Measuring Semantic Drift on the Archain

The Arweave Project
5 min readOct 13, 2017

--

Semantic Drift

Traditionally, linguistic expressions have been viewed as meaningful because they relate to real world concepts in some objective manner. However, since modern linguistics and semiotics, it has been widely accepted that words and expressions are subjective and can drift in meaning and semantic interpretation over time. This evolution of the meaning and usage of words over time is known as Semantic Change or Semantic Drift.

A word or expression can undergo semantic drift in a number of ways, such as semantic expansion where a word comes to mean something broader than its prior meaning (e.g. middle english bridde meaning `small bird’ which has come to be the broader word bird.) or semantic restriction, the inverse of expansion, as in the middle english word mete and modern english meat.

Semantic drift is of course, a key interest in linguistics. However, its applications and usage go far beyond the academic. By being able to track and chart the usage and interpretation of words pertaining to business marketing, branding, company names etc. it is possible to gain novel and valuable, quantitative insight into public perception of these words or concepts and make predictions about future public perception. For example, being able to establish cultural codings (broadband public understanding and perception) of motorcyclists was essential to the Department for Transport’s motorcycle awareness campaign[1] in terms of the cultural re-encoding of motorcycle riders in their advertising. Not only is it useful in this capacity but being able to perform prediction about usage of business branding material could provide a warning of negative association or predicted negative association allowing the business to counter the negative shift early.

By providing access to a real-time stream of history, The Archain presents an ideal platform to analyse and assess cultural codings, semantic drift and make predictions about future perceptions of words or expressions.

Archain as a Platform for Measuring Semantic Drift

Archain is a distributed, permanent and cryptographically verified archive of information. Once populated, it yields access to new information in an novel manner — as a live feed of hand picked information that users find historically important, valuable or worthy of storage in some way. Data is stored on a sharded blockchain (the Blockweave )[2]- a cryptographically linked chain of blocks containing the information users desire to be stored. After a block is mined, the data in its transactions becomes available.

This real-time access to historical information allows for a real-time interpretation and analysis of history in terms of data stored in the block. An Archain monitor application built with the Application Developer Toolkit would provide a straightforward means of access to data in this way. You can find a video tutorial on how to build a simple monitor on our youtube channel.

Given that the change of any one word is independent and unaffected by the changes of other words, measuring semantic drift appears to be a prime candidate for the use markov chains to model and perform predictions about semantic drift.

Mathematical Outline

The first step in performing the analysis is to transform the expression in question into a word cloud. A word cloud is a set of items, generated over a number of blocks, that relate to the expression in terms of connotation and denotation.

For example, a small word cloud for the word `car’ might be {`wheels’, `engine’, `fast’, `polluting’}. Note that the words in the cloud contain words without positive or negative association e.g. `wheels’ and words that may have positive or negative associations: `fast’, `polluting’.

Fig 1: Word cloud for RMS wikipedia page, source: http://wordcloud.cs.arizona.edu/

Many algorithms exist for the generation of semantic word clouds. Generally, a text is parsed and tokenised, common words such as `a’ or `the’ removed, then words are weighted and grouped according to some factor such as co-occurrence in sentences. An outline of the algorithm used in the generation of Fig. 1, a semantic word cloud generated from the wikipedia article on Richard Stallman is given in this paper [3].

Now we are able to define a discrete-time markov chain over blocks on the blockweave:

Where P is the matrix of transition probabilities (transition matrix) — the matrix of probabilities that a word will be featured in the word cloud of the next block based on historical data from the blockweave. In the general case, it will take the following form:

Such that w₀ through wn are binary values representing membership of words in the word field and wp₀ⁱ⁺₁ is the probability in the interval [0,1] of w₀ following wp₀ⁱ⁺₁. Using this definition it is possible to perform a prediction about the content of the next block.

On the release of a new block the markov chain is rebuilt along with the new word cloud and transition matrix. Starting from a suitably populated weave and having a real-time monitor app that performs the calculations each time a new block is mined would provide the user with the ability to track semantic word clouds for desired terms over time, detect trends by comparing word clouds and transition matrices and utilise the markov chain’s predictive capability.

Conclusions

In this article we have presented a means of tracking semantic drift using a real-time Archain ADT monitor application that implements a markov chain approach. Such an application would provide the ability to detect trends over time and provide some predictive capability. We have presented a sample use case and explained the benefits of using the Archain in solving the problems of the use case.

The model and implementation is still relatively naive. A better implementation might use a multi-dimentional matrix as the representation for a block’s semantic word cloud — utilising the extra variables such as importance and sentence co-ocurrence as seen in Fig. 1 and described in [3]. This would provide a better description of the state with more information. Tracking semantic drift remains but one of the many applications where blockweave technology might prove effective, however, it contains ideas that could well bear fruit in the future.

For more information on the Archain project, visit us at archain.org, find us on twitter or email the team.

-archain-jon

References

[1] Sign Salad: Case studies DfT ‘Think! Biker’, http://www.signsalad.com/semiotics-explained/case-studies/case-studies-dft/

[2] Archain: An Open, Irrevocable, Unforgeable and Uncensorable Archive for the Internet, S Williams, W Jones, https://www.archain.org/whitepaper.pdf

[3] Semantic Word Cloud Representations: Hardness and Approximation Algorithms, Barth et al., https://arxiv.org/abs/1311.4778

--

--

The Arweave Project
The Arweave Project

Written by The Arweave Project

A novel data storage blockchain protocol enabling a permanent serverless web and creating truly permanent data storage for the first time.

No responses yet