Due to rapid and progressive advances in technology, increasingly-large datasets of human activity are now available for study. This plethora of information is transforming the ways in which researchers analyze human dynamics and computational social science. During a minisymposium at the 2017 SIAM Conference on Applications of Dynamical Systems, held in Snowbird, Utah, this May, James Bagrow (University of Vermont) used information-theoretic tools and large-scale data analysis to study the dynamics of content circulation on Twitter. Bagrow, a physicist by training, is particularly interested in the networks, organizing principles, and underlying rules of complicated social systems, such as present-day social media.
“The data that social media platforms collect these days is gigantic,” Bagrow said. In 2014, Facebook was growing by 600 terabytes a day. Twitter currently generates 6,000 tweets (of 140 characters or less) per second, while Apple receives 200,000 text messages per second. In essence, short text exchanges are quickly becoming the dominant means of communication. However, questions pertaining to information flow remain. How much information do users regularly exchange? How much of this information is legitimate, and how much is noise? To what degree do users influence each other? And finally, can researchers accurately address these queries?
“Information diffusion and information flow are very, very old topics,” Bagrow said. Researchers have thoroughly studied social media and mobile phone data, but such studies are structural; they focus on keywords and/or timing, rather than the words themselves. Thus, Bagrow concentrated his efforts on the specific wording of tweets. Using entropy rate \(h\), he examined the influence of past words on subsequent text. For example, when \(h=0\), no additional information is necessary to describe upcoming words. “But how do you compute these information measures from real, written text?” Bagrow asked. “You can’t just shuffle the text; you’ll destroy massive amounts of data.”
To address this problem, Bagrow presented his entropy estimator, which differs significantly from the well-known work of Claude Shannon in 1951. “Moving forward, I wanted to compute the shortest substring of words that you haven’t seen before,” he said. “How many bits are needed to describe the next word, given past words?” To test his estimator, Bagrow conducted an entropy experiment on three books: J.R.R. Tolkien’s Fellowship of the Ring, James Joyce’s Ulysses, and Ernest Hemingway’s For Whom the Bell Tolls. He intentionally chose novels with varied language. For example, Ulysses is known for its verbosity, while Hemingway uses fairly simple language. Upon looking more closely at Fellowship of the Ring, Bagrow found that certain events in the book coincide with dips in his estimator, suggesting that action sequences likely use simpler, more straightforward language that is easier to predict.
Certain events in The Fellowship of the Ring coincide with dips in James Bagrow's entropy estimator, suggesting that action sequences likely use simpler, more straightforward language that is easier to predict. Image courtesy of James Bagrow.
Bagrow then applied his estimator to a Twitter stream. He gathered data—3,200 tweets in total—on Twitter activity via application program interface calls, and fastened the tweets together in a long, single stream of words. Bagrow treated each stream of text as a symbolic time series, in which the entropy rate measures the impact of a user’s past tweets on future word choice. Next, he filtered out bots and low-activity users before employing Fano’s inequality to connect his estimator to social predictability. Using data compression theorems, Bagrow estimated a correlated entropy rate that incorporated temporal and long-range correlations in the data. This rate estimates the intrinsic uncertainty of a user’s word choice in future posts.
Bagrow found that his estimator has a 54 percent chance of correctly predicting the next word in a tweet. Though this percentage might not sound particularly impressive, he reminded the audience that a one in 5,000 chance of correct prediction is standard when handling thousands of words. He then went on to measure social information flow between pairs of users; these users as called egos and alters. “If I try to predict Alice’s words with Bob’s history, I should do worse,” he said “But how much worse?” To answer this question, he used the estimator in much the same way as before, and employed a cross entropy that approximates how much information about the ego’s future word choice is present in the alter’s prior words. The results varied on a case-by-case basis, as some alters contain almost as much information about the ego as the ego itself, while other pairs have minimal transmission.
Bagrow implemented a simple toy model to better understand cross entropies. In the model, the alter generates tweets randomly from a collection of words. The ego does the same, but occasionally copies a word sequence from the alter’s past. He portrayed this data for different values of a Zipf distribution. “As you add more alters, your entropy goes down, and you get more and more information,” Bagrow said. He also created a temporal control that essentially replaced each alter with a pseudo alter, and was surprised to find that the alter actually provides a bit more information than the ego itself. “Alters are providing a little bit of additional information, a little bit of predictability,” he said. “At the end of the day, the ego is only giving a little bit of additional information.”
In summation, Bagrow’s results offer unique quantitative bounds on the transfer of information in social networks. By combining large-scale data analysis with mathematical models to understand Twitter’s complex information system, he facilitates a better understanding of how ideas spread and influence each other in impressionable human populations. Predicting the net word of a tweet or online communication is a proxy signal for something measurable. “We can find all sorts of sociological signals embedded in social networks,” he said. Tracking the words used in online exchanges allows researchers to then track the concepts that users are discussing, which is much more valuable.
|| Lina Sorg is the associate editor of SIAM News.