In a minisymposium talk on public opinion polling through Twitter, Emily Cody talked about the opportunities and challenges of using Twitter for polling over traditional surveys.
“People tend to get on their computer and tend to type how they're feeling about an issue,” said Cody, talking about the ease of obtaining people’s opinions in the digital age.
Motivations for using Twitter opinion polls. Image credit: Emily Cody, AN16 presentation.
Can we use this data to determine public opinion and analyze sentiment? Cody answered the question by describing the process behind extracting information by pulling all tweets on a subject of interest. A massive Twitter dataset is a great resource to infer characteristics of human emotion and behavior.
The disadvantage with traditional surveys, she said, is that a lot of times people don't answer surveys. Moreover, surveys only reach a limited number of willing participants. With Twitter, data is easily available at little cost to researchers. Tweets come in every 15 minutes as oppose to traditional polling where data is collected monthly or even yearly.
Cody went on to describe the mathematical methods used by her research group to achieve this.
The group uses an instrument called the Hedonometer, which is designed to calculate the happiness score for a large collection of texts. A word list, called LabMT, is used, which gives scores for sentiment analysis.
Each word is assigned a happiness score based on the positivity or negativity associated with it. Neutral words tend to skew the results toward a neutral score, so all neutral words are removed from the text. The happiness score of text can thus be calculated by adding the average happiness score for all of the words in the dataset.
The happiness map in the image below is a result of such analysis using a hedonometer, which calculates the ambient happiness for each labMT word.
For example, if trying to gauge tweeters’ opinion on climate, all words that appeared in tweets with the word climate in them, except the word “climate” itself would be used to calculate the happiness score.
In their studies, Cody and her group also compare ambient happiness time series to solicited polling data. They have created 10,000 matrices that connect words like laughter and happiness that appear with other words (for eg. Obama or snow). Once the happiness score is obtained, this data can then be broken down by time to determine if particular incidents occurred that may have influenced the happiness or sadness associated with a person or concept. If tweets are geolocated, they may even be broken down by location or geography though Cody’s group has not done this yet.
Happiness by state as determined by Twitter opinion polling. Image credit: Emily Cody, AN16 presentation.
As an example she illustrated happiness scores for tweets about Obama within a given timeframe. As far as Twitter was concerned, Obama’s happiest days included the day he won the Nobel Prize and his birthday – the latter was because people were sending him birthday greetings on Twitter. One of the saddest days was the day he declared a state of emergency for the H1N1 virus. Going by quarters, Obama’s first quarter was seen to be his happiest quarter and his saddest quarter was the 23rd quarter.
Cody explained that this seems logical, given that people were still happy about his election and the newness of his presidency in the first quarter. As someone in the audience aptly noted, the 23rd quarter was during the mid-term elections where the Democratic Party did poorly – ample reason to be sad.
The group found that their Twitter time series preceded traditional polling since the latter is usually restricted to just a few days of polling is are done infrequently.
Cody demonstrated Twitter polling with another example: snow. “How do people feel about snow?” she said. “With snow you see very seasonal happiness.” During summer happiness about snow seems to go up, though the elation frequency is low. In winter, when people are actually dealing with snow, happiness about snow goes down but frequency goes up.
One of the pitfalls to consider when using Twitter polls is erroneous happiness or sadness that may be generated due to unrelated events. For instance, one of the huge happiness spikes in the case of snow, was related to a Snow White movie that came out in summer. In such cases, the specific word should be removed from the process so results aren’t skewed.
The group compared their data with the Consumer Sentiment Index, a consumer confidence index published monthly by the University of Michigan. Twitter data showed some correlation. Text or Twitter sentiment responds faster to news events that conventional polls, so tweets were lagged in the comparison analysis. Sure enough, this made the Twitter polling data correlate better with the index.
“This shows the power of Twitter in predicting sentiment,” Cody said.
The group also looked at sentiment surrounding businesses like McDonald's and Walmart through Twitter. In the case of Walmart, the saddest days seemed to correlate with three to four shootings that had occurred at a few stores. There was also a dip around the time actor Tracey Morgan sued the corporation over a car accident caused by a Walmart truck. The happiness spikes usually correlated with free gift cards stores were giving away. The spikes might be overrepresented by Twitter bots in this case, Cody acknowledged.
In the case of McDonald's, there was a spike in happiness when they won the national toilet award for cleanliness. The saddest day was when Ferguson protestors broke into a McDonald’s when hit with tear gas.
Cody illustrated another example of potential errors with this method: when a so-called negative word may actually be associated with something positive. “Disappear” is generally labeled a sad word in Twitter opinion polling, but the actual instance in the case of the McDonald’s study was one of a McDonald's employee doing a magic trick— a generally “happy” incident by all accounts. In such a case, this word would be removed from the analysis.
Another pitfall of Twitter opinion polling is the lack of comparable public opinion surveys to validate it. Retweets tend to emphasize small or irrelevant events.
However, Twitter is an easier source of data that can be used to complement traditional surveys. It also allows the investigation of any topic of interest since data on most topics is readily available.