How is Big Data Changing Social Science?

By Karthika Swamy Cohen

There has been an unprecedented growth of social data in the last decade with the growth of web data stores like Wikipedia and Google, social media channels like Facebook, Twitter, and YouTube, smartphones and handhelds, the prevalence of Wi-Fi, and so on. These tools have enabled not only easy access to social data, but also new ways to learn and analyze the social world at a scale not possible before with such granularity and precision.

Digital vs. analog data. Photo Credit: Matthew Salganik, AN16 presentation.

In his invited talk at the SIAM Annual Meeting, “Social Science in the Age of Big Data,” Matthew Salganik of Princeton University talked about how traditional social science research design can help us understand these new data sources. He also emphasized that new ways of obtaining information require us to update our thinking on research design.

Salganik began by asking, “Is computational science a fad?”

No, it's not, he said, proceeding to answer the question. “Computational science is not a fad since there is a fundamental change in the world that’s driving it, and that's the change from the analog to the digital world.” The amount of information he have access to gets larger and larger in the digital world. And as Moore’s Law predicts, computers double roughly in speed every year.

“People who study social behavior have a choice of taking advantage of this [data explosion] or being left behind,” said Salganik. “Eventually computational science will become social science since that's the way we will learn about the world.”

He hastened to clarify that all this doesn’t mean computational science methods will displace other methods. It would just become complementary to methods that are being used. For instance, you can't gauge the homeless population with Twitter.

One issue with computational science is that people tend to think about it as online and on the web, but in reality, it's everywhere.

“Pokemon GO combines digital devices and the physical world,” said Salganik giving an example of how the two worlds are interconnected. “Increasingly, the Internet of things is going to be about building sensors into all kinds of devices that will be a great opportunity for social research. We are going to start thinking about it just as behavior rather than online behavior.”

One aspect that computational and social scientists need to start thinking about is a move from “found data” to “design data”. We should be moving toward design data where data is collected for the purposes of research. That is much more useful for social science research than data that just happens to be available.

Salganik then moved on to the topic of research design, describing randomized clinical trials as an example of research design. Research design is extremely important in social science research, he said, and described four main methods of design: observing people's behavior; asking questions; running experiments; and mass collaboration.

He then moved on to Wikipedia as a great example of a mass collaboration. The digital age makes it much easier for people to communicate and collaborate.

Wiki surveys are like Kittenwar! for ideas. Photo Credit: Matthew Salganik, AN16 presentation.

Surveys, however, are still important in social science. “People hate surveys,” said Salganik, but, “Most of what we know about quantitative psychology is from surveys.”

Why should we care about surveys in the age of big data? We will always need to ask people questions because of the inherent nature and limitations of “digital exhaust,” a term used to describe passive data traces emitted by the Internet of things.

A lot of data that government and companies collect is for the information they want. But you need to get your own data based on the questions you want to answer in social science. While the date collected by big corporations and government entities is good for measuring external states, it’s not exactly ideal to measure internal states. Social science relies on internal states, which are often how we try to explain behavior.

Salganik divided survey research into three eras: the first era was the era of face to face interviews. The second was of landline phone interviews. While people were skeptical about phone surveys in the beginning, phones were greatly beneficial in terms of speed, cost, and accuracy. This realization in social science led to the prevalence of random digital dial surveys. All kinds of large scale social science surveys began to use this approach.

We are now we in the third era of survey research: non-probability survey research, which does not involve random selection. Interviews are going to change from being administered by humans to being done by computers. They're no longer going to be standalone exchanges, but will be more linked to data sources.

Moving from human-administered to computer-administered interviews enables and requires lots of changes, Salganik cautioned. With in-person and phone interviews, we are used to designing for captive audiences.”
“[In face-to-face interviews], individuals are physically sitting there. When you're on the phone, social pressures keep you talking,” said Salganik. “But on computer surveys, you are just one click away from a skateboarding dog.”

Drawing motivation from online aggregation sites like Wikipedia and conventional survey research, Salganik proposed a new method called wiki surveys. Similar to Wikipedia’s progress over time based on editor participation, a wiki survey is an evolving survey aided by continuous contributions from respondents.

The main principles of wiki surveys should be greed, collaboration, and adaptivity, he explained. Wiki surverys should collect as much or as little data as people are willing to give. "Some people contribute 80 hours a week, some add a comma every month," he said. With wiki surveys, researchers want to optimize the experience for respondents and deal with the complexities themselves.

By being adaptive, wiki surveys only ask respondents the most valuable questions. “If we consider our survey respondents’ time valuable, we should use their time more efficiently and use the information we have more efficiently,” he said. They’re collaborative in that they are jointly created with users and researchers are open to new information.

Matthew Salganik talked about social science in the age of big data at the SIAM Annual Meeting. Staff photo.

Based on this premise, Salganik’s group has developed methods for data collection and analysis for a pairwise wiki survey. He explained that pairwise wiki surveys can obtain information and insights that would be difficult to obtain with other methods.

His group hosts wiki surveys for anyone who wants to participate at allourideas.org. He described a project currently underway with Mayor Bloomberg’s office in New York. The wiki survey fielded 25 questions for creating a cleaner, greener, greater New York City.

The survey is open-ended. “This is basically kitten war for ideas,” said Salganik. “Anyone can add their own ideas as well, which go into the queue for people to vote.” This remark was accompanied by a shot of KittenWar! winners, which had the audience in splits.

The data structure includes votes from all respondents which goes into an opinion matrix. That helps estimate how much each person likes each idea.There are some pitfalls since the amount of information everyone gets is not equal - not everyone's idea is included and some people don't see all ideas. Participants were recruited through Facebook and Twitter.

They received 31,000 responses and the fascinating thing was that eight of the top ten ideas were originally proposed by users. This shows the value of the openness of surveys, Salganik said.

He explained this with examples of two top user-generated ideas:
- “plug ships into electricity grid so they don’t idle in port—reducing emissions equivalent to 12000 cars per ship.”
- “Keep NYC’s drinking water clean by banning fracking in NYC’s watershed”

In the case of the first idea, the Mayor’s hadn’t considered this a priority. In the case of the second idea, while the Mayor’s office was certainly interested in protecting NYC’s watershed, they would not have mentioned fracking.
The high-scoring user-contributed ideas contained information that was novel and new or suggested alternative framings for existing ideas. This reinforces the importance of collaboration and importance in surveys. People are able to articulate responses that resonate with other people more than with researchers or public officials.

In conclusion, Salganik said, “We don't need to look through other people's trash—we can move beyond found data.” Ideal data is almost certainly not the data you have.

“With great power comes great responsibility,” Salganik quoted Spider-Man to reinforce that while social scientists can observe millions of people and run data on them without their knowledge, this can be used for good, and should be used in a safe and responsible way.

Karthika Swamy Cohen is the managing editor of SIAM News.