Network Data: Dealing with Incompleteness and Bias

By Sucheta Soundarajan, Jeremy D. Wendt

While network analysis is applied in a broad variety of scientific fields (such as physics, computer science, biology, and the social sciences), the methods used to construct networks and the resulting bias and incompleteness have drawn more limited attention. For example, in biology, gene networks are typically developed via experiment – many actual interactions are likely yet to be discovered. In addition to this incompleteness, the data-collection processes can introduce significant bias into the observed network datasets [2, 3]. For instance, a classic random walk used to observe part of the World Wide Web network would be more likely to find high-degree nodes than a random selection of nodes [1]. Unfortunately, such incomplete and biased data collection methods are often necessary.

At the recent Workshop on Incomplete Network Data (WIND), held at Sandia National Laboratories¹ in Livermore, CA, researchers from academia, industry, and national labs gathered to discuss perspectives on dealing with incomplete network data. WIND was organized by Tina Eliassi-Rad (Northeastern University), James Ferry (Metron, Inc.), Ali Pinar (Sandia), and C. Seshadhri (University of California, Santa Cruz).

The workshop highlighted a host of areas with biased graph samples. Dennis Feehan (University of California, Berkeley) and Forrest Crawford (Yale University) both discussed a particularly interesting problem from the social sciences – determining how to accurately estimate the size of hidden or rare groups in massive populations by querying survey respondents. Bradley Huffaker (Center for Applied Internet Data Analysis, University of California, San Diego) presented problems related to obtaining an accurate map of the Internet, and Jaiwei Han (University of Illinois at Urbana-Champaign) described techniques for supplementing explicit graphs using unstructured text mining.

Three main approaches emerged. The first approach involves estimating properties or characteristics of the global network, given only a partial observation of that network. For example, can one estimate the number of triangles (i.e., “A” knows “B” knows “C” knows “A”) in the full network with only partial access to the full network data (e.g., Tammy Kolda, Sandia)? The second approach entails performing data collection in such a way as to reduce bias or increase the quality of the information obtained. For instance, how can one sample a node uniformly and at random from a graph, where access to data is through a random walk-like crawl (e.g., Ravi Kumar, Google)? The third approach identifies algorithm degradation resulting from noise or incomplete data and designs algorithms to be more robust. For example, local spectral methods provide results akin to full-graph spectral methods, without the effects of problems in distant parts of the graph (e.g., Michael Mahoney, University of California, Berkeley; David Gleich, Purdue University).

These categories are complementary, and a real-world application of network analysis on incomplete data would ideally incorporate all three techniques. The third category (understanding the robustness of existing algorithms against noise or incompleteness) is an important first step in any such unified approach.

David Kempe (University of Southern California) discussed the problem of algorithm robustness at the workshop. He pointed out that in real-world network data, “Noise is the norm, not the exception,” and that understanding the effects of noise on algorithmic tasks is critical. To illustrate this point, he considered the problem of influence maximization; in a network setting, if we assume that an individual’s beliefs can affect their neighbors’ beliefs, which nodes’ beliefs should we influence to have the greatest effect on the beliefs of the population as a whole? Kempe argued that the influence probabilities (i.e., the probability that node “A” will influence the belief of node “B”) can have a large effect on which nodes are selected; but any estimates of these probabilities are likely to be inaccurate!

Kempe also considered the effect of noise or incomplete data on community detection (the problem of clustering the nodes of a network into cohesive groups). He argued that the output of community detection methods can also be significantly affected by noise or missing edges in the network. For example, missing edges might lead an algorithm to identify two communities. If those edges were to be present, the algorithm would find only one community. Kempe argued that community detection on incomplete network datasets, while not sufficient for drawing conclusions, may be appropriate for suggesting hypotheses, which are then verified by other means.

Along similar lines, Anil Vullikanti (Virginia Tech) considered how noise can affect the core decomposition of a graph. A core of a graph is, in essence, a ‘dense’ or ‘central’ part of the graph and, among other applications, can be used to measure the importance or centrality of nodes in the network. Through experimental results, Vullikanti demonstrated that k-cores are unstable when the network is perturbed in degree-biased ways (that is, the probability of a perturbation affecting a node depends on the number of connections the node has). This is a critical problem because one of the most common ways of obtaining network data (crawl via breadth-first search) leads to just this kind of degree-biased sampling.

Other research presented during the workshop proposed techniques to overcome missing data or noise, including strategies for counteracting bias or generating more accurate network samples. Many attendees presented early solutions that show great promise.

However, the consensus among WIND attendees was that incomplete data presents a daunting challenge to performing accurate network analysis. Several critical questions remain: How do we measure or estimate the noise, bias, or incompleteness of network datasets? What tests could we run to thoroughly assess the effects of these data errors on later analyses?

¹ Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

References

[1] Kurant, M., Markopoulou, A., & Thiran, P. (2010). On the bias of BFS. In 22nd International Teletraffic Congress (pp. 1-8). Amsterdam, Netherlands.

[2] Leskovec, J., & Faloutsos, C. (2006, August 20-23). Sampling from large graphs. In 12th International Conference on Knowledge Discovery and Data Mining (KDD). Philadelphia, PA.

[3] Maiya, A.S., & Berger-Wolf, T.Y. (2010, April 26-30). Sampling community structure. In 19th International World Wide Web Conference. Raleigh, NC.

Sucheta Soundarajan is an assistant professor at Syracuse University. Her research is in the area of social network analysis, including topics related to network sampling, incomplete network data, and community detection. Jeremy D. Wendt is R&D staff at Sandia National Laboratories. His research focuses on machine learning, text mining, and graph analytics, especially when driven by social analytics.