The organizers of the 2013 SIAM Conference on Computational Science and Engineering chose “Scalable Algorithms for Big Data” as one of the main conference themes. We can better understand many real-world data sets by looking at their connectivity (i.e., the network or graph) rather than at static features. The early success of Google’s PageRank analysis, for instance, showed that incorporating connectivity in web search was much more effective than keyword analysis alone. Today, network analysis plays an important role in a wide variety of application domains, including biology, chemistry, economy, communications, transportation systems, cybersecurity, sociology, and even sports analysis.
Large-scale networks are ubiquitous: Consider social networks; graphs of links between web pages; the power grid; graphs of phone, e-mail, or text communications; graphs of purchasing and/or financial transactions; player interaction networks in multiplayer online games; and computer network traffic between various IP addresses. These networks are not described simply by nodes and edges, but by attributed graphs, whose nodes and edges carry (possibly changing) properties. Moreover, the edges may have both timestamps and directionality, which appears (and disappears) based on the interactions between nodes
BTER graph model. The nodes are color-coded—darker nodes are of higher degree. The blue edges correspond to affinity blocks, and the green edges to “random” connections. Image by Nurcan Durak, courtesy of Tamara Kolda.
The size of such networks is already quite large. For instance, public data sets available for research have graphs as large as 100 million nodes and 4.5 billion edges, and these are tiny compared to many graphs that companies like Facebook and Google are analyzing. Moreover, these data sets continue to grow in size—far outstripping analysis capabilities.
Several talks at CSE13 focused on ways to handle these large, complex, and evolving data sets. The challenges ranged from the theoretical (e.g., generating uniformly random instances of graphs) to the practical (e.g., discovering computer malware) and everything in between. Here are a few highlights:
- In her plenary talk, Tamara G. Kolda discussed the challenge of modeling large-scale networks in a way that captures both the connectivity of the nodes (i.e., the degree distribution) and the community structure (i.e., the degree-wise clustering coefficients). Ultimately, her team’s proposed BTER (block two-level Erdös-Renyí) model was able to capture these properties on a simple graph with more than 4 billion edges, although specialized techniques were needed just to compute the clustering coefficients of such large-scale graphs. For directed graphs, she showed that the situation is much more complicated because of the impact of the reciprocity of edges on the community structures.
- Graph modeling was the focus of several other talks as well. Dmitri Krioukov (UC San Diego) presented a hyperbolic model of evolving networks that achieves good clustering coefficients. Pierre-André Maugis (University College London) and Sonja Petrović (Penn State) both considered the problem of correctly estimating parameters for multiplicative probability models of edges. Maugis discussed the use of these parameters to find hubs. Petrović turned to polyhedral geometry to discover whether a solution even exists, and also used random walks to modify networks. Similarly, Ali Pinar considered how many moves are needed in a Markov setting to guarantee sufficient distance from the original graph.
- The theme of graph analysis for cybersecurity surfaced in several talks. John R. Johnson (Pacific Northwest National Laboratory) discussed the “pass the hash” attack that adversaries use to gain increasing levels of access within a network. The researchers’ goal was to analyze the vulnerability of a network, but they needed to use graph minors (an aggregated version of the graph) to make the problem tractable. Benjamin A. Miller and Matthew C. Schmidt (MIT Lincoln Laboratory) discussed methods for anomaly detection in time-varying graphs and its utility for finding real-world malicious activity in a graph of computer connections.
- High-performance computing was a theme in many talks. Big graphs pose challenges that differ from those of traditional scientific computing. David A. Bader (Georgia Tech) is involved in the Graph 500 Benchmark, which has a goal of guiding the design of hardware and software architectures that are appropriate for network science. Henning Meyerhenke (Karlsruhe Institute of Technology) discussed HPC clustering methods.
- Several speakers described new packages for working with graphs. Jason Riedy (Georgia Tech) tackles the big-data problem with streaming analysis via the STINGER package, keeping only a small part of the data in memory at any time. Aydin Buluç (Lawrence Berkeley National Laboratory) presented the Knowledge Discovery Toolbox, which is accessible to domain scientists, but uses the Combinatorial BLAS library for high performance. Joseph Gonzalez (UC Berkeley) discussed GraphLab, which uses clever hashing and approximations for fast graph computations.
CSE13 was remarkable for demonstrating both the growing importance of network analysis itself and the role of applied mathematics within network analysis.
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.