# Symposium Yields Insights on Big Data and Predictive Computational Modeling

Everyone is aware of the revolution that has taken place in the data sciences over the last few years and the impact it has had on medicine, commerce, education, and the media. A multitude of reports on the “big data” paradigm and the marked success of companies like Google and IBM provide constant reminders.

Advances in networking, sensor technologies, and computer science have enabled the collection of gigantic amounts of data at ever accelerating rates. At the same time, developments in machine learning/computational statistics and data mining have led to powerful tools for extracting patterns and trends from the data. Most of this output is mapped to phenomenological but predictive models of increasing complexity and involving huge numbers of parameters.

In parallel, computational scientists in physics, chemistry, biology, and engineering have been experiencing their own big data revolution. Thanks to the sophistication of the mathematical models and the availability of high-performance computing platforms, we can simulate physical processes at unparalleled levels of spatiotemporal resolution. With almost every simulation we see exuberant growth of output data.

Can we extract meaningful information from huge amounts of simulation data? Can we use the data to make predictions at scales that are currently inaccessible? How can we couple tools from the data sciences—tools that are capable of dealing with high dimensions and uncertainty—with physical principles?

These are some of the questions addressed during the symposium entitled "Big Data and Predictive Computational Modeling," held at the Institute for Advanced Study, Technical University of Munich, May 18-21, 2015.

With a mindset firmly grounded in computational discovery, but a polychromatic set of viewpoints, several leading scientists—from physics and chemistry, biology, engineering, applied mathematics, scientific computing, neuroscience, statistics, computer science, and machine learning—met, engaged in discussions, and exchanged ideas for four days.

### Some Conclusions

**1. Big data can mean different things in different communities. **Whereas data in machine learning applications comes at a low price and in abundance, in the physical sciences data is obtained from expensive and computationally demanding simulations. In the latter case what is produced might be better described as “tall data”; it is high-dimensional and structured, although we can expect neither to have a large number of instantiations nor that this data will uniformly populate the configuration space of the problem.

**2. Physical scientists trust their models more than data scientists do.** To a large extent, physical models are anchored in venerated physical principles (e.g., conservation of mass, energy). Nevertheless, parts of these models (e.g., constitutive laws) are as phenomenological as regression/classification models used in data mining. It has been said that you do not need to know the true cause as long as you can minimize the prediction error. This approach has been applied with great success in several machine learning tasks, but physical scientists would also like the discovery of patterns and trends to lead to comprehensible and interpretable physical principles as the underlying drivers of the complexity observed.

**3. Quantifying model uncertainties is recognized as an important step.** Model-selection issues arise prominently in obtaining reduced or coarse-grained descriptions of physical models (e.g., in molecular dynamics), and along these lines the expertise and arsenal of tools from machine learning/computational statistics can be extremely powerful. Information-theoretic tools and pertinent concepts can be extremely useful in that respect and can be used even in non-equilibrium settings.

**4. Models employed in both machine learning and the physical sciences are multilevel and high-dimensional, with thousands of parameters to be inferred or learned.** The feasibility of these tasks is often limited to distributed computational environments in which each node is aware of only a portion of the (experimental or simulation) data. Novel methods are needed that reduce communication costs but can nevertheless lead to accurate estimates.

**5. A little bias in estimates can be a good thing if it also leads to reduced variance.** Advocating approximate inference and learning tools to the applied mathematics and engineering community is preaching to the choir—these groups have become comfortable with the idea of approximate solutions to difficult mathematical problems. Such algorithms, which have seen explosive growth in the machine learning community in the last few years as part of efforts to address big-data challenges, would be ideal for the computationally demanding tasks of fusing models with data in the physical sciences and in the context of Bayesian model calibration and validation. Deterministic tools from numerical analysis (e.g., adjoint formulations) can frequently complement and enhance probabilistic methods.

**6. Many of the events of interest are rare. **Whether in seeking transition paths to overcome large free energy barriers in molecular simulations or in assessing the extremely small probabilities of failure in engineering systems, we need new tools that are capable of directing our simulations or data-acquisition mechanisms to the most informative regimes.

**7. Symposium participants knew a priori that dimension reduction is a key aspect of the analysis, whether the task is to make sense of atomistic trajectories, to look at huge databases of features or networks, or simply to visualize high-dimensional data. **Several nonlinear dimension-reduction tools and low-rank matrix factorization techniques were presented and discussed. It became apparent that for predictive purposes, dimension reduction is necessary but not always sufficient. In addition to a lower-dimensional set of collective variables, we must simultaneously infer a model for their interaction and evolution in time. This will not only enable extrapolation into regions for which data is not available, but also lead to efficient tools that exhibit sub-linear complexity with respect to the fine-scale degrees of freedom.

**8. Relationships between different communities can be bi-directional.** For a long time, methods and tools developed by the computational physics community (e.g., Markov Chain Monte Carlo) have stimulated developments in statistics and machine learning, where the methods were formalized and their domain of application was expanded. Similarly, tools and techniques from machine learning have inspired developments and advanced our capabilities in the analysis of physical systems (e.g., ISOMAP for dimension reduction, graphical models for UQ).

*The video and lecture presentations from the workshop are available on the TUM-IAS website, and a post-symposium publication of related papers by the participants is planned. For further information, readers can contact the authors of this article, who also served as the symposium organizers.*