Trust Me. QED.

By Michael A. Heroux

Consider a standard SIAM journal article containing theoretical results. Each theorem has a proof that typically builds on previous developments. Since every theorem stems from a firm foundation, the research community can trust a result without further evidence. One could thus argue that a theorem does not require a proof because surely an author would not publish it if no proof existed to back it up. However, respectable reviewers and editors expect proofs without exception, and papers containing proof-less theorems will likely go unpublished.

Next, consider a paper with computational results and look for details on the generation of these results. Suppose your task is to qualitatively obtain the same outcome. Does the article contain enough information to support you in this task? If your experience is like mine, you will find that while some authors provide details that give you a chance to succeed, many others do not. Contrary to the rigor of justifying theoretical conclusions, the meticulousness (or lack thereof) applied to many published computational results is the equivalent of “Trust me. QED.”

In defense of the insufficiency currently associated with computational results, we must acknowledge that fully capturing the input conditions and specifying the necessary execution environment to repeat a computational experiment has traditionally been very challenging. Providing this information for another scientist’s use is even more so. Furthermore, scientists trained in formal academic environments were seldom exposed to the tools, practices, and processes used to confirm result reproducibility. But with computation’s increasingly critical role in science and engineering—and the availability of new tools, practices, and processes to make the job easier—we can and must improve reproducibility.

The past decade has seen the emergence of new platforms that support rigorous software management. Environments like GitHub and GitLab provide users with the ability to develop, test, and integrate software changes using efficient collaborative workflows. The wide usage and accessibility of these platforms ensure that community members can easily document and publish a description of the software environment used to compute a result. Additionally, container technologies such as Docker support encapsulation of the full software environment that enables portable execution on many computer systems with very little overhead. With these improved tools, scientific software developers can adopt new workflows and practices that make reproducible computational results more feasible.

Reproducibility as the Key Focus

Both the growing importance of computation in science and engineering, as well as the new tools, practices, and processes presently available, yield more opportunities to elevate the quality of computational science. Raising expectations for reproducible results provides the incentive for realizing these opportunities.

Reproducibility as a fundamental goal in computational science is very powerful. First, the computation should be repeatable with realization of the same qualitative result. Second, the result must be usable, trustworthy, and extensible as a step toward scientific progress without concern of moving in the wrong direction. While the details can be more complicated and require careful articulation, the basic concepts should be easy to grasp.

Demanding Reproducibility Improves Productivity and Sustainability

Reproducibility expectations dictate that the software, input data, and execution environment used to produce published results must be available in the future. Validation of computational results using an independent software, data, and execution environment would provide the most rigorous evidence of reproducibility (commonly termed “replicability”). However, verification of an author’s findings using his or her own environment is still valuable and arguably the first step in a prudent approach to obtaining trustworthy results.

The increased incentive to improve developer productivity and software sustainability is a serendipitous outcome of pursuing reproducible results. Authors have compelling reasons to invest in source code management and annotation, data provenance, improved documentation, automated build tools, and comprehensive tests. The higher workload will initially delay results, raise costs, and perhaps require a research team to invest in skills or people that are not directly focused on scientific questions. Yet in the long run, researchers will see overall improvement in scientific output, especially when accounting for the increased trustworthiness of computational results. The ensuing increase in quality is fundamentally valuable.

Metadata and Meta-computation

As we have observed, trustworthy computational results require something beyond published outcomes. In most cases, program source code, documentation, input data, and details about the computing environment are essential. A complete software container that provides the entire computing environment in a single file is even better. However, metadata is not sufficient in all situations. In particular, some computational results are obtained from computing environments with limited access, such as supercomputing centers. In these cases, one can perform additional computations — like testing conservation properties or operator symmetries.

For example, if a computation involves the application of a symmetric linear transformation \(A\) to a vector—as would happen in a Krylov iterative solver—one can compute the expressions \(x^T(Ay)\) and \(y^T(Ax)\) for two random vectors \(x\) and \(y\) and obtain the same scalar result, up to roundoff error. Putting this simple test into a preamble computation prior to executing the main code is inexpensive and useful. Meta-computations can provide suitable substitutes or supplements to metadata for boutique computing environments, wherein reviewers or authors may be unable to access the same computing environment in the future. In addition to supercomputers, experimental computer systems, testbeds, and configurable hardware systems are all transient computational environments that can be difficult to re-instantiate at a later date. If their performance is inexpensive, meta-computations can also help debug new functionality and assure the sanity of novel software environments.

Reproducible and Replicable

Two distinct approaches tend to emerge as research communities assume expectations of reproducible scientific results. The first is the activity of a reviewer using an author’s experimental environment to duplicate that author’s result. This activity can be thought of as verification, answering the question, “Did the author do things right?” A second endeavor involves using a different experimental environment to obtain a consistent result; this can be considered validation, answering the question, “Did the author do the right thing?” Independent validation surely results in the most powerful evidence of correctness, but both activities are important to improving the trustworthiness of scientific results.

As communities develop distinct activities in pursuit of trustworthiness, taxonomies arise that allow common synonyms to take on specific meanings. In computational science communities, “reproducible” is typically associated with the verification activity and “replicable” with the validation activity. However, not all communities use these terms consistently. For example, the Association for Computing Machinery (ACM) defines “reproducible” and “replicable” in essentially the opposite way [2-3]. Even so, the concepts remain the same and both actions are valuable.

Expecting Reproducibility: How to Get There

Introducing new research approaches will increase the time and effort required to produce computational results. We will need novel tools, methodologies, and workflows, and may have to recruit or collaborate with new people to gain additional expertise. Investing in productivity and sustainability can reduce future time and effort in pursuit of reproducibility and improve the quality of our work, but not immediately. Any improvement strategy should be incremental, both within a given research team and across collective teams in a community (see Figure 1).

Figure 1. Introducing increased reproducibility expectations will heighten time and cost toward obtaining computational results. We must raise expectations incrementally, improving how we work while pursuing our results. Finding and learning from early adopters while gradually increasing expectations over the span of a few years has been effective in other communities—such as the Association for Computing Machinery—where reproducibility efforts are rewarded. Figure courtesy of Michael A. Heroux.

Identifying teams within a community that have already made progress in reproducible computational results is a good starting point. Colleagues can often more readily adapt these teams’ approaches than seek methods from dissimilar software communities. These early adopters can provide inspiration and practical advice to others. Many options exist to help teams work toward reproducible results, meaning that the how of improved reproducibility—beyond the need to move ahead incrementally—is straightforward. Adapting incentive systems is more challenging.

The real first step toward improving reproducibility of computational results is to expose its value, and SIAM can play a central role toward that end by rewarding authors for assuring reproducible results (see Figure 2). The ACM awards badges for papers whose results have been reviewed [1], and many conferences do the same. SIAM can also provide recognition for society members who help lead the community toward reproducible computational science. Beyond SIAM, funding agencies and employers—especially academic institutions—can also play important roles by rewarding people whose computational work is consistently reproducible.

Figure 2. Adapting our incentive systems to expect improved reproducibility will increase demand for heightened productivity and sustainability, which will in turn enable our desired reproducibility improvements. SIAM can play a central role by introducing reproducibility incentives in its publications and giving recognition to community members who are leaders in reproducible computational science. Figure courtesy of Michael A. Heroux.

Computational scientists are inherent problem solvers. Given the challenge and incentive to make our computational results reproducible, we will develop effective and efficient ways to meet that challenge. We will also serendipitously improve our productivity and the sustainability of our software environments, ultimately moving from “Trust me. QED.” to trustworthy.

References
[1] Association for Computing Machinery (2018). Artifact Review and Badging. Retrieved from https://www.acm.org/
publications/policies/artifact-review-badging.
[2] Heroux, M.A., Barba, L.A., Parashar, M., Stodden, V., & Taufer, M. (2018). Toward a Compatible Reproducibility Taxonomy for Computational and Computing Sciences. Technical report. Office of Scientific and Technical Information, U.S. Department of Energy. Retrieved from https://www.osti.gov/biblio/1481626-toward-compatible-reproducibility-taxonomy-computational-computing-sciences.
[3] National Academies of Sciences, Engineering, and Medicine (2019). Reproducibility and Replicability in Science. Washington, D.C: The National Academies Press.

Michael A. Heroux is a senior scientist at Sandia National Laboratories, director of S&W Technologies for the U.S. Department of Energy’s Exascale Computing Project, and scientist-in-residence at St. John’s University in Minnesota. His research interests include all aspects of scalable scientific and engineering software for new and emerging parallel computing architectures.