About the Author

Creating Findable, Accessible, Interoperable, and Reusable Cyberinfrastructure

By Jillian Kunze

The landscape of scientific computing is evolving. "It is apparent that both the demand for and availability of cyberinfrastructure resources are growing, especially given the trend of interdisciplinary collaborations," said Carol Song of Purdue University during her minisymposium presentation at the 2021 SIAM Conference on Computational Science and Engineering, which is taking place virtually this week. Collaborative cyberinfrastructure is quickly growing into an essential component of multidisciplinary research on many complex systems of global importance; however, this paradigm also faces major challenges. While collaborative infrastructures can help to make research more accessible and reproducible, they can also be difficult to build and sustain due to funding uncertainties and rapid technological changes. Researchers who develop collaborative cyberinfrastructures must tackle very complex problems with multiple audiences—such as collaborators, decisionmakers, students, and the public—and thus need to balance goals of research, outreach, and training. 

Song is a member of a research software engineering group that works with scientists to develop computational infrastructure for addressing data-driven geophysical research. During her talk, she described how her team evolved the software architecture and interoperability model of their science gateways to make community codes and tools available, listing four major steps in evolving collaborative cyberinfrastructure. The first step is to create software building blocks for data management in gateways. This is followed by leveraging the building blocks for one-off interoperability implementations, then developing end-to-end research workflows that span gateways to enable reproducibility. This leads to the final step: the development of customizable, lightweight, and personalized cyberinfrastructure.

The geospatial data analysis building blocks project (GABBs) is one example of a useful web-based data management tool in the regime of geophysical data workflows. This project consolidates individual data management and processing tools within the field of geophysics into reusable software building blocks, which users can employ to manage, share, and publish data. Conveniently, users can find many associated tools on a single platform. Developers can also use GABBs to quickly develop new tools for geospatial data with innovative visualization capabilities, thanks to the framework’s capability for seamless data management (see Figure 1). All of these tools are hosted on one platform through the core infrastructure of HubZero, a platform for scientific computing that Song and her team have worked with for many years. They have also worked with educational resource groups to build interoperability; HydroShare, an online environment for sharing data and code, has been very helpful for developing lessons with built-in interactive online exercises. 

Figure 1. Tools for utilizing geospatial data developed through the geospatial data analysis building blocks.

The next phase of developing collaborative cyberinfrastructure that Song described focused on data wrangling. Her team works with geophysical researchers whose workflow incorporates data from multiple different sources, which can lead to issues with formatting and aggregation. This messy reality means that researchers spend a huge amount of their time trying to organize data, instead of making analyses or visualizations. To address this issue, Song and a team of interdisciplinary collaborators developed the Extensible Geospatial Data Framework (GeoEDF). The project, which is funded by the National Science Foundation through the Cyberinfrastructure for Sustained Scientific Innovation program, aims to create a plug-and-play framework with pluggable data processors that allow users to compose their own pipeline for their workflow. GeoEDF’s goal is to enable remote data to be directly usable, thus helping data-driven science become more findable, accessible, interoperable, and reusable. GeoEDF is deployed on MyGeoHub, a hub for geospatial modeling that reduces the hassle for users. The connectors and processors in GeoEDF are platform agnostic and thus reusable, and a lot of complexity is abstracted away. The outcome of GeoEDF is a personalized, python-based engine for composing and executing workflows.

Song next addressed the sustainability of personalized cyberinfrastructure. She observed that these applications often have a very specific targeted audience, many members of which desire an intuitive and interactive online interface with which they can explore a dataset’s documentation and visualization before downloading. It is also important for these infrastructures to be highly adaptable. They should be self-contained, so that researchers can continue to utilize them even if they move to a new institution, and must also have the capability to handle large datasets that update frequently. In addition, it is helpful for infrastructures to be both maintainable and extendable by researchers, who may not have dedicated web development expertise. 

MyGeoHub has its own sustainability model that is split into three prongs: funding, code, and data. For funding, the shared hosting on the MyGeoHub platform enables multiple projects to share infrastructure costs; this also helps projects to stay online if there are any interruptions in funding. In terms of code sustainability, MyGeoHub promotes community contributions by separating from specific cyberinfrastructures. The data publication process on MyGeoHub generates digital object identifiers and annotations, both of which are important for the long-term curation of the data. There is also an emphasis on seamless transfers between institutional repositories and archival storage. These good practices will be essential for the continual operation of MyGeoHub as it is adopted more and more within the research community.

Song concluded her talk with some thoughts on careers in research software engineering. “This is an exciting career with lots of opportunities and good company, and people tend to stay for a long time,” she said. However, it can be difficult to find and hire research software engineers, which makes peer mentoring within the field especially important. As developing sustainability in cyberinfrastructures is difficult and there is not much institutional support or a dedicated professional society for the field, working together with colleagues and collaborators is the way to go.  

  Jillian Kunze is the associate editor of SIAM News