| July 10, 2020

Understanding the Global Impact of COVID-19 through Data Science

The ongoing COVID-19 pandemic and associated travel restrictions prevented the 2020 SIAM Conference on the Life Sciences (LS20) from convening in person in Garden Grove, Calif., as planned. But in its new, online format, researchers were still able to present their work, some of which focused on researching and modelling SARS-CoV-2 and the resulting disease.

During a virtual minisymposium presentation at LS20, Adam Mahdi of the University of Oxford shared his experience working on the OxCOVID19 Database, which is associated with Oxford’s Institute of Biomedical Engineering. The project began as a response to the many challenges that researchers face when attempting to model the spread of COVID-19. The reported caseloads from some areas are inconsistent; for example, Leicester, England has not been reporting all of its cases to the public. Erratic reporting and inconsistent formatting also affect online data, as does a lack of granularity — some countries only report caseloads at the country level instead of by specific regions within the country.

Adam Mahdi described the hard work that went into building the OxCOVID19 Database.

A team of about 14 researchers and developers are contributing to the OxCOVID19 Database, 10 of whom are working on it full-time. Their goal is to build a large relational database of COVID-19 cases across the world using PostgreSQL, an open-source database system. “The data we are aiming to collect is not at the country level but at the regional level,” Mahdi explained, which added to the project’s difficulty. The team thus needed to provide different identifiers that corresponded with different regions. Some of these identifiers were added automatically, but many of them had to be done by hand.

Finding the data to feed into the database was a challenging process. Some countries are better than others at providing data on their caseloads. For instance, Italy has a GitHub account that allows anyone to access its COVID-19 data, but Poland only provides data on Twitter via tweets from its Ministry of Health. The OxCOVID19 database fetches data from over 60 sources daily, then unifies the data and pairs it with geographical regions. After the data is validated for consistency, it is stored in the PostgreSQL database, where people can access it as a CSV file or through an application programming interface.

One important aspect of the database is its relationality. Users can easily make queries by utilizing a composite key, or a combination of several columns that can uniquely identify a row in the table. The database also contains data beyond just COVID-19 caseloads; this valuable benefit allows researchers to explore different factors’ impacts on infection rates. For example, it pulls in 40 megabytes of weather data each day, as well as information on factors such as intervention stringency, workplace mobility, and the status of transit options.

The OxCOVID19 database is still a work in progress, and Mahdi assured attendees that “more data is coming soon.” However, the site has already been used in a number of projects, including this useful visualization tool. As the database project is intended to collect data rather than analyze it, other researchers can now exploit this resource to model further infection rates. As the pandemic proceeds, the OxCOVID19 database will serve as an essential tool for understanding COVID-19 and predicting its future course.

Jillian Kunze is the associate editor of SIAM News.