SIAM News Blog
SIAM News
Print

When Data Meets Diversity

By Brianna C. Heggeseth and Chad M. Topaz

In 2017, social media discussions about the opening of a new wing at the Massachusetts Museum of Contemporary Art first alerted us to the dearth of works by women and people of color in major museum collections. When we asked Steven Nelson—now Dean of the Center for Advanced Study in the Visual Arts at the National Gallery of Art in Washington, D.C.—about the magnitude of underrepresentation, he noted that no researchers had ever gathered a data set with the size and completeness required to address this issue in a systematic way.

Together, we built the first collaborative research effort to quantify certain axes of demographic diversity among artists with works in the permanent collections of major art museums [3]. The lack of publicly available museum collections data, as well as the lack of data and consistency standards across museum curatorial databases, complicated our research efforts. Though we were eventually able to locate and standardize our data, the process was cumbersome and time-consuming.

The response of major museums to the COVID-19 pandemic has since shifted the terrain. Museums are increasingly putting their collections online, thereby presenting a golden opportunity for data scientists to ask and answer questions that were previously inaccessible. To this end—and noting that many areas of applied mathematics are now inextricably dependent on data—we present some opportunities and complexities of museum collections data.

Accessing Museum Data

Museum collections data and metadata are curated predominantly in an industry standard software package called The Museum System (TMS). TMS allows museums to record acquisitions; the movement of art objects and their separable or non-separable components; and object metadata such as titles, artists, dates, and dimensions. The data formats’ internal consistency depends on the historical implementation of data content standards, which were established by museum registrars and database managers. In 2006, the Cataloguing Cultural Objects [1] standards were published to “move toward shared cataloging and contribute to improved documentation and access to cultural heritage information.” While these data standards are gaining traction in the U.S., vocabularies (e.g., object classifications and artist nationality) and data entry formats (e.g., date formats and artist name) are still inconsistent. Missing data is also an issue, as some information has been lost in the transition to digital cataloguing systems; this is especially true of early acquisitions.

Until the last decade, only museum staff could access this data. The rise of social media and digital communication inspired major art museums to engage their audiences with online collections on their websites. While this move increases data accessibility, the collections’ HTML formatting is often inconsistent. Data scraping is possible, but it is cumbersome since it must be customized for each museum.

Analyzing Museum Data

Coding artist data for gender, ethnicity, and other demographics can be difficult, as these metadata are not often stored in TMS. This leaves any data scientist who is interested in anti-racism and other forms of social justice with three primary options for coding the relevant data.

First, one could manually code artist profiles. For example, living artists can self-identify or art historians can infer demographic characteristics based on primary sources. Second, researchers might decide to consult data sets such as the Getty Research Institute’s Union List of Artist Names Online (ULAN), a large database of artists that includes some information about gender and other identities. However, linking ULAN to art museum website data requires that one manually add the static URL from Wikipedia or the ULAN ID to each piece on a museum’s website. As an additional challenge, the available race and ethnicity information appears within a variable called “nationality,” which, perhaps surprisingly, “contains reference to the nationality, culture, ethnic group, religion, or sexual orientation associated with the person.”

Finally, large-scale coding through Human Intelligence Tasks on web-based crowdsourcing platforms like Amazon Mechanical Turk provides a mechanism that handles sizable data sets at scale, especially when the aforementioned two options are infeasible or impractical. This crowdsourcing approach requires that laypersons read secondary sources online and make informed guesses about ethnicity and gender on a random sample of artists. We expect more erroneous inferences from non-experts, so statistical techniques ensure inter-rater reliability and provide confidence intervals on aggregate museum-level statistics.

It is important to remember that all demographic data, with the exception of information that the artists provide themselves, are inferred data. An individual’s characteristics, such as gender and ethnicity, can be reliably stated only by the individual themself.

Our Projects

Our first deep dive into museum data sought to measure the (under)representation of female artists and artists of color in 18 major American art museums [3]. Within this group of museums, we estimate that 85 percent of artists are White and 87 percent are men (see Figure 1). Some people believe that the large proportion of White people and men “makes sense” because museum collections sometimes focus on time periods and geographic regions in which those two groups dominated artistic production. However, even putting aside issues pertaining to whose work is considered art and valued by society, our results support a different conclusion.

Figure 1. Based on the results of [3], data journalist and artist Mona Chalabi created images to represent the demographics of the data set, scaled down from over 10,000 artists to 100 for the purpose of visualization. In this case, 88 of the artists would be men (75 White, eight Asian, three Latinx, one Black, and one man of another race/ethnicity). Figure courtesy of Mona Chalabi [2].

To reach our conclusion, we clustered the 18 museums in two different ways. First we clustered them by their “collection habits” — the time periods and geographic regions in which their art was created. Next we clustered them by their estimated demographic percentages for gender and ethnicity. Quite simply, these two clustering schemes are uncorrelated. For instance, the Museum of Fine Arts in Boston and the Detroit Institute of Arts both have catalogs in which the average artist birth year is around 1800 and roughly 30 percent of artists are of North American origin. However, we estimate that 95 percent of identifiable artists in the Detroit catalog are White, in contrast to only 80 percent of the artists in Boston. Of course, our study represents one snapshot in time. Collections are subject to change as museums make acquisitions and loan, sell, or gift various pieces.

In response to our work, the National Gallery of Art invited us to participate in a two-day datathon, during which they allowed us full access to their internal data stores. Our major focal point for this event was the representation of women and artists of color on public view in the gallery spaces and exhibitions curated by the National Gallery, as differentiated merely from their representation within the catalog. More specifically, we sought to answer the question, “Whose art is being seen by the public as they visit the museum?” Using metadata from TMS and location history data for the individual pieces, we found that over 75 percent of the art objects in public view at the National Gallery are attributable to an identifiable male and/or White artist.

However, the last five years have seen an increase in female and Black representation, primarily in the renovated East Gallery. Contemporary photographs, prints, and drawings by female and Black artists have driven this shift. The gallery staff informed us that while these media are more financially accessible to new artists, they are also more physically sensitive and can only be on public view for short periods of time. We created an online interactive visualization tool for gallery staff, the public, and other researchers that explores representation across the gallery space and over time.

Next Steps

Based on the results of our study, we recommend that museums immediately focus on standardizing data and metadata practices to allow for greater transparency. These improvements would enable easier and more rigorous data collection and analysis, thus helping researchers identify the extent of the underrepresentation of minoritized artists. The time is long past for major art museums to become activist collectors, emphasizing the work of women and artists of color in their collection practices to address historical underrepresentation and bias. Some museums have already begun to engage in this practice

This area of study presents numerous opportunities for the applied mathematics, statistics, and data science communities. When subject to the right disciplinary and data cleaning expertise, museum databases can serve as looking glasses into museums’ degrees of success in living up to their missions, including their attention to diversity. Local art museums may have limited in-house resources and could be open to collaboration with data scientists and applied mathematicians. For example, we are currently working with the Minneapolis Institute of Art to analyze their collection’s accession and deaccession history, apply natural language processing to facilitate tag creation, and plan for data improvements and consistency. Greater communication between and within art and the mathematical and computational sciences could help improve museum data management systems, ultimately enabling greater progress towards diversification within the collections.


References
[1] Baca, M., Harpring, P., Lanzi, E., McRae, L., & Whiteside, A. (Eds.) (2006). Cataloging cultural objects: a guide to describing cultural works and their images. Chicago, IL: American Library Association.
[2] Chalabi, M. (2019, May 21). Museum art collections are very male and very white. The Guardian. Retrieved from https://www.theguardian.com/news/datablog/2019/may/21/museum-art-collections-study-very-male-very-white.
[3] Topaz, C.M., Klingenberg, B., Turek, D., Heggeseth, B., Harris, P.E., Blackwood, J.C., …, Murphy, K.M. (2019). Diversity of artists in major U.S. museums. PLOS One, 14(3), e0212852.

Chad M. Topaz is co-founder and Executive Director of Research at the Institute for the Quantitative Study of Inclusion, Diversity, and Equity: an independent, nonprofit research-into-action organization. He is also an applied mathematician, data scientist, and professor of mathematics at Williams College. Brianna C. Heggeseth is a statistician, data scientist, and associate professor of statistics at Macalester College.

blog comments powered by Disqus