By Karthika Swamy Cohen and Lina Sorg
Megan Price, executive director of Human Rights Data Analysis Group (HRDAG), a nonprofit in San Francisco, spoke at the Women in Data Science Conference held at Stanford University last Friday. Price discussed her organization’s behind-the-scenes work to collect and analyze data on the ground for human rights advocacy organizations. HRDAG partners with a wide variety of human rights organizations, including local grassroots non-governmental groups and—most notably—multiple branches of the United Nations.
The organization also helped collect evidence of human rights violations under former Guatemalan president Efraín Ríos Montt, and contributed to testimony against him. HRDAG’s analysis of relevant data allowed lawyers to argue that the patterns of violence in Guatemala were consistent with killings aimed at a particular ethnicity, which is consistent with genocide.
In 2015, HRDAG supplied data analysis during the trial of former Chadian dictator Hissène Habré for systematic torture and crimes against political prisoners. Habré was sentenced to life in prison in 2016. “For 25 years we have held state leaders accountable for violations of human rights, and we will continue to do that at home and abroad,” Price said.
She explained that her team uses record linkage, the process of deduplicating databases or merging records that refer to the same person, as well as an analytics method called multiple systems estimation (MSE) to collect data and study patterns. Researchers developed MSE 150 years ago to study animal populations in the field of ecology. MSE encapsulates a broad class of methods, is paired with various R packages (such as LCMCR), and is often implemented with Bayesian approaches.
In 2012, the Office of the United Nations High Commissioner for Human Rights (OHCHR) approached HRDAG requesting help with the analysis and description of data sets in Syria. HRDAG has since co-written publications with the UN about people killed in the ongoing crisis. HRDAG is working to document deaths in the country during the current crisis, and Price serves as the lead statistician for this work. This particular project focuses only on conflict deaths directly related to violence; different demographic techniques would be necessary to study non-conflict deaths, such as disease-related casualties.
HRDAG receives data from four different organizations that compile lists of named victims: the Center for Statistics and Research – Syria, the Damascus Center for Human Rights Studies, the Syrian Network for Human Rights, and the Violations Documentation Centre. The groups have collectively amassed over 400,000 records of named victims from the beginning of the conflict to the end of 2015. However, Price hastened to clarify that this figure in no way indicates the true number of victims, as many victims are reported multiple times, perhaps to different sources or even to the same source via different community numbers.
“Violence and conflict violence is a subset of all victims,” she said. “Not every victim is identified, not every victim’s story is told rightaway. It may be days, months, or years before we hear some stories from the conflict.”
HRDAG first deduplicates or merges records that refer to the same person by the aforementioned process of record linkage, which relies heavily on specific modules in Python. Price’s team performed supervised record linkage to clean up the databases; training set data first goes out for human review, during which people look at a pair of records and utilize the “compare and classify” method. Another way of evaluating these records is to numerically summarize how similar or dissimilar they are and feed the conclusions, along with the human-labeled data, into a computer model.
Price’s team then feeds those key pieces of information—the human-labeled data and feature vectors, which comprise numerical comparison of the pairs of records—into a classification model. This generates a score, which can be used to set a threshold, above which one can say with certainty that two records indicate the same person, and below which one can conclude that they are not the same person.
Once the records are classified, the group decides which groups of records refer to the same person. For instance, “Are A and B the same? B and C? A and C?” This is done by clustering. Determining whether person A and person B are actually the same person involves forming and breaking clusters. This is followed by post-processing. Eventually, the group gets to the end of the record linkage problem. “At the end of the problem, we will have a database that tells the number of uniquely identifiable victims that have been documented by one or more of the sources to which we have access,” Price said.
At this point, however, the team is technically only halfway through the problem. They have identified the subset of identifiable victims, but not those that are as yet unknown. “Our job as data scientists is to recognize that there is some amount of the population that we don’t know about yet,” Price said. “There is more than just what we have been able to document and report.”