Statistical Methods Estimate Missing Information on Human Rights Violations

By Jillian Kunze

The available data on human rights violations is often incomplete, as collecting data in conflict settings is a difficult and frequently dangerous process. Datasets may therefore not be statistically representative of the entire victim population. “Some victims’ stories are never recorded, and those whose stories are documented may still be missing information about the victim, the perpetrator, or the violent event,” Maria Gargiulo, a graduate student at the University of Oxford, said.

Gargiulo shared the statistical methods that she uses in her work to document human rights violations as a statistician with the Human Rights Data Analysis Group (HRDAG) during a technical vision talk at the Women in Data Science Worldwide Conference 2022. In her talk, Gargiulo introduced different approaches for addressing missing information and estimating the total size of a victim population, including incidents that were never recorded.

“These missing data are generally not missing at random,” Gargiulo said. “There are a number of different statistical biases that affect which victims are reported and, perhaps more importantly, which victims are not.” For instance, events with a large number of victims are more likely to be documented than events with a small number of victims. There is also an urban bias, in which violence in urban settings is more likely to be documented than violence in rural settings due to issues of physical access and internet connection. Visibility can also affect reporting, as people with prominent positions (like lawyers and politicians) are more likely to be reported when victimized than less prominent individuals. The overall result of these effects is that the data does not represent the victim population as a whole.

Figure 1. A stylized example of human rights violation data with missing information. Green blocks represent missing fields of information for victims that do have some information present, while yellow rows represent undocumented victims who do not appear in the records. Figure courtesy of Maria Gargiulo.

Figure 1 provides a stylized example of how data with missing components might appear. The multiple sources for this data can include government institutions, non-profits, activists, and more; the same incident may appear on more than one list. The question marks in Figure 1 represent missing information. The question marks that are color-coded in green appear in fields for incidents that do have some other information documented — Person 4, for example, has information on the location of the incident but is missing data on both the date and the perpetrator of the violation. There are also undocumented victims who do not appear in the records at all, represented by the yellow rows.

Gargiulo explained two processes for dealing with this missing information. Imputation—the first process that she described—fills in blank spots in incomplete records of documented victims. “We can think of imputation as a process of filling in missing fields with information based on the non-missing information available in other records,” Gargiulo said. Doing so is necessary in order to calculate estimates that are stratified by fields with missing data, such as finding patterns in time or by perpetrator in this example. The only alternatives to imputation are to not perform a stratified analysis at all, or to drop the incomplete records entirely — which would statistically bias the results.

There are several different methods for imputation, and the most appropriate one generally depends on the specific type of analysis that a researcher is pursuing. Single imputation approaches—such as mean/mode imputation, regression imputation, or tree-based methods—fill in the missing values one time. However, it is preferable to represent the uncertainty that is inherent in this process and propagate it into the final estimates. This can be accomplished through multiple imputations: repeating the imputation multiple times to randomly create distinct versions of the final data with the missing values filled in different ways using a statistical model. These repetitions allow researchers to propagate the uncertainty from the imputation state into the final estimates.

The second process for addressing missing data—called multiple systems estimation (MSE)—helps account for victims who were never recorded. “The basic idea behind this method is that we would like to use the information about the documentation patterns of the observed victims to estimate the total population size,” Gargiulo said. There are many different Bayesian and frequentist MSE models, including log-linear models, decomposable graphical models, and latent class multiple capture-recapture. Due to certain model assumptions, these methods are best used when data from at least three distinct sources are available.

Figure 2. A simplified visualization of the intuition and basic mathematics behind multiple systems estimation. Figure courtesy of Maria Gargiulo.

MSE uses the fact that some victims appear in only one list while others appear in multiple lists as part of the input. Figure 2 provides a simplified visualization of the intuition behind this technique. In this example, there are two lists of victims (\(A\) and \(B\)) that capture part of a victim population of unknown size \(N\). \(A\) and \(B\) are independent of each other but do have some overlap in the specific victims that they list, represented by \(M\). One first defines the probabilities of a victim being on lists \(A\) and \(B\), then writes the probability of being in \(M\). Because of the fact that lists \(A\) and \(B\) are independent of each other, one can rewrite this expression to estimate \(N\) from the observed data, thereby finding an estimated value for the entire number of victims — including those who were never documented.

Gargiulo finished her talk by describing how she combines the multiple imputation and MSE approaches such that estimates which are stratified by fields with missing data include measures of uncertainty from both processes. The first step in this process is to create around 10 imputed datasets with the R package mice; these datasets will all be slightly different, as imputation is a probabilistic process. For each imputed dataset, Gargiulo then calculates MSE estimates of the total population size with the R package LCMCR. By combining the 10 sets of results via certain combination rules, she can find a point estimate of the likely number of victims with an uncertainty that accounts for the uncertainty in both imputation and MSE.

The estimation of the total size of the victim population and its associated uncertainty interval can become the basis for studies on patterns of violence. In particular, Gargiulo mentioned how her favorite component of working with HRDAG is collaborating with partner organizations to support their work on the ground for projects such as training analysts from the Colombian truth commission and the Jurisdicción Especial para la Paz in data processing, record linkage, imputation, and MSE (read more here). Statistical approaches that deal with missing information in real datasets on human rights violations can make a difference in research.

A recording of Maria Gargiulo's technical vision talk at the Women in Data Science Worldwide Conference 2022 is available on YouTube.

Jillian Kunze is the associate editor of SIAM News.