| June 01, 2021

An Exploration of Inliers

The following is a short reflection from the author of Mining Imperfect Data: With Examples in R and Python (Second Edition), which was published by SIAM in 2020. This updated edition is more complete and contains techniques and applications that were not available in 2005 when the first edition—titled Mining Imperfect Data: Dealing with Contamination and Incomplete Records—initially published.

The second edition of Mining Imperfect Data focuses on the definitions, consequences, detection, and treatment of 10 forms of “imperfection” that commonly occur in real datasets. These imperfections include well-known data anomalies like outliers and missing observations, as well as more obscure issues like inliers. We can define inliers as “data values that are consistent with the distribution of the bulk of the data, but are in error” [3]. The following example [1] from Mining Imperfect Data illustrates the phenomenon of disguised missing data [4] as one possible source of inliers: “Recently, a colleague rented a car in the U.S. Since he was Dutch, his post code did not fit into the fields of the computer program. The car hire representative suggested that she use the zip code of the rental office instead.”

Another possible source of inliers is extreme coarsening, such as when one uses the first day of a month or year as a surrogate for “exact date unknown.”

Although the aforesaid definition of inliers unfortunately provides no basis for their detection, inliers in numerical data typically represent values that repeat unusually frequently. By adopting this condition as a working definition, researchers can detect inliers by tabulating the frequency of each distinct value and searching for those with atypically high frequencies. This method converts the inlier detection problem into an outlier detection problem, for which a variety of solutions exist.

The best known of these solutions is likely the “three-\(\sigma\) edit rule,” which declares points that lie more than three standard deviations from the mean to be outliers. Despite its popularity, this approach often performs badly in practice, leading to the development of alternatives like the Hampel identifier or the boxplot rule (discussed in chapter two of Mining Imperfect Data, “Dealing with Univariate Outliers”). Ironically, these alternative approaches fail catastrophically in the presence of numerical data with more than 50 percent ties (i.e., repeated values). Since there is zero probability of ties under the continuously distributed random variable model that researchers commonly use to characterize numerical data, one would expect to see few of them in inlier-free data. This means that the majority of distinct value frequencies that are computed via the aforementioned inlier detection strategy should be 1. As a consequence, the three-\(\sigma\) edit rule probably represents the best approach for identifying unusually large frequencies in inlier detection, despite its limitations.

Figure 1. Two views of the log nonzero loss data from the Australian vehicle insurance dataset.

The Australian vehicle insurance dataset [2] provides a real-world inlier example, which is also available on the companion website (see Figure 1). This dataset contains 67,856 records, with 11 fields that provide losses, claim counts, and vehicle and driver characteristics. The loss variable claimcst0 exhibits 3,257 distinct values that range from 0 to 55,922.13; 3,182 of the values appear only once, while the most frequent value appears 63,232 times. The five most frequently occurring values and their frequencies (in parentheses) are as follows: 0 (63,232), 200 (695), 353.76999998 (219), 389.94999981 (94), and 390 (35).

Applying the three-\(\sigma\) edit rule to this count sequence yields a mean frequency of 20.83, a standard deviation of 1108.02, and an upper outlier detection threshold of 3344.90. The only frequency that exceeds this threshold is that for the value 0. This extremely high frequency of 0s is characteristic of variables like insurance loss data—where claims are relatively rare—or daily rainfall amounts, which are 0 for most days except in extremely wet regions.

The fact that 0 is the only value that is identified by the proposed inlier detection procedure illustrates the weakness of the three-\(\sigma\) edit rule. An extension of this approach that sometimes greatly improves its performance is inward detection. For this extension, we detect outlying counts as before, then remove these records and reapply the outlier detection procedure. By adopting this approach, we exclude the very large 0 count and recompute the mean frequency as 1.42 and the standard deviation as 12.96, thus giving an upper outlier detection limit of 40.29. The second pass of this inward detection procedure therefore identifies the second through fourth most frequent values as candidate inliers (200, 353.76999998, and 389.94999981). Of these results, the first value (200) is the most interesting because it is such a round number. Upon further investigation, it turns out to be the smallest loss value.

A more detailed discussion of this example is available in section 6.1.2 of Mining Imperfect Data: “Inward Detection of Inliers.”

Enjoy this passage? Visit the SIAM Bookstore to learn more about Mining Imperfect Data and browse other available SIAM titles.

References
[1] Adriaans, P., & Zantinge, D. (1996). Data mining. Reading, MA: Addison-Wesley.
[2] de Jong, P., & Heller, G.Z. (2008). Generalized linear models for insurance data. Cambridge, U.K.: Cambridge University Press.
[3] DesJardins, D. (2001). Paper 169-26: Outliers, inliers, and just plain liars—new graphical EDA+ (EDA Plus) techniques for understanding data. In Proceedings of the SAS User’s Group International Conference (SUG126). Cary, NC: SAS Institute.
[4] Pearson, R.K. (2006). The problem of disguised missing data. SIGKDD Explor., 8(1), 83-92.

Ronald K. Pearson is a senior data scientist at GeoVera Holdings, a U.S.-based property insurance company. He has held positions in both academia and industry and has been actively involved in research and applications in several data-related fields, including industrial process control and monitoring, signal processing, bioinformatics, drug safety data analysis, property-casualty insurance, and software development.