By James Case
Cathy O’Neil discovered prime numbers on her own, at an early age. At 14 years old, she mastered the Rubik’s Cube while attending summer math camp. Following undergraduate work at the University of California, Berkeley, O’Neil earned a Ph.D. from Harvard University in 1999, with a thesis in algebraic geometry. Next came a postdoctoral position at the Massachusetts Institute of Technology, followed by a tenure-track professorship at Barnard College/Columbia University. In 2007, seeking excitement, she elected to try her hand at finance with hedge fund D.E. Shaw.
After four of the most tumultuous years in Wall Street history, O’Neil became convinced that the computer programs used to scour the global economy for promising investment opportunities were partly to blame for the housing crisis, the collapse of major financial institutions, the rise of unemployment, and other societal plights. Moreover, she began to suspect that data-driven finance was but a small part of an emerging “big data economy,” with limitless potential for good or ill.
A computer program can speed through thousands of résumés or loan applications in a second or two. Not only do the machines save time and money, they also treat everybody the same way — meaning they appear fair and objective in court.
Regrettably, programs written by humans nearly always encode at least a few of the biases, prejudices, and misconceptions harbored—consciously or unconsciously—by their creators. Moreover, the verdicts they render are all but impossible to appeal, since nobody really knows what makes the programs work the way they do. O’Neil maintains that, whether by accident or design, too many of these electronic decision-makers punish the poor and oppressed while further rewarding the already-rich. For ease of reference, she took to describing the more dangerous decision-making programs as “weapons of math destruction,” or WMDs for short.
O’Neil’s opening example involves a procedure meant to improve Washington D.C.’s school system. In 2007, the new mayor set out to reform the district’s underperforming schools. Only eight percent of the system’s eighth graders were performing at grade level in math, and barely half of those entering high school were soldiering on to graduation. Choosing to blame the teachers, the mayor decided that the solution was to identify and remove those that were incompetent. To that end, he created a powerful new post—chancellor of D.C. schools—and hired Michelle Rhee, a young but highly-regarded reformer, to fill it.
Rhee engaged a Princeton, N.J.-based consulting firm called Mathematica Policy Research to construct a “value added” program known as IMPACT to measure each student’s year-to-year learning progress. Teachers were then rated by their students’ progress. At the end of the 2009-2010 school year, all teachers with IMPACT scores in the bottom two percent were fired. A year later, another five percent—206 teachers total—were terminated. Everything seemed to be going according to plan, including collateral damage.
Sarah Wysocki, a fifth grade teacher with two years of experience, had received nothing but positive feedback from her superiors and students’ parents. One evaluation praised her attentiveness to the children, while another called her “One of the best teachers I’ve ever come in contact with.” Yet her IMPACT score was dismal, obliging the district to fire her. How, she wondered, could this have happened?
Upon inquiry, Wysocki learned that her students’ test papers from the previous year, which weighed heavily in their prior IMPACT scores, contained an unusual number of erasures. Had prior teachers changed answers to improve scores? Had they protected their own jobs, while costing Wysocki hers? Fortunately, Wysocki quickly landed another job in Virginia, where teachers are evaluated differently. She arrived with glowing letters of recommendation, while D.C. retained a possibly dishonest (and/or incompetent) teacher.
Wysocki’s firing was an obvious miscarriage of justice, quickly recognized and soon corrected. But what of the other mishaps, the ones that went undetected? Wysocki could hardly have been the only qualified teacher to be fired. What, one wonders, do value-added assessments actually measure? Do they measure a teacher’s ability to teach? Do they measure his or her impact on students? Or do they measure nothing at all? The Tim Clifford case suggests that the latter possibility is entirely too real.
Clifford was a middle school English teacher in New York City with 26 years of experience. When the city adopted a rating system similar to the one that cost Wysocki her job, he was shocked to learn that his initial rating was an appalling six out of 100. Clifford worried that with a few more such scores, even his tenured position might be in jeopardy. It also concerned him that poor scores for tenured teachers call into question the validity of the tenure system, already under fire from would-be reformers. So imagine his relief when, a year later—with no discernable change in his teaching methods—his rating rose to an enviable 96! How can one trust such a volatile performance index?
“In fact, misinterpreted statistics run through the history of teacher evaluation,” O’Neil writes. She offers an imposing list of difficulties to overcome, and a litany of mistakes commonly made during the process.
Electronic decision-makers are, of course, by no means restricted to the education system. They are often used to decide which internet shoppers should see certain ads. Advertisers test different versions on (disjoint) samples drawn from a target demographic to learn which generate the greatest response. The winning version is shown to the entire audience of presumed “susceptibles” only after a number of these small-scale trials. Every purchaser of western garb, get-rich-quick schemes, weight-loss programs, exercise equipment, or exotic vacations is sure to be rewarded with “opportunities” to buy more such products.
Men and women in the armed services nearing completion of their tours of duty are routinely swamped with offers from for-profit universities, mainly because government loans are more easily obtained on their behalf. This, says O’Neil, is a particularly grievous use of electronic rating techniques because it encourages primarily poor, often poorly-educated, and easily-misled individuals to assume unrepayable quantities of debt in return for all-but-worthless credentials.
Many of O’Neil’s complaints involve fairness and accuracy. Judges, she notes, often hand down more severe sentences to convicts deemed likely to become repeat offenders. They do so even though the likelihood in question is typically assessed electronically, and may be due to the offender’s broken family or residence in a high-crime neighborhood. Although such information would be inadmissible in court due to its propensity to result in a verdict of guilt by association, it still counts against the defendant at sentencing time.
Similarly, loans are frequently denied to applicants considered liable to default, despite the fact that electronic credit rating algorithms often predicate their evaluations on applicants’ residence in neighborhoods where jobs are likely to be temporary and/or loan defaults are unusually common. This also invites an unfair finding of guilt by association.
Finally, O’Neil laments the lack of concern regarding the accuracy of electronic rating schemes, despite the fact that their main purpose is to fairly narrow down the field of job seekers, loan applicants, or potential parolees at low cost. The fact that more labor-intensive evaluation schemes might result in slightly better workers, slightly fewer loan defaults, or slightly more law-abiding parolees counts for little beside the indisputable cost savings obtained with electronic screening.
O’Neil points out that this cavalier attitude toward accuracy is in marked contrast to the player evaluation schemes employed in professional sports, where even slightly better players can mean the difference between (profitable) winning and (unprofitable) losing records. Those electronic rating systems—like those that determine which ads will pop up on your computer screen—are continually monitored and improved. Too often, the ones applied to teachers, job seekers, loan applicants, and convicted criminals are not. Though easily available for the purchase of packages like IMPACT, funds for testing their accuracy are regrettably scarce.
O’Neil’s important and fact-filled book will win no awards for suspense. One chapter tends to resemble another, since many of the same pitfalls await the use of electronic rating schemes in different fields of application. But by exposing the shortcomings of existing methods, Weapons of Math Destruction gives reason to hope that demonstrably better systems may yet be developed.
James Case writes from Baltimore, Maryland.