In order to understand what’s wrong with governments relying too heavily on data and algorithms, let’s go back to the forests of 18th-century Prussia.
As James Scott wrote in his book Seeing Like a State, governments in the former German state primarily viewed forests as a way to extract revenue. So when they did surveys of forests, all they counted was the number of trees that could be cut down and turned into commercial wood products.
With the countable target of salable wood in hand, the state reconfigured forests in the region to optimize for that metric. Underbrush was cleared. Profitable monocultures of identical trees were planted simultaneously—often in straight, orderly lines that made them easy to tally.
Prediction: Bias
Predictive Policing Software Terrible At Predicting Crimes
A software company sold a New Jersey police department an algorithm that was right less than 1% of the time
It took about a century for calamity to hit. The first generation of bureaucratically managed trees did great. By the second generation, lumber production was collapsing. Turns out, maintaining a healthy forest relies heavily on things unmeasured and unoptimized. Decaying underbrush promotes healthy soil, for example. Uniform rows of trees from the same species all planted at the same time proved to be particularly vulnerable to natural disaster and disease. “A new term, Waldsterben (forest death), entered the German vocabulary to describe the worst cases,” Scott, a political scientist and anthropologist, writes.
Bureaucracies use measurements to optimize and rearrange the world around them. For those measurements to be effective, they have to be conducted in units as relevant as possible to the conditions on the ground. Locals who used those German forests daily would have listed many important elements outside of the volume of harvestable lumber. Those details, in terms of effectively managing a complex system, are far more important than the sort of universal, high-level standardization that anyone in a bureaucracy can quickly peek at.
Unfortunately, this obsession with “legibility,” a concept Scott identifies as a “central problem in statecraft,” has accompanied governments as they’ve evolved in the digital era. Joe Flood’s book The Fires tells the story of a wave of destruction that devastated New York City’s South Bronx neighborhood during the 1970s. Forty-four different census tracts in the area, which had seen an influx of Black and Latino residents in the preceding decades, lost over half their housing stock to fires and abandonments over the course of the decade; seven tracts lost nearly everything. By 1977, over 2,000 square blocks were devastated and a quarter-million people had been dislocated from the neighborhood.
How the devastation was remembered tended to align with people’s political orientations. Liberals said it was greedy landlords burning down their buildings to collect insurance money. Conservatives said the ungrateful tenants of public housing programs were setting the fires.
In reality, it was forest death all over again. The fire department pinned less than 7 percent of the blazes on arson—and those occurred mostly in abandoned buildings without paying tenants.
The crucial factor was an algorithm. New York used a computer program designed by the RAND Corporation, a California-based think tank famous for military consulting, to find cost savings by identifying which fire stations could be closed without substantially decreasing the city’s firefighting capacity.
“Convinced that their statistical training trumped the hands-on experience of veteran fire officers, RAND analysts spent years building computer models they thought could accurately predict fire patterns and efficiently restructure the department based on those patterns,” Flood writes. “The models were deeply flawed—closing busy ghetto fire companies and opening new units in sleepy suburban-style neighborhoods—but they benefited from the sheen of omniscience coming from the whirring supercomputers and whiz-kid Ph.D’s behind them.”
While firehouses were previously evaluated on stats like the number of fatalities or the dollar figure of property damage, RAND’s models, which were then among the most complicated in the world, boiled the complexity of fighting fires down to a set of metrics focused on the time between when a fire company was notified about a fire and when a fire truck arrived on the scene.
But response time obscured as much as it revealed. There were factors it didn’t account for: firefighters figuring out which building the fire is in and how to enter it, rescuing victims stuck inside, or doing proactive patrols to make sure fire hydrants were functioning as they should.
Response time was a proxy for the thing RAND was really trying to measure: how effective a fire station was at quickly putting out nearby fires. The think tank could optimize for response time all it wanted, but there were other limiting factors through the process of firefighting it wasn’t measuring. As a result, RAND’s algorithm missed crucial aspects of effective firefighting that occurred between getting to a fire and putting it out.
RAND’s system incorrectly insisted that closing 34 of the city’s busiest fire stations, many in the South Bronx, wouldn’t lead to significantly worse outcomes. When the algorithm suggested closing a station in an area with politically influential residents, the department’s decision-makers often skipped right over it to avoid a tough political fight, squeezing even more firefighting capacity out of areas already hit with closures resulting from the RAND system’s recommendation.
Unions representing the firefighters sued the city, insisting the cuts were racially motivated. With reams of RAND-supplied data in hand, courts tossed the suit, maintaining that the decisions were determined on cold, hard numbers rather than petty human prejudices.
High-tech computer systems can reflect the misjudgments of the people who designed them—but this idea gets obscured because the systems’ algorithms don’t have any agency. “This is one of the problems with the technocracy,” Flood said in an interview with The Markup. “It doesn’t want to be accountable. It wants to make calculations, run expert systems, and then steamroll. It doesn’t like to think about implementation problems.”
The waterfall effect
The debacle of RAND’s firefighting algorithm reflected the fallacy described in Seeing Like a State on two levels simultaneously. First, the government, seeking out an objective assessment, tried to simplify urban fires into a set of variables that anyone within that bureaucracy could understand and, if necessary, justify. Second, RAND’s algorithm was going through that exact same process—manipulating measurements, shorn from their real-world context, and pushing them though a digital bureaucracy made of ones and zeroes.
The smartphone you’re reading this article on while sipping your morning coffee (or maybe you’re sitting on the toilet, no judgment, just happy you’re reading) has far more computing power than the mainframes that ran RAND’s 1970s firefighting algorithm. But the problems RAND’s approach embodied—the desire for high-level institutional clarity overwhelming fine-grained, context-based local knowledge—have only gotten worse.
In the endnotes of Jennifer Pahlka’s new book Recoding America, the Code for America founder and former U.S. deputy chief technology officer lists Seeing Like a State as a book to read if you want to better understand how bureaucracy functions. (Full disclosure: Pahlka is a donor to The Markup.) Recoding America, a recounting of times she helped fix flailing government technology projects, is a treatise on how the institutional tendency to impose rigid legibility at the expense of more useful local knowledge also exists within organizations themselves. If you’ve ever cursed in frustration using a government website, Recoding America explains why it was built like that.
Pahlka frames the problem by contrasting two different methods for building software: waterfall development and agile development.
Waterfall development, which Pahlka asserts is the default across the government, works by whoever is in charge of the effort stating a goal of what the software should do and when it should be delivered. From there, everyone on the project works to get as close as possible to that initial design before the deadline.
Inside government, the top of the waterfall is policy—made by lawmakers and then interpreted by regulators. In Pahlka’s experience, when a policy involves the creation of a technology product, like a website or an app, there are a plethora of metrics to collect and benchmarks to hit as it’s being created. Each of those metrics is usually there to make sure that people stationed further down the waterfall—the ones actually doing the work—are doing what they should. In reality, they are often incidental process details (like response time to a fire) sitting off to the left of the real question (like actually putting out the fire), which typically isn’t being tracked. For government tech projects, the key metric should be: Does the damn thing work for the people using it?
Pahlka recounts working on a project improving the Veterans Benefits Management System—a database of disability claim files, mostly a collection of scanned forms, that was painfully slow to operate. She was brought in to reduce the database’s latency, the speed at which each page loads. But soon after she began working on the problem, she was told that her services weren’t needed; the issue had been solved. Agency officials had simply redefined “unacceptable latency” to mean any time a page took longer than two minutes to load.
The system wasn’t more usable in any way that mattered. The measurements were just being gamed to obscure the reality on the ground because the bureaucracy was fundamentally limited in how it was able to look at and identify the problem.
The example is ludicrous, but it’s worth highlighting that it only made sense to game the measurement because latency was being evaluated independently from the purpose of the system as a whole: to help veterans gain timely access to their benefits. Disconnected from that high-level orientation, individual process measurements can easily become unmoored.
The irony is that when Pahlka asked the actual engineers who worked on the database for ideas on how to improve it, they had loads of great suggestions. They just felt like improving those systems wasn’t part of their jobs. They had critical local knowledge about how the database functioned but weren’t empowered to use it in any way that mattered because they were at the bottom of the waterfall.
Whereas Scott eventually took his ideas about bureaucracies two-thirds of the way to becoming an anarchist, Pahlka’s solution is different. She advocates for a different software development method, called “agile,” which was first detailed in a 2001 manifesto.
The idea behind agile development is that—instead of starting with a highly prescriptive blueprint for what you’re going to build and working tightly to those specifications—a team should start with a problem to solve. From there, they should make iterations of working versions of the program that gradually get better over time until its release. The key is a continuous feedback loop with users. The software is tested with people who would be using it, and their feedback is employed to produce something they find useful—even if it conflicts with metrics the program’s initial designers had in mind.
In an interview with The Markup, Pahlka said the message she frequently gives policymakers interested in improving their technology’s effectiveness is to bring in the perspectives of the folks designing and using the program: Craft policy in a way that gives the people with the most local knowledge the power to both learn as they build and incorporate their lessons into whatever they’re building.
“When I talk to [policymakers], they’re just like, here, write this bill,” she said. “And it’s like, if you understood it, then you would give them more flexibility, you’d be more in dialogue with them instead of just giving them these directives.”
This entire train of thought—from the forest to the fires to the waterfall—is something I encounter all the time reporting on algorithms at The Markup. I recently published a story evaluating the accuracy of a predictive policing program used in Plainfield, New Jersey, and found that its predictions lined up with where crimes were reported less than one percent of the time.
The algorithm took in data about where and when crimes happened in the past, and that’s what it optimized for. But it didn’t take into account that making 80 predictions per day in a city with little crime wasn’t going to help officers prevent crime; it was just going to overwhelm them with useless data. The local police knew where crimes generally clustered. After poking around with the software, the department decided it wasn’t helpful and, at least in interviews with The Markup, insisted they’d stopped using it.
The lesson here, ultimately, is that if you ever feel the urge to measure the forest, know that the endeavor is, in a sense, doomed to failure—especially if you narrow its complexity down to the mere counting of trees. But, for institutions ranging from the enormity of the United States government down to a little computer program you’re hacking together, the benefit of doing it right can still be worth the effort. It just takes a bit of humility to put your ear as close as possible to the ground and listen.