My name is Jon Keegan, and I’m an investigative data reporter here at The Markup. Recently I’ve been writing about how companies have been monetizing our data in many aspects of our daily lives—at the supermarket, as we drive our cars, and when we check in to see where our kids are.
Over the years my reporting has often taken me deep into various government agency websites, and I have always been amazed to see the incredible variety of public data that our government collects. I have a background in art and design, so I’ve made a habit of keeping a spreadsheet of the more unusual data that is visually interesting. Last fall I started publishing a newsletter called Beautiful Public Data that lets me share to a wider audience these quirky collections, alongside reporting on why and how they came to exist. For Beautiful Public Data, I’ve written about the official federal style guide for America’s highways, a crazy chart showing how our radio frequencies are sliced up, and the National Library of Medicine’s database of 8,693 pills. As a reader of The Markup, you’d probably like it!
Recently, this personal project and my day job collided when I sat down to write about one dataset I had added to my spreadsheet a few years ago.
Special Database 18
Buried deep in The National Institute of Standards and Technology’s (NIST) data catalog, among biometric measurement datasets such as fingerprints, iris scans, and handwriting samples, you will find a most unusual collection of federal data. Special Database 18 is labeled as “NIST Mugshot Identification Database.” This dataset contains black-and-white mugshot photographs of 1,573 people. The 3,248 undated photos appear to be many decades old and to come from a variety of locations. The photos feature front and profile views of people and are accompanied by a metadata file containing the subject’s gender and age at the time of the photo. Some of the subjects appear in multiple photos taken at different times. NIST’s website says the database is being distributed “for use in development and testing of automated mugshot identification systems.”
Though names and any other obviously identifying information have been redacted in these images, I was still shocked to see this available. The dataset must be requested from NIST, which will provide a copy if you are a “qualified researcher” and it is “to be used for biometrics related research, development and education.” In the dataset documentation, NIST does not offer any context other than the gender and age metadata.
In the dataset I reviewed, which I requested and downloaded from NIST’s website a few years ago, almost all of the subjects are labeled as male, with only 78 labeled as female. Perhaps the most notable number I found while looking through this dataset was that 175 of the subjects were minors at the time their mugshot was taken, with 10 of them under the age of 15. There are even two 12-year-old children in the collection. I’m a father of a 12-year-old, and the thought of my kid getting caught up in some trouble and having his face permanently added to a government mugshot collection made my stomach turn.
I originally found this dataset because of Beautiful Public Data, but I have to admit that I paused at even considering this dataset for the newsletter, as it is obviously not a correct fit. But in my role as a curator of unusual visual data in government archives, this dataset is hard to ignore. One of the biggest lessons I have learned as a journalist in my time at The Markup is that when you find something that seems bad, sort by harm. (In this case, that meant loading the data into a Python notebook and quite literally sorting by age.)
These photographs are depicting people at likely one of the worst moments in their lives, and it shows. Bandages, black eyes, and fresh wounds hint at grim, untold stories that led them to the moment of their photo. Among the people in the photographs are sailors and soldiers and bus drivers. One photo of a woman crying was particularly wrenching. Some of these photos are blurry, smudged snapshots while others are evocative portraits that are hard to pin down in time. Looking at this huge collection of pictures, it is hard not to empathize with these subjects, though it is sobering to remember that it is likely that at least some of these people may have done some bad things.
I reached out to NIST to ask some questions about this dataset, including one about the presence of minors in it. Richard Press, a spokesperson for the institute, told me that NIST has “paused distribution of SD-18 [the mugshot database] and are reviewing the dataset to remove any that depict people who were minors at the time the photos were taken.” Press added, “We are also notifying anyone who has received the dataset for biometric related research that they must remove these images, as per the terms and conditions of the agreement they signed with NIST. Thank you for bringing this matter to our attention.”
In response to my questions about the origin of the photos, Press said that they were shared with NIST in the 1990s by the FBI to “support research in mugshot identification.” Press said that when the FBI provided the collection of images to NIST, it confirmed that each of the individuals in the photographs were deceased and that they each had “criminal records.” It was not clear looking at these photographs if any of these people in this dataset were convicted of any crime or were simply arrested at some point, and when I asked NIST to clarify what having a criminal record meant exactly, Press told me that “NIST defers to the FBI on the meaning of ‘criminal records.’ ”
The intended use of this data does not appear to be training facial recognition models (you can use NIST’s Special Database 32 for that) but rather for image recognition systems to classify a photograph as a mugshot, as opposed to one used for a passport or a government issued ID.
One academic study that cited the use of Special Database 18 for analysis was titled “Criminal tendency detection from facial images and the gender bias effect” and described the study as an effort to “explore a new level of image understanding, inferring criminal tendency from facial images via deep learning.” The study has since been retracted by the publisher as the authors failed to seek approval from their ethics committee before working with sensitive human biometric data.
I have submitted a public records request to both NIST and the FBI for more information on the dataset’s origins.
Mugshots as Public Records
State laws vary widely regarding the classification of mugshots as public records. Thirty-one states generally consider them public to some degree, but some impose additional rules controlling their release. Other states do not appear to have rules specifically governing mugshots, at least according to a review of laws assembled by the Reporters Committee for Freedom of the Press.
Some states have limited the distribution of mugshots in recent years amid growing concern about privacy and the ability of such photos to spread online.
California enacted a law in 2021 that prohibits police departments or sheriff’s offices from sharing certain booking photos of people on social media and mandates the photos be removed if the individual is not charged, found not guilty, or had their conviction overturned.
In a controversial move, New York recently amended its freedom of information law to prevent the disclosure of mugshots in public records requests. But the state allows individual law enforcement agencies the discretion to release such photos if they have a “a specific law enforcement purpose.” The amendment was criticized both by news organizations and civil liberties groups.
Louisiana and Florida are among a group of states that have recently enacted new restrictions related to mugshots that limit their use on exploitative websites that post the photos and make their money by forcing people to pay to have them removed.
While mugshots are considered public records in many cases, the use of these photos as a technical testing standard in a long-term government dataset raises significant ethical questions.
I spoke with Sarah Lageson, an associate professor at Rutgers’ School of Criminal Justice who studies the ways in which online crime data and criminal records have lingering effects on people. Lageson said that the lack of context surrounding the subjects’ arrests raises serious ethical concerns. “I think it’s this decontextualization that is harmful for any circumstance where mugshots are involved because it’s just so personal,” she said.” It’s your face in this context in which you don’t have any rights.”
Lageson added that a criminal case is a dynamic, evolving thing, and the mugshot reflects just one early moment in that process. “Mug shots are basically just a reflection of who a police officer has decided to arrest. It’s really not an indicator of criminality or proof that the person has been harmful or anything like that,” said Lageson. “Just because they were brought into the system for one minute, or for one instance, they’re forever now relegated to their biometric data to be used over and over.”
Albert Fox Cahn is the founder and executive director of the Surveillance Technology Oversight Project (S.T.O.P.), which advocates for ending discriminatory surveillance practices. “It’s hard to imagine how anything is ethical about this, if it’s being done with no informed, meaningful consent—[and] there is no consent when you’re arrested or when your booking photos [are] forcibly taken,” Cahn told me in an interview.
When asked if such ethical problems are mitigated by the fact that the subjects of these photographs are apparently deceased, Cahn replied, “There is something harmful about this, even if someone is deceased, harmful to their family, harmful to their legacy. I know that after I’m gone, I don’t want to have my image being used by the government to train new forms of AI.”
Cahn told me that the fact that such a collection has been out there on a government website for years is a result of a wider societal problem resulting from the unfettered use of surveillance technologies to ingest biometric data without consent—what he called “a mountain of stolen data.”
Press said that, to the best of NIST’s knowledge, no subjects or their families had been notified that they were included in this dataset, and no subjects or their families have ever contacted NIST requesting to be removed from the dataset.
On its website, NIST includes among its core values, “Integrity: We are ethical, honest, independent, and provide an objective perspective.”
Press added that all of the image datasets used by the agency are subject to “Human Subjects Protection” regulations. When asked if NIST has had any recent discussion about the ethics of these involuntary subjects being made part of a permanent dataset, Press said, “Yes, this is an active area of consideration within NIST and among data experts.”
A Look at NIST’s Collections
The National Institute of Standards and Technology is set apart from most federal agencies by the wide array of standards and measurements that it works on for many industries. Founded in 1901, NIST is part of the Department of Commerce and was created to help boost American commercial competitiveness by defining and maintaining measurements and standards that would help the U.S. catch up to more established European industrial leaders. NIST’s mission: “To promote U.S. innovation and industrial competitiveness by advancing measurement science, standards, and technology in ways that enhance economic security and improve our quality of life.”
Diving into NIST’s website yields all sorts of interesting collections of data. NIST is the official keeper of time for the U.S. government. NIST maintains a large amount of physics data, including tables of atomic weights, properties, and isotopic compositions for all known elements. There’s a collection of fundamental physical constants in the known universe, where you can find the precise values of the speed of light in a vacuum (299,792,458 meters per second) or the Rydberg frequency (3.289 841 960 2508(64) × 1015 Hz).
NIST also has its own online store, where industrial labs can order “standard reference material” for a dizzying array of items, both mundane and exotic. For $1,107 you can buy 170 grams of peanut butter. Looking for a good source of industrial sludge? For $780, NIST can set you up with 70 grams, shipped to your door. You can also order “Lake Superior Fish Tissue,” “Urban dust,” “Organic Contaminants in Smokers’ Urine (Frozen),” and “Slurried Spinach.”
From a purely bureaucratic perspective, this mugshot database perhaps seemed to fit neatly beside these items as yet another “standard reference material” in NIST’s collections. But today, decades after the photos were taken, their presence reminds us that the recent leaps forward in machine learning are fueled by mountains of data, and that includes real human faces. The practice of involuntary personal data collection used to train algorithms is not only limited to those people in the criminal legal system but is also happening to all of us. Access to Special Database 18 may be on pause, but that broader problem remains.
Investigative Data Journalist
March 20, 2023: This story originally described NIST as regulating standards and measurements. NIST does not regulate standards and measurements but rather develops voluntary standards. The story has been updated to correct the error.