Shine a light on tech’s hidden impacts. Triple your donation.
Skip navigation

Hello World

How To Handle the Growing Flood of Leaked Data

An interview with Micah Lee, author of a new book on analyzing datasets that were leaked, hacked, or just accidentally left in the open

Digital photo illustration of an overhead view of a crowded crosswalk, with a black and white rectangular overlay
Photo illustration by Gabriel Hongsdusit; photo by © Marco Bottigelli

Hello again, readers. I hope your 2024 is off to a great start. I recently had the pleasure of meeting Micah Lee, an investigative data journalist, digital privacy expert and now a newly-minted author.

Headshot of Micah Lee
Caption: Micah Lee Credit:Credit: Micah Lee

Micah’s day job is Director of Information Security at investigative news site The Intercept. He’s kind of a big deal in the leaked data world, having worked with the Snowden leaks and recently the “BlueLeaks” hacked law enforcement dataset, among many others. Micah just published “Hacks, Leaks and Revelations: The Art of Analyzing Hacked and Leaked Data,” in which he offers a masterclass in the tools and techniques used to safely investigate these revelatory troves. I interviewed Micah to talk about his book, and the following discussion is lightly edited for brevity and clarity. 


Jon Keegan: Who would you say this book is for? It feels like there are four or five mini books in this book that are gracefully interweaved together. You wrote lessons on working with command line tools, writing software in the Python programming language, and writing database queries in SQL. You also told fascinating stories about how some leaks happened and were reported on. You describe in detail how you conducted your investigations. 

Micah Lee: I wrote it for journalists, but also for researchers or activists or really anyone who is wanting to learn these skills. I wanted to make it so that if you have the motivation to learn how to analyze datasets, then you can pick up the book and just follow along and start learning, and you could publish on your blog or post about what you find on social media or whatever. But also, I think that a core audience is journalists, who are gonna be digging in datasets and publishing articles.

Keegan: We’ve seen an increasing number of these large leaked datasets coming out. Can you say a little bit more about why this is?

Lee: It’s insane. There are new datasets pretty much every single day and they keep getting bigger and bigger. Sometimes there are hundreds of megabytes, sometimes there are hundreds of gigabytes, sometimes they’re terabytes. 

The reason is because, in the 21st century, everything is digitized, everything is on computers, and everything is getting hacked. Also it’s really hard to manage data. 

Everything is digitized, everything is on computers, and everything is getting hacked.

Micah Lee

People oftentimes make mistakes where they keep an Amazon S3 bucket open and somebody finds it and downloads everything in it, and that sort of thing just happens constantly. An example that I’ve been talking about recently is the American College of Pediatricians. It’s this group that the Southern Poverty Law Center calls a fringe anti-LGBTQ hate group. They left a Google drive folder with 20 gigabytes of data just open to anyone with the link. That’s an example of one of the datasets. It wasn’t really a hack. It was like somebody somehow came across the link and was like, “wow, look at all this data,” and downloaded it all. 

Now there’s been some journalism based off of it. I think that as the world gets more dependent on computers and the internet, this is just going to increase even more.

Keegan: I know some newsrooms have restrictions in place that prevent the use of leaked data in reporting. 

How should news organizations and their legal departments adapt to this world of increasingly large and revelatory leaks?

Lee: I think the newsrooms should definitely be looking into these leaks. I think that the main thing for journalists to be considering from an ethical point of view is protecting people’s privacy when you’re investigating these leaks, making sure that whatever you end up publishing from them is in the public interest and making sure that you are only publishing details that are relevant to your story and aren’t harming people’s privacy. 

But really you are maybe reporting on a specific email. You just have to make sure that, when you publish this email, just don’t publish all of the private information for people that are completely unrelated. 

For example, it’s very common for leaks to include a ton of personal information about random people that aren’t actually the topic of your investigation. Maybe you have an email dump, you have somebody’s entire inbox or a company’s entire set of inboxes and there’s a ton of people’s names and email addresses and maybe phone numbers and their email signatures and all sorts of stuff. 

Make sure that whatever you publish is something that’s newsworthy and that’s in the public interest and that’s exposing some sort of corruption or wrongdoing or crimes or things like that.

Keegan: I just read a great and disturbing story in Wired by Dhruv Mehrotra about a California police department that used DNA evidence to generate a speculative image of a suspect, and then they asked another agency to run it through facial recognition tools. It’s a crazy story. 

The revelation came from the BlueLeaks dataset that you write about extensively in the book. Even though the dataset is a few years old, new stories continue to come out. Do you think there are enough journalists actually digging through these files? Are there many more stories like that buried away?

Lee: I think there’s not nearly enough journalists digging through files like this. So BlueLeaks is a collection of 270 gigabytes of police documents from hundreds of law enforcement websites. It was hacked and sent to Distributed Denial of Secrets in 2020 in the middle of the Black Lives Matter uprising that summer and I reported on it a bunch myself. 

But I basically just focused on a small sliver of BlueLeaks. I focused on just a few weeks, during the summer of 2020, worth of police documents because I was looking for how police were responding to the Black Lives Matter protests. 

This is what most of the journalists that looked at it were looking at. But BlueLeaks itself is a massive amount of data and I still think that journalists haven’t even looked through it all, which is why, years later, there’s a sprinkling of stories like this insane DNA facial recognition AI one. I’ve also noticed that in the last few months there have been a few other stories from BlueLeaks. 

In November, Wired published an article about a secret White House surveillance program that gave federal, state, and local law enforcement access to trillions of phone call records of Americans who weren’t suspected of a crime. This came from reviewing BlueLeaks years after it was out.

Then in December, two months after October 7th and after Israel’s assault on Gaza, Guardian journalists reported that, “US law enforcement agencies for decades received analysis of incidents in the Israel-Palestine conflict directly from the Israeli Defense Forces and Israeli thinktanks, training on domestic ‘Muslim extremists’ from pro-Israel nonprofits and surveilled social media accounts of pro-Palestine activists in the US”. 

That’s another BlueLeaks revelation from three years after it came out. So I’m pretty confident that there will be more.

When there is a big dataset like this, a handful of people, like a very small minority of journalists, actually look at it all. 

Micah Lee

I think that one of the big problems is that very few journalists actually have the technical skills to look through datasets like this. So what ends up happening is that when there is a big dataset like this, a handful of people, like a very small minority of journalists, actually look at it all, and then when they do, they do a story or two and move on, leaving most of it not looked at. 

I know that for me, I hear about datasets all the time, but I just have too many projects. So I just ignore them. I think that that’s probably the case for you and for other data journalists. If there was like, 1,000 times more of us that maybe we can make a bigger dent.

Keegan: I’ll do my part! Send them my way. 

What are some of the risks of using leaked data as a source for reporting?

In your book, you describe  handling some of the most sensitive leaked data imaginable, such as internal chat messages from the neo-Nazis who incited violence in Charlottesville, the NSA revelations by Reality Winner, and the Edward Snowden leaks. You even mentioned handling documents detailing another nation’s nuclear secrets, which you chose not to publish.

You describe tiered levels of caution readers should use when handling different types of leaked data. 

Lee: I like to think of datasets in regards to how sensitive they are. On one [end of the] spectrum, there’s the Snowden documents, which are incredibly sensitive and in this case, they’re classified U.S. intelligence community documents. 

A big risk when dealing with this sort of data is protecting your source, especially if it’s from a very authoritarian government where your source could be facing assassination if they get caught, or serious prison time. Protecting your source is a very important thing that you have to consider but also protecting the source documents. 

In the case of the Snowden documents, we definitely wanted to keep the secrets secret except for what was in the public interest. We didn’t want to unnecessarily expose things that really shouldn’t be exposed. 

We wanted to protect the privacy of surveillance targets and everything else like that. So we had to not just worry about the NSA, but we had to worry about the FSB, right? Russian intelligence, Chinese Intelligence, Iranian Intelligence, who I’m sure would all be very interested in these documents. So we had to take extraordinary measures like doing everything on air-gapped computers. At the time, when we were transferring files from one computer to another, we would burn them to CDs and then shred the CDs because we didn’t want to use USB devices. That’s definitely a risk for the very, very highly sensitive type of dataset. 

An example of what I would call a “medium sensitive” dataset: I have a case study in the book about America’s Frontline Doctors’ patient data. A hacker had hacked some telehealth companies that worked with America’s Frontline Doctors. [The group has been criticized for its claims related to Covid 19.]

I had hundreds of thousands of patient records, obviously very sensitive data. It’s people’s names and birth dates and email addresses and physical addresses and everything like that. I took various precautions. I didn’t store it in cloud services ever. Even though I had tons of spreadsheets with lots of data, I didn’t use Google sheets. I did everything locally on my computer. I made sure to store all of the original source documents in a special encrypted container on my computer so that when I wasn’t working on that project, they weren’t accessible. 

This is also what I was talking about [in terms of] protecting privacy. The main point of my story is what America’s Frontline Doctors was doing and what these telehealth companies were doing. It wasn’t the individual patients. We didn’t publish any information about individual patients. We didn’t publish anyone’s name or anything like that. 

Keegan: Sometimes the documents themselves could pose a risk?

Lee: Yeah. Whenever you get documents and you don’t really know the source of them and you don’t trust the source of them, potentially opening them on your computer you could get hacked. So if you open a malicious Word document, Word could automatically run macros, or it could have an exploit for Word and it could hack your computer and start stealing your data. 

There are some tips for how to deal with this, including running your documents through this open source tool called Dangerzone—which I developed, but now Freedom of the Press Foundation has taken over— which is sort of a digital version of printing out a document and then rescanning it. If the document has malware in it, the Dangerzone safe version of it that it gives you definitely doesn’t. 

You could also open documents in virtual machines, so that if they do hack your computer, they’ll hack your virtual machine and your main computer should stay safe. Also it’s actually a good idea to use tools like Aleph, which is one of the tools I talk about in the book where you can index big datasets because Aleph does its own parsing of documents somewhat. So then you can view them in your web browser and Aleph, without having to open them in Microsoft Word.

Keegan: There’s something kind of exciting about reading in your book, particularly coming across passages like, “In this next exercise, you will be downloading four gigabytes of Oath Keepers’ emails and searching through them.” I was uncomfortable looking through the email inbox of, say, the head of the Heritage Foundation, or the emails from the Oath Keeper organization’s different chapters around the country. Say a little bit about the mindset that you’re taking when you’re doing this kind of exploration. 

It’s very surreal the first time you’re sitting there reading someone else’s email.

Micah Lee

Lee: It’s definitely very surreal the very first time you’re sitting there reading someone else’s email, and that’s why the chapter is actually called “Reading other people’s email.” I think that whenever I’m doing this sort of investigation, I always have a kind of goal in mind for the type of thing that I’m looking for. It also makes me very keenly aware of my own email and of other people potentially digging through it or my own data. It makes me think twice when I start drafting an email, about how this could end up in an email dump someday. 

Generally, if I’m looking into a specific person, I look through their recent emails, or I search their emails for specific keywords that I’m looking for or things like that. I try to narrow the investigation somewhat, but it is kind of hard, especially when you really don’t know what’s in there.

Sometimes you have to sit there reading through each one in order to discover what you didn’t know was there. It’s definitely surreal, but that’s part of living on the internet today. That’s part of the 21st century that we all have to deal with.

Keegan: Is there anything else you want the readers of The Markup to know about this book?

Lee: I want to make sure that everyone should have access to the information in this book and I want to make sure that, if you can’t afford it, that isn’t a barrier, especially for people all around the world, and people with different incomes. So I’ve released the book under a Creative Commons license as well. From the website hacksandleaks.com, you can read the whole book online for free. If you could afford it then buy a copy, it’s definitely much nicer to read it in the physical book. But all the information is just on the website.


Thanks for reading.

Sincerely,

Jon Keegan
Investigative Data Journalist
The Markup

We don't only investigate technology. We instigate change.

Your donations power our award-winning reporting and our tools. Together we can do more. Give now.

Donate Now