Subscribe to Hello World
Hello World is a weekly newsletter—delivered every Saturday morning—that goes deep into our original reporting and the questions we put to big thinkers in the field. Browse the archive here.
Hello everyone, Ryan here.
Behind many of our stories here at The Markup are some surprisingly controversial little bots called scrapers.
A scraper is essentially just a piece of software set loose on the internet. Typically, it visits a web page, extracts some information, stores it, and moves on to another page where it will repeat the process.
At The Markup, we and our sources have used scrapers to show how posts made by Amazon Ring owners were forwarded to local police, possibly without the posters’ knowledge; how people living in poorer, less white neighborhoods paid the same for slow internet service as people in other neighborhoods paid for fast connections; and how Facebook collects information about visitors to dozens of popular websites that target kids.
Scraping has long been controversial precisely because it allows people to collect this sort of information from online organizations—information that the organizations don’t necessarily want to be aggregated and analyzed. Sometimes the pushback against scraping involves technical countermeasures, like a website owner blocking bots arriving from particular internet addresses. Sometimes the pushback has been more serious—a criminal charge or civil lawsuit.
We at The Markup have clearly stated that we believe scraping is vital to democracy. It has enabled much of our own journalism and a great deal of public-interest research and journalism published elsewhere. In 2020, we filed an amicus brief in a Supreme Court case saying as much and arguing that it was a First Amendment violation to criminally prosecute people for scraping simply because it violated a website’s terms of service. The case, Van Buren v. United States, made it safe to scrape in violation of a website’s own rules by clarifying a federal anti-hacking law, the Computer Fraud and Abuse Act (CFAA).
But a lawsuit filed by X Corp. in July over scraping of its social network, formerly known as Twitter, has raised new questions about how safe scraping really is. It seeks tens of millions of dollars in damages from a nonprofit that produced research into the prevalence of hate speech on X’s platform.
To better understand the current legal landscape around scraping, I turned to Esha Bhandari, the deputy director of the American Civil Liberties Union’s Speech, Privacy, and Technology Project. Bhandari has worked for years to establish the importance of scraping to civil liberties and public interest research and reporting. We first crossed paths when she prepared to litigate Sandvig v. Barr, a federal scraping-related lawsuit that preceded Van Buren. Testimony from Sandvig was cited in the Van Buren decision, alongside an amicus brief from Bhandari and her ACLU colleagues.
You can find our conversation below, edited for brevity and clarity.
Ryan Tate: What is the legal status of web scraping these days—particularly in the wake of the rulings in the last several years, and for someone doing it without permission, potentially in violation of terms of service?
Esha Bhandari: The answer is that it depends. It continues to be uncertain.
One important development in the last couple of years has been that the threat against scraping from the federal Computer Fraud and Abuse Act has been severely diminished, for a few reasons.
One, in Sandvig v. Barr, we sued the federal government, before any prosecutorial action, on behalf of a group of researchers and journalists, saying, “we need clarity from the court that when you scrape public information like this, and it’s in violation of terms of service, that that is not enough to constitute a computer crime under the CFAA.” And the district court in that case agreed with us and said essentially that you can’t make out a computer crimes violation under the CFAA simply because you violate terms of service—there has to be something more than that, something that constitutes behavior that looks more like breaking and entering rather than simply violating a written term.
Then, shortly after that decision, the Supreme Court decided Van Buren v. United States. In that case, it was addressing the Computer Fraud and Abuse Act and how to interpret it. That case came up in a very different context. It involved a police officer who accessed information to use for personal purposes, who was looking up database information to help a friend who was harassing a woman. So the facts of that case [were] very different from the context of public interest research and journalism. But the key legal question there was, is it a CFAA violation? Did the police officer commit a computer crimes violation by accessing information in a database that he had access to, but violating written policies that said you can’t use it for these non-work purposes? Obviously the behavior was not great and you can see why this is a context in which charges were brought.
News
Why Web Scraping Is Vital to Democracy
Journalists have used scrapers to collect data that rooted out extremist cops, tracked lobbyists, and uncovered an underground market for adopted children
But the Supreme Court said, we can’t read the Computer Fraud and Abuse Act so broadly as to make it a crime just to violate written use policies. So if you’ve been given access to certain information and then you misuse it or use it in a way that was not intended by the person who gave it to you, that’s not a computer crimes violation within the meaning of the statute because you didn’t break and enter, you didn’t bypass a technological barrier. What you did was misuse information, and that might be a separate violation under separate laws in certain contexts, but it’s not a CFAA violation.
That decision left some open questions but made it clear that you can’t really be charged for a CFAA violation criminally for just violating terms of service. And that includes all the kinds of terms of service violations that data journalists and researchers engage in, not just scraping.
Tate: What are those open questions—what is still up in the air after Van Buren?
Bhandari: Van Buren focused on written terms and whether violating written terms alone could be a computer crimes law violation of the CFAA. There are still open questions about whether certain research techniques or other techniques used by others online would fall within the CFAA. There’s very much a lack of clarity around things like using a VPN, masking your IP address, or whether you can use someone else’s credentials to access a website with their consent. Is that breaking and entering within the meaning of the law?
Tate: I’m also curious about whether civil suits come into this. As a journalist, I think about, and I imagine other researchers think about, the possibility of getting sued, and certainly a civil case might be better than a criminal case, but it still could potentially be a deterrent to engaging in certain kinds of research.
Bhandari: Yeah, totally, that’s exactly right. The threat of civil lawsuits remains. That’s why the landscape remains uncertain and there’s still more work to be done to clarify that digital journalism techniques, data journalism methods, have to be protected by the courts.
I’ll give one example.
The recent lawsuit that X Corp., formerly Twitter, filed against a nonprofit called the Center for Countering Digital Hate illustrates the ongoing threat to researchers—whether they’re nonprofit researchers, academics, journalists—who engage in public interest investigations of platforms and often speak critically about platforms. They will often find things that the platforms are not happy for them to publicize.
In this case, the Center for Countering Digital Hate published reports that talked about what it termed hate speech and misinformation that remained on the Twitter platform. In doing this research, they had to scrape public information on Twitter. They analyzed posts at scale and they argued that Twitter allowed content to remain up that violated Twitter’s own policies on content. X Corp. sued CCDH and their theory was that CCDH violated the terms of service and that that’s a breach of contract.
They’re seeking tens of millions of dollars in damages based on the reputational harm to them of these reports, which they say caused advertisers to flee. You can see why, as much as I think that the threat of criminal charges was a large deterrent, and it’s good to have the threat of criminal charges cleared away, it’s still a deterrent to potentially face tens of millions of dollars in liability for publishing reports that a platform doesn’t like.
Tate: And that is still ongoing, so we don’t have a sense of where that is going to end up yet.
Bhandari: Exactly. That case was filed a few months ago, the court will issue a decision presumably sometime in the next few months to a year.
I think this is the next frontier of battles over scraping in the public interest and digital journalism techniques and whether platforms will be able to wield their terms of service as a weapon against researchers who are engaged in adversarial research. There is obviously a totally different set of considerations when they grant access to information and who they grant access to information to. But we’re talking about independent, autonomous, often adversarial investigations of platforms that might yield results that they don’t like, and whether platforms can then look at the outcome of that research and go back and sort of find technical violations of their terms of service and hold the researchers liable in breach of contract.
Tate: What is particularly novel or interesting about that case? Were civil cases like this uncommon before? Is it the scale of damages that makes it interesting?
Bhandari: Both of those actually.
To my knowledge, this is the first time that a platform has actually sued a researcher or a journalist who published something that was clearly public-interest-focused research. Now platforms have often, behind the scenes, quietly sent cease and desist letters to researchers and journalists. But they’ve kept those out of the public eye, I think in large part because platforms have recognized that it’s a bad look for them reputationally to be going after researchers and journalists—particularly researchers and journalists who uncover things about their policies that are of great public interest.
Obviously, the catalyst is that X Corp. has changed now, and they are willing to file suit—that in and of itself makes this notable.
LevelUp
How to Legally Scrape EU Data for Investigations
Data collection laws are much trickier to navigate in Europe than in the U.S.
And the second point is the tens of millions of dollars in damages, which are rooted in the reputational harm. What’s interesting here is it’s not that X is saying the scraping
materially affected the operation of the platform, that it resulted in less functionality or they got access to information that they otherwise wouldn’t have, like scraping private information that otherwise wasn’t made available to the researchers.
No, it’s not damages tied to the scraping. It’s damages flowing from the subsequent speech of the researchers when they wrote a report saying Twitter allows this type of content on its platform and that made advertisers skittish.
We at the ACLU filed an amicus brief in support of CCDH in this case [on Nov. 24]. And our argument was that the court shouldn’t allow enforcement of a contract term like this. When the contract term is really being used to stifle speech in the public interest, it should be deemed void for public policy because, in this case, X is essentially trying to recover damages for reputational harm, and the reputational harm is based on speech in the public interest. But they didn’t bring a defamation claim, likely because they know a defamation claim has a high bar under the First Amendment. You’d have to show a whole host of things, including actual malice on the part of the speaker, that they were just disregarding that what they were saying was false.
Tate: I wanted to ask about a distinction you made—going after research done in the public interest, maybe by a nonprofit or a journalist. Have there been civil suits against a more commercial entity like a competitor or a startup?
Bhandari: Yes, that has happened. Perhaps the most famous lawsuit is hiQ versus LinkedIn,
in which hiQ [Labs] was a startup commercial entity that was scraping data from LinkedIn and LinkedIn sued hiQ saying, “this is our information, this is our competitive advantage, hiQ getting this information allows them to offer competing services.”
Now, hiQ was, again, only scraping public information. In that case, the court held that it’s not a CFAA violation, even if there’s a civil provision of the CFAA (LinkedIn had sued under the civil provision). They said it’s not a CFAA violation because, again, there was no breaking and entering here. The scraping was just of public information.
But the court did end up holding hiQ liable for breach of contract for the terms of service, which is very interesting in that commercial context.
What’s different about the X Corp. lawsuit and what we’ve emphasized is that holding parties to breach of contract claims when it’s about speech in the public interest, important research and journalism methods to enable speech in the public interest, is void for public policy, because there is a public policy in California and in the United States generally to encourage this kind of speech.
Tate: Looking back over the last several years, do you feel like the landscape has become meaningfully safer legally for journalists and researchers to do work that involves scraping? Or is it one step forward, one step back?
Bhandari: I think there has been meaningful progress. The CFAA threat being cleared away, in large part, is really important progress. The fact that we’ve had a few court decisions now recognizing the importance of digital era journalism techniques like scraping is really important, including in the Sandvig case. We’ve also had, in the last five years, more evidence of the value of this type of research. And you’ve had government regulators weigh in and say, “this kind of research is important,” and even rely on that research to bring enforcement action.
For example, when the Department of Housing and Urban Development brought enforcement action against Facebook for its online ad targeting system, a lot of that enforcement work was informed by research done by organizations like ProPublica and many others who have continued to do studies of Facebook and its ad targeting system.
So I think there’s just a greater recognition of the value of the research and how important it is, including for government regulation, than there was five to seven years ago when we might’ve been in the position of arguing to a court, trying to explain why this matters.
The progress can often be slow in clearing away legal hurdles, but I’m really hopeful that we’ll get there.
One last note: If you’re reading this with an eye toward figuring out if something you’re doing is legal or not, know that Bhandari always advises people to get their own legal advice and assess their own risk factors, because the answer to whether a particular activity that involves scraping is a legal risk ultimately is, “it depends.”
Thank you, as always, for reading, and may your holiday wishlist be scraped only by friends and loved ones.
Yours,
Ryan Tate
Editor
The Markup