In October 2022, The Markup published an investigation into internet disparities in 38 major cities across the United States. Afterward, many readers, including fellow journalists, reached out to ask how they could replicate the investigation for other cities. We wanted to help make that possible, but we knew that the early challenge we had with sourcing, cleaning, and wrangling street addresses would pose an issue. So, we got to work on finding a way to help people get easy access to randomized address data—or as we like to say, receipts from streets.
Here's the tool
Today we’re introducing a beta version of The United States Place Sampler (USPS), a new tool by Big Local News and The Markup that samples random street addresses in the United States: https://usps.biglocalnews.org/.
The “P” in USPS refers to the U.S. Census Bureau’s definition of “place,” which is an encompassing term for geographic entities such as municipalities, cities, towns, neighborhoods, and named areas. This includes legally defined geographic areas called “incorporated places,” like New York City and Milwaukee, Wis., as well as “Census designated places,” which are unincorporated and have no legal boundaries, like Stanford, Calif., and Hershey, Pa.
USPS can collect either a percentage or a finite number of street addresses for geographic designations, such as Census block groups, cities, towns, and counties.
Here's the code
Based on the work The Markup has done so far, we know the tool will be useful for analyzing disparities between address-level outcomes with socioeconomic data from the U.S. Census Bureau and historical redlining maps from the University of Richmond’s Mapping Inequality project. With that in mind, each address record USPS provides contains the surrounding area’s Federal Information Processing Series (FIPS) codes down to the Census block group level.
Aside from allowing readers to reproduce our investigation in their own cities, we also hope that USPS will be useful to fellow journalists, researchers, and others who seek accountability through street-level data.
Journalists and researchers can use USPS to sample addresses and test for disparities in:
- Outages to utilities through service outage portals for electricity, gas, and cable
- Access to grocery stores, hospitals, trauma centers, polling places, and other essential locations across neighborhoods and cities
- Availability, and rates of location-based services likes ride sharing (Uber, Lyft, Revel), food delivery (GrubHub, DoorDash, Uber Eats), and e-commerce
Soon, we will be releasing a step-by-step-guide showing how to replicate our investigation into internet disparities in other American towns and cities using USPS. (Sign up for our newsletter here to be among the first to know.)
Why Did We Build USPS?
As part of our investigation into internet service providers, we collected and analyzed more than one million internet plans offered across 38 major cities and found that lower-income, least-White, and historically redlined areas were disproportionately offered slow internet speeds for the same price as faster speeds elsewhere in the same city.
In the process, we discovered the difficulty of a seemingly simple task of building a representative sample of random addresses in a city.
Show Your WorkStill Loading
How We Uncovered Disparities in Internet Deals
AT&T, Verizon, EarthLink, and CenturyLink disproportionately offered the worst deals to lower-income areas and communities of color across the country—while charging the same for faster speeds in higher-income and Whiter areas
The first issue was finding a data source. Publicly available street address databases were scarce, and those that we could access were incomplete. Even though “clicking around Google Maps” was the accessible, readily available option for most people trying to gather addresses in a given area at any sort of scale, statisticians advised us (and we agree) that it’s biased and shouldn’t be used.
We decided to use OpenAddresses, which is an open-source repository of addresses and associated geographic coordinates collected from public data from state and local governments. Although OpenAddresses does not include full coverage of the United States, it was the best source we could find and had adequate coverage in the major cities in our investigation.
Coverage aside, we found that many addresses were incomplete—lacking cities, zip codes, and other geographic markers we would need for our analysis.
To standardize and complete these addresses, we used the U.S. Census Bureau’s geocoder API, which lets us map the Census block group and designated place (city) to the geographic coordinates that were provided for every address in the dataset. To learn more about our methodology, see our companion article for our investigation here.
Following publication, journalists in at least seven of the cities we investigated used our data to localize our findings for their neighbors. But journalists, researchers, city workers, educators, and readers who were not in one of those 38 cities also expressed an interest in reproducing our methodology to map internet speeds near them.
We began writing a guide on how to manually collect internet offer information from the front end of internet service providers’ websites so that anyone from high school classrooms to community groups to civil servants could follow the process.
However, we soon realized it would be unrealistic to expect others to source street addresses the same way we did. We also didn’t feel as if it would be responsible to offer an easy solution that could be biased, such as “clicking around Google maps.”
So, we partnered with Big Local News to download, clean, and index every address we could find in the United States to simplify the task of sampling random street addresses.
How Do You Use USPS?
To build a sample of addresses on usps.biglocalnews.org.
- Enter the name of the location you would like to sample from
- Specify a number or percentage of addresses you’d would like to sample
- Press “Search”
- Now you should see a map with some dots; click the plus sign and then “download CSV”
These searches will require a bit of patience, since USPS queries more than 200 million addresses.
Want More Control Over Your Search?
If you need greater precision, you can specify a search by a Census block group, county, or city using custom query strings.
To search by a city or town name, use the `place` query syntax.
place: buffalo city
You can do the same for
county, county-subdivision (
tract, and Census block group (
In the future, we will document the underlying API for others to programmatically pull samples of addresses. Please visit Big Local News’s GitHub page and usps.biglocalnews.org for updates.
How Did We Build USPS?
USPS uses addresses from public data from the open-source project OpenAddresses, and shapes and geographic metadata from the U.S. Census Bureau’s Tiger Shapefiles from 2022.
OpenAddresses data was ingested for every address in the United States between Feb. 19 and Feb. 23. In total, the first iteration of USPS contains more than 200 million addresses.
Report Deeply and Fix Things
The tool is powered by PostGIS, an extension to the PostgreSQL database that supports geographic objects and spatial queries.
Rather than make a request to the Census geocoder API for each U.S. address from OpenAddresses, we preprocessed addresses with PostGIS’ Tiger Geocoder. This executes the same functions as our original methodology but with a simpler data pipeline that doesn’t strain the U.S. Census Bureau’s servers, given the scale of addresses we preprocessed.
PostGIS also allows USPS to use “spatial indexing”—which uses geographic bounding boxes as the main reference point, making geographic-based queries fast and efficient.
For USPS, we created spatial indices around the Census Tiger Shapefiles for each geographic designation (Census block group, place, county, etc.) and used those as reference points to quickly filter addresses against their geographic coordinates.
But before that was possible, each address’s geographic coordinates were converted to the same coordinate system as the TIGER data (called epsg:4269) and stored in another spatial index.