Data Is Plural is a weekly newsletter of useful/curious datasets. This edition, dated Feb. 1, 2023, has been republished with permission of the author.
In last week’s newsletter, I noted that the Coast Guard’s list of boat recalls “seems possible to scrape.” Reader Michael Nolan took up the challenge; here’s the dataset he extracted. Thanks, Michael!
Women’s well-being. Camille Belmin et al.’s LivWell dataset presents “a set of key indicators on women’s socio-economic status, health and well-being, access to basic services and demographic outcomes” in 447 regions of 52 countries from 1990 to 2019. The indicators include, for example, rates of home ownership, educational attainment, and domestic violence; they’re based primarily on data from the Demographic and Health Surveys Program, a USAID-funded initiative that, since 1984 “has provided technical assistance to more than 400 surveys in over 90 countries, advancing global understanding of health and population trends in developing countries.” Read more: An introductory Twitter thread from Belmin.
Radiation-contaminated waste. The U.S. Nuclear Regulatory Commission regulates the disposal of “low-level” radioactive waste—“items that have become contaminated with radioactive material or have become radioactive through exposure to neutron radiation,” such as protective equipment and cleaning supplies. The NRC provides annual statistics (facility, volume, and total curies) for the country’s four active disposal sites. The Department of Energy’s Manifest Information Management System provides more detailed figures, with breakdowns by month, state of origin, waste classification, isotope, and more. Previously: Data from the Nuclear Fuel Data Survey (DIP 2022.11.23).
Novel dialogue. Krishnapriya Vishnubhotla et al.’s Project Dialogism Novel Corpus contains every quotation from 22 novels, plus who speaks each line, who they’re addressing, the characters they mention, and more. With 35,000-plus quotations, the corpus “is by an order of magnitude the largest dataset of annotated quotations for literary texts in English.” Jane Austen is the most-represented author (five novels), followed by E.M. Forster (two). The researchers have also published a document that they “hope will help standardize future annotation work in this domain.”
Browser capabilities. Alexis Deveria’s caniuse.com indicates which versions of which web browsers support various web technologies, such as CSS’s grid layout, the WebP image format, and the Image Capture API. The project’s dataset covers 530-plus technologies and 19 browsers (six desktop, 13 mobile). It also provides estimates of the percentage of all users whose browsers support a given technology. [h/t Simon Willison]
More English football. Josh Fjelstul’s English Football Database “is a comprehensive database of football matches played in the Premier League and the English Football League from the inaugural season of the Football League (1888–89) through the most recent season (2021–22).” It records each season, team, match, and the standings table at the end of each season. Previously: James P. Curley’s English soccer dataset (DIP 2016.05.04), since expanded to leagues in a dozen more countries. [h/t Derek M. Jones]
Notice: Unlike most of our content, this edition of Data Is Plural by Jeremy Singer-Vine is not available for republication under a Creative Commons license.