Data is Plural archive

Trawl through the backlog, or roll the dice for a random dataset 🎲
Jul 2024

Australia shipwrecks.

The Western Australia Museum hosts a range of datasets, including details concerning 1,600+ local shipwrecks and 30,000+ artifacts recovered from them. The shipwreck dataset lists each ship’s builder, construction materials, owner, cargo, wreck location, date wrecked, known deaths, date found, and more. Previously: Ancient shipwrecks (DIP 2024.07.10). [h/t Kristin Milton]

Jul 2024

National park species.

The National Park Service’s NPSpecies portal “documents our knowledge about the occurrence and status of species” on the agency’s lands. For each NPS-managed area, you can download a list of the species, their scientific and common names, occurrence status (present, probably present, unconfirmed), nativeness, conservation status, and more. Related: Noting that “many of the observations in NPSpecies remain unverified and the lists are often outdated,” Benjamin J. LaFrance et al. have created an updated dataset for amphibian species, which they checked against other sources and verified with regional experts.

Jul 2024

Commercial zones.

Byeonghwa Jeong et al. have constructed a dataset estimating the geographic boundaries of 23,000+ commercial zones in 69 metro areas in the US and Canada. To build it, they used data on retail and office locations from OpenStreetMap, and on job density from the US Census Bureau’s Longitudinal Employer-Household Dynamics program (DIP 2021.05.26) and Statistics Canada. For each detected commercial zone, the dataset provides its outline, total area, a score of its relative concentration (on which the zone comprising most of Manhattan scored the highest), its MSA, and the street at its centroid.

Jul 2024

Human rights scores.

The CIRIGHTS project aims “to create numerical measures for every internationally recognized human right for all countries of the world.” The team has developed a detailed guide to scoring each government’s record on dozens of such rights, such as freedom of religion, women’s political rights, freedom from extrajudicial killings, the right to a fair trial, and “reasonable limits” on working hours. For each year from 1981 to 2021, the project’s scorers have rated each country on each right, generally on a three-point scale, based on information in the US State Department’s Country Reports on Human Rights Practices, Amnesty International’s annual reports, and similar sources. The resulting dataset includes those scores, as well as several summary metrics.

Jul 2024

News homepages, archived.

Since launching in March 2022, homepages.news has archived millions of screenshots, performance audits, robots.txt files, accessibility trees, and hyperlink lists from the homepages of 1,100+ news sites. The open-source project, run by journalist Ben Welsh, provides bulk data for each of those assets. The screenshots themselves are stored on the Internet Archive; you can also view the latest screenshots from all the sites on one page. To date, the publications span 32 countries and 17 languages. Related: Welsh and volunteer Alex Garcia are using the robots.txt data to track which sites block OpenAI, Google AI, and Common Crawl — findings that have been cited widely.

Jul 2024

Ancient shipwrecks.

The Summary Geodatabase of Shipwrecks 1500BCE-1500CE merges two catalogs of ancient wrecks: one from the Oxford Roman Economy Project and one from Harvard’s Mapping Past Societies project. Building on scholarly research by Toby Parker, Julia Strauss, and others, the combined effort includes 1,900+ known wrecks, listing (where known) their coordinates, depth, estimated time period when wrecked, year discovered, cargo, and more.

Jul 2024

US coral reefs.

The National Coral Reef Monitoring Program collects scientific and socioeconomic/attitude survey data related to the coral reefs offshore of the continental United States, Hawaii, and US territories. It provides the data through public visualizations as well as download tools and raw files. The scientific data include species-level coral cover, colony density, bleaching prevalence, and disease rates; fish populations; water alkalinity and dissolved inorganic carbon levels; and more. [h/t Gary Price]

Jul 2024

East Asia building outlines.

Recent years have seen the development of ambitious datasets that provide the outlines of buildings by the millions. For instance, from the archives: buildings in the US (DIP 2018.07.18), Africa (DIP 2021.08.25), and Canada and New Zealand (DIP 2019.09.25). In the Journal of Remote Sensing, however, Qian Shi et al. note a relative lack of such data for buildings in East Asia, which the authors attribute “to the more complex distribution of buildings and the scarcity of auxiliary data”. As an antidote, they’ve generated a dataset that outlines more than 280 million buildings in 2,800+ cities across China, Japan, South Korea, North Korea, and Mongolia.

Jul 2024

Corporate AI activity.

The Private-Sector AI Indicators dataset, from Georgetown University’s Emerging Technology Observatory, provides “a diverse range of indicators of AI-related activity for hundreds of companies worldwide, from startups to multinationals.” For each of the 670+ companies included, the dataset counts the number of AI-related research articles published by its employees (disaggregated by topic), AI-related patents filed (by application area and use-case), and workforce (overall and those estimated to be AI-involved). It also lists each company’s main location, sector, growth stage, and description, as well as aliases, stock listings, and identifiers in several external data sources. See also: An interactive version of the database. [h/t Zach Arnold]

Jul 2024

Federal inmate complaints.

The Federal Bureau of Prisons’ Administrative Remedy Program “allow[s] an inmate to seek formal review of an issue relating to any aspect of his/her own confinement.” In October 2022, the Data Liberation Project (which, disclosure, I run) filed a FOIA request seeking records from the agency database that tracks those complaints. In response, BOP last month provided data on 1.78 million complaint and appeal submissions filed from January 2000 through late May 2024, spanning nearly 1 million distinct cases. The records, published yesterday with the help of volunteers, indicate when each filing was received, its relevant case number, complaint subject, facility where the issue occurred, case status, status update date, reasons for rejection/closure, and other details.

Jul 2024

UK film stats.

The British Film Institute publishes a variety of statistical reports, including spreadsheets of weekend box office figures. Those spreadsheets cover each weekend’s 15 highest-grossing films, all UK-originated films, and other newly-released films; they list each film’s title, country of origin, distributor, cinema count, weekend gross, total gross to date, and more. [h/t Gina Acosta Gutiérrez]

Jul 2024

Hurricane forecast accuracy.

The National Hurricane Center says it “receives frequent inquiries on the accuracy and skill of its forecasts and of the computer models available to it.” To help answer those questions, the agency publishes a series of regularly-updated verification reports, as well as a database quantifying its forecast errors. For each official projection since 1970, the database compares each storm’s predicted location and wind speed to those attributes’ ultimate values. As seen in: “The Social Value of Hurricane Forecasts,” a study by Renato Molina and Ivan Rudik.

Jul 2024

Beach replenishment.

The Program for the Study of Developed Shorelines at Western Carolina University maintains a database of 2,500+ beach-replenishing efforts since the 1920s. The project is “a 25-year research and data collection effort that, to the best of our knowledge, represents the most comprehensive compilation of beach nourishment history in the United States.” For each sand-adding “episode,” the dataset indicates its location, year completed, sand volume, length of shoreline treated, total cost, primary type of funding source (private, federal, state, etc.), and justification (shore protection, navigation, emergency dune construction, etc.). As seen in: “Sand Dollars,” by CBS News Investigations.

Jul 2024

Death penalty status by country.

The Comparative Death Penalty Database, compiled by Carsten Anckar and Thomas Denk, tracks the status of capital punishment in 206 independent countries annually from 1800 to 2022. It places each observation into one of five categories, indicating whether the death penalty is: (a) fully abolished, (b) abolished “for ordinary crimes only,” (c) abolished for “for ordinary crimes only but where at least one execution has occurred in the last 10 years,” (d) de facto abolished, or (e) still in use. Previously: The Death Penalty Information Center’s database of US executions (DIP 2019.05.15); data on death sentences from The Intercept (DIP 2019.12.11) and from Brandon L. Garrett (DIP 2018.08.01).

Jul 2024

Historical newswire articles.

Emily Silcock et al. have created Newswire, a dataset of 2.7 million newswire articles published in the US between 1878 and 1977. To build it, they extracted 138 million articles from scans of newspapers’ front pages and then used machine learning to group those coming “from the same underlying newswire source article, in the presence of significant abridgement and noise.” For each detected newswire item, the dataset lists the newspapers that carried it, dates published, the text of a representative version, its extracted byline, dispatch location, people mentioned in the text, general topic, and more. Previously: American Stories (DIP 2023.09.13), a dataset of historical newspaper articles — also from Melissa Dell’s research group. [h/t Robin Sloan]

Jun 2024

Broadway attendance.

The Broadway League’s Internet Broadway Database lets you search the famed industry’s theaters, shows, casts and staff, awards, and more. It also publishes charts and structured tables of weekly attendance and ticket revenue, additionally available for individual shows. The League itself also publishes show-level statistics. [h/t Millie Giles]

Jun 2024

Fish-spawning areas.

Kimberly L. Oremus et al. have constructed a geospatial dataset that indicates the spawning locations and timing for 1,000+ saltwater fish in 2,900+ marine regions around the world. The authors primarily sourced — and then geocoded — the records from FishBase (an initiative providing data on the habitats, body shapes, and other characteristics of 35,000+ fish species) and the database of Science and Conservation of Fish Aggregations (a nonprofit that focuses on massive reproductive gatherings of fish).

Jun 2024

Federal rural investment.

The Department of Agriculture’s Rural Development agency runs dozens of financial assistance programs to support housing, business, community facilities, telecommunications, and other developments in less populated areas of the United States. The agency’s data gateway, launched last year, provides dashboards and downloads tracking these loans, loan guarantees, and grants going back to fiscal year 2012. At their most granular, the data describe each investment’s type, amount, program, sector, ZIP code, city, state, and more. [h/t James Barham]

Jun 2024

Internet politics.

The Digital Society Project, founded in 2018 as a collaboration with the Varieties of Democracy initiative (DIP 2019.04.24), “aims to answer some of the most important questions surrounding interactions between the internet and politics.” To do so, they conduct surveys of experts and ask them questions such as: “How often does the government shut down domestic access to the Internet?” and “How often do domestic elites use social media to organize offline political action of any kind?” The survey’s downloadable datasets cover 170+ countries and provide aggregate metrics for each question. [h/t Donata Columbro]

Jun 2024

Supreme gifts.

Fix the Court is a nonprofit that “advocates for non-ideological ‘fixes’ that would make the federal courts, and primarily the U.S. Supreme Court, more open and more accountable to the American people.” Earlier this month, they published a series of spreadsheets tallying 500+ gifts accepted by the Supreme Court’s justices since 2000, with an estimated total value of $4.76 million. The findings “are largely based on last year’s groundbreaking work by ProPublica and includes data from stories in the New York Times, L.A. Times, the congressional record, annual disclosures,” and Fix the Court’s own work. The data indicate each gift’s year, recipient, description, giver, value, and the source of information. Previously: The Free Law Project’s database of federal judges’ financial disclosures (DIP 2021.10.20). Related: ProPublica’s interactive database of the current justices’ disclosures.

Jun 2024

Sudoku solves.

In the spirit of introspection, Sudoku enthusiast Vivek Rao has conducted a detailed analysis of his cell-by-cell performance on 100 puzzles from the New York Times’ daily offerings. The underlying data, collected via a custom browser extension that Rao built, indicates the order and timing of every cell he filled.

Jun 2024

English women’s football.

The English Women’s Football Database “covers all matches played since the 2011 season for the highest division (the Women’s Super League) and since the 2014 season for the second-highest division (the Women’s Championship).” The project, built by Rob Clapp, lists the date, teams, score, attendance, division, tier, and season of each match, as well as each season’s final standings. Previously: Josh Fjelstul’s English Football Database (DIP 2023.02.01), which Clapp cites as inspiration.

Jun 2024

NYC shelter exits.

A 2022 law requires New York City to report the monthly number of individuals and families exiting the city’s shelter system. Unfortunately, the city publishes those reports only as PDFs and without a historical archive. Patrick Spauster has built a pipeline to download and preserve the reports, and to turn them into structured data. For each month since May 2023, each row indicates the number of exits for a particular city agency, family/person category, and destination type. The latter includes various kinds of permanent housing, transitional housing, medical facilities, as well as “unknown.” Read more: Spauster’s analysis of the data for City Limits. Previously: NYC shelter counts (DIP 2023.12.13).

Jun 2024

State tax revenues.

How much money do US states collect through different types of taxes? The Census Bureau’s Quarterly Summary of State and Local Government Tax Revenue provides these figures every three months, going back decades. The categories include taxes on property, income, general sales, sales of specific products (such as tobacco, alcohol, gas, and gambling), licenses, and more. For several years now, the agency has also published monthly data for a subset of those taxes. As seen in: “Which states make the most from sports betting? What about lotteries?” by the Washington Post’s Andrew Van Dam. Previously: The Census’s Annual Survey of State and Local Government Finances (DIP 2020.11.18).

Jun 2024

Interest group positions.

Galen Hall et al. have compiled a dataset of “over 13 million policy positions stated by tens of thousands of interest groups and individuals on bills in 17 state legislatures over the past 25 years.” The authors collected and standardized the data, which span 1997 to 2022, from lobbying and testimony disclosures. For each of those positions, the dataset indicates the relevant bill, client or individual represented, representative name, position phrase (for, against, monitoring, undecided, etc.), the date the position was reported, and more. It also provides details about each bill from Legiscan and the National Conference of State Legislatures, as well as client-industry categorizations from FollowTheMoney.org.

May 2024

France and Italy’s protected wines.

Sebastian Candiago et al. have assembled a dataset of 5,400+ Italian and French wines granted Protected Designation of Origin status, restricting their production to specific geographies and methods. For each wine, the dataset lists its name, country, designated area, color, category, grape varieties used, maximum allowed yields, registration date, and more. Previously: Protected European ham and the EU’s register of protected indications (DIP 2023.05.24).

May 2024

Satellite-aided rescues.

NOAA’s Search and Rescue Satellite-Aided Tracking (SARSAT) program is part of an international collaboration to locate distress beacons activated (manually or automatically) by mariners, aviators, and wilderness explorers. The agency publishes annual maps of SARSAT-enabled rescues, along with data for the most recent year-plus. NOAA has also provided Data Is Plural with data for 2016–2022. The maps and data files contain each rescue’s date, category, description, beacon type, coordinates, and number of people saved. [h/t Dan Brady]

May 2024

Security Council resolutions.

Seán Fobbe et al.’s Corpus of Resolutions: UN Security Council “collects and presents for the first time in human and machine-readable form all resolutions, drafts, and meeting records of the UN Security Council, including detailed metadata, as published by the UN Digital Library and revised by the authors.” It covers all 2,700+ resolutions from the council’s founding in 1946 through early 2024. In addition to providing the texts all six official UN languages, the dataset includes each resolution’s title, date, council votes, related meeting number, meeting transcript, keywords, countries of focus, and more. An auxiliary dataset represents the corpus’s internal citations as a directed graph. [h/t Sharon Machlis]

May 2024

Amazon purchases.

The MIT Media Lab’s Alex Berke et al. have compiled “a first-of-its-kind dataset containing detailed purchase histories from 5027 U.S. Amazon.com consumers, spanning 2018 through 2022, with more than 1.8 million purchases […] crowdsourced through an online survey and shared with participants’ informed consent.” The published data include “order date, product code, title, price, quantity, and shipping address state,” and are “linked to survey data with information about participants’ demographics, lifestyle, and health.” The researchers found that a stratified subsample of the data demonstrated “expected seasonal trends and strong relationships to other public datasets.” [h/t Data Science Community Newsletter]

May 2024

Federal court dockets.

Journalist Matt Clark has compiled a database of more than 350 million docket entries across more than 13 million cases in 180+ federal courts — including the majority of district, appellate, and bankruptcy courts. The records, which Clark collected through the RSS feeds that many federal courts provide, span 2013 to the near-present. Clark’s downloadable database provides information about each docket entry (time filed, entry number, description, and URL), case (number, name, type, and URL), and court. Although the database does not include the docketed documents themselves, they can be retrieved via PACER and the free RECAP archive, among other sources.