Data is Plural archive

Trawl through the backlog, or roll the dice for a random dataset 🎲
Feb 2025

Chord progressions.

Spyridon Kantarelis et al. have created CHORDONOMICON, a dataset identifying the progressions of 51 million chords in 667,000+ songs. The dataset is based on tablatures from the website Ultimate Guitar and then “annotated with structural parts, genre, and release date”. Most entries also include the song’s and artist’s IDs in Spotify’s system. [h/t Dale Debber]

Feb 2025

Argentine treaties.

Javier I. Santander, a career diplomat, has built a dataset of 8,200+ bilateral treaties signed by Argentina from 1810 and 2023. It lists each treaty’s title, status, date signed, and counterpart country. The dataset is based on the government’s Digital Library of Treaties, where you can find copies of the treaties themselves. The most common counterparts have been neighboring countries — Chile, Brazil, Bolivia, Paraguay, and Uruguay — followed by Germany, the US, and Italy.

Feb 2025

18 million deceased veterans.

BIRLS.org, a new website from Reclaim The Records, provides “an index to basic biographical information on more than 18 million deceased American veterans who received some sort of veterans benefits in their lifetime”. Those records, obtained through a FOIA lawsuit, represent a substantial chunk of the Department of Veterans Affairs’ Beneficiary Identification Records Locator Subsystem. The site also helps you file follow-up requests for any individual’s “full VA claims file, which may contain hundreds of pages of never-before-seen biographical and historical material about the veteran, their military service, and their interactions with the VA.” Note: The “database is not a comprehensive database of all American veterans, but rather a partial and incomplete index of veterans who were eligible for VA benefits or whose heirs had some kind of contact with the VA regarding benefits.”

Feb 2025

Water availability.

The US Geological Survey last month released its National Water Availability Assessment, “a pioneering scientific overview of water availability that offers first-of-its-kind insights into the balance between water supply and demand across the conterminous United States.” Alongside the report, USGS launched a “data companion” providing “regularly updated, model-based estimates” of monthly water usage within each of the country’s hydrologic units. Estimates for water availability and water supply are “coming soon,” while those for water quality and aquatic ecosystems are “coming later.”

Feb 2025

Presidential schedules.

Among its various White House–related undertakings, Roll Call Factba.se provides event-by-event structured data representing the public presidential calendars for Donald Trump and Joe Biden since the latter’s inauguration in January 2021. The schedules, available to download in bulk, provide each event’s day and time, location, a brief description, and other details. They contain 9,400+ entries from Biden’s four years in office plus another 300+ from Trump’s second term so far. The events include those from the official presidential schedule, those derived from pool reports, and press briefings. As seen in: POTUS Tracker. [h/t Dan Brady]

Jan 2025

A royal regatta.

The Henley Royal Regatta, a multi-day rowing competition, has been held on the River Thames nearly every year since 1839. Dominic Goymour has scraped the event’s online results into a dataset covering 7,500+ outcomes since 1999. It includes each race’s date, starting time, stage, boat class, cup, winning crew/club, losing crew/club, winning time, and more.

Jan 2025

Grocery ingredients.

To compile GroceryDB, Babak Ravandi et al. scraped data about 50,000+ food products available on the websites of Walmart, Target, and Whole Foods. For each product, they extracted the nutritional information and ingredient list, which they provide as structured data and use for estimating each product’s degree of processing. Related: TrueFood, a website the research team built with the findings.

Jan 2025

Hurricane landfalls.

NOAA’s Hurricane Research Division maintains a table of hurricanes that have made landfall on the continental US since the 1850s. It records the year and month of landfall, designated name, states affected, the highest Saffir-Simpson category, central pressure at landfall, and maximum sustained wind speed. The division publishes another table containing more details — such as the full date, latitude, and longitude of landfall — but with a gap in the late 1970s to early 1980s. [h/t Michael Ferragamo + Dale Debber]

Jan 2025

Private schools.

The National Center for Education Statistics’s Private School Universe Survey has been gathering data about private elementary and secondary schools every two years since the 1989–90 school year. It collects information on “religious orientation; level of school; size of school; length of school year, length of school day; total enrollment (K-12); number of high school graduates, whether a school is single-sexed or coeducational and enrollment by sex; number of teachers employed; program emphasis” and more. In the latest data, covering the 2021–22 school year, “there were 29,727 private schools, enrolling 4,731,303 students and employing 482,571 full-time teachers”. As seen in: ProPublica’s Private School Demographics lookup tool (webinar scheduled for January 31) and its reporting on “segregation academies”.

Jan 2025

Hyperlocal Trump/Harris results.

Earlier this month, colleagues at The New York Times published “An Extremely Detailed Map of the 2024 Election” and made the underlying data available to download. The effort “currently includes results for more than 110,000 precincts, or 73 percent of all votes, and will be updated as more data is collected.” The dataset lists each precinct’s state, county FIPS code, votes received by Kamala Harris, votes received by Donald Trump, and total votes (including third parties and write-ins). It also provides each precinct’s geographical boundaries, derived from a mix of official sources and estimations. Previously: “An Extremely Detailed Map of the 2020 Election” and the data behind it (DIP 2021.02.10). See also: Precinct-level election results for 2020, 2018, 2016, and 2012 from the Voting and Election Science Team.

Jan 2025

ISS telemetry.

The International Space Station beams home a wide range of measurements: cabin temperature, solar array angles, spacesuit power supply, wastewater tank capacity, oxygen production rate, and much more. NASA, in collaboration with Lightstreamer, provides a feed of these measurements. A team developing a live 3D model of the station has also published a couple of dashboards of the realtime data, historical data going back to 2018, and a data dictionary. [h/t ajdud + AIorNot]

Jan 2025

NEA writing fellowships.

A team led by English professor Alexander Manshel has compiled a dataset of every recipient of the National Endowment for the Arts’ fellowship for creative writing, “from the organization’s founding in 1965 to 2024, including information about those writers’ demographics, education, and geography.” The dataset, which lists 3,700+ recipients, is based on the NEA’s own directory and a 2006 report, as well as “author biographies and websites, institutional websites, interviews, encyclopedias, literary criticism, and literary journalism.” [h/t Melanie Walsh + Derek Willis]

Jan 2025

AI governance documents.

The Emerging Technology Observatory’s AGORA “is a living collection of AI-relevant laws, regulations, standards, and other governance documents from the United States and around the world.” The dataset, available to download and explore online, provides the full text, metadata (e.g., jurisdiction, title, relevant dates), summaries, and thematic tags for 600+ documents. The project currently “skews toward U.S. law and policy” but is aiming “to broaden coverage of U.S. state documents […] and to broaden coverage of Chinese central government documents and major corporate commitments.”

Jan 2025

Opioid settlement spending.

KFF Health News, working with researchers at Johns Hopkins and Shatterproof, has published “a first-of-its kind database” tracking how states and local governments are using the billions of dollars received via opioid settlements in recent years. The database, drawing from “dozens of interviews, thousands of pages of documents, an array of public records requests, and outreach to all 50 states,” represents “the most comprehensive resource to date tracking some of the largest public health settlements in American history.” For each state, it indicates the total funds received in 2022-23, amount committed or spent in various categories (e.g., prevention, treatment, recovery), amount set aside, and amount “untrackable via public reports.” It also catalogs 7,000+ specific spending decisions: funder, destination, purpose, and amount. Previously: Opioid settlement payouts (DIP 2024.04.10). [h/t Aneri Pattani]

Jan 2025

Overdose demographics.

Since mid-2024, reporters at the Baltimore Banner have been publishing a series examining the city’s overdose crisis — reporting supported by The New York Times’ Local Investigations Fellowship and Stanford’s Big Local News. Last month the team partnered with The Upshot and a range of local news organizations to examine a stark phenomenon: In dozens of US counties, “Black men born between 1951 and 1970 have died of overdose at exceptionally high rates for decades.” They’ve published the supporting data, which list overdose death counts and rates by year, county, race/ethnicity, sex, and age group. The data, based on restricted-use records from the CDC, cover the years 1989 to 2022 for “the 408 U.S. counties that had 200 or more overdose deaths between 2018 and 2022”. [h/t Cheryl Phillips + Kimi Yoshino]

Dec 2024

A long-running ultramarathon.

The Comrades Marathon, first run in 1921, is considered “the oldest and largest ultramarathon in the world.” The route stretches 80+ kilometers between Durban and Pietermaritzburg, flipping annually between “up” and “down” directions. In 2019, Kyle Stratton scraped the official website to construct a dataset of all 445,000+ finishers (year, name, country, club, category, finishing time, medal received) through that year. Related: The Association of Road Racing Statisticians’ lists of longest-running marathons and ultramarathons, last updated in 2017. As seen in: Antony Unwin’s Getting (more out of) Graphics.

Dec 2024

Serbian political party funds.

The Center for Investigative Journalism of Serbia’s Party Funds database “tracks all reported incomes and expenses of 40 political parties and citizens’ groups in Serbia over the past nine years.” The records, based on financial disclosure reports, can be browsed online, searched, and downloaded. They indicate revenues, overhead costs, ad spending, salary expenditures, and more. The data specifies each line item’s year, amount, purpose, and other context-dependent details. [h/t Teodora Ćurčić]

Dec 2024

Crop rotations.

The Department of Agriculture’s Crop Sequence Boundaries initiative algorithmically analyzes satellite imagery to create “estimates of field boundaries, crop acreage, and crop rotations across the contiguous United States.” The results are available via an interactive map and downloads for eight-year time frames. The underlying code is open-source and can be used to generate datasets for custom time frames. Previously: The USDA’s CropScape tool and Cropland Data Layer (DIP 2019.03.06). [h/t Forest Gregg]

Dec 2024

Food safety alerts.

Data journalist Adrian Nesta is building a automated pipeline to collect and standardize data on food safety recalls and alerts from two US federal agencies — the FDA and the USDA. For each alert, the standardized dataset indicates the notice’s title, ID, URL, and time posted, as well as the product description, company name, brand name, recall type, recall reason, impacted states, risk level, and more.

Dec 2024

Quits and layoffs.

Minneapolis Fed–affiliated economists Kathrin Ellieroth and Amanda Michaud have constructed a new dataset on monthly quits and layoffs. Using Current Population Survey (CPS) microdata going back to 1978, the dataset estimates the proportions of employees who, after quitting or being laid off, transition to unemployment versus exiting the labor market. In a recent article, Ellieroth and Michaud note that “CPS data offer a perspective not seen in the most-often-used series on quits and layoffs, the Job Openings and Labor Turnover Survey (JOLTS),” featured in DIP 2022.09.21. “Whereas the JOLTS tracks what happens to a job, the CPS tracks what happens to people.” Analyzing it, they found “that increases in unemployment are typically not due to increases in layoffs; rather, they happen because laid-off workers are less likely to quickly find a new job, more likely to stay in the labor force, and thus more likely to join a growing pool of unemployed people hunting for work.” [h/t Alex Albright]

Dec 2024

Pixar films.

Software engineer Eric Leung built and maintains a dataset and R package providing structured information about every Pixar film — from 1995’s Toy Story to 2024’s Inside Out 2. It lists each film’s creators (storywriters, screenwriters, directors, composers, and producers), budget, box-office earnings, aggregate critic ratings, Oscar nominations and wins, and more. [h/t Josh Laurito]

Dec 2024

Nanosatellites.

Space systems engineer Erik Kulu’s Nanosats Database tracks 4,000+ nanosatellites that have been launched into space, are planned for future launch, or have had their launches cancelled. The data for each satellite include its mission name and description, launching organization and country, mass/unit size, launch date, and status. Additional tables provide lists of CubeSat companies, launch providers, costs, and more. [h/t Ahmad Assem]

Dec 2024

China leaders’ foreign visits.

Yu Wang and Randall W. Stone’s China Visits dataset records 400+ visits by China’s presidents and premiers to 100+ countries between 1998 and early 2020. To compile it, the authors consulted official reports, web search results, and relevant Wikipedia pages. For each visit, the dataset indicates its starting and ending date, Chinese leader, foreign country, broader meeting (e.g., those of the Shanghai Cooperation Organisation, and source URL.

Dec 2024

Education policies around the world.

Adrián del Río et al. “introduce a global dataset on education policies and systems across modern history,” with “measures on compulsory education, ideological guidance and content of education, governmental intervention and level of education centralization, and teacher training.” The dataset covers 157 countries annually from 1789 to 2020. The questions answered by the team’s evaluators include, for example, “How many years of schooling are required by compulsory education?”, “Are there any national laws in place that ban specific subjects or topics in school?”, and “Which entities operate secondary schools?”

Dec 2024

100 million places.

Foursquare has released an open dataset describing more than 100 million points of interest across 200+ countries. For each place, the dataset includes its name, address, latitude/longitude, date entered, date updated, date marked closed, telephone number, website, email address, and relevant categories. Among the many possible labels: casino, comedy club, 300+ kinds of restaurants (e.g., deli, diner, Korean BBQ, “mac and cheese joint”), and 100+ types of retailers (e.g., candy store, used car dealership, shopping mall). Learn more: Some initial explorations from Tim Wallace and from Simon Willison. Previously: The Overture Maps Foundation’s datasets (DIP 2023.08.09), including information about 53 million places. [h/t Derek M. Jones + Sharon Machlis + Giuseppe Sollazzo]

Nov 2024

NYC marathon finishers.

New York Road Runners publishes a searchable database of all races it has organized since 1970 — the year of NYC’s first marathon — and all finishers of those races. Data Is Plural reader Joe Hovde has scraped the results of the 2024 marathon into a downloadable spreadsheet. Each row represents one of the 55,000+ finishers and provides their name, bib number, age, gender, city, state, country, time ran, and place finished. Read more: “Marcelo & Karolina, the Fastest Names in the NYC Marathon,” by Hovde.

Nov 2024

Waves.

The Coastal Data Information Program, launched in the 1970s by a research group at the Scripps Institution of Oceanography, “is an extensive network for monitoring waves and beaches along the coastlines of the United States.” The program provides a map of its stations, a table of recent observations, a catalog of real-time and historical wave measurements, and an extreme wave tracker. As seen in: Dion Häfner et al.’s “FOWD: A Free Ocean Wave Dataset for Data Mining and Machine Learning.”

Nov 2024

Substance abuse treatment.

The Substance Abuse and Mental Health Services Administration’s Treatment Episode Data Set records admissions to, and discharges from, substance abuse centers in the US. The public-use datasets, which span several decades, are based on records collected by state agencies. They include each patient’s demographic information, state, metro/micro area, referral source, treatment type, substances used, frequency of use, age at first use, number of previous treatment episodes, among other details. Related: The administration’s National Survey of Substance Abuse Treatment Services, “an annual census of treatment facilities.” [h/t Conor Lennon et al.]

Nov 2024

Climate summit attendees.

Daria Blinova et al. have built a dataset of 310,000+ attendees of United Nations climate summits. The data, largely compiled from PDFs of attendance rosters, include each attendee’s year and meeting attended, name, job title, affiliation, delegation, delegation type (party, observer state, intergovernmental organization, NGO), gender, and more. In all, the attendees span 27,000+ delegations across three decades of COP and predecessor summits. Read more: “This Is 29 Years of International Climate Summits, Visualized,” by The New York Times’ Mira Rojanasakul.

Nov 2024

Tariffs.

The United States International Trade Commission maintains annual datasets of US import tariffs going back to 1997. The datasets include each impacted product’s eight-digit Harmonized Tariff Schedule code, a brief description, the duty rate, rate type, effective and ending dates, and more. The commission also publishes a tariff search tool and data on upcoming tariff rates. More globally, the World Trade Organization provides tools to query and download data about its members’ tariffs, as well as databases of regional trade agreements and preferential trade agreements. Previously: Trade policy intervention data from Global Trade Alert (DIP 2022.01.19).