Data is Plural archive

Trawl through the backlog, or roll the dice for a random dataset 🎲
May 2025

What the nose knows.

Antonie Louise Bierling et al. have published a dataset of “descriptions, evaluative ratings, and qualitative labels for 74 chemically diverse mono-molecular odors, rated by a large sample of young adults.” Another paper by Bierling et al. “elicited body odor descriptions from 2,607 participants across 17 countries and 13 languages” to assemble “a standardized lexicon of body odor words.” Related: The Pyrfume Project provides “tools, models, and data for odorant-linked research.”

May 2025

US dams.

The National Inventory of Dams “documents all known dams in the U.S. and its territories that meet certain criteria” related to the dam’s height, reservoir size, and likely impacts of its “failure or mis-operation.” The inventory, maintained by the US Army Corps of Engineers since the 1970s, now includes 92,000+ structures. The data — available via a searchable map, bulk downloads, and an API — indicate each dam’s name, location, year built, structural characteristics, purpose, operational status, and much more. Previously: Global Dam Watch’s datasets (DIP 2020.01.29) and the USGS’s National Hydrography Dataset (DIP 2022.10.12).

May 2025

California ghost guns.

“Ghost guns have been a uniquely Californian issue,” with the state accounting for a majority of the untraceable firearms that are reported to the ATF, according to The Trace. Earlier this year, on its Gun Violence Data Hub, the publication posted datasets counting the ghost guns recovered by California law enforcement agencies, as well as “firearm-level data on guns reported lost or stolen in the state.” [h/t Aaron Mendelson]

May 2025

European workforces.

Each quarter, dozens of countries collectively conduct more than 1.7 million interviews for the European Union Labour Force Survey. The survey, the continent’s largest, aims “to classify people into 3 groups that are mutually exclusive and cover the whole target population”: employed, unemployed, and outside the labor force. Eurostat publishes aggregate results, with breakdowns by age, sex, country, nationality, citizenship status, education level, sector, and more. Detailed microdata are also available to approved researchers. As seen in: Bruegel’s labor market dashboard. [h/t Nina Ruer]

May 2025

Deportation records.

The Deportation Data Project, run by a team of academics and lawyers, “collects and posts public, anonymized U.S. government immigration enforcement datasets.” These include data from border apprehensions, deportations, Title 42 expulsions, ICE arrests and detentions, ICE-operated flights, and more. Some of the data files come directly from the government, while others were initially obtained from the government by other organizations, such as the University of Washington Center for Human Rights. The project also posts information about its Freedom of Information Act requests. Read more: The project’s “U.S. Immigration Enforcement Data: A Short Guide.” As seen in: “The Rising Cost of ICE Flying Immigrants to Far-Flung Detention Centers” (Bloomberg). [h/t Alex Albright]

Apr 2025

Canoe marathons.

Paddle UK’s Marathon Racing Committee promotes endurance canoe and kayak competitions that range “from a couple of miles or kilometres to the ultimate challenge of the 125-mile Devizes to Westminster Canoe Race.” The organization publishes race results online, which data scientist Andrew Collier has collected into structured data files that indicate each competition’s date, name, region, and category, as well as each paddler’s name, club, division, class, finishing time, position, and points.

Apr 2025

Previously unmapped waterways.

WaterNet Global Waterways is “a new global dataset that predicts the locations of waterways around the world” using an AI model trained on satellite imagery and elevation data. A collaboration between Bridges to Prosperity and the Better Planet Laboratory, the dataset — available as raster files, vector files, and an interactive map — “triples the known extent of mapped waterways globally, adding 124 million kilometers to the previously mapped 54 million kilometers.” [h/t Cameron Kruse]

Apr 2025

US sewer overflow sites.

“There are approximately 700 communities in the United States that have combined sewer systems and experience combined sewer overflow (CSO) discharges,” according to the EPA, whose National Combined Sewer Overflow Inventory lists 8,600+ outfalls across those communities. The downloadable inventory, last updated in September 2023, provides each outfall’s location and relevant information from the National Pollutant Discharge Elimination System’s permit database. As seen in: “Minority communities twice as likely to have sewage polluting nearby river or creek, CBS News analysis shows”. Previously: Sewer overflows in England (DIP 2024.05.15).

Apr 2025

Tens of millions of flights.

Sebastiaan Menger has developed a series of quarterly datasets “featuring global, high-level flight schedules extracted from worldwide aircraft ADS-B position transmissions,” going back to early 2024. Each quarterly extract, derived from the ADSB.lol flight-tracking initiative’s open data, features 10–13 million flights. Each flight’s entry indicates the aircraft’s registration number, type, call sign, airline (when applicable), approximate liftoff/touchdown times, and origin/destination airports.

Apr 2025

Refugee and asylum policies.

The Dataset of World Refugee and Asylum Policies “offers a complete dataset of de jure asylum and refugee policies” across 190+ countries and 70+ years, from 1951 to 2022. The project, developed by Christopher W. Blair et al. and updated in collaboration with the Joint Data Center on Forced Displacement, evaluates 54 aspects of each policy across five dimensions: access, services, the ability to earn a livelihood, freedom of movement, and political inclusion. Each aspect is scored on a 0-1-2-3 scale. The results are available to download and to analyze online. [h/t Annika Younge]

Feb 2025

Chord progressions.

Spyridon Kantarelis et al. have created CHORDONOMICON, a dataset identifying the progressions of 51 million chords in 667,000+ songs. The dataset is based on tablatures from the website Ultimate Guitar and then “annotated with structural parts, genre, and release date”. Most entries also include the song’s and artist’s IDs in Spotify’s system. [h/t Dale Debber]

Feb 2025

Argentine treaties.

Javier I. Santander, a career diplomat, has built a dataset of 8,200+ bilateral treaties signed by Argentina from 1810 and 2023. It lists each treaty’s title, status, date signed, and counterpart country. The dataset is based on the government’s Digital Library of Treaties, where you can find copies of the treaties themselves. The most common counterparts have been neighboring countries — Chile, Brazil, Bolivia, Paraguay, and Uruguay — followed by Germany, the US, and Italy.

Feb 2025

18 million deceased veterans.

BIRLS.org, a new website from Reclaim The Records, provides “an index to basic biographical information on more than 18 million deceased American veterans who received some sort of veterans benefits in their lifetime”. Those records, obtained through a FOIA lawsuit, represent a substantial chunk of the Department of Veterans Affairs’ Beneficiary Identification Records Locator Subsystem. The site also helps you file follow-up requests for any individual’s “full VA claims file, which may contain hundreds of pages of never-before-seen biographical and historical material about the veteran, their military service, and their interactions with the VA.” Note: The “database is not a comprehensive database of all American veterans, but rather a partial and incomplete index of veterans who were eligible for VA benefits or whose heirs had some kind of contact with the VA regarding benefits.”

Feb 2025

Water availability.

The US Geological Survey last month released its National Water Availability Assessment, “a pioneering scientific overview of water availability that offers first-of-its-kind insights into the balance between water supply and demand across the conterminous United States.” Alongside the report, USGS launched a “data companion” providing “regularly updated, model-based estimates” of monthly water usage within each of the country’s hydrologic units. Estimates for water availability and water supply are “coming soon,” while those for water quality and aquatic ecosystems are “coming later.”

Feb 2025

Presidential schedules.

Among its various White House–related undertakings, Roll Call Factba.se provides event-by-event structured data representing the public presidential calendars for Donald Trump and Joe Biden since the latter’s inauguration in January 2021. The schedules, available to download in bulk, provide each event’s day and time, location, a brief description, and other details. They contain 9,400+ entries from Biden’s four years in office plus another 300+ from Trump’s second term so far. The events include those from the official presidential schedule, those derived from pool reports, and press briefings. As seen in: POTUS Tracker. [h/t Dan Brady]

Jan 2025

A royal regatta.

The Henley Royal Regatta, a multi-day rowing competition, has been held on the River Thames nearly every year since 1839. Dominic Goymour has scraped the event’s online results into a dataset covering 7,500+ outcomes since 1999. It includes each race’s date, starting time, stage, boat class, cup, winning crew/club, losing crew/club, winning time, and more.

Jan 2025

Grocery ingredients.

To compile GroceryDB, Babak Ravandi et al. scraped data about 50,000+ food products available on the websites of Walmart, Target, and Whole Foods. For each product, they extracted the nutritional information and ingredient list, which they provide as structured data and use for estimating each product’s degree of processing. Related: TrueFood, a website the research team built with the findings.

Jan 2025

Hurricane landfalls.

NOAA’s Hurricane Research Division maintains a table of hurricanes that have made landfall on the continental US since the 1850s. It records the year and month of landfall, designated name, states affected, the highest Saffir-Simpson category, central pressure at landfall, and maximum sustained wind speed. The division publishes another table containing more details — such as the full date, latitude, and longitude of landfall — but with a gap in the late 1970s to early 1980s. [h/t Michael Ferragamo + Dale Debber]

Jan 2025

Private schools.

The National Center for Education Statistics’s Private School Universe Survey has been gathering data about private elementary and secondary schools every two years since the 1989–90 school year. It collects information on “religious orientation; level of school; size of school; length of school year, length of school day; total enrollment (K-12); number of high school graduates, whether a school is single-sexed or coeducational and enrollment by sex; number of teachers employed; program emphasis” and more. In the latest data, covering the 2021–22 school year, “there were 29,727 private schools, enrolling 4,731,303 students and employing 482,571 full-time teachers”. As seen in: ProPublica’s Private School Demographics lookup tool (webinar scheduled for January 31) and its reporting on “segregation academies”.

Jan 2025

Hyperlocal Trump/Harris results.

Earlier this month, colleagues at The New York Times published “An Extremely Detailed Map of the 2024 Election” and made the underlying data available to download. The effort “currently includes results for more than 110,000 precincts, or 73 percent of all votes, and will be updated as more data is collected.” The dataset lists each precinct’s state, county FIPS code, votes received by Kamala Harris, votes received by Donald Trump, and total votes (including third parties and write-ins). It also provides each precinct’s geographical boundaries, derived from a mix of official sources and estimations. Previously: “An Extremely Detailed Map of the 2020 Election” and the data behind it (DIP 2021.02.10). See also: Precinct-level election results for 2020, 2018, 2016, and 2012 from the Voting and Election Science Team.

Jan 2025

ISS telemetry.

The International Space Station beams home a wide range of measurements: cabin temperature, solar array angles, spacesuit power supply, wastewater tank capacity, oxygen production rate, and much more. NASA, in collaboration with Lightstreamer, provides a feed of these measurements. A team developing a live 3D model of the station has also published a couple of dashboards of the realtime data, historical data going back to 2018, and a data dictionary. [h/t ajdud + AIorNot]

Jan 2025

NEA writing fellowships.

A team led by English professor Alexander Manshel has compiled a dataset of every recipient of the National Endowment for the Arts’ fellowship for creative writing, “from the organization’s founding in 1965 to 2024, including information about those writers’ demographics, education, and geography.” The dataset, which lists 3,700+ recipients, is based on the NEA’s own directory and a 2006 report, as well as “author biographies and websites, institutional websites, interviews, encyclopedias, literary criticism, and literary journalism.” [h/t Melanie Walsh + Derek Willis]

Jan 2025

AI governance documents.

The Emerging Technology Observatory’s AGORA “is a living collection of AI-relevant laws, regulations, standards, and other governance documents from the United States and around the world.” The dataset, available to download and explore online, provides the full text, metadata (e.g., jurisdiction, title, relevant dates), summaries, and thematic tags for 600+ documents. The project currently “skews toward U.S. law and policy” but is aiming “to broaden coverage of U.S. state documents […] and to broaden coverage of Chinese central government documents and major corporate commitments.”

Jan 2025

Opioid settlement spending.

KFF Health News, working with researchers at Johns Hopkins and Shatterproof, has published “a first-of-its kind database” tracking how states and local governments are using the billions of dollars received via opioid settlements in recent years. The database, drawing from “dozens of interviews, thousands of pages of documents, an array of public records requests, and outreach to all 50 states,” represents “the most comprehensive resource to date tracking some of the largest public health settlements in American history.” For each state, it indicates the total funds received in 2022-23, amount committed or spent in various categories (e.g., prevention, treatment, recovery), amount set aside, and amount “untrackable via public reports.” It also catalogs 7,000+ specific spending decisions: funder, destination, purpose, and amount. Previously: Opioid settlement payouts (DIP 2024.04.10). [h/t Aneri Pattani]

Jan 2025

Overdose demographics.

Since mid-2024, reporters at the Baltimore Banner have been publishing a series examining the city’s overdose crisis — reporting supported by The New York Times’ Local Investigations Fellowship and Stanford’s Big Local News. Last month the team partnered with The Upshot and a range of local news organizations to examine a stark phenomenon: In dozens of US counties, “Black men born between 1951 and 1970 have died of overdose at exceptionally high rates for decades.” They’ve published the supporting data, which list overdose death counts and rates by year, county, race/ethnicity, sex, and age group. The data, based on restricted-use records from the CDC, cover the years 1989 to 2022 for “the 408 U.S. counties that had 200 or more overdose deaths between 2018 and 2022”. [h/t Cheryl Phillips + Kimi Yoshino]

Dec 2024

A long-running ultramarathon.

The Comrades Marathon, first run in 1921, is considered “the oldest and largest ultramarathon in the world.” The route stretches 80+ kilometers between Durban and Pietermaritzburg, flipping annually between “up” and “down” directions. In 2019, Kyle Stratton scraped the official website to construct a dataset of all 445,000+ finishers (year, name, country, club, category, finishing time, medal received) through that year. Related: The Association of Road Racing Statisticians’ lists of longest-running marathons and ultramarathons, last updated in 2017. As seen in: Antony Unwin’s Getting (more out of) Graphics.

Dec 2024

Serbian political party funds.

The Center for Investigative Journalism of Serbia’s Party Funds database “tracks all reported incomes and expenses of 40 political parties and citizens’ groups in Serbia over the past nine years.” The records, based on financial disclosure reports, can be browsed online, searched, and downloaded. They indicate revenues, overhead costs, ad spending, salary expenditures, and more. The data specifies each line item’s year, amount, purpose, and other context-dependent details. [h/t Teodora Ćurčić]

Dec 2024

Crop rotations.

The Department of Agriculture’s Crop Sequence Boundaries initiative algorithmically analyzes satellite imagery to create “estimates of field boundaries, crop acreage, and crop rotations across the contiguous United States.” The results are available via an interactive map and downloads for eight-year time frames. The underlying code is open-source and can be used to generate datasets for custom time frames. Previously: The USDA’s CropScape tool and Cropland Data Layer (DIP 2019.03.06). [h/t Forest Gregg]

Dec 2024

Food safety alerts.

Data journalist Adrian Nesta is building a automated pipeline to collect and standardize data on food safety recalls and alerts from two US federal agencies — the FDA and the USDA. For each alert, the standardized dataset indicates the notice’s title, ID, URL, and time posted, as well as the product description, company name, brand name, recall type, recall reason, impacted states, risk level, and more.

Dec 2024

Quits and layoffs.

Minneapolis Fed–affiliated economists Kathrin Ellieroth and Amanda Michaud have constructed a new dataset on monthly quits and layoffs. Using Current Population Survey (CPS) microdata going back to 1978, the dataset estimates the proportions of employees who, after quitting or being laid off, transition to unemployment versus exiting the labor market. In a recent article, Ellieroth and Michaud note that “CPS data offer a perspective not seen in the most-often-used series on quits and layoffs, the Job Openings and Labor Turnover Survey (JOLTS),” featured in DIP 2022.09.21. “Whereas the JOLTS tracks what happens to a job, the CPS tracks what happens to people.” Analyzing it, they found “that increases in unemployment are typically not due to increases in layoffs; rather, they happen because laid-off workers are less likely to quickly find a new job, more likely to stay in the labor force, and thus more likely to join a growing pool of unemployed people hunting for work.” [h/t Alex Albright]