Data is Plural archive

Trawl through the backlog, or roll the dice for a random dataset 🎲
May 2024

Agatha Christie’s bibliography.

Nicole Mark has compiled a dataset of Agatha Christie’s published stories, covering 75 novels, 154 short stories, and 22 short story collections. The spreadsheets provide each work’s title and the character-based series to which it belongs (e.g., Hercule Poirot, Miss Marple, etc.). The novel and collection entries also indicate their year of initial publication, while the short-story entries list the collections that included them.

May 2024

Openly-licensed video transcripts.

The YouTube-Commons dataset, built by a French startup, contains 15 million original and auto-translated audio transcripts from 2 million Creative Commons–licensed YouTube videos, sourced from 400,000+ channels. The dataset indicates each video’s YouTube ID, title, channel, and date, as well as each transcript’s original language, translated language, word count, and character count. Translations are available primarily in Dutch, English, French, German, Italian, Russian, and Spanish. [h/t Data Machina]

May 2024

Biological numbers.

BioNumbers wants “you to find in one minute any useful molecular biology number that can be important for your research.” As its creators Ron Milo et al. described in 2010, those numbers “range from cell sizes to metabolite concentrations, from reaction rates to generation times, from genome sizes to the number of mitochondria in a cell.” You can search, browse, and download more than 14,000 entries. Each includes a number and/or range, units and method of measurement, relevant organism, and source. For instance: The diameter of an e. coli cell is 1-1.1 micrometers, the lifespan of a human red blood cell is 70-140 days, and a chicken’s genome has 1.05 billion base pairs.

May 2024

Historical markers.

Launched in 2006, the Historical Marker Database “is an illustrated searchable online catalog of historical information viewed through the filter of […] permanent outdoor markers, monuments, and plaques.” The crowdsourced project has documented 195,000+ markers in the US, plus thousands more in Canada, Mexico, the UK, and elsewhere. You can browse them by location and by topic, and download data corresponding to each collection. You can also search by person, keyword, historical date, and other attributes. As seen in: “Historical markers are everywhere in America. Some get history wrong,” by NPR’s Laura Sullivan and Nick McMillan, with associated data analysis. [h/t Walt Hickey]

May 2024

Prison commissary prices.

Through a series of public records requests, reporters at The Appeal have constructed the “first national database of prison commissary lists,” based on documents provided by 46 states. The database contains three tables. The first links to, and provides metadata about, each list. The second summarizes each state’s prices for two dozen types of products — such as ramen, toothpaste, and rosary beads — across three categories: food, personal care/hygiene, and religious items. The third table provides 2,200+ commissary-specific prices for those products. Read more: “Locked In, Priced Out: How Prison Commissary Price-Gouging Preys on the Incarcerated,” by reporters Elizabeth Weill-Greenberg and Ethan Corey. [h/t JQ Whitcomb]

Apr 2024

Three-dimensional natural history.

At MorphoSource, you can “find, view, and download 3D data representing the world’s natural history, cultural heritage, and scientific collections.” The service hosts 82,000+ 3D models you can view online; roughly half can be downloaded without prior approval. As described in a recent BioScience article, the Florida Museum of Natural History’s openVertebrate project used MorphoSource to share 3D models and volumetric CT scans of 10,000+ amphibian, reptile, fish, bird, and mammal specimens. [h/t Duncan Geere]

Apr 2024

Food-system indicators.

The Food Systems Countdown Initiative produces “annual publications to measure, assess, and track the performance of global food systems toward 2030 and the conclusion of the Sustainable Development Goals.” As part of that work, its Food Systems Dashboard incorporates 275 metrics from dozens of sources about countries’ food availability, affordability, supply chains, safety, and related topics. Examples include protein supply per capita, number of supermarkets per capita, trans fat regulations, and food safety poll results. [h/t Jan Willem Tulp]

Apr 2024

Wholesale electricity markets.

The US Energy Information Administration recently launched a Wholesale Electricity Market Portal, providing visualizations and downloads of market data from regional transmission organizations (RTOs) and independent system operators (ISOs) — the country’s electric-grid coordinating entities. The data include “day-ahead” electricity prices, real-time prices, actual/forecasted load and demand, fuel mix by time, and local temperatures. Learn more: A 30-minute introductory video from the EIA.

Apr 2024

Chain locations and more.

All The Places is “a growing set of web scrapers designed to output consistent geodata about as many places of business in the world as possible.” During its latest weekly run, the project’s 2,400+ open-source scrapers collected data on nearly 5 million locations. They include postal collection boxes, ATMs, various fast food chains and chain stores, gas stations, and more. The results are available on an interactive map, to download in bulk, and by location type. Related: Journalist Matt Stiles maintains a collection of US-focused scrapers gathering the locations of dozens of chain stores and restaurants, with Python notebooks for each scraper. [h/t Forest Gregg + Sharon Machlis]

Apr 2024

Public procurement.

The Global Public Procurement Dataset provides standardized data on 72 million government contracts in 42 countries. The dataset, constructed from official sources by the Budapest-based Government Transparency Institute, represents $17 trillion in total procurement. It begins in the 2000s for most of the countries and concludes in 2021. For each contract, it provides information about the tender (title, procedure type, product code, publication date, award date, final price, currency, etc.), government buyer, bidders’ names and locations, and more. The downloadable files are split into two repositories. The US, Italy, Brazil, Poland, and Colombia have the most contracts represented. Previously: The Open Contracting Partnership’s data registry (DIP 2023.03.08) and data standard (DIP 2020.02.26).

Apr 2024

Aerial obstacles.

The Federal Aviation Administration’s Obstacles Team “investigates and evaluates existing obstacles that may be hazardous to safe flight navigation,” such as tall buildings, windmills, water tanks, utility poles, amusement parks, monuments, blimps, and other structures. Its Daily Digital Obstacle File contains 580,000+ entries, which it says provides full coverage of the US and partial coverage of Canada, Mexico, the Pacific, and the Caribbean. It lists each obstacle’s type, country, state, city, coordinates, height, type of lighting, and more. [h/t Michael Allen]

Apr 2024

Automated decision-making in government.

The UK nonprofit Public Law Project last year launched the Tracking Automated Government Register, which describes automated systems that government agencies there use “to make or inform decisions on a range of sensitive policy areas, including how people are policed, what benefits they receive, and their immigration status.” It currently lists 55 systems, their names, purposes, agencies in charge, policy areas, transparency level, potential unequal impacts, and more. Last week, Western University’s Joanna Redden and colleagues launched a version for Canada, listing 303 systems.

Apr 2024

Border crossings.

Michael R. Kenwick et al.’s Border Crossings of the World dataset “explores state authority spatially by collecting information about infrastructure built where highways cross internationally recognized borders.” Using satellite imagery, the team’s researchers identified the locations of gates, official buildings, and split-lane inspection facilities annually from the 1990s onward. An accompanying dataset calculates a “border orientation” score that summarizes the “extent to which the State is committed to the spatial display of capacities to control the terms of penetration of its national borders.” [h/t Erik Gahner Larsen]

Apr 2024

Unregulated water contaminants.

Through the Unregulated Contaminant Monitoring Rule, the Environmental Protection Agency “collect[s] data for contaminants that are suspected to be present in drinking water and do not have health-based standards set under the Safe Drinking Water Act.” The current version of the rule requires public water systems to test for lithium and 29 per- and polyfluoroalkyl substances, better known as PFAS. (Prior iterations, which go back to the early 2000s, have tested for other contaminants.) The EPA publishes the data collected, with the most detailed files listing each sample, its public water system, facility, sampling point, sample date, contaminant tested, concentration detected, and more. Related: Last week, the EPA finalized its first-ever limits for PFAS in drinking water. [h/t Lisa Sorg]

Apr 2024

Things flung spaceward.

The UN’s Office for Outer Space Affairs maintains an Online Index of Objects Launched into Outer Space, based on a similarly-named register, itself mandated by a similarly-named convention, which went into force in 1976. Presented as an HTML table, the index lists 17,000+ satellites, spacecraft, probes, and other objects’ names, launching countries, statuses, launch dates, and more. As seen in: Our World In Data’s chart of the annual number of objects launched. [h/t Chartr]

Apr 2024

Colonial empire timelines.

The Colonial Dates Dataset, compiled by political scientist Bastian Becker, “aggregates information on the reach and duration of European colonial empires from renowned secondary sources.” Producing the dataset from its four main sources “is largely automated, relying on predefined coding rules.” The dataset indicates the first and last years each contemporary country was colonized, disaggregated by eight colonizing countries (Belgium, Britain, France, Germany, Italy, Netherlands, Portugal, and Spain). [h/t Erik Gahner Larsen]

Apr 2024

Work-injury laws.

Nate Breznau and Felix Lanver’s Global Work-Injury Policy Database tracks the history of work-injury laws (also known as workers’ compensation laws) in 186 “independent nation states.” For each, the database lists the year of its first such law, the year a law first provided insurance for work-related injuries, the type of program it established, and more. It also incorporates data on current laws’ coverage and payment rates from Kenneth Nelson et al.’s Social Insurance Entitlements dataset.

Apr 2024

Groundwater wells.

Despite the infrastructural importance of groundwater wells, “a unified database collecting and standardizing information on the characteristics and locations of these wells across the United States has been lacking,” Chung-Yi Lin et al. write. “To bridge this gap, we have created a comprehensive database of groundwater well records collected from state and federal agencies.” Their United States Groundwater Well Database contains ~14 million records, each indicating a well equipped for monitoring or extracting water. Where available, each row lists the well’s coordinates, county, aquifer, watershed, depth, capacity, water use category, water potability, and more. [h/t Derek M. Jones]

Apr 2024

Opioid settlement payouts.

KFF Health News has been following the slew of legal settlements by companies accused of exacerbating the opioid crisis. Last week, they published an interactive and downloadable database of payouts to state and local governments so far (and expected in the future) from the largest national settlement, “a $26 billion deal with four companies that will be paid out over nearly two decades.” The data, gathered from the court-appointed firm administering the settlement, include overall numbers for 48 states and DC, plus locality-level data for 35 states. (Los Angeles County, for instance, received $47 million in 2022 and 2023, with another $210 million expected in years to come.) Learn more: A webinar scheduled for tomorrow, in which reporter Aneri Pattani “will discuss the data and how it can help you launch into coverage of the historic opioid settlement story.”

Apr 2024

Deaths in plague-era London.

Death by Numbers, also known as the Bills of Mortality Project, aims to transcribe ~8,000 official weekly tallies of deaths in London published in the 1600s and 1700s. Initially focused on plague deaths, the reports expanded to “dozens of other causes of death, such as childbirth, measles, syphilis, and suicide, ensuring their continued publication for decades after the final outbreak of plague in England.” The project’s data are available to browse online, to download, and via API. [h/t Derek M. Jones + Cody Winchester]

Apr 2024

LLM data provenance.

The Data Provenance Initiative “is a multi-disciplinary volunteer effort to improve transparency, documentation, and responsible use of training datasets for AI.” Its first release, the Data Provenance Collection, catalogs dozens of corpora used for fine-tuning large language models, as well as their component datasets’ names, task categories, known sources, licensing, various text metrics, and more. Related: Yang Liu et al.’s “Datasets for Large Language Models: A Comprehensive Survey,” accompanied by semi-structured descriptions of hundreds of training and evaluation datasets. [h/t u/cavedave]

Apr 2024

Candid animals.

The Labeled Information Library of Alexandria data repository is “intended as a resource for both machine learning (ML) researchers and those that want to harness ML for biology and conservation.” Its datasets include millions of images, mostly captured by motion-triggered cameras. Its North American Camera Trap Images dataset, for instance, “contains 3.7M camera trap images from five locations across the United States, with labels for 28 animal categories, primarily at the species level.” Read more: “Machine learning to classify animal species in camera trap images: Applications in ecology.” [h/t Corin Faife]

Apr 2024

European Parliament activity.

Parltrack keeps tabs on 4,000+ active and prior members of the European Parliament, 23,000+ policy dossiers, 39,000+ votes, and much more. The project, launched in 2011, scrapes data from various official websites and links it together — so that you can see, for example, any given member’s dossiers, committee roles, and activities such as plenary speeches and proposed legislative amendments. Its bulk datasets are updated daily and include details beyond what the online interfaces offer. [h/t Stefan Marsiske]

Apr 2024

Power outages.

Christa Brelsford et al. have compiled a county-level estimates of the number of US customers experiencing power outages at 15-minute intervals from 2014 to 2023. The records come from Oak Ridge National Laboratory’s restricted-access Environment for Analysis of Geo-Located Energy Information, a “platform created to monitor electric utility customer outages from data gathered from public sources.” The data’s coverage has increased over time; by 2022, it represented 92% of customers in the 50 states, DC, and Puerto Rico. “The remaining 8% of customers belong to utilities which do not report outage information publicly in near-real time in a format that is currently accessible to EAGLE-I parsers,” the authors write. “These are most typically small, rural, municipal utilities which lack robust information technology infrastructure.”

Mar 2024

Rolling Stone’s album rankings.

A new visual essay from The Pudding compares Rolling Stone’s “500 Greatest Albums of All Time” lists from 2003, 2012, and 2020. A methodology note says the project began with a spreadsheet by Chris Eckert and eventually led the authors to develop a dataset of their own. Theirs lists every album in the rankings — its name, genre, release year, 2003/2012/2020 rank, the artist’s name, birth year, gender, and more — plus each year’s voters. [h/t Jason Kottke]

Mar 2024

Agri-environmental policies.

David Wuepper et al. have constructed a dataset of 6,000+ policies between 1960 and 2022 “at the intersection of agriculture and the environment, implemented not only by national entities but also by subnational and supranational entities, covering different instruments (for example, regulations, frameworks, payment programmes) and topics,” such as the US Safe Drinking Water Act, the Bavarian Forestry Law, and Tanzania’s 2009 Wildlife Conservation Act. Each entry lists the policy’s country, title, type, keywords, year implemented, description, and other details.

Mar 2024

Extrajudicial killings in Bangladesh.

Between 2009 and 2022, “Bangladesh’s security forces killed at least 2,597 people in apparent extrajudicial executions, custodial torture, and by firing bullets at protesters,” according to Nazmul Ahasan’s analysis for Netra News, building on data “compiled by Bangladeshi human rights defenders and collated by the Australia-based Capital Punishment Justice Project.” Ahasan and colleagues “independently verified more than 98% of the cases in the dataset using press reports and subsequently updated any incomplete data.” The records are available as a table in the article and as a JSON file. Each entry includes the victim’s name (if known), incident date, description, location, agencies involved, purported justification, and news source. As seen in: The 2024 Sigma Awards.

Mar 2024

Human development, indexed.

Perhaps the best-known metric of its kind, the United Nations’ Human Development Index combines statistics on life expectancy, income per capita, and years of schooling into a single number for each country-year. The UN provides downloads and an API for all annual HDI ratings and sub-components for 1990 to 2022. Those resources also feature data from related indices, such as the Inequality-adjusted Human Development Index, Gender Development Index, and Gender Inequality Index. [h/t Michael A. Rice]

Mar 2024

Institutional investments.

If you’re an institutional investor with US operations and managing at least $100 million in publicly traded securities, the Securities and Exchange Commission requires you to file Form 13F each quarter. (The biggest filers — such as Vanguard, BlackRock, and State Street — have trillions of dollars invested.) These filings, available to download going back to mid-2013, detail each investor’s long positions for each security: their number of shares, market value, security type, issuer name, CUSIP code, and more. As seen in: Michigan teenager Anonyo Noor’s wallstreetlocal.com, which aggregates the data, matches it to additional information, and provides a search interface.