Data is Plural archive

Trawl through the backlog, or roll the dice for a random dataset 🎲
May 2024

Los Angeles street trees.

Journalist Matt Stiles has been using public records requests and official portals to compile data on 1.6 million street trees in 40+ Los Angeles County municipalities. The information varies by city but generally includes the tree’s coordinates and species, often also with measurements such as height and trunk diameter. Previously: Street trees in DIP 2022.09.07, DIP 2020.11.18, DIP 2018.08.08, and DIP 2016.11.16.

May 2024

England sewage discharge.

The UK’s Environment Agency collects data from utility companies regarding sewage-discharging storm overflows in England. The records, available for 2020–2023, list every reported overflow event, its timing and location, number of detected discharges, discharge points, and much more. Related: Wales overflow data are available from other sources. As seen in: Maps from The Rivers Trust, Surfers Against Sewage, and The Guardian. [h/t Giuseppe Sollazzo + James Cheshire + Hugh Graham]

May 2024

1 million ChatGPT conversations.

The WildChat Dataset, constructed by Wenting Zhao et al., “is a corpus of 1 million real-world user-ChatGPT interactions, characterized by a wide range of languages and a diversity of user prompts.” The researchers, primarily affiliated with Cornell and the Allen Institute for AI, built it “by offering free access to ChatGPT and GPT-4 in exchange for consensual chat history collection.” Each of the 1 million rows in the dataset represents a conversation and provides its text, main language, timestamp of its conclusion, underlying model used, moderation results, inferred country, and more. [h/t Data Machina]

May 2024

Greenhouse gas giants.

Carbon Majors, run by the UK-based InfluenceMap, “is a database of historical production data from 122 of the world’s largest oil, gas, coal, and cement producers.” It attributes to these producers 1,421 metric gigatons of CO2-equivalent emissions from 1854 through 2022. Launched last month, the database provides downloads at several levels of granularity. The least granular version indicates the emissions calculated for each entity-year combination. The most granular version breaks those emissions down by commodity produced, quantity of commodity produced, emitting activity, and reporting entity.

May 2024

Unaccompanied migrant children.

The New York Times’ Hannah Dreier was awarded a Pulitzer Prize last week for a “series of stories revealing the stunning reach of migrant child labor across the United States—and the corporate and governmental failures that perpetuate it.” That reporting, Dreier has noted, was partly driven by data she obtained from the Department of Health and Human Services via a FOIA request and lawsuit. The dataset, published in December along with visualizations, describes the placement of 550,000+ unaccompanied migrant children with local sponsors between January 2015 and May 2023: each child’s country of origin, gender, date of entry, and date of release to a sponsor, plus the sponsor’s ZIP code and relationship to the child. Read more: “The data pointed to spots I never would have thought of: Flandreau, South Dakota; Parksley, Virginia; Bozeman, Montana,” Dreier says.

May 2024

Poisonous book bindings.

As recently as the 1800s, green pigments containing arsenic were in fairly wide use. The Arsenical Books Database — part of the Winterthur Museum and the University of Delaware’s Poison Book Project — has identified hundreds of examples of 19th-century books that used these colorants in their covers and other binding components. The database lists each book’s title, author, imprint, publication year, arsenical material, testing method, and owner. [h/t Tom Merritt Smith]

May 2024

Video gaming layoffs.

Game Industry Layoffs, run by @dekaf, lists known staffing cuts at video game studios, publishers, and related companies. Each entry provides the company’s name, type, parent company (for subsidiaries), layoff date, estimated number of employees laid off, and source link. In 2023, the site tallied 10,000+ employees affected by 170+ layoffs; in 2024 so far, the employee count is already nearly that high, spread across ~100 layoffs. As seen in: “Visualizing Games Industry Layoffs,” by Ben Oldenburg. [h/t Vivien Serve]

May 2024

Global entrepreneurship.

The Global Entrepreneurship Monitor, active since 1999, considers itself “the world’s foremost study of entrepreneurship.” A collaboration between Babson College and the London Business School, the project publishes national and response-level data from two main surveys. The Adult Population Survey, answered over the years by millions of respondents around the world, examines “the characteristics, motivations and ambitions of individuals starting businesses, as well as social attitudes towards entrepreneurship.” The National Expert Survey, meanwhile, assesses factors such as access to financing, physical infrastructure, and government support.

May 2024

Emergency room visits.

The CDC’s National Hospital Ambulatory Medical Care Survey, conducted annually since 1992, is based on patient records from a strategic sample of emergency department visits in “noninstitutional general and short-stay hospitals”. Its public, anonymized data indicate the patient’s time of arrival, length of wait, and length of visit; the their demographic information, vital signs, and reasons for visiting; the hospital’s diagnoses, medications given, tests and procedures conducted; and much more. The most recent release includes 16,000+ visits that occurred in 2021. As seen in: “Grabbing the NHAMCS emergency room data in python,” a blog post by Andrew P. Wheeler.

May 2024

US greenhouse gas accounting.

The Inventory of U.S. Greenhouse Gas Emissions and Sinks, published annually by the EPA, “provides a comprehensive accounting of total greenhouse gas emissions for all man-made sources in the United States,” as well as the removal of carbon dioxide from the atmosphere attributable to “land use, land-use change, and forestry.” The reports contain a slew of data tables. Those in the latest edition, which covers 1990 through 2022, include total emissions by year and economic sector; transportation-related emissions by vehicle type and gas emitted; removals by land-use category; and more. Previously: EPA’s facility-level emissions data (DIP 2023.09.20). [h/t Ben Young et al.]

May 2024

Agatha Christie’s bibliography.

Nicole Mark has compiled a dataset of Agatha Christie’s published stories, covering 75 novels, 154 short stories, and 22 short story collections. The spreadsheets provide each work’s title and the character-based series to which it belongs (e.g., Hercule Poirot, Miss Marple, etc.). The novel and collection entries also indicate their year of initial publication, while the short-story entries list the collections that included them.

May 2024

Openly-licensed video transcripts.

The YouTube-Commons dataset, built by a French startup, contains 15 million original and auto-translated audio transcripts from 2 million Creative Commons–licensed YouTube videos, sourced from 400,000+ channels. The dataset indicates each video’s YouTube ID, title, channel, and date, as well as each transcript’s original language, translated language, word count, and character count. Translations are available primarily in Dutch, English, French, German, Italian, Russian, and Spanish. [h/t Data Machina]

May 2024

Biological numbers.

BioNumbers wants “you to find in one minute any useful molecular biology number that can be important for your research.” As its creators Ron Milo et al. described in 2010, those numbers “range from cell sizes to metabolite concentrations, from reaction rates to generation times, from genome sizes to the number of mitochondria in a cell.” You can search, browse, and download more than 14,000 entries. Each includes a number and/or range, units and method of measurement, relevant organism, and source. For instance: The diameter of an e. coli cell is 1-1.1 micrometers, the lifespan of a human red blood cell is 70-140 days, and a chicken’s genome has 1.05 billion base pairs.

May 2024

Historical markers.

Launched in 2006, the Historical Marker Database “is an illustrated searchable online catalog of historical information viewed through the filter of […] permanent outdoor markers, monuments, and plaques.” The crowdsourced project has documented 195,000+ markers in the US, plus thousands more in Canada, Mexico, the UK, and elsewhere. You can browse them by location and by topic, and download data corresponding to each collection. You can also search by person, keyword, historical date, and other attributes. As seen in: “Historical markers are everywhere in America. Some get history wrong,” by NPR’s Laura Sullivan and Nick McMillan, with associated data analysis. [h/t Walt Hickey]

May 2024

Prison commissary prices.

Through a series of public records requests, reporters at The Appeal have constructed the “first national database of prison commissary lists,” based on documents provided by 46 states. The database contains three tables. The first links to, and provides metadata about, each list. The second summarizes each state’s prices for two dozen types of products — such as ramen, toothpaste, and rosary beads — across three categories: food, personal care/hygiene, and religious items. The third table provides 2,200+ commissary-specific prices for those products. Read more: “Locked In, Priced Out: How Prison Commissary Price-Gouging Preys on the Incarcerated,” by reporters Elizabeth Weill-Greenberg and Ethan Corey. [h/t JQ Whitcomb]

Apr 2024

Three-dimensional natural history.

At MorphoSource, you can “find, view, and download 3D data representing the world’s natural history, cultural heritage, and scientific collections.” The service hosts 82,000+ 3D models you can view online; roughly half can be downloaded without prior approval. As described in a recent BioScience article, the Florida Museum of Natural History’s openVertebrate project used MorphoSource to share 3D models and volumetric CT scans of 10,000+ amphibian, reptile, fish, bird, and mammal specimens. [h/t Duncan Geere]

Apr 2024

Food-system indicators.

The Food Systems Countdown Initiative produces “annual publications to measure, assess, and track the performance of global food systems toward 2030 and the conclusion of the Sustainable Development Goals.” As part of that work, its Food Systems Dashboard incorporates 275 metrics from dozens of sources about countries’ food availability, affordability, supply chains, safety, and related topics. Examples include protein supply per capita, number of supermarkets per capita, trans fat regulations, and food safety poll results. [h/t Jan Willem Tulp]

Apr 2024

Wholesale electricity markets.

The US Energy Information Administration recently launched a Wholesale Electricity Market Portal, providing visualizations and downloads of market data from regional transmission organizations (RTOs) and independent system operators (ISOs) — the country’s electric-grid coordinating entities. The data include “day-ahead” electricity prices, real-time prices, actual/forecasted load and demand, fuel mix by time, and local temperatures. Learn more: A 30-minute introductory video from the EIA.

Apr 2024

Chain locations and more.

All The Places is “a growing set of web scrapers designed to output consistent geodata about as many places of business in the world as possible.” During its latest weekly run, the project’s 2,400+ open-source scrapers collected data on nearly 5 million locations. They include postal collection boxes, ATMs, various fast food chains and chain stores, gas stations, and more. The results are available on an interactive map, to download in bulk, and by location type. Related: Journalist Matt Stiles maintains a collection of US-focused scrapers gathering the locations of dozens of chain stores and restaurants, with Python notebooks for each scraper. [h/t Forest Gregg + Sharon Machlis]

Apr 2024

Public procurement.

The Global Public Procurement Dataset provides standardized data on 72 million government contracts in 42 countries. The dataset, constructed from official sources by the Budapest-based Government Transparency Institute, represents $17 trillion in total procurement. It begins in the 2000s for most of the countries and concludes in 2021. For each contract, it provides information about the tender (title, procedure type, product code, publication date, award date, final price, currency, etc.), government buyer, bidders’ names and locations, and more. The downloadable files are split into two repositories. The US, Italy, Brazil, Poland, and Colombia have the most contracts represented. Previously: The Open Contracting Partnership’s data registry (DIP 2023.03.08) and data standard (DIP 2020.02.26).

Apr 2024

Aerial obstacles.

The Federal Aviation Administration’s Obstacles Team “investigates and evaluates existing obstacles that may be hazardous to safe flight navigation,” such as tall buildings, windmills, water tanks, utility poles, amusement parks, monuments, blimps, and other structures. Its Daily Digital Obstacle File contains 580,000+ entries, which it says provides full coverage of the US and partial coverage of Canada, Mexico, the Pacific, and the Caribbean. It lists each obstacle’s type, country, state, city, coordinates, height, type of lighting, and more. [h/t Michael Allen]

Apr 2024

Automated decision-making in government.

The UK nonprofit Public Law Project last year launched the Tracking Automated Government Register, which describes automated systems that government agencies there use “to make or inform decisions on a range of sensitive policy areas, including how people are policed, what benefits they receive, and their immigration status.” It currently lists 55 systems, their names, purposes, agencies in charge, policy areas, transparency level, potential unequal impacts, and more. Last week, Western University’s Joanna Redden and colleagues launched a version for Canada, listing 303 systems.

Apr 2024

Border crossings.

Michael R. Kenwick et al.’s Border Crossings of the World dataset “explores state authority spatially by collecting information about infrastructure built where highways cross internationally recognized borders.” Using satellite imagery, the team’s researchers identified the locations of gates, official buildings, and split-lane inspection facilities annually from the 1990s onward. An accompanying dataset calculates a “border orientation” score that summarizes the “extent to which the State is committed to the spatial display of capacities to control the terms of penetration of its national borders.” [h/t Erik Gahner Larsen]

Apr 2024

Unregulated water contaminants.

Through the Unregulated Contaminant Monitoring Rule, the Environmental Protection Agency “collect[s] data for contaminants that are suspected to be present in drinking water and do not have health-based standards set under the Safe Drinking Water Act.” The current version of the rule requires public water systems to test for lithium and 29 per- and polyfluoroalkyl substances, better known as PFAS. (Prior iterations, which go back to the early 2000s, have tested for other contaminants.) The EPA publishes the data collected, with the most detailed files listing each sample, its public water system, facility, sampling point, sample date, contaminant tested, concentration detected, and more. Related: Last week, the EPA finalized its first-ever limits for PFAS in drinking water. [h/t Lisa Sorg]

Apr 2024

Things flung spaceward.

The UN’s Office for Outer Space Affairs maintains an Online Index of Objects Launched into Outer Space, based on a similarly-named register, itself mandated by a similarly-named convention, which went into force in 1976. Presented as an HTML table, the index lists 17,000+ satellites, spacecraft, probes, and other objects’ names, launching countries, statuses, launch dates, and more. As seen in: Our World In Data’s chart of the annual number of objects launched. [h/t Chartr]

Apr 2024

Colonial empire timelines.

The Colonial Dates Dataset, compiled by political scientist Bastian Becker, “aggregates information on the reach and duration of European colonial empires from renowned secondary sources.” Producing the dataset from its four main sources “is largely automated, relying on predefined coding rules.” The dataset indicates the first and last years each contemporary country was colonized, disaggregated by eight colonizing countries (Belgium, Britain, France, Germany, Italy, Netherlands, Portugal, and Spain). [h/t Erik Gahner Larsen]

Apr 2024

Work-injury laws.

Nate Breznau and Felix Lanver’s Global Work-Injury Policy Database tracks the history of work-injury laws (also known as workers’ compensation laws) in 186 “independent nation states.” For each, the database lists the year of its first such law, the year a law first provided insurance for work-related injuries, the type of program it established, and more. It also incorporates data on current laws’ coverage and payment rates from Kenneth Nelson et al.’s Social Insurance Entitlements dataset.

Apr 2024

Groundwater wells.

Despite the infrastructural importance of groundwater wells, “a unified database collecting and standardizing information on the characteristics and locations of these wells across the United States has been lacking,” Chung-Yi Lin et al. write. “To bridge this gap, we have created a comprehensive database of groundwater well records collected from state and federal agencies.” Their United States Groundwater Well Database contains ~14 million records, each indicating a well equipped for monitoring or extracting water. Where available, each row lists the well’s coordinates, county, aquifer, watershed, depth, capacity, water use category, water potability, and more. [h/t Derek M. Jones]

Apr 2024

Opioid settlement payouts.

KFF Health News has been following the slew of legal settlements by companies accused of exacerbating the opioid crisis. Last week, they published an interactive and downloadable database of payouts to state and local governments so far (and expected in the future) from the largest national settlement, “a $26 billion deal with four companies that will be paid out over nearly two decades.” The data, gathered from the court-appointed firm administering the settlement, include overall numbers for 48 states and DC, plus locality-level data for 35 states. (Los Angeles County, for instance, received $47 million in 2022 and 2023, with another $210 million expected in years to come.) Learn more: A webinar scheduled for tomorrow, in which reporter Aneri Pattani “will discuss the data and how it can help you launch into coverage of the historic opioid settlement story.”