Data is Plural archive

Trawl through the backlog, or roll the dice for a random dataset 🎲
Apr 2024

Things flung spaceward.

The UN’s Office for Outer Space Affairs maintains an Online Index of Objects Launched into Outer Space, based on a similarly-named register, itself mandated by a similarly-named convention, which went into force in 1976. Presented as an HTML table, the index lists 17,000+ satellites, spacecraft, probes, and other objects’ names, launching countries, statuses, launch dates, and more. As seen in: Our World In Data’s chart of the annual number of objects launched. [h/t Chartr]

Apr 2024

Colonial empire timelines.

The Colonial Dates Dataset, compiled by political scientist Bastian Becker, “aggregates information on the reach and duration of European colonial empires from renowned secondary sources.” Producing the dataset from its four main sources “is largely automated, relying on predefined coding rules.” The dataset indicates the first and last years each contemporary country was colonized, disaggregated by eight colonizing countries (Belgium, Britain, France, Germany, Italy, Netherlands, Portugal, and Spain). [h/t Erik Gahner Larsen]

Apr 2024

Work-injury laws.

Nate Breznau and Felix Lanver’s Global Work-Injury Policy Database tracks the history of work-injury laws (also known as workers’ compensation laws) in 186 “independent nation states.” For each, the database lists the year of its first such law, the year a law first provided insurance for work-related injuries, the type of program it established, and more. It also incorporates data on current laws’ coverage and payment rates from Kenneth Nelson et al.’s Social Insurance Entitlements dataset.

Apr 2024

Groundwater wells.

Despite the infrastructural importance of groundwater wells, “a unified database collecting and standardizing information on the characteristics and locations of these wells across the United States has been lacking,” Chung-Yi Lin et al. write. “To bridge this gap, we have created a comprehensive database of groundwater well records collected from state and federal agencies.” Their United States Groundwater Well Database contains ~14 million records, each indicating a well equipped for monitoring or extracting water. Where available, each row lists the well’s coordinates, county, aquifer, watershed, depth, capacity, water use category, water potability, and more. [h/t Derek M. Jones]

Apr 2024

Opioid settlement payouts.

KFF Health News has been following the slew of legal settlements by companies accused of exacerbating the opioid crisis. Last week, they published an interactive and downloadable database of payouts to state and local governments so far (and expected in the future) from the largest national settlement, “a $26 billion deal with four companies that will be paid out over nearly two decades.” The data, gathered from the court-appointed firm administering the settlement, include overall numbers for 48 states and DC, plus locality-level data for 35 states. (Los Angeles County, for instance, received $47 million in 2022 and 2023, with another $210 million expected in years to come.) Learn more: A webinar scheduled for tomorrow, in which reporter Aneri Pattani “will discuss the data and how it can help you launch into coverage of the historic opioid settlement story.”

Apr 2024

Deaths in plague-era London.

Death by Numbers, also known as the Bills of Mortality Project, aims to transcribe ~8,000 official weekly tallies of deaths in London published in the 1600s and 1700s. Initially focused on plague deaths, the reports expanded to “dozens of other causes of death, such as childbirth, measles, syphilis, and suicide, ensuring their continued publication for decades after the final outbreak of plague in England.” The project’s data are available to browse online, to download, and via API. [h/t Derek M. Jones + Cody Winchester]

Apr 2024

LLM data provenance.

The Data Provenance Initiative “is a multi-disciplinary volunteer effort to improve transparency, documentation, and responsible use of training datasets for AI.” Its first release, the Data Provenance Collection, catalogs dozens of corpora used for fine-tuning large language models, as well as their component datasets’ names, task categories, known sources, licensing, various text metrics, and more. Related: Yang Liu et al.’s “Datasets for Large Language Models: A Comprehensive Survey,” accompanied by semi-structured descriptions of hundreds of training and evaluation datasets. [h/t u/cavedave]

Apr 2024

Candid animals.

The Labeled Information Library of Alexandria data repository is “intended as a resource for both machine learning (ML) researchers and those that want to harness ML for biology and conservation.” Its datasets include millions of images, mostly captured by motion-triggered cameras. Its North American Camera Trap Images dataset, for instance, “contains 3.7M camera trap images from five locations across the United States, with labels for 28 animal categories, primarily at the species level.” Read more: “Machine learning to classify animal species in camera trap images: Applications in ecology.” [h/t Corin Faife]

Apr 2024

European Parliament activity.

Parltrack keeps tabs on 4,000+ active and prior members of the European Parliament, 23,000+ policy dossiers, 39,000+ votes, and much more. The project, launched in 2011, scrapes data from various official websites and links it together — so that you can see, for example, any given member’s dossiers, committee roles, and activities such as plenary speeches and proposed legislative amendments. Its bulk datasets are updated daily and include details beyond what the online interfaces offer. [h/t Stefan Marsiske]

Apr 2024

Power outages.

Christa Brelsford et al. have compiled a county-level estimates of the number of US customers experiencing power outages at 15-minute intervals from 2014 to 2023. The records come from Oak Ridge National Laboratory’s restricted-access Environment for Analysis of Geo-Located Energy Information, a “platform created to monitor electric utility customer outages from data gathered from public sources.” The data’s coverage has increased over time; by 2022, it represented 92% of customers in the 50 states, DC, and Puerto Rico. “The remaining 8% of customers belong to utilities which do not report outage information publicly in near-real time in a format that is currently accessible to EAGLE-I parsers,” the authors write. “These are most typically small, rural, municipal utilities which lack robust information technology infrastructure.”

Mar 2024

Rolling Stone’s album rankings.

A new visual essay from The Pudding compares Rolling Stone’s “500 Greatest Albums of All Time” lists from 2003, 2012, and 2020. A methodology note says the project began with a spreadsheet by Chris Eckert and eventually led the authors to develop a dataset of their own. Theirs lists every album in the rankings — its name, genre, release year, 2003/2012/2020 rank, the artist’s name, birth year, gender, and more — plus each year’s voters. [h/t Jason Kottke]

Mar 2024

Agri-environmental policies.

David Wuepper et al. have constructed a dataset of 6,000+ policies between 1960 and 2022 “at the intersection of agriculture and the environment, implemented not only by national entities but also by subnational and supranational entities, covering different instruments (for example, regulations, frameworks, payment programmes) and topics,” such as the US Safe Drinking Water Act, the Bavarian Forestry Law, and Tanzania’s 2009 Wildlife Conservation Act. Each entry lists the policy’s country, title, type, keywords, year implemented, description, and other details.

Mar 2024

Extrajudicial killings in Bangladesh.

Between 2009 and 2022, “Bangladesh’s security forces killed at least 2,597 people in apparent extrajudicial executions, custodial torture, and by firing bullets at protesters,” according to Nazmul Ahasan’s analysis for Netra News, building on data “compiled by Bangladeshi human rights defenders and collated by the Australia-based Capital Punishment Justice Project.” Ahasan and colleagues “independently verified more than 98% of the cases in the dataset using press reports and subsequently updated any incomplete data.” The records are available as a table in the article and as a JSON file. Each entry includes the victim’s name (if known), incident date, description, location, agencies involved, purported justification, and news source. As seen in: The 2024 Sigma Awards.

Mar 2024

Human development, indexed.

Perhaps the best-known metric of its kind, the United Nations’ Human Development Index combines statistics on life expectancy, income per capita, and years of schooling into a single number for each country-year. The UN provides downloads and an API for all annual HDI ratings and sub-components for 1990 to 2022. Those resources also feature data from related indices, such as the Inequality-adjusted Human Development Index, Gender Development Index, and Gender Inequality Index. [h/t Michael A. Rice]

Mar 2024

Institutional investments.

If you’re an institutional investor with US operations and managing at least $100 million in publicly traded securities, the Securities and Exchange Commission requires you to file Form 13F each quarter. (The biggest filers — such as Vanguard, BlackRock, and State Street — have trillions of dollars invested.) These filings, available to download going back to mid-2013, detail each investor’s long positions for each security: their number of shares, market value, security type, issuer name, CUSIP code, and more. As seen in: Michigan teenager Anonyo Noor’s wallstreetlocal.com, which aggregates the data, matches it to additional information, and provides a search interface.

Mar 2024

Aviation waypoints.

For his recent exploration of the FAA’s aviation maps, Beautiful Public Data’s Jon Keegan has turned the agency’s list of 67,000+ navigation waypoints into a downloadable dataset. “Often these waypoint names will reflect the culture, food or sports teams of the city they are near,” Keegan writes. “Off the coast of New England, there is LBSTA and WHALE. Boston’s sports legacy gave us BOSOX, BRUWN, CELTS, PATSS, FENWY, ORRRR and BORQE. Salem has WITCH, and Plymouth has PLGRM.”

Mar 2024

Meta Oversight Board decisions.

Meta’s independent Oversight Board reviews a selection of the company’s content-moderation decisions and has the power to overturn them. The board publishes its rulings online, as does Meta itself; neither, however, provides a download link. But Information Is Beautiful has compiled a spreadsheet of the board’s 80+ decisions through early February, supporting a visualization of the cases’ topics and outcomes over time. [h/t Data Science Community Newsletter]

Mar 2024

State legislators.

Nicholas Carnes and Eric Hansen’s 2023-4 State Legislators Dataset features “biographical information about state lawmakers who held office in 2023 and 2024 compiled from legislative and campaign websites and other online sources.” The dataset spans all 50 states and includes 7,300+ lawmakers. “The project’s principal aim was to record the current or most recent main occupation (outside of elected office) held by each member,” the authors write, “but the dataset also includes information about a wide range of characteristics including race, gender, and education.” A version for 2021–22 is also available. Previously: State legislator financial disclosures (DIP 2017.12.13) and ideology scores (DIP 2020.01.01). [h/t Derek Willis]

Mar 2024

Real-world vehicle emissions.

On Monday, the European Commission published its first report analyzing the real-world CO2 emissions of cars and vans, based on fuel consumption monitoring devices that the EU now requires. The report uses data received from 600,000+ vehicles. That sample is available to download, along with metrics aggregated by manufacturer and fuel type: average fuel consumption, emissions, and comparisons to standardized test results. Related: Data on millions of EU car registrations (and van registrations), including each vehicle’s fuel economy and emissions ratings. Previously: FuelEconomy.gov (DIP 2017.04.12), with data on decades of car models. [h/t Jan Willem Tulp + Xan Gregg]

Mar 2024

Human trafficking.

The Counter-Trafficking Data Collaborative’s Global Synthetic Dataset uses differential privacy techniques to represent “over 206,000 victims and survivors of trafficking identified across 190 countries and territories from 2002 to 2022.” The approach, developed in partnership with Microsoft Research, converts anonymized case records into “a new dataset in which records do not correspond to actual individuals, but which preserves the structure and statistics (i.e., utility) of the original data.” Each row indicates a (synthetic) individual’s gender, age group, citizenship, country of exploitation, duration of reported trafficking, traffickers’ means of control, types of exploitation, and the year the collaborative’s partners registered the case. Related: The collaborative’s Global Victim-Perpetrator Synthetic Dataset, which takes a similar approach to relationships between victims and perpetrators. [h/t Mariana Moreira + Lorraine Wong]

Mar 2024

Counting fish.

The University of Washington’s Columbia Basin Research provides (among other data) daily, species-level counts of adult salmon and trout passing through more than a dozen sites in the Pacific Northwest. CalFish publishes fish counts and population estimates for the Upper Sacramento River Basin, which “contains much of California’s salmon and steelhead populations.” Similar resources include those available from Alaska, Oregon, and the Yakama Nation. [h/t Dan Brady]

Mar 2024

NYC council members.

Maximum New York has published a biographical dataset of people elected to the New York City Council. For each member since 1998 (plus some before that), it lists their name, district, borough, political party, date of birth, undergraduate/graduate universities and fields of study, whether they ever served on a community board, prior employer, and more. Related: DataMade’s Chicago Councilmatic lists all members, bills, votes, and meetings, and is also available as structured data. [h/t Vikram Oberoi + Forest Gregg]

Mar 2024

EU infringements.

The European Commission publishes a searchable and downloadable database of all its decisions regarding national infringements of EU regulations, decisions, and directives. It currently contains 58,000+ decisions in 24,000+ cases, going back to the late 1980s. (To download the full database, conduct a blank search and then click the “Export to Excel” link.) Each entry lists a decision type and date, case identifier, country, policy area, and more. Recent examples include the Commission’s decisions to refer Ireland to court for failing to protect its peat bogs and Italy for noncompliance with a wastewater treatment directive. [h/t Maximilian Haag et al.]

Mar 2024

Materials and their properties.

The Materials Project, led by scientists at Lawrence Berkeley National Lab, “is a multi-institution, multi-national effort to compute the properties of all inorganic materials” with the “ultimate goal” being “to drastically reduce the time needed to invent new materials.” Its online explorer and API currently provide information about 150,000+ materials. You can search by component elements, formula, thermodynamics, structural properties, magnetism, elasticity, and many other characteristics.

Mar 2024

Humanitarian emergency mapping.

The United Nations Satellite Centre (UNOSAT) provides a range of services to UN agencies and the general public, including downloadable maps, data, and analyses produced “in response to humanitarian emergencies related to disasters, complex emergencies and conflict situations.” Those currently available include assessments of flood impacts in Libya, landslides in the Republic of the Congo, and building damage in the Gaza Strip. The latter identifies structures that satellite imagery suggests have been damaged; the data indicate each building’s location and damage level, plus an assessment confidence and notes. Assessments for prior humanitarian emergencies can be found by adjusting the listing page’s filters, and also via UNOSAT’s contributions to the Humanitarian Data Exchange. [h/t Allison Martell]

Mar 2024

Pinball machines.

The Open Pinball Database provides a searchable inventory and API of ~2,000 pinball machines and 120+ manufacturers. Details include each machine’s name, manufacture date, mechanism type, display type, player count, and more. Related: Pinball Map’s crowdsourced global map and API of the locations of installed pinball machines. [h/t Jeremy Herrman + technophiliac]

Mar 2024

Real-time airport disruptions.

The Federal Aviation Administration’s National Airspace System Status dashboard provides real-time listings of delays and closures at US airports. For each disruption, it indicates the type of problem, reason, current average delay times, and more. A minimal API linked from the site provides the information as an XML-formatted file. Read more: Ruihai Youngblood describes his experience helping to redesign the dashboard. [h/t Jason Scott]

Mar 2024

Price-fixing cartels.

Industrial economist John M. Connor has constructed the Private International Cartels dataset, “which the author believes to be the largest collection of legal-economic information on contemporary price-fixing cartels.” It spans three decades (1990–2019) and covers 1,500+ suspected or convicted cartels, including 1,100+ that “have been deemed guilty of price fixing by one or more antitrust authority.” It also links those cartels to tens of thousands of companies and to 2,000+ individuals indicted or punished for their involvement. The dataset’s variables include information about cartel geography, industry, market share, overcharges, penalties, and much more.

Mar 2024

Fatal police pursuits.

Reporters at the San Francisco Chronicle have compiled a national dataset of 3,300+ deaths in police car chases in 2017–2022. To build it, they used information from the federal government’s Fatality Analysis Reporting System (DIP 2016.08.31), research organizations, news reports, lawsuits, and public records requests. For each death, the dataset indicates the person’s name, age, gender, race, and connection to the pursuit (driver, passenger, bystander, officer). It also includes the incident’s date, location, reason given for the pursuit, and main law enforcement agency involved. Read more: “Fast and Fatal,” the Chronicle’s investigation based on the dataset. [h/t Susie Neilson]

Mar 2024

Global military spending.

How much money has each country spent, each year, on its military? Different datasets have different answers, cover different timeframes, and use different methodologies. Miriam Barnum et al.’s Global Military Spending Dataset attempts to bring them together. By uniting “76 variables from 9 dataset collection projects,” the authors write, “we provide the most comprehensive and complete set of published datasets on military spending ever assembled.” Each of the variables represents one source/methodology, and each observation is a country-year. “Disagreement on the actual expenditure value for a given country-year is common, even between datasets produced by the same project,” they find. Previously: The Stockholm International Peace Research Institute’s Military Expenditure Database (DIP 2017.03.29), one of the sources.