Data is Plural archive

Trawl through the backlog, or roll the dice for a random dataset 🎲
Sep 2024

Italian tax-to-charity allocations.

Italy’s “five per thousand” program allows taxpayers to allocate 0.5% of their income tax to certain nonprofits, research institutions, and other social-benefit organizations. The country’s Ministry of Economy and Finance has published information about 2022’s beneficiaries, but initially did so only via PDFs. Earlier this year, the Liberiamoli tutti! initiative converted those PDFs into structured data that list each recipient organization’s name, tax ID, category, region, province, and municipality, number of taxpayers choosing it, and amount of money allocated. The ministry has since added structured files of its own.

Sep 2024

Source code.

Software Heritage, a nonprofit initiative collaborating with UNESCO, maintains “the largest public collection of source code in existence”: an archive tracking 20 billion source files and 4 billion code-commits from 317 million projects from a range of public software hosts (GitHub, GitLab, BitBucket, npm, et cetera). Its Graph Dataset, which provides access to the archive’s content and internal relationships, is available via bulk downloads and APIs. [h/t Derek M. Jones]

Sep 2024

Monthly crime trends.

The Real-Time Crime Index, launched last week by a team of crime-data analysts, presents a “sample of reported crime data from hundreds of law enforcement agencies nationwide which mimics national crime trends with as little lag and the most accuracy possible.” Framed as a supplement to the FBI’s slow-to-update official statistics, the project provides monthly and rolling 12-month totals of reported crimes (using the FBI’s UCR Part I offense categories) for the nation, individual cities, and by city population size. You can download the data and see the sources for each of the 300+ local agencies in the national sample. Read more: “The Real-Time Crime Index Shows Declining Crime in 2024,” from project co-leader Jeff Asher’s newsletter.

Sep 2024

Health and nutrition.

Since 1999, CDC has been continuously fielding its National Health and Nutrition Examination Survey, interviewing and testing approximately 5,000 people in 15 different counties each year. The survey combines “demographic, socioeconomic, dietary, and health-related questions” with an “examination component” involving “medical, dental, and physiological measurements, as well as laboratory tests administered by highly trained medical personnel.” Its public-access data files provide anonymized, respondent-level records and are currently available for surveys conducted through March 2020. As seen in: Catherine McDonough et al.’s dataset and interactive dashboard “exploring factors associated with prediabetes and diabetes mellitus among youth in the United States.”

Aug 2024

Olympic medalists.

The European Data Journalism Network’s Giorgio Comai has used Wikipedia and Wikidata to create a series of datasets listing the name, birth date, sex, and birthplace of Summer Olympic medalists. Comai has mapped the birthplace coordinates and, for Europe-born medalists, linked them to their NUTS regions. The project focuses on the 2024 and 2020 Summer Olympics but also provides provisional data for other recent iterations. [h/t Federico Caruso]

Aug 2024

UK grantmakers.

The UK Grantmaking initiative “is a unique cross-sector collaboration between” several major organizations in the field. Their downloadable dataset provides information about 12,000+ trusts, foundations, charities, and other grantmakers for financial year 2022-23, based on records from government regulators. The dataset lists each organization’s name, government-assigned ID, location, category, registration date, income, spending totals, net assets, and more. Previously: UK grants via 360Giving (DIP 2018.12.05). [h/t Giuseppe Sollazzo]

Aug 2024

California residential water supply.

Marie-Philine Gross et al.’s dataset of residential water demand and supply in California includes the monthly volumes of water produced/sold by 404 of the state’s water suppliers, covering 2013–2021. The researchers extracted, standardized, and cleaned the data from the state’s mandatory annual reports, which collect thousands of data points from each supplier. They also added contextual information, such as climatic data (monthly local precipitation, temperature, and drought severity) and each supplier’s hydrologic region.

Aug 2024

Multinational corporations.

The Multinational Enterprise Information Platform, a collaboration between the OECD and the UN Statistics Division, provides publicly sourced data on the 500 multinational corporations with the largest market capitalization. Its “Global Register” dataset examines the companies’ structure, listing each subsidiary’s name, parent company, address, alternative names, and various unique identifiers. The “Digital Register” dataset lists all known web domains controlled by each company and assessments of those domains’ popularity. The platform’s “Media Monitor” feature, although not downloadable, links to news articles and other webpages mentioning the companies. [h/t Annie Burns-Pieper]

Aug 2024

H-1B lotteries.

A recent Bloomberg News investigation into the US government’s annual H-1B lottery, a key step in allocating the country’s skilled-worker visas, finds that “thousands of companies got an unfair advantage by helping themselves to extra lottery tickets.” To reach those conclusions, the team “obtained data on all H-1B lottery registrations, selections, and petitions for fiscal years 2021 through 2024 after bringing a lawsuit against the Department of Homeland Security under the Freedom of Information Act.” They’ve shared the records, which indicate each registration’s employer, as well as the proposed beneficiary’s gender, nationality, and birth year. For registrations that led to visa petitions, the data include additional details, such as the worksite, salary, job title, and beneficiary’s field of study. [h/t Eric Fan]

Aug 2024

Watching grass grow.

The Jornada Experimental Range, run by the USDA’s Agricultural Research Service, “is one of the longest serving laboratories focused on rangelands and drylands in the world.” Located north of Las Cruces, N.M., the site has operating since the 1910s. A few years ago, Erica Christensen et al. published a dataset of grass and shrub growth within 122 one-meter-by-one-meter squares on the range from 1915 to 2016, containing roughly 200,000 observations.

Aug 2024

Real-time UK voter registrations.

The UK government’s voter registration statistics dashboard updates hundreds of times per day. It provides downloadable data on the number of online applications in each five-minute interval in the past 24 hours and daily counts broken down by online vs. paper applications, age group, elector type, and nation. [h/t Giuseppe Sollazzo]

Aug 2024

Western water rights.

Matthew D. Lisk et al. have compiled and standardized a dataset of water rights records — key documents in the allocation of the scarce resource — in the Western United States. Drawing on raw data collected from 11 states, the harmonized dataset “provides consistent unique identifiers for each spatial unit of water management across the domain, unique identifiers for each water right record, and a consistent categorization scheme that puts each water right record into one of 7 broad use categories.” Those categories: irrigation, domestic, livestock, fish, industrial, environmental, and other. The authors have also published a set of shapefiles outlining each water management area’s boundaries.

Aug 2024

Recreational boating accidents.

In the US, recreational boaters must notify state authorities soon after any incident involving a death, serious injury, disappearance, or substantial damage. The authorities relay those notifications to the Coast Guard, which stores them in its centralized Boating Accident Report Database. The Data Liberation Project (which, customary disclosure, I run) filed a FOIA request for the database and, earlier this week, published the records it received. The data describe 58,000+ boating accidents, 78,000+ vessels, 8,900+ deaths, and 36,000+ injuries from 2009 to 2023, although a few states and territories withheld their incidents from disclosure. Read more: The Data Liberation Project’s introductory documentation.

Aug 2024

The Freedman’s Bank.

The Freedman’s Savings and Trust Company was chartered by Congress in 1865, toward the end of the Civil War, to provide banking services to formerly enslaved Americans. “Though the bank achieved some early successes, it failed catastrophically in 1874, destroying the savings of a broad swath of newly freed black citizens,” write Malcolm Wardlaw and Virginia Traweek, who have constructed several datasets based on handwritten records preserved by the federal government. One dataset lists 5,000+ transactions from 500+ accounts’ “passbooks”, indicating the account holder, city, transaction type, date, and amount. Another lists 40,000+ accounts’ final balances at the time of the bank’s failure. Read more: Wardlaw and Traweek’s studies analyzing the records.

Aug 2024

Art auction sales.

Kangsan Lee et al. have compiled a dataset of “34,200 auction sales records, including images, artists’ attributes, and market information, encompassing 590 living contemporary artists spanning 17 years (1996 to 2012) across 23 countries.” It includes the artist’s name, nationality, and birth year; the artwork’s name, year, medium, and dimensions; the auction date, house, initial estimates, and final sale price; and more. The sales information comes from Blouin, which the researchers “cross-check[ed] with publicly available auction house data at the time, such as Christie’s and Sotheby’s.”

Aug 2024

Weather balloons.

When a weather balloon rises into the atmosphere, it carries a radiosonde to record the temperature, pressure, wind speed, humidity, and other measurements. NOAA’s Integrated Global Radiosonde Archive “consists of radiosonde and pilot balloon observations from more than 2,800 globally distributed stations,” some dating back to the early 1900s. SondeHub, meanwhile, tracks hundreds of weather balloons a day in real-time, thanks to a community-run network of 1,400+ receiver stations. You can browse the flight paths and detailed measurements online, and access the data via download and API. [h/t Michael Allen]

Aug 2024

Ireland’s gender pay gaps.

Ireland’s Gender Pay Gap Information Act requires certain-sized companies to report their differences pay for men versus women. “While plans are in place to create a central portal (similar to that in the UK) to collate this information, such a database does not exist yet,” writes Jennifer Keane, who has built PayGap.ie to fill the void. For each company and year, the project’s database lists the metrics the Act requires — mean and median hourly pay gaps, percentages of men and women paid bonuses, proportion of employees in each pay quartile for each gender, among others — plus a link to the company’s public report.

Aug 2024

Railroad incidents.

Since last August, the Federal Railroad Administration has been rolling out a new portal for its safety data. Through it, you can find datasets on incidents and accidents involving railroad equipment, incidents at grade crossings, and reported injuries and illnesses, as well as dashboards and reports on related topics. The grade crossing dataset, for instance, lists 246,000+ incidents since 1975; it indicates each incident’s date, railroad, crossing identifier, nearest station, number of injuries, vehicle and train speeds, and much more. Previously: Blocked rail crossings (DIP 2023.05.10).

Aug 2024

State/local government employment.

Every year, the Census Bureau sends its Annual Survey of Public Employment & Payroll to all 50 state governments and 90,000+ local governments, requesting employee counts and payroll totals. The survey’s public datasets provide those figures for each government unit, broken down by several dozen “functional categories” (such as “Highways”, “Financial Administration”, and “Hospitals”). As seen in: The Marshall Project’s guide to using the data to examine declines in prison staffing, part of the organization’s new Investigate This! initiative; they’ve also aggregated the raw records into a spreadsheet of annual state totals. [h/t David Eads]

Jul 2024

Wait Wait.

Linh Pham considers himself “the unofficial scorekeeper” of Wait Wait… Don’t Tell Me!, NPR’s weekly quiz show. Since 2007, he’s been maintaining a structured database that describes Wait Wait’s episodes, venues, hosts, guests, panelists, and more. Pham provides the data via API, and also publishes charts and automated reports, such this list of panelists who won their debut appearances. [h/t Cody Winchester]

Jul 2024

People surveyed since 1979.

The Bureau of Labor Statistics’ NLSY79 survey has interviewed the same people dozens of times since 1979. It began with a “nationally representative sample of 12,686 young men and women”; more than four decades later, 6,000+ interviewees are still responding to the project’s biennial inquiries. The survey asks about a range of topics, including education, employment, health, dating, marriage, children, attitudes, and substance abuse. Public-use data are available to download and through the agency’s NLS Investigator tool. Related: The agency’s other national longitudinal surveys. [h/t Prashant Bharadwaj et al.]

Jul 2024

Scholarship, networked.

OpenAlex, “a free and open catalog of the global research system,” has compiled data on more than 250 million scholarly works — and has linked those works to structured information about their authors, institutions, publishers, funders, topics. The data are available to search online, to download in bulk, and via API. As seen in: Aliakbar Akbaritabar et al.’s “Bilateral flows and rates of international migration of scholars for 210 countries for the period 1998-2020” and Philippe Mongeon et al.’s dataset of scholars’ Twitter/X usernames.

Jul 2024

Hurricane evacuation orders.

Harsh Anand et al.’s Hurricane Evacuation Order Database “is a comprehensive and standardized database of evacuation orders issued by state and local government officials in response to the hurricanes that impacted the United States between 2014 and 2023.” To build it, the authors combed through government websites, official social media, news reports, and other sources. The database covers 27 storms and several types of announcements: state-of-emergency declarations, mandatory evacuations, voluntary evacuations, and the lifting of those orders. For each announcement, the database indicates the order type, date/time announced, date/time effective, counties affected, and evacuation area.

Jul 2024

Ballots cast.

“Electronic records of actual ballots cast (cast vote records) are available to the public in some jurisdictions,” Shiro Kuriwaki et al. write. “However, they have been released in a variety of formats and have not been independently evaluated.” So the researchers have constructed a standardized dataset representing 40.7 million (anonymous) ballots in the November 2020 general election, spanning 352 counties across 20 states. Each of the 160 million rows corresponds to a voter’s choice in a particular race and indicates the precinct, legislative district, office in question, candidate selected, and candidate’s party. The initial release, which the authors use to analyze ticket-splitting patterns, covers votes for president, Senate, House, governor, and state legislature. [h/t Derek Willis]

Jul 2024

Australia shipwrecks.

The Western Australia Museum hosts a range of datasets, including details concerning 1,600+ local shipwrecks and 30,000+ artifacts recovered from them. The shipwreck dataset lists each ship’s builder, construction materials, owner, cargo, wreck location, date wrecked, known deaths, date found, and more. Previously: Ancient shipwrecks (DIP 2024.07.10). [h/t Kristin Milton]

Jul 2024

National park species.

The National Park Service’s NPSpecies portal “documents our knowledge about the occurrence and status of species” on the agency’s lands. For each NPS-managed area, you can download a list of the species, their scientific and common names, occurrence status (present, probably present, unconfirmed), nativeness, conservation status, and more. Related: Noting that “many of the observations in NPSpecies remain unverified and the lists are often outdated,” Benjamin J. LaFrance et al. have created an updated dataset for amphibian species, which they checked against other sources and verified with regional experts.

Jul 2024

Commercial zones.

Byeonghwa Jeong et al. have constructed a dataset estimating the geographic boundaries of 23,000+ commercial zones in 69 metro areas in the US and Canada. To build it, they used data on retail and office locations from OpenStreetMap, and on job density from the US Census Bureau’s Longitudinal Employer-Household Dynamics program (DIP 2021.05.26) and Statistics Canada. For each detected commercial zone, the dataset provides its outline, total area, a score of its relative concentration (on which the zone comprising most of Manhattan scored the highest), its MSA, and the street at its centroid.

Jul 2024

Human rights scores.

The CIRIGHTS project aims “to create numerical measures for every internationally recognized human right for all countries of the world.” The team has developed a detailed guide to scoring each government’s record on dozens of such rights, such as freedom of religion, women’s political rights, freedom from extrajudicial killings, the right to a fair trial, and “reasonable limits” on working hours. For each year from 1981 to 2021, the project’s scorers have rated each country on each right, generally on a three-point scale, based on information in the US State Department’s Country Reports on Human Rights Practices, Amnesty International’s annual reports, and similar sources. The resulting dataset includes those scores, as well as several summary metrics.

Jul 2024

News homepages, archived.

Since launching in March 2022, homepages.news has archived millions of screenshots, performance audits, robots.txt files, accessibility trees, and hyperlink lists from the homepages of 1,100+ news sites. The open-source project, run by journalist Ben Welsh, provides bulk data for each of those assets. The screenshots themselves are stored on the Internet Archive; you can also view the latest screenshots from all the sites on one page. To date, the publications span 32 countries and 17 languages. Related: Welsh and volunteer Alex Garcia are using the robots.txt data to track which sites block OpenAI, Google AI, and Common Crawl — findings that have been cited widely.