Data is Plural archive
Synths.
Iftah Gabbai is building a dataset of “hardware synthesizers, samplers, and drum machines” produced since 1896, “compiled through a mix of automated and manual processes, combined with extensive research.” For each of the 2,300+ devices identified, the dataset indicates its name, brand, release year, years in production, device type (synth, sampler, et cetera), form factor, architecture, synth engine used, number of keys, key type, oscillator count, and more. Learn more: Gabbai’s introductory video. [h/t Stefan Bohacek]
German election results.
GERDA, a new project by Vincent Heddesheimer et al., “provides a comprehensive dataset of local, state, and federal election results in Germany.” The results go back to 1953 for federal elections, to 1990 for local elections, and to 1996 for state elections. The files indicate each geographic unit’s number of eligible voters, actual voters, valid votes, invalid votes, and vote shares by party. The authors have also created “geographically harmonized datasets that account for changes in municipal boundaries and mail-in voting districts.”
The cost of sustenance.
The UN World Food Programme’s Fill the Nutrient Gap initiative conducted a series of analyses in 2015 through 2021 to “calculate the costs of energy-sufficient and nutrient-adequate diets and the percentage of households that were unable to afford each diet.” In a recent paper, Zuzanna Turowska et al. describe the analyses’ methodology and share their results as a dataset. For each of the 37 countries analyzed, the dataset contains one row per geographic unit, timeframe, and type of household member; each row provides the cost and unaffordability estimates for that category.
US buildings.
“Leveraging high performance computing, remote sensing, geographic data science, machine learning, and computer vision,” Hsiuhan Lexie Yang et al., researchers at Oak Ridge National Laboratory, have “partnered with Federal Emergency Management Agency (FEMA) to build a baseline structure inventory covering the US and its territories to support disaster preparedness, response, and recovery.” The dataset and interactive map trace the outlines of 125 million buildings and, in many cases, contain the building’s address, occupancy class, usage type, height, elevation, and other attributes. They also provide information about the imagery used to identify the structure.
SBA disaster loans.
“Following a declared disaster,” the US Small Business Administration offers “disaster assistance in the form of low-interest, long-term disaster loans for damages not covered by insurance or other recoveries to businesses of all sizes, private nonprofit organizations, as well as homeowners and renters.” The SBA publishes anonymized data about each such loan in fiscal years 2000 to 2022, drawn directly from its Disaster Credit Management System. The records provide the relevant disaster declaration IDs, property type, ZIP code, city, county, state, verified losses (in real estate and in “content”), and approved loan amounts. Previously: SBA datasets the Paycheck Protection Program (DIP 2020.07.08) and the administration’s 7(a) and 504 loan programs (DIP 2023.01.11). [h/t Benjamin L. Collier et al.]
Anglo-Saxons on record.
The Prosopography of Anglo-Saxon England project “aims to provide structured information relating to all the recorded inhabitants of England from the late sixth to the late eleventh century.” Built over (relatively less) time by several teams at UK universities, PASE is “based on a systematic examination of the available written sources for the period, including chronicles, saints’ Lives, charters, libri vitae, inscriptions, Domesday Book and coins.” The Domesday-focused portion of the project features a downloadable table of 17,000+ landholders from that manuscript, listing their name (where known), gender, description, value of holdings, and linking to details about those holdings. [h/t Derek M. Jones]
Trash balloons.
Since May of this year, North Korea has floated thousands of trash-carrying balloons into South Korea. A team from the Center for Strategic and International Studies’ Beyond Parallel project has mapped 160+ known balloon landing locations, based on public sources. The map’s downloadable data indicates the each landing’s date, associated “wave,” coordinates, location name, and province. As seen in: Reuters’ visually immersive article on the topic. [h/t Soph Warnes]
Fuel forecasts.
The US Energy Information Administration’s monthly Short-Term Energy Outlook provides forecasts and recent trends of energy supply, consumption, prices, and inventory. It covers a range of commodities and electricity sources, such as crude oil, coal, natural gas, gasoline, renewables, and nuclear. Starting with its September 2024 report, the outlooks have also begun to include more detailed data on biofuels, available in Table 4d of its structured datasets.
Evapotranspiration.
With the goal of “filling the biggest data gap in water management,” the OpenET project uses satellite imagery, weather data, and other sources to estimate the volume of evapotranspiration — “the process by which water is transferred from the land to the atmosphere” — at a 30-meter resolution across 17 states in the western US. The results are available to explore via an online map of annual cumulative evapotranspiration (2019–2024), as monthly datasets via Earth Engine, and through an API. [h/t Mira Rojanasakul]
Disability claims processing.
The Social Security Administration publishes monthly and annual datasets tracking each state agency’s progress processing disability claims. The datasets, which go back to October 2000, count the number of initial claims received, pending, determined, and approved by each agency during each period, as well as similar breakdowns for denial reconsiderations and continuing disability reviews. The administration’s extensive catalog of public datasets also includes several that measure the waiting involved, such as monthly average initial claim processing times, average wait times for reconsiderations, and wait times for administrative hearings. As seen in: “Wait times for Social Security disability benefit decisions reach new high” (USAFacts).
James Beard honorees.
Cody Winchester has constructed a dataset of James Beard Foundation Award semifinalists, nominees, and winners since 1991, sourced from the foundation’s award-search page. For each honoree, the dataset provides their name, year, category, subcategory, and award status, plus additional category-specific variables (such as publisher for the book awards, and location for restaurant and chef awards).
Canadian mines.
Economist Clara Dallaire-Fortier has compiled a dataset of “mine-level estimates for the Canadian mining industry with a persistent annual coverage between 1950 and 2022,” based partly on historical government maps. For each of the 947 mines identified, the dataset indicates its name, location, mining companies, dates open/closed, and commodities produced. Previously: Australian mine production, 1799–2021 (DIP 2023.07.12).
Tree canopies.
The High Resolution Canopy Height Maps dataset, released in April by Meta and the World Resources Institute, estimates “tree canopy height at a 1-meter resolution, allowing the detection of single trees at a global scale.” It is available to explore online and download, and was constructed by applying machine learning techniques to satellite imagery and LiDAR data. The estimates use satellite imagery mostly from 2018–2020, and “when newer imagery is available, the publicly shared model can be used to detect change in canopy heights.” [h/t Ben Hur Pintor]
Open-source weather APIs.
Open-Meteo, an open-source project built on data from national weather services, offers a range of weather and climate APIs that are free for non-commercial use. They include weather forecasts (temperature, humidity, precipitation, wind speed, etc.), daily historical weather since 1940, climate change model outputs, marine wave forecasts, air quality assessments, and more. The project also provides bulk downloads of the underlying data and self-hosting instructions. As seen in: Jan Kühn’s Historical Meteo Graphs. [h/t Giuseppe Sollazzo]
NYC evictions.
New York City’s government publishes a dataset listing evictions “pending, scheduled and executed” since 2017, updated daily. The data are “compiled from the majority of New York City Marshals,” who are mayor-appointed officers tasked with enforcing civil court cases. Each of the 97,000+ rows indicates the eviction court case number, address, property type, eviction type, execution date, and marshal. Related: The city also publishes data on marshals’ annual eviction revenues. Also related: nycdb points to, and helps download, a range of NYC housing–related datasets. As seen in: “Spiking Evictions Renew Calls to Reform NYC Marshals System,” by Patrick Spauster for City Limits.
Messi’s moves.
StatsBomb, a soccer/football-data company, publishes a subset of its detailed, in-play data for free. Among the offerings: Every touch, pass, dribble, and shot from Lionel Messi’s 17 seasons playing for Barcelona in La Liga. Related: Carlos Menezes’s tool for visualizing StatsBomb event data files. Read more: Net Gains: Inside the Beautiful Game’s Analytics Revolution, by Ryan O’Hanlon. [h/t Giuseppe Sollazzo]
Art words.
The Getty Vocabularies, published by the Getty Research Institute, “contain structured terminology for art, architecture, decorative arts, archival materials, visual surrogates, art conservation, and bibliographic materials.” They provide definitions, relationships, translations, and disambiguations for a broad range of terms and entities. Their Art & Architecture Thesaurus, for example, describes 57,000+ generic concepts (e.g., lithography), while others focus on artist names, cultural objects, and geographies. The records are available several ways, including bulk downloads. [h/t Lynn Cherny]
Alcohol consumption.
The National Institute on Alcohol Abuse and Alcoholism’s latest consumption surveillance report, published earlier this year, uses sales and shipment data to measure annual alcohol intake by beverage type (beer, wine, spirits) and state. The report and corresponding data file estimate the likely total and per-capita volumes (of the beverages and of their ethanol content) consumed each year from the 1970s through 2022. Related: Additional surveillance reports and the CDC’s list of surveys gathering data on alcohol use. [h/t Millie Giles]
Long-run economic growth.
The Maddison Project Database, based on the work of Angus Maddison (1926-2010), “provides information on comparative economic growth and income levels over the very long run.” Its latest release includes historical per-capita GDP estimates for 169 countries, in many cases spanning several centuries. In all, the database contains 21,000+ such estimates and another 17,000+ population estimates, drawn from hundreds of sources. Previously: The Penn World Table (DIP 2016.08.17) — “income, output, input and productivity” estimates now “covering 183 countries between 1950 and 2019” — and the Long-Term Productivity Database (DIP 2020.04.08).
Landslides.
The US Geological Survey has released a new map of landslide susceptibility, indicating the specific areas of the country (at 90-meter resolution) that are at greatest risk of slides. The map and county-level metrics are also available as structured data. To calculate the susceptibilities, Benjamin B. Mirus et al. combined data from the agency’s 3D Elevation Program and a national landslide inventory that they updated. The latter provides the location (as a single point or more detailed boundary), timing, number of fatalities, and confidence level for 610,000+ landslides (or evidence of them) in the US since the early 1900s. The researchers also consulted data from state landslide inventories published by Idaho, Maine, North Dakota, and West Virginia.
Snakes.
SnakeDB — created by Sascha Steinhoff “after [he] accidentally stepped into a snake in South-East Asia” — provides downloadable data on the maximum size, fang position, pupil shape, mode of reproduction, and toxicity of thousands of species, drawn from a broad range of sources. As seen in: Oleksandra Oskyrko et al.’s ReptTraits database.
Italian tax-to-charity allocations.
Italy’s “five per thousand” program allows taxpayers to allocate 0.5% of their income tax to certain nonprofits, research institutions, and other social-benefit organizations. The country’s Ministry of Economy and Finance has published information about 2022’s beneficiaries, but initially did so only via PDFs. Earlier this year, the Liberiamoli tutti! initiative converted those PDFs into structured data that list each recipient organization’s name, tax ID, category, region, province, and municipality, number of taxpayers choosing it, and amount of money allocated. The ministry has since added structured files of its own.
Source code.
Software Heritage, a nonprofit initiative collaborating with UNESCO, maintains “the largest public collection of source code in existence”: an archive tracking 20 billion source files and 4 billion code-commits from 317 million projects from a range of public software hosts (GitHub, GitLab, BitBucket, npm, et cetera). Its Graph Dataset, which provides access to the archive’s content and internal relationships, is available via bulk downloads and APIs. [h/t Derek M. Jones]
Monthly crime trends.
The Real-Time Crime Index, launched last week by a team of crime-data analysts, presents a “sample of reported crime data from hundreds of law enforcement agencies nationwide which mimics national crime trends with as little lag and the most accuracy possible.” Framed as a supplement to the FBI’s slow-to-update official statistics, the project provides monthly and rolling 12-month totals of reported crimes (using the FBI’s UCR Part I offense categories) for the nation, individual cities, and by city population size. You can download the data and see the sources for each of the 300+ local agencies in the national sample. Read more: “The Real-Time Crime Index Shows Declining Crime in 2024,” from project co-leader Jeff Asher’s newsletter.
Health and nutrition.
Since 1999, CDC has been continuously fielding its National Health and Nutrition Examination Survey, interviewing and testing approximately 5,000 people in 15 different counties each year. The survey combines “demographic, socioeconomic, dietary, and health-related questions” with an “examination component” involving “medical, dental, and physiological measurements, as well as laboratory tests administered by highly trained medical personnel.” Its public-access data files provide anonymized, respondent-level records and are currently available for surveys conducted through March 2020. As seen in: Catherine McDonough et al.’s dataset and interactive dashboard “exploring factors associated with prediabetes and diabetes mellitus among youth in the United States.”
Olympic medalists.
The European Data Journalism Network’s Giorgio Comai has used Wikipedia and Wikidata to create a series of datasets listing the name, birth date, sex, and birthplace of Summer Olympic medalists. Comai has mapped the birthplace coordinates and, for Europe-born medalists, linked them to their NUTS regions. The project focuses on the 2024 and 2020 Summer Olympics but also provides provisional data for other recent iterations. [h/t Federico Caruso]
UK grantmakers.
The UK Grantmaking initiative “is a unique cross-sector collaboration between” several major organizations in the field. Their downloadable dataset provides information about 12,000+ trusts, foundations, charities, and other grantmakers for financial year 2022-23, based on records from government regulators. The dataset lists each organization’s name, government-assigned ID, location, category, registration date, income, spending totals, net assets, and more. Previously: UK grants via 360Giving (DIP 2018.12.05). [h/t Giuseppe Sollazzo]
California residential water supply.
Marie-Philine Gross et al.’s dataset of residential water demand and supply in California includes the monthly volumes of water produced/sold by 404 of the state’s water suppliers, covering 2013–2021. The researchers extracted, standardized, and cleaned the data from the state’s mandatory annual reports, which collect thousands of data points from each supplier. They also added contextual information, such as climatic data (monthly local precipitation, temperature, and drought severity) and each supplier’s hydrologic region.
Multinational corporations.
The Multinational Enterprise Information Platform, a collaboration between the OECD and the UN Statistics Division, provides publicly sourced data on the 500 multinational corporations with the largest market capitalization. Its “Global Register” dataset examines the companies’ structure, listing each subsidiary’s name, parent company, address, alternative names, and various unique identifiers. The “Digital Register” dataset lists all known web domains controlled by each company and assessments of those domains’ popularity. The platform’s “Media Monitor” feature, although not downloadable, links to news articles and other webpages mentioning the companies. [h/t Annie Burns-Pieper]
H-1B lotteries.
A recent Bloomberg News investigation into the US government’s annual H-1B lottery, a key step in allocating the country’s skilled-worker visas, finds that “thousands of companies got an unfair advantage by helping themselves to extra lottery tickets.” To reach those conclusions, the team “obtained data on all H-1B lottery registrations, selections, and petitions for fiscal years 2021 through 2024 after bringing a lawsuit against the Department of Homeland Security under the Freedom of Information Act.” They’ve shared the records, which indicate each registration’s employer, as well as the proposed beneficiary’s gender, nationality, and birth year. For registrations that led to visa petitions, the data include additional details, such as the worksite, salary, job title, and beneficiary’s field of study. [h/t Eric Fan]