Data is Plural archive

Trawl through the backlog, or roll the dice for a random dataset 🎲
Oct 2024

Synths.

Iftah Gabbai is building a dataset of “hardware synthesizers, samplers, and drum machines” produced since 1896, “compiled through a mix of automated and manual processes, combined with extensive research.” For each of the 2,300+ devices identified, the dataset indicates its name, brand, release year, years in production, device type (synth, sampler, et cetera), form factor, architecture, synth engine used, number of keys, key type, oscillator count, and more. Learn more: Gabbai’s introductory video. [h/t Stefan Bohacek]

Oct 2024

German election results.

GERDA, a new project by Vincent Heddesheimer et al., “provides a comprehensive dataset of local, state, and federal election results in Germany.” The results go back to 1953 for federal elections, to 1990 for local elections, and to 1996 for state elections. The files indicate each geographic unit’s number of eligible voters, actual voters, valid votes, invalid votes, and vote shares by party. The authors have also created “geographically harmonized datasets that account for changes in municipal boundaries and mail-in voting districts.”

Oct 2024

The cost of sustenance.

The UN World Food Programme’s Fill the Nutrient Gap initiative conducted a series of analyses in 2015 through 2021 to “calculate the costs of energy-sufficient and nutrient-adequate diets and the percentage of households that were unable to afford each diet.” In a recent paper, Zuzanna Turowska et al. describe the analyses’ methodology and share their results as a dataset. For each of the 37 countries analyzed, the dataset contains one row per geographic unit, timeframe, and type of household member; each row provides the cost and unaffordability estimates for that category.

Oct 2024

US buildings.

“Leveraging high performance computing, remote sensing, geographic data science, machine learning, and computer vision,” Hsiuhan Lexie Yang et al., researchers at Oak Ridge National Laboratory, have “partnered with Federal Emergency Management Agency (FEMA) to build a baseline structure inventory covering the US and its territories to support disaster preparedness, response, and recovery.” The dataset and interactive map trace the outlines of 125 million buildings and, in many cases, contain the building’s address, occupancy class, usage type, height, elevation, and other attributes. They also provide information about the imagery used to identify the structure.

Oct 2024

SBA disaster loans.

“Following a declared disaster,” the US Small Business Administration offers “disaster assistance in the form of low-interest, long-term disaster loans for damages not covered by insurance or other recoveries to businesses of all sizes, private nonprofit organizations, as well as homeowners and renters.” The SBA publishes anonymized data about each such loan in fiscal years 2000 to 2022, drawn directly from its Disaster Credit Management System. The records provide the relevant disaster declaration IDs, property type, ZIP code, city, county, state, verified losses (in real estate and in “content”), and approved loan amounts. Previously: SBA datasets the Paycheck Protection Program (DIP 2020.07.08) and the administration’s 7(a) and 504 loan programs (DIP 2023.01.11). [h/t Benjamin L. Collier et al.]

Oct 2024

Anglo-Saxons on record.

The Prosopography of Anglo-Saxon England project “aims to provide structured information relating to all the recorded inhabitants of England from the late sixth to the late eleventh century.” Built over (relatively less) time by several teams at UK universities, PASE is “based on a systematic examination of the available written sources for the period, including chronicles, saints’ Lives, charters, libri vitae, inscriptions, Domesday Book and coins.” The Domesday-focused portion of the project features a downloadable table of 17,000+ landholders from that manuscript, listing their name (where known), gender, description, value of holdings, and linking to details about those holdings. [h/t Derek M. Jones]

Oct 2024

Trash balloons.

Since May of this year, North Korea has floated thousands of trash-carrying balloons into South Korea. A team from the Center for Strategic and International Studies’ Beyond Parallel project has mapped 160+ known balloon landing locations, based on public sources. The map’s downloadable data indicates the each landing’s date, associated “wave,” coordinates, location name, and province. As seen in: Reuters’ visually immersive article on the topic. [h/t Soph Warnes]

Oct 2024

Fuel forecasts.

The US Energy Information Administration’s monthly Short-Term Energy Outlook provides forecasts and recent trends of energy supply, consumption, prices, and inventory. It covers a range of commodities and electricity sources, such as crude oil, coal, natural gas, gasoline, renewables, and nuclear. Starting with its September 2024 report, the outlooks have also begun to include more detailed data on biofuels, available in Table 4d of its structured datasets.

Oct 2024

Evapotranspiration.

With the goal of “filling the biggest data gap in water management,” the OpenET project uses satellite imagery, weather data, and other sources to estimate the volume of evapotranspiration — “the process by which water is transferred from the land to the atmosphere” — at a 30-meter resolution across 17 states in the western US. The results are available to explore via an online map of annual cumulative evapotranspiration (2019–2024), as monthly datasets via Earth Engine, and through an API. [h/t Mira Rojanasakul]

Oct 2024

Disability claims processing.

The Social Security Administration publishes monthly and annual datasets tracking each state agency’s progress processing disability claims. The datasets, which go back to October 2000, count the number of initial claims received, pending, determined, and approved by each agency during each period, as well as similar breakdowns for denial reconsiderations and continuing disability reviews. The administration’s extensive catalog of public datasets also includes several that measure the waiting involved, such as monthly average initial claim processing times, average wait times for reconsiderations, and wait times for administrative hearings. As seen in: “Wait times for Social Security disability benefit decisions reach new high” (USAFacts).

Sep 2024

James Beard honorees.

Cody Winchester has constructed a dataset of James Beard Foundation Award semifinalists, nominees, and winners since 1991, sourced from the foundation’s award-search page. For each honoree, the dataset provides their name, year, category, subcategory, and award status, plus additional category-specific variables (such as publisher for the book awards, and location for restaurant and chef awards).

Sep 2024

Canadian mines.

Economist Clara Dallaire-Fortier has compiled a dataset of “mine-level estimates for the Canadian mining industry with a persistent annual coverage between 1950 and 2022,” based partly on historical government maps. For each of the 947 mines identified, the dataset indicates its name, location, mining companies, dates open/closed, and commodities produced. Previously: Australian mine production, 1799–2021 (DIP 2023.07.12).

Sep 2024

Tree canopies.

The High Resolution Canopy Height Maps dataset, released in April by Meta and the World Resources Institute, estimates “tree canopy height at a 1-meter resolution, allowing the detection of single trees at a global scale.” It is available to explore online and download, and was constructed by applying machine learning techniques to satellite imagery and LiDAR data. The estimates use satellite imagery mostly from 2018–2020, and “when newer imagery is available, the publicly shared model can be used to detect change in canopy heights.” [h/t Ben Hur Pintor]

Sep 2024

Open-source weather APIs.

Open-Meteo, an open-source project built on data from national weather services, offers a range of weather and climate APIs that are free for non-commercial use. They include weather forecasts (temperature, humidity, precipitation, wind speed, etc.), daily historical weather since 1940, climate change model outputs, marine wave forecasts, air quality assessments, and more. The project also provides bulk downloads of the underlying data and self-hosting instructions. As seen in: Jan Kühn’s Historical Meteo Graphs. [h/t Giuseppe Sollazzo]

Sep 2024

NYC evictions.

New York City’s government publishes a dataset listing evictions “pending, scheduled and executed” since 2017, updated daily. The data are “compiled from the majority of New York City Marshals,” who are mayor-appointed officers tasked with enforcing civil court cases. Each of the 97,000+ rows indicates the eviction court case number, address, property type, eviction type, execution date, and marshal. Related: The city also publishes data on marshals’ annual eviction revenues. Also related: nycdb points to, and helps download, a range of NYC housing–related datasets. As seen in: “Spiking Evictions Renew Calls to Reform NYC Marshals System,” by Patrick Spauster for City Limits.

Sep 2024

Messi’s moves.

StatsBomb, a soccer/football-data company, publishes a subset of its detailed, in-play data for free. Among the offerings: Every touch, pass, dribble, and shot from Lionel Messi’s 17 seasons playing for Barcelona in La Liga. Related: Carlos Menezes’s tool for visualizing StatsBomb event data files. Read more: Net Gains: Inside the Beautiful Game’s Analytics Revolution, by Ryan O’Hanlon. [h/t Giuseppe Sollazzo]

Sep 2024

Art words.

The Getty Vocabularies, published by the Getty Research Institute, “contain structured terminology for art, architecture, decorative arts, archival materials, visual surrogates, art conservation, and bibliographic materials.” They provide definitions, relationships, translations, and disambiguations for a broad range of terms and entities. Their Art & Architecture Thesaurus, for example, describes 57,000+ generic concepts (e.g., lithography), while others focus on artist names, cultural objects, and geographies. The records are available several ways, including bulk downloads. [h/t Lynn Cherny]

Sep 2024

Alcohol consumption.

The National Institute on Alcohol Abuse and Alcoholism’s latest consumption surveillance report, published earlier this year, uses sales and shipment data to measure annual alcohol intake by beverage type (beer, wine, spirits) and state. The report and corresponding data file estimate the likely total and per-capita volumes (of the beverages and of their ethanol content) consumed each year from the 1970s through 2022. Related: Additional surveillance reports and the CDC’s list of surveys gathering data on alcohol use. [h/t Millie Giles]

Sep 2024

Long-run economic growth.

The Maddison Project Database, based on the work of Angus Maddison (1926-2010), “provides information on comparative economic growth and income levels over the very long run.” Its latest release includes historical per-capita GDP estimates for 169 countries, in many cases spanning several centuries. In all, the database contains 21,000+ such estimates and another 17,000+ population estimates, drawn from hundreds of sources. Previously: The Penn World Table (DIP 2016.08.17) — “income, output, input and productivity” estimates now “covering 183 countries between 1950 and 2019” — and the Long-Term Productivity Database (DIP 2020.04.08).

Sep 2024

Landslides.

The US Geological Survey has released a new map of landslide susceptibility, indicating the specific areas of the country (at 90-meter resolution) that are at greatest risk of slides. The map and county-level metrics are also available as structured data. To calculate the susceptibilities, Benjamin B. Mirus et al. combined data from the agency’s 3D Elevation Program and a national landslide inventory that they updated. The latter provides the location (as a single point or more detailed boundary), timing, number of fatalities, and confidence level for 610,000+ landslides (or evidence of them) in the US since the early 1900s. The researchers also consulted data from state landslide inventories published by Idaho, Maine, North Dakota, and West Virginia.

Sep 2024

Italian tax-to-charity allocations.

Italy’s “five per thousand” program allows taxpayers to allocate 0.5% of their income tax to certain nonprofits, research institutions, and other social-benefit organizations. The country’s Ministry of Economy and Finance has published information about 2022’s beneficiaries, but initially did so only via PDFs. Earlier this year, the Liberiamoli tutti! initiative converted those PDFs into structured data that list each recipient organization’s name, tax ID, category, region, province, and municipality, number of taxpayers choosing it, and amount of money allocated. The ministry has since added structured files of its own.

Sep 2024

Source code.

Software Heritage, a nonprofit initiative collaborating with UNESCO, maintains “the largest public collection of source code in existence”: an archive tracking 20 billion source files and 4 billion code-commits from 317 million projects from a range of public software hosts (GitHub, GitLab, BitBucket, npm, et cetera). Its Graph Dataset, which provides access to the archive’s content and internal relationships, is available via bulk downloads and APIs. [h/t Derek M. Jones]

Sep 2024

Monthly crime trends.

The Real-Time Crime Index, launched last week by a team of crime-data analysts, presents a “sample of reported crime data from hundreds of law enforcement agencies nationwide which mimics national crime trends with as little lag and the most accuracy possible.” Framed as a supplement to the FBI’s slow-to-update official statistics, the project provides monthly and rolling 12-month totals of reported crimes (using the FBI’s UCR Part I offense categories) for the nation, individual cities, and by city population size. You can download the data and see the sources for each of the 300+ local agencies in the national sample. Read more: “The Real-Time Crime Index Shows Declining Crime in 2024,” from project co-leader Jeff Asher’s newsletter.

Sep 2024

Health and nutrition.

Since 1999, CDC has been continuously fielding its National Health and Nutrition Examination Survey, interviewing and testing approximately 5,000 people in 15 different counties each year. The survey combines “demographic, socioeconomic, dietary, and health-related questions” with an “examination component” involving “medical, dental, and physiological measurements, as well as laboratory tests administered by highly trained medical personnel.” Its public-access data files provide anonymized, respondent-level records and are currently available for surveys conducted through March 2020. As seen in: Catherine McDonough et al.’s dataset and interactive dashboard “exploring factors associated with prediabetes and diabetes mellitus among youth in the United States.”

Aug 2024

Olympic medalists.

The European Data Journalism Network’s Giorgio Comai has used Wikipedia and Wikidata to create a series of datasets listing the name, birth date, sex, and birthplace of Summer Olympic medalists. Comai has mapped the birthplace coordinates and, for Europe-born medalists, linked them to their NUTS regions. The project focuses on the 2024 and 2020 Summer Olympics but also provides provisional data for other recent iterations. [h/t Federico Caruso]

Aug 2024

UK grantmakers.

The UK Grantmaking initiative “is a unique cross-sector collaboration between” several major organizations in the field. Their downloadable dataset provides information about 12,000+ trusts, foundations, charities, and other grantmakers for financial year 2022-23, based on records from government regulators. The dataset lists each organization’s name, government-assigned ID, location, category, registration date, income, spending totals, net assets, and more. Previously: UK grants via 360Giving (DIP 2018.12.05). [h/t Giuseppe Sollazzo]

Aug 2024

California residential water supply.

Marie-Philine Gross et al.’s dataset of residential water demand and supply in California includes the monthly volumes of water produced/sold by 404 of the state’s water suppliers, covering 2013–2021. The researchers extracted, standardized, and cleaned the data from the state’s mandatory annual reports, which collect thousands of data points from each supplier. They also added contextual information, such as climatic data (monthly local precipitation, temperature, and drought severity) and each supplier’s hydrologic region.

Aug 2024

Multinational corporations.

The Multinational Enterprise Information Platform, a collaboration between the OECD and the UN Statistics Division, provides publicly sourced data on the 500 multinational corporations with the largest market capitalization. Its “Global Register” dataset examines the companies’ structure, listing each subsidiary’s name, parent company, address, alternative names, and various unique identifiers. The “Digital Register” dataset lists all known web domains controlled by each company and assessments of those domains’ popularity. The platform’s “Media Monitor” feature, although not downloadable, links to news articles and other webpages mentioning the companies. [h/t Annie Burns-Pieper]

Aug 2024

H-1B lotteries.

A recent Bloomberg News investigation into the US government’s annual H-1B lottery, a key step in allocating the country’s skilled-worker visas, finds that “thousands of companies got an unfair advantage by helping themselves to extra lottery tickets.” To reach those conclusions, the team “obtained data on all H-1B lottery registrations, selections, and petitions for fiscal years 2021 through 2024 after bringing a lawsuit against the Department of Homeland Security under the Freedom of Information Act.” They’ve shared the records, which indicate each registration’s employer, as well as the proposed beneficiary’s gender, nationality, and birth year. For registrations that led to visa petitions, the data include additional details, such as the worksite, salary, job title, and beneficiary’s field of study. [h/t Eric Fan]