Data is Plural archive
Pleiades, “a community-built gazetteer and graph of ancient places,” has collected data on 40,000+ settlements, roads, rivers, monuments, and many other types of landmarks. It also describes the relationships between them — linking, for instance, the Parthenon to the Acropolis and the Acropolis to Athens. Related: ORBIS: The Stanford Geospatial Network Model of the Roman World. Previously: Roman amphitheaters (DIP 2022.06.08) and the Digital Atlas of Roman and Medieval Civilizations (DIP 2020.06.24). [h/t Avi Levin]
GitHub’s new Innovation Graph datasets present a range of quarterly metrics on the code-sharing site, aggregated by “economy” — a concept similar to “country” but slightly broader. (Antarctica is an “economy” in the data, for example.) The datasets count the number of developers based in each economy, their repositories and code pushes, most-used programming languages, and more. As noted in the project’s datasheet, the locations are based on IP addresses, so VPN usage may distort the results. Previously: More-granular GitHub activity via the GH Archive (DIP 2018.02.21). [h/t Kevin Xu]
The PopuList, constructed by Matthijs Rooduijn et al., “offers academics and journalists an overview of populist, far-left and far-right parties in Europe from 1989 until 2022.” Version 3.0 of the dataset, released last month, lists each party’s country, local/English names, presence in parliament, and identifiers in the Party Facts (DIP 2019.01.16) and ParlGov (DIP 2018.09.19) databases. It also indicates whether the project’s comparativists and country experts classified the party (outright or “borderline”) as populist, far-right, far-left, and/or euroskeptic, and for which time periods.
Since launching in 2011, FracFocus has become the largest registry of hydraulic fracturing chemical disclosures in the US. The database, available to explore online and download in bulk, contains 210,000+ such disclosures from fracking operators; it details the location, timing, and water volume of each fracking job, plus the names and amounts of chemicals used. The project is managed by the Ground Water Protection Council, “a nonprofit 501(c)6 organization whose members consist of state ground water regulatory agencies”. As seen in: The latest installment of the New York Times’ Uncharted Water series.
The Census Tree, developed by Kasey Buckles et al., “is the largest-ever database of record links among the historical U.S. censuses, with over 700 million links for people living in the United States between 1850 and 1940.” By the team’s estimates, the dataset “includes over 70% of the possible links that could be made for men, and over 60% of possible links for women.” To build it, the researchers began with genealogy records from FamilySearch.org, developed a machine learning algorithm to identify additional links, and incorporated data from other census-linking projects. Each Census Tree row connects an individual’s IPUMS identifier in one decade’s census to their identifier in another. Per the project’s instructions, you’ll need to download the Census data itself from IPUMS.
Zahra Gharaee et al.’s BIOSCAN-1M Insect Dataset contains one million microscope photographs of bugs, each taxonomically classified by experts and supplemented with raw DNA sequences and genetic “barcode” identifiers. The dataset, part of the broader BIOSCAN initiative, includes 8,300+ species across 3,400+ genera; the specimens were primarily collected in Costa Rica, Canada, and South Africa, using tent-like traps. Previously: Bug splats (DIP 2020.03.04). [h/t Robin Sloan]
Through a FOIA request to the US Department of the Interior, journalist Ben Welsh has obtained and published the agency’s drone roster. For each of the 850+ remote-controlled aircraft, the dataset lists the agency bureau and office, drone manufacturer, model, cost, serial number, and more. The FOIA request also unearthed spreadsheets listing specific drone flights and multi-flight deployments. Previously: Drone registration data (DIP 2022.12.21), also obtained by Welsh via FOIA.
Political scientist Matthew M. Singer’s State Executive Approval Database contains the results of 10,000+ gubernatorial approval polls, spanning all 50 states and going back decades. The database, which builds on earlier efforts by Thad Beyle et al., lists each poll’s date, state, governor, pollster, sample size, sample type, ratings scale, percentage of positive/negative responses, and more. Previously: The Executive Approval Project (DIP 2019.10.16), which Singer co-directs.
Net migration estimates.
Using national and subnational data on birth rates, death rates, and population counts, Venla Niva et al. have constructed a dataset of estimated net migration for each part of the world, each year between 2000 and 2019. The estimates are available as gridded data (with ~10km resolution) and at three levels of administrative units: national, provincial, and communal. (For the US, the latter two levels correspond to states and counties, respectively.) The researchers have also published an interactive map of the estimates for each administrative unit.
US facility GHG emissions.
The EPA’s Facility Level Information on Greenhouse Gases Tool “gives you access to greenhouse gas data reported to EPA by large emitters, facilities that inject CO2 underground, and suppliers of products that result in GHG emissions when used in the United States.” The information comes from the agency’s Greenhouse Gas Reporting Program, which also provides bulk data downloads. Per the EPA: “Approximately 8,000 facilities are required to report their emissions annually, and the reported data are made available to the public in October of each year.” The data, which go back to 2010, indicate the facility type, emissions reported, measurement methods, types of fuel used, and much more. Previously: Climate TRACE’s estimates of the world’s largest GHG emitters (DIP 2022.11.16). [h/t Terin V. Mayer]
License plate designs.
Beautiful Public Data’s Jon Keegan scraped the websites of every US state’s (and DC’s) motor vehicle agency to assemble a dataset of 8,291 license plate designs. The dataset provides each design’s name, state, and image. Read more: Keegan’s exploration of the data, which includes a searchable table. Previously: Vanity plates requested in California (DIP 2020.01.29) and New York (DIP 2015.10.21).
The human genome has been sequenced, but what do all those genes do? João J. Rocha et al.’s Unknome database assigns a “knownness” score to “all protein clusters that contain at least 1 protein from humans or any of 11 model organisms.” The score is based on the density of annotations in the Gene Ontology knowledgebase, which bills itself as “the world’s largest source of information on the functions of genes.” The clustering comes from another downloadable database, PANTHER, which contains “comprehensive information about the evolution of protein-coding gene families.”
Prigozhin audio messages.
Giorgio Comai has collected and auto-transcribed hundreds of audio messages from Russian mercenary leader Yevgeny Prigozhin, which were posted “on his official Telegram channel - the press service of his holding company” from late 2022 through June 2023. For each transcribed segment within each message, the resulting datasets (one in Russian and one auto-translated into English) include the message ID, time posted, segment timestamp, and segment text. [h/t Federico Caruso]
Historical newspaper articles.
Melissa Dell et al.’s American Stories dataset contains the text of ~400 million newspaper articles, extracted from ~20 million public-domain scans in the Library of Congress’s Chronicling America project (DIP 2017.08.16). To construct the dataset, the authors built “a novel deep learning pipeline that incorporates layout detection, legibility classification, custom OCR, and the association of article texts spanning multiple bounding boxes.” For each article, the dataset provides the newspaper name, edition number, date of publication (largely in the 1800s–1920s), page number, headline, byline, and article text. Previously: The LOC’s Newspaper Navigator dataset (DIP 2020.10.07), which extracts visual content from the Chronicling America scans. [h/t Derek M. Jones]
Political contributions, enhanced.
Political scientist Adam Bonica’s Database on Ideology, Money in Politics, and Elections (DIME) gathers “500 million itemized political contributions made by individuals and organizations to local, state, and federal elections covering from 1979 to 2020.” The project, which received a major update last month, “is intended to make data on campaign finance and elections (1) more centralized and accessible, (2) easier to work with, and (3) more versatile […].” It assigns each contributor a unique identifier, geocodes their stated addresses, quantifies their ideological orientation, and more. The raw data come from several sources, including the Federal Election Commission and OpenSecrets. Related: MoneyInPolitics.wtf, a collaborative project that aims to be “America’s most comprehensive dictionary of campaign finance jargon.” [h/t Isadora Borges Monroy]
Vox populi, except not.
The Onion runs a regular feature called “American Voices,” which presents fake quotes from fake people responding to not-fake events in the news. Cody Winchester has built a spreadsheet listing the headlines, descriptions, and dates for 7,000+ of these features since August 1996; the 23,000+ quotes in them; and the names, occupations, and (almost always recycled) photos of the fictional personae purportedly quoted.
Drawing on 20+ published sources, Tori M. Hoehler et al. have compiled a dataset of 10,000+ measurements of metabolic rates of mammals, fish, birds, insects, tree saplings, bacteria, and other living organisms. The dataset includes several types of metabolic rates (primarily basal, field, and maximum), and a mix of individual-organism and species-average measurements.
Govinda Clayton et al.’s Civil Conflict Ceasefire Dataset “covers all ceasefires in civil conflict between 1989 and 2020, including multilateral, bilateral and unilateral arrangements, ranging from verbal arrangements to detailed written agreements.” The dataset’s 2,200+ ceasefires span 109 conflicts in 66 countries, largely based on news articles discussing the agreements. A team of reviewers manually coded each instance, indicating its type and stated purpose; sides participating; dates declared, entered effect, and ended; and more. Previously: The PA-X Peace Agreements Database (DIP 2018.02.28).
Doctors in practice.
The Physician and Physician Practice Research Database, published by the US government’s Agency for Healthcare Research and Quality, harmonizes data on medical practices from 13 participating states. The public-use files provide each practice’s ZIP code, number of physicians, most common specialty, organizational NPI, and other characteristics. They also provide statistical aggregates at the 3-digit ZIP code level, such as the number of physicians accepting Medicare and/or Medicaid, average claims per month, and more. [h/t Gary Price]
Mapping Diversity, created by the European Data Journalism Network, “is a platform for discovering key facts about diversity and representation in street names across Europe, and to spark a debate about who is missing from our urban spaces.” The interactive analysis of 145,000+ streets in 30 major European cities launched in March, accompanied by spreadsheets calculating city-level statistics and listing all the streets named after women. Last month the team released its full dataset, which provides information for every street examined, plus data for six more cities. Each row indicates a street’s country, city, and name; whether it’s named after anyone; and, if so, the person’s name, gender, and various attributes from Wikidata, such as occupation and date of birth. Previously: Las Calles de las Mujeres (DIP 2019.05.29).
Scotland’s Common Good.
Scotland’s Common Good Act, passed in 1491, creates a legal distinction for historical property owned by local authorities — often land and buildings, but also “moveable assets” such as paintings, chains of office, and furniture. CommonGood.scot, launched in April by investigative journalism cooperative The Ferret, presents a searchable, browsable, and downloadable dataset of 2,900+ of these common good assets, compiled largely through freedom of information requests. [h/t Chris H]
Singapore’s Ministry of Manpower released its 2022 wage tables last month. The tables list the 25th, 50th, and 75th percentiles of basic and gross wages for more than 300 occupations by industry, plus median wages by worker sex, by worker age, and by establishment size. As seen in: The Straits Times’ benchmarking tool and DIP reader Joses Ho’s interactive chart of Singapore’s gender wage gaps.
Dunham’s Data, a project led by Kate Elswit and Harmony Bench, “explores the kinds of questions and problems that make the analysis and visualization of data meaningful for dance history, through the case study of 20th century African American choreographer Katherine Dunham” (1909–2006). Drawing on materials “held by seven archives across the United States,” the team has built three core datasets, accompanied by essays, visualizations, and code repositories. They “document the daily itinerary of Dunham’s touring and travel from the 1930s-60s; the over 300 dancers, drummers, and singers who appeared with her; and the shifting configurations of the nearly 300 repertory entities they performed.” [h/t Selena Chau]
SSVF satisfaction surveys.
The Veterans Administration’s Supportive Services for Veteran Families program aims to “to promote housing stability among very low-income Veteran families who reside in or are transitioning to permanent housing.” The VA outsources those services to a network of 200+ selected nonprofits, which it grants hundreds of millions of dollars per year. When a veteran exits a grantee’s program, they’re invited to complete a satisfaction survey. The Data Liberation Project (which, disclosure, I run) filed a FOIA request for the survey data, and received three spreadsheets in return (among other documents), detailing nearly 40,000 anonymized responses from fiscal years 2016–20 and 2022.
International cancer statistics.
The World Health Organization’s Global Cancer Observatory provides interfaces to a range of studies and statistics. Its Cancer Today portal features tables, charts, and maps of “incidence, mortality and prevalence for year 2020 in 185 countries or territories for 36 cancer types by sex and age group.” Those figures come from the latest GLOBOCAN estimates, calculated by the WHO’s International Agency for Research on Cancer based on data from national and regional registries. Note: “Caution must be exercised when interpreting these estimates, given the limited quality and coverage of cancer data worldwide at present, particularly in low- and middle-income countries,” the researchers warn. Previously: Statistics from the American Cancer Society (DIP 2016.01.27).
New York State’s Gaming Commission publishes various lottery-related datasets, including the winning numbers for many national and state lotteries, such as Powerball (since 2010), Mega Millions (since 2002), and Pick 10 (since 1987). New York isn’t alone; the Colorado Lottery, for instance, also publishes downloadable drawing histories. Their Powerball results go back to August 2001 and include the jackpot values, unavailable from New York. As seen in: “The jackpot is a lie,” by Zach Seward.
The Families of England project, led by economic historians Gregory Clark and Neil Cummins, aims “to reconstruct the economic and social position, and the demography, of a representative set of English families” over time. A recent paper by Clark includes a public version of the dataset, which “details the family connections of 422,374 people with rarer surnames in England for births from 1600 to 2022.” The dataset, based in part on genealogies from the Guild of One-Name Studies, indicates (where available) each person’s years of birth, marriage, and death, plus indicators of literacy, sex, occupational status, and more. [h/t Derek M. Jones]
Federal Reserve communications.
Agam Shah et al. have compiled a corpus of key communications by the US Federal Reserve’s Federal Open Market Committee, which “controls the three tools of monetary policy — open market operations, the discount rate, and reserve requirements.” Gathered from the Fed’s website, the corpus includes all meeting minutes and speeches from 1996 to mid-October 2022, and all press conferences from April 2011 to mid-October 2022. The published records include the raw text and metadata of each communication, as well as datasets filtered to key sentences. Previously: Federal Reserve Bank directors (DIP 2021.05.05) and Fed forecasts (DIP 2018.02.07).
The US Department of Agriculture’s Sugar and Sweeteners Yearbook Tables provide “summary statistics on sugar, sugarbeets, sugarcane, corn sweeteners (dextrose, glucose, and high-fructose corn syrup), and honey.” Compiled by the agency’s Economic Research Service from a range of national, international, and industry sources, the statistics are provided as regularly-updated spreadsheets, many of which go back multiple decades. They estimate global and country-level production, supply, distribution, and prices, as well as US imports and consumption. [h/t Sam Larson]
Much mapping material.
The Overture Maps Foundation has released its first datasets, which include 59 million “points of interest” (landmarks, businesses, parks, etc.), 785 million building outlines, road network data, and administrative boundaries. The initiative, which is steered by several giant tech companies, “could help third-party developers use maps that don’t rely on Google and Apple,” The Verge’s Emma Roth writes. The datasets draw on a range of sources, including the project’s member-companies, OpenStreetMap, and USGS’s 3D Elevation Program. Read more: “Exploring the Overture Maps places data using DuckDB, sqlite-utils and Datasette,” by Simon Willison, who considers the data release “a really big deal.” [h/t Avi Levin]