Data is Plural archive

Trawl through the backlog, or roll the dice for a random dataset 🎲
Mar 2023

Political podcasts.

The Popular Political Podcast Dataset, developed by the Brookings Institution’s Valerie Wirtschafter and Chris Meserole, covers 50,000+ episodes from 100+ “prominent political podcast series” — the latter based on Apple Podcasts’ popularity rankings and its “You Might Also Like” recommendations. Updated daily and explorable online, the dataset provides each episode’s name, description, air date, and URL, plus the series name, partisan leaning, and Apple Podcasts category.

Mar 2023

Municipal incorporations.

Christopher B. Goodman, a professor of public administration, has consulted a range of state-level sources to compile a dataset listing the year of incorporation for 18,000+ municipalities in the United States. The dataset, which covers nearly 96% of all active municipalities, also provides each place’s name, state, coordinates, canonical ID in the Census, and more. Read more: In a Twitter thread, Goodman explains why he undertook the effort and shares a couple of visualizations. [h/t Maggie Lee]

Mar 2023

Aid for Ukraine.

Christoph Trebesch et al.’s Ukraine Support Tracker “lists and quantifies military, financial and humanitarian aid to Ukraine in the context of the Russia-Ukraine war.” The 1,400+ entries in the tracker’s dataset include contributions and commitments from 40 governments, plus several European Union institutions. (It does not include aid from NGOs and other non-state entities.) Each entry indicates the country, announcement date, type of aid, total value, description, sources, and more. The tracker’s next update is scheduled for March 29.

Mar 2023

Civilian harm in Ukraine.

Researchers at Bellingcat and contributors to its Global Authentication Project have assembled a map and dataset of 1,000+ incidents “that have resulted in potential civilian impact or harm since Russia began its invasion of Ukraine.” They include incidents “where rockets or missiles struck civilian areas,” “where attacks have resulted in the destruction of civilian infrastructure,” and/or where visual evidence depicts civilian injuries or “immobile civilian bodies.” The information, collected from public sources and vetted by Bellingcat, includes each incident’s date, location, description, sources, type of area affected, and type of weapon system (if known). [h/t Philip Bump]

Mar 2023

The market for X-Men.

Anderson Evans’s Mutant Moneyball project uses comic book market data to explore the financial value of individual X-Men characters. The project’s dataset provides decade-by-decade statistics for 26 members of the team, drawn from sales histories and pricing guides, as well as a matrix indicating the issues in which each character appeared.

Mar 2023

Irrigation by county and crop.

P. J. Ruess et al. have developed annual, county-level estimates of irrigation water use for 20 crop groups between 2008 and 2020. The calculations draw on water use data from the US Geological Survey, as well as high-resolution data on crop locations, climate, and more. They generate estimates for surface water withdrawals, groundwater withdrawals, and nonrenewable groundwater depletion, making the findings “the first national-scale assessment of irrigation by crop, water source, and year.” [h/t Mike Stucka]

Mar 2023

Policies, categorized.

The Comparative Agendas Project “assembles and codes information on the policy processes of governments from around the world,” categorizing them into 20+ topics (e.g., “Civil Rights”) and 200+ subtopics (e.g., “Handicap Discrimination”). It “actively monitors thirty different data series,” which you can download and explore online, “all coded by this same predictable, reliable coding system.” Previously: CAP categorizations for a decade of NYT front-page stories (DIP 2018.04.25). [h/t E.J. Fagan]

Mar 2023

Bank financials, 1867–1904.

Federal Reserve economists Sergio Correia and Stephan Luck have compiled a dataset of “annual national bank balance sheets for more than 7,000 unique national banks, covering the years 1867 to 1904.” They did so by “combining optical character recognition (OCR) techniques with modern layout separation techniques,” which allowed them to extract information from scans of the Office of the Comptroller of the Currency’s annual reports to Congress. The data include asset and liability subtotals, receivership dates, city-level variables, and more. Related: Correia and Luck describe their methodology in a recent paper and open-access preprint.

Mar 2023

Bank financials, 1976–present.

The Federal Financial Institutions Examination Council’s National Information Center “provides comprehensive financial and structure information on banks and other institutions for which the Federal Reserve has a supervisory, regulatory, or research interest.” Its datasets include quarterly financial statements for bank holding companies, going back to 2016, plus detailed attributes of all active banks, 150,000+ banks closed since the mid-1930s (including Silicon Valley Bank), and 160,000+ bank branches. The agency also provides bank financials in the form of “call reports” going back to 2001. Earlier call reports, going back to 1976, are available from the Chicago Fed. Related: The FDIC’s list of failed banks since October 2000. [h/t Sergio Correia]

Mar 2023

Shows cut short.

IsItCutShort.com provides a searchable list of television shows that were canceled (e.g., Knight Rider), ended on a cliffhanger (The Sopranos), or both (Rubicon). The database provides each series’s title, cliffhanger and cancellation status, IMDB identifier, and occasional extra notes. A handful of the 130+ entries fit another category: shows that “ended without a cliffhanger, but more show content exists outside the show itself.” [h/t Dan Brady]

Mar 2023

Decades of UK prices.

In January 2023, the UK’s Office for National Statistics collected 139,000+ price quotes from thousands of stores and across hundreds of products, from “A4 PRINTER PAPER (500 REAM)” to “YORKSHIRE PUDDING FROZEN”. The agency has collected this kind of price-quote data for decades, using it to calculate inflation and price indices. Economist Richard Davies has aggregated the data going back to 1988 and standardized it, correcting misrecorded prices, offsetting measurement changes, among other efforts described in a 2021 working paper.

Mar 2023

Changes of address.

In the “Frequently Requested Records” section of its online FOIA library, the US Postal Service provides datasets counting how many individuals, families, and businesses have registered for the agency’s change-of-address service, by month and ZIP code. The datasets tally the moves originating from a given ZIP code separately from those destined for it, although moves within the same ZIP code are counted on both sides of the ledger. Related: The companies to which USPS sells mover-level data. [h/t Tim Henderson]

Mar 2023

Debt-to-income ratios.

The US Federal Reserve generates quarterly statistics estimating the median ratio of household debt to income in each state, county, and metro area. The published maps and datasets, which go back to 1999, don’t include precise figures, but rather place each geographic unit into one of ten ranges. The income calculations come from the Bureau of Labor Statistics, while the debt estimates (which do not include student loans) come from the New York Fed Consumer Credit Panel, “an anonymized 5 percent random sample of Americans with credit files at the credit reporting bureau Equifax.” As seen in: “Debt and Inequality” (American Inequality).

Mar 2023

Government contracting data, cataloged.

The nonprofit Open Contracting Partnership has launched a registry of government procurement datasets that use its Open Contracting Data Standard (featured in DIP 2020.02.26). The registry contains 100+ entries so far, across 50+ countries — from Argentina’s national roads authority and the city of Buenos Aires to Zambia’s Public Procurement Authority. You can filter the listings by dataset recency, update frequency, and the data types included (parties, awards, documents, amendments, et cetera). [h/t Georg Neumann]

Mar 2023

Thump, thump, coconut.

“Traditionally,” in the Philippines, “coconuts are classified into their maturity levels manually,” June Anne Caladcad and Eduardo Piedad Jr. write. “Traders often use their fingernails, knuckles, or the blunt end of the knife to tap the coconuts before assessing the sounds produced.” The authors and their colleagues have developed hardware and software to emulate that process, and used it to collect acoustic signal data from 129 premature, mature, and overmature coconuts, each mechanically knocked on each of its three ridges.

Mar 2023

20th-century occupations.

Between 1939 and 1991, the US government published several iterations of the now-discontinued Dictionary of Occupational Titles, a precursor to the O*NET database (DIP 2017.09.27). The dictionaries included job descriptions, classification codes, and cross-references, but are mostly available only as scans. So Shahad Althobaiti et al. organized the manual transcription of five major editions into structured text files. A random sample of 1939’s titles: punch-press operator, seam dampener, base brander, box pleater, and necktie finisher.

Mar 2023

Programming languages.

PLDB is a database that describes several thousand programming languages, file formats, communications protocols, and other related concepts. Its downloads, available in several formats, provide information on the languages’ years announced, technical features, creators, countries and communities of origin, relevant books and URLs, popularity metrics, and more. [h/t Derek M. Jones]

Mar 2023

EPA-regulated facilities.

The US Environmental Protection Agency’s Facility Registry System “provides Internet access to a single source of comprehensive information about facilities, sites or places subject to environmental regulations or of environmental interest.” It includes each entity’s name, type, location, industry, regulatory programs, and more. That information, which spans millions of facilities, is “subjected to rigorous verification and data management quality assurance procedures.” The records also provide facilities’ ID numbers from other EPA systems, such as the agency’s Risk Management Program database featured in last week’s edition. [h/t Michael Allen]

Mar 2023

Congressional votes and ideology.

The Voteview project “allows users to view every congressional roll call vote in American history,” and places those votes in the context of ideology estimates along a liberal-to-conservative spectrum. The core estimates come from DW-NOMINATE, a method developed by the project’s directors emeritus, Keith T. Poole and Howard Rosenthal. Voteview’s bulk data includes ideology estimates for every member of the House and Senate since 1789, every vote taken in either chamber, and every member’s position on those votes. [h/t Philip Bump]

Feb 2023

Data journalists, surveyed.

The European Journalism Center’s DataJournalism.com has published a dataset of 1,800+ anonymized responses to its second annual State of Data Journalism Survey, including 50+ entries each from the US, UK, Italy, Germany, Spain, India, and Nigeria, plus double-digit counts from dozens of other countries. The questions touch on demographics, employment, training, skills, the COVID-19 pandemic, and more. [h/t Simona Bisiani]

Feb 2023

Unclaimed estates.

The UK government’s Bona Vacantia division publishes a dataset of unclaimed estates — inheritances that nobody has claimed yet to inherit. The entries indicate the deceased person’s name, aliases, date/place of birth and death, marital status, and more. Related: California provides a dataset of unclaimed property, such as “lost or forgotten” bank accounts, insurance benefits, and stock holdings.

Feb 2023

Daily European gas imports.

Researchers at Bruegel are tracking daily and weekly natural gas imports to Europe, using data from the European Network of Transmission System Operators for Gas’s transparency portal. Alongside the imports, which they’re aggregating by source (e.g., Russia, Norway, Algeria) and by route (e.g., Nord Stream, TurkStream), the researchers are also tracking gas storage levels, using data from Gas Infrastructure Europe (DIP 2022.01.26). Previously: Eurostat’s data on annual European energy imports and exports (DIP 2022.03.16).

Feb 2023

Animal Welfare Act inspections.

The USDA’s Animal and Plant Health Inspection Service checks whether animal dealers, exhibitors, research facilities, and transporters are complying with the care standards set by the Animal Welfare Act. The agency provides public access to the inspection reports but no bulk data on them. So, in a collaboration between Big Local News and the Data Liberation Project (same disclosure as above), Ben Welsh and I wrote code to fetch the 80,000+ (and counting) inspections going back to 2014, parse their PDFs, and make the data more accessible. The information includes each inspection’s date, type, licensee, violation counts, species inspected, and more.

Feb 2023

Facilities handling hazardous chemicals.

The US Environmental Protection Agency’s Risk Management Program rule requires facilities that handle “extremely hazardous substances” to tell the government, at least every five years, about those substances, their safety plans, their recent accident history, and more. Through a FOIA request to the EPA, the Data Liberation Project (which, disclosure, I run) obtained a copy of the agency’s database of these filings (minus some parts the government deems non-disclosable), containing submissions by 21,000+ facilities from early 1999 to February 2022. You can now access that data, in various formats, along with documentation guiding you through it.

Feb 2023

Open Data Day events.

Open Data Day(s), scheduled for March 4–10 this year, “is an annual celebration of open data all over the world.” The Open Knowledge Foundation, which helps to coordinate the locally-organized gatherings, hosts a searchable list of registered events, plus datasets of each year’s events since 2014.

Feb 2023

Bog bodies.

Roy van Beek et al. “present the first large-scale overview of well-dated human remains from northern European mires, based on a database of 266 sites and more than 1000 bog mummies, bog skeletons and disarticulated/partial skeletal remains.” The database, which can be found in the study’s supplementary materials tab, indicates the bog bodies’ location, year found, preservation level, sex, estimated age, assumed cause of death, and much more. [h/t Miriam Posner + Robin Sloan]

Feb 2023

Diplomatic visits.

The US Department of State’s Office of the Historian maintains a dataset and online directory of every visit by a foreign leader from 1874 to 2020. The details include each visit’s starting and ending date, the visitor’s name, their country, and a brief description. The office publishes similar data on the travels of US presidents and secretaries of state. Related: Matt Malis and Alastair Smith have expanded the data for 1946–2019, adding fields that indicate the type of visit, whether it involved a presidential meeting, the names of agreements signed, and more. Previously: Diplomatic gifts (DIP 2022.08.03).

Feb 2023

Humanitarian groups.

“Based on clear and reproducible criteria,” Clara Egger and Doris Schopper have compiled the Humanitarian Organizations Dataset, which describes 2,500+ groups active in the sector. It includes the organizations’ founding years, structures, countries headquartered, regional scopes, types of activities, targets for assistance, and more. You can also explore a version of the dataset online. Related: The Global Database of Humanitarian Organisations, from Humanitarian Outcomes, “a team of specialist consultants providing research and policy advice for humanitarian aid agencies and donor governments.”

Feb 2023

Fair market rents.

Every year, the Department of Housing and Urban Development recalculates what it calls “fair market rents” for every county in the US and for individual ZIP codes in metropolitan counties. The results, which factor into various housing subsidy programs, represent the 40th percentile cost of monthly rent and (basic) utilities for “recent movers” in “standard quality” units, adjusted for the number of bedrooms. HUD’s annual spreadsheets go back to the early 2000s; you can also browse the estimates online and query them via an API. As seen in: “Where are rents rising post COVID-19?” (USAFacts).