Data is Plural archive

Trawl through the backlog, or roll the dice for a random dataset 🎲
Jan 2022

Tech support scams.

From 2018 to 2021, the now-shuttered PopupDB Project collected information about tech support scams and their deceptive browser popups. Its maintainers have since released two final downloads: a “light” dataset that lists the URLs and web hosts of 11,000+ such popups, and a “full” database that includes screenshots and source code. [h/t NeeP]

Jan 2022

Commuting zones.

Decades ago, the USDA Economic Research Service developed a methodology to group the nation’s counties into hundreds of “commuting zones,” based on the Census’s journey to work data. Those groupings are available from the agency (for 1980, 1990, and 2000), and from researchers at Penn State (for those years plus 2010). More recently, Facebook/Meta has developed its own methodology for estimating commuting zones, using location data collected from its users. The project’s public dataset spans the world and specifies the zones not as sets of counties but as detailed, custom boundaries.

Jan 2022

Foreign commerce interventions.

The Global Trade Alert, launched in 2009, “provides timely information on state interventions taken since November 2008 that are likely to affect foreign commerce,” such as new subsidies, export quotas, import tariffs, or anti-dumping laws. Its downloadable dataset describes 33,000+ interventions, listing their types, implementing jurisdictions, affected jurisdictions, affected products, and more. The project, affiliated with the University of St. Gallen, has also started tracking interventions that affect digital commerce. [h/t Simon Evenett and Johannes Fritz]

Jan 2022

More political emails.

Derek Willis, a journalism lecturer with an expertise in political data, has published a searchable, downloadable database of 100,000+ email messages received in recent years by an address he created for this purpose, and “which I routinely plug into forms I find on candidate and committee sites.” The database, which Willis plans to update weekly, lists each message’s timestamp, sender, subject line, and body. Previously: The Princeton Corpus of Political Emails (DIP 2021.06.23), the Markup’s collection of 5,000+ campaign emails (DIP 2020.03.04, and DCInbox’s congressional e-newsletter collection (DIP 2021.03.03). [h/t jcberk]

Jan 2022

Chocolate bar reviews.

The Manhattan Chocolate Society’s Brady Brelinski has reviewed 2,500+ bars of craft chocolate since 2006, and compiles his findings into a copy-paste-able table that lists each bar’s manufacturer, bean origin, percent cocoa, ingredients, review notes, and numerical rating. Related: Craft chocolate makers in the US and Canada, also compiled by Brelinski. [h/t Andrew Maranhão]

Jan 2022

Radio on the internet.

Radio-browser.info is “a community driven effort (like wikipedia) with the aim of collecting as many internet radio and TV stations as possible.” The 29,000+ stations span 200+ countries and 280+ languages. You can explore them on a map, through an API, and via bulk database snapshots. [h/t jlkuester7]

Jan 2022

Journal editors.

For his Open Editors project, Andreas Nishikawa-Pacher scrapes the websites of scholarly journals, extracting the names, affiliations, and roles of their listed editors and board members. The project’s dataset contains half a million associations between editors and 6,000+ scholarly journals from 22 publishers (17 mainstream and 5 “predatory”). Read more: Nishikawa-Pacher et al.’s introductory working paper. And: “Why researchers created a database of half a million journal editors” (Nature Index, 2021).

Jan 2022

Pension plans.

Public Plans Data, a collaboration led by Boston College professor Alicia H. Munnell, gathers extensive information about the retirement plans that state and local governments offer their employees, drawn from those plans’ annual financial reports. The project maintains a range of datasets, including two decades of participation and financial figures for 200+ plans that account for “95 percent of public pension membership and assets nationwide,” their investments, early payout options, and more. It also provides interactive tools and an API.

Jan 2022

Slaveholders in Congress.

At least “1,715 members of Congress were enslavers at some point in their adult lives,” according to a Washington Post investigation published Monday. Reporter Julie Zauzmer Weil began her research with a list of every person who ever served in the House or Senate, filtered it to those born before 1840, and then consulted their biographies, Census records, and other historical documents. The Post’s public dataset, the first of its kind, lists the congressmen from that era, their dates of birth, positions held in Congress, states served, dates served, and whether the Post determined they were slaveholders. For 677 of the congressmen, the Post “couldn’t reach a conclusion” and is seeking assistance from readers.

Jan 2022

Honey bees.

Since the 1980s, the US Department of Agriculture has conducted an annual Bee and Honey Inquiry Survey, which generates estimates of “the number of colonies producing honey, yield per colony, honey production, average price, price by color class and value as well as honey stocks at the state and national levels.” Since 2016, it has also published annual reports that examine the gain and loss of colonies, including losses due to colony collapse disorder.

Jan 2022

Foundation shades.

For “The Naked Truth,” a Pudding article published last year with Ofunne Amaka, Amber Thomas scraped information about 6,800+ foundation shades from the websites of two major cosmetics retailers. The project’s datasets identify each product’s name, description, URL, and the predominant RGB/HSL color value in its swatch image.

Jan 2022

Medical drug names.

To build their International Drug Dictionary, Mohammad A. Khaleel et al. collected trade names and ingredient names “from open access websites belonging to official drug regulatory agencies, official healthcare systems, or recognized scientific bodies from 44 countries around the world,” among other sources. Each of the 450,000+ entries maps a name to standardized ingredient information from the National Library of Medicine’s RxNorm database.

Jan 2022

Joint military exercises.

Jordan Bernhardt’s Joint Military Exercises Dataset describes 5,000+ such operations undertaken between 1977 and 2016, drawn from historical news reports. The dataset lists each exercise’s name, location, when it began and ended, the countries that participated, activities involved, and more. Related: Brandon J. Kinne’s Defense Cooperation Agreement Dataset, a “comprehensive, human-coded dataset” covering bilateral defense treaties between 1980 and 2010.

Jan 2022

Civil asset forfeiture.

“Most states and the federal government have laws allowing police and prosecutors to seize and permanently keep Americans’ cash, cars, homes and other property suspected of being involved in a crime — without regard to the owners’ guilt or innocence,” the nonprofit law firm Institute for Justice writes in its third edition of Policing for Profit, published in 2020. The report gathers and analyzes datasets on property seized in dozens of states through this practice of civil asset forfeiture, and on the spending of forfeiture funds. It also examines seizures from the federal Consolidated Asset Tracking System, detailed public extracts of which the Department of Justice updates quarterly. As seen in: “Cops still take more stuff from people than burglars do” (The Why Axis, 2021), and “Stop and Seize” (Washington Post, 2014).

Dec 2021

Root traits.

Oak Ridge National Laboratory’s Fine-Root Ecology Database categorizes the root characteristics of 4,500+ plant species, as observed and published in scientific literature. The hundreds of types of traits relate to vessel density, root angles, lifespan, macronutrients, microbial symbionts, and more.

Dec 2021

Leaders’ economic persuasions.

Political scientist Bastian Herre’s new Global Leader Ideology dataset “provides unprecedented coverage of chief executives’ [economic] ideologies across time and space,” classifying their approaches as leftist, centrist, rightist, or non-ideological in 182 countries, from 1945 to 2020. Read more: Herre’s introductory paper and Twitter thread.

Dec 2021

COVID-era news layoffs.

“At least 6,154 news organization workers, which includes both editorial and non-editorial staffers, were laid off beginning March 2020 through August 2021,” according to a new report from the Tow Center’s Gabby Miller. An interactive tracker provides a map and downloadable table of the layoffs and other cutbacks, listing each outlet’s name, medium, owner, and location; the cutback’s date, description, and category (layoffs, pay cuts, etc.); and source links. Related: The Washington Post Magazine’s Lost Local News issue.

Dec 2021

Religious congregations.

The Association of Religion Data Archives, founded in 1997, “strives to democratize access to the best data on religion.” Among its resources are four waves of the National Congregations Study, a Duke University–based survey that asks US religious establishments about their denominational affiliation, buildings of worship, congregants, staffing, educational offerings, and more. Representatives from 5,300+ Christian, Jewish, Muslim, Buddhist, Hindu, and other congregations participated in the latest wave, conducted in 2018–19. [h/t Patricia Homan and Amy Burdette + Kevin Lewis]

Dec 2021

Local mortality and the 1918 pandemic.

Martin Eiermann, Elizabeth Wrigley-Field, et al. have conducted an analysis of deaths before, during, and after the 1918 influenza pandemic, drawing on “data from multiple sources, including digitized mortality records for 70 U.S. cities, linked census records that establish urban residency status across multiple censuses” and newspaper accounts of non-pharmaceutical interventions. The team’s published files includes a dataset that, for each city-and-year, lists the mortality rate overall, for white vs. non-white residents, and due to flu/pneumonia; a range of demographic variables; the timing of certain interventions; and more.

Dec 2021

Toothbrushing.

Zawar Hussain et al. recorded data from 120 electric and manual toothbrushing sessions, using sensors attached to the brush handle and brusher’s wrist. Each session’s data files trace the sensors’ positions over time and indicate the brush type, participant’s gender, age, and handedness, and more.

Dec 2021

California water wells.

Domestic wells in the San Joaquin Valley “are drying up at an alarming pace” amid “a frenzy of new well construction and heavy agricultural pumping,” according to a Los Angeles Times article last week. Data reporter Gabrielle LaMarr LeMee’s analysis provides the quantitative backbone, drawing on three state datasets: well completion reports and periodic groundwater level measurements, both of which go back more than a century, and household water supply shortage reports since 2013.

Dec 2021

COVID-19 in European prisons.

A collaboration coordinated by Deutsche Welle and the European Data Journalism Network has gathered data on the pandemic’s impact on prisoners and prison staff in dozens of European countries, including the number of COVID-19 tests, cases, and deaths over time. The data also note the types of preventative measures and vaccine policies in place. Previously: US prison COVID-19 data from the Marshall Project and AP (DIP 2020.05.06) and the New York Times (DIP 2021.04.21). [h/t Lorenzo Ferrari]

Dec 2021

Working hours.

Political scientist Magnus Bergli Rasmussen has compiled data on the regulation of laborers’ total work hours in nearly every country since 1789, available as a Stata file. For each year and territory, the dataset indicates whether such laws existed, the “normal” number of contractually-obligated weekly hours, the maximum number of hours allowed, and increases in pay for overtime. Read more: “The Great Standardization: Working Hours Around the World,” in which Rasmussen describes the dataset’s construction.

Dec 2021

Tobacco habits.

Every few years since 1992, the National Cancer Institute has sponsored the Tobacco Use Supplement to the Current Population Survey, administered by the US Census Bureau. In addition to extensive demographic information, the supplement asks about historical tobacco usage (“Have you smoked at least 100 cigarettes in your entire life?”), preferences (“Do you usually smoke menthol or non-menthol cigarettes?”), purchasing habits, e-cigarettes, and much more. Anonymized responses and documentation are available for all survey waves through 2018–19. Previously: The CDC’s Behavioral Risk Factor Surveillance System (DIP 2016.09.14). [h/t Christian Gunadi et al. + Kevin Lewis]

Dec 2021

1 million Bandcamp sales.

Components’ Andrew Thompson has published a dataset of 1,000,000 sales on the music platform Bandcamp during a few weeks in late 2020. For each sale, it includes the item’s description, price, and type; the buyer’s country; a timestamp; and more. It’s a slice of the data used in “The Chaos Bazaar: An analysis of Bandcamp sales, 9/1/2020 - 12/31/2020.”

Dec 2021

Country leaders’ birthplaces.

Axel Dreher et al.’s Political Leaders’ Affiliation Database lists the birthplaces and ethnicities of 1,109 leaders of 177 countries between 1989 and 2020. The birthplaces are described at several levels of administrative detail and are ascribed a latitude, longitude, and an indication of those coordinates’ precision. The ethnicities are drawn from external, linked sources. [h/t Simon Heß]

Dec 2021

Reproductive assistance in the EU.

Reporters at Civio have collected data on the access to, and availability of financial aid for, in vitro fertilization and artificial insemination in 43 European countries, noting limits based on age, marital status, sexual orientation, gender identity, and other factors. Read more: “More than half of European countries prohibit access to assisted reproduction for lesbians and almost a third do so for single women.” Related: Civio’s visualization code in Observable notebooks. [h/t Olaya Argüeso Pérez + Mike Freeman]

Dec 2021

Zero-day exploits.

Researchers at Google’s Project Zero study “zero-day” vulnerabilities — software flaws discovered by hackers before they can be fixed. Since 2019, the team has published a spreadsheet of known zero-day exploits “in the wild.” The spreadsheet’s 200 entries go back to 2014 and note the software product, its vendor, the flaw’s type and description, date discovered, date patched, and more. Previously: The Common Vulnerabilities and Exposures list (DIP 2018.12.12). [h/t Patrick Howell O’Neill + Bruce Schneier]

Dec 2021

Faster-turnaround mortality data.

Last week, the CDC began publishing provisional US mortality statistics for 2018 to the near-present. The data are based on death certificates and can be queried by location, timing, demographics, and causes of death. They’re similar to the CDC’s non-provisional statistics for earlier years, but “with a lag of just a few weeks” instead of more than a year, writes the COVID-19 Data Dispatch’s Betsy Ladyzhets. Read more: “Researchers say the US is undercounting COVID deaths. Now we have a tool to figure out why,” an article by Ladyzhets and other members of Documenting COVID-19, who are hosting a webinar about the data today.