Data is Plural archive

Trawl through the backlog, or roll the dice for a random dataset 🎲
Jan 2021

2020 in haiku.

Over the course of 2020, Eli Holder paid workers on Mechanical Turk to turn news headlines into 5/7/5-syllable poems. The result: 2,760 “Doom Haikus,” which you can browse on a timeline or download in bulk. For each poem, the dataset also includes the original article URL, date processed, headline, and SEO snippet. [h/t Karsten Johansson]

Jan 2021

Millions of computational notebooks.

In 2017, a team of researchers downloaded and analyzed 1.25 million publicly-available Jupyter notebooks — documents that weave computational code, output, and text. They also published the notebooks and their related metadata. Inspired by that project, a team at JetBrains recently did a follow-up scan, analyzing and publishing data on nearly 10 million notebooks.

Jan 2021

More college sports financing.

The College Athletics Financial Information Database, run by the privately-funded Knight Commission on Intercollegiate Athletics, details the annual sources of revenue (such as ticket sales) and expenses (such as coaches’ compensation) for hundreds of schools, based on information self-reported to the NCAA and federal government. Many of the records were obtained via freedom-of-information requests by USA Today and Syracuse University students. [h/t Craig Garthwaite et al.]

Jan 2021

Commodity-transportation costs.

The UN and the World Bank have launched a new interactive map and dataset that quantify the transportation costs for international trade — country-by-country and broken down by mode of transportation (sea, air, rail, road), trading partner, and commodity. The numbers, based both on directly-reported figures and statistical modelling, include costs overall, per unit, and per unit per kilometer. The project currently covers only 2016, but has plans to expand. [h/t Jan Hoffmann]

Jan 2021

Megascale coronavirus surveys.

Carnegie Mellon University’s epidemiological forecasting group and Facebook have partnered to field a large-scale coronavirus survey in the US; they’ve collected more than 15 million responses since April 2020. The University of Maryland has formed a similar partnership for an international survey, in which “a representative sample of Facebook users is invited on a daily basis to report on symptoms, social distancing behavior, mental health issues, and financial constraints”; millions have also participated. Geographically-aggregated results of the US survey can be downloaded via an online interface or Delphi’s API; the international results are also available via API. Practical example: An analysis of state-by-state mask usage, with code. [h/t Alex Reinhart]

Dec 2020

Who washes meat?

YouTuber (and former public radio reporter) Adam Ragusea recently asked his viewers to answer a detailed survey about whether (and why, and how) they wash meat before cooking it. He received more than 13,000 responses. He then made a video about what he found and published a spreadsheet of the anonymized answers.

Dec 2020

Permafrost.

The European Space Agency has released new longitudinal data on the Northern Hemisphere’s permafrost — ground that remains 0°C/32°F or colder for at least two years. Through a combination of satellite detection and on-the-ground measurements, the datasets quantify the permafrost’s thickness, extent, and temperature between 1997 and 2017. [h/t Simon Proud]

Dec 2020

Social scientists testifying.

In a paper published this spring, Mahler et al. describe their dataset of social scientists’ appearances in US congressional hearings — more than 15,000 instances in all, at more than 10,000 hearings between 1946 and 2016. For each testimony, the dataset indicates the expert’s name, discipline, title, and professional affiliations, as well as the hearing’s date, title, and committee. Economists predominate, followed by political scientists, psychologists, sociologists, and then anthropologists. [h/t Deblina Mukherjee]

Dec 2020

County-level coronavirus tests.

The US federal government has finally begun publishing county-level data on COVID-19 test counts, positivity rates, and delays. And that’s just a slice of the information now available through the daily-updated, multi-agency Community Profile Reports, which also assign each county to a “concern category” and aggregate the metrics to the CBSA, state, and regional levels. Related: Ryan Panchadsaram’s enthusiastic Twitter thread.

Dec 2020

Vaccine doses.

Our World in Data is tracking the number of COVID-19 vaccine doses administered per country, compiling their dataset from a range of government sources, including press releases and ministers’ tweets. In addition to listing the total doses administered, the US Department of Health and Human Services is also publishing datasets that tally how many Pfizer-BioNTech and Moderna vaccine doses have been allocated and shipped to each state and territory.

Dec 2020

Cyber wargames.

This is, “to the best of our knowledge, [...] the first dataset providing network traffic traces and corresponding event logs from a complex cyber defense exercise” — a two-day Cyber Czech event in March 2019.

Dec 2020

Third-Republic France.

Economist Victor Gay has built a geographic dataset that traces, year by year, the administrative boundaries of France’s Third Republic, which governed from 1870 to 1940, when the Vichy Regime took power. The dataset provides annual shapefiles delineating the country’s départements, arrondissements, and cantons; as well as for its “most significant special administrative constituencies: military, judicial and penitentiary, electoral, academic, labor inspection, and ecclesiastical.”

Dec 2020

More on travel/immigration bans.

The COVID Border Accountability Project is tracking countries’ pandemic-related travel and immigration restrictions, on a weekly basis. The project’s team categorizes various aspects of the restrictions — whether they hinge on citizenship, halt new visa applications, et cetera — and turns them into a longitudinal dataset. Previously: The UN World Food Program’s travel-restrictions dataset (DIP 2020.12.09).

Dec 2020

More PPP details.

Thanks to a FOIA lawsuit by a group of news organizations, the US Small Business Administration has released additional data about the financial assistance distributed through its Paycheck Protection Program. Previously (DIP 2020.07.08), the SBA’s public data withheld the specific amounts for all loans (instead listing only a broad range), as well as names and addresses for loans below $150,000. The new datasets include those amounts, names, and addresses for all loans.

Dec 2020

People of slavery.

Recently launched, Enslaved.org allows the public to “explore or reconstruct the lives of individuals who were enslaved, owned slaves, or participated in the historical trade.” Its interactive database contains 600,000+ records, with plans to expand. The collaborative, schlolar-led project also includes The Journal of Slavery and Data Preservation, which “publishes original, peer-reviewed datasets about the lives of enslaved Africans and their descendants.” The first issue features three datasets originally published through a precursor to Enslaved.org — Slave Biographies: The Atlantic Database Network. Those datasets focus on Louisiana slaves (1719–1820), New Orleans “Free Blacks” (1840–1860), and enslaved Africans in Maranhão, Brazil (1767–1831).

Dec 2020

Bob Ross paintings.

Data scientist Jared Wilber has built a dataset of all paintings in Bob Ross’s 31 seasons of “The Joy of Painting,” scraped from the searchable database at TwoInchBrush.com. For each painting, the dataset lists the title, season, episode, YouTube link, and list of colors used. For a 2014 article at FiveThirtyEight, Walt Hickey created a dataset categorizing the types of things Ross depicted in each episode. Related: “Where Are All the Bob Ross Paintings? We Found Them,” a video from the New York Times. [h/t u/palpitations]

Dec 2020

La Pola and her compatriots.

During Colombia’s struggle for independence, Royalists executed scores of women by shooting squad, the most famous of whom was the seamstress and spy known as “La Pola.” Writing last year for the cultural journal of Colombia’s central bank, historian Pablo Rodríguez Jiménez presented a list of 76 women known to have suffered this fate — their names, locations, and dates of death. Colombia-based Datasketch has converted that list into a spreadsheet. [h/t Juan Pablo Marín Díaz]

Dec 2020

Country facts.

The CIA’s World Factbook “provides information on the history, people and society, government, economy, energy, geography, communications, transportation, military, and transnational issues for 267 world entities.” The details are extensive and fairly standardized. Open data–enthusiast Gerald Bauer has converted the publication into a series of JSON files. Now you know: The physical areas of five countries (Georgia, Ireland, Latvia, Lithuania, and Sri Lanka) are all described as “slightly larger than West Virginia.”

Dec 2020

Pandemic travel restrictions.

The UN World Food Program has been tracking countries’ and airlines’ travel restrictions during the COVID-19 pandemic, based on official communications, media reports, and other sources. The country-level dataset indicates whether travelers must obtain a recent negative test and what type of quarantine or self-isolation is required. [h/t Cassidy Chansirik]

Dec 2020

COVID-19 hospital capacity.

On Monday, the Department of Health and Human Services released a dataset on coronavirus-related capacity at thousands of US hospitals — information the agency previously only published as state-level metrics. The self-reported, weekly-updated dataset quantifies various aspects of capacity, such as the number of staffed ICU beds and the number of beds occupied by patients with COVID-19. “This data is tremendously complex and is the result of substantial ongoing efforts,” notes an accompanying blog post. “We opted not to have perfect be the enemy of good, so these datasets will have imperfections.” Related: An FAQ “developed in collaboration with a group of data journalists, data scientists, and healthcare system researchers who have reviewed the data.” [h/t Ryan Panchadsaram]

Dec 2020

Integer sequences.

The decades-old, frequently-updated, and downloadable On-Line Encyclopedia of Integer Sequences contains more than 338,000 lists of those things. Each has some particular significance, ranging from the famous (the Fibonacci numbers) to the intriguing (“days required to spread gossip to n people”) to the obscure (“numbers n such that 2^n + 35 is prime”) to the super-obscure. Related: This xkcd comic and its impact. [h/t Dan Brady]

Dec 2020

Jefferson’s weather.

From July 1776 to June 1826, Thomas Jefferson recorded thousands of nearly-daily weather observations — temperatures, precipitations, humidities, wind speeds — at Monticello, Paris, Milan, and scores of other locations. Now a UVA/Princeton collaboration has turned those handwritten records into an explorable and downloadable database. [h/t Erica Cavanaugh]

Dec 2020

More coups.

Last month, the University of Illinois’ Cline Center for Advanced Social Research published version 2.0 of its Coup D’état Project, a dataset detailing more than 900 coups, attempted coups, and coup conspiracies from 1945 to 2019. Each entry indicates the country and date, plus the “type of actor who initiated the coup (i.e. military, palace, rebel, etc.) as well as the fate of the deposed executive (killed, injured, exiled, etc.).” Previously: Powell and Thyne’s coup dataset (DIP 2016.07.20).

Dec 2020

Rural facilities in India.

As part of its Pradhan Mantri Gram Sadak Yojana road-development program, India’s Ministry of Rural Development has gathered data on 700,000+ rural facilities, which data-science engineer Pratap Vardhan has organized into state-level CSV files. The information includes each facility’s name, category (e.g., education, medical, etc.), subcategory, state, district, block, address, and geocoordinates. Related: An exploratory Twitter thread by Vardhan, who says, “This is probably the largest open indian geo-tagged dataset I’ve seen!? It’s mostly great!?”

Dec 2020

Student loans.

The US Department of Education publishes a range of aggregate datasets on federal student loans, including the amounts outstanding ($1.5+ trillion overall, from 43 million students), volumes of financial aid requested and awarded (by student demographic and by school), default rates, and forgiveness.

Nov 2020

State spending on kids.

A new dataset from the Urban Institute “provides a comprehensive accounting of public spending on children from 1997 through 2016.” Drawing on the US Census Bureau’s Annual Survey of State and Local Government Finances and other sources, the dataset summarizes “state-by-state spending on education, income security, health, and other areas.” [h/t Erica Greenberg]

Nov 2020

Education and civil rights.

For decades, the US Department of Education’s Civil Rights Data Collection has compiled “data on key education and civil rights issues in our nation’s public schools,” including “student enrollment and educational programs and services, most of which is disaggregated by race/ethnicity, sex, limited English proficiency, and disability.” Last month, the department released the CRDC for the 2017–18 school year. Related: ProPublica has used CRDC data to investigate racial inequality and the use of restraints and seclusions. [h/t Andrew McCartney]

Nov 2020

Urban traffic.

Researchers from ETH Zurich’s Institute for Transport Planning and Systems have assembled 170 million observations of traffic intensity on urban roads, registered by 23,000+ detection points in 40 cities, “making it the largest multi-city traffic dataset publically available.” The cities are mostly in Western Europe, but also include Tokyo, Taipei, Melbourne, Vilnius, Los Angeles, and Toronto. [h/t ddechamb]

Nov 2020

Global inequality.

The World Inequality Database “aims to provide open and convenient access to the most extensive available database on the historical evolution of the world distribution of income and wealth, both within countries and between countries.” The project, co-directed by Thomas Piketty, published a major update last week, expanding its geographic and temporal coverage. The data points vary by country; you can download them interactively or in bulk. Previously: Frederick Solt’s Standardized World Income Inequality Database (DIP 2019.12.04) and the United Nations University’s World Income Inequality Database (DIP 2016.06.01).