Data is Plural archive

Trawl through the backlog, or roll the dice for a random dataset 🎲
Jun 2022

NYC tree plantings.

New York City’s Department of Parks & Recreation publishes a map and dataset of recent and likely-upcoming street tree plantings. The information includes each location’s coordinates, nearest street address, ZIP code, city council district, and borough, as well as the dates of completed plantings. Previously: Every street tree in NYC (DIP 2016.11.16). [h/t Soph Warnes]

Jun 2022

Inclusive crossword names.

In a post for The New York Times’ Gameplay section, psychology professor Erica Hsiung Wojcik describes her motivations for creating the Expanded Crossword Name Database, “a free and regularly updated list of names, places and things that represent groups, identities and people often excluded from crossword grids,” with a particular focus on “names of women, non-binary, trans, and/or people of color.” It contains 2,400+ potential entries — from AALIYAH to ZORANEALEHURSTON — that correspond to 900+ distinct proper nouns, each briefly described in the project’s main spreadsheet. [h/t George Ho]

Jun 2022

Central bank interest rates.

The Bank for International Settlements maintains a longitudinal dataset of policy interest rates, which central banks adjust to influence inflation and other aspects of the economy. The dataset, which includes both official policy rates and analogous precursors, covers three dozen countries plus the European Central Bank. The records span decades, going as far back as 1946 for Denmark, India, Japan, Sweden, Switzerland, and the UK; 1954 for the US; 1960 for Canada; and 1976 for Australia.

Jun 2022

Ukraine air raid alerts.

Volodymyr Agafonkin, a Kyiv-based software engineer, has been scraping and charting the emergency notifications published through Air Alert Ukraine, a Telegram channel. The notifications serve as a digital counterpart to the sirens that warn residents of potentially-imminent Russian air attacks. Agafonkin’s dataset indicates the starting and ending times of 8,000+ alerts for 240 locations since March 15. Read more: An interview with Agafonkin in How To Read This Chart, a Washington Post newsletter.

Jun 2022

Monkeypox cases.

Global.health, a data-sharing initiative launched during the COVID-19 pandemic, has compiled a dataset of 2,500+ confirmed cases from this year’s monkeypox outbreak. Drawing from government and media sources, the dataset lists each case’s country and publicly known characteristics, such as the patient’s gender, age range, date of confirmation, and/or symptoms. As seen in: Charts and maps from the Global.health team and from Our World In Data.

Jun 2022

European royal families.

In “A Network of Thrones,” economists Seth G. Benzell and Kevin Cooke describe building a dataset that links “European royal kinship networks, monarchies, and wars to study the effect of family ties on conflict” between 1495 and 1918. The kinship records come from Brian Tompsett’s Directory of Royal Genealogical Data, which covers “almost every ruling house in the western world.” Related: Andrej Kokkonen et al. have compiled a dataset of “royal offspring, siblings, and paternal uncles and aunts” for 27 European monarchies from 1000 to 1799. Previously: European monarchs (DIP 2019.06.19). [h/t Phenomenal World + Kevin Lewis]

Jun 2022

Airplane laser incidents.

“Aiming a laser at an aircraft is a serious safety risk and violates federal law,” according to the Federal Aviation Administration, which encourages flight staff and the public to submit reports of such incidents. The agency publishes annual spreadsheets of each reported incident since 2010, listing the date and time, flight number, aircraft model, altitude, local airport, laser color, and an injury indicator.

Jun 2022

Drug combinations.

Guy Shtar et al.’s Continuous Drug Combination Database identifies 17,000+ unique medical drug combinations that have generated clinical interest, “curated automatically” from ClinicalTrials.gov (DIP 2018.05.09), the FDA’s Orange Book (DIP 2017.03.08), and international patent records. The automated approach allows the database to be “continuously updated,” with new versions published weekly.

Jun 2022

Incarceration and redistricting.

The US Census counts prisoners as residing where they are incarcerated. In 2010, however, New York passed a law requiring the state to adjust these figures for redistricting purposes, reassigning people in state and federal prisons to their pre-incarceration addresses. Those adjusted counts, down to the Census block level, are available for 2010 and 2020. A new report from the Prison Policy Initiative and VOCAL-NY cross-references the 2020 data with the original Census numbers to determine state prison incarceration rates for each county, city, ZIP code, and other geographies. PPI says it plans to issue similar reports for other states that have enacted comparable reforms. [h/t Mike Wessler]

Jun 2022

Consumer prices.

The US Bureau of Labor Statistics’ widely-cited Consumer Price Index measures “the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services.” In addition to overall averages, the index provides decades of detailed data on monthly price changes for hundreds of sub-baskets, ranging from broad groupings (e.g., apparel, housing, education) to narrower categories (e.g., frozen vegetables, window coverings, veterinarian services). [h/t Mike Reilley]

Jun 2022

Roman amphitheaters.

Sebastian Heath, a professor of computational humanities and Roman archaeology, has constructed a dataset of 260+ amphitheaters in the Roman Empire. It provides the structures’ known names, coordinates, orientations, and capacities, among other characteristics, and links the entries to external data sources.

Jun 2022

Mercenaries.

Ulrich Petersohn et al.’s Commercial Military Actor Database examines “the market for force” in 72 countries from 1980 to 2016. It contains information, primarily sourced from news reports, on thousands of contractual relationships between providers (mercenaries and private military/security companies) and their clients (governments, opposition groups, NGOs, and transnational corporations). The contracted work ranges “from combat services and support services (e.g., communication, maintenance), to logistics, security, consultancy, training, and reconstruction.”

Jun 2022

Hong Kong political prisoners.

The Hong Kong Democracy Council, a US-based advocacy group, last month published the first version of its Hong Kong Political Prisoners Database, which contains information about 1,000+ protesters, opposition leaders, and national security law defendants incarcerated since the city’s pro-democracy mass protests in mid-2019. It lists each defendant’s age, arrest date, arrest location, conviction date, convicted offenses, sentencing date, sentence length, and other details. An accompanying report describes the database’s context and methodology. [h/t Samuel Bickett]

Jun 2022

Where college grads go.

Johnathan Conzelmann et al. have created a dataset that estimates the geographic distribution of recent graduates from 2,600 US colleges and universities, calculated from information on the schools’ official LinkedIn landing pages. For each institution, the dataset indicates the proportions of alumni in each of the 278 specific US locations in LinkedIn’s geographic lexicon and cross-references them with government-defined metropolitan and micropolitan statistical areas. Read more: An introductory Twitter thread. [h/t Sharon Machlis]

Jun 2022

Six decades of House primaries.

In 2014, Stephen Pettigrew, Karen Owen, and Emily Wanless published a dataset of all Democratic and Republican primary election results for the US House of Representatives between 1956 and 2010. It indicates each election’s year, state, redistricting status, primary system (open, closed, semi-open, multiparty), and more. The dataset also lists each candidate’s name, gender, prior office, and votes received. In 2020, Michael G. Miller and Nicki Camberg published a follow-up dataset, adding coverage for 2012 through 2018. It uses the same variable names and structure as the earlier dataset, so that the two files can be easily combined.

Jun 2022

“What Middletown Read.”

Thanks to the discovery of “a collection of dusty ledgers” in 2003, researchers have built a database of (nearly) every checkout from Muncie, Indiana’s public library from November 1891 to December 1902. The project, a collaboration between the library and Ball State University, takes its name from a famous sociological study that pseudonymized Muncie as Middletown. Previously: Seattle Public Library checkouts since 2005 (DIP 2017.03.01). [h/t Matt Brown]

Jun 2022

Olympic accounting.

Martin Müller et al. have compiled a dataset of the costs and revenues of three recurring “mega-events”: the Summer Olympic Games, Winter Olympic Games, and FIFA Men’s World Cup. For each event between 1964 and 2018, it indicates the number of athletes, number of accredited media, venue costs, organization costs, ticketing revenue, broadcast revenue, and sponsorship revenue.

Jun 2022

Wind and solar power.

The Global Energy Monitor’s Global Wind Power Tracker is “a worldwide dataset of utility-scale wind facilities,” focusing on those with planned or installed capacities of at least 10 megawatts. It provides each facility’s name, location, status, capacity, installation type, owner, and other details. The project launched last week alongside a sibling dataset, the Global Solar Power Tracker. They join a growing collection of trackers from the organization, including those examining coal infrastructure, steel plants, and oil and gas resources. [h/t Nathaniel Hoffman]

Jun 2022

Congress, consolidated.

CongressData, published last month by political scientists at the Institute for Public Policy and Social Research, “compiles information about all US congressional districts,” the legislators representing them, and those legislators’ policymaking behavior (such as committee memberships and number of bills sponsored). The dataset spans 1789-2021, although many of the variables (such those derived from the Census Bureau’s American Community Survey) are only available for more recent years. [h/t Erik Gahner Larsen]

Jun 2022

School shootings, continued.

The K-12 School Shooting Database, housed at the Naval Postgraduate School, “documents each and every instance a gun is brandished, is fired, or a bullet hits school property for any reason, regardless of the number of victims, time, day of the week.” It describes 2,000+ incidents from 1970 to the present and links them to information regarding 3,000+ victims killed and wounded, 2,200+ shooters, and 2,000+ weapons. Related: A few years ago, CNN compiled a dataset of 180 school shootings from 2009 to 2018, focusing on incidents where at least one person was shot. Previously: Data on school shootings from The Washington Post (DIP 2018.04.25), and on mass shootings from The Violence Project, Gun Violence Archive, and Mother Jones (DIP 2021.03.24, DIP 2015.12.09). [h/t Michael A. Rice + Sam Petulla]

May 2022

Art Garfunkel’s library.

The legendary folk singer’s official website includes a catalog of “every book Art has read since 1968.” It lists each book’s title, author, year published, month/year read, page count, and whether it was one of the musician’s favorites. Recently, AI engineer Corey Christensen converted the HTML pages into a downloadable dataset.

May 2022

Moreno and Jennings’ sociograms.

In the 1930s, Jacob Moreno and Helen Hall Jennings created a series of “sociograms” representing the seating preferences of grade-school classmates. These graphics “are frequently considered as the first examples of social network analysis and visualization,” according to historian and network analysis practitioner Martin Grandjean, who has translated them into simple data files. [h/t Christian Miles + Jer Thorp]

May 2022

European election results.

Dominik Schraff et al. have built EU-NED, a dataset that harmonizes European election results at a subnational level, providing party vote totals for 31 countries’ NUTS 2 and NUTS 3 geographic units. The dataset covers 1990 to 2020 and uses party identifiers from PartyFacts (DIP 2019.01.16), making it easier to link the records to other projects. [h/t Christian Breuer]

May 2022

Infrastructure permitting.

The US government’s Federal Infrastructure Permitting Dashboard tracks the “environmental review and authorization processes for large or complex infrastructure projects,” particularly those funded by the Department of Transportation and those participating in a voluntary review-coordination effort known as FAST-41. The dashboard’s full dataset describes 12,000+ milestones relating to nearly 1,000 projects, roughly half of which have been completed. Online, you can search across projects and browse their characteristics and timetables.

May 2022

Supercomputers.

Since 1993, a team of researchers has regularly assessed the most powerful computers in the world. The resulting TOP500 lists are published twice a year, in June and November, using a performance benchmark developed by team member Jack Dongarra, who became a Turing Award laureate this year. Downloadable versions indicate each supercomputer’s name, rank, location, manufacturer, year built, power consumption, technical specifications, and more. As seen in: “The race to build the fastest supercomputer,” by Datawrapper’s Edurne Morillo, who recommends visiting Barcelona’s MareNostrum, which ranked 74th on the latest list and is housed in a former chapel.

May 2022

Banknote people.

The visual essay “Who’s in Your Wallet?” examines the famous faces that appear on 38 countries’ paper money. To do that, Alejandra Arevalo and Eric Hausken built a dataset describing 279 person-banknote combinations. It lists the banknote’s currency and value, plus the person’s name, gender, profession, year first on the bill, year deceased, and more. Related: Wikipedia’s lists of people on banknotes and on coins.

May 2022

Europe prison populations.

The Council of Europe publishes annual statistical reports on prison populations and facilities, based on surveys sent to its member states. The council publishes most of the data only in the report PDFs, but does provide HTML tables of country-level inmate counts and facility capacity from 2018 to 2022. As seen in: A Civio.es-led analysis of pretrial detention rates, for which reporters extracted (and have shared) the numbers of untried and unsentenced prisoners in early 2021. [h/t Olaya Argüeso Pérez]

May 2022

Open data governance.

To develop the Global Data Barometer, a network of local experts and regional organizations evaluated “the state of data for public good” in 109 countries between May 2019 and May 2021. The project’s initial results, released last week as a report and downloadable dataset, reflect 60,000+ of their observations, which focused on data governance and capabilities, plus the availability and use of data on specific topics, such as public finance, climate action, and company ownership. [h/t cat cortes]

May 2022

Municipal pandemic responses.

The National League of Cities’ Local Action Tracker describes itself as “the most complete collection of municipal responses to COVID-19.” It contains information about 4,800+ policies undertaken or planned in roughly 800 US cities between February 2020 and February 2022, listing each response’s date, policy area (e.g., housing, utilities, vaccinations), type of action (e.g., ordinance, emergency declaration), a brief description, and more. [h/t Joshua Pine]

May 2022

Religion and government.

To compile the Government Religious Preference dataset, researchers scrutinized primary documents and secondary sources “for information on the existence, origination, change, or discontinuation of a law or policy directed toward” any of 30 religious denominations in 200+ countries. Then they assessed the degree to which those policies reflected institutional favor or disfavor across 28 variables, themselves grouped into “five broad components of state-religion”: official status, financial support, regulatory burdens, religious education, and free exercise. The project provides individual and composite scores for each country-year-denomination, from as early as 1800 through 2015. Related: Country-level religious demographics, from the same principal investigators. [h/t Ariel Zellman and Davis Brown]