Combining family history and machine learning to link historical records: The Census Tree data set

B-Tier
Journal: Explorations in Economic History
Year: 2021
Volume: 80
Issue: C

Authors (4)

Price, Joseph (not in RePEc) Buckles, Kasey (University of Notre Dame) Van Leeuwen, Jacob (not in RePEc) Riley, Isaac (not in RePEc)

Score contribution per author:

0.503 = (α=2.01 / 4 authors) × 1.0x B-tier

α: calibrated so average coauthorship-adjusted count equals average raw count

Abstract

A key challenge for research on many questions in the social sciences is that it is difficult to link records in a way that allows investigators to observe people at different points in their life or across generations. In this paper, we contribute to recent efforts to create these links with a new approach that relies on millions of record links created by individual contributors to a large, public, wiki-style family tree. We use these “true” links both to inform the decisions one needs to make when using automated methods to link records and as a training data set for use in a supervised machine learning approach. We describe our procedure and illustrate its potential by linking individuals across the 100% samples of the US censuses from 1900, 1910, and 1920. When linking adjacent censuses, we obtain an overall match rate of 62-65 percent (for over 88.9 million matches), with a false positive rate that is around 6-7 percent and with links that are similar to the population along observable characteristics. Thus, our method allows us to link records with a combination of a high match rate, precision, and representativeness that is beyond the current frontier. Finally, we demonstrate the potential of the data by estimating the degree of intergenerational transmission of literacy between father-son and mother-daughter pairs.

Technical Details

RePEc Handle
repec:eee:exehis:v:80:y:2021:i:c:s0014498321000024
Journal Field
Economic History
Author Count
4
Added to Database
2026-01-25