A Comparison of Rule-based and Supervised Machine Learning Approaches for Record Linkage of Italian Historical Data

Author(s)

  • Saverio Minardi University of Bologna
  • Suzanne Greco ItalianParishRecords.org, USA
  • Nicola Barban University of Bologna

DOI:

https://doi.org/10.51964/hlcs18990

Keywords:

Record linkage, Parish records, Historical demography

Abstract

Parish and civil records are crucial sources for reconstructing historical socio-demographic processes. However, their analysis presents significant challenges, particularly the need to digitize data and link life events across documents that lack formal identifiers. With the growing availability of digitized records, the development and evaluation of automated linkage techniques have become increasingly important. This study compares rule-based and supervised machine learning approaches for linking birth and death records derived from crowdsourced transcriptions of Italian parish and civil registers. Using a set of hand-linked data as a benchmark, we assess the performance of both approaches in terms of precision and recall, under standard conditions and in scenarios where key disambiguating information is missing. Our findings suggest that the machine learning approach outperforms the rule-based method both under standard conditions and when information is incomplete, making it the preferred option when training data are available. Nonetheless, the rule-based method can still achieve high precision when configured with sufficiently strict matching thresholds. While the focus of this exercise is on linking birth and death records, the procedures can be adapted to a wide range of historical reconstruction projects based on names and dates. 

Downloads

Download data is not yet available.

References

Abramitzky, R., Boustan, L., & Eriksson, K. (2019). To the new world and back again: Return migrants in the age of mass migration. ILR Review, 72(2), 300–322. https://doi.org/10.1177/0019793917726981

Abramitzky, R., Boustan, L., Eriksson, K., Feigenbaum, J., & Pérez, S. (2021). Automated linking of historical data. Journal of Economic Literature, 59(3), 865–918. https://doi.org/10.1257/jel.20201599

Avoundjian, T., Dombrowski, J. C., Golden, M. R., Hughes, J. P., Guthrie, B. L., Baseman, J., & Sadinle, M. (2020). Comparing methods for record linkage for public health action: Matching algorithm validation study. JMIR Public Health and Surveillance, 6(2), e15917. https://doi.org/10.2196/15917

Bailey, M. J., Cole, C., Henderson, M., & Massey, C. (2020). How well do automated linking methods perform? Lessons from US historical data. Journal of Economic Literature, 58(4), 997–1044. https://doi.org/10.1257/jel.20191526

Bouchard, G., Roy, R., & Casgrain, B. (1986). De la micro à la macro-reconstitution des familles le systeme SOREP [From micro- to macro-reconstitution of families: The SOREP system]. Genus, 42(3/4), 33–54. 

Breschi, M., Fornasin, A., & Manfredini, M. (2011). Demographic responses to short-term stress in a 19th century Tuscan population: The case of household out-migration. Demographic Research, 25, 491–512. https://doi.org/10.4054/DemRes.2011.25.15

Breschi, M., Fornasin, A., & Manfredini, M. (2020). The richness of Italian historical demography. Historical Life Course Studies, 9, 228–240. https://doi.org/10.51964/hlcs9304

Breschi, M., Fornasin, A., Manfredini, M., Pozzi, L., Rettaroli, R., & Scalone, F. (2014). Social and economic determinants of reproductive behavior before the fertility decline. The case of six Italian communities during the nineteenth century. European Journal of Population, 30(3), 291–315. https://doi.org/10.1007/s10680-013-9303-8

Breschi, M., Fornasin, A., Pozzi, L., Rettaroli, R., and Scalone, F. (2009). The onset of fertility transition in Italy 1800–1900. In: A. Fornasin & M. Manfredini (Eds.), Fertility in Italy at the turn of the twentieth century (pp. 11–29). Forum.

Christen, P. (2012). Data matching: Concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer. 

Del Panta, L., & Rettaroli, R. (1994). Introduzione alla demografia storica [Introduction to historical demography]. Manuali Laterza. 

Dillon, L., Amorevieta-Gentil, M., Caron, M., Lewis, C., Guay-Giroux, A., Desjardins, B., & Gagnon, A. (2018). The programme de recherche en démographie historique: Past, present and future developments in family reconstitution. The History of the Family, 23(1), 20–53. https://doi.org/10.1080/1081602X.2016.1222501

Dribe, M., Eriksson, B., & Helgertz, J. (2023). From Sweden to America: Migrant selection in the transatlantic migration, 1890–1910. European Review of Economic History, 27(1), 24–44. https://doi.org/10.1093/ereh/heac007

Dribe, M., & Lundh, C. (2010). Marriage choices and social reproduction: The interrelationship between partner selection and intergenerational socioeconomic mobility in 19th-century Sweden. Demographic Research, 22, 347–382. https://doi.org/10.4054/DemRes.2010.22.14

Dribe, M., & Quaranta, L. (2020). The Scanian Economic-Demographic Database (SEDD). Historical Life Course Studies, 9, 158–172. https://doi.org/10.51964/hlcs9302

Feigenbaum, J. J. (2016). Automated census record linking: A machine learning approach (Working paper). https://open.bu.edu/handle/2144/27526.

Feigenbaum, J. J., Helgertz, J., & Price, J. (2025). Examining the role of training data for supervised methods of automated record linkage: Lessons for best practice in economic history. Explorations in Economic History, 96, 101656. https://doi.org/10.1016/j.eeh.2025.101656

Ferrie, J. P. (1996). A new sample of males linked from the public use microdata sample of the 1850 U.S. federal census of population to the 1860 U.S. federal census manuscript schedules. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 29(4), 141–156. https://doi.org/10.1080/01615440.1996.10112735

Fornasin, A., Breschi, M., & Manfredini, M. (2016). Environment, housing, and infant mortality: Udine, 1807–1815. In D. Ramiro Fariñas & M. Oris (Eds.), New approaches to death in cities during the health transition (pp. 43–54). Springer. https://doi.org/10.1007/978-3-319-43002-7_3

Fu, Z., Boot, H. M., Christen, P., & Zhou, J. (2014). Automatic record linkage of individuals and households in historical census data. International Journal of Humanities and Arts Computing, 8(2), 204–225. https://doi.org/10.3366/ijhac.2014.0130

Fure, E. (2000). Interactive record linkage: The cumulative construction of life courses. Demographic Research, 3, Arcticle 11. https://doi.org/10.4054/DemRes.2000.3.11

Gautam, B., Terrades, O. R., Pujades, J. M., & Valls, M. (2020). Knowledge graph based methods for record linkage (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2003.03136

Goeken, R., Huynh, L., Lynch, T. A., & Vick, R. (2011). New methods of census record linking. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 44(1), 7–14. https://doi.org/10.1080/01615440.2010.517152

Helgertz, J., Price, J., Wellington, J., Thompson, K. J., Ruggles, S., & Fitch, C. A. (2022). A new strategy for linking U.S. historical censuses: A case study for the IPUMS multigenerational longitudinal panel. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 55(1), 12–29. https://doi.org/10.1080/01615440.2021.1985027

Herlihy, D. (1988). Tuscan Names, 1200–1530. Renaissance Quarterly, 41(4), 561–582. https://doi.org/10.2307/2861882

Kahle, P., Colutto, S., Hackl, G., & Muhlberger, G. (2017). Transkribus — A service platform for transcription, recognition and retrieval of historical documents. 14th IAPR International Proceedings of the Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan (pp. 19–24). IEEE. https://doi.org/10.1109/ICDAR.2017.307

Mandemakers, K. (2002). Building life course datasets from population registers by the Historical Sample of the Netherlands (HSN). History and Computing, 14(1–2), 87–107. https://doi.org/10.3366/hac.2002.14.1-2.87

Manfredini, M. (2003). Families in motion: The role and characteristics of household migration in a 19th-century rural Italian parish. The History of the Family, 8(2), 317–343. https://doi.org/10.1016/S1081-602X(03)00031-9

Minello, A., Dalla Zuanna, G., & Alfani, G. (2017). First signs of transition: The parallel decline of early baptism and early mortality in the province of Padua (northeast Italy), 1816–1870. Demographic Research, 36, Article 27, 759–802. https://doi.org/10.4054/DemRes.2017.36.27

Piccione, L., Dalla Zuanna, G., & Minello, A. (2014). Mortality selection in the first three months of life and survival in the following thirty-three months in rural Veneto (North-East Italy) from 1816 to 1835. Demographic Research, 31, Article 39, 1199–1228. https://doi.org/10.4054/DemRes.2014.31.39

Price, J., Buckles, K., van Leeuwen, J., & Riley, I. (2021). Combining family history and machine learning to link historical records: The Census Tree data set. Explorations in Economic History, 80, 101391. https://doi.org/10.1016/j.eeh.2021.101391

Pujadas-Mora, J. M., Fornés, A., Ramos Terrades, O., Lladós, J., Chen, J., Valls-Fígols, M., & Cabré, A. (2022). The Barcelona historical marriage database and the Baix Llobregat demographic database. From algorithms for handwriting recognition to individual-level demographic and socioeconomic data. Historical Life Course Studies, 12, 99–132. https://doi.org/10.51964/hlcs11971

Rettaroli, R., Samoggia, A., & Scalone, F. (2017). Does socioeconomic status matter? The fertility transition in a northern Italian village (marriage cohorts 1900–1940). Demographic Research, 37, Article 15, 455–492. https://doi.org/10.4054/DemRes.2017.37.15

Rettaroli, R., & Scalone, F. (2012). Reproductive behavior during the pre-transitional period: Evidence from rural Bologna. The Journal of Interdisciplinary History, 42(4), 615–643. https://doi.org/10.1162/JINH_a_00307

Rettaroli, R., Scalone, F., & Del Panta, L. (2019). The demography of isolated populations. A research note on a German-speaking community in a northern Italian valley between the 18th and 19th century. Popolazione e storia, 19(2), 105–123. https://doi.org/10.4424/ps2018-10

Ruggles, S., Fitch, C. A., & Roberts, E. (2018). Historical census record linkage. Annual Review of Sociology, 44(1), 19–37. https://doi.org/10.1146/annurev-soc-073117-041447

Ruiu, G., & Breschi, M. (2015). For the times they are a changin’: The respect for religious precepts through the analysis of the seasonality of marriages. Italy, 1862–2012. Demographic Research, 33, Article 7, 179–210. https://doi.org/10.4054/DemRes.2015.33.7

Scalone, F., Agati, P., Angeli, A., & Donno, A. (2017). Exploring unobserved heterogeneity in perinatal and neonatal mortality risks: The case of an Italian sharecropping community, 1900–39. Population Studies, 71(1), 23–41. https://doi.org/10.1080/00324728.2016.1254812

Scalone, F., & Samoggia, A. (2018). Neonatal mortality, cold weather, and socioeconomic status in two northern Italian rural parishes, 1820–1900. Demographic Research, 39, Article 18, 525–560. https://doi.org/10.4054/DemRes.2018.39.18

Tymicki, K. (2009). The correlates of infant and childhood mortality: A theoretical overview and new evidence from the analysis of longitudinal data of the Bejsce (Poland) parish register reconstitution study of the 18th–20th centuries. Demographic Research, 20, Article 23, 559–594. https://doi.org/10.4054/DemRes.2009.20.23

Vézina, H., & Bournival, J.-S. (2020). An overview of the BALSAC population database. Past developments, current state and future prospects. Historical Life Course Studies, 9, 114–129. https://doi.org/10.51964/hlcs9299

Wen, F., In, J., & Breen, R. J. (2022). A comprehensive assessment of census record linking methods: Comparing deterministic, probabilistic, and machine learning approaches. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4241435

Westberg, A., Engberg, E., & Edvinsson, S. (2016). A unique source for innovative longitudinal research: The POPLINK database. Historical Life Course Studies, 3, 20–31. https://doi.org/10.51964/hlcs9351

Winchester, I. (1992). What every historian needs to know about record linkage for the microcomputer era. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 25(4), 149–165. https://doi.org/10.1080/01615440.1992.10112722

Wrigley, E. A., Davies, R. S., Oeppen, J. E., & Schofield, R. S. (1997). English population history from family reconstitution 1580–1837 (1st ed.). Cambridge University Press. https://doi.org/10.1017/CBO9780511660344

Downloads

Published

2025-06-03

Issue

Section

Articles

How to Cite

Minardi, S., Greco, S., & Barban, N. (2025). A Comparison of Rule-based and Supervised Machine Learning Approaches for Record Linkage of Italian Historical Data. Historical Life Course Studies, 15, 28-46. https://doi.org/10.51964/hlcs18990

Funding data