A Comparison of Rule-based and Supervised Machine Learning Approaches for Record Linkage of Italian Historical Data
DOI:
https://doi.org/10.51964/hlcs18990Keywords:
Record linkage, Parish records, Historical demographyAbstract
Parish and civil records are crucial sources for reconstructing historical socio-demographic processes. However, their analysis presents significant challenges, particularly the need to digitize data and link life events across documents that lack formal identifiers. With the growing availability of digitized records, the development and evaluation of automated linkage techniques have become increasingly important. This study compares rule-based and supervised machine learning approaches for linking birth and death records derived from crowdsourced transcriptions of Italian parish and civil registers. Using a set of hand-linked data as a benchmark, we assess the performance of both approaches in terms of precision and recall, under standard conditions and in scenarios where key disambiguating information is missing. Our findings suggest that the machine learning approach outperforms the rule-based method both under standard conditions and when information is incomplete, making it the preferred option when training data are available. Nonetheless, the rule-based method can still achieve high precision when configured with sufficiently strict matching thresholds. While the focus of this exercise is on linking birth and death records, the procedures can be adapted to a wide range of historical reconstruction projects based on names and dates.
Downloads
References
Abramitzky, R., Boustan, L., & Eriksson, K. (2019). To the new world and back again: Return migrants in the age of mass migration. ILR Review, 72(2), 300–322. https://doi.org/10.1177/0019793917726981
Abramitzky, R., Boustan, L., Eriksson, K., Feigenbaum, J., & Pérez, S. (2021). Automated linking of historical data. Journal of Economic Literature, 59(3), 865–918. https://doi.org/10.1257/jel.20201599
Avoundjian, T., Dombrowski, J. C., Golden, M. R., Hughes, J. P., Guthrie, B. L., Baseman, J., & Sadinle, M. (2020). Comparing methods for record linkage for public health action: Matching algorithm validation study. JMIR Public Health and Surveillance, 6(2), e15917. https://doi.org/10.2196/15917
Bailey, M. J., Cole, C., Henderson, M., & Massey, C. (2020). How well do automated linking methods perform? Lessons from US historical data. Journal of Economic Literature, 58(4), 997–1044. https://doi.org/10.1257/jel.20191526
Bouchard, G., Roy, R., & Casgrain, B. (1986). De la micro à la macro-reconstitution des familles le systeme SOREP [From micro- to macro-reconstitution of families: The SOREP system]. Genus, 42(3/4), 33–54.
Breschi, M., Fornasin, A., & Manfredini, M. (2011). Demographic responses to short-term stress in a 19th century Tuscan population: The case of household out-migration. Demographic Research, 25, 491–512. https://doi.org/10.4054/DemRes.2011.25.15
Breschi, M., Fornasin, A., & Manfredini, M. (2020). The richness of Italian historical demography. Historical Life Course Studies, 9, 228–240. https://doi.org/10.51964/hlcs9304
Breschi, M., Fornasin, A., Manfredini, M., Pozzi, L., Rettaroli, R., & Scalone, F. (2014). Social and economic determinants of reproductive behavior before the fertility decline. The case of six Italian communities during the nineteenth century. European Journal of Population, 30(3), 291–315. https://doi.org/10.1007/s10680-013-9303-8
Breschi, M., Fornasin, A., Pozzi, L., Rettaroli, R., and Scalone, F. (2009). The onset of fertility transition in Italy 1800–1900. In: A. Fornasin & M. Manfredini (Eds.), Fertility in Italy at the turn of the twentieth century (pp. 11–29). Forum.
Christen, P. (2012). Data matching: Concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer.
Del Panta, L., & Rettaroli, R. (1994). Introduzione alla demografia storica [Introduction to historical demography]. Manuali Laterza.
Dillon, L., Amorevieta-Gentil, M., Caron, M., Lewis, C., Guay-Giroux, A., Desjardins, B., & Gagnon, A. (2018). The programme de recherche en démographie historique: Past, present and future developments in family reconstitution. The History of the Family, 23(1), 20–53. https://doi.org/10.1080/1081602X.2016.1222501
Dribe, M., Eriksson, B., & Helgertz, J. (2023). From Sweden to America: Migrant selection in the transatlantic migration, 1890–1910. European Review of Economic History, 27(1), 24–44. https://doi.org/10.1093/ereh/heac007
Dribe, M., & Lundh, C. (2010). Marriage choices and social reproduction: The interrelationship between partner selection and intergenerational socioeconomic mobility in 19th-century Sweden. Demographic Research, 22, 347–382. https://doi.org/10.4054/DemRes.2010.22.14
Dribe, M., & Quaranta, L. (2020). The Scanian Economic-Demographic Database (SEDD). Historical Life Course Studies, 9, 158–172. https://doi.org/10.51964/hlcs9302
Feigenbaum, J. J. (2016). Automated census record linking: A machine learning approach (Working paper). https://open.bu.edu/handle/2144/27526.
Feigenbaum, J. J., Helgertz, J., & Price, J. (2025). Examining the role of training data for supervised methods of automated record linkage: Lessons for best practice in economic history. Explorations in Economic History, 96, 101656. https://doi.org/10.1016/j.eeh.2025.101656
Ferrie, J. P. (1996). A new sample of males linked from the public use microdata sample of the 1850 U.S. federal census of population to the 1860 U.S. federal census manuscript schedules. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 29(4), 141–156. https://doi.org/10.1080/01615440.1996.10112735
Fornasin, A., Breschi, M., & Manfredini, M. (2016). Environment, housing, and infant mortality: Udine, 1807–1815. In D. Ramiro Fariñas & M. Oris (Eds.), New approaches to death in cities during the health transition (pp. 43–54). Springer. https://doi.org/10.1007/978-3-319-43002-7_3
Fu, Z., Boot, H. M., Christen, P., & Zhou, J. (2014). Automatic record linkage of individuals and households in historical census data. International Journal of Humanities and Arts Computing, 8(2), 204–225. https://doi.org/10.3366/ijhac.2014.0130
Fure, E. (2000). Interactive record linkage: The cumulative construction of life courses. Demographic Research, 3, Arcticle 11. https://doi.org/10.4054/DemRes.2000.3.11
Gautam, B., Terrades, O. R., Pujades, J. M., & Valls, M. (2020). Knowledge graph based methods for record linkage (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2003.03136
Goeken, R., Huynh, L., Lynch, T. A., & Vick, R. (2011). New methods of census record linking. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 44(1), 7–14. https://doi.org/10.1080/01615440.2010.517152
Helgertz, J., Price, J., Wellington, J., Thompson, K. J., Ruggles, S., & Fitch, C. A. (2022). A new strategy for linking U.S. historical censuses: A case study for the IPUMS multigenerational longitudinal panel. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 55(1), 12–29. https://doi.org/10.1080/01615440.2021.1985027
Herlihy, D. (1988). Tuscan Names, 1200–1530. Renaissance Quarterly, 41(4), 561–582. https://doi.org/10.2307/2861882
Kahle, P., Colutto, S., Hackl, G., & Muhlberger, G. (2017). Transkribus — A service platform for transcription, recognition and retrieval of historical documents. 14th IAPR International Proceedings of the Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan (pp. 19–24). IEEE. https://doi.org/10.1109/ICDAR.2017.307
Mandemakers, K. (2002). Building life course datasets from population registers by the Historical Sample of the Netherlands (HSN). History and Computing, 14(1–2), 87–107. https://doi.org/10.3366/hac.2002.14.1-2.87
Manfredini, M. (2003). Families in motion: The role and characteristics of household migration in a 19th-century rural Italian parish. The History of the Family, 8(2), 317–343. https://doi.org/10.1016/S1081-602X(03)00031-9
Minello, A., Dalla Zuanna, G., & Alfani, G. (2017). First signs of transition: The parallel decline of early baptism and early mortality in the province of Padua (northeast Italy), 1816–1870. Demographic Research, 36, Article 27, 759–802. https://doi.org/10.4054/DemRes.2017.36.27
Piccione, L., Dalla Zuanna, G., & Minello, A. (2014). Mortality selection in the first three months of life and survival in the following thirty-three months in rural Veneto (North-East Italy) from 1816 to 1835. Demographic Research, 31, Article 39, 1199–1228. https://doi.org/10.4054/DemRes.2014.31.39
Price, J., Buckles, K., van Leeuwen, J., & Riley, I. (2021). Combining family history and machine learning to link historical records: The Census Tree data set. Explorations in Economic History, 80, 101391. https://doi.org/10.1016/j.eeh.2021.101391
Pujadas-Mora, J. M., Fornés, A., Ramos Terrades, O., Lladós, J., Chen, J., Valls-Fígols, M., & Cabré, A. (2022). The Barcelona historical marriage database and the Baix Llobregat demographic database. From algorithms for handwriting recognition to individual-level demographic and socioeconomic data. Historical Life Course Studies, 12, 99–132. https://doi.org/10.51964/hlcs11971
Rettaroli, R., Samoggia, A., & Scalone, F. (2017). Does socioeconomic status matter? The fertility transition in a northern Italian village (marriage cohorts 1900–1940). Demographic Research, 37, Article 15, 455–492. https://doi.org/10.4054/DemRes.2017.37.15
Rettaroli, R., & Scalone, F. (2012). Reproductive behavior during the pre-transitional period: Evidence from rural Bologna. The Journal of Interdisciplinary History, 42(4), 615–643. https://doi.org/10.1162/JINH_a_00307
Rettaroli, R., Scalone, F., & Del Panta, L. (2019). The demography of isolated populations. A research note on a German-speaking community in a northern Italian valley between the 18th and 19th century. Popolazione e storia, 19(2), 105–123. https://doi.org/10.4424/ps2018-10
Ruggles, S., Fitch, C. A., & Roberts, E. (2018). Historical census record linkage. Annual Review of Sociology, 44(1), 19–37. https://doi.org/10.1146/annurev-soc-073117-041447
Ruiu, G., & Breschi, M. (2015). For the times they are a changin’: The respect for religious precepts through the analysis of the seasonality of marriages. Italy, 1862–2012. Demographic Research, 33, Article 7, 179–210. https://doi.org/10.4054/DemRes.2015.33.7
Scalone, F., Agati, P., Angeli, A., & Donno, A. (2017). Exploring unobserved heterogeneity in perinatal and neonatal mortality risks: The case of an Italian sharecropping community, 1900–39. Population Studies, 71(1), 23–41. https://doi.org/10.1080/00324728.2016.1254812
Scalone, F., & Samoggia, A. (2018). Neonatal mortality, cold weather, and socioeconomic status in two northern Italian rural parishes, 1820–1900. Demographic Research, 39, Article 18, 525–560. https://doi.org/10.4054/DemRes.2018.39.18
Tymicki, K. (2009). The correlates of infant and childhood mortality: A theoretical overview and new evidence from the analysis of longitudinal data of the Bejsce (Poland) parish register reconstitution study of the 18th–20th centuries. Demographic Research, 20, Article 23, 559–594. https://doi.org/10.4054/DemRes.2009.20.23
Vézina, H., & Bournival, J.-S. (2020). An overview of the BALSAC population database. Past developments, current state and future prospects. Historical Life Course Studies, 9, 114–129. https://doi.org/10.51964/hlcs9299
Wen, F., In, J., & Breen, R. J. (2022). A comprehensive assessment of census record linking methods: Comparing deterministic, probabilistic, and machine learning approaches. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4241435
Westberg, A., Engberg, E., & Edvinsson, S. (2016). A unique source for innovative longitudinal research: The POPLINK database. Historical Life Course Studies, 3, 20–31. https://doi.org/10.51964/hlcs9351
Winchester, I. (1992). What every historian needs to know about record linkage for the microcomputer era. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 25(4), 149–165. https://doi.org/10.1080/01615440.1992.10112722
Wrigley, E. A., Davies, R. S., Oeppen, J. E., & Schofield, R. S. (1997). English population history from family reconstitution 1580–1837 (1st ed.). Cambridge University Press. https://doi.org/10.1017/CBO9780511660344

Downloads
Published
Issue
Section
License
Copyright (c) 2025 Saverio Minardi, Suzanne Greco, Nicola Barban

This work is licensed under a Creative Commons Attribution 4.0 International License.
How to Cite
Funding data
-
Horizon 2020 Framework Programme
Grant numbers 865356