POPP. An OCR-Generated Database of the Population Censuses of Paris (1926–1936)
DOI:
https://doi.org/10.52024/hlcs18627Keywords:
Database, Census, Machine learning, Artificial Intelligence, Paris, France, InterwarAbstract
Empirical research in historical demography is usually time-consuming and labour-intensive. Recent developments in machine learning offer new possibilities for building very large databases with reduced time and costs, though these new methods raise new challenges as well. This article describes the process of constructing the POPP database, a data collection project based on the exploitation of the nominative lists of the Parisian population censuses of 1926, 1931, and 1936. This database provides a host of information for almost 9 million individuals: their name and surname, year and location of birth, nationality, relation to the household head, and occupation. The article discusses the digitisation of archival sources — several hundred thousand handwritten pages — their transformation into a database by computer scientists using machine learning techniques, and the work required on the part of social scientists to correct and adapt the resulting data for statistical purposes. Beyond its methodological contribution, this article also discusses the various ways in which the POPP database will improve our knowledge of the economic, social, and demographic evolution of an important European urban population.
Downloads
References
Antonie, L., Inwood, K., Lizotte, D. J., & Ross, J. A. (2014). Tracking people over time in 19th-century Canada for longitudinal analysis. Machine Learning, 95(1), 129–146. https://doi.org/10.1007/s10994-013-5421-0
Bailey, M. J., Cole, C., Henderson, M., & Massey, C. (2020). How well do automated linking methods perform? Lessons from US historical data. Journal of Economic Literature, 58(4), 997–1044. https://doi.org/10.1257/jel.20191526
Bailey, M. J., & Lin, P. Z. (2025). Marital matching and women's intergenerational mobility in the late 19th- and early 20th-century US. In M. J. Bailey, L. P. Boustan, & W. J. Collins (Eds.), The economic history of American inequality: New evidence and perspectives (pp. 165–198). University of Chicago Press.
Bailey, M., Lin, P. Z., Mohammed, A. R. S., Mohnen, P., Murray, J., Zhang, M., & Prettyman, A. (2023). The creation of LIFE-M: The Longitudinal, Intergenerational Family Electronic Micro-Database Project. Historical Methods, 56(3), 138–159. https://doi.org/10.1080/01615440.2023.2239699
Berkner, L. K. (1972). The stem family and the developmental cycle of the peasant household: An eighteenth-century Austrian example. American Historical Review, 77(2), 398–418. https://doi.org/10.1086/ahr/77.2.398
Berkner, L. K. (1975). The use and misuse of census data for the historical analysis of family structure. Journal of Interdisciplinary History, 5(4), 721–738. https://doi.org/10.2307/202867
Biraben, J.-N. (1963). Inventaire des listes nominatives de recensement en France [Inventory of nominative lists of the census in France]. Population, 18(2), 305–328. https://doi.org/10.2307/1527137
Biraben, J.-N. (1970). La statistique de population sous le Consulat et l'Empire [Population statistics under the Consulate and the Empire]. Annales historiques de la Révolution française, 199(1), 30–45. https://doi.org/10.3406/ahrf.1970.3892
Boillet, M., Tarride, S., Blanco, M., Rigal, V., Schneider, Y., Abadie, B., Kesztenbaum, L., & Kermorvant, C. (2024). The Socface project: Large-scale collection, processing, and analysis of a century of French censuses. arXiv:2404.18706. https://doi.org/10.48550/arXiv.2404.18706
Boudjaaba, F., Gourdon, V., & Rathier, C. (2010). Charleville's census reports: An exceptional source for the longitudinal study of urban populations in France. Popolazione e Storia, 11(2), 17–42. https://doi.org/10.4424/ps2010-9
Bourdieu, J., Kesztenbaum, L., & Postel-Vinay, G. (2014). The TRA project, a historical matrix. Population, 69(2), 191–220. https://doi.org/10.3917/popu.1402.0217
Bourdieu, J., Postel-Vinay, G., & Kesztenbaum, L. (2013). L'enquête TRA. Histoire d'un outil, outil pour l'histoire (Tome I, 1793–1902) [The TRA survey: History of a tool, a tool for history (Vol. I, 1793–1902)]. INED.
Brée, S. (2016). La population de la région parisienne au XIXe siècle [The nineteenth-century population of Paris]. In S. Brée (Ed.), Paris, l'inféconde (pp. 59–93). Ined. https://doi.org/10.4000/books.ined.1576
Brée, S. (2024). Mariage, concubinage et célibat dans le Paris de l'entre-deux-guerres [Marriage, cohabitation and singlehood in interwar Paris]. [Unpublished habilitation dissertation]. Sorbonne Université.
Brée, S., & the POPP Team. (2025). Paris 100 years ago: More people than today — and mostly born elsewhere. Population & Societies, 636(9), 1-4. https://shs.cairn.info/journal-population-societies-2025-9-page-1?lang=en
Constum, T., Kempf, N., Paquet, T., Tranouez, P., Chatelain, C., Brée, S., & Merveille, F. (2022). Recognition and information extraction in historical handwritten tables: Toward understanding early 20th-century Paris census. In S. Uchida, E. Barney, & V. Eglin, V. (Eds.), Document Analysis Systems. DAS 2022. Lecture Notes in Computer Science (Vol. 1323, pp. 143–157). Springer. https://doi.org/10.1007/978-3-031-06555-2_10
Couturier, M. (1966). Vers une nouvelle méthodologie mécanographique. La préparation des données [Towards a new mechanographic methodology. Data preparation]. Annales. Histoire, Sciences Sociales, 21(4), 769–778. https://doi.org/10.3406/ahess.1966.421421
Darroch, G. (2002). Semi-automated record linkage with surname samples: A regional study of 'case law' linkage, Ontario 1861–1871. History and Computing, 14(1–2), 153–183. https://doi.org/10.3366/hac.2002.14.1-2.153
Dillon, L. (2002). Challenges and opportunities for census record linkage in the French and English Canadian context. History and Computing, 14(1–2), 185–212. https://doi.org/10.3366/hac.2002.14.1-2.185
Dillon, L., & Roberts, E. (2002). Introduction: Longitudinal and cross-sectional historical data: Intersections and opportunities. History and Computing, 14(1–2), 1–7. https://doi.org/10.3366/hac.2002.14.1-2.1
Dumont, G.-F. (2018). Une exception française: Son recensement de la population. Quelle méthode? Quelles insuffisances? Comment l'améliorer? [A French exception: Its population census. What method? What shortcomings? How to improve it?]. Les Analyses de Population & Avenir, 3(13), 1–26. https://doi.org/10.3917/lap.003.0001
Dupâquier, J. (1984). L'enquête des 3000 familles [The 3,000-family survey]. Population, 39(2), 380–383. https://doi.org/10.2307/1532304
Dupâquier, J., & Dupâquier, M. (1985). Histoire des recensements [History of censuses]. Revue française d'administration publique, 36, 9–23. www.persee.fr/issue/rfap_0152-7401_1985_num_36_1
Edvinsson, S., Mandemakers, K., & Smith, K. R. (2023a). Introduction: Major databases with historical longitudinal population data: Development, impact and results. Historical Life Course Studies, 13, 186–190. https://doi.org/10.51964/hlcs14840
Edvinsson, S., Mandemakers, K., Smith, K. R., & Puschmann, P. (Eds.) (2023b). Harvesting. The results and impact of research based on historical longitudinal databases. Radboud University Press. https://doi.org/10.54195/HYLR8777
Esmonin, E. (1964). Statistiques du mouvement de la population en France de 1770 à 1789 [Statistics on the movement of the population in France, 1770–1789]. Annales de Démographie Historique, 27–130. https://doi.org/10.3406/adh.1964.882
Fauve-Chamoux, A. (1972). La reconstitution des familles: Espoirs et réalités [Family reconstruction: Hopes and realities]. Annales. Histoire, Sciences Sociales, 27(4–5), 1083–1090. https://doi.org/10.3406/ahess.1972.422582
Fleury, M., & Henry, L. (1956). Des registres paroissiaux a l'histoire de la population: Manuel de dépouillement et d'exploitation de l'état civil ancien [From parish registers to the history of the population: Manual for counting and exploitation of the ancient civil status]. INED.
Fleury, M., & Henry, L. (1985). Nouveau manuel de dépouillement et d'exploitation de l'état civil ancien [New manual for counting and using of the ancient civil status] (3rd ed.). INED.
Fornés, A., Lladós, J., & Pujadas-Mora, J. M. (2019). Browsing the social network of the past: Information extraction from population manuscript images. In A. Fischer, M. Liwicki, & R. Ingold (Eds.), Handwritten historical document analysis, recognition, and retrieval: State of the art and future trends (Vol. 89, pp. 195–220). World Scientific. https://doi.org/10.1142/9789811203244_0011
Garrett, E., & Reid, A. (2015). Introducing 'movers' into community reconstructions. In G. Bloothooft, P. Christen, K. Mandemakers, & M. Schraagen (Eds.), Population reconstruction (pp. 263–283). Springer. https://link.springer.com/chapter/10.1007/978-3-319-19884-2_13
Gourdon, V., & Ruggiu, F. J. (2015). Richard Wall en France: Retour vers le futur? [Richard Wall in France: Back to the future?]. Revista de Demografía Histórica, 33(2), 65–86.
Haug, J. C. (1979). Manuscript census materials in France: The use and availability of the listes nominatives. French Historical Studies, 11(2), 258–274. https://doi.org/10.2307/286604
Henry, L. (1953). Une richesse démographique en friche: Les registres paroissiaux [An untapped demographic resource: Parish registers]. Population, 8(2), 281–290. https://doi.org/10.2307/1524765
Héran, F., & Toulemon, L. (2005). What happens when the census population figure does not match the estimates? Population & Societies, 411, 1–4. https://doi.org/10.3917/popsoc.411.0001
INSEE. (2022). Fichier des prénoms [First name file] [Database]. https://www.insee.fr/fr/statistiques/7633685
INSEE. (2023). Fichier des personnes décédées depuis 1970 [Deceased persons since 1970 file] [Database]. https://www.insee.fr/fr/information/4190491
Kesztenbaum, L. (2021). Strength in numbers: A short note on the past, present and future of large historical databases. Historical Life Course Studies, 10, 5–8. https://doi.org/10.51964/hlcs9557
Laslett, P. (1965). The world we have lost. Methuen.
Laslett, P., & Wall, R. (1972). Household and family in past times. Cambridge University Press.
Mandemakers, K. (2025). Overview and comparison of 85 databases with historical population longitudinal microdata. Historical Life Course Studies, 15, 281–321. https://doi.org/10.52024/hlcs21660
Mandemakers, K., Alter, G., Vézina, H., & Puschmann, P. (Eds.) (2023). Sowing: The construction of historical longitudinal population databases. Radboud University Press. https://doi.org/10.54195/BJYF5752
Nagaï, N. (2002). Catégories socioprofessionnelles [Socio-professional categories]. In N. Nagaï (Ed.), Les conseillers municipaux de Paris sous la Troisième République (1871–1914) (pp. 323–355). Éditions de la Sorbonne. https://doi.org/10.4000/books.psorbonne.1329
Perrenoud, A. (1979). La population de Genève du XVIe au début du XIXe siècles: Étude démographique (Vol. I: Structure et mouvements) [The population of Geneva from the 16th to the early 19th century: Demographic studies (Vol. I: Structure and dynamics]. Société d'histoire et d'archéologie de Genève.
Pinchemel, P. (1954). Les listes nominatives des recensements de population [The nominative lists of population censuses]. Revue du Nord, 36(142), 419–431. https://doi.org/10.3406/rnord.1954.2150
Pujadas-Mora, J. M. (2019, February 14). The big data of the past: A journey through historical population documents driven by Computer Vision [Presentation]. Workshop Automated Registration of Historical Population Registers: New Prospects and Possibilities, Lund, Sweden.
Puschmann, P., Matsuo, H., & Matthijs, K. (2022). Historical life courses and family reconstitutions: The scientific impact of the Antwerp COR*-Database. Historical Life Course Studies, 12, 260–278. https://doi.org/10.51964/hlcs12914
Reid, A., Davies, R., & Garrett, E. (2002). Nineteenth-century Scottish demography from linked censuses and civil registers. History and Computing, 14(1–2), 61–86. https://doi.org/10.3366/hac.2002.14.1-2.61
Ruggles, S. (2002). Linking historical censuses: A new approach. History and Computing, 14(1–2), 213–224. https://doi.org/10.3366/hac.2002.14.1-2.213
Ruggles, S. (2014). Big microdata for population research. Demography, 51(1), 287–297. https://doi.org/10.1007/s13524-013-0240-2
Ruggles, S., Fitch, C. A., & Roberts, E. (2018). Historical census record linkage. Annual Review of Sociology, 44, 19–37. https://doi.org/10.1146/annurev-soc-073117-041447
Sandholt Jensen, P., & Nørmark Sørensen, E. (2019, February 14). Digitizing and analyzing historical documents at scale: The power of AI [Presentation]. Workshop Automated Registration of Historical Population Registers: New Prospects and Possibilities, Lund, Sweden.
Schofield, R. S. (1972). La reconstitution de la famille par ordinateur [Computer-based family reconstruction]. Annales. Histoire, Sciences Sociales, 27(4–5), 1071–1082. https://doi.org/10.3406/ahess.1972.422581
Séguy, I. (2001). La population de la France de 1670 à 1829: L'enquête Louis Henry et ses données [The population of France from 1670 to 1829: The Louis Henry survey and its data]. INED.
Tarride, S., Maarand, M., Boillet, M., McGrath, J., Capel, E., Vézina, H., & Kermorvant, C. (2023). Large- scale genealogical information extraction from handwritten Quebec parish records. International Journal on Document Analysis and Recognition, 26, 255–272. https://doi.org/10.1007/s10032-023-00427-w
Toulemon, L. (2017). Undercount of young children and young adults in the new French census. Statistical Journal of the IAOS, 33(2), 311–316. https://doi.org/10.3233/SJI-171054
Vézina, H., & Bournival, J.-S. (2020). An overview of the BALSAC population database. Historical Life Course Studies, 9, 114–129. https://doi.org/10.51964/hlcs9299
Wagner, R. A., & Fischer, M. J. (1974). The string-to-string correction problem. Journal of the Association for Computing Machinery, 21(1), 168–173. https://doi.org/10.1145/321796.321811
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Sandra Brée, Victor Gay, Marion Leturcq, Yoann Doignon, Baptiste Coulmont

This work is licensed under a Creative Commons Attribution 4.0 International License.