POPP. An OCR-Generated Database of the Population Censuses of Paris (1926–1936)

Author(s)

DOI:

https://doi.org/10.52024/hlcs18627

Keywords:

Database, Census, Machine learning, Artificial Intelligence, Paris, France, Interwar

Abstract

Empirical research in historical demography is usually time-consuming and labour-intensive. Recent developments in machine learning offer new possibilities for building very large databases with reduced time and costs, though these new methods raise new challenges as well. This article describes the process of constructing the POPP database, a data collection project based on the exploitation of the nominative lists of the Parisian population censuses of 1926, 1931, and 1936. This database provides a host of information for almost 9 million individuals: their name and surname, year and location of birth, nationality, relation to the household head, and occupation. The article discusses the digitisation of archival sources — several hundred thousand handwritten pages — their transformation into a database by computer scientists using machine learning techniques, and the work required on the part of social scientists to correct and adapt the resulting data for statistical purposes. Beyond its methodological contribution, this article also discusses the various ways in which the POPP database will improve our knowledge of the economic, social, and demographic evolution of an important European urban population.

 

Downloads

Download data is not yet available.

Author Biographies

  • Sandra Brée, French National Centre for Scientific Research

    Sandra Brée is full-time researcher at the at the French National Centre for Scientific Research (CNRS). She holds a PhD in historical demography from Paris-Sorbonne (defended in 2011). She studies marital and non-marital fertility, family formations and dissolution and household structures. She is the PI of two project that aim to construct big databases with the help of Machine Learning. She is also the PI of the Observatory of the history of the French population: large databases and artificial intelligence.

  • Victor Gay, Toulouse School of Economics

    Victor Gay has PhD in Economics from the University of Chicago (defended in 2018). His research is at the crossroads of economic history, labor economics, and the economics of culture. He focuses primarily on the economic history of France, and develop data infrastructures based on novel archival material.

  • Marion Leturcq, National Institute for Demographic Studies

    Marion Leturcq is full-time researcher at INED (French Institute for Demographic Studies). She holds a PhD from Paris School of Economics (defended in 2013). She studies gender inequality and family formation and dissolution. Her main research topic focuses on how the gender wealth gap is intertwined with the legal settings of marriage and cohabitation.

  • Baptiste Coulmont, École Normale Supérieure Paris-Saclay

    Baptiste Coulmont holds a Phd from the École des hautes études en sciences sociales (School of Advanced Studies in the Social Sciences) (defended in 2003). He is a professor of sociology at the École Normale Supérieure Paris-Saclay. His research focuses on the sociology of voting and cultural stratification, with a marked interest in the choice of first names.

  • Yoann Doignon, French National Centre for Scientific Research

    Yoann Doignon holds a PhD in population geography from the Aix-Marseille University, France. He is researcher at the French National Centre for Scientific Research (CNRS). He has specialised in spatio-temporal analyses of population phenomena, particularly in the spatial diffusion of family changes in Europe and fertility decline. He worked on Mediterranean populations, on territorial and spatial convergence of the ageing population in the Mediterranean.

References

Antonie, L., Inwood, K., Lizotte, D. J., & Ross, J. A. (2014). Tracking people over time in 19th-century Canada for longitudinal analysis. Machine Learning, 95(1), 129–146. https://doi.org/10.1007/s10994-013-5421-0

Bailey, M. J., Cole, C., Henderson, M., & Massey, C. (2020). How well do automated linking methods perform? Lessons from US historical data. Journal of Economic Literature, 58(4), 997–1044. https://doi.org/10.1257/jel.20191526

Bailey, M. J., & Lin, P. Z. (2025). Marital matching and women's intergenerational mobility in the late 19th- and early 20th-century US. In M. J. Bailey, L. P. Boustan, & W. J. Collins (Eds.), The economic history of American inequality: New evidence and perspectives (pp. 165–198). University of Chicago Press.

Bailey, M., Lin, P. Z., Mohammed, A. R. S., Mohnen, P., Murray, J., Zhang, M., & Prettyman, A. (2023). The creation of LIFE-M: The Longitudinal, Intergenerational Family Electronic Micro-Database Project. Historical Methods, 56(3), 138–159. https://doi.org/10.1080/01615440.2023.2239699

Berkner, L. K. (1972). The stem family and the developmental cycle of the peasant household: An eighteenth-century Austrian example. American Historical Review, 77(2), 398–418. https://doi.org/10.1086/ahr/77.2.398

Berkner, L. K. (1975). The use and misuse of census data for the historical analysis of family structure. Journal of Interdisciplinary History, 5(4), 721–738. https://doi.org/10.2307/202867

Biraben, J.-N. (1963). Inventaire des listes nominatives de recensement en France [Inventory of nominative lists of the census in France]. Population, 18(2), 305–328. https://doi.org/10.2307/1527137

Biraben, J.-N. (1970). La statistique de population sous le Consulat et l'Empire [Population statistics under the Consulate and the Empire]. Annales historiques de la Révolution française, 199(1), 30–45. https://doi.org/10.3406/ahrf.1970.3892

Boillet, M., Tarride, S., Blanco, M., Rigal, V., Schneider, Y., Abadie, B., Kesztenbaum, L., & Kermorvant, C. (2024). The Socface project: Large-scale collection, processing, and analysis of a century of French censuses. arXiv:2404.18706. https://doi.org/10.48550/arXiv.2404.18706

Boudjaaba, F., Gourdon, V., & Rathier, C. (2010). Charleville's census reports: An exceptional source for the longitudinal study of urban populations in France. Popolazione e Storia, 11(2), 17–42. https://doi.org/10.4424/ps2010-9

Bourdieu, J., Kesztenbaum, L., & Postel-Vinay, G. (2014). The TRA project, a historical matrix. Population, 69(2), 191–220. https://doi.org/10.3917/popu.1402.0217

Bourdieu, J., Postel-Vinay, G., & Kesztenbaum, L. (2013). L'enquête TRA. Histoire d'un outil, outil pour l'histoire (Tome I, 1793–1902) [The TRA survey: History of a tool, a tool for history (Vol. I, 1793–1902)]. INED.

Brée, S. (2016). La population de la région parisienne au XIXe siècle [The nineteenth-century population of Paris]. In S. Brée (Ed.), Paris, l'inféconde (pp. 59–93). Ined. https://doi.org/10.4000/books.ined.1576

Brée, S. (2024). Mariage, concubinage et célibat dans le Paris de l'entre-deux-guerres [Marriage, cohabitation and singlehood in interwar Paris]. [Unpublished habilitation dissertation]. Sorbonne Université.

Brée, S., & the POPP Team. (2025). Paris 100 years ago: More people than today — and mostly born elsewhere. Population & Societies, 636(9), 1-4. https://shs.cairn.info/journal-population-societies-2025-9-page-1?lang=en

Constum, T., Kempf, N., Paquet, T., Tranouez, P., Chatelain, C., Brée, S., & Merveille, F. (2022). Recognition and information extraction in historical handwritten tables: Toward understanding early 20th-century Paris census. In S. Uchida, E. Barney, & V. Eglin, V. (Eds.), Document Analysis Systems. DAS 2022. Lecture Notes in Computer Science (Vol. 1323, pp. 143–157). Springer. https://doi.org/10.1007/978-3-031-06555-2_10

Couturier, M. (1966). Vers une nouvelle méthodologie mécanographique. La préparation des données [Towards a new mechanographic methodology. Data preparation]. Annales. Histoire, Sciences Sociales, 21(4), 769–778. https://doi.org/10.3406/ahess.1966.421421

Darroch, G. (2002). Semi-automated record linkage with surname samples: A regional study of 'case law' linkage, Ontario 1861–1871. History and Computing, 14(1–2), 153–183. https://doi.org/10.3366/hac.2002.14.1-2.153

Dillon, L. (2002). Challenges and opportunities for census record linkage in the French and English Canadian context. History and Computing, 14(1–2), 185–212. https://doi.org/10.3366/hac.2002.14.1-2.185

Dillon, L., & Roberts, E. (2002). Introduction: Longitudinal and cross-sectional historical data: Intersections and opportunities. History and Computing, 14(1–2), 1–7. https://doi.org/10.3366/hac.2002.14.1-2.1

Dumont, G.-F. (2018). Une exception française: Son recensement de la population. Quelle méthode? Quelles insuffisances? Comment l'améliorer? [A French exception: Its population census. What method? What shortcomings? How to improve it?]. Les Analyses de Population & Avenir, 3(13), 1–26. https://doi.org/10.3917/lap.003.0001

Dupâquier, J. (1984). L'enquête des 3000 familles [The 3,000-family survey]. Population, 39(2), 380–383. https://doi.org/10.2307/1532304

Dupâquier, J., & Dupâquier, M. (1985). Histoire des recensements [History of censuses]. Revue française d'administration publique, 36, 9–23. www.persee.fr/issue/rfap_0152-7401_1985_num_36_1

Edvinsson, S., Mandemakers, K., & Smith, K. R. (2023a). Introduction: Major databases with historical longitudinal population data: Development, impact and results. Historical Life Course Studies, 13, 186–190. https://doi.org/10.51964/hlcs14840

Edvinsson, S., Mandemakers, K., Smith, K. R., & Puschmann, P. (Eds.) (2023b). Harvesting. The results and impact of research based on historical longitudinal databases. Radboud University Press. https://doi.org/10.54195/HYLR8777

Esmonin, E. (1964). Statistiques du mouvement de la population en France de 1770 à 1789 [Statistics on the movement of the population in France, 1770–1789]. Annales de Démographie Historique, 27–130. https://doi.org/10.3406/adh.1964.882

Fauve-Chamoux, A. (1972). La reconstitution des familles: Espoirs et réalités [Family reconstruction: Hopes and realities]. Annales. Histoire, Sciences Sociales, 27(4–5), 1083–1090. https://doi.org/10.3406/ahess.1972.422582

Fleury, M., & Henry, L. (1956). Des registres paroissiaux a l'histoire de la population: Manuel de dépouillement et d'exploitation de l'état civil ancien [From parish registers to the history of the population: Manual for counting and exploitation of the ancient civil status]. INED.

Fleury, M., & Henry, L. (1985). Nouveau manuel de dépouillement et d'exploitation de l'état civil ancien [New manual for counting and using of the ancient civil status] (3rd ed.). INED.

Fornés, A., Lladós, J., & Pujadas-Mora, J. M. (2019). Browsing the social network of the past: Information extraction from population manuscript images. In A. Fischer, M. Liwicki, & R. Ingold (Eds.), Handwritten historical document analysis, recognition, and retrieval: State of the art and future trends (Vol. 89, pp. 195–220). World Scientific. https://doi.org/10.1142/9789811203244_0011

Garrett, E., & Reid, A. (2015). Introducing 'movers' into community reconstructions. In G. Bloothooft, P. Christen, K. Mandemakers, & M. Schraagen (Eds.), Population reconstruction (pp. 263–283). Springer. https://link.springer.com/chapter/10.1007/978-3-319-19884-2_13

Gourdon, V., & Ruggiu, F. J. (2015). Richard Wall en France: Retour vers le futur? [Richard Wall in France: Back to the future?]. Revista de Demografía Histórica, 33(2), 65–86.

Haug, J. C. (1979). Manuscript census materials in France: The use and availability of the listes nominatives. French Historical Studies, 11(2), 258–274. https://doi.org/10.2307/286604

Henry, L. (1953). Une richesse démographique en friche: Les registres paroissiaux [An untapped demographic resource: Parish registers]. Population, 8(2), 281–290. https://doi.org/10.2307/1524765

Héran, F., & Toulemon, L. (2005). What happens when the census population figure does not match the estimates? Population & Societies, 411, 1–4. https://doi.org/10.3917/popsoc.411.0001

INSEE. (2022). Fichier des prénoms [First name file] [Database]. https://www.insee.fr/fr/statistiques/7633685

INSEE. (2023). Fichier des personnes décédées depuis 1970 [Deceased persons since 1970  file] [Database]. https://www.insee.fr/fr/information/4190491

Kesztenbaum, L. (2021). Strength in numbers: A short note on the past, present and future of large historical databases. Historical Life Course Studies, 10, 5–8. https://doi.org/10.51964/hlcs9557

Laslett, P. (1965). The world we have lost. Methuen.

Laslett, P., & Wall, R. (1972). Household and family in past times. Cambridge University Press.

Mandemakers, K. (2025). Overview and comparison of 85 databases with historical population longitudinal microdata. Historical Life Course Studies, 15, 281–321. https://doi.org/10.52024/hlcs21660

Mandemakers, K., Alter, G., Vézina, H., & Puschmann, P. (Eds.) (2023). Sowing: The construction of historical longitudinal population databases. Radboud University Press. https://doi.org/10.54195/BJYF5752

Nagaï, N. (2002). Catégories socioprofessionnelles [Socio-professional categories]. In N. Nagaï (Ed.), Les conseillers municipaux de Paris sous la Troisième République (1871–1914) (pp. 323–355). Éditions de la Sorbonne. https://doi.org/10.4000/books.psorbonne.1329

Perrenoud, A. (1979). La population de Genève du XVIe au début du XIXe siècles: Étude démographique (Vol. I: Structure et mouvements) [The population of Geneva from the 16th to the early 19th century: Demographic studies (Vol. I: Structure and dynamics]. Société d'histoire et d'archéologie de Genève.

Pinchemel, P. (1954). Les listes nominatives des recensements de population [The nominative lists of population censuses]. Revue du Nord, 36(142), 419–431. https://doi.org/10.3406/rnord.1954.2150

Pujadas-Mora, J. M. (2019, February 14). The big data of the past: A journey through historical population documents driven by Computer Vision [Presentation]. Workshop Automated Registration of Historical Population Registers: New Prospects and Possibilities, Lund, Sweden.

Puschmann, P., Matsuo, H., & Matthijs, K. (2022). Historical life courses and family reconstitutions: The scientific impact of the Antwerp COR*-Database. Historical Life Course Studies, 12, 260–278. https://doi.org/10.51964/hlcs12914

Reid, A., Davies, R., & Garrett, E. (2002). Nineteenth-century Scottish demography from linked censuses and civil registers. History and Computing, 14(1–2), 61–86. https://doi.org/10.3366/hac.2002.14.1-2.61

Ruggles, S. (2002). Linking historical censuses: A new approach. History and Computing, 14(1–2), 213–224. https://doi.org/10.3366/hac.2002.14.1-2.213

Ruggles, S. (2014). Big microdata for population research. Demography, 51(1), 287–297. https://doi.org/10.1007/s13524-013-0240-2

Ruggles, S., Fitch, C. A., & Roberts, E. (2018). Historical census record linkage. Annual Review of Sociology, 44, 19–37. https://doi.org/10.1146/annurev-soc-073117-041447

Sandholt Jensen, P., & Nørmark Sørensen, E. (2019, February 14). Digitizing and analyzing historical documents at scale: The power of AI [Presentation]. Workshop Automated Registration of Historical Population Registers: New Prospects and Possibilities, Lund, Sweden.

Schofield, R. S. (1972). La reconstitution de la famille par ordinateur [Computer-based family reconstruction]. Annales. Histoire, Sciences Sociales, 27(4–5), 1071–1082. https://doi.org/10.3406/ahess.1972.422581

Séguy, I. (2001). La population de la France de 1670 à 1829: L'enquête Louis Henry et ses données [The population of France from 1670 to 1829: The Louis Henry survey and its data]. INED.

Tarride, S., Maarand, M., Boillet, M., McGrath, J., Capel, E., Vézina, H., & Kermorvant, C. (2023). Large- scale genealogical information extraction from handwritten Quebec parish records. International Journal on Document Analysis and Recognition, 26, 255–272. https://doi.org/10.1007/s10032-023-00427-w

Toulemon, L. (2017). Undercount of young children and young adults in the new French census. Statistical Journal of the IAOS, 33(2), 311–316. https://doi.org/10.3233/SJI-171054

Vézina, H., & Bournival, J.-S. (2020). An overview of the BALSAC population database. Historical Life Course Studies, 9, 114–129. https://doi.org/10.51964/hlcs9299

Wagner, R. A., & Fischer, M. J. (1974). The string-to-string correction problem. Journal of the Association for Computing Machinery, 21(1), 168–173. https://doi.org/10.1145/321796.321811

Downloads

Published

2026-02-12

Issue

Section

Articles

How to Cite

Brée, S., Gay, V., Leturcq, M., Coulmont, B., Doignon, Y., Constum, T., Paquet, T., & Tranouez, P. (2026). POPP. An OCR-Generated Database of the Population Censuses of Paris (1926–1936). Historical Life Course Studies, 16, 3-28. https://doi.org/10.52024/hlcs18627