A Tale of Two Transcriptions. Machine-Assisted Transcription of Historical Sources

Gunnar Thorvaldsen; Joana Maria Pujadas-Mora; Trygve Andersen; Line Eikvil; Josep Lladós; Alícia Fornés; Anna Cabré

doi:10.51964/hlcs9355

Author(s)

Gunnar Thorvaldsen
Joana Maria Pujadas-Mora
Trygve Andersen
Line Eikvil
Josep Lladós
Alícia Fornés
Anna Cabré

DOI:

https://doi.org/10.51964/hlcs9355

Keywords:

Word spotting, Optical Character Recognition, Vital records, Census, Nominative sources, Computer vision

Abstract

This article explains how two projects implement semi-automated transcription routines: for census sheets in Norway and marriage protocols from Barcelona. The Spanish system was created to transcribe the marriage license books from 1451 to 1905 for the Barcelona area; one of the world’s longest series of preserved vital records. Thus, in the Project “Five Centuries of Marriages” (5CofM) at the Autonomous University of Barcelona’s Center for Demographic Studies, the Barcelona Historical Marriage Database has been built. More than 600,000 records were transcribed by 150 transcribers working online. The Norwegian material is cross-sectional as it is the 1891 census, recorded on one sheet per person. This format and the underlining of keywords for several variables made it more feasible to semi-automate data entry than when many persons are listed on the same page. While Optical Character Recognition (OCR) for printed text is scientifically mature, computer vision research is now focused on more difficult problems such as handwriting recognition. In the marriage project, document analysis methods have been proposed to automatically recognize the marriage licenses. Fully automatic recognition is still a challenge, but some promising results have been obtained. In Spain, Norway and elsewhere the source material is available as scanned pictures on the Internet, opening up the possibility for further international cooperation concerning automating the transcription of historic source materials. Like what is being done in projects to digitize printed materials, the optimal solution is likely to be a combination of manual transcription and machine-assisted recognition also for hand-written sources.

Downloads

Download data is not yet available.

References

Almazán, J., Fernandez, D., Fornés, A., Llados, J. & Valveny, E. (2012). A coarse-to-fine approach for handwritten word spotting in large scale historical documents collection. Proceedings of the International Conference on Frontiers in Handwriting Recognition (ICFHR), 453-458.

Anderson, M. (1988). The American Census. A Social History. New Haven: Yale University Press.

Cirera, N., Fornés, A., Frinken, V. & Lladós, J. (2013). Hybrid grammar language model for handwritten historical documents recognition. Pattern Recognition and Image Analysis, Lecture Notes in Computer Science, 7887, 117-124. Berlin/Heidelberg: Springer-Verlag. https://doi.org/10.1007/978-3-642-38628-2_13

Cruz, F. & Ramos-Terrades, O. (2012). Document segmentation using relative location features. 21st International Conference on Pattern Recognition, 1562–1565.

de Salazar, J. & Mayoralgo, J.M. (1991). Génesis y evolución histórica del apellido en España. Madrid: Real Academia Matritense de Heráldica y Genealogía.

Eikvil, L., Holden, L. & Bævre, K. (2010). Automatiske metoder som hjelp til transkribering av historiske kilder. Oslo: Norsk regnesentral (Norwegian computing center).

Estellés-Arolas, E. & González-Ladrón-de-Guevara, F. (2012). Towards an integrated crowdsourcing definition. Journal of Information Science, 38(2), 189-200.

Fernández, D., Manmatha, R., Lladós, J. & Fornés, A. (2012). On influence of line segmentation in efficient word segmentation in old manuscripts. Proceedings of the International Conference on Frontiers in Handwriting Recognition (ICFHR), 759-764.

Fernández, D., Marinai, S., Lladós, J. & Fornés, A. (2013). Contextual word spotting in historical manuscripts using Markov logic networks. 2nd International Workshop on Document Imaging and Processing (HIP), 36-43. https://doi.org/10.1145/2501115.2501119

Fjellberg, A. (2013, November 22). Hamsuns digitale reise. Morgenbladet.

Fornés, A., Otazu, X. & Lladós, J. (2013). Show through cancellation and image enhancement by multiresolution contrast processing. 12th International Conference on Document Analysis and Recognition (ICDAR), 200-204. https://doi.org/10.1109/ICDAR.2013.47

Graves, A., Liwicki, M., Fernandez, S., Bertolami, R., Bunke, H. & Schmidhuber, J. (2009). A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5), 855-868. https://doi.org/10.1109/TPAMI.2008.137

Haug, J. (1979). Manuscript census materials in France: The use and availability of the listes nominatives. French Historical Studies, 11(2) Autumn, 258-274.

INSEE. (no date). Le recensement de la population dans l’Histoire.

Jåstad, H. & Thorvaldsen, G. (2012). The incidence of consanguinity in Norway in the late 19th century. In: E. Beekink & E. Walhout (Eds.), Frans van Poppel: a sort of farewell: liber amicorum (pp. 58-62). Den Haag: NIDI.

le Roy Ladurie, E. (1973). Le territoire de l’historien. Paris: Gallimard.

Lladós, J., Rusiñol, M., Fornés, A., Fernández, D. & Dutta, A. (2012). On the influence of word representations for handwritten word spotting in historical documents. International Journal of Pattern Recognition and Artificial Intelligence (IJPRAI), 26(5), 1263002.1-1263002.25. https://doi.org/10.1142/S0218001412630025

Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257-286. https://doi.org/10.1109/5.18626

Salinero, G. (2010). Sistemas de nominación e inestabilidad antroponímica moderna 9-27. In G. Salinero & I. Testón (Eds.), Un Juego de Engaños. Movilidad, nombres y apellidos en los siglos XV a XVIII. Madrid: Casa de Velázquez.

Sermanet, P., Kavukcuoglu, K. & LeCun, Y. (2009). EBLearn: Open-source energy-based learning in C++. Proc. International Conference on Tools with Artificial Intelligence, IEEE. https://doi.org/10.1109/ICTAI.2009.28

Solli, A. & Thorvaldsen, G. (2012). Norway: From colonial to computerized censuses. Revista de Demografia Historica, XXX (I), 107-136.

Statistics Norway. (1895). Fremgangsmåden m.v. ved den i Januar 1891 afholdte Folketælling. [Instructions for the 1891 census.] Kristiania. Last accessed: 5/1/2015.

Thorvaldsen, G. (2011). Using NAPP Census Data to Construct the Historical Population Register for Norway. Historical Methods, 44(1), 37-47. https://doi.org/10.1080/01615440.2010.517470

US Census Bureau. (no date). The Hollerith Machine. Last accessed: 5/1/2015.

A Tale of Two Transcriptions. Machine-Assisted Transcription of Historical Sources

Author(s)

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

centered make a submission

latestarticles

A Tale of Two Transcriptions. Machine-Assisted Transcription of Historical Sources

Author(s)

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

centered make a submission

latestarticles

centered subscribe to our newsletter