Fair and Tender Data. The FAIRness of Four Databases With Historical Individual Life Course Data Tested

e-ISSN: 2352-6343 DOI article: https://doi.org/10.51964/hlcs9559 The article can be downloaded from here. © 2021, Heerma van Voss This open-access work is licensed under a Creative Commons Attribution 4.0 International License, which permits use, reproduction & distribution in any medium for non-commercial purposes, provided the original author(s) and source are given credit. See http://creativecommons.org/licenses/.


Fair and Tender Data
Four databases with data on individual historical life courses are tested for FAIRness: the TRA, Umeå, HSN and IPUMS databases. All databases make their data much more Findable than they were in the original sources. But as databases, they are best findable if their name is a unique acronym, and if different subdatasets all use that same acronym. Sensitive data have to be protected. Two databases make anonymous data sets or those only containing information on deceased individuals Accessible without any formalities, and other databases could follow this example. To increase Interoperability a large number of tools are offered by the databases. Reusability is among the raisons d'être of these databases.

Lex Heerma van Voss
Huygens Institute for the History of the Netherlands & Utrecht University The FAIRness of Four Databases With Historical Individual Life Course Data Tested Come all ye fair and tender ladies Take warning how you court your man They're like a star on a summer morning First they appear then they're gone again A traditional Appalachian folk song, recorded by numerous folk and country artists, tells fair and tender ladies to take care in love. Men are like a heavenly body, which is visible before dawn but disappears from sight at sunrise. In other versions of the song, the faithless male lovers are likened to sparrows or swallows flittering away (Roud no. 451). Even if the lover has disappeared from sight, the love may have left visible traces, in life and in historical records. The latter are in turn chased by historians. While historical individuals may seem as fleeting as morning stars or birds, historians come to grips with their life courses by collecting life events in historical databases. These allow us to describe collective life courses over generations, and to contrast singular or group experiences with the average life course in historical societies. Quaranta (2015) gives an overview of the kind of questions such data allow us to answer.
Data on individual historical life courses are collected in dozens of major databases accessible online. 1 There is broad agreement that such digital assets should be FAIR: Findable, Accessible, Interoperable, and Reusable (www.go-fair.org/fair-principles/). The following represents a very simple test of the FAIRness of a few of the larger databases which offer online demographical data on historical life courses. Approaching the databases online as a naïve user, the fairness of the databases is assessed from the position of the outsider, aiming to access the data. Being interoperable and reusable is more or less a raison d'être of this type of database, which is not saying that these do not pose their challenges. But the findability and accessibility of the data will get most of our attention.
For this assessment we selected four high profile databases. The TRA survey aims at life reconstitution of all individuals whose last name begins with the letters 'Tra' and their descendants who died in France between 1800 and 1939. It contains data from population registers, military and fiscal records (Bourdieu, Kesztenbaum, & Postel-Vinay, 2014). 2 The Umeå database is based on the historical parish records of northern Sweden, from the 18th and the 19th centuries, and extended for a subset to the 1950s (Edvinsson & Engberg, 2020). 3 The HSN (Historical Sample of the Netherlands) contains life courses based on population registers and additional sources for a 0.5% sample of those born in the Netherlands between 1812 and 1922 (Mandemakers & Kok, 2020). 4 IPUMS contains data on individual and household level from the U.S. decennial censuses from 1790 to 2010 (Ruggles et al., 2020). 5 The first question is how findable the databases are. This simple question was approached in a simple way. Each database was both searched for under the name it seems to be known for in the historical profession and as 'database historical demography <country name>'. Both searches were executed in Google Chrome in the first two weeks of February 2021.
The TRA database may be known as such in the historical profession, but looking for it as 'TRA data set' only gives one relevant hit among the first 50 results. It is at place thirteen and refers to an article on researchgate. net on parental status homogamy in France in the 19th century. 'TRA database' gives a hit at place eight. Before we conclude on findability, it is important to stress that all of these databases make dispersed and hard to access data extremely more findable than they would be otherwise. What we have looked at here, is merely how findable the resulting databases are. One could well argue that that is just scratching the surface of findability. But that said, a few clear recommendations are possible: • The best way to make your database findable is by using a distinctive acronym or name. IPUMS is much better than TRA or HSN in that way. Being known by the same name as your town or university does not increase findability.
• Only being findable by the exact correct way to spell your name is not conducive to being found.
• Having lots of references to your original website with different search terms enhances findability. This is mostly the ordinary management of web site findability, but international forms of association of these databases or crosslinking can be tremendously helpful (see note 1).
• Stick to your name. Obviously, this may run counter to some of the other recommendations if you have chosen the wrong name originally.
• Name your data sets after the main database, and find other ways to acknowledge contributions from researchers and funders.
How accessible are the data sets? Individual demographic data are sensitive. Some of the data sets contain medical information, like cause of death or the results of a medical examination for military service. Having spent time in an institution or having lived in certain neighborhoods can also be deemed sensitive, as well as belonging to an ethnic group or religious community. Historians generally do not think that the deceased have a right to privacy, but some of this information can be sensitive across generations. All four databases are careful about the way sensitive information may spread, and want their users to adhere to relevant scholarly standards.
Both TRA and HSN only deliver data after one has requested and acquired permission from a database representative, whom can be contacted through e-mail. HSN users must fill in and sign a license agreement and agree with HSN privacy rules. TRA lists eight conditions that users must underwrite, and is therefore slightly more transparent than HSN. Both Umeå and IPUMS restrict the use of data that may contain information on living individuals. Umeå asks users to specify which data will be retrieved and how they will be used. This is reviewed by the database's Approval Committee, and in case of sensitive personal data also by the Swedish Ethical Review Authority. IPUMS-International uses what is describes as a 'lengthy, probing registration form' which it deems 'an effective deterrent for unqualified applicants'.
I did not bother the officials at TRA and HSN by requesting access just to experience the process. Both IPUMS and Umeå make available data sets which are anonymized or which contain no data on surviving individuals. Umeå makes available anonymous data for the older period without further ado through its application SHiPS. IPUMS asked me to check two boxes promising not to redistribute the data without permission, to cite the data appropriately and to add publications making use of IPUMS data to their bibliography (Ruggles et al., 2020). Similar wishes are expressed by all databases. Giving my email address and answering a small number of questions resulted in an immediate response from IPUMS granting me access. The application which allowed me to download data from IPUMS was user-friendly, SHiPS somewhat less so.
Recommendations again are possible: • Be transparent about the conditions you set for using data.
• Let users of privacy-sensitive data describe how they will use the data and ask them to state that they will comply with your rules of good scholarly practice. Let their application be judged by the database and -if necessary -by an ethical review board.
• Make non-sensitive data available upon simple request and grant access to them through a quick and automatic procedure.
Are the databases interoperable? The proof of the pudding would be in linking relevant data sets of our four databases. That is beyond the scope of this contribution, as the fact that only two data sets were accessed already makes clear. Fortunately, others have already tasted the pudding.
On theoretical grounds alone we would assume that data sets based on census data and covering recent decades would stand a better chance to collect the same data defined in a similar way, than data sets based on register data and covering an older period. From 2006 an Intermediate Data Structure (IDS) was developed to improve interoperability among the databases (Alter & Mandemakers, 2014;Quaranta, 2015). This makes it possible to download data from two of the databases into the IDS and compare them. Using this tool, it was for instance possible to link the HSN and Umeå databases, and two others for Scania and Antwerp, and help establish that infant mortality runs in the family: the likelihood that a woman's offspring died in infancy was higher when any of her siblings had died in infancy (Quaranta & Sommerseth, 2018).
It is also possible to follow individuals from one database to the other. Paiva, Anguita and Mandemakers (2020) did this when they followed Dutch migrating to the USA by linking HSN to IPUMS. This exercise points to a difference between census-based and register-based data sets. Both have to link individuals and households that appear at different places in the original sources. For the census data, linking them from one observation to the next means bridging the time lapse between two censuses, typically ten years. A child's whole life course can fall between two censuses and thus will not be visible in any census. Registers

INTEROPERABLE
follow individuals and households through time. Apart from additional richness in observed geographical movements and life events, this also makes it easier to link people from one register to another than from one census to the next.
Dataverse versions of the HSN data (https://datasets.iisg.amsterdam/dataverse/hsn) come with help to enhance interoperability: a suite of tools for the classification and comparison of historical occupational titles (HISCO, HISCAM and HISCLASS), and a gazetteer of place names in the Netherlands in the relevant period (AMCO). IPUMS used an adapted version of the HISCO classification for several of its data sets (IPUMS USA 1880 100% population database, IPUMS NAPP = North Atlantic Population Project). A conversion table between HISCO and the US Bureau of the Census 1950 Standard for Occupations, CROSSWALK, is available (https://easy.dans.knaw.nl/ui/datasets/id/easy-dataset:73810).
Being reusable is clearly a strong point of all these data sets. The investment necessary to establish the databases in the first place is far too large to be justified by answering a single historical research question. They are specifically designed to be used over and over again. They often grow when they are reused and the additional data needed to answer a fresh research question are entered in the same format and added to the database.