Reflections on the Use of the Intermediate Data Structure (IDS) in Historical Demographic Research

e-ISSN: 2352-6343 DOI article: https://doi.org/10.51964/hlcs9571 The article can be downloaded from here.


Reflections on the Use of the Intermediate Data Structure (IDS) in Historical Demographic Research
The Intermediate Data Structure (IDS) was developed as a strategy aimed at standardizing the dissemination of micro-level historical demographic data. The structure provides a common and clear data strategy which facilitates studies that consider several databases, and the development and exchange of software. Based on my own experiences from working with the IDS, in this article I provide reflections on the use of IDS to create datasets for analysis and to conduct comparative demographic research.

Luciana Quaranta Lund University
The Intermediate Data Structure (IDS) was developed by George Alter, Myron Gutmann and Kees Mandemakers as a strategy aimed at standardizing the dissemination of micro-level historical demographic data (Alter & Mandemakers, 2014;Alter, Mandemakers, & Gutmann, 2009). The structure provides a common and clear data strategy which facilitates conducting comparative studies using multiple databases and the development and exchange of software for analysis, therefore increasing the transparency, replicability and generalizability of research and breaking down the barriers of entry into the field of historical demography. Using longitudinal demographic data is, in fact, complex, given its multilevel and relational character and the fact that it involves processes that develop over time.
The article by George Alter (2021) included in the current issue examines the main characteristics of IDS and how these characteristics make the structure flexible and expandable. In the current work I provide reflections on the use of IDS to create data sets for analysis and to conduct comparative demographic research. These reflections are based on my own experiences of working with the IDS of the Scanian Economic Demographic Database and from the development of an international comparative project that used IDS databases from five different historical European populations.
Using the IDS requires initial investments. Foremost, it is necessary to transfer the data from its original form into the standardized structure. This can be done either using your own-written programs or the publicly available IDS transposer (Klancher Merchant & Alter, 2017). There are large returns to the investments made into transferring the data into IDS, since it provides a clear and well-defined structure and, through the METADATA table, good documentation of variables and their values, all of which substantial facilitate the work of researchers. The common structure makes it also easier for researchers to switch from using one database to another, or to conduct comparative studies. Moreover, the possibility of using publicly available software (e.g. Alter, Newton, & Oeppen, 2020;Quaranta, 2016Quaranta, , 2018b reduces the barriers of entry into the field for researchers who are less experienced in data management. To conduct longitudinal statistical analyses using data stored in the IDS it is necessary to select from the IDS tables the information that is required for the study, to process such data, to construct additional variables, and to convert the data extraction into a rectangular episodes table. This process is not simple. The IDS follows the entity-attribute-value model, and therefore contains one row for each attribute (variable) about an entity (person, place, etc.). Although this model has the advantage of making the structure clear and standardized, extracting variables from the tables can be somewhat tedious. For variables that are used multiple times in the process of creating data sets for analysis (birth date, sex, etc.), it can sometimes be useful to construct intermediate wide tables. For other variables, even if extracting them from the IDS can require some extra lines of code with respect to using more traditional database structures, due to the standard nature of IDS, and with simple modifications the same code can be used to process different variables, for example through the use of loops.
In a previous article (Quaranta, 2015), I proposed a solution for creating rectangular data sets for analysis from IDS databases, which involved the use of the Chronicle file. The Chronicle file is a middle step between the IDS tables -which store the information found in sources -and the episodes tables -which are rectangular files that are ready for statistical analysis. Like the basic IDS tables, the Chronicle file follows the entity-attribute-value model, and it contains one row per each date of an event and each date of change in variable values. Using the program 'Episodes File Creator' (Quaranta, 2016), the Chronicle file can be transformed into a rectangular table for statistical analysis. One advantage of using the Chronicle file when developing a data set for analysis from IDS-data is that it allows to create variables in a modular way. In other words, rather than creating all variables at the same time, it is possible to develop separate syntax or sections of syntax files, each focusing on a specific variable (marital status, occupation, household size, etc.) or event (marriage, birth of a child, death, etc.). Programming variables in a modular fashion does not increase computational time, and it allows to easily check, modify and run variable-specific code, and to use the same constructed variables in different types of analyses, in the same way as source variables stored in the basic IDS tables. Modular programming also facilitates the exchange of software between researchers, since it is possible not only to share software that produces full data sets for analysis but also to share software that produces only one specific variable.

INTRODUCTION
In 2009, together with research engineer Clas Andersson, I have converted the Scanian Economic Demographic Database into the IDS (see Dribe and Quaranta (2020) for a full description on the SEDD source material, structure of the data and IDS transfer). In addition to creating basic IDS tables containing source information, using such source information we constructed a wide range of individual-and context-level variables, to be used for many different types of research questions (related to fertility, marriage, migration, social mobility, mortality, etc.). All these variables are included in a Chronicle file. Researchers who request data from SEDD are provided with both basic IDS tables and a Chronicle file, which allows them to conduct most types of analyses without further elaboration of the data. This has not only facilitated the work of researchers, but it has also made different research outputs more consistent, since variables were defined identically. The modular nature of the Chronicle file and of the programs used to create variables has allowed to, when necessary, easily update and fix the code specific to each constructed variable as new data was added to SEDD. It has also allowed to add new modules to create additional variables when research questions were expanded. From these experiences in working with IDS in SEDD I can therefore say that IDS functions well, it is very advantageous for research, and that even if the investments made are large, the returns to these investments are much larger.
The growing availability of micro-level longitudinal databases combined with the development of advanced statistical techniques and related software has allowed major developments in the field of historical demography. However, due to the high costs involved in the digitization of historical sources, the majority of the research that has been published has been based on studies that focus on small communities and on limited time periods, making it difficult to test the external validity of their findings.
One of the best ways to extend the scope of small databases and the generalizability of results across different contexts and time periods is by conducting comparative research that considers several populations. Facilitating comparative research was one of the main aims in the development of the IDS. The first and to date also the only international comparative research project based on the IDS looked into intergenerational transmissions in infant mortality across the maternal line in five different historical populations in Europe (Quaranta & Sommerseth, 2018) -Belgium (Donrovich, Puschmann, & Matthijs, 2018), the Netherlands (van Dijk & Mandemakers, 2018), Norway (Sommerseth, 2018), northern Sweden (Broström, Edvinsson, & Engberg, 2018) and southern Sweden (Quaranta, 2018a). The studies were conducted using the same methods and theoretical framework, and the same programs were used to create a Chronicle file for each database (Quaranta, 2018b), transform such file into rectangular a episodes table (Quaranta, 2016) and to run the statistical models (Quaranta, 2018b).
Developing the software used in the project to create the Chronicle file was complex. However, such complexity was related to the underlying characteristics of the research question and data, which required following individuals longitudinally across their lives and across generations, rather than the fact that five different databases were used. Writing one common program to be used across the five databases included in the project was not much more difficult nor much more time consuming than writing a program to be used by one single database. The only additional challenge faced in developing a common syntax code was in the definition of when individuals were under exposure, given that some of the databases contained only information on vital events, while other databases included data on vital events as well as data on migration.
Thanks to the IDS, being able to use the same exact programs in our study was a novelty with respect to previous works. The few comparative historical demographic studies that had been published before were primarily based on common research questions and model definitions across the communities considered, but, given the different underlying structure of the databases used, they employed independent syntax for each database to create the data sets for analysis and run the statistical models. Such studies were unable to distinguish differences in the findings that related from diverging characteristics of the contexts from those related to differences in data handling and model generating processes. Instead, in our IDS-based study, it was possible to fully compare the findings and even to pool the data sets to estimate one common model across the five populations. The project demonstrated that the IDS and the construction of Chronicle files do indeed greatly facilitate the development of collaborative research and allow to make studies fully comparative and reproducible. Given that the software used in the project was published, the same software could be used also

REFLECTIONS ON USING IDS IN COMPARATIVE RESEARCH
in other contexts. The project also demonstrated that this approach can be utilized in many new IDS-based comparative projects, thereby expanding the future of historical demographic research.
The IDS was developed with the aim of providing a standardized structure to facilitate the dissemination of micro-level historical demographic data. My own experience with the use of the IDS has shown that, although the initial investments needed to transfer the data into IDS and to learn how to use the structure are not negligible, there are very large returns to these investments. Detailed documentation is provided to researchers. Data sets for analysis can be developed by first creating Chronicle files, and in a modular way, simplifying the control, expansion and exchange of programs. The standard nature of IDS also greatly facilitates the development of comparative studies. The IDS therefore has the potential of breaking the barriers of entry into the field of historical demography, of making research more transparent and reproducible, and results generalizable to wider contexts, in these ways expanding the scope of future historical demographic research.
To date not many standard IDS programs have been published, something perhaps due to the fact that documenting software is time consuming and that software development is not yet recognized as academic merit within our discipline. The organization of research workshops devoted to writing IDS programs can be a step-forward in this development. The establishment of more research collaborations involving IDSbased comparative studies can also help to incentivise more databases to transfer their data into the IDS and researchers to produce and share IDS-software. The main Swedish historical demographic databases, including SEDD, have joined in the national infrastructure SwedPop (www.swedpop.se), which will allow researchers to download IDS-based harmonized data from the databases. This infrastructure will not only significantly improve access to Swedish population data, but it will also enable more comparative studies to be conducted and possibly also furtherly promote the use of IDS internationally.