Historical Chinese Microdata. 40 Years of Dataset Construction by the Lee-Campbell Research Group

e-ISSN: 2352-6343 PID article: http://hdl.handle.net/10622/23526343-2020-0004?locatt=view:master The article can be downloaded from here. © 2020, Campbell, Lee This open-access work is licensed under a Creative Commons Attribution 4.0 International License, which permits use, reproduction & distribution in any medium for non-commercial purposes, provided the original author(s) and source are given credit. See http://creativecommons.org/licenses/.

For over forty years, beginning in 1979, the Lee-Campbell Group has devoted considerable effort to locate, construct, and analyse individual-level datasets based largely on Chinese archival materials to produce a scholarship of discovery. Initially, we studied Chinese demographic behaviour, households, kin networks, and socioeconomic attainment. We constructed datasets that followed individuals from birth to death and families and households across multiple generations. More recently, we have turned to the study of the social origins and careers of civil and military officials and other educational and professional elites. We are now constructing datasets that describe the civil service careers of Qing dynasty officials from the 18th century to the beginning of the 20th century, the social origins and educational trajectories of university students during the 20th century, the qualifications and careers of government officials and educated professionals largely during the Republican era , and the experience of hundreds of thousands of Chinese peasants and their families during the process of rural reconstruction from land reform in the mid-1940s to Peoples Communes in the mid-1960s. Based on these data, we have published seven academic books and some 70 academic articles, mostly in English, eleven of which have won thirteen best academic prizes or equivalent recognition, including four books and two articles we wrote in English and two books and three articles we wrote in Chinese. 1 This article is a retrospective on these projects and a summary of their findings. In part one, we overview the datasets themselves, summarizing their contents, organization, and notable features. In part two, we provide an integrated history, starting in 1979 with James Lee's effort to locate systematic historical demographic microdata in China, and continuing up to the present. In part three, we summarize the contributions from the analysis of these datasets beginning with the key demographic outcomes that were the focus of our early work, and then move to inequality and stratification. We conclude with reflections based on our experience. This is the first time we have presented all our projects together and discussed them and the results of our analysis as a single integrated whole. By doing so, we clarify the full scale and range of our efforts for readers who may only be familiar with some of the projects. The projects share an emphasis on the discovery of social phenomena by an inductive approach that prioritizes careful description, sensitivity to the institutions that produced the sources, and awareness of the historical and social context in which the individuals and families we study are embedded. Up-to-date information on each project is available at our group website. 2 In this section we present the basic features and current status of each dataset. For this introduction, we organize our datasets into four categories: 1) family, kinship and demographic behaviour, 2) education, 3) employment, and 4) rural reconstruction. Each category includes a variety of datasets, some largely complete, and some still in progress. The populations are all Chinese, largely from the 18th, 19th, and 20th centuries. As of July 2020, the datasets include 8,167,457 records with nominative information on the behaviour and life outcomes of 1,753,700 individuals who were the focus of those records as well as several hundred thousand related individuals. 3 Table 1 summarizes their contents. Three datasets and accompanying documentation are already available for download which we introduce in detail below with links and other information. What these diverse datasets have in common is that each seeks to record a substantively interesting well-defined target population in its entirety. This approach is different from constructing a statistically representative sample from a large target population in order to make inferences about it. By including entire target populations, we can describe broader social, political, and economic processes involving important or influential population subgroups in more detail than is possible with representative samples drawn from the general population. This is particularly important for upper-class populations such as nobles, civil servants, university faculty, and elite university graduates who may be important agents and/or harbingers of change, but account for only a small fraction of the general population and in a representative sample would include too few cases to allow much analysis of their composition, function, and change over time.
We introduce the datasets in detail below. The family, kinship and population datasets are defined by geographic area and hereditary status. Two are largely rural populations in northeast China, while the third consists of members of the Qing Imperial Lineage, almost all of whom lived in Beijing and what is now Shenyang. The education datasets include students from almost all the major Republicanera Chinese universities and from two major universities after 1949, as well as the vast majority of Chinese who graduated from foreign universities before the early 1950s. The employment datasets include nearly all civil and many military Chinese officials employed between 1760 and 1912, as well as separate nominative information for Chinese professionals from the fall of the Qing in 1912 to the early 1950s including almost all certified accountants, health professionals, engineers, and university faculty nationally as well as legal professionals in Shanghai and Beijing. Finally, the rural reconstruction datasets are drawn from nominative lists of individuals organized often by household for entire villages, rural brigades and communes, and even entire counties undergoing rural reconstruction either during Land Reform in the late 1940s or subsequent reorganization in the early and mid-1960s.
Inspired by earlier studies on Chinese population history by Ping-ti Ho (1959) as well as by the contributions to English and French social economic history by Louis Henry (Gautier & Henry, 1958), Peter Laslett (Laslett & Wall, 1972) and their colleagues and students, our initial efforts at collecting individual level microdata for historical China focused on family, kinship and demographic behaviour. The datasets we eventually produced, collectively referred to as the China Multi-Generational Panel Datasets (hereafter CMGPD), are amenable to event-history analysis to examine community and family contextual influences on demographic behaviour and socioeconomic outcomes (Dong, Campbell, Kurosu, Yang, & Lee, 2015b). Table 2 summarizes the CMGPD datasets.
3 For the family, kinship and population datasets, each record only provided information about the person who was the focus of the record, akin to an entry in a census form. For many of the remaining datasets, in addition to details about the individual who was the focus, records sometimes also listed the names and sometimes occupation or other information of related individuals, typically their spouses, parents, or other relatives. The China Multigenerational Panel Database-Liaoning (hereafter CMGPD-LN) and China Multigenerational Panel Database-Shuangcheng (hereafter CMGPD-SC) are based on sets of population registers covering different administrative populations within Liaoning and Shuangcheng that in terms of format and organization resemble census-like listings of households and their members compiled at frequent intervals. Households and individuals are listed in roughly the same order in every edition, making the manual linkage of their records across time straightforward. Using the longitudinal data that we construct through record linkage, we can not only study the life histories of individuals, but also the histories of households and lineages. Because the content and organization of the CMGPD-SC and CMGPD-LN resemble those of population sources for other countries, they have been used in comparative studies, most notably in the Eurasia Project on Population and Family History introduced below.

FAMILY, KINSHIP AND DEMOGRAPHIC BEHAVIOUR
The CMGPD-SC and CMGPD-LN both record household of residence, relationship to household head, demographic outcomes including birth, marriage, death, and basic measures of socioeconomic status. The populations they record are closed in the sense that entries and exits are rare, and when they occur, the timing is recorded, so at any given point in time the set of individuals at risk of experiencing an event is well-defined. Moreover, the target communities for each register series are recorded completely. Because the data are organized by residential household and are also multi-generational, and the detail on relationship to household head allows for children to be linked to their parents, grandparents, and other kin, they allow for us to embed individuals in their households and larger kin networks and examine how their life outcomes depend on the characteristics of distant kin and ancestors.
The China Multigenerational Panel Dataset-Liaoning includes 698 communities over a swath of what is now Liaoning province between 1749 and 1909  CMGPD-LN communities are scattered over a large area roughly equivalent in size to the Netherlands (see map 1), and were economically, ecologically, and geographically diverse, including coastal communities who relied on fishing as well as farming, inland communities who cultivated fruit orchards as well as dry field agriculture, and mountain communities which supplemented such activities with hunting and gathering (Ding, Guo, Lee, & Campbell, 2004 farmers who were tenants on state land as well as specialized populations who supplied the state inkind levies of fish, honey, mink pelts, and other goods. Unusually for an historical Chinese source, the registers record married and widowed women completely and in detail. Like many historical Chinese sources, however, they tend to omit children who died young, especially if they were female.

Map 1 China Multigenerational Panel Dataset-Liaoning Communities
Source: Reproduced from page 4 of Lee, Campbell and Chen (2010).
The China Multigenerational Panel Dataset-Shuangcheng covers 129 communities in Shuangcheng county in Heilongjiang province from 1866 to 1913 (Chen, 2017;Wang et al., 2013). It contains 1,346,826 individual records and 156,711 household records. Through linkage, we reconstruct the histories of 107,551 individuals over as many as five generations. We also follow households over time. Map 2 presents the geographic locations of these communities and shows the location of Shuangcheng location in contemporary China. We publicly released the CMGPD-SC along with accompanying documentation at ICSPR in 2014. 5 The data are annual and drawn from 14 separate register series. Unlike the CMGPD-LN which covers a much larger area, CMGPD-SC communities are confined to one 3000 square kilometre county directly south of the city of Harbin. Relative to the CMGPD-LN, the data in the CMGPD-SC have more detail on their socioeconomic characteristics. Reflecting the population's diverse origins, for example, the registers record each household's official ethnicity: Manchu, Mongol, Han, Xibo and others. Moreover, by linking the CMGPD-SC household records to separate CMGPD-SC land registers, the CMGPD provides data on each household's landed wealth distinguishing between assigned and acquired land. Like the CMGPD-LN, the CMGPD-SC records widows and married women in detail. It records somewhat more children who died young and more daughters than the CMGPD-LN, but such recording is hardly complete.

5
In the last three years the documentation has been downloaded 6,240 times and data has been downloaded 2,389 times (According to https://pcms.icpsr.umich.edu/pcms/reports/studies/35292/ utilization accessed on July 14, 2020).

Map 2 Contemporary Shuangcheng County with CMGPD-SC Villages
Source: Reproduced from Wang et al. (2013, p. 5 The China Multigenerational Panel Dataset-Imperial Lineage (CMGPD-IL) records 115,033 members of the Qing Imperial Lineage and another 135,000 or so related individuals such as spouses from before the founding of the Qing dynasty in 1644 until 1933, two decades after the fall of the Qing (Cai, Lee, Campbell, & Myers, 1994;Lee, Campbell, & Wang, 1993;Lee & Guo (Eds.), 1994). 70,000 members are sons and daughters in the main line (zongshi) of the lineage and the remaining 45,000 or so are sons in the collateral line (jueluo) Wang, 2012). In contrast with most lineage genealogies in China, these records were compiled prospectively by the Imperial Lineage Office (zongrenfu) which throughout the Qing employed 50 to 60 officials to record Imperial Lineage members, who resided almost exclusively in Beijing and Shenyang, and administer their affairs from cradle to grave. The 28 editions of the Jade Records (yudie) produced by the Office between 1660 and 1921 are among the most detailed and complete records of fertility and infant and child mortality for a large Chinese population before the mid-20th century. Unlike the CMGPD-LN or CMGPD-SC they do not record residential household composition. However, they do include almost all births, including daughters, official titles and employment, notable events, and the timing of exits via death and (for daughters) out-marriage (Ju, 1994). By contrast, the privately compiled lineage genealogies that were used in many studies of Chinese historical demography rarely record daughters or wives, and tend to omit sons who died in infancy, childhood, and even early adolescence, as well as adult males who never married or married but had not surviving sons (Campbell & Lee, 2002a). By the 19th century, the Imperial Lineage was internally diverse in terms of the social and economic status, including close relatives of the emperors who had a variety of privileges, and very distant relatives of the emperors whose status was mundane.
As our research interests moved from population and family history to long-term trends in social mobility and social stratification, we expanded our collection to include individual student records in university archives.
We collectively refer to these datasets as the China University Student Datasets (hereafter CUSD). We distinguish between overseas students (OS) who graduated from foreign universities and domestic

EDUCATION
university students during the Republic of China (ROC) and Peoples Republic of China (PRC). Based mostly on records of matriculating university students, these datasets typically include their name, major, place of origin, current address, previous education, and the names and occupations of their parents, and sometimes guarantors. The data accordingly not only provide information on students' family origins, but also for some students allow linkage of their records to those of their parents or siblings in the CUSD as well as to their and their family members' records in other datasets. 6 Table 3 summarizes the CUSD. The China University Student Dataset-Republic of China (CUSD-ROC) covers university students in the Republican  era (Liang, Dong, Ren, & Lee, 2017;Ren, Liang, & Lee, 2020). It includes all or partial student registration records for 34 Republican Chinese universities. While these 34 universities represent only one-third of the universities during the Republic of China, they account for 90% of the surviving student registration records we have located in Chinese university and administrative archives. They include most of the major public, private, and missionary universities. As of January 2020, we have entered 165,981 records of 136,220 students from 34 universities. Almost all these records include student's major, age, gender, and place of origin. Most student records also include the names, occupations, and addresses of at least one parent, and in some cases of grandparents and guarantors as well. Entry of such information on parents, grandparents, and guarantors is still in progress, and we hope to add data from additional universities whose data we have already located.
The China University Student Dataset-People's Republic of China (CUSD-PRC) includes information for 64,500 undergraduate students who matriculated at Peking University between 1952 and 1999 and 86,393 undergraduate students who matriculated at Suzhou University between 1933 and 2003. The datasets are important both for their focus on elite university students in the People's Republic of China, and for their coverage of the last half of the 20th century. Peking University is one of the top national universities in China, and Suzhou is one of the best ranked regional universities. While censuses and retrospective surveys carried out since the 1980s identify college graduates, only a small number of recent surveys specify the university they attended, thus with only a few exceptions it was not previously possible to study the social or geographic origins of students at elite institutions, let alone from the 1950s to the present. 7 6 Entry is ongoing and we expect the number of students linked to parents or other kin to rise substantially in the coming years, thus we do not present numbers here. 7 See pages 24-37 of Liang et al. (2013) for details on the construction of the CUSD-PRC, pages 37-46 for a discussion of the methods used in the analysis, and 46-57 for the contents. To protect privacy of the students in the records, entry was carried out on-site at both universities by university personnel and data remained there. Analysis of identifying data was also carried out onsite by university personnel. Analysis offsite relied on non-identifying tabulations, transformation or other calculations that had been conducted onsite.
The newest university student dataset, the CUSD-Overseas (OS), includes 52,664 Chinese students who pursued education overseas from the late 19th to the mid-20th century, accounting for 75 to 80% of the estimated 65,000-70,000 Chinese students who graduated from foreign universities during this period. As of June 2020, the dataset includes 64,164 records for 32,543 Chinese students in Japan, 12,457 records for 11,289 Chinese students in the USA, 7,402 records for 7,356 students in Europe, and over a thousand students who studied in the Soviet Union and elsewhere or for whom country of study is not available. While the CUSD-OS are based on Chinese and foreign government records of overseas Chinese students and graduates and rarely include information on students' family members, such information will be available via linkage for those students whose undergraduate records are in the CUSD-ROC.
We recently have turned to constructing a variety of large datasets on individual employment in the professions and in civil and military service in late imperial, Republican, and contemporary China. Table  4 lists our employment datasets. The largest and most developed of these datasets is the China Government Employee Dataset-Qing (CGED-Q) (Chen, Campbell, Ren, & Lee, 2020;Ren, Chen, Hao, Campbell, & Lee, 2016). The core information for the CGED-Q comes from the jinshenlu, a roster of civil offices that was compiled every three months during the Qing and listed almost every regular, salaried civil office and included the holder's name, place of origin, banner affiliation (if any), location of post, job title, and other details. Positions ranged from high offices in the Six Ministries and other central government units down to low-level offices in county administrations. Each edition lists 13,000-15,000 employees. 8 As of July 2020, we have entered 4,178,078 records of 346,541 officials for the period between 1760 and 1912. 9 Most of these are from the period between 1830 and 1912, during which coverage of surviving editions is nearly complete. We are releasing the data in stages. Microdata for the period 1900-1912 are already available for download at the HKUST DataSpace and at a mirror site maintained by Renmin University Institute of Qing History. 10 Nominative linkage of the records of the same official in successive editions in the CGED-Q allows us to construct and study their career histories. Linkage procedures depended on whether officials 8 The CGED-Q also includes some lists of military officials from a roster zhongshubeilan that originally was also compiled every three months. Each edition recorded 7000-8000 military officials.

EMPLOYMENT
had a hereditary affiliation with the Eight Banners by virtue of their descent from the conquest elite who established the Qing in 1644. The officials who were not affiliated with the Eight Banners were mostly Han Chinese and can be linked based on their surname, name, and province and county of origin. The combination of these four attributes was almost always unique, and the primary challenge is addressing a relatively small number of cases where records of the same individual are not linked because their name is written slightly differently in two editions, usually because one character is replaced by another that looks or sounds similar. Linkage of officials who were affiliated with the Eight Banners is more difficult because instead of a province and county of origin, their banner affiliation was recorded. Moreover, most officials who were bannermen were Manchu or Mongol and did not have surnames recorded. The primary challenge is accordingly the opposite of the one we face for non-banner officials. While approximately 86% of the combinations of name and banner affiliation are unique, for the remainder who are not unique we use additional information to prevent records of different individuals who have the same name and banner affiliation from being linked together as if they were one person. We are also entering and linking information on the family backgrounds of officials and other characteristics from records of exam degree holders and other sources.
We have also begun to extend our coverage of employment to include government officials and educated professionals in the Republican era and the early years of the People's Republic of China. The data are important for our understanding of state building and the emergence of educated professionals as a distinct social group during this period. They also allow for comparisons of officials and some professionals who served in the Qing, Republican, and early Peoples Republic of China. For the CGED-ROC (Republic of China) we have entered 31,658 records of government officials who served between 1911 and 1949. These include with some overlap 9,988 officials from the Ministry of Education, the Ministry of Defence, the Academia Sinica, and the five administrative branches called the Control, Examination, Executive, Judicial, and Legislative Yuan, and 21,580 officials from the Transportation and Railroad Ministries. Common variables include name, sex, age, place of origin, education credentials, current employment and employment history. Acquisition of relevant materials continues, and this dataset should expand substantially.
The China Professional Occupation Datasets-Republic of China (CPOD-ROC) date back to 2016 when Bamboo Ren located related sources from the Liaoning Provincial Archive at FamilySearch (formerly the Genealogical Society of Utah). Bamboo subsequently worked with other group members, notably Yibei Wu, in archives and libraries at Beijing, Hangzhou, Nanjing, and Shanghai, to compile five discrete datasets. Of the 55,178 currently entered, 18% are medical doctors, 36% are university faculty, and another 36% are engineers. The remaining 10% are lawyers and certified accountants. Data entry is ongoing, and we expect the number of professions and especially the number of professionals for whom we have data to increase rapidly.
Finally, motivated largely by student interests to better understand China's rural revolution, we have collected and continue to collect nominative individual level datasets, collectively referred to as the China Rural Reconstruction Datasets (CRRD), to research China's rural reconstruction especially during the third quarter of the 20th century. Table 5 summarizes the rural reconstruction datasets.

RURAL RECONSTRUCTION
We do so because one of the defining features of 20th century China was the transformation of the world's largest agrarian society during this period. We are constructing two datasets that record information about individual and household experiences during the most dramatic stages of this process between 1946 and 1966, when the Chinese Communist Party carried out a nationwide redistribution of land and then gradually organized rural communities into agricultural cooperatives and ultimately People's Communes. The China Rural Reconstruction Dataset-Land Reform (CRRD-LR) was created to study the nationwide Land Reform Movement from 1946 to 1953. During this movement, local governments in many parts of rural China kept systematic records of land reform events and activities. These records include detailed individual-and household-level registers of property expropriation and reallocation and the political struggles that accompanied this redistribution of wealth. Currently the CRRD-LR contains county-wide data on the land reform experiences of over 80,000 households with approximately 400,000 individuals in Shuangcheng, Heilongjiang between 1946 and 1948.
The China Rural Reconstruction Dataset-Siqing (CRRD-SQ) is one of the most systematic and detailed sources available on social and economic change in rural China from before land reform in the 1940s up to the eve of the Cultural Revolution in 1966 (Xing, Campbell, Li, Noellert, & Lee, 2020). It is based on household social class registration forms compiled in rural areas around 1966 as part of the Socialist Education Campaign also known as the Four Clean-up (Siqing) Movement. The CRRD-SQ currently contains data from over 25,000 of these household forms, one quarter in collaboration with the Shanxi University Research Centre for Chinese Social History, from four provinces: Shanxi, Hebei, Inner Mongolia, and Guangdong. Each form records two to three pages of information per household, including their property holdings and occupations before and after land reform in the late 1940s, at the time when cooperatives were formed in the mid-1950s, and at the time of compilation in 1965 and 1966; the household head's social relations, a three-generation family history, and social, demographic, and political details on every household member over 15 sui, that is approximately 13.5 years or older according to Western ages.
Looking back over the last forty years, we can distinguish three distinct phases in terms of dataset construction and to some extent research. In the first phase, from 1979 to 1989, transcription and analysis were slow because of limitations in funding, technology, and support personnel. Work focused on demographic analysis of early iterations of the CMGPD-LN. In the second phase, from 1990 to 2010, data entry accelerated as stable funding became available to support a core team of full-time data coders, first in the USA and then in the People's Republic of China. Data coverage expanded to include the entire current CMGPD as well as the CUSD-PRC. In the third phase, from 2010 to the present, the range of population categories broadened to include government officials, professionals and other educated elites with the initiation of the CGED, the CPOD, and the CUSD-ROC and CUSD-OS, as well as such topics as rural reconstruction with the creation of the CRRD-LR and CRRD-SQ.
Here we narrate progress across these three phases. The emphasis is on when and how each project were initiated, the participants and their contributions, and key transitions in terms of approach and scale.
We began more than forty years ago when James Lee began to look for quantizable nominative individual level microdata in historical archives in mainland China beginning with a winter-long visit to the First Historical Archives in Beijing in 1979. Inspired by the quantitative historians and social scientists who in the 1960s and 1970s transformed our understanding of family and population in past times in Europe and North America by the construction and analysis of datasets from archival sources, he hoped to do the same for historical China. 11

11
Funding during this decade came from a combination of internal Caltech resources as well as from external support from the National Endowment for the Humanities and from a Wang Fellowship in Chinese Studies. In addition, support from the Academia Sinica and the National Program for Advanced Study and Research in China funded nearly all our travel as well as some of our research expenses in Beijing, Shenyang and Taipei.

HISTORY
In 1982 at the advice of Deyuan Ju, Lee visited the Liaoning Provincial Archives and obtained microfilms of five household registers from Daoyi covering the period between 1774 and 1798. 12 Together with Robert Eng, an economic historian with prior experience with Japanese historical population registers, Lee developed a coding scheme and personally transcribed the contents of the 1774, 1780, 1786, and 1792 registers into a fixed-column format, first on paper forms and then into digital files. 13 Lee also took a course in demographic methods in 1984 offered by the Graduate Group in Demography at the University of California, Berkeley. Working primarily with various California Institute of Technology (Caltech) undergraduates, he published the first analyses of Chinese mortality, fertility, and household structure for specific historical populations on the Chinese mainland before the 20th century using household register microdata (Lee, Anthony, & Suen, 1988;Lee & Eng, 1984;Lee & Gjerde, 1986). 14 He also published with William Lavely and Feng Wang an influential article on how new historical and contemporary microdata were reshaping the understanding of Chinese demographic behaviour (Lavely, Lee, & Wang, 1990).
While Lee acquired additional 19th century household and population registers for Daoyi beginning in 1985, research using such microdata did not advance significantly until the arrival in the summer of 1987 of Cameron Campbell, a Caltech sophomore (second year student) majoring in Electrical Engineering with a side interest in Chinese history he had developed in high school. 15 Campbell had prior training and experience with database programming. After going over the various C programs that had been written for specific data transformations and calculations for the Daoyi registers, he proposed to Lee a new workflow where the data would be managed in dBase III+ (later dBase IV) and then exported to SPSS for analysis. Campbell began to develop the new code in summer 1987 when he and Lee re-visited the First Historical Archives in Beijing and the Liaoning Provincial Archives in Shenyang. 16 Processing included construction of flag variables for the occurrence of demographic events, identifiers for records of the same individual in different registers and links between kin, measures of household structure and composition, and measures of context at the individual level including the presence and absence of specific kin. This simplified data entry and created new possibilities for analysis to move beyond the calculation of rates and proportions. 17 Entry of the CMGPD-LN subsequently accelerated, first in 1990 thanks to support from the Academia Sinica and the National Science Council in Taiwan, and again in 1999 thanks to a large private gift to James Lee. Before 1990, only one of the 29 administrative populations that eventually made up the CMGPD-LN, Daoyi, had been entered, along with a few registers from another administrative population, Gaizhou, yielding about 100,000 observations. In contrast, between 1990 to 1999, we entered an additional 400,000 records from 8 more administrative populations. Our increased speed 12 Additional registers were filmed at the Liaoning Provincial Archives in 1985 and 1987 that extended the series to 1873. 13 See Lee and Campbell (1997, pp. xix-xxi) for complete lists of the people who helped with the entry and analysis of the Daoyi registers, the coders who entered the data, and the funding sources. 14 By this time, Arthur Wolf had already done related work for Taiwan using Japanese colonial records from the first half of the 20th century (Wolf & Huang, 1980

PHASE 2 -ACCELERATION, 1990-2009
Cameron D. Campbell & James Z. Lee was partly due to the acquisition of most of the historical household registers and related materials from the Liaoning Provincial Archives by the Genealogical Society of Utah who made these sources available to us, and partly because the increase in funding enabled us to support a larger team of coders to transcribe these data into digital datasets. 18 Almost all the data entry was done by coders in the United States, though two CMGPD-LN series were completed in Taiwan. In 1999, we moved data entry to mainland China where we were fortunate to have the support of one and soon afterwards three reliable and enthusiastic full-time coders : Xing Xiao,Huicheng Sun,and Jiyang. 19 Over the next four years they entered the remaining 19 CMGPD-LN datasets, yielding an additional 1 million observations, and subsequently spent a full person-year largely in 2010 to clean the entire 1.5 million observation CMGPD-LN in preparation for public release of these data. 20 To conduct more detailed analyses of fertility and infant and child mortality than was possible with the CMGPD-LN, Lee collaborated with Huimin Lai and Sufen Liu at the Academia Sinica in Taiwan to construct the CMGPD-IL beginning in 1990. Deyuan Ju at the First Historical Archives had earlier introduced Lee to the collections of nominative historical demographic microdata from the Office of the Imperial Lineage including the Jade Registers (yudie) or Imperial Lineage Genealogy (Ju, 1994).
Recognizing that the nearly complete recording of male and female births and infant and child deaths made the yudie an invaluable complement to the CMGPD-LN, which recorded few daughters, and omitted some sons who died in infancy or childhood, Lee obtained copies of these data in 1985 when the First Historical Archives in Beijing microfilmed the yudie for the Genealogical Society of Utah. Lee oversaw the coding of the members of the Main Line (zongshi) in collaboration with Huimin Lai and Sufen Liu between 1990 and1992. Lee also recruited Feng Wang in 1989 to participate in the analysis of these data and published with Campbell an introduction to the CMGPD-IL dataset (Lee, Campbell, & Wang, 1993). 21 Campbell used the CMGPD-IL in his PhD dissertation which studied long-term trends in mortality in Beijing by comparing mortality patterns in the CMGPD-IL in the 18th and 19th century with mortality patterns in Beijing in the 1920s and 1930s and after 1949. 22 Later, working with Linlan Wang, a sociology PhD student of Lee's at Peking University, Lee added the sons from the collateral line (jueluo) recorded in the 1933 Aixin Jueluo Genealogy for her 2012 PhD thesis (Wang, 2012;Wang, Lee, & Campbell, 2010). 23 In 2003, Lee who was now at the University of Michigan, began to work with Shuang Chen, a PhD student in history, to construct the CMGPD-SC to examine the relationship between landholding and demographic behaviour in Shuangcheng, Heilongjiang (Chen, 2009;Wang et al., 2013). 24  Another logistical challenge after moving data entry to China was processing payments to the coders. We are grateful to staff at University of Michigan ICPSR, UCLA CCPR and HKUST for their oversight and management of the grants, especially Ruth Danner at ICPSR, Lucy Shao at UCLA CCPR, and Freda Ching at HKUST. 20 Between 1999 and 2006, we also visited some 57 Liaoning villages during which we acquired almost 250 related data sets such as family genealogies. 21 Lee, Campbell and Wang (1993, p. 361) lists the scholars at the Academia Sinica who facilitated this work, the coders who entered the data, the Caltech undergraduates who helped with programming, and the funding sources. Lee and Guo (Eds.) (1994) provides a detailed history in Chinese of the project, including the data entry process and the transformation of the data to prepare it for analysis. 22 Campbell collected the mortality data for 20th century Beijing while doing archival and library research there in summer 1993. While conducting this research he was assisted by Jennifer Huang Bouey. 23 Unlike the Imperial Lineage Genealogy but like almost all other historical Chinese genealogies, the Aixin Jueluo Genealogy did not report daughters. Data entry was done principally by Jiyang with the assistance of Xing Xiao from April 2008 to July 2011. 24 Wang et al. (2013, p. viii) provides a complete list of the individuals who contributed in various ways to the construction of the CMGPD-SC. SC between 2004 and2007. 25 By contrast, entry of the 1.5 million records in the CMGPD-LN had taken approximately twenty years. Shuang Chen, who is now an associate professor of history at the University of Iowa, oversaw the data coding and led the analysis of the CMGPD-SC for her dissertation (Chen, 2009) and book (Chen, 2017) as well as for our contributions to the Eurasia Project fertility volume (Chen, Campbell, & Lee, 2014).
Our capacity to manage and analyse steadily larger datasets improved as a result of additional changes in the workflow. The dBase programs that read in the files provided by the coders to produce the files to be analysed in SPSS and later STATA were maintained by Chris Myers in the early 1990s and then again by Cameron Campbell. However, the programs were slow. In the late 1980s, when there were fewer than 70,000 records to deal with, the dBase programs that transformed the raw data entered by coders into files to be used in analysis could run for more than a day and were prone to crashing. 26 Subsequent improvements in processing and disk speed were offset by increases in the number of records to be handled. Finally, in the mid-1990s, we froze development of the dBase programs. They continued to be used to process incoming files and prepare a file for analysis in STATA, but they were not further developed. Creation of new variables was done in STATA. Eventually, Campbell retired the dBase programs completely and wrote STATA code to handle the entire process of importing files, organizing them, creating variables for analysis, and carrying out analysis. This reduced the time required to go from the raw files provided by coders to the work files used in analysis to just a few hours.
To better understand the context of the communities that were recorded in the CMGPD-LN and learn more about the histories of the families it recorded, we conducted fieldwork in rural Liaoning. Between 1999 and 2006, we made eight field trips to Liaoning accompanied by Gao Jing and colleagues from the Liaoning Provincial Gazetteer Office and local Gazetteer Offices during which we visited 57 largely rural communities. We spent some 250 person-days visiting descendants of the CMGPD-LN populations and collecting local sources such as genealogies, tomb inscriptions, deeds, and other family documents on these populations. We also gathered oral histories and collected information on the families during the time from the end of the CMGPD-LN in 1911 up to the time of our visit. We compared these local data with the state household and population registers in Campbell and Lee (2002a) and Ding, Guo, Lee, and Campbell (2004). In each community, we shared with the families we visited genealogies of their lineages generated from the CMGPD-LN. Many families had lost their genealogies or had only rudimentary genealogies that listed only the generation and given names of male lineage members, and the materials we provided helped them reconstruct their family histories, including names and other information on ancestors who held office or had other achievements or recognition.
Taking advantage of advances in technology, we began to use event-history analysis and other regression-based approaches to study associations between individual demographic behaviour and outcomes and household and community context. The capacity of the personal computers that we were using for processing and analysis improved dramatically. In the early 1990s, computations involving the 100,000 or so records in the Daoyi series that involved anything more than tabulation or linear regression took fifteen minutes to an hour, depending on the number of observations included, the number of variables, the type of model, and the number of models. By the late 1990s, more advanced estimations run on much larger numbers of records took much less time. By the late 2000s, calculations involving the combined CMPGD-LN and CMGPD-SC files, nearly 3 million records, could be completed in minutes on a personal computer.
Our shift to event-history analysis was also spurred by Akira Hayami's 1993 invitation to James Lee to participate in the Eurasia Project in Population and Family History, an international comparative project that studied interactions between community context, household organization, and demographic 25 During this time, our work received support from NIH NICHHD 1R01HD045695-01A2 (Demographic Responses to Community and Family Context, James Lee PI). 26 While Campbell was still at Caltech, the programs were so slow and prone to crashing that when he and Lee processed the dataset to integrate newly coded registers or reprocessed the dataset to add new variables, he often slept on the floor in Lee's office, waking up every few hours to check that the programs were still running, and if they had failed, correcting the problems and restarting them.
behaviour in past times. 27 The project began in 1994 and yielded three volumes in a dedicated series published by MIT Press under the editorship of Lee, Bengtsson, and Alter. The first was on mortality (Bengtsson, Campbell, Lee et al., 2004), the second was on fertility (Tsuya, Wang, Alter, Lee et al., 2010), and the third was on marriage (Lundh, Kurosu et al., 2014). 28 Teams of researchers who had household register data from communities in Belgium, China, Italy, Japan, and Sweden collaborated to specify event-history models of mortality, fertility and marriage that could be estimated in all of the datasets which would yield results that could be compared. We discuss findings from the project below.
In the early 2000s we began to plan for the public release of the CMGPD. Lee (Campbell, Dong, & Lee, 2013) as a companion to the User Guides Wang et al., 2013).
In the late 2000s, we initiated a new line of work on social mobility, stratification, and inequality. We had originally considered father-son associations in social and economic outcomes in Daoyi in Lee and Campbell (1997, pp. 196-214). When Campbell first started as an assistant professor in sociology at CMGPD-LN to study social mobility, inequality, kinship and other topics. 30 We moved past father-son associations to study associations of socioeconomic outcomes with the characteristics of progressively wider networks of kin, starting with siblings, uncles and grandparents in the same household, then moving to kin outside the household, and then eventually to lineages. We faced a constraint, however, in that the CMGPD datasets only recorded official positions held by adult males. They provide no information about other non-agricultural occupations, and only the CMGPD-SC recorded landholding.
Inspired by this growing interest in inequality and social mobility and a desire to move beyond the study of demography and after having devoted over two decades to the collection of individual level information on socio-economic attainment and related demographic behaviour in pre-20th century China, we turned our attention to the construction of datasets for the study of inequality, social mobility, and social change in historical and contemporary China. Lee  We now construct datasets for the study of inequality, social mobility, and social change in historical and contemporary China. In 2010, Chen Liang proposed a study of social origins of university students in the first half of the 20th century that would extend on the CUSD-PRC by construction and analysis of a dataset (CUSD-ROC) based on student registration cards held in historical archives throughout China (Liang et al., 2017;Ren et al., 2020). Working with others in the Lee-Campbell Group, he located the student registration cards for half of the 34 universities that currently make up the CUSD-ROC and arranged for most of their data entry. Bamboo Y. Ren, James Lee, and Mingyu Zhang located and coded the other half of these data. The process for transcription of these records differed from that of the CMGPD datasets. Rather than having a dedicated team transcribe the contents of scans of the original sources, data was entered on-site in the archives by personnel recruited locally for the task. Coding additional variables or checking the originals to resolve inconsistencies required return trips to the archives.
The next project we initiated was the CGED-Q.

PHASE 3 -THE EXPANSION, 2009-PRESENT
at Shanghai Jiao Tong University, showed him and James Lee work she was doing with records of officials in northeast China she had transcribed from quarterly jinshenlu in a collection of 206 editions published by the Tsinghua University Library. 32 Campbell, Lee, and Ren developed a plan to enter all of the 2.8 million records in this collection and an additional 1.2 million records from jinshenlu editions held elsewhere. This was completed in summer 2020. The coders we had relied on for the CMGPD began entering data in 2014. In 2016 we added new coders and the pace of entry doubled. Bijia Chen joined the project in 2015 while she was an MPhil student in Social Science at HKUST. She played a key role in the coordination of the data entry and then wrote her PhD dissertation on the careers of Qing officials (Chen, 2019).
While the CMGPD, CGED, and much of the CUSD were all initiated either by James Lee or Cameron Campbell, working with other senior members of our research team such as Chen Liang, our most recent data sets, on China's rural revolution (CRRD-LR and CRRD-SQ) and the rise of China's professionals (CPOD) are the initiative largely of our younger team members who found these sources and organized the data construction for their PhD theses. Matthew Noellert discovered the materials that became the basis of the CRRD-LR while conducting fieldwork in Shuangcheng in 2011 and used these data to write his 2014 PhD dissertation and 2020 book. 33  We organize our review of findings by topic. We begin with studies of demographic behaviour and household organization. We follow the development of these studies from estimates of demographic rates and household structure to examinations of the implications of family hierarchy based on patterns of differentials according to household context and finally to analyses of assortative mating, relationships between household context and health and mortality in later life, and other topics. We then present work on intergenerational social mobility and inequality more generally. We start with the earliest studies of the associations between fathers' and sons' socioeconomic outcomes, move to multi-generational studies that considered the role of kin other than the father in shaping individual outcomes, and conclude with recent studies that move beyond the individual to consider kinship networks and descent groups as units of analysis, and stratification and inequality more generally. Third, we summarize recent published work on the geographic and social origins of university students in 20th century China. Finally, we summarize recent published studies of the careers of government officials during the Qing. The resulting dataset is held at Shanxi University and analysis is carried out by personnel there. As with the CUSD-PRC, any analysis we conduct is based on tabulations or other calculations produced in response to our requests, not on the original data. Since 2016, we have located and entered another 18,000 forms. Matt Noellert and Xiangning Li lead the analysis of these materials.

FINDINGS
The earliest line of work described trends and patterns in mortality, fertility, population age and sex composition, and household structure by presentation of aggregated rates and proportions. Based on five registers from Daoyi between 1774 and 1798, Lee and Eng (1984) introduced the data and presented descriptive results on birth and death rates, population age composition, and household structure. Among other findings, they showed that these sources recorded adult males and married and widowed females completely but omitted many sons who died in infancy or early childhood along with most daughters. Lee, Campbell, and Wang (1993) introduced the CMGPD-IL and presented time trends and age patterns of mortality for the members of the Imperial Lineage. Lee, Anthony, and Suen (1988) and Lee, Campbell, and Anthony (1995) showed that levels and patterns of mortality in Daoyi resembled those in other historical populations. Lee and Gjerde (1986) compared household forms in Daoyi with those in Norway and the United States to show that existing schemes for classifying household structure were inadequate and proposed a new classification scheme more amenable to comparison between Europe and other societies. Comparison to the CMGPD-LN revealed that adult Imperial Lineage males had higher death rates than adult males in Daoyi in rural Liaoning, presumably because being restricted to living in Beijing subjected them to an 'urban penalty'. Results from these early studies led to lines of research exploring relationships of fertility, mortality, and other aspects of demographic behaviour to other social and economic variables which took advantage of the individuallevel detail in the data. We describe these lines of research below.
Early results on fertility patterns in these studies led to a line of work on the role of deliberate delay or cessation of childbearing in producing low levels of fertility within marriage. Wang, Lee, and Campbell (1995) and Lee and Campbell (1997, pp. 83-102) showed that in the CMGPD-IL and Daoyi respectively, marital fertility was lower than in Europe, intervals between marriage and first birth and between subsequent births were much longer, and childbearing ceased much earlier. They argued that these and other patterns were consistent with deliberate behaviour to delay births and cease childbearing. These findings were the basis of the claim in Lee and Wang (1999) that contrary to the beliefs of Malthus and his successors, a fertility-based preventive check played an important role in the population dynamics of China before the 20th century, and that the rapidity of fertility decline in the 20th century in mainland China, Taiwan and Hong Kong reflected a historical legacy of adjusting fertility according to economic and other circumstances that primed the population to respond quickly when new technologies for fertility limitation became available. A vigorous debate with advocates of a Malthusian interpretation of China's historical population dynamics ensued . Campbell and Lee (2010b) revisited the issue of fertility control and showed that once heterogeneity in fecundity across couples was properly accounted for, there was clear evidence of stopping behaviour.
Paralleling our study of mortality, we moved on to map fertility differentials to illuminate influences of community, household and individual context on reproduction. Lee and Campbell (1997, pp. 133-156, pp. 177-195) compared cumulative numbers of boys born by household structure, location within the household, and socioeconomic status in Daoyi. In general, men who had privileged statuses in the household or socioeconomic hierarchy had more children. This reflected not only earlier marriage and a higher likelihood of remarriage, but in some cases higher fertility within marriage. This consistent positive relationship between privilege and reproduction is in contrast with mortality, which as noted above was in some cases the opposite of what was expected, with privileged males experiencing higher death rates. Fertility fell during times of hardship, that is when grain prices were high or when there were climatic shocks (Campbell & Lee, 2010a;. Dong (2016) also studied the role of local family system in moderating the influence of co-resident kin on reproduction between East Asian populations.
Subsequent analyses focused primarily on fertility within marriage and such related behaviour as adoption.  revisited fertility in an expanded CMGPD-LN and demonstrated that it was linked to location in economic and household hierarchies. Campbell and Lee (2009) examined associations of marital fertility with characteristics of kin living outside the household but did not find any. Chen, Lee, and Campbell (2010) showed that fertility in Shuangcheng was positively associated with family landholding and with other measures of socioeconomic and household status. Wang and Lee (1998) showed that in the Qing imperial lineage, as many as 12.5% of sons

FERTILITY
were adopted between related individuals, and that more generally, adoption played an important role in maintaining the continuity of the descent line and achieving other goals.
Early descriptive analysis of infant and child mortality led to an examination of female infanticide that became one of the foundations of the critique of Malthusian interpretations of China's historical population dynamics in Lee and Wang (1999). Lee, Campbell, and Tan (1992) and Lee and Campbell (1997, pp. 58-82) used indirect evidence from registered births and deaths in Daoyi to argue that families employed female infanticide or neglect to influence the number and sex composition of surviving children, and in doing so responded to economic conditions as well as their personal circumstances. As noted above, dissatisfaction with reliance on indirect evidence inspired the creation of the CMGPD-IL, which recorded the births of sons and daughters completely, as well as the deaths that occurred afterward. This led to the analysis of infant and child mortality in the Imperial Lineage in Lee, Wang, and Campbell (1994) that provided direct evidence of infanticide in the form of dramatically higher death rates for daughters in the first day and month of life, furthermore demonstrating that infanticide was not only a response to crisis or desperate poverty.
The next set of mortality studies employed event-history analysis to map patterns of mortality differentials and illuminate the influence of family, community, and institutional context on death risks. Lee and Campbell (1997, pp. 133-156, pp. 177-195) first showed that mortality rates varied according to socioeconomic status and location within the household hierarchy. Relationships were sometimes counterintuitive: male privilege was sometimes associated with higher mortality risk. Campbell and Lee (1996) showed that mortality risks depended not only on household size and composition but also on the presence or absence in the household of specific kin. Campbell and Lee (2002b) examined how household context conditioned the mortality effects of widowhood and orphanhood and showed that widows' mortality risks depended on whether they had a son. Widows who had a son were unaffected by the loss of their husband, but widows without a son experienced elevated mortality. As we describe below in our summary of findings, Hao Dong lead-authored comparisons of family contextual influences on mortality in Liaoning and Taiwan in China and northeast Japan using pooled household register data from all three locations (Dong, 2016;Dong et al., 2017).
This led to studies of the short-term consequences of economic and climatic shocks and long-term effects of public health interventions. Campbell and Lee (2000) argued based on an analysis of effects of the interactions of social status, household context and prices that there was a trade-off between privilege and mortality risk, with the mortality of privileged individuals also being more sensitive to price fluctuations. Campbell and Lee (2004) used a much larger sample from the CMGPD-LN to investigate differentials in mortality levels and mortality sensitivity to price fluctuations in more detail. Male mortality was more sensitive to grain price fluctuations than female mortality, and the response was conditioned by age, socioeconomic status, and household context. These results contributed to the comparisons between East and West in Bengtsson et al. (2004). Campbell and Lee (2010a) investigated the effects of unusually cold summers and other climatic disruptions in the years 1782-1789, 1813-1815, and 1831-1841. During the first of these periods, life expectancy fell by more than 10 years. Young males and females were especially hard hit, with the death rates of males aged 5-15 being multiplied by 8.78 and females aged 5-15 being multiplied by 4.65. Campbell (1997;2001) assessed the effects of public health interventions in Beijing at the beginning of the 20th century and immediately after 1949 by comparing mortality in the CMGPD-IL in the 19th century to rates in Beijing at different points in time in the in the early, mid and late 20th century.
Recent studies investigate the consequences of family context and history for mortality later in life or in later generations. Chen et al. (2005) used the CMGPD-SC to compare the death rates of settlers in Shuangcheng according to whether their families originated in urban Beijing and its environs or from rural northeast China. She found that descendants of migrants from Beijing experienced a persistent mortality disadvantage even though state policies privileged them. Campbell and Lee (2009) used the CMGPD-LN to study how household context affected mortality in adulthood and old age. They found that men who had lost their mothers in childhood or whose mothers were 35 or older when they were born had higher death rates in adulthood, and that men experienced elevated mortality risks in old age if they were born after a short preceding birth interval, to women who were 35 or older, to a father listed as disabled, or to a father who held a salaried official position. Dong and Lee (2014) used the CMGPD-LN to examine mortality in later life of men who had migrated from one village

MORTALITY
www.ehps-network.eu/journal Historical Chinese Microdata. 40 Years of Dataset Construction by the Lee-Campbell Research Group to another in childhood and found that they had more favourable outcomes if they had kin in their destination village. Most recently, Zang and Campbell (2018) used the CMGPD-LN to investigate how co-residence with grandparents in childhood influenced mortality in adulthood and old age.
Marriage has been a fruitful area for study because marriage timing and the overall chances of marriage closely reflected household priorities and individual privilege within the household. Marriage was the direct result of an explicit decision by the household about when a son or daughter would marry, and who they would marry. By contrast, with the exception of infanticide, fertility and mortality were outcomes that were influenced by household priorities and decisions but were also subject to a variety of other unrelated influences, making associations much more difficult to interpret. High status males were more likely to marry and if widowed, remarry. The first demonstration of the positive association between male socioeconomic and household status and marriage chances was for Daoyi in Lee and Campbell (1997, pp. 133-156, pp. 177-195). In the Imperial Lineage, social status was also positively associated with male marriage chances (Lee, Wang, & Ruan, 2001). Socioeconomic status of distant but co-residing kin influenced male marriage chances, and there was also clear evidence of sequencing among unmarried males of the same generation within the household (Campbell & Lee, 2008b). Higher status females tended to marry later but almost all females married eventually (Chen, Campbell, & Lee, 2014). Remarriage chances were tied to socioeconomic status as well, with higher status widowers more likely to remarry.
We also considered other aspects of marriage, including polygyny and the effects of economic shocks. Even though polygyny was one of the most widely noted features of marriage in China before the 20th century, it was extremely rare in the rural populations covered by the CMGPD-LN and CMGPD-SC. Even in the elite Imperial Lineage, polygyny became steadily less common over time, so that by the last half of the 19th century it was rare except among close relatives of the emperor. Moreover, polygyny was used primarily to extend the reproductive span of males rather than to father children with different partners at the same time . In rural Liaoning, economic hardship as reflected in elevated grain prices did not have an immediate effect on marriage the way it did on mortality and fertility, but had a lagged effect because elevated female infant and children mortality disproportionately reduced the numbers of girls reaching adulthood two decades later, worsening the imbalance in the marriage market (Campbell & Lee, 2008a).
Recently we have examined assortative marriage, that is who marries whom, for insight into family preferences regarding their affinal connections. This helps delineate social, economic and institutional boundaries between groups in historical China. Our first paper on the topic examined interethnic marriage in the CMGPD-SC for insight into whether in a unique institutional setting where Han and Manchu were allowed to intermarry without being affected by rules forbidding marriage between affiliates of the Eight Banners and regular civilians, they would do so (Chen, Campbell, & Dong, 2018). We found that marriage between Manchu and Han was common and that its likelihood depended on family characteristics including a family history of intermarriage, local marriage market composition, and other factors. Our second paper on the topic examined assortative mating by education and family class label in rural Shanxi in China in the middle of the 20th century (Xing et al., 2020) and found that both were important in marriage formation, and that patterns changed little before and after 1949, when the People's Republic of China was established. This was a novel finding because while there are many studies of educational assortative marriage in China in the last half of the 20th century, there are fewer that consider the middle of the 20th century and simultaneously consider the role of class labels.
Another line of work investigates household dynamics, including the growth of households and the formation of new ones by household division. In Liaoning, a large share of the population lived in large households with many distantly related individuals living together (Lee & Campbell, 1997, pp. 105-132;Lee & Gjerde, 1986). These households were highly hierarchical, with status and privilege determined by relationship to the household head (Lee & Campbell, 1997, pp. 133-156). The head and his or her children and grandchildren of the head were most privileged, and more distant kin were less privileged. When households divided, it was typically upon the death of a head or other senior relative whose presence linked different kin groups within the household (Lee & Campbell, 1998). Household heads were predominantly male, but widows sometimes inherited the headship after the death of their husband. Household division was a liberating experience for distant relatives of the

MARRIAGE AND HOUSEHOLD
head who previously had been at the bottom of the hierarchy but as a result of division now exercised control over the resources of the newly formed household (Campbell & Lee, 1999 The resulting comparisons of mortality (Bengtsson et al., 2004), fertility (Tsuya et al, 2010) and nuptiality (Lundh et al., 2014) revealed unanticipated similarities between East and West, the role of household context in shaping demographic behaviour, as well as unexpected differences. Key findings were that in the West, socioeconomic differences were important for shaping demographic responses to economic shocks, while in the East, socio-political differences in household context were more important. Overall, and in contrast with expectations based on Malthusian interpretations of population dynamics, demographic responses to economic shocks were weaker in the East than in the West. The emphasis on comparison by analysis of results from models that were the same across all the different datasets distinguished this effort from previous international comparisons of population and family in past times and led to novel results.
For insight into the strengths and weaknesses of the registers that were the basis of the CMGPD-LN, we also conducted comparisons of the same families recorded in the CMGPD-LN and their own genealogies. Among the materials we collected from each village we visited during our eight field trips were lineage genealogies. We transcribed these into a dataset and then compared the recording of lineage members between the CMGPD-LN and the genealogies (Campbell & Lee, 2002a). We found that as is already widely known, sons who died in infancy and childhood as well as daughters tended to be omitted from family genealogies, leading fertility estimated from genealogies to be underestimated. We also showed that fertility estimated from genealogies could be underestimated because they were more likely to omit adults who never married and married adults who did not have any surviving heirs. Whereas previous research had assumed that because of the omission of sons who died early and most daughters, fertility estimates from genealogies could be 'corrected' with an adjustment for infant and child mortality and the sex ratio at birth, the countervailing biases associated with the omission of childless adults made adjustment much more difficult, or perhaps even impossible.
We have initiated a new comparative, collaborative study of family and demographic behaviour in historical East Asia led by Hao Dong. Hao Dong harmonized datasets from northeast China, northeast Japan, Korea and Taiwan, and worked with Satomi Kurosu (Japan), and Wenshan Yang (Taiwan) on the analysis of these data. For these comparisons we have also made use of triennial Korean household registers from the county of Tansong made publicly available by a group of historians who at the time were mostly at Sungkyunkuan University and which we have turned into a longitudinal dataset through nominative linkage. 36 The resulting comparative studies explore how family context including presence and absence of various kin influenced demographic outcomes across East Asian populations (Dong et al., 2015a(Dong et al., , 2015bDong, 2016;Dong et al., 2017).
Our study of social mobility progressed from the analysis of associations in the outcomes of fathers and sons to the study of the role of networks of kin in shaping individual outcomes and finally to the study of lineages in their own right, with lineage membership as a key stratifying variable in historical 35 Each of the teams had other participants with whom we interacted more sporadically. 36 Cameron Campbell and Hao Dong wrote software to produce longitudinal links of records of the same individual in different registers to transform the cross-sectional data into a panel dataset (Dong et al., 2015b). The longitudinal links are available at https://doi.org/10.14711/dataset/IVIDZV.

SOCIAL MOBILITY, INEQUALITY AND MIGRATION
Chinese society. Initial studies of father-son associations revealed that a son had a greater chance of obtaining a salaried official position in Daoyi if his father held one (Lee & Campbell, 1997, pp. 196-215). Comparison with results of studies of social mobility in 19th century North America and Europe revealed that the attainment advantages of the sons of locally elite fathers were nevertheless much less pronounced in Liaoning than in the West (Campbell & Lee, 2003). Ethnic mobility accompanied social mobility in the sense that Han men who held salaried official positions were more likely to change from Han to Manchu names (Campbell, Lee, & Elliott, 2002). In every generation, large proportions of the men in Liaoning who entered the local elite by obtaining salaried government positions not only were 'new' in the sense that not only did their father not hold a position, but neither did any of their other patrilineal relatives (Campbell & Lee, 2003). Positions held by other kin were usually a source of advantage, though not always (Campbell & Lee, 2008b).
Lineage membership was also a source of differentiation in rural Liaoning. Social and demographic outcomes, especially attainment chances and marriage chances, depended not only on individual and household characteristics but also lineage affiliation (Campbell & Lee, 2008c). There was longterm continuity not only during the Qing but between the Qing and the late 20th century in the relative status of lineages (Campbell & Lee, 2011). Socioeconomic privilege not only increased the number of children a man had, but increased the total number of descendants he had for as many as six generations, meaning that in every generation, a disproportionate share of the population was descended from the most socioeconomically privileged members of the population in previous generations (Song, Campbell, & Lee, 2015). We have also explored computational approaches to the study of lineages: Fu et al. (2018) used visualization and network techniques to study the determinants of the morphology of descent line structure.
Newer work considers inequality from an even broader perspective. Chen (2017) examines stratification based on institutional affiliation and landholding in Shuangcheng. The state specified different entitlements to land according to population category defined by institutional affiliation. These differential land entitlements affected landholding as well as access to other social and economic privileges. The residents of Shuangcheng challenged the state defined social hierarchy in some cases but at the same time reinforced it in others. Noellert (2020) examines individual-level data on Land Reform events in Shuangcheng after 1945 and found that a redistribution of power away from local strongmen paved the way for the reallocation of property, which continued to be defined by state entitlements.
We have also conducted studies of migration. The CMGPD-LN follows households when they move within Liaoning. When people leave the region entirely, usually illegally, that is recorded as well. Our first study examined the determinants of legal migration of households within Liaoning and illegal departure from the region (Campbell & Lee, 2001). Household age structure conditioned legal migration: 'younger' households with fewer elderly dependents were much more likely to migrate. Households with men who held salaried positions, meanwhile, were less likely to move. Illegal departure was more common for men who were unmarried or widowed, distant relatives of the household head, or members of smaller households. Dong et al. (2015a) compares patterns in northeast China with those in 18th and 19th century Korea and Japan.
Studies based on CUSD datasets of student registration cards and other materials have illuminated shifts in the spatial and social origins of university students in China from the late 19th century to the beginning of the 21st century. Whereas during the Qing educated elites were recruited nationally via the examination system until its abolition in 1905, the educated elites who dominated Republican China in the first half of the 20th century were generally drawn from merchant and white-collar professional families in the major coastal cities (Liang et al., 2017;Ren et al., 2020). Liang et al. (2013) moreover showed that immediately after 1949, student origins at Peking University and Soochow University continued to resemble Republican-era universities, with disproportionate numbers of students coming from business and professional families in the coastal cities.
More importantly, Liang et al. (2012Liang et al. ( , 2013 also show that the introduction of standardized exams (gaokao) in 1955 together with a major expansion in primary and secondary education fundamentally transformed the composition of university-eligible students. In particular, the numbers of students from

SOCIAL AND SPATIAL ORIGINS OF UNIVERSITY STUDENTS IN 20TH CENTURY
CHINA farm and factory families who were the first in their families to attend college significantly increased. This pattern persisted well into the 1990s, when the share of children of professionals began to rise again. At least until 2004, however, approximately 30% of students in Peking University and 40% of students in Soochow University still originated from working-class families. This was a very different pattern from much of the West, where the students who attend the elite private universities that are the counterpart of Peking and Soochow Universities overwhelmingly come from high-income families. These findings had considerable impact on ongoing Chinese debates at the turn of the twenty-first century whether the college entrance examinations still maintained earlier opportunities for students from families of modest means, or favoured students from already well-off families. 37 Analysis of the CGED-Q has already yielded insights into Qing officialdom and the careers of officials not available from traditional approaches to the study of the Qing civil service which emphasize case studies of individuals or offices, or specific time periods. Ren et al. (2016), Chen (2019) and Chen et al. (2020) show that the central government, especially its upper reaches, were dominated by Manchu and other bannermen right up to the end of the Qing. Only a relatively small share of Han who qualified by their civil service examination performance served in the central government largely confined to the Hanlin Academy and related offices. Outside the central government, however, officials were predominantly Han, and included more holders of purchased degrees than holders of exam degrees. Median career length was just under seven years, except for bannermen and holders of gongsheng exam degrees, whose medium career length was three years (Chen et al., 2020). The abolition of the examination system in 1905 had little effect on the holders of exam degrees who were already officials, or on holders of exam degrees who awaited appointment. Chen, Campbell and Lee (2018) examined Banner officials at the very end of the Qing and found that their numbers and positions changed little during the New Government period, but that their share of officials declined because of an increase in the number of Han officials. Campbell (2020) shows that after the abolition of the civil service exams in 1905, men who already held exam degrees continued to be appointed at the same pace as before, and the turnover of exam degree holders who were officials was unaffected. Such results challenge claims made in other studies that the abolition of the examinations adversely affected aspiring elites (Bai & Jia, 2016).
Looking back at four decades of collaboration on the study of demographic, social and economic history, we have some reflections and observations. The first is that we have been extraordinarily fortunate in terms of finding, acquiring, and constructing the diverse, large microdata sources which are the basis of almost all our research. This has been a group effort. Our achievements in identifying new sources of microdata to understand China's past and sometimes present are increasingly due to collaborations with our colleagues in the Lee-Campbell Research  Third, we also benefitted from many other individuals who contributed to our construction of these datasets. Listing everyone here would be impossible, but we can single out some individuals who played especially important roles. Ju Deyuan, Robert Eng, Alice Suen, Anna Chi and others helped James Lee initiate research on Daoyi. Mel Thatcher arranged for access to the collections of the Genealogical Society of Utah. Ts'ui-jung Liu, Sufen Liu, and Huimin Lai at the Academia Sinica facilitated the creation of the CMGPD-IL and one of the CMGPD-LN register series. Among the coders at the Academia Sinica, Shu-mei Tsay made the most contributions to the CMGPD-IL and CMGPD-LN. Shuang Chen, Matt Noellert, and Bijia Chen coordinated and oversaw the data entry of the CMGPD-SC, CRRD-LR and CRRD-SQ, and CGED-Q, respectively. Similarly, Liang Chen, Bamboo Ren, Hao Zhang, Yibei Wu, and Li Yang initiated, coordinated, and oversaw the creation of various subsets of the CUSD and CPOD. Hao Dong helped create longitudinal links for the Korean registers and led the efforts to harmonize the CMGPD and other datasets from Japan, Korea, and Taiwan. Many, many coders worked tirelessly to enter all these data. It is not possible to list all of them here, but we highlight six who made especially large contributions over extended periods of time: Huicheng Sun, Jiyang, and Xing Xiao entered much of the CMGPD-LN, CMGPD-SC, CRRD-LR, and together with Xiaodong Ge, Yibei Liu, and Mi Zhao, the CGED-Q.
Fourth, we could not have proceeded without generous institutional and occasionally personal support. James Lee began his career at the California Institute of Technology and Cameron Campbell met him there while an undergraduate. In retrospect, Caltech was one of only a few places that in the early 1980s would support an assistant and then associate and finally full professor of humanities/history to carry out quantitative research on China. It is also hard to imagine that at any other institution, a sophomore studying electrical engineering who had a side interest in Chinese history but no language ability could walk in to a history professor's office and after some discussions outline a plan to reorganize data management and analysis for an ongoing project and then become a collaborator. In graduate school at Penn and then as an assistant, associate and then full professor and sociology at UCLA, Campbell was supported by mentors and then colleagues even though his work was esoteric.
Internal funding and administrative support from the California Institute of Technology, the University of Michigan, UCLA, Peking University, the Hong Kong University of Science and Technology, Shanghai Jiao Tong University, and from other universities in China allowed us to greatly expand our data acquisition and dataset construction, as well as to apply for sustained research support from the National Institutes of Health in the USA, the National Science Council in Taiwan, the National Natural Science Foundation in mainland China, and the Research Grants Council in Hong Kong. Equally important, these universities provided opportunities and sometimes funding to collaborate with the graduate students, postdoctoral fellows, and visiting professors who made crucial contributions to, and in some cases led, the data construction and research which define much of our work over the last forty years and hopefully for many years in the future. We are especially grateful to Myron Guttman, who as head of the Inter-university Consortium for Political and Social Research, provided guidance and support when we were scaling up our operations and seeking for the first time substantial extramural funding.
Long-term collaborations played a key role in advancing our work. The most sustained and for us influential collaboration was the twenty years we spent working with colleagues from a variety of countries and disciplines on the Eurasia Project in Population and Family History. Interactions with project participants stimulated us to broaden our range of research topics, learn and apply more advanced methods, and seek opportunities for comparisons for our other projects. The camaraderie that grew out of frequent, sustained interaction with others working with data like ours and on similar topics was also important for our own morale. We have especially fond memories of our fruitful two decade-long collaboration with Feng Wang which produced a number of discrete studies as well as Lee and Wang (1999) and Tsuya et al. (2010) and our collaborations with Tommy Bengtsson and Noriko Tsuya, which involved reciprocal visits by us in Lund and Tokyo and by them in Pasadena.
Many short-term collaborations on specific papers and long-term projects were also important. Yizhuang Ding and Songyi Guo advised on the CMGPD-LN and with the help of Gao Jing from the Liaoning Provincial Gazetteer Office we conducted fieldwork with them that resulted in Ding et al. (2004). Songyi Guo also shared his expertise with us when we were constructing and analysing the CMGPD-IL. We have been fortunate to co-author with a variety of others on papers or sets of papers using our datasets, including Lawrence Anthony, Mark Elliott, Robert Eng, William Lavely, Chris Myers, Xi Song, Alice Suen, Guofu Tan, Emma Zang, and Siwei Fu and other members of Huamin Qu's group. Similarly, we have benefited from interaction with Akira Hayami, Kuentae Kim, Satomi Kurosu, Sangkuk Lee, Ts'ui-jung Liu, Byun-giu Son, Noriko Tsuya, Wenshan Yang, and other collaborators.
Looking back, we believe that a distinctive feature of our research that has been central to our success has been our inductive, data-driven approach which emphasizes discovery of facts about demographic behaviour and family, social and economic organization through empirical analysis of the datasets we have constructed. We have always started with data that we thought might help us investigate a topic of general interest, and then through exploratory and descriptive analysis sought to uncover key patterns of demographic behaviour and family and social organization, moving to carefully specified regression-based models only after extensive work to verify the data and then elaborate on relationships and patterns discovered in descriptive analysis. While this approach is time-consuming, sometimes taking years to locate, access, enter and clean data for the analyses that led to our major results, we believe that the result has been a fundamental transformation in our understanding of basic patterns of family, social and economic organization in China in the past and increasingly up to the very recent present.
Our strong belief in the importance of a scholarship of microdata-driven empirical discovery as opposed to a scholarship of interpretation is because there is much about the Chinese past we do not know, or worse, think we know, but are wrong. 38 Whenever feasible and permissible, therefore, we have coded datasets in their entirety and devoted considerable time and energy to produce detailed documentation and User Guides to accompany possible future public data releases, and create a complete, permanent resource to be used by ourselves and others to study a wide range of topics. Related projects described in this special issue are underway for a variety of other locations around the world and we look forward to a future where such datasets are produced and used routinely in social science and in history to discover basic facts about life in the past, right up to the time a few decades ago when longitudinal surveys and other sources become available.
While working on this manuscript, Lee and Campbell received support from HK RGC GRF 16602117 (Lee PI), HK RGC GRF 16601718 (Campbell PI), and HK RGC GRF 16600017 (Campbell PI). We are grateful to the members of the Lee-Campbell Group for their comments, corrections, and other feedback.

38
A good example world-wide would be Malthus' previously commonly accepted description of the Chinese demographic system and its consequences (Lee & Wang, 1999). Another largely Chinese example would be the debate over opportunity for first-college students from families of marginal means to attend some of China's most competitive universities. See too our acknowledgement in Bengtsson et al. (2004) of our own past misunderstandings of mortality and related behavior and our description of the processes by which we were able to correct ourselves and achieve new understandings of the past (435-439).