Nominative Linkage of Records of Officials in the China Government Employee Dataset-Qing (CGED-Q)

We introduce our approach to the nominative linkage of records of Qing officials who were included in the China Government Employee Datasets-Qing (CGED-Q) Jinshenlu (JSL) and Examination Records (ER). We constructed these datasets by transcription of quarterly rosters of civil and military officials produced by the government and by commercial presses, and records of examination degree holders. We assess each of the primary attributes available in the original sources in terms of their usefulness for disambiguation, focusing on their diversity and potential for inconsistent recording. For officials who were not affiliated with the Eight Banners, these primary attributes include surname, given name, and province and county of origin. For the small subset of officials who were affiliated with the Bannermen, we assess the available data separately. We also assess secondary attributes available in the data that may be useful for adjudicating candidate matches. We then describe the approach that we developed that addresses the issues we identified with the primary and secondary attributes. The issues we have identified and the approach that we have developed will be of interest to researchers engaged in similar efforts to construct and link datasets based on elite males in historical China.


Introduction
We describe our approach to the large-scale nominative linkage of records of elite males in two Qing dynasty  historical datasets that we have constructed: the China Government Employee Dataset-Qing Jinshenlu (CGED-Q JSL) and Examination Records (CGED-Q ER). By transcribing records of Qing civil and military officials in quarterly personnel rosters from the period between 1762 and 1911 to produce the CGED-Q JSL and then linking those records over time, we have reconstructed the career histories of officials. By linking officials in the CGED-Q JSL to their records in the CGED-Q ER, we have also attached information about their year of birth, exam performance, ancestry, and other attributes to their career records. This allows us to examine a major topic in the sociological study of stratification: the roles of family background and 'ability' (as measured by exam performance) in the appointment, promotion, and exit from work of officials. Attaching information on year of birth to career histories allows for the study of the age structure of officialdom and the age dynamics of appointment, promotion, and exit from work.
We arrived at the approach we describe here iteratively, building on experience analyzing career histories in the CGED-Q JSL in a series of publications on appointment, promotion, and exit of Qing officials Chen, Campbell & Lee, 2018;Hu, Chen & Campbell, 2020;Hu et al., 2020;Xue & Campbell, 2022), a visualization platform , an introduction to the CGED-Q JSL  and a dissertation (Chen, 2019). Each analysis brought to light issues with the sources, the transcription process, and linkage procedures that had not arisen previously and required adjustments. As our dataset expanded, meanwhile, we adjusted our code and obtained substantial improvements in speed. In the end, as described below, we used probabilistic linkage as implemented in the STATA package dtalink (Kranker, 2018).
The most important contribution of the paper is the thorough documentation of the many problems that arise in the recording of names, place of origin, and other attributes in Qing administrative sources, the implications of these problems for nominative linkage, and our solutions to them. We hope that our experience will be useful to researchers carrying out large-scale nominative linkage in other Chinese sources and to users of the CGED-Q JSL public releases that we have made available for download (Ren et al., 2019). 1 The problems that we identify and our solutions to them should be general to historical Chinese sources. Common problems include the replacement of characters in surnames and given names with variant forms, homonyms, and similar-looking characters, and the inconsistency in the recording of locations because of changes in administrative boundaries. To facilitate work by others who are carrying out nominative linkage with historical Chinese sources, we recording of other attributes like age or date of birth, all of which could create false negatives, and overall low diversity of surnames and given names, which could lead to false positives. By false negatives, we refer to situations where two records that should have been linked together were not. By false positives, we refer to situations where records that should not have been linked together, were. Misspellings occurred because people were inconsistent in the way they wrote their own name, or the way census takers or other officials wrote their name in official records. International migrants might have new names assigned to them by immigration officers who transliterated their original names in ad hoc fashion or might adapt new names on their own. Women typically adopted their husband's surname on marriage. People might use contractions of their name or nicknames in some situations but not in others, for example, writing Bill in some situations and William in others. In many communities in Europe, diversity of surnames and given names was low, making it difficult to distinguish whether records of the same name referred to the same or different people.
The issues that arise with names written in Chinese are very different. Surnames are not diverse. In 2020, the top 5 surnames in China accounted for 30.8% of the population, and the top 100 surnames accounted for 85.8% of the population. 5 Given names are potentially more diverse since they are typically two characters, and for each of those two characters there are thousands to choose from. The actual diversity of given names depended on naming practices in different periods and social classes. While names of elite males during the Qing and the first half of the 20 th century should have been very diverse because well-off families could showcase their erudition by including rare characters with literary, historical or philosophical connotations in the names of their sons, names for people born between the 1960s and 1980s were much less diverse than for those born before or after because single character names with political or patriotic implications became more popular (Cai et al., 2018;Bao et al., 2021). 6 Developing procedures for record linkage is important because there are numerous efforts ongoing to create biographical databases of historical Chinese individuals. Prominent examples include the China Biographical Database (Fuller, 2021;Song & Wang, 2022;Tsui & Wang, 2020), the Modern China Historical Database (Armand et al., 2022), and the various projects of the Lee-Campbell Group (Campbell & Lee, 2020). Such databases are the basis of prosopographical studies of social groups (Stone, 1971), especially elites, in historical China. The creators of these databases carry out what they refer to as 'disambiguation' to assess whether the same name and other attributes appearing in two or more sources refers to the same person or different people, and then attach unique identifiers to each appearance of a person in the dataset. The underlying task is similar to the record linkage that we carry out in the CGED-Q, but somewhat broader in that it may also involve individuals named in unstructured texts like newspaper articles, dynastic histories, or gazetteers. 7 Pronunciation-based approaches developed for linkage of individuals with names written in phonetic 5 See the 2019 and 2020 年全國姓名報告 (National Surname and Given Name Report) published by the Public Security Bureau of the People's Republic of China. 6 See Chua (2021) for an overview of contemporary naming practices in China and descriptive results on the popularity of different kinds of names during the 20 th century. The analysis was based on the Chinese Name Database (1930Database ( -2008 created by Han-Wu-Shang (Bruce) Bao and shared at https://github.com/psychbruce/ChineseNames 7  and Chen & Campbell (2022) include brief, non-technical overviews of linkage in the CGED-Q as part of their overviews of the methods used in the project. We describe linkage procedures for two of our other publicly released datasets, the China Multigenerational Panel Datasets (CMGPD) Liaoning (LN) and Shuangcheng (SC), in Appendix A of Lee & Campbell (1997), Lee et al. (2010) and Wang et al. (2013). According to personal communication with the leaders of the China Biographical Database and Modern China Historical Database projects, they do not yet have any publications describing their procedures for linkage and disambiguation.
scripts are not immediately useful for these Chinese language sources because the prevalence of homonyms in Chinese means that names with identical pronunciations can be completely different. Meanwhile, characters that look similar and may be mistakenly replaced with each other during the production process can be pronounced differently and have different meanings.
The studies of Chinese language nominative linkage and disambiguation that we have located focus on names in contemporary unstructured Chinese language texts (for example, web pages) not on structured records like in the CGED-Q. We mention them here because they could eventually help with the linkage of the officials in the CGED-Q to mentions of them in unstructured texts. Chen and Huang (2010) assessed issues that arise in the disambiguation of the names of individuals in Chinese language texts. They report that single character given names are more challenging than two character given names. Combinations of surname and single-character name that are also commonly used words are especially difficult to disambiguate. For example, the combination Gaofeng (高峰) could be the surname Gao followed by the given name Feng but could also be the word for 'peak.' 8 Han et al. (2011) andFan &Li (2021) describe approaches based on clustering in which the same names appearing in different documents are disambiguated by reference to other words appearing with them in the text. The problems these papers address is different to the one we face in our own linkage of names in tabular datasets where the surname and given name are clearly specified in fields of their own, but relevant for efforts by others to extract and disambiguate names in unstructured historical texts like newspaper articles, books, and essays.
Several studies discuss the disambiguation of Chinese names of authors of texts. Han et al. (2017) focus on the specific case of disambiguating the names of authors of Chinese language publications, and introduce a method based on the names of the co-authors, the author's institution, and 'semantic fingerprints. ' Kim et al. (2021) shows that disambiguation of the names of Chinese authors of English language publications is easier if their name in Chinese characters is available alongside their phoneticized names. Yin et al. (2020) presents the results of an effort to disambiguate the names of inventors listed on Chinese patents between 1985 and 2016. They use supervised learning approach that begins with hand-labelled data for training.
Another line of studies offers potentially useful approaches for measuring similarity in the sound and the appearance of Chinese characters and then using this to assess the similarity of strings of Chinese characters. Liu et al. (2017) offer a method for encoding Chinese characters in terms of their sound, appearance, and meaning, and then ranking pairs of characters according to their similarity.  proposes a "SoundShape Code" for Chinese characters that reflects their pronunciation and appearance, and which may be used as a basis of measuring similarity between two characters in a pair. Xu et al. (2020) combine the SoundShape Code for individual Chinese characters with the Dice similarity measure for strings of potentially different lengths. Such methods address a challenge that we describe below: because of errors in the original source or errors during our transcription, the names of the same individual may appear with slightly different characters in different records in our dataset. Characters may be replaced by a homonym that looks different, or with a visually similar character that is pronounced very differently.

China Government Employee Dataset-Qing Jinshenlu (CGED-Q JSL)
We constructed the China Government Employee Dataset-Qing Jinshenlu (CGED-Q JSL) from Jinshenlu (縉紳錄) and Zhongshubeilan (中樞備覧) rosters of Qing civil and military officials respectively that were produced every three months. We have described the CGED-Q JSL and the sources from which it was constructed in detail elsewhere Ren et al., 2016;Ren et al., 2019) and only provide key details here. Official editions of the Jinshenlu and Zhongshubeilan were produced by the Qing Ministries of Personnel and War, respectively. 9 The government used the official editions to keep track of posts and the officials who held them. In the 19 th century, commercial publishers produced and sold editions that supplemented information on officials from the official editions with additional information collected by the publishers. 10 Purchasers of commercial editions used them for a variety of purposes, including searching for vacant positions and locating kin, classmates, or other connections who they knew were officials.
At the time of writing, the CGED-Q JSL contains 4,433,600 records from 275 Jinshenlu editions and 75 Zhongshubeilan editions. Each Jinshenlu roster lists 13,000 to 15,000 posts in the civil service and identifies the officials who held them. Zhongshubeilan rosters each list approximately 8,000 military posts and the officers who held them. The editions in the CGED-Q JSL are from the period 1762 to 1912. Coverage is sparse before 1830, but very complete after that year. From 1830 to 1911, the CGED-Q JSL includes at least one Jinshenlu edition from nearly every year. In many years, it includes all four quarterly editions. Zhongshubeilan are sparser and the gaps between them are longer.
78.9% of officials were ordinary citizens (minren 民人) and almost all the remainder were Bannermen (qiren 旗人). The vast majority of minren were what we would now refer to as Han Chinese. 11 Bannermen were hereditary affiliates of the Eight Banners, originally the army used to conquer China and establish the Qing in 1644, and in the 18 th and 19 th centuries, an organization used by the Qing state to maintain political and military control. Most officials who were Bannermen were Manchu or Mongol, but 16.4% were Han Chinese. The latter were referred to as Han Martial Bannermen (hanjun qiren 漢軍旗人). They were the descendants of Han Chinese who had been incorporated into the Eight Banners. Bannermen had a privileged position in the Qing government, with their own pathways to appointment and promotion, and quotas for certain positions. Thus, even though Bannermen accounted for only 2%-4% of the population of the Qing (Elliott et al., 2016), they accounted for one-fifth of civil officials overall, two-thirds of civil officials serving in the capital Jingshi (now Beijing) and 90% of officials in the secondary capital Shengjing (now Shenyang) (Chen et al., 2020, 454).
To produce career histories by longitudinal linkage of CGED-Q JSL records of officials, we distinguish between what we refer to as the primary and secondary attributes recorded for officials. We define primary attributes as basic and stable information about an official that are available in all or nearly all records and should be available in almost any other source that we might wish to link to. The most important of these are the names. We define secondary attributes as characteristics that are specific to the CGED-Q JSL and may not be available in other sources or recorded in every edition of a Jinshenlu or Zhongshubeilan. They may also be attributes that vary over time, for example, the official's current position. These may be used to adjudicating candidate links made based on the primary attributes, but on their own are not sufficient for linkage within the CGED-Q JSL or between the CGED-Q JSL and CGED-Q ER.
For linkage, we separate officials according to whether they had a surname recorded because the primary attributes available for officials with surnames differed from those available for those without surnames. Officials with surnames accounted for 80.2% of records. These included all the minren and one-third of the Han Martial Bannermen. 12 Basic information recorded for them included not only their surname (xing 姓) and given name (ming 名) but also their place of origin. The latter was usually the province and prefecture or county of origin, though there are complications that we discuss below. Officials without surnames included all Manchu (滿洲) and Mongol (蒙古) Bannermen and two-thirds of Han Martial Bannermen. 13 The only attributes recorded for officials without surnames that were in principle stable were given name and Banner affiliation (qifen 旗分). We use these as the primary attributes for Bannermen.
The primary attributes for officials with and without surnames differ in terms of their ability to uniquely identify officials within an edition. For officials with a surname, the combination of surname, given name, and province and county of origin was usually unique within an edition. If these were all recorded reliably and consistently across every edition, they would in principle be sufficient for linkage. Table 1 summarizes the number of repetitions of combinations of primary attributes within each quarterly Jinshenlu edition. For officials with a surname, 95.0% of the combinations of surname and given name were unique within their edition. In other words, for 95.0% of records, there was no other record in the same edition with the same surname and given name. For 4.4% of records, there was only one other record in the same edition with the same surname and given name. 98.1% of records of officials with a surname were unique within their edition in terms of the combination of surname, given name, and place of origin. Our investigations have revealed that for these officials, most repetitions within the same edition all refer to the same official. If an official held more than one post, there was a separate record for each of them. For officials without surnames, given name by itself is not sufficient for linkage. Only two-third of records recorded a given name that was unique within the quarterly edition. One-third of records had a name that appeared in one or more other records. When Banner affiliation was added, 88% of records became unique within their quarterly edition in terms of the primary attributes. 12% of records had a given name and Banner affiliation that appeared in at least one other record in the same quarterly education. Based on our investigations, these reflect some cases where the same official held more than one office, as well as cases where two different officials had the same name.
These results highlight that the approaches to linkage must differ according to whether a surname was available. For officials with a surname, as discussed above, the combination of surname, given name, and province and county of origin all written in Chinese characters is likely to be unique, and 'false positives' in which records of different officials are mistakenly linked together should be rare. The main task for linkage of officials with a surname is avoiding 'false negatives' in which an inconsistency in the recording of the name or some other attribute prevents a link from being made. For officials without a surname, the risk of false positives is high because surnames and place of origin are not available, and there are enough officials who share the same combination of given name and Banner affiliation to raise concerns that two records with identical name and Banner affiliation may refer to different officials.
Below we introduce the primary and secondary attributes in detail and assess their usefulness for linkage, with a focus on their homogeneity or heterogeneity. We divide our discussion of attributes between those available in records of officials with surnames and those available in records of officials without surnames.

Surnames
Because a small number of surnames accounted for a large share of the records of officials, surnames are of limited utility as a primary attribute for linkage. According to Table 2, which presents the cumulative percentages of records accounted for by the 100 most common surnames in the CGED-Q JSL, the five most common surnames appeared in one-quarter of the records. These were Wang (王) , Zhang (張), Li (李), Chen (陳) and Liu (劉). The top 10 surnames accounted for 38.3% of the records of officials with surnames. The top 20 surnames accounted for approximately one-half of the records, and the top 200 accounted for 95.1%. There were a total of 1626 distinct surnames recorded, though the actual number was lower because in this tabulation a surname may have more than one entry if the character appears in more than one form. Note: Based on authors' calculations on 3,244,484 CGED-Q JSL records with a legible surname.
One issue that arises with linkage based on surnames is that a character may be replaced with one that looks similar in an adjacent edition. Of the 1,559,380 pairs of records in editions which were no more than one year apart and almost certainly referred to the same official because they recorded the same two-character given name, province and county of origin, and position and broad category of degree qualification, 20,055 pairs (1.3%) differed on the character written for the surname. 14 Table 3 presents the cumulative frequencies of discordant pairs of surnames. The most common discordant pair (黃 黄) accounted for 22.4% of discordant pairs overall, the top 20 accounted for nearly two-thirds (63.1%) and the top 100 accounted for 79.2%.
2 Note: Of the 1,559,380 pairs of records in adjacent editions no more than one year apart that were identical on given name, province and county of origin, broad category of degree qualification, and position, 20,055 (1.3%) were discordant. Table 3 reveals two common issues that may generate 'false negatives', in which records of the same official are not properly linked. The first issue is that some pairs are the same character written in variant forms (Yitizi 異體字). The four most common pairs in Table 3 are examples: 黃 and 黄, 吳 and 呉, 高 and 髙, and 吕 and 呂 are different ways of writing the surnames Huang, Wu, Gao, and Lu respectively. In the Unicode standard these are recognized as different representations of the same character, and as we describe below, this is straightforward to address. The second and more challenging issue is that sometimes between editions a character for a surname is replaced by one that looks similar but is a completely different character. Examples in Table 3 include the fifth entry (叚 Xia and 段 Duan), the seventh entry (宋 Song and 朱 Zhu), the tenth entry (汪 Wāng and 王 Wáng), and the fifteenth entry (馬 Ma and 馮 Feng). These issues reflect either inconsistencies in the production process across different editions or transcription errors by coders. There are also examples of discordant pairs in Table 3 that consist of characters that are clearly different, for example the 24 th entry 張 章 (Zhang and Zhang) and the 28 th entry 程 陳 (Cheng and Chen). In most of these cases, one or both characters are relatively common surnames. While there is some possibility that these could be from records of different people, they may also be transcription errors that occurred during data entry.

Given Names
Given names (Table 4) were the most diverse of the primary attributes available for officials with surnames, and therefore the most useful for record linkage. We distinguish between records of officials with two-and one-character names. The former accounted for 85% of the records and the latter accounted for the remainder. A total of 102,648 distinct given names appeared in our data, 98,745 of which were two-character names, with the remaining 3,903 being one-character names. According to Table 4, two-character names were very diverse. The top 100 accounted for only 5.7% of records, the top 200 accounted for 9% of records, the top 1,000 accounted for 23% of records, and the top 10,000 accounted for only 61% of records. The diversity of two-character names reflects the large number of characters available to choose from: we found that at least 5,764 different characters made at least one appearance in a two-character given name in the CGED-Q JSL. 15  Like surnames, characters in given names may also be inconsistent across different quarterly editions. If not addressed, this may also lead to false negatives. Table 5 repeats the exercise for surnames carried out in Table 3 for the characters in two-character given names. 16 It presents the cumulative percentages of discordant pairs, defined as characters in given names that differ between records in editions that are no more than one year apart, and where the surname, one of the two characters in the given name, place of origin, position, and degree qualification are all identical. Out of 1,539,198 such pairs of records, 4.34% (66,994) differed on one character in the given name. Discordant pairs of characters in given names were much more diverse than was the case for surnames. The most common discordant pair (淸 and 清) accounted for only 3.7% of discordant pairs. The top 20 accounted for one-fifth (20.3%) of discordant pairs, and the top 100 accounted for 39.2%.  In the 1,539,198 pairs of records with legible surname and two-character given names in adjacent editions no more than one year apart that were identical on surname, one character of the given name, province and county of origin, broad category of degree qualification, and position, there were 66,994 discordant pairs.
Once again, the most common issue is a that between one edition and the next, a character was replaced with a variant, of which the seven most frequent pairs are all examples. 淸 and 清, for example, are both ways of writing the same character (Qing). However, there are also cases where a character is replaced by one that is different but looks similar. The twelfth, fourteenth, twentysecond and thirty-ninth entries are examples: 傅 (Fu) and 傳 (Chuan), 思 (Si) and 恩 (En), 增 (Zeng) and 曾 (Ceng), and 先 (Xian) and 光 (Guang), respectively. Again, this likely reflects a problem during the production of the source, or during the transcription.
Single-character names were less diverse. According to Table 6, the top 10 most common singlecharacter names accounted for 6.6% of records with single-character names and the top 100 accounted for 37%. According to separate tabulations, the top 200 accounted for 54% and the top 500 accounted for 78%. According to a separate tabulation like the ones in Tables 3 and 5 but not shown here, the patterns in discordant pairs are like those in Table 5. Most discordant pairs consisted of the same characters written differently or similar looking characters that could be mistaken for each other. There were examples, however, of characters that were clearly different, at least raising the possibility that they were men from the same county with the same surname and post who should not be linked. Accordingly, we link records of officials with single-character names separately, with more stringent criteria for match on other attributes when assessing candidate links. The given names recorded in the CGED-Q JSL should otherwise be stable and are the ones recorded for officials in their family genealogies and other sources like the CGED-Q ER, not their courtesy name (biaozi 表字) or style name (hao 號). We have shared data with researchers who have constructed datasets from lineage genealogies, and they report success linking men in the genealogies to officials in the CGED-Q JSL based on the names in the genealogies. Users of our CGED-Q JSL search page also report success locating ancestors or other figures based on names recorded in genealogies or other sources. 17 As for the stability of names, while we have not explicitly searched for cases where an official appeared to change their given name, we are not aware of any cases where someone appeared with two different given names except as the result of problems with the sources or transcription process that we discuss below. 18

Place of Origin
For place of origin, the available level of detail differed between the civil officials recorded in the Jinshenlu and the military officials in the Zhongshubeilan. The place of origin was where an official had first sat for an exam. In most cases this was where their family lived, but as we will discuss below, there were exceptions. 95% of the records of civil officials with surnames in the Jinshenlu specified county of origin and either specified province of origin or allowed for it to be inferred from the province in which the official was currently serving. 19 Of the records of military officers with surnames in the Zhongshubeilan, 13% had both province and county of origin, 84% only had province of origin and 3% had county of origin.
For civil officials, the place of origin was diverse, though not as diverse as the given name. Table 7 presents the cumulative percentages for the 100 most common places of origin as recorded for officials with surnames in the Jinshenlu. In most cases this is the province and county or prefecture where an official earned the shengyuan (生員) degree that made them eligible to sit for further exams or purchase the degrees that would qualify them for office. Since usually the county was recorded, not the prefecture, below we will only refer to county. A total of 10,156 distinct combinations of province and county or prefecture appeared in the CGED-Q JSL. For reasons that we discuss below, this is larger than the actual number of counties and prefectures at any given time. 18 If evidence emerges that officials did change their name, we will have to revisit our procedures for linkage within the CGED-Q JSL to produce career histories, as well as our procedures for linkage to other sources like the CGED-Q ER. 19 Province of origin could be imputed from province of current post because the Jinshenlu typically omitted province of origin for officials serving in their home province.  There were two main reasons that the number of combinations of province and county appearing in the data was larger than the number of counties at any given time, and these require attention during linkage. First, the province of origin listed for an official could change between editions even when the county of origin did not. 21 Out of 1,789,985 pairs of records in adjacent editions with identical surname and given name, position, and degree qualification, there were still 0.1% (1941) in which the province changed. This could occur because a provincial boundary was redrawn, but in other cases it was likely the production of a mistake during the production of the edition or the transcription by coders. Several sets of adjacent provinces stood out for the frequency with which one was replaced by the other across two records of the same official with the same county of origin listed: 1) Guangdong and Guangxi, 2) Zhejiang, Jiangsu, Jiangxi, and Anhui, 3) Hubei and Hunan, 4) Shandong and Shanxi, 5) Shuntian and Zhili, and 5) Shaanxi and Gansu. 22 20 There were too few secondary places of origin included to be of much use in linkage. Of the records that included a province and county of origin, only 13,533 (0.39%) listed an additional place of origin. 21 The Zhongshubeilan editions had additional complications. In the Zhongshubeilan rosters of military officials, Huguang (湖廣) appeared as a province of origin in some late 18th century and early 19th century editions. This was a combination of Hunan and Guangdong. We assigned the four counties that were associated with Huguang to Hunan. These were 慈利 (Cili), 祁陽 (Qiyang), 衡陽 (Hengyang), and 道州 (Daozhou). Similarly, in the Zhongshubeilan and sometimes in the Jinshenlu, counties in Jiangsu, Zhejiang, Anhui and sometimes Jiangxi were listed as being in Jiangnan (江南). T When we compared records in adjacent editions that were less than three years apart and which were identical on the surname, given name, county of origin, degree qualification and position, there were 36 cases Second, the characters used to write the name of a county could differ across editions. Out of 1,581,616 pairs of records in adjacent editions that had an identical surname, given name, province of origin, degree qualification, and position recorded, there were 3.6% (57,066) in which the county differed. Almost all of these were situations where a character within a county name was replaced with a variant form of the same character, as happened above with the surnames and characters that were part of given names. For example, the 3 rd and 10 th most common counties (山陰 and 山 隂) in Zhejiang (浙江) are the same county ( The cumulative implication of the discrepancies for surname, given name, and location for nominative linkage across the career of all the records of an official across their career is serious. By combining the discrepancy rates for the primary attributes, we can produce estimates that for two records of the same official in two adjacent editions, at least one of the four primary attributes differs. Assuming independence between the probabilities of each of the four primary attributes differing, we have 1-(1-0.035)(1-0.001)(1-0.0434)(1-0.0128)=0.0896, or 8.96%. Assuming a typical career length of 5 years, or 20 quarterly editions, the probability of a discrepancy in at least one pair of records is 83.2% (1-(1-0.0896) 19 ). In other words, assuming independence of these probabilities, it is almost certain that for any official whose career lasted for more than a few years that at least one of their records will not match exactly, and in the absence of measures to accommodate discrepancies, the records of many if not most officials with careers of more than just a few years of service will be split incorrectly into two or more officials. Below, we will present tabulations from career histories of officials produced by our linkage to show that such discrepancies were indeed common.

Secondary Attributes
Secondary attributes help adjudicate in situations where the primary attributes in a pair of records of officials with a surname are close but not an exact match. As we discuss below, they may be useful to confirm a candidate match, but by themselves they are rarely adequate to rule one out because they are not recorded completely, may not be recorded in a consistent fashion, or may change. For example, commercial editions tended to recorded more details that could be used as secondary attributes than official editions . Available secondary attributes for officials with surnames include the exam or purchased degree that qualified an official for appointment, the official position, courtesy or style name, and title.
The most important of these are degree qualifications. 84.2% of the records of officials with surnames included the examination or purchased degree that qualified them for appointment (chushen 出身). For some officials who held a jinshi or juren examination degree, the name of the where an official from Lingui (臨桂) county was listed as being from Guangdong in one record and Guanxi in another, 25 cases where someone was listed with Changping (昌平) as county of origin and were listed as being from Shuntian province in one record and Zhili province in the other, 21 cases where an official from Dantu (丹徒) county was listed as being from Jiangsu in one record and Jiangxi in the other, and 19 cases where an official from Hanyang (漢陽) was listed as being from Hubei in one record and Hunan in the other. Counties that switched between Shuntian and Zhili in more than 10 cases included Baoding (保定), Wuqing (武 淸), Ninghe (甯河) and Wanping (宛平). 23 We have made a list of pairs of discordant counties at the same website as the other tables.
degree wasn't included in the record, but the year (ganzhi 干支) in which they earned their degree was included. Since the provincial and metropolitan exams were the basis of the juren and jinshi were held in different years, whether an official held a juren or jinshi could be inferred from the exam year. When jinshi or juren inferred from exam year are included, 93.2%of records of officials with surnames specified a degree qualification. Hundreds of different degrees were recorded in the original, but for 89.3% of them, the degree fell into one of the following five broad categories: 1) Jinshi (進士) degrees for graduates of the Metropolitan Exam, 2) Juren (舉人) degrees for graduates of the provincial exam, 3) Regular gongsheng (正途貢生) degrees earned by examination, 4) Irregular gongsheng (異途貢生) acquired by purchase, or 5) Purchased Jiansheng (監生) degree. 24 Of 1,405,138 pairs of records in adjacent editions that matched on surname, given name, place of origin, and post and which had a degree qualification recorded in the original source, only 7.5% (106,007) changed their degree between two editions. Nearly all these changes were within the broad categories above and represented different ways of writing the same degree. Actual transitions between broad categories were rare. 25 Official post is useful for confirmation of candidate matches. Relevant information includes an official's job title (guanzhi 官職). For officials in the capital, their ministry and department were recorded. For officials outside the capital, their province, prefecture, and county were recorded. According to our calculations based on record pairs in adjacent editions that were identical on all primary attributes, 7.3% of job titles changed between editions, either because the official changed jobs, or because the title was written differently. If we consider the entire post, including the geographic location or ministry and department, 12.6% changed between editions. Again, this reflected not only actual changes, but inconsistencies across editions in recording. The recorded post had high specificity: for 85% of the records of officials with a surname, the combination of geographic location or ministry and department and job title was unique within the quarterly edition. We have also mapped posts to the numeric bureaucratic ranks used in the civil service (pinji 品级) and then categorized these numeric ranks as high, middle, low, and unranked. Below, this helps us assess whether two records with the same name belong to the same or different officials. 26 Some other attributes were recorded only for a few officials, but when they were recorded, could be useful for helping to confirm a match. One of these was the official's courtesy name (biaozi 表字) or style name (hao 號). 11.7% of the records of officials with a surname included a courtesy or style name alongside the given name. Whether or not these names were recorded also varied across editions: In 74 of 275 Jinshenlu editions, no courtesy or style names were recorded at all. They are also not systematically available in the CGED-Q ER, limiting their usefulness for linkage to that dataset. Titles (juewei 爵位) were recorded consistently, but only 0.5% of civil officials with a surname had one. Year of appointment to the current post and related information could be useful 24 A small number of civil officials in the Jinshenlu and many military officials in the Zhongshubeilan had military exam (武舉) degrees. A small number of officials were Yinsheng (蔭生), that is holders of a hereditary honorary status. See Chen et al. (2020) for a detailed discussion of these degrees, including tabulations and trends over time.
but they are only available for 60.2% of records of officials with a surname in the CGED-Q JSL, and not available at all in the CGED-Q ER. 57 Jinshenlu editions do not record year of appointment to the current post.

Attributes Available for Officials Without Surnames
3.2.1 Given Names 26,727 distinct given names appeared for officials without surnames in the data. In principle, all or almost all of these officials should have been Bannermen, mostly Manchu but in some cases Mongol. 84.1% of the given names consisted of only two characters, 11.2% three characters and less than 1% four or more characters. According to Table 8, the top 100 names accounted for 8.6% of records. This was only slightly higher than the 6.6% accounted for by the top 100 given names of officials with a surname. The main difference is that the distribution of given names of officials without surnames has a shorter tail: separate calculations reveal that the top 200 account for 13%, the top 1,000 account for 36%, and the top 10,000 account for 92%. By contrast, the top 10,000 names accounted for only 64% of the records of officials with surnames. While the smaller number of officials without surnames may have accounted for the overall smaller number of distinct given names, it should not have affected the shape of the distribution. The given names recorded for officials without surnames were transliterations into Chinese of originally Manchu or Mongol names. Bannerman officials had different combinations of characters to choose from for the transliteration of their name. For example, the most common name in term of toneless pronunciation, Qing'an, appeared variously as 慶安, 清安, and 淸安. In the latter two, 清 and 淸 are variants of the same Chinese character. The next most common name in terms of toneless pronunciation, Xilin, appeared as 錫麟, 錫霖, 熙麟 and 西林. These are all different characters. As a result, our tabulations of the romanized names without tones reveals that they were less diverse than names written as Chinese characters. There were 14,560 distinct names if we only consider the pronunciations without tones. The top 100 accounted for 11.8% of records, the top 200 accounted for 19.4% of records, the top 1000 accounted for half of records and the top 10,000 accounted for 99.0% of records.
In the CGED-Q JSL, changes in the transliterations of the same Manchu or Mongol name across different editions appear to have been rare. While officials who had the same Manchu or Mongol name may have had different transliterations to choose from at the beginning of their career, once they chose one they do not seem to have changed it later. Of 560,559 pairs of records of officials without surnames in editions no more than one year apart that were identical in terms of the toneless Mandarin pronunciation of the name, Banner affiliation, and post, the Chinese characters used to write the name changed in only 2.3% of pairs (13,128). Our further inspection revealed that many of these apparent changes were the result of replacement of one character in the name with a variant form of the same character.

Banner Affiliation
Banner affiliation was stable enough to help confirm candidate links, but there were enough changes to suggest caution against reliance on it to exclude possible links. Every Bannermen were associated with one of eight banners defined by a combination of either Plain or Bordered and one of four colours: Yellow, White, Red, and Blue. 27 When we examined 488,734 pairs of records of Manchu and Mongol Bannermen in adjacent editions with identical names in Chinese characters, identical location or ministry and department, and identical job title, 4.4% (21,634) changed banner. More than one-quarter of these were between Plain and Bordered Banners of the same colours. Most of the changes are among officials with the same three job titles as above: clerk (bitieshi 筆帖式), yuanwailang (員外郎) or zhushi (主事). At present we are unclear of the process by which officials changed Banners, and we will need to conduct further inquiries with the help of Qing historians.

Secondary Attributes of Bannermen
The posts recorded for officials without surnames within a quarterly edition were not unique. Table  9 presents the tabulation of the concatenation of job title and administrative unit for officials without surnames. For those serving in the capital, the administrative unit was their ministry and department. For those serving outside the capital, it was the province and possibly prefecture and county where they were assigned. Only 16.7% of job titles (guanzhi 官職) were unique within an edition. More than three-quarter appeared five or more times within an edition. The most common were clerks (bitieshi 筆帖式), yuanwailang (員外郎) and zhushi (主事). Even when we consider the combination of location or ministry and department and job title, less than one-third of positions were unique. For more than half of positions, there were 5 or more records in the same edition with an identical position. Most of the repeated positions were clerks who were in pools assigned to the central government ministries. Officials without surnames had other details recorded that are potentially useful as secondary attributes, but which are only available for small numbers of records. Those who were members of the main line (zongshi 宗室) or collateral line (jueluo 覺羅) of the Imperial Lineage were recorded as such and accounted for 7.4% of the civil officials who had no surname and 1.7% of civil officials overall. Over the entire course of the Qing and into the Republican era, the Imperial Lineage only had 83,656 male members total, thus its members were heavily overrepresented among officials. One-third (35.8%) of civil officials who were Bannermen had an examination or purchased degree recorded. This tended to be more common later in the nineteenth century. 11.6% of Bannermen had a courtesy or style name recorded. Year of appointment is only recorded in 7.5% of the records of Bannermen.

China Government Employee Dataset-Qing Examination Records (CGED-Q ER)
The China Government Employee Dataset-Qing Examination Records (CGED-Q ER) consists of records of examination degree holders transcribed from originally separate lists of exam passers from different sittings of the exam. The most important sources are lists in books self-published by the exam degree holders who had passed at the same sitting of an exam and thought of themselves as classmates. Most of these were titled Tongnianchilu (同年齒綠), though some appeared with other titles. Hereafter we refer to them as Classmate Books. Each one listed the surname, given name, and province and county of origin for exam passers at a single sitting along with their current post, if any, and names and degrees held for their father and paternal grandfather and greatgrandfather. In most cases they also provide age at passing the exam. They also list other kin, but such information is less systematic. Within the CGED-Q ER, we can also link between the different levels, connecting the records of Juren to their records as Jinshi. This allows us to examine how characteristics of a Juren influenced their chances of going on to earn the Jinshi. The second task is to link the information about degree holders in the CGED-Q ER to their career records in the CGED-Q JSL. This allows us to examine how the characteristics of degree holders including their family background and their exam performance affected their chances of being appointed subsequently being promoted.
For these linkage tasks we make use of surname, given name, province and county of origin, the year in which the degree was earned, and the type of degree recorded in the CGED-Q JSL and ER. Issues related to the use of surname, given name, and province and county of origin are similar to those in the CGED-Q JSL. The combination of surname, given name, and province and county of origin is almost always unique for degree holders with surnames who earned their degrees at the same time, thus we do not repeat the detailed analysis for officials in the CGED-Q JSL from above. There is also the possibility that across different sources, characters may be replaced by variants. The approach we describe below for dealing with this in the CGED-Q JSL will also work for linkage of exam records.
Exam year is useful because it allows us to constrain matching to exclude situations where someone appears to earn the jinshi before the juren, or else earns it more than a decade after the juren.

Linkage
We carry out linkage in four stages. First, as described in 5.1, we prepare for linkage by constructing standardized versions of key attributes. Second, as described in 5.2, we carry out simple deterministic linkage to form groups of records that match exactly on a variety of primary and secondary attributes and therefore are unambiguously the same official. We then extract the first record in each group to produce the dataset that will be used in the later stages. This substantially reduces the number of records to be considered in the later stages. Third, as described in 5.3, we make use of the capability in the STATA probabilistic linkage package dtalink (Kranker, 2018) to specify attributes to be used for 'blocking', according to which pairs of records are selected for scoring in the probabilistic linkage only if they have an exact match on those attributes. By excluding large numbers of record pairs that are clearly not matches, for example ones in which records differ on both surname and given name, it yields another order of magnitude reduction in the time required for linkage. In the fourth stage (5.4), we carry out probabilistic linkage, again with dtalink. Candidate pairs of records left over after the formation of record groups and application of blocking are scored and then based on these scores, linked together by assignment of a unique identifier to all records that have been associated with a specific official.

Preparation
We prepare the datasets for linkage by producing standardized versions of the primary and secondary attributes. To reduce the chances that inconsistencies in the recording of a given name for the same person across different editions will produce false negatives during linkage, we create transformed versions of the surname and given name. We begin by consolidating the characters in surnames and given names recognized in the Unicode standard as different versions of the same character. 28 Examples in Table 6 include 淸 and 清 (Qing) and 勲 and 勳 (Xun). We refer to these as the CV versions of the names, for Consolidated Variants. We then carry out a second round of consolidation on the CV versions which we group sets of characters in given names that are not recognized as variants in the Unicode standard but look like each other. 29 Examples include the ones mentioned in the discussion of Table 6: 傅 (Fu) and 傳 (Chuan), 思 (Si) and 恩 (En), 增 (Zeng) and 曾 (Ceng), and 先 (Xian) and 光 (Guang). We refer to these as the SC versions, for Similar Characters. At the end of the process, each record contains the given name as originally entered, and fields for the CV and the SC version.
We also produce standardized versions of the surnames. We first consolidate variant forms of characters based on the Unicode standard to produce CV versions. We then consolidate similar looking CV characters to produce SC versions. To do this, we manually reviewed the results of the tabulation that produced Table 3 to identify the most common discordant pairs that were not variant forms that would be accounted for by consolidation on the Unicode standard. As we noted in our discussion of Table 3, there were pairs of characters that were different enough that we concluded that they may have been for different people who were otherwise similar on the attributes we matched on. After excluding these, for the time being we have settled on twelve sets of characters that were especially like to appear in place of each other, and which we thought were similar enough that they could be swapped by mistake between editions, either during the production of the editions, or during transcription by our coders. 30 This is a more conservative approach than we took with the characters in given names because surnames are less diverse than the characters in given names, and accordingly the risk is higher that two people who are the same on other attributes but differ on the their surname really are different people. We may adjust our approach later.
We produce standardized versions of the province and county of origin. We create two versions of the county name romanized by Hanyu pinyin to account for the possibility that characters in the name of a county were replaced with homonyms by mistake. These are listed in Table 10. The first version (PY) includes tone marks, and the second version (PY TL) excludes them. Finally, to address inconsistency in the association of counties with provinces, we create a version of province of origin in which Anhui, Jiangsu, Jiangxi and Zhejiang are all combined into Jiangnan, and Hunan, Guangdong, and Guangxi are all combined into Huguang. We refer to this as the C version of province. In the very 28 This includes converting characters mistakenly typed in simplified form into traditional form. See https://unicode.org/reports/tr38/ for a report on the latest version of the Unicode Han Database. We downloaded the Unicode database for Han Chinese characters from https://www.unicode.org/Public/UCD/latest/ucd/Unihan.zip 29 We did this by carrying out a tabulation like the one that produced Table 6 but which only used the CV versions of the characters to produce a list of pairs of characters that are commonly swapped. We manually assessed each of the resulting pairs to flag those that were visually similar enough that it is plausible that they could be switched. We use the resulting pairs to map sets of similar characters to a single character. 30 These were 1) 宋, 朱,宗, 2) 叚, 段, 3) 王, 汪,江, 4) 馬, 馮, 溤, 5) 柳, 栁, 6) 季, 李, 7) 龍, 龔, 8) 余, 徐, 涂, 9) 湛, 諶, 10) 㓂, 寇, 11) 樂, 欒, 12) 褚, 諸.
small number of records in which a second province and county of origin were listed in the original source, we used that instead of the first listed province and county of origin.

Deterministic Linkage
We group records that match exactly on a large number of primary and secondary attributes and are in editions less than one year apart and create an extract of the data that only includes the first record in each of these groups. We make the criteria for inclusion of a record in one of these groups so exacting as to rule out false positives in which records of different officials are accidentally linked. 31 The creation of these record groups by deterministic linkage is straightforward and we do not discuss it further. Because the number of record groups that need to be linked is an order of magnitude less than the original number of records to be linked, the time required for the second and third stages is substantially reduced.

Blocking
We divide blocking for the CGED-Q JSL and CGED-Q ER linkage into six types based on the attributes available in the records involved and the risks of false positives or negatives. For linkage within the CGED-Q JSL to produce career histories, we distinguish three types : 1) officials with a surname who had a single character given name, 2) officials with a surname who had two character given names, and 3) officials without surnames. We link officials with surnames and one-character given names separately because comparison of Tables 4 and 5 suggests that the risk of a false positive is higher, compared with the ones with a two-character given name. This requires stricter criteria for matching on other attributes. Because the combination of surname and two-character given name is more likely to unique, for linkage of officials with given names who had two-character given names we can be more forgiving for other attributes. Officials without surnames have only the given name and Banner affiliation as primary attributes, which combination is less likely to be unique, thus we must put more weight on secondary attributes. Linkage within the CGED-Q ER forms the fourth type.
Here, we treat all the records the same. The total number of records is small enough that false positives for degree-holders with a surname and only a one-character given name are unlikely. For linkage between the CGED-Q JSL and CGED-Q ER, we distinguish the fifth and sixth types according to whether men with surnames have one-or two-character given names. Table 10 summarizes the attributes used for blocking for each of the six types of linkage. In each case, we balance the risk of false negatives associated with use of overly strict criteria against the increased linkage time associated with the use of loose criteria. In general, we make the blocking criteria as loose as possible while seeking to prevent clearly impossible pairs through to be scored. Thus, for example, we typically block on SC versions of names rather than CV versions of names, and then use scoring on other attributes to assess pairs that match on the SC but not CV versions. For blocking within the CGED-Q JSL, we apply different criteria for each linkage type. For the first type, officials with surnames who had two-character names, we block on the SC and pinyin versions of the surname and given name. That is, if two records have the same SC or pinyin version of the surname and given name, they are a candidate match and go on to be scored on the other attributes, including the CV versions of the names. We do not use the CV version of the names for blocking because it would be too strict, and would preclude making matches based on the looser criteria associated with use of the SC versions. For the second type, officials with surnames and only onecharacter given names, we only allow pairs of records with the same SC versions of the names. Our experiments with allowing for matches on the pinyin version of the surname and given name yielded too many false positives. For the third type, officials with no surname, we block on the SC version of the given name and Banner affiliation, or on the combination of the pinyin version of the name, the Banner affiliation, Imperial Lineage affiliation, title, and complete post. In other words, a pair in which the SC version of the name doesn't match but the pinyin version does match can still be treated as a candidate pair and scored if there is an exact match on a variety of other characteristics. We allow candidate pairs that match on the pinyin name only when several additional secondary attributes also match because allowing candidate pairs based on pinyin given name alone would substantially expand the number of pairs to be considered. For the fourth type, linkage within the CGED-ER, the SC versions of the surname and given name are sufficient for blocking. Rather than have a separate approach to blocking in the CGED-Q ER for men with a surname and a single character given name, as we describe below, we apply tighter criteria for scoring candidate pairs involving such records. 32 For the fifth type, linkage between the CGED-Q ER and CGED-Q JSL of men with a surname and a two-character given name, we allow for candidates pairs that match on the SC or toneless pinyin versions of the surname and name. 33 For the sixth type, linkage between the CGED-Q ER and CGED-Q JSL of men with single-character given names between, we only allow candidate pairs that match on the SC versions of the names.

Probabilistic Linkage
Since probabilistic linkage is already widely used and described in detail elsewhere, here we only provide a summary of the basic concept. Probabilistic matching considers every possible pair of records in a dataset left over after blocking and then scores each pair for similarity according to criteria specified by the user. For the scoring, the user specifies the attributes to compare, and the 32 We do not have separate blocking for Bannermen when linking between the CGED-Q JSL and CGED-ER because there are too few of them (1.2% of records overall) in the CGED-Q ER to warrant special handling. 33 We include Bannermen with officials with surnames because only a small proportion (1.23%) of exam degree holders in the Classmate Books we have coded were Bannermen, and the chances of different individuals having the same name were small. amount to be added to or subtracted from the score if they match or differ. Calipers may also be specified according to which some amount may be added to the score for a pair if two numeric attributes are within some range of each other, and some other amount may be deducted if they are not. A match is made by comparing the scores of candidate pairs selecting the ones with the highest score that also meet a cutoff score set by the user.
We scored the candidate pairs of record groups left over after blocking according to their concordance or discordance on specified primary and secondary attributes. Tables 11 and 12 summarize our current rewards and penalties for concordance or discordance on each primary or secondary attribute for our six types of linkage. The rewards ("+" in the tables) are added to the score for a candidate pair if the condition specified in the row heading is satisfied. The penalties ("-" in the tables) are subtracted from the score if the condition is not satisfied. Tables 11 and 12 also include the cutoffs that a score had to be greater than or equal to in order for a match to be made. For each of the six linkage tasks, we choose the amounts to be added to or subtracted for a match or mismatch on a specified attribute to balance the risks of false negatives and false positives. We apply more stringent criteria when there are larger numbers of records to be linked, most notably within the CGED-Q JSL, and therefore a higher chance that separate individuals will have the same primary attributes. We apply looser criteria when the chances of a false positive are lower, usually because there are fewer records to be linked. Linkage within the CGED-Q ER is one example. We arrived at the rewards, penalties, and cutoffs in Tables 11 and 12 iteratively. We inspected the results every time we ran the linkage. We located false negatives by searching the data for groups of records that matched exactly on secondary attributes such as position and degree and most but not all of the primary attributes, and which were not associated with a single official. We examined these groups to assess whether the records in the group should all have been assigned to the same official. This helped clarify how often characters were replaced with ones that looked similar and inspired our effort not only to create the CV and SC versions of names. It led us to increase the rewards for exact matches on such secondary attributes as courtesy name and complete post that were highly unlikely to match by chance. It also led to our discovery of inconsistencies in the recording of province.
We searched for false positives by identifying groups of records that had all been assigned to the same official, but which differed on at least one primary attribute, for example, surname, or one character in a two-character given name. This led to our realization that we needed to apply more stringent criteria for individuals with single character given names and led us also to increase penalties for mismatches on attributes such as province of origin or broad category of purchased or examination degree that should be stable. Users working with extracts of the data to study topics of their own, most commonly the appointment and promotion of specific categories of officials, also reported problems that they noticed, and our investigations revealed. 34 For linkage within the CGED-Q JSL (types one through three), we assigned the largest rewards to concordance on attributes like given name, post, or county that are the most diverse and therefore the least likely to match by chance. Even though blocking differed for one-and two-character names, scoring was the same. We gave large rewards to matches on secondary attributes like courtesy or style name and complete post. This helped counter the effects of inconsistencies in the recording of province and county of origin that were not addressed by the transformations described above. Since posts were listed in the same order from one edition to the next, we also rewarded concordance on the name of the official in the record above or below. Rewards are smaller for concordance on attributes like province or broad category of examination or purchase degree that are less diverse and more likely to match by chance.
We apply the largest penalties for discordance on attributes like province or county of origin that should have been stable and were less diverse. A mismatch on a less diverse attribute like the C version of the province, Banner affiliation, or broad category of degree qualification will lead to a large penalty. We apply a penalty for a mismatch on county, with an additional penalty if the province in which the counties are located are part of Huguang or Jiangnan. 35 We also penalize matches of records that are further apart in time, and in the case of records so far apart that it is implausible for them to be the same person, we apply a penalty so large that it will preclude a match from being made. For officials without surnames, we also penalize candidate matches if the categories of bureaucratic rank (pinji 品级) are too far apart. This helps reduce the chances that a record of a high official will be linked to those of another officials with the same given name who is a low-ranking clerk. We apply a smaller penalty for mismatches on attributes that are more prone to inconsistent recording, like detailed examination or purchase degree. Courtesy and style names were diverse, often missing, and sometimes seem to have changed, thus we do not apply a penalty for a mismatch on them. Similarly, because complete positions and the components that made up the position were expected to change when an official was promoted or reassigned, and because different editions could record positions differently even when the official was not promoted or reassigned, we do not apply a penalty for a mismatch on position.
For linkage within the CGED-Q ER (type four), we began with surname, name, province and county of origin, and exam year. We created CV and then SC versions of the surname and name. We blocked on the SC version of the surname and name. We rewarded matches on the combination of SC surname, SC name, and county or province, and heavily penalized discordance on the pinyin (PY) version of the county or C version of the province. We used the SC version of the name rather than the CV version because the overall number of men to be linked was much smaller than in the CGED-Q JSL and the risk of a 'false positive' accordingly smaller. We applied only a mild penalty for a gap between exam years because we wanted to allow for links between records of juren and jinshi degrees earned in different years but applied a much larger penalty if the exam years were so far apart that the rules would not have allowed a juren to sit for the Metropolitan exam in the specified year.
For linkage between the CGED-Q JSL and CGED-Q ER (types five and six), we relied on surname, given name, province and county of origin, CGED-Q ER exam year, and CGED-Q JSL edition year. We blocked on the SC version of the surname and given name. We gave very large rewards for matches on the CV version of the surname and name and smaller rewards for matches on the SC versions. We allowed for matches not only on the province and county of origin in the CGED-Q JSL, but also on the province and county of origin (籍貫) listed in the CGED-Q JSL for officials who sat for the exam someplace other than their actual place of origin, usually Shuntian. We allowed up to 30 years for the time between earning a degree and being appointed for the first time.

Results
To illustrate how the approach describe above reduces false positives and false negatives, while also reducing the amount of time required, we present the results of linkage within the CGED-Q JSL, that is types one, two and three. We focus on linkage with the CGED-Q JSL because it was the most challenging and complex, and made use not only of primary attributes, but a wide range of secondary attributes. linkage of the 4,108,586 records in the CGED-Q JSL with a name and other information required by the approach described in the sections above yielded 326,315 sets of linked records, each career history of a single official. For each of the three types of linkage within the CGED-Q JSL, Table 13 presents the original number of records to be linked, the number of groups remaining after the deterministic linkage described in 5.2, the number of candidate pairs left after the blocking described in 5.3, and the final number of officials produced by the probabilistic linkage described in 5.4. According to Table 13, grouping records with deterministic linkage on the primary and some secondary attributes substantially reduces the number of items to be linked. In the case of Type 1 linkage, the number of items to be linked is reduced by 88.6% percent, from 2,676,108 to 315,015. The resulting number of candidate pairs to be scored is modest. For type 1 linkage, the number of candidate pairs is lower than the number of groups because many groups are isolates because blocking left them without any other groups to be paired with and scored, and they go straight to being recognized as an official. The number of candidate pairs for type 3 linkage is much larger only the given name and Banner affiliation are available for blocking, and these are less diverse than the surname, given name and province and county of origin of officials who have surnames. Probabilistic linkage on standardized primary attributes that compensates for discrepancies when there are matches on secondary attributes reduces the number of false negatives. Had we required exact matches on the primary attributes as originally recorded, each distinct combination within one of the histories produced by our linkage would have been associated with a separate official. Table  14 tabulates the career histories according to the numbers of distinct combinations of surname, name and province and county of origin or Banner affiliation within them in the original data. In 28% (100-28) of the career histories of officials with a one-character given name, more than one surname, given name, or place of origin appeared. The corresponding figure for officials with twocharacter given names was 29.9% (100-70.1). In the career histories of officials of without surnames produced by linkage, 13.9% (100-86.1) had more than one name or Banner affiliation appeared. According to our calculations, linkage by requiring exact matching on the original primary attributes and not using probabilistic linkage with the standardized versions of the names consolidated CV or SC versions of names to allow for discrepancies would have led to the creation of 453,375 career histories. Career histories that in our probabilistic linkage were attributed to a single official would have been separated. The total number of career histories, in other words, would have been inflated by 38%. The gains associated with applying probabilistic linkage within the CGED-Q ER and between the CGED-Q JSL and CGED-Q ER are similar: the number of juren degree holders who are linked to jinshi records increases substantially, as do the numbers of juren and jinshi linked to the CGED-Q JSL.

Conclusion
This is unlikely to be the final word, especially for the linkage of officials without surnames. Based on manual examination of the resulting data we are confident that our linkage of officials with surnames is close to optimal in terms of its balance between avoiding false positives and false negatives, and that any further accommodation of additional discrepancies we have noticed would open the door to false positives in which the records of clearly different officials would be combined. Any further adjustments to the linkage of officials with surnames are likely to consist of small refinements to the lists of similar characters, and adjustments to the handling of problems with provinces and counties. For Bannermen, however, we suspect that the lack of diversity in the combination of names and Banner affiliation means that we still have too many false positives.
Our experiences, and our descriptive results about patterns in names, should be useful to other teams that are carrying out large-scale record linkage in datasets constructed from historical Chinese sources. The issues we discuss here and our approach to linkage are most relevant for the linkage of highly structured data transcribed from rosters and related records, the descriptive results on the consistency and potential for overlap in the recording of names may be of interest to those conducting disambiguation in unstructured data like newspaper articles. Particular attention needs to be paid to the possibility that across difference sources, the characters in the names of individuals to be linked may be replaced with variant forms of the same character, or entirely different characters that are superficially similar.
We now have ongoing projects to construct, link, and analyze datasets of individuals during the Republican era . Our efforts to create datasets from university student records are the furthest along , but we have other projects to create datasets of Republican officials, professionals, and other elites. While we expect some of our experiences with Qing records to be relevant, we also anticipate that there will be other issues specific to the Republican data. Naming patterns may have changed. Consistency in the usage of genealogical given names as opposed to courtesy or style names may have changed as well. Customs for the recording of place of origin may also have evolved.