Cross-Searching Simplified & Traditional Chinese

The U-M Library Catalog contains more than 400,000 bibliographic records for Chinese-language materials. These bibliographic records display titles, authors, and other information in both the vernacular Chinese characters and transliterated Roman characters. The Chinese-language metadata elements are entered in one of two character sets: traditional Chinese characters or simplified Chinese characters (and some records actually have a mix of both). As their names suggest, the two character sets are equivalent, but the simplified one typically uses fewer strokes to write. While the same word or concept can be written in either character set, and some characters in both sets are visually similar to each other, they are not the same character and have different Unicode characters to represent them. (For more on the origins of simplified Chinese characters, see the Simplified Chinese characters on Wikipedia.) 

Our challenge

The Chinese-language records in our catalog are a mix of both character sets: some records were created using traditional Chinese characters to represent titles, authors, publishers, etc., while other records were cataloged using simplified characters. Similarly, individuals who search using Chinese characters may enter their search query with either simplified or traditional characters. 

In our catalog, until recently, users could not be confident that they were retrieving all relevant records unless they conducted two searches, one in simplified characters and a second in traditional characters. Recognizing this challenge for finding Chinese-language materials in our catalog, we undertook a project to address this shortcoming in partnership with the U-M’s Asia Library

Our solution

Our solution was to implement Unicode Traditional-to-Simplified Chinese transliteration (as implemented in the standard ICU library) in the existing Solr indexing system. Solr, open-source indexing software, is what we use in Catalog Search. The ICU Filter performs script transliteration during indexing of catalog records, creating a new searchable index of all Chinese-language characters normalized to simplified characters. Likewise, user queries are normalized to simplified characters. Thus, a query in either traditional or simplified Chinese characters is used to match against a normalized index of records encoded in simplified Chinese characters. When the records are displayed, they are shown as they were created, in either traditional or simplified characters.

For example, a title query for any of the Chinese characters for “household registration” will result in the same number of records (95), where before they individually returned 6 and 87, respectively (there are now two on-order titles that were not in the catalog in the original search): 

While working on this project, our partners in the Asia Library helped us identify test records for indexing and also tested the beta version and provided feedback before it was released. Because Chinese characters are also used in Korean and Japanese languages, our Asia Library colleagues also tested to ensure no unintended outcomes were introduced. Because the ICU Filter makes the same transformation on a character-by-character level, it does not change the way searches perform on records in those languages.

With this improvement, users can now find all records regardless of whether the query or the record is in simplified or traditional Chinese characters. Credit goes to Bill Dueber in LIT’s Digital Library Applications department for his work on this and to Liangyu Fu, Gengna Wang, Yunah Sung, and Keiko Yokota-Carter in the Asia Library for their guidance and testing.