Last week I had the opportunity to attend Japanese Text Mining, an extraordinary four-day workshop held at Emory University and sponsored by the Japan Foundation (and numerous other organization listed at the link). I say extraordinary because surely this cutting edge topic has never been explored this deeply and for this long by a group of 20+ people in the United States--and perhaps not even in Japan. Our group was a diverse mix of students, professors, and librarians; one person came from Canada and one directly from Japan to attend the workshop.
Another extraordinary aspect to the gathering was the strong presence of the University of Michigan. Two of our PhD students were in attendance: Paula Curtis and Melissa Van Wyk. Plus three alums of our PhD program: Brian Dowdle, Hoyt Long, and Molly Des Jardin! The latter two served as our instructors, along with the brilliant Mark Ravina, a professor of Japanese history at Emory who organized, instructed, and generally made the entire event happen. The crew of graduate students he arranged to be ready to help out with our computing difficulties and coding questions was top-notch!
Text mining is the process of analyzing data from natural language text to derive high quality information about it. That type of research activity has been going on for some time with regard to English language text. For example, one might process all of the writings of Charles Dickens to derive information about his patterns of language use over time. It is relatively straightforward to scan a page of printed English text and then use optical character recognition (OCR) to make the words in the text readable and searchable by a computer. Japanese text, however, does not usually have spaces between the words, so simply doing OCR on a text doesn’t get you as far, and even mere OCR of Japanese, alas, is not as straightforward as it is for English. If you want to know more, a rich starting point is Molly Des Jardin’s guide to the subject.
Our four days were packed with activities: presentations; hands-on work, including learning to use the coding language R to manipulate Japanese text once it had been successfully OCR’ed and broken into words (the technical term is “tokenized”); and lively discussion. We did a little tweeting under the hashtag #dhjapan, which we hope will be used for other Japanese digital humanities topics in the future. I am still processing everything I learned and the ways I would like to use it in future projects. Happily, the group intends to keep in touch and advance this exciting new area of scholarship.