Statistical Approaches to Mining Texts in Literary Chinese

October 2013

This seminar aims to explore new lines of research in natural language processing by statisticians in discussion with the humanists who would be making use of these methods. It is focused on texts written in literary Chinese for three reasons. First, there is an influential group of potential users of these techniques at Harvard: scholars working on Chinese history, religion, and literature. Second, Harvard has a number of faculty and students in statistics who read Chinese and who are just beginning to explore this field. Third, developing natural language processing approaches for literary Chinese is uniquely hard because this language lacks the punctuation and word division that traditional approaches based on western languages have depended on; methods developed for literary Chinese would apply to Japanese and Korean as well. The massive digitization efforts taking place in China and Taiwan are now producing an extraordinary amount of material in searchable digital format. Natural language processing, particularly the approach central to this seminar (which does not depend on the prior existence of a dictionary), has the potential to enable scholars to develop quantitative approaches to this data. However, the reading protocols in history and literature are such that, on the face of it, there would seem to be little common ground. The organizers believe the time has come to explore the possibility that the scientists and the humanists have much to learn from each other.