Chinese Informal Words Detector and Microblog Segmenter (13275N)
IDA Technology Roadmap 2012
This technology falls in the following categories of Singapore's IDA Infocomm Technology Roadmap 2012:
For more information on this technology contact:
The system processes Chinese text by automatically segmenting sentences into words, while at the same time recognizing informal Chinese words. The processed text can be used to generate a Chinese language informal words lexicon that can be used as a reference for other technologies that may need this input.
One of the novel aspects is the usage of joint inference model. To leverage the close dependency between Chinese word segmentation and informal word recognition, a factorial conditional random field model to perform both tasks jointly is used. The performance on both tasks is improved with mutual interaction between each other. The current implementation of this software can be queried with web browsers as well as through programming.
The model is trained on a data set crawled from China’s Sina Weibo with gold annotations obtained through crowdsourcing.
Technology Readiness Level 3 on the scale by the Ministry of Defence Singapore.
An interactive demo of this technology can be found here.
About the Research Group
Dr. Min-Yen Kan is an associate professor with research interests in digital libraries, natural language processing and information retrieval. Specific interests include scholarly digital libraries, definitional QA, statistical MT, text summarization, verb analysis, optimizing access to scientific literature, web crawling, and combining search and browsing user interfaces under human-computer interaction.