(13275N) Chinese Informal Words Detector and Microblog Segmenter

Chinese Informal Words Detector and Microblog Segmenter (13275N)

Applications

Chinese language informal word recognition
Chinese language word segmentation in the microtext, chat, SMS and IM domains
Chinese language lexicon creation

Patent

Know-how

Opportunity

Exclusive/non-exclusive licensing
Partnership in commercial development

Advantages

Automatically extract informal words from Chinese sentences.
Automatically segment Chinese microtext sentences into words with high accuracy, in contrast to standard Chinese word segmentation systems.

IDA Technology Roadmap 2012
This technology falls in the following categories of Singapore's IDA Infocomm Technology Roadmap 2012:

Big Data
Social Media
User Interface

For more information on this technology contact:
Dr. Jose Rojas ([email protected])
Industry Liaison Office
NUS Enterprise
National University of Singapore

Technology Overview
The system processes Chinese text by automatically segmenting sentences into words, while at the same time recognizing informal Chinese words. The processed text can be used to generate a Chinese language informal words lexicon that can be used as a reference for other technologies that may need this input.

One of the novel aspects is the usage of joint inference model. To leverage the close dependency between Chinese word segmentation and informal word recognition, a factorial conditional random field model to perform both tasks jointly is used. The performance on both tasks is improved with mutual interaction between each other. The current implementation of this software can be queried with web browsers as well as through programming.

The model is trained on a data set crawled from China’s Sina Weibo with gold annotations obtained through crowdsourcing.

Development Status
Technology Readiness Level 3 on the scale by the Ministry of Defence Singapore.

An interactive demo of this technology can be found here.

About the Research Group
Dr. Min-Yen Kan is an associate professor with research interests in digital libraries, natural language processing and information retrieval. Specific interests include scholarly digital libraries, definitional QA, statistical MT, text summarization, verb analysis, optimizing access to scientific literature, web crawling, and combining search and browsing user interfaces under human-computer interaction.