IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data

Hyderabad, India - January 8, 2007


Home
Programme
Proceedings
Call for Papers
Important Dates
People
Submission
Attendance
Contact
IBM Research
Supported by IBM Research
IAPR
Endorsed by the International Association for Pattern Recogntion

Possible Source of Noisy Text and Application Scenarios

“Contact center” is a general term for help desks, information lines and customer service centers operating in domains ranging from computer sales and support to mobile phones to apparels. On an average a person in the developed world interacts at least once a week with a contact center agent. A typical contact center agent handles over a hundred calls per day. They operate in various modes such as voice, online chat and email. The contact center industry produces gigabytes of data in the form of emails, chat logs, voice conversation transcriptions, customer feedback. A bulk of the contact center data is voice conversations. Transcription of these using state of the art automatic speech recognition results in text with 30-40% word error. Further even written modes of communication like online chat between customers and agents and even the interactions over email tend to be noisy. Analysis of contact center data is essential for customer relationship management, customer satisfaction analysis, call modeling, customer profiling, agent profiling, etc., and it requires sophisticated techniques to handle poorly written text. Poorly written text is also produced in large amounts in online chat, SMS, blogs, wikis, discussion forums, newsgroups. These are important sources of data for market buzz analysis, market review, trending, etc. Also because of the large amount of data, it is necessary to find efficient methods of information extraction, classification, summarization and analysis of this data. Many government and national defence organizations have vast repositories of hard-copy documents. To retrieve and process the content from such documents, they need to be OCRed. In addition to printed text, these documents may also contain handwritten annotations. OCRed text can be highly noisy depending on the font size, quality of the print etc. It can range from 2-3% error rates to as high as 50-60% error rates. Handwritten annotations can be particularly hard to decipher, and error rates can be quite high in their presence. Documents with historical languages can also be considered noisy with respect to today’s knowledge about the language. Such text contains important historical, religious, ancient medical knowledge that is useful.