iJCai-07 Wor kShop on Analitics for noisy Unstrctrd Txt Data

Proceedings
Hyderabad, India - January 8, 2007



Main Page

Table of Contents

Author Index

AND 07 Website

Foreword

Noisy unstructured text data is found in informal settings such as online chat, SMS, emails,message boards, newsgroups, blogs, wikis and web pages. Also, text produced by processing spontaneous speech, printed text and handwritten text contains processing noise. Text produced under such circumstances is typically highly noisy containing spelling errors, abbreviations, nonstandard words, false starts, repetitions, missing punctuations, missing case information, pause filling words such as "um" and "uh." Such text can be seen in large amounts in contact centers, on-line chat rooms, OCRed text documents, SMS corpus etc. Documents with historical language can also be considered noisy with respect to today’s knowledge about the language. Such text contains important historical, religious and ancient medical knowledge that is useful. The theme of the IJCAI 2007 Conference is "AI and its benefits to society." In keeping with this theme, this workshop proposes to look at analytics of highly noisy text that is produced in everyday applications in society.

The goal of the workshop is to focus on the problems encountered in analyzing noisy documents coming from various sources. The nature of the text warrants moving beyond traditional text analytics techniques. This workshop brings together a diverse group of researchers to present current research and development in addressing this challenge. As a result of this workshop some new real life noisy data sets have also become available to a wider research community.

We were fortunate to assemble a diverse group of researchers from the Natural Language Processing, Machine Learning and Knowledge Management communities to help us in organizing this workshop. The workshop call for papers had a very good response. We received 30 submissions spanning a diverse set of issues relevant to noisy text analytics. Each submission was reviewed by at least three members of the program committee. To encourage discussion, the workshop program is structured into topic-oriented oral and poster sessions. In addition to the contributed papers, the program also contains a keynote address and a panel discussion - on the topic of whether noisy text analytics is at all possible, and if it is then how.

We would like to thank our organizing and program committees for their many invaluable inputs and thoughtful reviews. We would like to thank Monojit Choudhury, Matthew Hurst, Ted Pedersen and Sudeshna Sarkar for sharing noisy text datasets prepared by them. We would also like to thank the others who pointed us to many relevant noisy text datasets. We thank the International Association for Pattern Recognition for endorsing this workshop and instituting a best student paper award. We would like to thank Raghuram Krishnapuram for chairing the committee to decide the best student paper award. We also thank IBM Research for providing financial support for the workshop.

Craig Knoblock, Daniel Lopresti, Shourya Roy, L. Venkata Subramaniam (Workshop Co-Chairs)



Endorsed by the International Association for Pattern Recognition

Supported by IBM Research