2005 IBM USER INTERFACE & SIGNAL PROCESSING
TECHNOLOGIES SYMPOSIUM
Monday, Sept. 19, Room 20-043
 
 

TIME SPEAKER TITLE
9:00-9:05 Jiri Navratil Welcome
9:05-9:30 Fred Mintzer IBM and Signal Processing: An Informal and Myopic History
9:30-10:00 Jay W. Summet
(Georgia Tech)
Virtual Rear Projection: Technology and Evaluation
10:00-10:30 Guangqi Ye (John Hopkins) Robust Modeling of Heterogeneous Gestures Using Localized Parsers
10:30-10:45 BREAK  
10:45-11:15 Rosa M. Figueras (EPFL) Flexible Geometric Image Coding through Matching Pursuit
11:15-11:45 Yinian Mao (Maryland) Coordinated Sensor Deployment for Improving Secure Communications and Sensing Coverage
11:45-12:15 Petar S. Aleksic (Northwestern) Audio-Visual Interactions in Communications
12:15-1:30 LUNCH  
1:30-2:00 Dimitri Bitouk (Johns Hopkins) 3-D Feature Extraction for Audio-Visual Speech Recognition
2:00-2:30 Abhinav Sethy (USC) Building Topic Specific Language Models from Webdata
2:30-3:00 Damianos Karakos (Johns Hopkins) The Maximum Likelihood Set: A Novel Approach to Language Modeling

IBM Presentations

TIME SPEAKER TITLE
3:00-3:05 Gerasimos Potamianos Introduction of IBM presentations
3:05-3:30 David Nahamoo Overview of IBM Research / UIT Areas
3:30-3:40 BREAK  
3:40-4:00 Ellen Eide Text-To-Speech
4:00-4:20 Ying Li Automobile Damage Detection
4:20-4:40 Jason Pelacanos Conversational Biometrics
4:40-5:00 Gerasimos Potamianos Audio-Visual Speech Recognition
5:00 Alain Biem Closing Remarks

DETAILS OF INVITED STUDENT / POSTDOC PRESENTATIONS:


9:30-10:00: Virtual Rear Projection: Technology and Evaluation
JAY W. SUMMET, College of Computing, Georgia Institute of Technology

ABSTRACT:

Virtual Rear Projection uses multiple redundant front-projectors to simulate a rear-projected experience on surfaces where true rear-projection is impossible or cost prohibitive (e.g. external walls or concrete floors). By detecting users as they occlude individual projectors, the light falling on the users can be turned off, and the shadows on the display surface are filled-in with other, non-occluded projectors. This talk will discuss the technology we have developed for VRP systems along with work to evaluate the user experience aspect of the technology, describing how user evaluations have motivated our technological development path.

More Information: http://www.cc.gatech.edu/cpl/vrp/

BIO:

Jay Summet is a PhD student at the Georgia Institute of Technology, co-advised by Gregory Abowd and Jim Rehg. His research involves the development and evaluation of projector based technology for display and tracking. Jay holds an MS degree in End User Visual Programming from Oregon State University, a BS in Scientific Computing from Central Washington University, and has interned at Pacific Northwest National Laboratories, Intel Research, and Mitsubishi Electronic Research Laboratories.

More Information: http://www.cc.gatech.edu/~summetj/cv.html


10:00-10:30: Robust Modeling of Heterogeneous Gestures Using Localized Parsers
GUANGQI YE, Department of Computer Science, the Johns Hopkins University

ABSTRACT:

With the ubiquity of powerful computers and rapid advances in sensing and human-computer interaction technologies, there is great potential for creating intelligent computing environments. Vision, as well as speech, has emerged as convenient and excellent means for interaction and controlling to augment or even replace traditional techniques, such as keyboards and mice. We propose a new methodology for vision-based human-computer interaction called the Visual Interaction Cues (VICs) paradigm. VICs fundamentally relies on a shared perceptual space between the user and computer with cameras.

In this space, each interface component is represented as a localized region in the image(s). Thus, we can efficiently extract local visual cues to model gesture using localized parsers without globally tracking the user. In this talk, we present a novel approach to efficiently capture hand shape and motion. Based on extracted features, low-level gesture modeling methods, such as Hidden Markov Models and Neural Networks, are used to model postures and dynamic gestures.

Most existing methods to model gestures only deal with unimodal gestures, i.e., all gestures in the vocabulary are either postures, dynamic gestures or parameterized gestures. To fully harness the power of gestures and to build an intelligent interface, we propose a high-level probabilistic framework to model heterogeneous gestures. Each low-level unimodal gesture is represented as a gesture word, while a gesture sentence is composed as a series of temporally and contextually constrained gesture words.

We built a system based on the VICs paradigm and carried out experiments to test the proposed methods. The experiments involve sixteen users and fourteen low-level gestures. The experimental results show that our method can robustly model multi-modal gestures based on localized parses.

BIO:

Guangqi Ye received the BE degree in computer science from Tsinghua University, China in 1998, and the MSE degrees in computer science from the Johns Hopkins University in 2002, respectively. Currently he is a PhD candidate in the Department of Computer Science at the Johns Hopkins University. His research interests include computer vision, human-computer interaction, and pattern recognition. He is a student member of the IEEE.


10.45-11:15: Flexible Geometric Image Coding through Matching Pursuit
ROSA M. FIGUERAS i VENTURA, Signal Processing Institute, Swiss Federal Institute of Technology (EPFL)

ABSTRACT:

Natural images have most of the information needed to recognize a scene concentrated in object edges. Thus, efficient modeling of natural images should make special emphasis on the efficient approximation of contours. Image contours can be considered as geometric singularities, smooth along one direction with a discontinuity along the orthogonal direction.

Typically, images are represented as a sum of simpler pieces, or basis functions. Common state of the art image coding approaches, typically based on separable bases, are not adapted to exploit the geometry of edges. Indeed, separable decompositions fail to detect the regularity of contours. The use of basis functions with arbitrary anisotropy and rotation, able to exploit the geometry of edges, reduces the number of functions needed to approximate a given edge, providing sparser image approximations. Sparse image approximations allow for a large number of applications, such as image coding or denoising.

This talk presents a flexible image coder that provides geometrical image approximations. The approximation is obtained by using a set of basis functions (dictionary) that include arbitrary anisotropy and rotations. This dictionary is created by applying geometrical transformations (translation, anisotropic scaling and rotation) to a mother wavelet like basis functions, forming an overcomplete basis. As this dictionary is overcomplete, there is not a unique solution for signal representation. From all the available solutions, the desired approximation is the sparsest, but finding it is an NP-hard problem. In order to try to find this sparsest solution, a suboptimal algorithm, called Matching Pursuit, is used to approximate the image in the selected dictionary. As the basis function indices are parametric and with a geometrical meaning, this coder provides a parametric description of the image. Thanks to this, it allows for geometrical transformations of the image directly in the transformed domain. In addition, thanks to the coefficient distribution that Matching Pursuit provides, and to a specific coding of the MP coefficients and function indices, this coder gives a scalable bit-stream, both in PSNR and in bit-rate, while it allows reconstructing the image at any resolution at the receiver.

BIO:

Rosa M. Figueras i Ventura received her Master of Sciences in Telecommunication Engineering from Escola Tècnica Superior d'Enginyers de Telecomunicacio de Barcelona, UPC (Technical University of Catalonia) in 2000. During the academic year 1999-2000 she was an exchange student at the Swiss Federal Institute of Technology (EPFL), where she performed her master thesis diploma on image coding. After that, Rosa Maria carried a Ph.D. degree in the Signal Processing Institute (ITS, EPFL) on Signal and Image Processing. During this, she performed research on the domain of Sparse Image Approximation and Coding. Moreover, during this time, she was also a teaching and research assistant. Rosa Maria successfully got her Ph.D. on July 2005. Now she is willing to continue her research career in the domain of Signal and Image Processing.


11:15-11:45: Coordinated Sensor Deployment for Improving Secure Communications and Sensing Coverage
YINIAN MAO, Department of Electrical and Computer Engineering, University of Maryland, College Park

ABSTRACT:

Sensor network has a great potential in applications such as habitat monitoring, wildlife tracking, building surveillance, and military combat. The design of a sensor network system involves several important issues, including the sensing coverage, node-to-node or node-to-base-station communications, and the security in information gathering and relay by the sensors. In this talk, I will show that the system performance on these aspects depends closely on how the sensors are deployed in the field, and on how the sensor locations can be adjusted after the initial deployment.

For static sensor deployment, we investigate the hexagon and square lattice topology and analyze their impact on secure connectivity and sensing coverage. For advanced sensing devices that allow for location adjustment after deployment, we propose the Weighted Centroid algorithm that can jointly improving sensing coverage and secure connectivity. This algorithm is an adaptation of the Lloyd-Max quantization algorithm to building secure sensor network. I shall show a number of simulation results that demonstrate the effectiveness of the proposed algorithm.

If time permits, I shall also briefly introduce our work on constructing and analyzing robust and secure image hashing, for applications such as image authentication and image/video watermarking; and provide a quick overview of other research projects that I am involved in.

BIO:

Yinian Mao received the B.E. degree in electrical engineering from Tsinghua University, Beijing, China, in 2001. He is currently working towards his Ph.D. degree in signal processing and communications at the Electrical and Computer Engineering Department of University of Maryland, College Park. He was a research intern at Microsoft Research (Redmond, WA) in 2004. His research interests include information security and multimedia signal processing. Mr. Mao is a co-author of a paper on media security that has won the Student Paper Contest in the 2005 International Conference on Acoustic, Speech, and Signal Processing (ICASSP'05).


11:45-12:15: Audio-Visual Interactions in Communications
PETAR S. ALEKSIC, Department of Electrical and Computer Engineering, Northwestern University

ABSTRACT:

This presentation describes several human-computer interaction applications that exploit joint processing of audio and visual speech signals, which I have investigated in my research. In particular, it focuses on audio-visual automatic speech recognition (AV-ASR), speech-to-video synthesis, and audio-visual biometrics. It especially explores these applications in relation to the MPEG-4 compliant facial animation parameters (FAPs). For each of the applications, the developed system and experimental setup are described, and the results presented.

I developed a large vocabulary, continuous AV-ASR system, which utilizes multi-stream HMMs to combine visual speech information contained in the lip movement with the acoustic speech information. An automatic and robust method for lip-tracking which does not require hand labeling or extensive training procedures is also developed. The method combines active contour and templates algorithms.

A novel speech-to-video synthesis system that exploits correlation between acoustic and visual speech signals to generate synthetic talking faces that are directly driven by the acoustic signal is also described. The main contribution of this work is the development of the correlation HMM system, which maps acoustic into visual HMM state sequences. I also developed an audio-visual speaker recognition system that utilizes dynamic visual speech information, contained in the FAPs describing lip movement, in addition to audio in order to improve speaker recognition performance.

BIO:

Petar S. Aleksic received the B.S. degree in electrical engineering from University of Belgrade, Serbia, in 1999, and the M.S. and Ph.D. degrees in electrical engineering from Northwestern University, in 2001 and 2004, respectively. He has been a member of the Image and Video Processing Lab at Northwestern University since 1999, where he is currently a postdoctoral fellow. His primary research interests include multimedia communications, computer vision, and pattern recognition. In particular, he focuses on investigating visual feature extraction and analysis, audio-visual speech recognition, speech-to-video synthesis, audio-visual biometrics, and facial expression recognition.


1:30-2:00: 3-D Feature Extraction for Audio-Visual Speech Recognition
DIMITRI BITOUK, Center for Imaging Science, the Johns Hopkins University

ABSTRACT:

Audio-visual speech recognition (AVSR) aims to improve the performance of the conventional speech recognition by incorporating visual information. One of the major challenges to AVSR is visual feature extraction. Almost of the approaches tot visual feature extraction introduced so far suffer from a fundamentally limited 2-D representation.

The focus of this talk in on development is of 3-D methods for visual speech recognition, emphasizing the creation of an efficient view-independent representation of the speaker's appearance and facial motion. The major advantage of this approach is the fact that it allows tracking and recognition of articulatory facial motion invariant to the speaker's pose and illumination conditions in the scene. At the end of the talk the use of such 3-D visual features in large vocabulary AVSR will be discussed.

BIO:

Dimitri Bitouk expects to receive his PhD in Electrical and Computer Engineering at The Johns Hopkins University in Baltimore, MD in Fall 2005. He received his Masters degree in Physics from Moscow State University in 1999. For the last 5 years, his research at The Center for Imaging Science has concentrated on various problems in computer vision and image understanding, including visual speech recognition, 3-D face tracking and automatic target recognition (ATR).


2:00-2:30: Building Topic Specific Language Models from Webdata
ABHINAV SETHY, Department of Electrical Engineering, University of Southern California

ABSTRACT:

The ability to build task specific language models, rapidly and with minimal human effort, is an important factor for fast deployment of natural language processing applications such as speech recognition in different domains. Although in-domain data is difficult to gather, we can utilize easily accessible large sources of generic text such as the Internet (WWW) or the GigaWord corpus for building statistical task language models by appropriate data selection and filtering methods. We propose a query generation and data weighting strategy which iteratively acquires data from such sources using a set of adaptive models to greatly improve the performance achieved from models built from limited in-domain data.

The proposed query generation mechanism utilizes Relative Entropy to extend measures such as TFIDF to larger text contexts and weighted utterances/data sets. Our method also models the data source properties by tracking the performance of queries in every iteration. The data obtained from these sources is weighted in terms of its fit to the topic/domain and merged to existing models in an iterative fashion. The fitness to the task is evaluated using a combination of features in a positive only classification framework using SVMs. By including features which measure the speech recognizer confusability we attempt to select data which helps build a better discriminative language model for speech recognition. In some speech recognition applications such as spoken document retrieval, automated call center it is possible to acquire a lot of raw speech data. The manual annotation effort required to convert this speech data into text is costly and time consuming. We present ways to merge the data acquisition process with unsupervised adaptation and active learning methods to help reduce the annotation requirement significantly by selecting a smaller subset from the raw speech data for annotation.

BIO:
Abhinav Sethy is a PhD candidate at the University of Southern California (USC) working with Prof Shrikanth Narayanan in the Department of Electrical Engineering. He received his B.Tech degree in Electrical Engineering from the Indian Institute of Technology (Delhi) in 1999. He has previously worked in Adobe Systems, India and interned at IBM TJ Watson research center. His research interests include data mining for NLP applications, acoustic modeling for speech recognition and speech pedagogy applications and learning from unlabeled data.


2:30-3:00: The Maximum Likelihood Set: A Novel Approach to Language Modeling
DAMIANOS KARAKOS, Center for Language and Speech Processing, the Johns Hopkins University

ABSTRACT:

A recurring problem in statistical language modeling and clustering of natural language texts is data sparseness. The distribution of words in a document is usually modeled using a non-parametric probability mass function (pmf), that needs to be estimated from sample text. The dimension of such a pmf (vocabulary size) is often tens of thousands, while a document itself may be just a few thousand words long, leading to severe data sparseness problems. In this talk, we will describe a novel method for density estimation, which is based on the computation of a set of probability distributions. In essence, this set, which we call Maximum Likelihood Set (or MLS for short), contains all pmfs under which the observed word-counts are more likely than any other set of word-counts possible for the same amount of data. We will discuss the properties of the MLS, as well as a way of choosing one of its pmfs as an estimate. We will present some ongoing work in the application of this method to statistical language modeling.

BIO:

Damianos Karakos obtained his BSc degree in Computer Science from the University of Crete, Greece in 1995, and the MSc and PhD degrees in Electrical Engineering from the University of Maryland, College Park, in 1998 and 2002 respectively. Since 2003 he has been working as a postdoctoral fellow at the Center for Language and Speech Processing, Johns Hopkins University, on problems in document clustering, language modeling, word-sense-disambiguation and machine translation. He is also interested in information theory, machine learning, and signal/image processing.



Related Links:
http://www.watson.ibm.com
http://www.research.ibm.com/compsci/uit
http://www.research.ibm.com/pics/signal

Site hosted by IBM Research