|
IBM Presentations
| TIME |
SPEAKER |
TITLE |
| 3:00-3:05 |
Gerasimos Potamianos |
Introduction of IBM presentations |
| 3:05-3:30 |
David Nahamoo |
Overview of IBM Research / UIT Areas |
| 3:30-3:40 |
BREAK |
|
| 3:40-4:00 |
Ellen Eide |
Text-To-Speech |
| 4:00-4:20 |
Ying Li |
Automobile Damage Detection |
| 4:20-4:40 |
Jason Pelacanos |
Conversational Biometrics |
| 4:40-5:00 |
Gerasimos Potamianos |
Audio-Visual Speech Recognition |
| 5:00 |
Alain Biem |
Closing Remarks |
DETAILS OF INVITED STUDENT / POSTDOC
PRESENTATIONS:
9:30-10:00: Virtual Rear Projection:
Technology and Evaluation
JAY W. SUMMET, College of Computing, Georgia Institute of Technology
ABSTRACT:
Virtual Rear Projection uses multiple redundant front-projectors
to simulate a rear-projected experience on surfaces where true
rear-projection is impossible or cost prohibitive (e.g. external
walls or concrete floors). By detecting users as they occlude
individual projectors, the light falling on the users can be turned
off, and the shadows on the display surface are filled-in with
other, non-occluded projectors. This talk will discuss the technology
we have developed for VRP systems along with work to evaluate
the user experience aspect of the technology, describing how user
evaluations have motivated our technological development path.
More Information: http://www.cc.gatech.edu/cpl/vrp/
BIO:
Jay Summet is a PhD student at the Georgia Institute of Technology,
co-advised by Gregory Abowd and Jim Rehg. His research involves
the development and evaluation of projector based technology for
display and tracking. Jay holds an MS degree in End User Visual
Programming from Oregon State University, a BS in Scientific Computing
from Central Washington University, and has interned at Pacific
Northwest National Laboratories, Intel Research, and Mitsubishi
Electronic Research Laboratories.
More Information: http://www.cc.gatech.edu/~summetj/cv.html
10:00-10:30: Robust Modeling of
Heterogeneous Gestures Using Localized Parsers
GUANGQI YE, Department of Computer Science, the Johns Hopkins
University
ABSTRACT:
With the ubiquity of powerful computers and rapid advances in
sensing and human-computer interaction technologies, there is
great potential for creating intelligent computing environments.
Vision, as well as speech, has emerged as convenient and excellent
means for interaction and controlling to augment or even replace
traditional techniques, such as keyboards and mice. We propose
a new methodology for vision-based human-computer interaction
called the Visual Interaction Cues (VICs) paradigm. VICs fundamentally
relies on a shared perceptual space between the user and computer
with cameras.
In this space, each interface component is represented as a
localized region in the image(s). Thus, we can efficiently extract
local visual cues to model gesture using localized parsers without
globally tracking the user. In this talk, we present a novel approach
to efficiently capture hand shape and motion. Based on extracted
features, low-level gesture modeling methods, such as Hidden Markov
Models and Neural Networks, are used to model postures and dynamic
gestures.
Most existing methods to model gestures only deal with unimodal
gestures, i.e., all gestures in the vocabulary are either postures,
dynamic gestures or parameterized gestures. To fully harness the
power of gestures and to build an intelligent interface, we propose
a high-level probabilistic framework to model heterogeneous gestures.
Each low-level unimodal gesture is represented as a gesture word,
while a gesture sentence is composed as a series of temporally
and contextually constrained gesture words.
We built a system based on the VICs paradigm and carried out
experiments to test the proposed methods. The experiments involve
sixteen users and fourteen low-level gestures. The experimental
results show that our method can robustly model multi-modal gestures
based on localized parses.
BIO:
Guangqi Ye received the BE degree in computer science from Tsinghua
University, China in 1998, and the MSE degrees in computer science
from the Johns Hopkins University in 2002, respectively. Currently
he is a PhD candidate in the Department of Computer Science at
the Johns Hopkins University. His research interests include computer
vision, human-computer interaction, and pattern recognition. He
is a student member of the IEEE.
10.45-11:15: Flexible Geometric
Image Coding through Matching Pursuit
ROSA M. FIGUERAS i VENTURA, Signal Processing Institute, Swiss
Federal Institute of Technology (EPFL)
ABSTRACT:
Natural images have most of the information needed to recognize
a scene concentrated in object edges. Thus, efficient modeling
of natural images should make special emphasis on the efficient
approximation of contours. Image contours can be considered as
geometric singularities, smooth along one direction with a discontinuity
along the orthogonal direction.
Typically, images are represented as a sum of simpler pieces,
or basis functions. Common state of the art image coding approaches,
typically based on separable bases, are not adapted to exploit
the geometry of edges. Indeed, separable decompositions fail to
detect the regularity of contours. The use of basis functions
with arbitrary anisotropy and rotation, able to exploit the geometry
of edges, reduces the number of functions needed to approximate
a given edge, providing sparser image approximations. Sparse image
approximations allow for a large number of applications, such
as image coding or denoising.
This talk presents a flexible image coder that provides geometrical
image approximations. The approximation is obtained by using a
set of basis functions (dictionary) that include arbitrary anisotropy
and rotations. This dictionary is created by applying geometrical
transformations (translation, anisotropic scaling and rotation)
to a mother wavelet like basis functions, forming an overcomplete
basis. As this dictionary is overcomplete, there is not a unique
solution for signal representation. From all the available solutions,
the desired approximation is the sparsest, but finding it is an
NP-hard problem. In order to try to find this sparsest solution,
a suboptimal algorithm, called Matching Pursuit, is used to approximate
the image in the selected dictionary. As the basis function indices
are parametric and with a geometrical meaning, this coder provides
a parametric description of the image. Thanks to this, it allows
for geometrical transformations of the image directly in the transformed
domain. In addition, thanks to the coefficient distribution that
Matching Pursuit provides, and to a specific coding of the MP
coefficients and function indices, this coder gives a scalable
bit-stream, both in PSNR and in bit-rate, while it allows reconstructing
the image at any resolution at the receiver.
BIO:
Rosa M. Figueras i Ventura received her Master of Sciences in
Telecommunication Engineering from Escola Tècnica Superior d'Enginyers
de Telecomunicacio de Barcelona, UPC (Technical University of
Catalonia) in 2000. During the academic year 1999-2000 she was
an exchange student at the Swiss Federal Institute of Technology
(EPFL), where she performed her master thesis diploma on image
coding. After that, Rosa Maria carried a Ph.D. degree in the Signal
Processing Institute (ITS, EPFL) on Signal and Image Processing.
During this, she performed research on the domain of Sparse Image
Approximation and Coding. Moreover, during this time, she was
also a teaching and research assistant. Rosa Maria successfully
got her Ph.D. on July 2005. Now she is willing to continue her
research career in the domain of Signal and Image Processing.
11:15-11:45: Coordinated Sensor
Deployment for Improving Secure Communications and Sensing Coverage
YINIAN MAO, Department of Electrical and Computer Engineering,
University of Maryland, College Park
ABSTRACT:
Sensor network has a great potential in applications such as
habitat monitoring, wildlife tracking, building surveillance,
and military combat. The design of a sensor network system involves
several important issues, including the sensing coverage, node-to-node
or node-to-base-station communications, and the security in information
gathering and relay by the sensors. In this talk, I will show
that the system performance on these aspects depends closely on
how the sensors are deployed in the field, and on how the sensor
locations can be adjusted after the initial deployment.
For static sensor deployment, we investigate the hexagon and
square lattice topology and analyze their impact on secure connectivity
and sensing coverage. For advanced sensing devices that allow
for location adjustment after deployment, we propose the Weighted
Centroid algorithm that can jointly improving sensing coverage
and secure connectivity. This algorithm is an adaptation of the
Lloyd-Max quantization algorithm to building secure sensor network.
I shall show a number of simulation results that demonstrate the
effectiveness of the proposed algorithm.
If time permits, I shall also briefly introduce our work on constructing
and analyzing robust and secure image hashing, for applications
such as image authentication and image/video watermarking; and
provide a quick overview of other research projects that I am
involved in.
BIO:
Yinian Mao received the B.E. degree in electrical engineering
from Tsinghua University, Beijing, China, in 2001. He is currently
working towards his Ph.D. degree in signal processing and communications
at the Electrical and Computer Engineering Department of University
of Maryland, College Park. He was a research intern at Microsoft
Research (Redmond, WA) in 2004. His research interests include
information security and multimedia signal processing. Mr. Mao
is a co-author of a paper on media security that has won the Student
Paper Contest in the 2005 International Conference on Acoustic,
Speech, and Signal Processing (ICASSP'05).
11:45-12:15: Audio-Visual Interactions
in Communications
PETAR S. ALEKSIC, Department of Electrical and Computer Engineering,
Northwestern University
ABSTRACT:
This presentation describes several human-computer interaction
applications that exploit joint processing of audio and visual
speech signals, which I have investigated in my research. In particular,
it focuses on audio-visual automatic speech recognition (AV-ASR),
speech-to-video synthesis, and audio-visual biometrics. It especially
explores these applications in relation to the MPEG-4 compliant
facial animation parameters (FAPs). For each of the applications,
the developed system and experimental setup are described, and
the results presented.
I developed a large vocabulary, continuous AV-ASR system, which
utilizes multi-stream HMMs to combine visual speech information
contained in the lip movement with the acoustic speech information.
An automatic and robust method for lip-tracking which does not
require hand labeling or extensive training procedures is also
developed. The method combines active contour and templates algorithms.
A novel speech-to-video synthesis system that exploits correlation
between acoustic and visual speech signals to generate synthetic
talking faces that are directly driven by the acoustic signal
is also described. The main contribution of this work is the development
of the correlation HMM system, which maps acoustic into visual
HMM state sequences. I also developed an audio-visual speaker
recognition system that utilizes dynamic visual speech information,
contained in the FAPs describing lip movement, in addition to
audio in order to improve speaker recognition performance.
BIO:
Petar S. Aleksic received the B.S. degree in electrical engineering
from University of Belgrade, Serbia, in 1999, and the M.S. and
Ph.D. degrees in electrical engineering from Northwestern University,
in 2001 and 2004, respectively. He has been a member of the Image
and Video Processing Lab at Northwestern University since 1999,
where he is currently a postdoctoral fellow. His primary research
interests include multimedia communications, computer vision,
and pattern recognition. In particular, he focuses on investigating
visual feature extraction and analysis, audio-visual speech recognition,
speech-to-video synthesis, audio-visual biometrics, and facial
expression recognition.
1:30-2:00: 3-D Feature Extraction
for Audio-Visual Speech Recognition
DIMITRI BITOUK, Center for Imaging Science, the Johns Hopkins
University
ABSTRACT:
Audio-visual speech recognition (AVSR) aims to improve the performance
of the conventional speech recognition by incorporating visual
information. One of the major challenges to AVSR is visual feature
extraction. Almost of the approaches tot visual feature extraction
introduced so far suffer from a fundamentally limited 2-D representation.
The focus of this talk in on development is of 3-D methods for
visual speech recognition, emphasizing the creation of an efficient
view-independent representation of the speaker's appearance and
facial motion. The major advantage of this approach is the fact
that it allows tracking and recognition of articulatory facial
motion invariant to the speaker's pose and illumination conditions
in the scene. At the end of the talk the use of such 3-D visual
features in large vocabulary AVSR will be discussed.
BIO:
Dimitri Bitouk expects to receive his PhD in Electrical and Computer
Engineering at The Johns Hopkins University in Baltimore, MD in
Fall 2005. He received his Masters degree in Physics from Moscow
State University in 1999. For the last 5 years, his research at
The Center for Imaging Science has concentrated on various problems
in computer vision and image understanding, including visual speech
recognition, 3-D face tracking and automatic target recognition
(ATR).
2:00-2:30: Building Topic Specific
Language Models from Webdata
ABHINAV SETHY, Department of Electrical Engineering, University
of Southern California
ABSTRACT:
The ability to build task specific language models, rapidly and
with minimal human effort, is an important factor for fast deployment
of natural language processing applications such as speech recognition
in different domains. Although in-domain data is difficult to
gather, we can utilize easily accessible large sources of generic
text such as the Internet (WWW) or the GigaWord corpus for building
statistical task language models by appropriate data selection
and filtering methods. We propose a query generation and data
weighting strategy which iteratively acquires data from such sources
using a set of adaptive models to greatly improve the performance
achieved from models built from limited in-domain data.
The proposed query generation mechanism utilizes Relative Entropy
to extend measures such as TFIDF to larger text contexts and weighted
utterances/data sets. Our method also models the data source properties
by tracking the performance of queries in every iteration. The
data obtained from these sources is weighted in terms of its fit
to the topic/domain and merged to existing models in an iterative
fashion. The fitness to the task is evaluated using a combination
of features in a positive only classification framework using
SVMs. By including features which measure the speech recognizer
confusability we attempt to select data which helps build a better
discriminative language model for speech recognition. In some
speech recognition applications such as spoken document retrieval,
automated call center it is possible to acquire a lot of raw speech
data. The manual annotation effort required to convert this speech
data into text is costly and time consuming. We present ways to
merge the data acquisition process with unsupervised adaptation
and active learning methods to help reduce the annotation requirement
significantly by selecting a smaller subset from the raw speech
data for annotation.
BIO:
Abhinav Sethy is a PhD candidate at the University of Southern
California (USC) working with Prof Shrikanth Narayanan in the
Department of Electrical Engineering. He received his B.Tech degree
in Electrical Engineering from the Indian Institute of Technology
(Delhi) in 1999. He has previously worked in Adobe Systems, India
and interned at IBM TJ Watson research center. His research interests
include data mining for NLP applications, acoustic modeling for
speech recognition and speech pedagogy applications and learning
from unlabeled data.
2:30-3:00: The Maximum Likelihood
Set: A Novel Approach to Language Modeling
DAMIANOS KARAKOS, Center for Language and Speech Processing, the
Johns Hopkins University
ABSTRACT:
A recurring problem in statistical language modeling and clustering
of natural language texts is data sparseness. The distribution
of words in a document is usually modeled using a non-parametric
probability mass function (pmf), that needs to be estimated from
sample text. The dimension of such a pmf (vocabulary size) is
often tens of thousands, while a document itself may be just a
few thousand words long, leading to severe data sparseness problems.
In this talk, we will describe a novel method for density estimation,
which is based on the computation of a set of probability distributions.
In essence, this set, which we call Maximum Likelihood Set (or
MLS for short), contains all pmfs under which the observed word-counts
are more likely than any other set of word-counts possible for
the same amount of data. We will discuss the properties of the
MLS, as well as a way of choosing one of its pmfs as an estimate.
We will present some ongoing work in the application of this method
to statistical language modeling.
BIO:
Damianos Karakos obtained his BSc degree in Computer Science
from the University of Crete, Greece in 1995, and the MSc and
PhD degrees in Electrical Engineering from the University of Maryland,
College Park, in 1998 and 2002 respectively. Since 2003 he has
been working as a postdoctoral fellow at the Center for Language
and Speech Processing, Johns Hopkins University, on problems in
document clustering, language modeling, word-sense-disambiguation
and machine translation. He is also interested in information
theory, machine learning, and signal/image processing.
Related Links: http://www.watson.ibm.com
http://www.research.ibm.com/compsci/uit
http://www.research.ibm.com/pics/signal
Site hosted by IBM
Research
|