Research Projects NTT-MIT Research Collaboration: a partnership in the future of communication and computation

Adaptive Man-Machine Interfaces

MIT9904-15

Start date: 07/99

Tomaso Poggio
MIT AI Lab

Norihiro Hagita
NTT

Project summary


Mary 101, A Videorealistic Text-to-Audiovisual Speech Synthesizer: The goal of this project is to create a videorealistic text-to-audiovisual speech synthesizer. The system should take as input any typed sentence, and produce as output an audio-visual movie of a face enunciating that sentence. By videorealistic we mean that the final audiovisual output should look like it was a videocamera recording of a talking human subject.

Project description


 

Prior Work

Much of the previous work in text-to-audio-visual (TTAVS) speech synthesis has focused on integrating physically-based facial models with a particular speech synthesis system in order to give the impression of a "talking face". Some TTAVS systems have also resorted to Cyberware scanning techniques to overlay realistic-looking skin texture on top of the underlying graphics model.

In previous work, we explored an image-based, morphing approach to facial synthesis, in an attempt to bypass the need for any underlying 3D physical models. Our talking facial model was comprised of a collection of viseme imagery and the set of optical flow vectors defining the morph transition paths from every viseme to every other viseme. A many-to-one map was assumed between the set of phonemes and the set of visemes.

Our Approach

While our earlier work made strong strides towards photorealism, it did not address the dynamic aspects of mouth motion. Modelling dynamics requires addressing a phenomenon termed coarticulation, in which the visual manifestation of a particular phone is affected by its preceding and following context.

In order to model lip dynamics, we propose an unsupervised learning framework in order to learn the parameters of a dynamic speech production model. Our new approach is composed of 3 substeps:

  1. Morpheable Model of the Lips: we first record a training corpus of a human speaker uttering various sentences naturally. Motivated by recent progress in the creation of statistical shape-appearance models of flexible objects, we build a flexible shape appearance model of the speaker's lips. Then we analyze the entire corpus using the model, yielding a low-dimensional time-series of the lip shape for the entire corpus.
  2. HMM Speech Production Model: Next we hypothesize a dynamic speech production model based on HMMs. Each phone is a 3-state left-to-right HMM model with Gaussian emissions. Baum-Welch learning is used to learn the paramters of each phone model from the labeled corpus calculated in step 1 above.
  3. Synthesis: Finally we propose a synthesis algorithm to generate novel visual utterances, given input text. The appropriate HMM phone models are concatenated, and smooth set of parameters generated from the HMM using a novel synthesis algorithm that calculates the most likely emission outputs from an HMM given known state sequences.

Demos, movies and other examples


The following sequences are a sample of our results. They are sentences produced by Mary101 which were never uttered by the original speaker.

"jam"
"more news in a moment"
"the meeting was frank"

The following sequences are preliminary results from our work on a 3D talking face:

"houston, we have a problem"
"12345"
"12345"

 

The principal investigators


Presentations and posters


"From Bits to Information: Machine Learning Theory and Applications", Tomaso Poggio, presented at IEEE Kansai Chapter, July 2000.

"Adaptive Man-Machine Interfaces", Tony Ezzat and Tomaso Poggio, May 2000.

"Adaptive Man-Machine Interfaces", Tony Ezzat and Tomaso Poggio, NTT, Musashino, Japan, January 2000.

Publications


"Visual Speech Synthesis by Morphing Visemes", Tony Ezzat and Tomaso Poggio, MIT AI Memo No 1658/CBCL Memo No 173, May 1999.

"MikeTalk: A Talking Facial Display Based on Morphing Visemes", Tony Ezzat and Tomaso Poggio, Proceedings of the Computer Animation Conference Philadelphia, PA, June 1998.

"Videorealistic Talking Faces: A Morphing Approach", Tony Ezzat and Tomaso Poggio, Proceedings of the Audiovisual Speech Processing Workshop, Rhodes, Greece, September 1997.

Proposals and progress reports


Proposals:

NTT Bi-Annual Progress Report, July to December 1999:

NTT Bi-Annual Progress Report, January to June 2000:

NTT Bi-Annual Progress Report, July to December 2000:

NTT Bi-Annual Progress Report, January to June 2001:

NTT Bi-Annual Progress Report, July to December 2001:

NTT Bi-Annual Progress Report, January to June 2002:

NTT Bi-Annual Progress Report, July to December 2002:

For more information