Research Projects | |
Adaptive Man-Machine InterfacesMIT9904-15 Start date: 07/99 |
Tomaso Poggio MIT AI Lab Norihiro Hagita NTT |
Project summary |
Mary 101, A Videorealistic Text-to-Audiovisual Speech Synthesizer: The goal of this project is to create a videorealistic text-to-audiovisual speech synthesizer. The system should take as input any typed sentence, and produce as output an audio-visual movie of a face enunciating that sentence. By videorealistic we mean that the final audiovisual output should look like it was a videocamera recording of a talking human subject.
Project description |
Prior WorkMuch of the previous work in text-to-audio-visual (TTAVS) speech synthesis has focused on integrating physically-based facial models with a particular speech synthesis system in order to give the impression of a "talking face". Some TTAVS systems have also resorted to Cyberware scanning techniques to overlay realistic-looking skin texture on top of the underlying graphics model. In previous work, we explored an image-based, morphing approach to facial synthesis, in an attempt to bypass the need for any underlying 3D physical models. Our talking facial model was comprised of a collection of viseme imagery and the set of optical flow vectors defining the morph transition paths from every viseme to every other viseme. A many-to-one map was assumed between the set of phonemes and the set of visemes. Our ApproachWhile our earlier work made strong strides towards photorealism, it did not address the dynamic aspects of mouth motion. Modelling dynamics requires addressing a phenomenon termed coarticulation, in which the visual manifestation of a particular phone is affected by its preceding and following context. In order to model lip dynamics, we propose an unsupervised learning framework in order to learn the parameters of a dynamic speech production model. Our new approach is composed of 3 substeps:
|
Demos, movies and other examples |
The following sequences are a sample of our results. They are sentences produced by Mary101 which were never uttered by the original speaker.
"jam"
"more news in a moment"
"the meeting was frank"
The following sequences are preliminary results from our work on a 3D talking face:
"houston, we have a problem"
"12345"
"12345"
The principal investigators |
Presentations and posters |
"Adaptive Man-Machine Interfaces", Tony Ezzat and Tomaso Poggio, May 2000.
"Adaptive Man-Machine Interfaces", Tony Ezzat and Tomaso Poggio, NTT, Musashino, Japan, January 2000.
Publications |
"Visual Speech Synthesis by Morphing Visemes", Tony Ezzat and Tomaso Poggio, MIT AI Memo No 1658/CBCL Memo No 173, May 1999.
"MikeTalk: A Talking Facial Display Based on Morphing Visemes", Tony Ezzat and Tomaso Poggio, Proceedings of the Computer Animation Conference Philadelphia, PA, June 1998.
"Videorealistic Talking Faces: A Morphing Approach", Tony Ezzat and Tomaso Poggio, Proceedings of the Audiovisual Speech Processing Workshop, Rhodes, Greece, September 1997.
Proposals and progress reports |
Proposals:
NTT Bi-Annual Progress Report, July to December 1999:
NTT Bi-Annual Progress Report, January to June 2000:
NTT Bi-Annual Progress Report, July to December 2000:
NTT Bi-Annual Progress Report, January to June 2001:
NTT Bi-Annual Progress Report, July to December 2001:
NTT Bi-Annual Progress Report, January to June 2002:
NTT Bi-Annual Progress Report, July to December 2002:
For more information |