Research Projects NTT-MIT Research Collaboration: a partnership in the future of communication and computation

Research and Development of Multi-Lingual and Multi-Modal Conversational Interfaces


Start date: 07/2001

James Glass and Stephanie Seneff

Kiyoaki Aikawa and Mikio Nakano

Project summary

Developing human language technologies to support natural multi-lingual and multi-modal interactions between humans and machines.

Project description


Creating conversational systems that enable natural interactions with users currently requires significant expertise in the use of the underlying human language technologies. To broaden the developer pool, researchers in the Spoken Language Systems group have started to develop a utility called SpeechBuilder which will make it easier for novice and experienced system developers to rapidly prototype new mixed-initiative conversational systems. The expansion of the SpeechBuilder infrastructure to accommodate multi-lingual and multi-modal usage constitutes a significant emphasis of the initial phase of this NTT/MIT project. Additional areas of research will involve corpus-based speech synthesis, flexible dialogue strategies, emotion recognition, and language learning assistance.

Demos, movies and other examples

Video: An example of a bilingual conversational system. This video demonstrates two talkers alternatively talking to a weather system in English and Japanese. The system considers English and Japanese hypotheses in parallel to decide on the appropriate language and word sequence. Since the internal components of the conversational system are language transparent, the discourse component can follow the interactions in both languages.

You can try out the English language version of Jupiter over the phone.

Speaking Style: Jupiter is intended to recognize and understand natural, conversational speech. When you talk to Jupiter, it is actually best if you speak naturally to the system, as you would to another person. In other words, you don't need to pause between words, overemphasize words (e.g., pronouncing them one syllable at a time), or speak in computerese (e.g. "weather boston" vs. "what's the weather in boston"). The system also doesn't do as well if you shout, mumble, or speak softly. It is best if you speak clearly, as you might to a young child. Also, if you speak softly (especially at the end of your sentence), or pause extensively, the end point detector might cause your speech to be clipped, which will make it harder for Jupiter to understand you.

Here are some example sentences you can speak to Jupiter. Notice that the system will remember some aspects of your previous queries.

- What cities do you know about in California?
- How about in Japan?
- What will the temperature be in Boston tomorrow?
- What about the humidity?
- Are there any flood warnings in the United States?
- Where is it sunny in the Caribbean?
- What's the wind speed in Chicago?
- How about London?
- Can you give me the forecast for Seattle?
- Will it rain tomorrow in Denver?

Here is how to call Jupiter.


The principal investigators

Presentations and posters


M. Nakano, Y. Minami, S. Seneff, T. Hazen, S. Cyphers, J. Glass, J. Polifroni, and V. Zue, "Mokusei: A telephone-based Japanese conversational system in the weather domain," to appear in Proc. Eurospeech 2001, Aalborg, Denmark.

J. Glass and E. Weinstein, "SpeechBuilder: Facilitating spoken dialogue system development," to appear in Proc. Eurospeech 2001, Aalborg, Denmark.

S. Seneff, D. Goddeau, C. Pao, and J. Polifroni, "Multimodal discourse modelling in a multi-user multi-domain environment," in Proc. ICSLP 1996, Philadelphia, PA.

V. Zue, S.Seneff, J. Polifroni, H. Meng, and J. Glass, "Multilingual human-computer interactions: From information access to language learning," in Proc. ICSLP 1996, Philadelphia, PA.

Proposals and progress reports


NTT Bi-Annual Progress Report, July to December 2001:

NTT Bi-Annual Progress Report, January to June 2002:

NTT Bi-Annual Progress Report, July to December 2002:

For more information