Affective intent in speech
 Kismet home
 The robot
 Ongoing research
 Broader questions

 Facial expression
 Visual attention
 Ocular-motor control
 Low-level features
 Expressive speech

Recognition of affective intent in speech

Human speech provides a natural and intuitive interface for both communicating with humanoid robots as well as for teaching them. To this end, Kismet recognizes and affectively responds to praise, prohibition, attention, and comfort in robot-directed speech. These affective intents are well matched to human-style instruction scenarios since praise, prohibition, and directing the robot's attention to relevant aspects of a task, could be intuitively used to train a robot.

Watch clip: Recognition of affective intent 
 (get viewer) 

The system runs in real-time and exhibits robust performance (i.e., for a teaching task, confusing strongly valenced intent for neutrally valenced intent is better than confusing oppositely valenced intents. For instance, confusing approval for an attentional bid, or prohibition for neutral speech, is better than interpreting a prohibition for praise.). Communicative efficacy has been tested and demonstrated in multi-lingual studies with the robot's caregivers as well as with naive subjects (only female subjects have been tested so far). Importantly, we have discovered some intriguing social dynamics that arise between robot and human when expressive feedback is introduced. This expressive feedback plays an important role in facilitating natural and intuitive human-robot communication.

Infant recognition of affective intent

Developmental psycholinguists have extensively studied how affective intent is communicated to preverbal infants. Infant-directed speech is typically quite exaggerated in pitch and intensity. From the results of a series of cross-cultural studies, Anne Fernald suggests that much of this information is communicated through the ``melody" of infant-directed speech. In particular, there is evidence for at least four distinctive prosodic contours, each of which communicates a different affective meaning to the infant (approval, prohibition, comfort, and attention) -- see figure. Maternal exaggerations in infant-directed speech seem to be particularly well matched to the innate affective responses of human infants.

Fernald contours

Recognition of affective intent

Inspired by this work, we have implemented a recognizer to distinguish the four affective intents for praise, prohibition, comfort, attentional bids. Of course, not everything a human says to Kismet will have an affective meaning, so we also distinguish neutral robot-directed speech. We have intentionally designed Kismet to resemble a very young creature so that people are naturally inclined speak to Kismet with appropriately exaggerated prosody. This aesthetic choice has payed off nicely for us. As shown below, the preprocessed pitch contour of labeled utterances resembles Fernald's prototypical prosodic contours for approval, attention, prohibition, and comfort/soothing.

prosodic contours of kismet-directed speech
As shown below, the affective speech recognizer receives robot-directed speech as input. The speech signal is analyzed by the low level speech processing system, producing time-stamped pitch (Hz), percent periodicity (a measure of how likely a frame is a voiced segment), energy (dB), and phoneme values. This low-level auditory processing code is provided by the Spoken Language Systems Group at MIT. The next module performs filtering and pre-processing to reduce the amount of noise in the data. The resulting pitch and energy data are then passed through the feature extractor, which calculates a set of selected features.

multi stage model
Finally, based on the trained model, the classifier determines whether the computed features are derived from an approval, an attentional bid, a prohibition, soothing speech, or a neutral utterance. As shown above, we adopted a multi-stage approach where several mini-classifiers are used to classify the data in stages. In all training phases we modeled each class of data using a Gaussian mixture model, updated with the EM algorithm and a Kurtosis-based approach for dynamically deciding the appropriate number of kernels. In the first stage, the classifier uses global pitch and energy features to separate some classes (high arousal versus low arousal). Below, you can see that pitch mean and energy variance separates the utterances accourding to arousal nicely.

the energy/pitch space
The remaining clustered classes are then passed to subsequent classification stages. Utilizing prior information, we included a new set of features that encoded the shape of the pitch contour according to Fernald's results. We found these features to be useful in separating the difficult classes in the subsequent classification stages.

For Kismet, output of the vocal affective intent classifier is interfaced with the emotion subsystem where the information is appraised at an affective level and then used to directly modulate the robot's own affective state. In this way, the affective meaning of the utterance is communicated to the robot through a mechanism similar to the one Fernald suggests. The robot's current "emotive" state is reflected by its facial expression and body posture. This affective response provides critical feedback to the human as to whether or not the robot properly understood their intent. As with human infants, socially manipulating the robot's affective system is a powerful way to modulate the robot's behavior and to elicit an appropriate response. The video segment on this page illustrates these points.

Other topics
Kismet's hardware
Facial expression
Visual attention
Ocular-motor control
Low-level features
Expressive speech
Homeostatic regulation mechanisms
The behavior system


    contact information: