ACM SIGCHI General Interest Announcements (Mailing List)


Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Jean-Claude MARTIN <[log in to unmask]>
Reply To:
Jean-Claude MARTIN <[log in to unmask]>
Wed, 24 Mar 2010 12:09:51 +0100
text/plain (346 lines)
JMUI Special issue on real-time affect analysis and interpretation in
virtual agents and robots



Special issue: Real-time affect analysis and interpretation: closing the
affective loop in virtual agents and robots

Guest Editors: Ginevra Castellano, Kostas Karpouzis, Christopher Peters and
Jean-Claude Martin
Volume 3, Issues 1-2, pages 1-153, March 2010


"Special issue on real-time affect analysis and interpretation: closing the
affective loop in virtual agents and robots"

Ginevra Castellano, Kostas Karpouzis, Christopher Peters and Jean-Claude
Pages 1-3


"On-line emotion recognition in a 3-D activation-valence-time continuum
using acoustic and linguistic cues"

Florian Eyben, Martin Wöllmer, Alex Graves, Björn Schuller, Ellen
Douglas-Cowie and Roddy Cowie
Pages 7-19

For many applications of emotion recognition, such as virtual agents, the
system must select responses while the user is speaking. This requires
reliable on-line recognition of the user’s affect. However most emotion
recognition systems are based on turnwise processing. We present a novel
approach to on-line emotion recognition from speech using Long Short-Term
Memory Recurrent Neural Networks. Emotion is recognised frame-wise in a
two-dimensional valence-activation continuum. In contrast to
currentstate-of-the-art approaches, recognition is performed on low-level
signal frames, similar to those used for speechrecognition. No statistical
functionals are applied to low-level feature contours. Framing at a higher
level is therefore unnecessary and regression outputs can be produced in
real-time for every low-level input frame. We also investigate the benefits
of including linguistic features on the signal frame level obtained by a
keyword spotter.

"Student mental state inference from unintentional body gestures using
dynamic Bayesian networks"

Abdul Rehman Abbasi, Matthew N. Dailey, Nitin V. Afzulpurkar and Takeaki Uno
Pages 21-31

Applications that interact with humans would benefit from knowing the
intentions or mental states of their users. However, mental state prediction
is not only uncertain but also context dependent. In this paper, we present
a dynamic Bayesian network model of the temporal evolution of students’
mental states and causal associations between mental states and body
gestures in context. Our approach is to convert sensory descriptions of
student gestures into semantic descriptions of their mental states in a
classroom lecture situation. At model learning time, we use expectation
maximization (EM) to estimate model parameters from partly labeled training
data, and at run time, we use the junction tree algorithm to infer mental
states from body gesture evidence. A maximum a posteriori classifier
evaluated with leave-one-out cross validation on labeled data from 11
students obtains a generalization accuracy of 97.4% over cases where the
student reported a definite mental state, and 83.2% when we include cases
where the student reported no mental state. Experimental results demonstrate
the validity of our approach. Future work will explore utilization of the
model in real-time intelligent tutoring systems.

"Multimodal emotion recognition in speech-based interaction using facial
expression, body gesture and acoustic analysis"

Loic Kessous, Ginevra Castellano and George Caridakis
Pages 33-48

In this paper a study on multimodal automatic emotion recognition during a
speech-based interaction is presented. A database was constructed consisting
of people pronouncing a sentence in a scenario where they interacted with an
agent using speech. Ten people pronounced a sentence corresponding to a
command while making 8 different emotional expressions. Gender was equally
represented, with speakers of several different native languages including
French, German, Greek and Italian. Facial expression, gesture and acoustic
analysis of speech were used to extract features relevant to emotion. For
the automatic classification of unimodal data, bimodal data and multimodal
data, a system based on a Bayesian classifier was used. After performing an
automatic classification of each modality, the different modalities were
combined using a multimodal approach. Fusion of the modalities at the
feature level (before running the classifier) and at the results level
(combining results from classifier from each modality) were compared. Fusing
the multimodal data resulted in a large increase in the recognition rates in
comparison to the unimodal systems: the multimodal approach increased the
recognition rate by more than 10% when compared to the most successful
unimodal system. Bimodal emotion recognition based on all combinations of
the modalities (i.e., ‘face-gesture’, ‘face-speech’ and ‘gesture-speech’)
was also investigated. The results show that the best pairing is
‘gesture-speech’. Using all three modalities resulted in a 3.3%
classification improvement over the best bimodal results.

"Multimodal user’s affective state analysis in naturalistic interaction"

George Caridakis, Kostas Karpouzis, Manolis Wallace, Loic Kessous and Noam
Pages 49-66

Affective and human-centered computing have attracted an abundance of
attention during the past years, mainly due to the abundance of environments
and applications able to exploit and adapt to multimodal input from the
users. The combination of facial expressions with prosody information allows
us to capture the users’ emotional state in an unintrusive manner, relying
on the best performing modality in cases where one modality suffers from
noise or bad sensing conditions. In this paper, we describe a multi-cue,
dynamic approach to detect emotion in naturalistic video sequences, where
input is taken from nearly real world situations, contrary to controlled
recording conditions of audiovisual material. Recognition is performed via a
recurrent neural network, whose short term memory and approximation
capabilities cater for modeling dynamic events in facial and prosodic
expressivity. This approach also differs from existing work in that it
models user expressivity using a dimensional representation, instead of
detecting discrete ‘universal emotions’, which are scarce in everyday
human-machine interaction. The algorithm is deployed on an audiovisual
database which was recorded simulating human-human discourse and, therefore,
contains less extreme expressivity and subtle variations of a number of
emotion labels. Results show that in turnslasting more than a few frames,
recognition rates rise to 98%.

"From expressive gesture to sound - The development of an embodied mapping
trajectory inside a musical interface"

Pieter-Jan Maes, Marc Leman, Micheline Lesaffre, Michiel Demey and Dirk
Pages 67-78

This paper contributes to the development of a multimodal, musical tool that
extends the natural action range of the human body to communicate
expressiveness into the virtual music domain. The core of this musical tool
consists of a low cost, highly functional computational model developed upon
the Max/MSP platform that (1) captures real-time movement of the human body
into a 3D coordinate system on the basis of the orientation output of any
type of inertial sensor system that is OSC-compatible, (2) extract low-level
movement features that specify the amount of contraction/expansion as a
measure of how a subject uses the surrounding space, (3) recognizes these
movement features as being expressive gestures, and (4) creates a mapping
trajectory between these expressive gestures and the sound synthesis process
of adding harmonic related voices on an in origin monophonic voice. The
concern for a user-oriented and intuitive mapping strategy was thereby of
central importance. This was achieved by conducting an empirical experiment
based on theoretical concepts from the embodied music cognition paradigm.
Based on empirical evidence, this paper proposes a mapping trajectory that
facilitates the interaction between a musician and his instrument, the
artistic collaboration between (multimedia) artists and the communication of
expressiveness in a social, musical context.

"The mental ingredients of bitterness"

Isabella Poggi and Francesca D’Errico
Pages 79-86

In view of multimodal interfaces capable of a detailed representation of the
User’s possible emotions, the paper analyses bitterness in terms of its
mental ingredients, the beliefs and goals represented in the mind of a
person when feeling an emotion. Bitterness is a negative emotion in between
anger and sadness: like anger, it is caused by a sense of injustice, but
also entails a sense of impotence which makes it similar to sadness. Often
caused by betrayal, it comes from the disappointment of an expectation from
oneself or anothers with whom one is affectively involved, or from a
disproportion between commitment and actual results. The ingredients found
in a pilot study were tested through qualitative analysis of a further
questionnaire, which confirmed the ingredients hypothesized, further
revealing the different nature of bitterness across ages and across types of

"Affect recognition for interactive companions: Challenges and design in
real world scenarios"

Ginevra Castellano, Iolanda Leite, André Pereira, Carlos Martinho, Ana Paiva
and Peter W. McOwan
Pages 89-98

Affect sensitivity is an important requirement for artificial companions to
be capable of engaging in social interaction with human users. This paper
provides a general overview of some of the issues arising from the design of
an affect recognition framework for artificial companions. Limitations and
challenges are discussed with respect to other capabilities of companions
and a real world scenario where an iCat robot plays chess with children is
presented. In this scenario, affective states that a robot companion should
be able to recognise are identified and the non-verbal behaviours that are
affected by the occurrence of these states in the children are investigated.
The experimental results aim to provide the foundation for the design of an
affect recognition system for a game companion: in this interaction scenario
children tend to look at the iCat and smile more when they experience a
positive feeling and they are engaged with the iCat.

"When my robot smiles at me - Enabling human-robot rapport via real-time
head gesture mimicry"

Laurel D. Riek, Philip C. Paul and Peter Robinson
Pages 99-108

People use imitation to encourage each other during conversation. We have
conducted an experiment to investigate how imitation by a robot affect
people’s perceptions of their conversation with it. The robot operated in
one of three ways: full head gesture mimicking, partial head gesture
mimicking (nodding), and non-mimicking (blinking). Participants rated how
satisfied they were with the interaction. We hypothesized that participants
in the full head gesture condition will rate their interaction the most
positively, followed by the partial and non-mimicking conditions. We also
performed gesture analysis to see if any differences existed between groups,
and did find that men made significantly more gestures than women while
interacting with the robot. Finally, we interviewed participants to try to
ascertain additional insight into their feelings of rapport with the robot,
which revealed a number of valuable insights.

"Communication of musical expression by means of mobile robot gestures"

Birgitta Burger and Roberto Bresin
Pages 109-118

We developed a robotic system that can behave in an emotional way. A
3-wheeled simple robot with limited degrees of freedom was designed. Our
goal was to make the robot displaying emotions in music performance by
performing expressive movements. These movements have been compiled and
programmed based on literature about emotion in music, musicians’ movements
in expressive performances, and object shapes that convey different
emotional intentions. The emotions happiness, anger, and sadness have been
implemented in this way. General results from behavioral experiments show
that emotional intentions can be synthesized, displayed and communicated by
an artificial creature, also in constrained circumstances.

"Investigating shared attention with a virtual agent using a gaze-based

Christopher Peters, Stylianos Asteriadis and Kostas Karpouzis
Pages 119-130

This paper investigates the use of a gaze-based interface for testing simple
shared attention behaviours during an interaction scenario with a virtual
agent. The interface is non-intrusive, operating in real-time using a
standard web-camera for input, monitoring users’ head directions and
processing them in real-time for resolution to screen coordinates. We use
the interface to investigate user perception of the agent’s behaviour during
a shared attention scenario. Our aim is to elaborate important factors to be
considered when constructing engagement models that must account not only
for behaviour in isolation, but also for the context of the interaction, as
is the case during shared attention situations.

"HMM modeling of user engagement in advice-giving dialogues"

Nicole Novielli
Pages 131-140

This research aims at defining a real-time probabilistic model of user’s
engagement in advice-giving dialogues. We propose an approach based on
Hidden Markov Models (HMMs) to describe the differences in the dialogue
pattern due to the different level of engagement experienced by the users.
We train our HMM models on a corpus of natural dialogues with an Embodied
Conversational Agent (ECA) in the domain of healthy-eating. The dialogues
are coded in terms of Dialogue Acts associated to each system or user move.
Results are quite encouraging: HMMs are a powerful formalism for describing
the differences in the dialogue patterns, due to the different level of
engagement of users and they can be successfully employed in real-time
user’s engagement detection. Though, the HMM learning process shows a lack
of robustness when using low-dimensional and skewed corpora. Therefore we
plan a further validation of our approach with larger corpora in the near

"Natural interaction with a virtual guide in a virtual environment - A
multimodal dialogue system"

Dennis Hofs, Mariët Theune and Rieks op den Akker
Pages 141-153

This paper describes the Virtual Guide, a multimodal dialogue system
represented by an embodied conversational agent that can help users to find
their way in a virtual environment, while adapting its affective linguistic
style to that of the user. We discuss the modular architecture of the
system, and describe the entire loop from multimodal input analysis to
multimodal output generation. We also describe how the Virtual Guide detects
the level of politeness of the user’s utterances in real-time during the
dialogue and aligns its own language to that of the user, using different
politeness strategies. Finally we report on our first user tests, and
discuss some potential extensions to improve the system.


Professor of Computer Science at Paris-South 11 University
Head of the Interactive Virtual Characters team at LIMSI-CNRS

Editor-in-Chief of the Springer Journal on Multimodal User Interfaces (JMUI)


                To unsubscribe, send an empty email to
     mailto:[log in to unmask]
    For further details of CHI lists see