Dear Colleague,


On behalf of the organizing committee, we sincerely invite you to submit papers to the MULEA Workshop (, in conjunction with ACM Multimedia 2019 ( We invite you to help define new directions in this broad research area! 


The MULEA Workshop and the Conference will be held in France, from 2019/10/21 through 2019/10/25.  The full name of the Workshop is the 1st International Workshop on Multimodal Understanding and Learning for Embodied Applications, and it will cover many of the applications in AI, such as robotics, autonomous driving, multimodal chatbots, or simulated games. It also covers many new and exciting research areas. 


Paper submission deadline: July 15, 2019. This inter-disciplinary Workshop has considerable breadth and diversity, across several research fields, including language, vision, robotics, etc. We therefore encourage 2-page Abstracts and Cross-Submissions, in addition to regular papers. Please refer to the Workshop website. The topics of the workshop include but are not limited to:


1. Multimodal context understanding.  Context include environment, task/goal states, dynamic objects of the scene, activities, etc. Relevant research streams include visually grounded learning, context understanding, and environmental modeling which includes 3D environment modeling and understanding. Language grounding is also an interesting topic. Connecting the vision and language modalities is essential in applications such as question answering and image captioning. Other relevant research areas include multimodal understanding, context modeling, and grounded dialog systems. 


2. Knowledge inference. Knowledge in this multimedia scenario is represented with knowledge graph, scene graph, memory, etc.  Representing contextual knowledge is a topic that has attracted much interest, and goal-driven knowledge representation and reasoning are also new research directions. Deep learning methods are good options to deal with unstructured multimodal knowledge signals. 


3. Embodied learning. Building on context understanding and knowledge representation, the policy generates actions for intelligent agents to achieve goals or finish tasks. The input signals are multimodal and can be images or dialogs etc., and the learning policies not only need to provide short-term reactions, but also need to plan its actions to optimally finish the long-term goals. The actions may involve navigation and localizations as well, which are mainstream in the robotics and self-driving vehicle fields. This is relevant to reinforcement learning, and the algorithms are driven by multiple industrial applications in robotics, self-driving vehicles, simulated games, multimodal chatbots, etc.


Please feel free to forward to colleagues who might be interested. If there are suggestions or comments, those will be welcome and appreciated! We are sorry if you receive duplicate copies of this email.   


Best Regards,

-John and Tim (Co-Chairs)