Multimodal KDD 2023:

International Workshop on Multimodal Learning

Held in conjunction with KDD'23


Welcome to Multimodal KDD 2023 !

The recent advancements in machine learning and artificial intelligence (particularly foundation models such as BERT, GPT-3, T5, ResNet, etc.) have demonstrated remarkable capabilities and driven significant revolutionary changes to the way we make inferences from complex data. These models represent a fundamental shift in the way data are approached and offer exciting new research directions and opportunities for multimodal learning and data fusion.

Given the potential of foundation models to transform the field of multimodal learning, there is a need to bring together experts and researchers to discuss the latest developments in this area, exchange ideas, and identify key research questions and challenges that need to be addressed. By hosting this workshop, we aim to create a forum for researchers to share their insights and expertise on multimodal data fusion and learning using foundation models, and to explore potential new research directions and applications in the rapidly evolving field. We expect contributions from interdisciplinary researchers to study and model interactions between (but not limited to) modalities of language, graphs, time-series, vision, tabular data, sensors, and more. Our workshop will emphasize interdisciplinary work and aim at seeding cross-team collaborations around new tasks, datasets, and models.

Contact: kdd2023-ws-multimodal@amazon.com

KEYNOTES

Next-Generation Intelligent Assistants for AR/VR Devices


Abstract: An intelligent assistant shall be an agent that knows you and the world, can receive your requests or predict your needs, and provide you the right services at the right time with your permission. As smart devices such as Amazon Alexa, Google Home, Meta Ray-ban Stories get popular, Intelligent Assistants are gradually playing an important role in people's lives. The Emergence of AR/VR devices brings more opportunities and calls for the next generation of Intelligent Assistants. In this talk, we discuss the many challenges and opportunities we face to grow intelligent assistants from voice-only to multi-modal, from context-agnostic to context-aware, from listening to the users' requests to predicting the user's needs, and from server-side to on-device. We expect these new challenges to open doors to new research areas and start a new chapter for providing personal assistance services.

Bio: Xin Luna Dong is a Principal Scientist at Meta Reality Labs, working on Intelligent Assistant. She has spent more than a decade building knowledge graphs, such as Amazon Product Graph and Google Knowledge Graph. She has co-authored books "Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases" and “Big Data Integration”. She was awarded VLDB Women in Database Research Award (2023) for "significant contributions to knowledge graph construction and data integration", ACM Distinguished Member (2018) for "significant contributions to data and knowledge integration", and VLDB Early Career Research Contribution Award (2016) for “advancing the state of the art of knowledge fusion”. She serves in the VLDB endowment and PVLDB advisory committee, and is a PC co-chair for KDD'2022 ADS track, WSDM'2022, VLDB'2021, and Sigmod'2018.


Graph-based Fusion for Multimodal Learning


Abstract: With the advances in data collection techniques, large amounts of multimodal data collected from multiple sources are becoming available. Such multimodal data can provide complementary information that can reveal fundamental characteristics of real-world subjects. Thus, multimodal machine learning has become an active research area. Extensive works have been developed to exploit multimodal interactions and integrate multi-source information. In this talk, we will discuss using a graph-based multimodal fusion approach to enable multimodal fusion of incomplete data within a heterogeneous graph structure. This approach develops a unique strategy for learning on incomplete multimodal data without data deletion or data imputation. Moreover, we will discuss a dynamic graph-based approach to support federated training over multimodal distributed data without assuming similar active sensors in all clients. The key idea is to employ a dynamic and multi-view graph structure to adaptively capture the correlations amongst multimodal client models.

Short Bio: Dr. Aidong Zhang is Thomas M. Linville Endowed Professor of Computer Science, with joint appointments at Data Science, and Biomedical Engineering at University of Virginia (UVA). Prof. Zhang’s research interests include machine learning, data science, bioinformatics, and health informatics. Prof. Zhang is a fellow of ACM, IEEE, and AIMBE.

SCHEDULE

August 7th, 2023, 1:00 PM – 5:00 PM (Pacific Time), Long Beach, CA, USA.

  Opening
  1:00 - 1:10 PM

Introduction by organizers.

  Keynote Talk 1: Next-Generation Intelligent Assistants for AR/VR Devices
  1:10 - 1:50 PM

Xin (Luna) Dong Principal Scientist, Meta

  Evaluating Picture Description Speech for Dementia Detection using Image-text Alignment
  1:50 - 2:05 PM

Youxiang Zhu, Nana Lin, Xiaohui Liang, John Batsis, Robert Roth and Brian MacWhinney

  Optimizing Learning Across Multimodal Transfer Features for Modeling Olfactory Perception
  2:05 - 2:20 PM

Daniel Shin, Gao Pei, Priyadarshini Kumari and Tarek Besold

  Explainable Local and Global Models for Fine-Grained Multimodal Product Recognition
  2:20 - 2:35 PM

Tobias Pettersson, Maria Riveiro and Tuwe Löfström

  Contrastive Multimodal Text Generation for E-Commerce Brand Advertising
  2:35 - 2:50 PM

Nikhil Madaan, Krishna Kesari, Manisha Verma, Shaunak Mishra and Tor Steiner

  What Makes Good Open-Vocabulary Detector: A Disassembling Perspective
  2:50 - 3:05PM

Jincheng Li, Chunyu Xie, Xiaoyu Wu, Bin Wang and Dawei Leng

  Coffee Break
  3:05 - 3:30 PM
  Keynote Talk 2: Graph-based Fusion for Multimodal Learning
  3:30 - 4:10 PM

Aidong Zhang Thomas M. Linville Endowed Professor of Computer Science, University of Virginia

  Detecting text-rich objects: OCR or object detection? A case study with stopwatch detection
  4:10 - 4:25 PM

Yarong Feng, Zongyi Liu, Yuan Ling, Shunyan Luo, Shujing Dong, Shuyi Wang and Bruce Ferry

  REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory
  4:25 - 4:40 PM

Ziniu Hu

  Closing
  4:55-5:00 PM

Concluding remarks by organizers.

Call for Contributions

This workshop will provide a platform to discuss the latest advances and trends in theory, methodologies, and applications in the field of multimodal learning. The workshop theme for this year will be on the use of foundation models. These foundation models, such as BERT, T5, LLaMA and GPT-4 which were trained on massive data collections, have significantly revolutionized the field of natural language processing (NLP). The use of such foundation models for solving several NLP tasks represent a fundamental paradigm shift in the way several problems are being solved especially due to their ability to integrate knowledge from other domains such as computer vision (DALL-E, CLIP), retrieval, knowledge graphs and more. Moreover, foundation models have brought some fundamental changes to the multimodal problem setting, especially when integrating text or images with graphs, time-series, and other forms of structured data. As such, the workshop aims to focus on utilizing these foundation models and integrating multiple modalities. Though the workshop might also include discussions and papers about general multimodal learning problems, more emphasis will be given to the works that utilize recently developed foundation models. Our goal will be to explore and showcase the innovative ways in which multimodal learning and data fusion can be employed, with a particular emphasis on how to leverage the capabilities of foundation models for these purposes. The workshop topics include, but are not limited to:

  • Multimodal data generation
  • Multimodal data preprocessing and feature engineering
  • Multimodal data fusion
  • Multimodal self-supervised and/or unsupervised learning
  • Multimodal learning with noisy data
  • Multimodal transfer learning
  • Multimodal zero shot learning with foundation models
  • Biases in multimodal learning
  • Explainable multimodal learning
  • Multimodal generative AI
  • Trust-worthy multimodal learning
  • Large-scale multimodal learning
  • Responsible multimodal learning
  • Applications of Multimodal learning (e.g., finance, healthcare, social media, climate, etc.)

The workshop seeks to bring together researchers in the machine learning and data mining communities and provide a unique opportunity for interdisciplinary researchers to explore and data interactions with foundation models between various modalities, such as text, images, graphs, tabular data, time-series, and more. This workshop will feature invited talks, accepted paper presentations, and a panel discussion to encourage knowledge sharing and foster cross-team collaboration within research and industry communities in the fields of Natural Language Processing (NLP), Information Retrieval, Data Mining, Machine Learning, and others.

Submission Guidelines

  • Paper submissions are limited to 9 pages, excluding references, must be in PDF and use ACM Conference Proceeding templates (two column format).
  • Additional supplemental material focused on reproducibility can be provided. Proofs, pseudo-code, and code may also be included in the supplement, which has no explicit page limit. The supplement format could be either single column or double column. The paper should be self-contained, since reviewers are not required to read the supplement.
  • The Word template guideline can be found here: [link]
  • The Latex/overleaf template guideline can be found here: [link]
  • The submissions will be judged for quality and relevance through single-blind reviewing.
  • A paper should be submitted in PDF format through EasyChair at the following link: [link]