Special sessions – 2024 IEEE 26th International Workshop on Multimedia Signal Processing (MMSP 2024)

(The special sessions are sorted alphabetically)

Exploring Generative AI Technologies in Multimedia Signal Processing

In recent years, the emerging of Generative Artificial Intelligence (AI) technologies has revolutionized the landscape of multimedia signal processing, which has gained significant attention and popularity in recent years due to their ability to capture complex patterns and generate realistic samples across various domains, including images, text, audio, and more. This special session aims to delve into the forefront of this transformative intersection, offering opportunities for researchers to present the development and latest advancements, challenges, and opportunities in leveraging Generative AI for multimedia signal processing, enlightening, and exchanging insights on these new technologies.

With the proliferation of Generative AI techniques such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and transformer-based architectures, the possibilities for multimedia content creation, compression, enhancement, and understanding have expanded exponentially. This special session will feature presentations on cutting-edge research spanning various aspects of multimedia signal processing empowered by AI technology, especially in terms of Generative AI. Generative models have the potential to revolutionize many aspects of artificial intelligence and creativity. As research in this field continues to advance, we can expect to see even more exciting developments and applications of generative models in the future.

We invite researchers, academics, industry professionals, and students to contribute to this topic, sharing their expertise, insights, and innovative technologies that shape the future of multimedia signal processing in the era of Generative AI. Topics of interest include, but are not limited to:

Visual signal compression with generative artificial intelligence technology
Generative visual content compression and processing
Human/machine-centric applications incorporating Generative AI in multimedia experiences
Generative AI applications in content creation and manipulation
Cross-modal generation: integrating vision, audio, and text.

Meng Wang (mwang98-c@my.cityu.edu.hk): Department of Computer Science, City University of Hong Kong
Junru Li (lijunru@bytedance.com): ByteDance Inc.
Li Zhang (lizhang.idm@bytedance.com): ByteDance Inc.

Latent Space Metrics in AI to Improve Multi-Object Detection (MOD), Tracking (MOT), Re-ID and 3D Segmentation Tasks

Multi-object detection, multi-object tracking object Re-ID and segmentation are common tasks in multimedia video. The performance of models to find, track, and re-ID objects within a camera feed and across camera feeds are key to many applications such as: surveillance, anomaly detection, motion prediction, 3D medical image guided surgery, 3D segmentation, 2D to 3D view synthesis (NeRF), video to speech, etc. Current state of the art AI backbone models perform very well on benchmark data, but still fall short on real-world data. This session will focus on the use of metrics developed in latent/feature space of AI models which then informs the selection of data to test or train these models. The metrics and techniques discussed in the session are used to improve multiple object detection (MOD), multiple object tracking (MOD), matching (Re-ID), and 3D segmentation tasks. Generative data from Diffusion Models and Variational Auto Encoders (VAEs) is explored for the purpose of creating samples from the latent space to supplement and augment the real-world training data with the goal of improving the performance, robustness, and resilience to adversarial attack.

Dr. Lauren Christopher (lachrist@purdue.edu): Department of Electrical and Computer Engineering, IUPUI, Indianapolis, IN, U.S.A.
Dr. Paul Salama (psalama@purdue.edu): Department of Electrical and Computer Engineering, IUPUI, Indianapolis, IN, U.S.A.

Reproducible Neural Visual Coding

Homepage >>

In recent years, we have witnessed the exponential growth of research and development explorations on learning-based visual coding. These learned coding approaches, regardless of their focus on image, video, or 3D point cloud, have demonstrated remarkable improvement in coding efficiency compared to traditional solutions developed for decades.

Although international standard organizations such as JPEG, MPEG, etc., have devoted efforts to promote learning-based visual coding techniques, they are often criticized for the lack of reproducibility. Reproducibility concerns the complexity and generalization of the underlying coding model, which is vital for faithfully evaluating the performance of these methods and ensuring the adoption in practical applications. The complexity herein includes computational complexity and memory (space) consumption in both training and inference. The generalization ensures the applicability of the trained model in various data domains, even for unseen data.

This special session seeks original contributions reporting and discussing the reproducibility of recently emerged neural visual coding solutions. It targets a mixed audience of researchers and product developers from several communities, i.e., multimedia coding, machine learning, computer vision, etc. The topics of interest include, but are not limited to:

Efficient Neural visual coding for image, video, 3D point cloud, etc.
Model complexity analysis of neural visual coding;
Model generalization studies of neural visual coding;
Standardization activity overview and relevant techniques summarization
Technical alignment of training and testing, e.g., dataset, procedural steps, etc., for fair comparison

Dr. Zhan Ma (mazhan@nju.edu.cn): Electronic Science and Engineering School of Nanjing University, China
Dr. Dong Tian (Dong.tian@interdigital.com): InterDigital, New York, NY, U.S.A.