The USTC Future Media Computing Lab is run by Prof Xiaojun Chang and Prof Xun Yang. It is a cutting-edge research center dedicated to advancing the frontiers of multimedia and AI technologies. Focused on areas such as video content analysis, multimodal intelligence, 3D vision, and human-computer interaction, the lab aims to revolutionize how media is processed, understood, and generated. Through interdisciplinary collaboration, the lab develops innovative algorithms and systems that address real-world challenges, from video-based recognition tasks to intelligent media creation, fostering breakthroughs in both academic and industrial applications.
The USTC Future Media Computing Lab is always looking for talented undergraduate, graduate students and postdocs. If you’re interested in working in the exciting field of future media computing, feel free to reach out!
This paper presents the DNA Family, a new framework for boosting the effectiveness of weight-sharing Neural Architecture Search (NAS) by dividing large search spaces into smaller blocks and applying block-wise supervisions. The approach demonstrates high performance on benchmarks such as ImageNet, surpassing previous NAS techniques in accuracy and efficiency.
This paper presents DS-Net++, a novel framework for efficient inference in neural networks. Dynamic weight slicing allows for scalable performance across multiple architectures like CNNs and vision transformers. The method delivers up to 61.5% real-world acceleration with minimal accuracy drops on models like MobileNet, ResNet-50, and Vision Transformer, showing its potential in hardware-efficient dynamic networks.
This paper presents ContrastZSD, a semantics-guided contrastive network for zero-shot object detection (ZSD). The framework improves visual-semantic alignment and mitigates the bias problem towards seen classes by incorporating region-category and region-region contrastive learning. ContrastZSD demonstrates superior performance in both ZSD and generalized ZSD tasks across PASCAL VOC and MS COCO datasets.
TN-ZSTAD introduces a novel approach to zero-shot temporal activity detection (ZSTAD) in long untrimmed videos. By integrating an activity graph transformer with zero-shot detection techniques, it addresses the challenge of recognizing and localizing unseen activities. Experiments on THUMOS'14, Charades, and ActivityNet datasets validate its superior performance in detecting unseen activities.
This paper proposes the SAD-SP model, which improves open-world compositional zero-shot learning by capturing contextuality and feasibility dependencies between states and objects. Using semantic attention and knowledge disentanglement, the approach enhances performance on benchmarks like MIT-States and C-GQA by predicting unseen compositions more accurately.
This paper introduces a novel framework for partial person re-identification, addressing the challenge of image spatial misalignment due to occlusions. The framework utilizes an adaptive threshold-guided masked graph convolutional network and incorporates human attributes to enhance the accuracy of pedestrian representations. Experimental results demonstrate its effectiveness across multiple public datasets.
ZeroNAS presents a differentiable generative adversarial network architecture search method specifically designed for zero-shot learning (ZSL). The approach optimizes both generator and discriminator architectures, leading to significant improvements in ZSL and generalized ZSL tasks across various datasets.
This paper provides a comprehensive review of the recent advancements in knowledge distillation (KD)-based object detection (OD) models. It covers different KD strategies for improving object detection tasks, such as incremental OD, small object detection, and weakly supervised OD. The paper also explores advanced distillation techniques and highlights future research directions in the field.
This paper introduces a video pivoting method for unsupervised multi-modal machine translation (UMMT), which uses spatial-temporal graphs to align sentence pairs in the latent space. By leveraging visual content from videos, the approach enhances translation accuracy and generalization across multiple languages, as demonstrated on the VATEX and HowToWorld datasets.
This survey provides a thorough exploration of the concept of scene graphs, discussing their role in visual understanding tasks. Scene graphs represent objects, their attributes, and relationships, helping improve tasks like visual reasoning and image captioning. The paper outlines various generation methods and applications, and also highlights key challenges like the long-tailed distribution of relationships.
This work addresses the catastrophic forgetting problem in one-shot neural architecture search by treating supernet training as a constrained optimization problem. The proposed method uses a novelty search-based architecture selection approach to enhance diversity and boost performance, achieving competitive results on CIFAR-10 and ImageNet datasets.
This paper attacks the challenging problem of video retrieval by text. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described exclusively in the form of a natural-language sentence, with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequence-to-sequence cross-modal matching is crucial. To that end, the two modalities need to be first encoded into real-valued vectors and then projected into a common space. In this paper we achieve this by proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own. Our novelty is two-fold. First, different from prior art that resorts to a specific single-level encoder, the proposed network performs multi-level encoding that represents the rich content of both modalities in a coarse-to-fine fashion. Second, different from a conventional common space learning algorithm which is either concept based or latent space based, we introduce hybrid space learning which combines the high performance of the latent space and the good interpretability of the concept space. Dual encoding is conceptually simple, practically effective and end-to-end trained with hybrid space learning. Extensive experiments on four challenging video datasets show the viability of the new method. Code and data are available at https://github.com/danieljf24/hybrid_space
This paper introduces a novel semantic pooling method for event analysis tasks like detection, recognition, and recounting in long untrimmed Internet videos. Using semantic saliency, the approach ranks video shots to prioritize the most relevant ones, improving the classifier’s accuracy. The paper proposes a nearly-isotonic SVM classifier, validated with experiments on real-world datasets, showcasing significant performance improvements.