This paper presents the DNA Family, a new framework for boosting the effectiveness of weight-sharing Neural Architecture Search (NAS) by dividing large search spaces into smaller blocks and applying block-wise supervisions. The approach demonstrates high performance on benchmarks such as ImageNet, surpassing previous NAS techniques in accuracy and efficiency.
This paper presents DS-Net++, a novel framework for efficient inference in neural networks. Dynamic weight slicing allows for scalable performance across multiple architectures like CNNs and vision transformers. The method delivers up to 61.5% real-world acceleration with minimal accuracy drops on models like MobileNet, ResNet-50, and Vision Transformer, showing its potential in hardware-efficient dynamic networks.
TN-ZSTAD introduces a novel approach to zero-shot temporal activity detection (ZSTAD) in long untrimmed videos. By integrating an activity graph transformer with zero-shot detection techniques, it addresses the challenge of recognizing and localizing unseen activities. Experiments on THUMOS'14, Charades, and ActivityNet datasets validate its superior performance in detecting unseen activities.
This paper presents ContrastZSD, a semantics-guided contrastive network for zero-shot object detection (ZSD). The framework improves visual-semantic alignment and mitigates the bias problem towards seen classes by incorporating region-category and region-region contrastive learning. ContrastZSD demonstrates superior performance in both ZSD and generalized ZSD tasks across PASCAL VOC and MS COCO datasets.
This paper proposes the SAD-SP model, which improves open-world compositional zero-shot learning by capturing contextuality and feasibility dependencies between states and objects. Using semantic attention and knowledge disentanglement, the approach enhances performance on benchmarks like MIT-States and C-GQA by predicting unseen compositions more accurately.
ZeroNAS presents a differentiable generative adversarial network architecture search method specifically designed for zero-shot learning (ZSL). The approach optimizes both generator and discriminator architectures, leading to significant improvements in ZSL and generalized ZSL tasks across various datasets.
This paper introduces a novel framework for partial person re-identification, addressing the challenge of image spatial misalignment due to occlusions. The framework utilizes an adaptive threshold-guided masked graph convolutional network and incorporates human attributes to enhance the accuracy of pedestrian representations. Experimental results demonstrate its effectiveness across multiple public datasets.
This paper provides a comprehensive review of the recent advancements in knowledge distillation (KD)-based object detection (OD) models. It covers different KD strategies for improving object detection tasks, such as incremental OD, small object detection, and weakly supervised OD. The paper also explores advanced distillation techniques and highlights future research directions in the field.
This paper introduces a video pivoting method for unsupervised multi-modal machine translation (UMMT), which uses spatial-temporal graphs to align sentence pairs in the latent space. By leveraging visual content from videos, the approach enhances translation accuracy and generalization across multiple languages, as demonstrated on the VATEX and HowToWorld datasets.
This survey provides a thorough exploration of the concept of scene graphs, discussing their role in visual understanding tasks. Scene graphs represent objects, their attributes, and relationships, helping improve tasks like visual reasoning and image captioning. The paper outlines various generation methods and applications, and also highlights key challenges like the long-tailed distribution of relationships.
Active learning (AL) attempts to maximize a model’s performance gain while annotating the fewest samples possible. Deep learning (DL) is greedy for data and requires a large amount of data supply to optimize a massive number of parameters if the model is to learn how to extract high-quality features. In recent years, due to the rapid development of internet technology, we have entered an era of information abundance characterized by massive amounts of available data. As a result, DL has attracted significant attention from researchers and has been rapidly developed. Compared with DL, however, researchers have a relatively low interest in AL. This is mainly because before the rise of DL, traditional machine learning requires relatively few labeled samples, meaning that early AL is rarely according the value it deserves. Although DL has made breakthroughs in various fields, most of this success is due to a large number of publicly available annotated datasets. However, the acquisition of a large number of high-quality annotated datasets consumes a lot of manpower, making it unfeasible in fields that require high levels of expertise (such as speech recognition, information extraction, medical images, etc.). Therefore, AL is gradually coming to receive the attention it is due. It is therefore natural to investigate whether AL can be used to reduce the cost of sample annotation while retaining the powerful learning capabilities of DL. As a result of such investigations, deep active learning (DeepAL) has emerged. Although research on this topic is quite abundant, there has not yet been a comprehensive survey of DeepAL-related works; accordingly, this article aims to fill this gap. We provide a formal classification method for the existing work, along with a comprehensive and systematic overview. In addition, we also analyze and summarize the development of DeepAL from an application perspective. Finally, we discuss the confusion and problems associated with DeepAL and provide some possible development directions.
This work addresses the catastrophic forgetting problem in one-shot neural architecture search by treating supernet training as a constrained optimization problem. The proposed method uses a novelty search-based architecture selection approach to enhance diversity and boost performance, achieving competitive results on CIFAR-10 and ImageNet datasets.
Deep learning has made substantial breakthroughs in many fields due to its powerful automatic representation capabilities. It has been proven that neural architecture design is crucial to the feature representation of data and the final performance. However, the design of the neural architecture heavily relies on the researchers’ prior knowledge and experience. And due to the limitations of humans’ inherent knowledge, it is difficult for people to jump out of their original thinking paradigm and design an optimal model. Therefore, an intuitive idea would be to reduce human intervention as much as possible and let the algorithm automatically design the neural architecture. Neural Architecture Search (NAS) is just such a revolutionary algorithm, and the related research work is complicated and rich. Therefore, a comprehensive and systematic survey on the NAS is essential. Previously related surveys have begun to classify existing work mainly based on the key components of NAS: search space, search strategy, and evaluation strategy. While this classification method is more intuitive, it is difficult for readers to grasp the challenges and the landmark work involved. Therefore, in this survey, we provide a new perspective: beginning with an overview of the characteristics of the earliest NAS algorithms, summarizing the problems in these early NAS algorithms, and then providing solutions for subsequent related research work. In addition, we conduct a detailed and comprehensive analysis, comparison, and summary of these works. Finally, we provide some possible future research directions.
This paper attacks the challenging problem of video retrieval by text. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described exclusively in the form of a natural-language sentence, with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequence-to-sequence cross-modal matching is crucial. To that end, the two modalities need to be first encoded into real-valued vectors and then projected into a common space. In this paper we achieve this by proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own. Our novelty is two-fold. First, different from prior art that resorts to a specific single-level encoder, the proposed network performs multi-level encoding that represents the rich content of both modalities in a coarse-to-fine fashion. Second, different from a conventional common space learning algorithm which is either concept based or latent space based, we introduce hybrid space learning which combines the high performance of the latent space and the good interpretability of the concept space. Dual encoding is conceptually simple, practically effective and end-to-end trained with hybrid space learning. Extensive experiments on four challenging video datasets show the viability of the new method. Code and data are available at https://github.com/danieljf24/hybrid_space
Modeling multivariate time series has long been a subject that has attracted researchers from a diverse range of fields including economics, finance, and traffic. A basic assumption behind multivariate time series forecasting is that its variables depend on one another but, upon looking closely, it is fair to say that existing methods fail to fully exploit latent spatial dependencies between pairs of variables. In recent years, meanwhile, graph neural networks (GNNs) have shown high capability in handling relational dependencies. GNNs require well-defined graph structures for information propagation which means they cannot be applied directly for multivariate time series where the dependencies are not known in advance. In this paper, we propose a general graph neural network framework designed specifically for multivariate time series data. Our approach automatically extracts the uni-directed relations among variables through a graph learning module, into which external knowledge like variable attributes can be easily integrated. A novel mix-hop propagation layer and a dilated inception layer are further proposed to capture the spatial and temporal dependencies within the time series. The graph learning, graph convolution, and temporal convolution modules are jointly learned in an end-to-end framework. Experimental results show that our proposed model outperforms the state-of-the-art baseline methods on 3 of 4 benchmark datasets and achieves on-par performance with other approaches on two traffic datasets which provide extra structural information.
This paper introduces a novel semantic pooling method for event analysis tasks like detection, recognition, and recounting in long untrimmed Internet videos. Using semantic saliency, the approach ranks video shots to prioritize the most relevant ones, improving the classifier’s accuracy. The paper proposes a nearly-isotonic SVM classifier, validated with experiments on real-world datasets, showcasing significant performance improvements.