Video Pivoting Unsupervised Multi-Modal Machine Translation

Abstract

This paper introduces a video pivoting method for unsupervised multi-modal machine translation (UMMT), which uses spatial-temporal graphs to align sentence pairs in the latent space. By leveraging visual content from videos, the approach enhances translation accuracy and generalization across multiple languages, as demonstrated on the VATEX and HowToWorld datasets.

Publication
IEEE Transactions on Pattern Analysis and Machine Intelligence