TVR Dataset

A Large-Scale Dataset for Video-Subtitle Moment Retrieval

What is TVR?

TV show Retrieval is a new multimodal retrieval task, in which a short video moment has to be localized from a large video (with subtitle) corpus, given a natural language query. Its associated TVR dataset is a large-scale, high-quality dataset consisting of 108,965 queries on 21,793 videos from 6 TV shows of diverse genres, where each query is associated with a tight temporal alignment. Read our paper


TVR text data files, including train/val/test-public sets annotations and subtitles:

We use the same set of videos as the TVQA dataset, click the button below to download 3FPS video frames. Note that you will be re-directed to TVQA website.

We provide a code base for you to start, which includes basic data preprocessing and analysis tools, feature extraction tools as well as our XML baseline model code. You can also find associated video features in the repo.


The ground-truth video name and timestamp annotations are not released for test-public set, you need to submit your model predictions to our evaluation server. Follow the instruction below:

Submission Instructions

Fill out the Google Form below if you want to show your results on our Leaderboard:


This research is supported by NSF, DARPA, Google, and ARO.


Ask us questions: or jielei [at]

TVR Leaderboard

TVR tests a system's ability of localizing a moment from a large video (with subtitle) corpus. The performance is measured by R@K (Recall@K, K=1, 10, 100), with temporal IoU = 0.7.



Jan 20, 2020

UNC Chapel Hill

Paper, Code
3.32 13.41 30.52


Jan 20, 2020

ENS & INRIA & CIIRC + KAUST & Adobe Research & INRIA (Implemented by UNC)

MEE Paper, CAL Paper
0.66 3.09 12.03


Jan 20, 2020

ENS & INRIA & CIIRC + CMU (Implemented by UNC)

MEE Paper, ExCL Paper
0.40 1.73 2.87


Jan 20, 2020

UNC Chapel Hill

0.00 0.00 0.07