TVC Dataset

A large-scale multimodal video captioning dataset

What is TVC?

TV show Caption is a large-scale multimodal captioning dataset, containing 261,490 caption descriptions paired with 108,965 short video moments. TVC is unique as its captions may also describe dialogues/subtitles while the captions in the other datasets are only describing the visual content. Read our paper


TVC text data files, including train/val/test-public sets annotations and subtitles:

We use the same set of videos as the TVQA dataset, click the button below to download 3FPS video frames. Note that you will be re-directed to TVQA website.

We provide a code base for you to start, which includes basic data preprocessing and analysis tools, feature extraction tools as well as our MMT baseline model code. You can also find associated video features in the repo.


The reference captions are not released for test-public set, you need to submit your model predictions to our evaluation server. Follow the instruction below:

Submission Instructions

Fill out the Google Form below if you want to show your results on our Leaderboard:


This research is supported by NSF, DARPA, Google, and ARO.


Ask us questions: or jielei [at]

TVC Leaderboard

TVC requires systems to gather information from both video and subtitle to generate relevant descriptions. The performance is measured by B@4 (BLEU@4), M (METEOR), R (Rouge-L), C (CIDEr-D).



Sep 15, 2020

MS D365 AI

Paper, Code
12.35 17.64 34.16 49.98


Jan 20, 2020
MMT (video+sub)

UNC Chapel Hill

Paper, Code
10.87 16.91 32.81 45.38


Jan 20, 2020
MMT (video)

UNC Chapel Hill

Paper, Code
9.98 15.23 30.44 36.07


Jan 20, 2020
MMT (sub)

UNC Chapel Hill

Paper, Code
6.33 13.92 7.73 33.76