TV show Caption is a large-scale multimodal captioning dataset, containing 261,490 caption descriptions paired with 108,965 short video moments. TVC is unique as its captions may also describe dialogues/subtitles while the captions in the other datasets are only describing the visual content. Read our paper
TVC text data files, including train/val/test-public sets annotations and subtitles:
We use the same set of videos as the TVQA dataset, click the button below to download 3FPS video frames. Note that you will be re-directed to TVQA website.
We provide a code base for you to start, which includes basic data preprocessing and analysis tools, feature extraction tools as well as our MMT baseline model code. You can also find associated video features in the repo.
The reference captions are not released for test-public set, you need to submit your model predictions to our evaluation server. Follow the instruction below:
Submission InstructionsFill out the Google Form below if you want to show your results on our Leaderboard:
This research is supported by NSF, DARPA, Google, and ARO.
Ask us questions: tvr-tvc-unc@googlegroups.com or jielei [at] cs.unc.edu.
TVC requires systems to gather information from both video and subtitle to generate relevant descriptions. The performance is measured by B@4 (BLEU@4), M (METEOR), R (Rouge-L), C (CIDEr-D).
Rank | Model | B@4 | M | R | C |
---|---|---|---|---|---|
1 Sep 15, 2020 |
HERO
MS D365 AI Paper, Code |
12.35 | 17.64 | 34.16 | 49.98 |
2 Jan 20, 2020 |
MMT (video+sub)
UNC Chapel Hill Paper, Code |
10.87 | 16.91 | 32.81 | 45.38 |
3 Jan 20, 2020 |
MMT (video)
UNC Chapel Hill Paper, Code |
9.98 | 15.23 | 30.44 | 36.07 |
4 Jan 20, 2020 |
MMT (sub)
UNC Chapel Hill Paper, Code |
6.33 | 13.92 | 7.73 | 33.76 |