Skip to content

[MM'23] Video Entailment via Reaching a Structure-Aware Cross-modal Consensus

Notifications You must be signed in to change notification settings

Feliciaxyao/MM2023-SACCN

Repository files navigation

MM2023 - SAMCN

Introduction

image

Video Entailment via Reaching a Structure-Aware Cross-modal Consensus

Xuan Yao, Junyu Gao, Mengyuan Chen, Changsheng Xu*

*Correspondece should be addressed to C.X.

State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences.

Paper Link on ACM MM 2023

Prerequisites

1 Environment

conda create -n samcn python=3.9
pip install -r requirements.txt
  • requirements.txt contains the core requirements for running the code in the SAMCN packages. NOTE: pytorch > = 1.2

2 Data Preparation

a) VIOLIN dataset:

  1. We use the visual features, statements and subtitles provided by CVPR 2020 paper: VIOLIN: A Large-Scale Dataset for Video-and-Language Inference. Please download the visual features(C3D features), statements and subtitles and unzip it under the ./dataset/violin folder.

  2. We represent the statement and subtitles using the pretrained RoBERTa encoder provided by arXiv 2019 paper: Roberta: A robustly optimized bert pretraining approach. Please download the pre-trained Roberta model and put it into the ./roberta.base folder.`

b) VLEP dataset (TODO)

Train & Test

a) VIOLIN dataset:

python violin_main.py --results_dir_base 'YOUR OUTPUT PATH' \
                      --feat_dir ./dataset/violin \
                      --bert_dir ./roberta.base \
                      --model VlepModel \
                      --data ViolinDataset \
                      --lr1 5e-6 \
                      --beta1 0.9 \
                      --first_n_epoch 60 \
                      --batch_size 8 \
                      --test_batch_size 8 \
                      --feat c3d \
                      --input_streams vid sub \
                      --dropout 0.3 \
                      --cmcm \
                      --cmcm_loss \

b) VLEP dataset (TODO)

Acknowledgements

We acknowledge that the part of the video entailment code for violin dataset is adapted from violin. Thanks to the authors for sharing their code.

Related Work

Citation

Feel free to cite this work if you find it useful to you!

@inproceedings{Yao2023VideoEV,
  title={Video Entailment via Reaching a Structure-Aware Cross-modal Consensus},
  author={Xuan Yao and Junyu Gao and Mengyuan Chen and Changsheng Xu},
  journal={Proceedings of the 31st ACM International Conference on Multimedia},
  year={2023},
  url={https://api.semanticscholar.org/CorpusID:264492780}
}

About

[MM'23] Video Entailment via Reaching a Structure-Aware Cross-modal Consensus

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages