PhotoChat, the first dataset that casts light on the photo sharing behavior in online messaging. PhotoChat contains 12k dialogues, each of which is paired with a user photo that is shared during the conversation. Based on this dataset, we propose two tasks to facilitate research on image-text modeling: a photo-sharing intent prediction task that predicts whether one intends to share a photo in the next conversation turn, and a photo retrieval task that retrieves the most relevant photo according to the dialogue context.
17 PAPERS • 2 BENCHMARKS
MMDialog is a large-scale multi-turn dialogue dataset containing multi-modal open-domain conversations derived from real human-human chat content in social media. MMDialog contains 1.08M dialogue sessions and 1.53M associated images. On average, one dialogue session has 2.59 images, which can be located anywhere at any conversation turn.
15 PAPERS • 1 BENCHMARK
MIntRec is a novel dataset for multimodal intent recognition. It formulates coarse-grained and fine-grained intent taxonomies based on the data collected from the TV series Superstore. The dataset consists of 2,224 high-quality samples with text, video, and audio modalities and has multimodal annotations among twenty intent categories.
7 PAPERS • 1 BENCHMARK