Technicolor Rich Multimedia Retrieval from Input Videos Grand Challenge

Visual search that aims at retrieving copies of an image as well as information on a specific object, person or place in this image has progressed dramatically in the past few years. Thanks to modern techniques for large scale image description, indexing and matching, such an image-based information retrieval can be conducted either in a structured image database for a given topic (e.g., photos in a collection, paintings, book covers, monuments) or in an unstructured image database which is weakly labeled (e.g., via user-input tags or surrounding texts, including captions).

This Grand Challenge aims at exploring tools to push this search paradigm forward by addressing the following question: how can we search unstructured multimedia databases based on video queries? This problem is already encountered in professional environments where large semi-structured multimedia assets, such as TV/radio archives or cultural archives, are operationally managed. In these cases, resorting to trained professionals such as archivists remains the rule, both to annotate part of the database beforehand and to conduct searches. Unfortunately, this workflow does not apply to large-scale search into wildly unstructured repositories accessible on-line.

The challenge is to retrieve and organize automatically relevant multimedia documents based on an input video.  In a scenario where the input video features a news story for instance, can we retrieve other videos, articles and photos about the same news story? And, when the retrieved information is voluminous, how can these multimedia documents be linked, organized and summarized for easy reference, navigation and exploitation?

This is exactly the prototypical scenario we propose to explore with this Grand Challenge:

Given a short news video, with its rich content (visible people and places, scene texts and overlaid headlines, sound track and voice-over), the system should output and analyze a corpus of relevant documents: other videos on the same or a related story, related news articles, links between retrieved documents or parts of them, meaningful subgroups of tightly related documents, automatic summary of the story and description of the video content (what, who, where, when?), etc.

Participants could attack only part of this complete scenario, though systems addressing the whole pipeline (joint exploitation of visual and audio input information; retrieval of multimodal documents; organization and summarization of these documents in relation with input video) would be preferred. Regarding test-beds, participants should feel free to report results on any real data of their choice. In addition, a small selection of representative input videos can be downloaded here.


Patrick Perez  patrick.perez -at-
Christophe Diot  christophe.diot -at-

Comments are closed.