Yahoo! Large-scale Flickr-tag Image Classification Grand Challenge

Image classification is one of the fundamental problems of computer vision and multimedia research. With the proliferation of the Internet, the availability of cheap digital cameras, and the ubiquity of cell-phone cameras, the amount of accessible visual content has increased astronomically. Websites such as Flickr alone boast of over 5 billion images, not counting the many other such websites and countless other images that are not published online. This explosion poses unique challenges for the classification of images.

Classification of images with a large number of classes and images has attracted several research efforts in recent years. The availability of datasets such as ImageNet [1], which boasts of over 14 million images and over 21 thousand classes, has motivated researchers to develop classification algorithms that can deal with large quantities of data. However, most of the effort has been dedicated to building systems that can scale up when the number of classes is large. In this challenge we are interested to learn classifiers when the number of images is large. There has been some recent work that deals with thousands of images for training; in this challenge, however, we are looking at 200,000 images per class. What makes the challenge difficult is that the annotations are provided by users of, and so they might not always be accurate. Furthermore, each class can be considered as a collection of sub-classes with varied visual properties.

In summary, the challenge is to classify images with:

  1. Hundred’s of thousands of training images per class — for each class 150K training images will be provided and another 50K images for testing. This is much larger than existing image classification challenges.
  2. Each class is composed of visually diverse sub-classes — for example, the class “Nature” contains images from sub-classes such as “Waterfall”, “Mountains”, “Beach” etc.
  3. Some of these sub-classes could be “visually” unrelated to the root class — for example, the class “Nature” contains some images of scientists who have worked on the problem of understanding how nature works and are not necessarily images of nature.
  4. Flickr users have provided all the annotations, so they are a reflection of their tagging behavior.


The task is to classify images into a given set of classes. The classifiers should be learned using a training set of images, and evaluated on a separate set of test images (both provided by Yahoo!). The system should provide a score to each of the test images with respect to each of the classes. Note that the final output should be a confidence score of the image belonging to the class, and not just a binary 0-1 output.

Methodology and Dataset

Yahoo! will release a set of ten image classes:

  1. nature
  2. food
  3. people
  4. wedding
  5. music
  6. sky
  7. london
  8. beach
  9. 2012
  10. travel

with 150K training and 50K test images per class. These classes are amongst the top tags as annotated by the Flickr users. Both the download links of the original images and a precomputed set of features will be provided. The participants should learn classifiers based on the image content and/or the provided image features and are expected to perform cross-validation, if any, on the training set.


The evaluation of the classifiers will be performed on a per-class basis using average precision (AP) similar to the classification task of PASCAL Visual Object Classes (VOC) Challenge [2]. For a given class, the participants should report the precision/recall curve computed from the classifier output. Recall is defined as the fraction of all positive images that are retrieved at a given rank. Precision is the fraction of retrieved images at a given rank that are from the positive class. The AP summarises the shape of the precision/recall curve and is computed as mean precision at a set of eleven equally spaced recall levels [0,0.1,...,1]. Furthermore, participants are encouraged to evaluate their results using 1K, 5K, 10K, 20K, 50K, 100K, 150K training images. Note that the test tags may be noisy, but algorithms should be tested on the available test set itself. As with the other Grand Challenges, finalists will be selected based on the scientific interest of their solution through a peer review process, which will take the AP performance of the solution into consideration. Preference will be given to systems that use less training images to achieve the same precision, and to simpler approaches.


The training and test data can be downloaded from Yahoo! Webscope™ Program at

Note that to download the dataset you must be affiliated to a research lab at a company or a university. To download the dataset, click on the above link and then click on “I1 – Yahoo! Flickr Creative Common Images tagged with ten concepts, version 1.0″. Add the dataset to cart and follow the next few screens to fill in the required information.


Nikhil Rasiwasia   nikux -at-
Kshitiz Garg    kshitizg -at-


[1] J. Deng, A. Berg, K. Li and L. Fei-Fei, What does classifying more than 10,000 image categories tell us? Proceedings of the 12th European Conference of Computer Vision (ECCV). 2010.

[2] Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J. and Zisserman, A. The PASCAL Visual Object Classes (VOC) Challenge International Journal of Computer Vision, 88(2), 303-338, 2010

Comments are closed.