WIKImage project

WIKImage is a project to build datasets of correlated images and text for data mining experiments and exploration of the influence of the combined presentations on classification algorithms.

WIKImage started as a part of the bilateral project between Slovenia and Serbia " Correlating Images and Words: Enhancing Image Analysis Through Machine Learning and Semantic Technologies"

Instances in the datasets ere manually classified, using the following binary labels: abstract and generated, animals, art, buildings and constructions, documents and maps, logos and flags, machines, tools and tech, misc nontech objects, nature and scenic, people, plants and fungi, space, sports, and vehicles. Some of these labels were introduced with the idea of being secondary -- generally applied with another label, but with the potential for providing interesting results when applied later. For instance, sport would probably appear with people or vehicles. Additionally, a few special labels were applied -- for pictures with no captions, bad captions, and pictures that were questionable when tagging.

Published papers

If you wish to use this data, please reference our paper in your work.

d3 dataset

The d3 dataset is the second public dataset. The main goal was to create a larger set. It was created from Category: Creative Commons Attribution-ShareAlike 3.0 images. A total of 15941 images were collected in the first run. Of these 11491 were manually labeled so far, using the above mentioned labels.

Captions were automatically extracted from the containing pages. Of course there are images without captions. These have "?" instead of the caption in the CVS files.

Paragraphs were also automatically extracted from the containing pages. There are images without proper paragraphs.

Sift data was created using the opensift library. A codebook of 400 representative features was created from a sample of the sift features, and then all of the images had histograms created, based on the codebook.


d1 dataset

Dataset d1 is one of the intial experiments in collecting of data, and was created from the English Wikipedia (API:, with images from Category: Creative Commons Attribution-ShareAlike 2.5 images, and contains 1007 instances.

All instances were manually classified with the above mentioned labels.


Browse the dataset online