Summary: | 1. Oxford-IIIT combined: a spatial pyramid intersection kernel SVM image classifier, a sliding-window randomforest object detector, a sliding-window intersection kernel SVM object detector, and a discriminative constellation model facial feature extractor. For each of the twenty features, methods were ranked based on their performance on a validation set and associated to successive runs by decreasing performance. For training, TRECVID annotations were manually corrected and augmented with object bounding boxes, and additional training data was used for under-represented features such as Airplane flying.
<br>
2. The different methods yielded a significantly different performance depending on the feature, as expected by their design.
<br>
3. The image classifier worked better for scene-level features such as Cityscape, Classroom, Doorway, while the object detectors worked better for Boat or ship, Bus, Person riding a bicycle, and the face feature extractor worked well for Female face closeup.
<br>
4. Three conclusions can be drawn: (i) different features are addressed better by specialised methods, (ii) removal of noise from TRECVID annotations (iii) additional data for under-represented features significantly improve performance.
|