Change search
ReferencesLink to record
Permanent link

Direct link
CNN features off-the-shelf: An Astounding Baseline for Recognition
KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP. (Computer Vision)
KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP. (Computer Vision)ORCID iD: 0000-0001-5211-6388
KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP. (Computer Vision)
KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP. (Computer Vision)
2014 (English)In: Proceedings of CVPR 2014, 2014Conference paper (Refereed)
Abstract [en]

Recent results indicate that the generic descriptors extracted from the convolutional neural networks are very powerful. This paper adds to the mounting evidence that this is indeed the case. We report on a series of experiments conducted for different recognition tasks using the publicly available code and model of the OverFeat network which was trained to perform object classification on ILSVRC13. We use features extracted from the OverFeat network as a generic image representation to tackle the diverse range of recognition tasks of object image classification, scene recognition, fine grained recognition, attribute detection and image retrieval applied to a diverse set of datasets. We selected these tasks and datasets as they gradually move further away from the original task and data the OverFeat network was trained to solve. Astonishingly, we report consistent superior results compared to the highly tuned state-of-the-art systems in all the visual classification tasks on various datasets. For instance retrieval it consistently outperforms low memory footprint methods except for sculptures dataset. The results are achieved using a linear SVM classifier (or L2 distance in case of retrieval) applied to a feature representation of size 4096 extracted from a layer in the net. The representations are further modified using simple augmentation techniques e.g. jittering. The results strongly suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual recognition tasks.

Place, publisher, year, edition, pages
National Category
Computer Science
URN: urn:nbn:se:kth:diva-149178DOI: 10.1109/CVPRW.2014.131ISI: 000349552300079ScopusID: 2-s2.0-84908537903OAI: diva2:738235
Computer Vision and Pattern Recognition (CVPR) 2014, DeepVision workshop,June 28, 2014, Columbus, Ohio

Best Paper Runner-up Award.

QC 20140825

Available from: 2014-08-16 Created: 2014-08-16 Last updated: 2016-09-08Bibliographically approved
In thesis
1. Visual Representations and Models: From Latent SVM to Deep Learning
Open this publication in new window or tab >>Visual Representations and Models: From Latent SVM to Deep Learning
2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Two important components of a visual recognition system are representation and model. Both involves the selection and learning of the features that are indicative for recognition and discarding those features that are uninformative. This thesis, in its general form, proposes different techniques within the frameworks of two learning systems for representation and modeling. Namely, latent support vector machines (latent SVMs) and deep learning.

First, we propose various approaches to group the positive samples into clusters of visually similar instances. Given a fixed representation, the sampled space of the positive distribution is usually structured. The proposed clustering techniques include a novel similarity measure based on exemplar learning, an approach for using additional annotation, and augmenting latent SVM to automatically find clusters whose members can be reliably distinguished from background class. 

In another effort, a strongly supervised DPM is suggested to study how these models can benefit from privileged information. The extra information comes in the form of semantic parts annotation (i.e. their presence and location). And they are used to constrain DPMs latent variables during or prior to the optimization of the latent SVM. Its effectiveness is demonstrated on the task of animal detection.

Finally, we generalize the formulation of discriminative latent variable models, including DPMs, to incorporate new set of latent variables representing the structure or properties of negative samples. Thus, we term them as negative latent variables. We show this generalization affects state-of-the-art techniques and helps the visual recognition by explicitly searching for counter evidences of an object presence.

Following the resurgence of deep networks, in the last works of this thesis we have focused on deep learning in order to produce a generic representation for visual recognition. A Convolutional Network (ConvNet) is trained on a largely annotated image classification dataset called ImageNet with $\sim1.3$ million images. Then, the activations at each layer of the trained ConvNet can be treated as the representation of an input image. We show that such a representation is surprisingly effective for various recognition tasks, making it clearly superior to all the handcrafted features previously used in visual recognition (such as HOG in our first works on DPM). We further investigate the ways that one can improve this representation for a task in mind. We propose various factors involving before or after the training of the representation which can improve the efficacy of the ConvNet representation. These factors are analyzed on 16 datasets from various subfields of visual recognition.

Place, publisher, year, edition, pages
Stockholm, Sweden: KTH Royal Institute of Technology, 2016. 172 p.
TRITA-CSC-A, ISSN 1653-5723 ; 21
Computer Vision, Machine Learning, Artificial Intelligence, Deep Learning, Learning Representation, Deformable Part Models, Discriminative Latent Variable Models, Convolutional Networks, Object Recognition, Object Detection
National Category
Electrical Engineering, Electronic Engineering, Information Engineering Computer Systems
Research subject
Computer Science
urn:nbn:se:kth:diva-192289 (URN)978-91-7729-110-7 (ISBN)
External cooperation:
Public defence
2016-09-27, Kollegiesalen, Brinellvägen 8, KTH-huset, våningsplan 4, KTH Campus, Stockholm, 15:26 (English)

QC 20160908

Available from: 2016-09-08 Created: 2016-09-08 Last updated: 2016-09-09Bibliographically approved

Open Access in DiVA

fulltext(429 kB)85 downloads
File information
File name FULLTEXT01.pdfFile size 429 kBChecksum SHA-512
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopusConference website

Search in DiVA

By author/editor
Sharif Razavian, AliAzizpour, HosseinSullivan, JosephineCarlsson, Stefan
By organisation
Computer Vision and Active Perception, CVAP
Computer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 85 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Altmetric score

Total: 579 hits
ReferencesLink to record
Permanent link

Direct link