Change search
ReferencesLink to record
Permanent link

Direct link
Visual Representations and Models: From Latent SVM to Deep Learning
KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP. (Computer Vision)ORCID iD: 0000-0001-5211-6388
2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Two important components of a visual recognition system are representation and model. Both involves the selection and learning of the features that are indicative for recognition and discarding those features that are uninformative. This thesis, in its general form, proposes different techniques within the frameworks of two learning systems for representation and modeling. Namely, latent support vector machines (latent SVMs) and deep learning.

First, we propose various approaches to group the positive samples into clusters of visually similar instances. Given a fixed representation, the sampled space of the positive distribution is usually structured. The proposed clustering techniques include a novel similarity measure based on exemplar learning, an approach for using additional annotation, and augmenting latent SVM to automatically find clusters whose members can be reliably distinguished from background class. 

In another effort, a strongly supervised DPM is suggested to study how these models can benefit from privileged information. The extra information comes in the form of semantic parts annotation (i.e. their presence and location). And they are used to constrain DPMs latent variables during or prior to the optimization of the latent SVM. Its effectiveness is demonstrated on the task of animal detection.

Finally, we generalize the formulation of discriminative latent variable models, including DPMs, to incorporate new set of latent variables representing the structure or properties of negative samples. Thus, we term them as negative latent variables. We show this generalization affects state-of-the-art techniques and helps the visual recognition by explicitly searching for counter evidences of an object presence.

Following the resurgence of deep networks, in the last works of this thesis we have focused on deep learning in order to produce a generic representation for visual recognition. A Convolutional Network (ConvNet) is trained on a largely annotated image classification dataset called ImageNet with $\sim1.3$ million images. Then, the activations at each layer of the trained ConvNet can be treated as the representation of an input image. We show that such a representation is surprisingly effective for various recognition tasks, making it clearly superior to all the handcrafted features previously used in visual recognition (such as HOG in our first works on DPM). We further investigate the ways that one can improve this representation for a task in mind. We propose various factors involving before or after the training of the representation which can improve the efficacy of the ConvNet representation. These factors are analyzed on 16 datasets from various subfields of visual recognition.

Place, publisher, year, edition, pages
Stockholm, Sweden: KTH Royal Institute of Technology, 2016. , 172 p.
Series
TRITA-CSC-A, ISSN 1653-5723 ; 21
Keyword [en]
Computer Vision, Machine Learning, Artificial Intelligence, Deep Learning, Learning Representation, Deformable Part Models, Discriminative Latent Variable Models, Convolutional Networks, Object Recognition, Object Detection
National Category
Electrical Engineering, Electronic Engineering, Information Engineering Computer Systems
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-192289ISBN: 978-91-7729-110-7OAI: oai:DiVA.org:kth-192289DiVA: diva2:967455
Public defence
2016-09-27, Kollegiesalen, Brinellvägen 8, KTH-huset, våningsplan 4, KTH Campus, Stockholm, 15:26 (English)
Opponent
Supervisors
Note

QC 20160908

Available from: 2016-09-08 Created: 2016-09-08 Last updated: 2016-09-09Bibliographically approved
List of papers
1. Object detection using strongly-supervised deformable part models
Open this publication in new window or tab >>Object detection using strongly-supervised deformable part models
2012 (English)In: Computer Vision – ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part I / [ed] Andrew Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, Cordelia Schmid, Springer, 2012, no PART 1, 836-849 p.Conference paper (Refereed)
Abstract [en]

Deformable part-based models [1, 2] achieve state-of-the-art performance for object detection, but rely on heuristic initialization during training due to the optimization of non-convex cost function. This paper investigates limitations of such an initialization and extends earlier methods using additional supervision. We explore strong supervision in terms of annotated object parts and use it to (i) improve model initialization, (ii) optimize model structure, and (iii) handle partial occlusions. Our method is able to deal with sub-optimal and incomplete annotations of object parts and is shown to benefit from semi-supervised learning setups where part-level annotation is provided for a fraction of positive examples only. Experimental results are reported for the detection of six animal classes in PASCAL VOC 2007 and 2010 datasets. We demonstrate significant improvements in detection performance compared to the LSVM [1] and the Poselet [3] object detectors.

Place, publisher, year, edition, pages
Springer, 2012
Series
, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), ISSN 0302-9743 ; 7572 LNCS
Keyword
Data sets, Detection performance, Model initialization, Nonconvex cost functions, Object Detection, Object detectors, Partial occlusions, Positive examples, Semi-supervised learning, State-of-the-art performance, Computer vision, Optimization, Supervised learning, Object recognition
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
urn:nbn:se:kth:diva-107260 (URN)10.1007/978-3-642-33718-5_60 (DOI)000343418300060 ()2-s2.0-84867871564 (ScopusID)978-364233717-8 (ISBN)
Conference
12th European Conference on Computer Vision, ECCV 2012, 7 October 2012 through 13 October 2012, Florence
Note

QC 20121210

Available from: 2012-12-10 Created: 2012-12-10 Last updated: 2016-09-08Bibliographically approved
2. Mixture component identification and learning for visual recognition
Open this publication in new window or tab >>Mixture component identification and learning for visual recognition
2012 (English)In: Computer Vision – ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI, Springer, 2012, 115-128 p.Conference paper (Refereed)
Abstract [en]

The non-linear decision boundary between object and background classes - due to large intra-class variations - needs to be modelled by any classifier wishing to achieve good results. While a mixture of linear classifiers is capable of modelling this non-linearity, learning this mixture from weakly annotated data is non-trivial and is the paper's focus. Our approach is to identify the modes in the distribution of our positive examples by clustering, and to utilize this clustering in a latent SVM formulation to learn the mixture model. The clustering relies on a robust measure of visual similarity which suppresses uninformative clutter by using a novel representation based on the exemplar SVM. This subtle clustering of the data leads to learning better mixture models, as is demonstrated via extensive evaluations on Pascal VOC 2007. The final classifier, using a HOG representation of the global image patch, achieves performance comparable to the state-of-the-art while being more efficient at detection time.

Place, publisher, year, edition, pages
Springer, 2012
Series
, Lecture Notes in Computer Science, ISSN 0302-9743 ; 7577
Keyword
Decision boundary, Detection time, Image patches, Intra-class variation, Linear classifiers, Mixture components, Mixture model, Non-Linearity, Non-trivial, Positive examples, Visual recognition, Visual similarity, Weakly annotated data
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
urn:nbn:se:kth:diva-106987 (URN)10.1007/978-3-642-33783-3_9 (DOI)000342828800009 ()2-s2.0-84867892975 (ScopusID)978-364233782-6 (ISBN)
Conference
12th European Conference on Computer Vision, ECCV 2012;Florence;7 October 2012 through 13 October 2012
Funder
ICT - The Next Generation
Note

QC 20121207

Available from: 2012-12-05 Created: 2012-12-05 Last updated: 2016-09-08Bibliographically approved
3. Self-tuned Visual Subclass Learning with Shared Samples An Incremental Approach
Open this publication in new window or tab >>Self-tuned Visual Subclass Learning with Shared Samples An Incremental Approach
2013 (English)Article, review/survey (Other academic) Epub ahead of print
Abstract [en]

Computer vision tasks are traditionally defined and eval-uated using semantic categories. However, it is known to thefield that semantic classes do not necessarily correspondto a unique visual class (e.g. inside and outside of a car).Furthermore, many of the feasible learning techniques athand cannot model a visual class which appears consistentto the human eye. These problems have motivated the useof 1) Unsupervised or supervised clustering as a prepro-cessing step to identify the visual subclasses to be used ina mixture-of-experts learning regime. 2) Felzenszwalb etal. part model and other works model mixture assignmentwith latent variables which is optimized during learning 3)Highly non-linear classifiers which are inherently capableof modelling multi-modal input space but are inefficient atthe test time. In this work, we promote an incremental viewover the recognition of semantic classes with varied appear-ances. We propose an optimization technique which incre-mentally finds maximal visual subclasses in a regularizedrisk minimization framework. Our proposed approach uni-fies the clustering and classification steps in a single algo-rithm. The importance of this approach is its compliancewith the classification via the fact that it does not need toknow about the number of clusters, the representation andsimilarity measures used in pre-processing clustering meth-ods a priori. Following this approach we show both quali-tatively and quantitatively significant results. We show thatthe visual subclasses demonstrate a long tail distribution.Finally, we show that state of the art object detection meth-ods (e.g. DPM) are unable to use the tails of this distri-bution comprising 50% of the training samples. In fact weshow that DPM performance slightly increases on averageby the removal of this half of the data.

National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-192293 (URN)
External cooperation:
Note

QC 20160912

Available from: 2016-09-08 Created: 2016-09-08 Last updated: 2016-09-12Bibliographically approved
4. Spotlight the Negatives: A Generalized Discriminative Latent Model
Open this publication in new window or tab >>Spotlight the Negatives: A Generalized Discriminative Latent Model
2015 (English)Conference paper (Refereed)
Abstract [en]

Discriminative latent variable models (LVM) are frequently applied to various visualrecognition tasks. In these systems the latent (hidden) variables provide a formalism formodeling structured variation of visual features. Conventionally, latent variables are de-fined on the variation of the foreground (positive) class. In this work we augment LVMsto includenegativelatent variables corresponding to the background class. We formalizethe scoring function of such a generalized LVM (GLVM). Then we discuss a frameworkfor learning a model based on the GLVM scoring function. We theoretically showcasehow some of the current visual recognition methods can benefit from this generalization.Finally, we experiment on a generalized form of Deformable Part Models with negativelatent variables and show significant improvements on two different detection tasks.

National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-172138 (URN)
Conference
British Machine Vision Conference (BMVC),7-10 September, Swansea, UK, 2015
Note

QC 20150828

Available from: 2015-08-13 Created: 2015-08-13 Last updated: 2016-09-08Bibliographically approved
5. CNN features off-the-shelf: An Astounding Baseline for Recognition
Open this publication in new window or tab >>CNN features off-the-shelf: An Astounding Baseline for Recognition
2014 (English)In: Proceedings of CVPR 2014, 2014Conference paper (Refereed)
Abstract [en]

Recent results indicate that the generic descriptors extracted from the convolutional neural networks are very powerful. This paper adds to the mounting evidence that this is indeed the case. We report on a series of experiments conducted for different recognition tasks using the publicly available code and model of the OverFeat network which was trained to perform object classification on ILSVRC13. We use features extracted from the OverFeat network as a generic image representation to tackle the diverse range of recognition tasks of object image classification, scene recognition, fine grained recognition, attribute detection and image retrieval applied to a diverse set of datasets. We selected these tasks and datasets as they gradually move further away from the original task and data the OverFeat network was trained to solve. Astonishingly, we report consistent superior results compared to the highly tuned state-of-the-art systems in all the visual classification tasks on various datasets. For instance retrieval it consistently outperforms low memory footprint methods except for sculptures dataset. The results are achieved using a linear SVM classifier (or L2 distance in case of retrieval) applied to a feature representation of size 4096 extracted from a layer in the net. The representations are further modified using simple augmentation techniques e.g. jittering. The results strongly suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual recognition tasks.

National Category
Computer Science
Identifiers
urn:nbn:se:kth:diva-149178 (URN)10.1109/CVPRW.2014.131 (DOI)000349552300079 ()2-s2.0-84908537903 (ScopusID)
Conference
Computer Vision and Pattern Recognition (CVPR) 2014, DeepVision workshop,June 28, 2014, Columbus, Ohio
Note

Best Paper Runner-up Award.

QC 20140825

Available from: 2014-08-16 Created: 2014-08-16 Last updated: 2016-09-08Bibliographically approved
6. Factors of Transferability for a Generic ConvNet Representation
Open this publication in new window or tab >>Factors of Transferability for a Generic ConvNet Representation
Show others...
2015 (English)In: IEEE Transaction on Pattern Analysis and Machine Intelligence, ISSN 0162-8828, E-ISSN 1939-3539Article in journal (Refereed) Epub ahead of print
Abstract [en]

Evidence is mounting that Convolutional Networks (ConvNets) are the most effective representation learning method for visual recognition tasks. In the common scenario, a ConvNet is trained on a large labeled dataset (source) and the feed-forward units activation of the trained network, at a certain layer of the network, is used as a generic representation of an input image for a task with relatively smaller training set (target). Recent studies have shown this form of representation transfer to be suitable for a wide range of target visual recognition tasks. This paper introduces and investigates several factors affecting the transferability of such representations. It includes parameters for training of the source ConvNet such as its architecture, distribution of the training data, etc. and also the parameters of feature extraction such as layer of the trained ConvNet, dimensionality reduction, etc. Then, by optimizing these factors, we show that significant improvements can be achieved on various (17) visual recognition tasks. We further show that these visual recognition tasks can be categorically ordered based on their similarity to the source task such that a correlation between the performance of tasks and their similarity to the source task w.r.t. the proposed factors is observed.

Place, publisher, year, edition, pages
IEEE Computer Society Digital Library, 2015
National Category
Computer Vision and Robotics (Autonomous Systems)
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-177033 (URN)10.1109/TPAMI.2015.2500224 (DOI)000381432700006 ()2-s2.0-84981266620 (ScopusID)
Note

QP 201512

Available from: 2015-11-13 Created: 2015-11-13 Last updated: 2016-09-30Bibliographically approved

Open Access in DiVA

fulltext(19225 kB)92 downloads
File information
File name FULLTEXT01.pdfFile size 19225 kBChecksum SHA-512
b542da2cd5e2e0e46425e1956ff9769088880274c2c0bd8a468073f81a58193ed996ca0068bd17ae7640bc29d0adb8ef6acbd9a8c1e4236c753a0a520493e2af
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Azizpour, Hossein
By organisation
Computer Vision and Active Perception, CVAP
Electrical Engineering, Electronic Engineering, Information EngineeringComputer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 92 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 522 hits
ReferencesLink to record
Permanent link

Direct link