Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A generative appearance model for end-to-end video object segmentation
Linköping University, Department of Electrical Engineering, Computer Vision. Linköping University, Faculty of Science & Engineering. Zenuity, Sweden.ORCID iD: 0000-0003-2553-3367
Linköping University, Department of Electrical Engineering, Computer Vision. Linköping University, Faculty of Science & Engineering. ETH Zurich, Switzerland.ORCID iD: 0000-0001-6144-9520
Linköping University, Department of Electrical Engineering, Computer Vision. Linköping University, Faculty of Science & Engineering. Saab, Sweden.ORCID iD: 0000-0002-0418-9694
Linköping University, Department of Electrical Engineering, Computer Vision. Linköping University, Faculty of Science & Engineering. IIAI, UAE.
Show others and affiliations
2019 (English)In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Institute of Electrical and Electronics Engineers (IEEE), 2019, p. 8945-8954Conference paper, Published paper (Refereed)
Abstract [en]

One of the fundamental challenges in video object segmentation is to find an effective representation of the target and background appearance. The best performing approaches resort to extensive fine-tuning of a convolutional neural network for this purpose. Besides being prohibitively expensive, this strategy cannot be truly trained end-to-end since the online fine-tuning procedure is not integrated into the offline training of the network. To address these issues, we propose a network architecture that learns a powerful representation of the target and background appearance in a single forward pass. The introduced appearance module learns a probabilistic generative model of target and background feature distributions. Given a new image, it predicts the posterior class probabilities, providing a highly discriminative cue, which is processed in later network modules. Both the learning and prediction stages of our appearance module are fully differentiable, enabling true end-to-end training of the entire segmentation pipeline. Comprehensive experiments demonstrate the effectiveness of the proposed approach on three video object segmentation benchmarks. We close the gap to approaches based on online fine-tuning on DAVIS17, while operating at 15 FPS on a single GPU. Furthermore, our method outperforms all published approaches on the large-scale YouTube-VOS dataset.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2019. p. 8945-8954
Series
Proceedings - IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, IEEE Conference on Computer Vision and Pattern Recognition, ISSN 1063-6919, E-ISSN 2575-7075
Keywords [en]
Segmentation; Grouping and Shape; Motion and Tracking
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:liu:diva-161037DOI: 10.1109/CVPR.2019.00916ISI: 000542649302058ISBN: 9781728132938 (electronic)ISBN: 9781728132945 (print)OAI: oai:DiVA.org:liu-161037DiVA, id: diva2:1361997
Conference
IEEE Conference on Computer Vision and Pattern Recognition. 2019, Long Beach, CA, USA, USA, 15-20 June 2019
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Swedish Foundation for Strategic ResearchSwedish Research CouncilAvailable from: 2019-10-17 Created: 2019-10-17 Last updated: 2025-02-07Bibliographically approved
In thesis
1. Dynamic Visual Learning
Open this publication in new window or tab >>Dynamic Visual Learning
2022 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Autonomous robots act in a \emph{dynamic} world where both the robots and other objects may move. The surround sensing systems of said robots therefore work with dynamic input data and need to estimate both the current state of the environment as well as its dynamics. One of the key elements to obtain a high-level understanding of the environment is to track dynamic objects. This enables the system to understand what the objects are doing; predict where they will be in the future; and in the future better estimate where they are. In this thesis, I focus on input from visual cameras, images. Images have, with the advent of neural networks, become a cornerstone in sensing systems. Image-processing neural networks are optimized to perform a specific computer vision task -- such as recognizing cats and dogs -- on vast datasets of annotated examples. This is usually referred to as \emph{offline training} and given a well-designed neural network, enough high-quality data, and a suitable offline training formulation, the neural network is expected to become adept at the specific task.

This thesis starts with a study of object tracking. The tracking is based on the visual appearance of the object, achieved via discriminative convolution filters (DCFs). The first contribution of this thesis is to decompose the filter into multiple subfilters. This serves to increase the robustness during object deformations or rotations. Moreover, it provides a more fine-grained representation of the object state as the subfilters are expected to roughly track object parts. In the second contribution, a neural network is trained directly for object tracking. In order to obtain a fine-grained representation of the object state, it is represented as a segmentation. The main challenge lies in the design of a neural network able to tackle this task. While the common neural networks excel at recognizing patterns seen during offline training, they struggle to store novel patterns in order to later recognize them. To overcome this limitation, a novel appearance learning mechanism is proposed. The mechanism extends the state-of-the-art and is shown to generalize remarkably well to novel data. In the third contribution, the method is used together with a novel fusion strategy and failure detection criterion to semi-automatically annotate visual and thermal videos.

Sensing systems need not only track objects, but also detect them. The fourth contribution of this thesis strives to tackle joint detection, tracking, and segmentation of all objects from a predefined set of object classes. The challenge here lies not only in the neural network design, but also in the design of the offline training formulation. The final approach, a recurrent graph neural network, outperforms prior works that have a runtime of the same order of magnitude.

Last, this thesis studies \emph{dynamic} learning of novel visual concepts. It is observed that the learning mechanisms used for object tracking essentially learns the appearance of the tracked object. It is natural to ask whether this appearance learning could be extended beyond individual objects to entire semantic classes, enabling the system to learn new concepts based on just a few training examples. Such an ability is desirable in autonomous systems as it removes the need of manually annotating thousands of examples of each class that needs recognition. Instead, the system is trained to efficiently learn to recognize new classes. In the fifth contribution, we propose a novel learning mechanism based on Gaussian process regression. With this mechanism, our neural network outperforms the state-of-the-art and the performance gap is especially large when multiple training examples are given.

To summarize, this thesis studies and makes several contributions to learning systems that parse dynamic visuals and that dynamically learn visual appearances or concepts.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2022. p. 59
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 2196
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:liu:diva-181604 (URN)10.3384/9789179291488 (DOI)9789179291471 (ISBN)9789179291488 (ISBN)
Public defence
2022-01-19, Ada Lovelace, B Building, Campus Valla, Linköping, 09:00 (English)
Opponent
Supervisors
Projects
WASP Industrial PhD student
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Available from: 2021-12-08 Created: 2021-12-03 Last updated: 2025-02-07Bibliographically approved
2. Learning to Analyze Visual Data Streams for Environment Perception
Open this publication in new window or tab >>Learning to Analyze Visual Data Streams for Environment Perception
2023 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

A mobile robot, instructed by a human operator, acts in an environment with many other objects. However, for an autonomous robot, human instructions should be minimal and only high-level instructions, such as the ultimate task or destination. In order to increase the level of autonomy, it has become a foremost objective to mimic human vision using neural networks that take a stream of images as input and learn a specific computer vision task from large amounts of data. In this thesis, we explore several different models for surround sensing, each of which contributes to a higher understanding of the environment being possible. 

As its first contribution, this thesis presents an object tracking method for video sequences, which is a crucial component in a perception system. This method predicts a fine-grained mask to separate the pixels corresponding to the target from those corresponding to the background. Rather than tracking location and size, the method tracks the initial pixels assigned to the target in this so-called video object segmentation. For subsequent time steps, the goal is to learn how the target looks using features from a neural network. We named our method A-GAME, based on the generative modeling of deep feature space, separating target and background appearances. 

In the second contribution of this thesis, we detect, track, and segment all objects from a set of predefined object classes. This information is how the robot increases its capabilities to perceive the surroundings. We experiment with a graph neural network to weigh all new detections and existing tracks. This model outperforms prior works by separating visually, and semantically similar objects frame by frame. 

The third contribution investigates one limitation of anchor-based detectors, which classify pre-defined bounding boxes as either negative or positive and thus provide a limited set of handled object shapes. One idea is to learn an alternative instance representation. We experiment with a neural network that predicts the distance to the nearest object contour in different directions from each pixel. The network then computes an approximated signed distance function containing the respective instance information. 

Last, this thesis studies a concept within model validation. We observed that overfitting could increase performance on benchmarks. However, this opportunity is insipid for sensing systems in practice since measurements, such as length or angles, are quantities that explain the environment. The fourth contribution of this thesis is an extended validation technique for camera calibration. This technique uses a statistical model for each error difference between an observed value and a corresponding prediction of the projective model. We compute a test over the differences and detect if the projective model is incorrect. 

Abstract [sv]

En mobil robot, instruerad av en mänsklig operatör, agerar i en miljö med många andra föremål. För en autonom robot bör det mänskliga ingripandet vara minimalt och endast vara instruktioner på hög nivå, som den ultimata uppgiften eller destinationen. Neurala nätverk som tar en ström av bilder som indata och lär sig en specifik datorseendeuppgift från stora mängder data, för att efterlikna den förmåga som kommer naturligt för människor, har blivit avgörande i strävan efter autonomi. I denna avhandling utforskar vi olika modeller, som var och en bidrar till att en högre förståelse av omgivningen är möjlig.

I avhandlingens första bidrag undersöks en metod för objektföljning, för att hålla reda på objekt. En förmåga som är ett nyckelelement till hur omvärlden kan uppfattas. Metoden skattar en detaljerad pixel-mask av objektet och klassificerar alla andra pixlar som bakgrund. De initiala pixlarna av objektet spåras, så kallad videoobjektsegmentering, istället för att spåra position och storlek. För efterföljande tidssteg är målet att lära sig utseendet av objektet från särdrag beräknat av ett neuralt nätverk. Vi döpte vår metod till A-GAME, baserad på den generativa modelleringen av djupa särdrag, som skiljer på hur objektet och bakgrunden ser ut.

I det andra bidraget i denna avhandling detekterar, spårar och segmenterar vi alla objekt från en uppsättning redan definierade objektklasser. Denna information är hur roboten kan öka sin förmåga att uppfatta omgivningen. Vi experimenterar med ett neuralt nätverk från grafteori för att vikta alla nya detekterade objekt och befintliga objektspår. Metoden, som bearbetar en bild i taget och separerar visuellt och semantiskt liknande objekt, överträffar tidigare arbeten.

Det tredje bidraget undersöker en begränsning av detektorer som använder ankar-baserade objektkandidater. Dessa detektorer klassificerar redan definierade boxtyper för tänkbara objekt som antingen negativa eller positiva och begränsad därmed vilka objekt som kan detekteras be-roende på form. En idé är att lära sig en alternativ objektrepresentation. Vi experimenterar med ett neuralt nätverk som förutsäger avståndet till närmaste objektkontur i olika riktningar från varje pixel. Det neurala nätverket beräknar sedan en approximerad distansfunktion, för en bild i taget, som innehåller information om de individuella objekten.

Till sist studerar denna avhandling ett koncept inom validering. Vi observerade att överanpassning kunde öka prestandamått på dataset avsedda för jämförelse. Denna möjlighet är dock obetydlig för oss i praktiken eftersom mätningar, såsom längd eller vinklar, är storheter som används för att beskriva omgivningen. Det fjärde bidraget i denna avhandling är en utökad valideringsteknik för kamerakalibrering. Denna teknik använder en statistisk modell för varje avvikelse mellan ett observerat värde och en motsvarande förutsägelse av den projektiva modellen. Ett statistiskt test beräknas över avvikelserna för att upptäcka om en sådan model är felaktig.  

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2023. p. 45
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 2283
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:liu:diva-192620 (URN)10.3384/9789180750158 (DOI)9789180750141 (ISBN)9789180750158 (ISBN)
Public defence
2023-04-28, Ada Lovelace, B-building, Campus Valla, Linköping, 10:15 (English)
Opponent
Supervisors
Note

Funding agencies: Saab Dynamics and the Wallenberg AI, Autonomous Systems, and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. Furthermore, the computations were enabled by the Berzelius resource provided by the Knut and Alice Wallenberg Foundation at the National Super-computer Centre; and computations were enabled by resources provided by the Swedish National Infrastructure for Computing (SNIC) at Alvis partially funded by the Swedish Research Council through grant agreement no. 2018-05973.

Available from: 2023-03-24 Created: 2023-03-24 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

fulltext(2100 kB)717 downloads
File information
File name FULLTEXT01.pdfFile size 2100 kBChecksum SHA-512
bfa5f781e534e3a6a9938d945ac3083e30924678f93fd063f0ff7092f8e396ec3f559845c2663df3d0a723013b3e80e391ef7706b58257ecd09276b4d749b45a
Type fulltextMimetype application/pdf

Other links

Publisher's full text

Search in DiVA

By author/editor
Johnander, JoakimDanelljan, MartinBrissman, EmilKhan, Fahad ShahbazFelsberg, Michael
By organisation
Computer VisionFaculty of Science & Engineering
Computer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar
Total: 717 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 453 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf