Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Scalable Machine Learning through Approximation and Distributed Computing
KTH, School of Electrical Engineering and Computer Science (EECS), Computational Science and Technology (CST). RISE.ORCID iD: 0000-0002-8180-7521
2019 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Machine learning algorithms are now being deployed in practically all areas of our lives. Part of this success can be attributed to the ability to learn complex representations from massive datasets. However, computational speed increases have not kept up with the increase in the sizes of data we want to learn from, leading naturally to algorithms that need to be resource-efficient and parallel. As the proliferation of machine learning continues, the ability for algorithms to adapt to a changing environment and deal with uncertainty becomes increasingly important.

In this thesis we develop scalable machine learning algorithms, with a focus on efficient, online, and distributed computation. We make use of approximations to dramatically reduce the computational cost of exact algorithms, and develop online learning algorithms to deal with a constantly changing environment under a tight computational budget. We design parallel and distributed algorithms to ensure that our methods can scale to massive datasets.

We first propose a scalable algorithm for graph vertex similarity calculation and concept discovery. We demonstrate its applicability to multiple domains, including text, music, and images, and demonstrate its scalability by training on one of the largest text corpora available. Then, motivated by a real-world use case of predicting the session length in media streaming, we propose improvements to several aspects of learning with decision trees. We propose two algorithms to estimate the uncertainty in the predictions of online random forests. We show that our approach can achieve better accuracy than the state of the art while being an order of magnitude faster. We then propose a parallel and distributed online tree boosting algorithm that maintains the correctness guarantees of serial algorithms while providing an order of magnitude speedup on average. Finally, we propose an algorithm that allows for gradient boosted trees training to be distributed across both the data point and feature dimensions. We show that we can achieve communication savings of several orders of magnitude for sparse datasets, compared to existing approaches that can only distribute the computation across the data point dimension and use dense communication.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2019. , p. 120
Series
TRITA-EECS-AVL ; 2019:43
Keywords [en]
Online Learning, Distributed Computing, Graph Similarity, Decision Trees, Gradient Boosting
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-250038ISBN: 978-91-7873-181-7 (print)OAI: oai:DiVA.org:kth-250038DiVA, id: diva2:1307122
Public defence
2019-05-28, Sal B, Kistagången 16, våningsplan 2, Electrum 1, KTH Kista, Kista, 14:00 (English)
Opponent
Supervisors
Funder
Swedish Foundation for Strategic Research , RIT10-0043Swedish Foundation for Strategic Research , BD15-0006
Note

QC 20190426

Available from: 2019-04-26 Created: 2019-04-25 Last updated: 2019-04-30Bibliographically approved
List of papers
1. Domain-agnostic discovery of similarities and concepts at scale
Open this publication in new window or tab >>Domain-agnostic discovery of similarities and concepts at scale
2017 (English)In: Knowledge and Information Systems, ISSN 0219-1377, E-ISSN 0219-3116, Vol. 51, no 2, p. 531-560Article in journal (Refereed) Published
Abstract [en]

Appropriately defining and efficiently calculating similarities from large data sets are often essential in data mining, both for gaining understanding of data and generating processes and for building tractable representations. Given a set of objects and their correlations, we here rely on the premise that each object is characterized by its context, i.e., its correlations to the other objects. The similarity between two objects can then be expressed in terms of the similarity between their contexts. In this way, similarity pertains to the general notion that objects are similar if they are exchangeable in the data. We propose a scalable approach for calculating all relevant similarities among objects by relating them in a correlation graph that is transformed to a similarity graph. These graphs can express rich structural properties among objects. Specifically, we show that concepts—abstractions of objects—are constituted by groups of similar objects that can be discovered by clustering the objects in the similarity graph. These principles and methods are applicable in a wide range of fields and will be demonstrated here in three domains: computational linguistics, music, and molecular biology, where the numbers of objects and correlations range from small to very large.

Place, publisher, year, edition, pages
London: Springer, 2017
Keywords
Similarity discovery, Concept mining, Distributional semantics, Graph processing
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-249992 (URN)10.1007/s10115-016-0984-2 (DOI)000399408200007 ()2-s2.0-84984793995 (Scopus ID)
Projects
E2ECLOUDS
Funder
Swedish Foundation for Strategic Research , RIT10-0043
Note

QC 20190426

Available from: 2019-04-25 Created: 2019-04-25 Last updated: 2019-04-26Bibliographically approved
2. Knowing an Object by the Company It Keeps: A Domain-Agnostic Scheme for Similarity Discovery
Open this publication in new window or tab >>Knowing an Object by the Company It Keeps: A Domain-Agnostic Scheme for Similarity Discovery
2015 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Appropriately defining and then efficiently calculating similarities from large data sets are often essential in data mining, both for building tractable representations and for gaining understanding of data and generating processes. Here we rely on the premise that given a set of objects and their correlations, each object is characterized by its context, i.e. its correlations to the other objects, and that the similarity between two objects therefore can be expressed in terms of the similarity between their respective contexts. Resting on this principle, we propose a data-driven and highly scalable approach for discovering similarities from large data sets by representing objects and their relations as a correlation graph that is transformed to a similarity graph. Together these graphs can express rich structural properties among objects. Specifically, we show that concepts -- representations of abstract ideas and notions -- are constituted by groups of similar objects that can be identified by clustering the objects in the similarity graph. These principles and methods are applicable in a wide range of domains, and will here be demonstrated for three distinct types of objects: codons, artists and words, where the numbers of objects and correlations range from small to very large.

National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-250036 (URN)10.1109/ICDM.2015.85 (DOI)000380541000013 ()2-s2.0-84963516560 (Scopus ID)978-1-4673-9504-5 (ISBN)
Conference
2015 IEEE International Conference on Data Mining (ICDM)
Funder
Swedish Foundation for Strategic Research , RIT10-0043
Note

QC 20190426

Available from: 2019-04-25 Created: 2019-04-25 Last updated: 2019-06-17Bibliographically approved
3. BoostVHT: Boosting distributed streaming decision trees
Open this publication in new window or tab >>BoostVHT: Boosting distributed streaming decision trees
2017 (English)In: International Conference on Information and Knowledge Management, Proceedings, Association for Computing Machinery , 2017, p. 899-908Conference paper, Published paper (Refereed)
Abstract [en]

Online boosting improves the accuracy of classifiers for unbounded streams of data by chaining them into an ensemble. Due to its sequential nature, boosting has proven hard to parallelize, even more so in the online setting. This paper introduces BoostVHT, a technique to parallelize online boosting algorithms. Our proposal leverages a recently-developed model-parallel learning algorithm for streaming decision trees as a base learner. This design allows to neatly separate the model boosting from its training. As a result, BoostVHT provides a flexible learning framework which can employ any existing online boosting algorithm, while at the same time it can leverage the computing power of modern parallel and distributed cluster environments. We implement our technique on Apache SAMOA, an open-source platform for mining big data streams that can be run on several distributed execution engines, and demonstrate order of magnitude speedups compared to the state-of-the-art.

Place, publisher, year, edition, pages
Association for Computing Machinery, 2017
Keywords
Boosting, Decision trees, Distributed systems, Online learning, Big data, Cluster computing, Clustering algorithms, Data mining, Distributed computer systems, Forestry, Knowledge management, Online systems, Trees (mathematics), Distributed clusters, Distributed streaming, Flexible Learning, Open source platforms, Parallel learning algorithms, Learning algorithms
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-227056 (URN)10.1145/3132847.3132974 (DOI)000440845300089 ()2-s2.0-85037345394 (Scopus ID)9781450349185 (ISBN)
Conference
26th ACM International Conference on Information and Knowledge Management, CIKM 2017, 6 November 2017 through 10 November 2017
Note

QC 20180503

Available from: 2018-05-03 Created: 2018-05-03 Last updated: 2019-04-26Bibliographically approved
4. Block-distributed Gradient Boosted Trees
Open this publication in new window or tab >>Block-distributed Gradient Boosted Trees
2019 (English)Conference paper, Published paper (Refereed)
Keywords
Gradient Boosted Trees, Distributed Systems, Communication Efficiency, Scalability
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-249994 (URN)
Conference
42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
Funder
Swedish Foundation for Strategic Research , BD15-0006
Note

QC 20190426

Available from: 2019-04-25 Created: 2019-04-25 Last updated: 2019-05-21Bibliographically approved
5. Predicting Session Length in Media Streaming
Open this publication in new window or tab >>Predicting Session Length in Media Streaming
2017 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Session length is a very important aspect in determining a user's satisfaction with a media streaming service. Being able to predict how long a session will last can be of great use for various downstream tasks, such as recommendations and ad scheduling. Most of the related literature on user interaction duration has focused on dwell time for websites, usually in the context of approximating post-click satisfaction either in search results, or display ads.

In this work we present the first analysis of session length in a mobile-focused online service, using a real world data-set from a major music streaming service. We use survival analysis techniques to show that the characteristics of the length distributions can differ significantly between users, and use gradient boosted trees with appropriate objectives to predict the length of a session using only information available at its beginning. Our evaluation on real world data illustrates that our proposed technique outperforms the considered baseline.

Keywords
User Behavior, Survival Analysis, Dwell Time, Session Length
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-250037 (URN)10.1145/3077136.3080695 (DOI)
Conference
40th International ACM SIGIR Conference on Research and Development in Information Retrieval
Funder
Swedish Foundation for Strategic Research , RIT10-0043
Note

QC 20190426

Available from: 2019-04-25 Created: 2019-04-25 Last updated: 2019-04-26Bibliographically approved
6. Quantifying Uncertainty in Online Regression Forests
Open this publication in new window or tab >>Quantifying Uncertainty in Online Regression Forests
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Accurately quantifying uncertainty in predictions is essential for the deployment of machine learning algorithms in critical applications where mistakes are costly. Most approaches to quantifying prediction uncertainty have focused on settings where the data is static, or bounded. In this paper, we investigate methods that quantify the prediction uncertainty in a streaming setting, where the data is potentially unbounded.

We propose two meta-algorithms that produce prediction intervals for online regression forests of arbitrary tree models; one based on conformal prediction theory, and the other based on quantile regression. We show that the approaches are able to maintain specified error rates, with constant computational cost per example and bounded memory usage. We provide empirical evidence that the methods outperform the state-of-the-art in terms of maintaining error guarantees, while being an order of magnitude faster. We also investigate how the algorithms are able to recover from concept drift.

Keywords
Online learning, Uncertainty, Decision Trees, Regression
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-250039 (URN)
Funder
Swedish Foundation for Strategic Research , BD15-0006
Note

QC 20190426

Available from: 2019-04-25 Created: 2019-04-25 Last updated: 2019-04-26Bibliographically approved

Open Access in DiVA

fulltext(4379 kB)180 downloads
File information
File name FULLTEXT01.pdfFile size 4379 kBChecksum SHA-512
ab2ff5aa0ab30acad6dc0fd930ddcfafbc0413eede8a906e659cc51eac88ebd62ea781936cb55f455427d99a390bd2a2382d94dc69f4677cbd912a9d0f74c7e6
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Vasiloudis, Theodore
By organisation
Computational Science and Technology (CST)
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 180 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 2181 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf