Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Data-driven Performance Prediction and Resource Allocation for Cloud Services
KTH, School of Electrical Engineering (EES), Communication Networks. KTH, School of Electrical Engineering (EES), Centres, ACCESS Linnaeus Centre.ORCID iD: 0000-0002-2680-9065
2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Cloud services, which provide online entertainment, enterprise resource management, tax filing, etc., are becoming essential for consumers, businesses, and governments. The key functionalities of such services are provided by backend systems in data centers. This thesis focuses on three fundamental problems related to management of backend systems. We address these problems using data-driven approaches: triggering dynamic allocation by changes in the environment, obtaining configuration parameters from measurements, and learning from observations. 

The first problem relates to resource allocation for large clouds with potentially hundreds of thousands of machines and services. We developed and evaluated a generic gossip protocol for distributed resource allocation. Extensive simulation studies suggest that the quality of the allocation is independent of the system size for the management objectives considered.

The second problem focuses on performance modeling of a distributed key-value store, and we study specifically the Spotify backend for streaming music. We developed analytical models for system capacity under different data allocation policies and for response time distribution. We evaluated the models by comparing model predictions with measurements from our lab testbed and from the Spotify operational environment. We found the prediction error to be below 12% for all investigated scenarios.

The third problem relates to real-time prediction of service metrics, which we address through statistical learning. Service metrics are learned from observing device and network statistics. We performed experiments on a server cluster running video streaming and key-value store services. We showed that feature set reduction significantly improves the prediction accuracy, while simultaneously reducing model computation time. Finally, we designed and implemented a real-time analytics engine, which produces model predictions through online learning.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2016. , 53 p.
Series
TRITA-EE, ISSN 1653-5146 ; 2016:020
National Category
Communication Systems Computer Systems Telecommunications Computer Engineering Other Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Electrical Engineering
Identifiers
URN: urn:nbn:se:kth:diva-184601ISBN: 978-91-7595-876-7 (print)OAI: oai:DiVA.org:kth-184601DiVA: diva2:916753
Public defence
2016-05-03, F3, Lindstedtsvägen 26, KTH Campus, Stockholm, 14:00 (English)
Opponent
Supervisors
Funder
VINNOVA, 2013-03895
Note

QC 20160411

Available from: 2016-04-11 Created: 2016-04-01 Last updated: 2016-05-30Bibliographically approved
List of papers
1. Gossip-based Resource Allocationfor Green Computing in Large Clouds
Open this publication in new window or tab >>Gossip-based Resource Allocationfor Green Computing in Large Clouds
2011 (English)In: The 7th International Conference on Network and Service Management, IFIP , 2011Conference paper, Published paper (Refereed)
Abstract [en]

We address the problem of resource allocation in a large-scale cloud environment, which we formalize as that of dynamically optimizing a cloud configuration for green computing objectives under CPU and memory constraints. We propose a generic gossip protocol for resource allocation, which can be instantiated for specific objectives. We develop an instantiation of this generic protocol which aims at minimizing power consumption through server consolidation, while satisfying a changing load pattern. This protocol, called GRMP-Q, provides an efficient heuristic solution that performs well in most cases—in special cases it is optimal. Under overload, the protocol gives a fair allocation of CPU resources to clients.

Simulation results suggest that key performance metrics do not change with increasing system size, making the resource allocation process scalable to well above 100,000 servers. Generally, the effectiveness of the protocol in achieving its objective increases with increasing memory capacity in the servers.

Place, publisher, year, edition, pages
IFIP, 2011
Keyword
cloud computing, green computing, distributed management, power management, resource allocation, gossip protocols, server consolidation
National Category
Computer and Information Science
Identifiers
urn:nbn:se:kth:diva-37883 (URN)2-s2.0-84855744790 (Scopus ID)9781457715884 (ISBN)
Conference
The 7th International Conference on Network and Service Management,Paris, France, 24-28 October, 2011
Funder
ICT - The Next Generation
Note

QC 20110818

Available from: 2011-08-18 Created: 2011-08-18 Last updated: 2016-04-11Bibliographically approved
2. On the performance of the Spotify backend
Open this publication in new window or tab >>On the performance of the Spotify backend
Show others...
2013 (English)In: Journal of Network and Systems Management, ISSN 1064-7570, E-ISSN 1573-7705Article in journal (Refereed) Published
Abstract [en]

We model and evaluate the performance of a distributed key-value storage system that is part of the Spotify backend. Spotify is an on-demand music streaming service, offering low-latency access to a library of over 20 million tracks and serving over 20 million users currently. We first present a simplified model of the Spotify storage architecture, in order to make its analysis feasible. We then introduce an analytical model for the distribution of the response time, a key metric in the Spotify service. We parameterize and validate the model using measurements from two different testbed configurations and from the operational Spotify infrastructure. We find that the model is accurate---measurements are within 11% of predictions---within the range of normal load patterns.In addition, we model the capacity of the Spotify storage system under different object allocation policies and find that measurements on our testbed are within 9% of the model predictions. The model helps us justify the object allocation policy adopted for Spotify storage system.

Place, publisher, year, edition, pages
Springer-Verlag New York, 2013
Keyword
Key-value store, distributed object store, object allocation policy, performance modeling, performance measurements, response times
National Category
Communication Systems Computer Systems
Identifiers
urn:nbn:se:kth:diva-129973 (URN)10.1007/s10922-013-9292-2 (DOI)000350554700009 ()2-s2.0-84921067530 (Scopus ID)
Note

QC20131024

Available from: 2013-10-07 Created: 2013-10-07 Last updated: 2017-12-06Bibliographically approved
3. A service-agnostic method for predicting service metrics in real-time
Open this publication in new window or tab >>A service-agnostic method for predicting service metrics in real-time
Show others...
(English)Manuscript (preprint) (Other academic)
Abstract [en]

We predict performance metrics of cloud services using statistical learning, whereby the behavior of a system is learned from observations. Specifically, we collect device and network statistics from a cloud testbed and apply regression methods to predict, in real-time, client-side service metrics for video streaming and key-value store services. Our method is service agnostic in the sense that it takes as input operating-systems and network statistics instead of service-specific metrics. We show that feature set reduction significantly improves the prediction accuracy in our case, while simultaneously reducing model computation time. We find that the prediction accuracy decreases when, instead of a single service, both services run on the same testbed simultaneously or when the network quality on the path between the server cluster and the client deteriorates. Finally, we discuss the design and implementation of a real-time analytics engine, which processes streams of device statistics and service metrics from testbed sensors and produces model predictions through online learning.

Keyword
Quality of service, cloud computing, network analytics, statistical learning, machine learning
National Category
Computer Systems Communication Systems Telecommunications
Identifiers
urn:nbn:se:kth:diva-184203 (URN)
Projects
REALM
Funder
VINNOVA, 2013-03895
Note

QC 20160411

Available from: 2016-03-30 Created: 2016-03-30 Last updated: 2016-04-11Bibliographically approved
4. Predicting Real-time Service-level Metrics from Device Statistics
Open this publication in new window or tab >>Predicting Real-time Service-level Metrics from Device Statistics
Show others...
2015 (English)In: IFIP/IEEE International Symposium on Integrated Network Management, IM 2015, Ottawa, Canada, IEEE Communications Society, 2015Conference paper, Published paper (Refereed)
Abstract [en]

While real-time service assurance is critical for emerging telecom cloud services, understanding and predicting performance metrics for such services is hard. In this paper, we pursue an approach based upon statistical learning whereby the behavior of the target system is learned from observations. We use methods that learn from device statistics and predict metrics for services running on these devices. Specifically, we collect statistics from a Linux kernel of a server machine and predict client-side metrics for a video-streaming service (VLC). The fact that we collect thousands of kernel variables, while omitting service instrumentation, makes our approach service-independent and unique. While our current lab configuration is simple, our results, gained through extensive experimentation, prove the feasibility of accurately predicting client-side metrics, such as video frame rates and RTP packet rates, often within 10-15% error (NMAE), also under high computational load and across traces from different scenarios.

Place, publisher, year, edition, pages
IEEE Communications Society, 2015
Keyword
Quality of service, cloud computing, network analytics, statistical learning, machine learning, video streaming
National Category
Communication Systems Computer Systems Telecommunications
Research subject
Computer Science; Electrical Engineering
Identifiers
urn:nbn:se:kth:diva-158063 (URN)10.1109/INM.2015.7140318 (DOI)000380495900049 ()2-s2.0-84942572120 (Scopus ID)978-3-9018-8276-0 (ISBN)
Conference
IFIP/IEEE International Symposium on Integrated Network Management, IM 2015, Ottawa, Canada, May 11-15 2015
Projects
VINNOVA REALM
Funder
VINNOVA, 2013-03895
Note

QC 20150527

Available from: 2014-12-22 Created: 2014-12-22 Last updated: 2016-09-23Bibliographically approved
5. Allocating Compute and Network Resources under Management Objectives in Large-Scale Clouds
Open this publication in new window or tab >>Allocating Compute and Network Resources under Management Objectives in Large-Scale Clouds
2013 (English)In: Journal of Network and Systems Management, ISSN 1064-7570, E-ISSN 1573-7705Article in journal (Refereed) Published
Abstract [en]

We consider the problem of jointly allocating compute and network resources in a large Infrastructure-as-a-Service (IaaS) cloud. We formulate the problem of optimally allocating resources to virtual data centers (VDCs) for four well-known management objectives: balanced load, energy efficiency, fair allocation, and service differentiation.  Then, we outline an architecture for resource allocation, which centers around a set of cooperating controllers, each solving a problem related to the chosen management objective. We illustrate how a global management objective is mapped onto objectives that govern the execution of these controllers. For a key controller, the Dynamic Placement Controller, we give a detailed distributed design, which is based on a gossip protocol that can switch between management objectives. The design is applicable to a broad class of management objectives, which we characterize through a property of the objective function.  The property ensures the applicability of an iterative descent method that the gossip protocol implements.  We evaluate, through simulation, the dynamic placement of VDCs for a large cloud under changing load and VDC churn. Simulation results show that this controller is effective and highly scalable, up to 100'000 nodes, for the management objectives considered.

Keyword
Cloud computing, distributed management, resource allocation, gossip protocols, management objectives
National Category
Computer Systems Communication Systems
Identifiers
urn:nbn:se:kth:diva-126965 (URN)10.1007/s10922-013-9280-6 (DOI)000350554700005 ()2-s2.0-84921068169 (Scopus ID)
Note

QC 20130828

Available from: 2013-08-23 Created: 2013-08-23 Last updated: 2017-12-06Bibliographically approved

Open Access in DiVA

fulltext_final(5118 kB)313 downloads
File information
File name FULLTEXT02.pdfFile size 5118 kBChecksum SHA-512
f0f457607d55b61972b494b69c2effdcca067bc5168ca467744abe2f546db4dc22a50c12dd1dea7da25f7d11786e23c6fe347c8947c2c77154d3c748dbb01ca7
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Yanggratoke, Rerngvit
By organisation
Communication NetworksACCESS Linnaeus Centre
Communication SystemsComputer SystemsTelecommunicationsComputer EngineeringOther Electrical Engineering, Electronic Engineering, Information Engineering

Search outside of DiVA

GoogleGoogle Scholar
Total: 313 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1512 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf