Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Self-Management for Large-Scale Distributed Systems
KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
2012 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Autonomic computing aims at making computing systems self-managing by using autonomic managers in order to reduce obstacles caused by management complexity. This thesis presents results of research on self-management for large-scale distributed systems. This research was motivated by the increasing complexity of computing systems and their management.

In the first part, we present our platform, called Niche, for programming self-managing component-based distributed applications. In our work on Niche, we have faced and addressed the following four challenges in achieving self-management in a dynamic environment characterized by volatile resources and high churn: resource discovery, robust and efficient sensing and actuation, management bottleneck, and scale. We present results of our research on addressing the above challenges. Niche implements the autonomic computing architecture, proposed by IBM, in a fully decentralized way. Niche supports a network-transparent view of the system architecture simplifying the design of distributed self-management. Niche provides a concise and expressive API for self-management. The implementation of the platform relies on the scalability and robustness of structured overlay networks. We proceed by presenting a methodology for designing the management part of a distributed self-managing application. We define design steps that include partitioning of management functions and orchestration of multiple autonomic managers.

In the second part, we discuss robustness of management and data consistency, which are necessary in a distributed system. Dealing with the effect of churn on management increases the complexity of the management logic and thus makes its development time consuming and error prone. We propose the abstraction of Robust Management Elements, which are able to heal themselves under continuous churn. Our approach is based on replicating a management element using finite state machine replication with a reconfigurable replica set. Our algorithm automates the reconfiguration (migration) of the replica set in order to tolerate continuous churn. For data consistency, we propose a majority-based distributed key-value store supporting multiple consistency levels that is based on a peer-to-peer network. The store enables the tradeoff between high availability and data consistency. Using majority allows avoiding potential drawbacks of a master-based consistency control, namely, a single-point of failure and a potential performance bottleneck.

In the third part, we investigate self-management for Cloud-based storage systems with the focus on elasticity control using elements of control theory and machine learning. We have conducted research on a number of different designs of an elasticity controller, including a State-Space feedback controller and a controller that combines feedback and feedforward control. We describe our experience in designing an elasticity controller for a Cloud-based key-value store using state-space model that enables to trade-off performance for cost. We describe the steps in designing an elasticity controller. We continue by presenting the design and evaluation of ElastMan, an elasticity controller for Cloud-based elastic key-value stores that combines feedforward and feedback control.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2012. , xix, 266 p.
Series
TRITA-ICT-ECS AVH, ISSN 1653-6363 ; 12:04
Keyword [en]
Self-Management, Autonomic Computing, Control Theory, Distributed Systems, Grid Computing, Cloud Computing, Elastic Services, Key-Value Stores
National Category
Computer Systems
Research subject
SRA - ICT
Identifiers
URN: urn:nbn:se:kth:diva-101661ISBN: 978-91-7501-437-1 (print)OAI: oai:DiVA.org:kth-101661DiVA: diva2:548547
Public defence
2012-09-26, Sal E, Forum IT-Universitetet, KTH, Isajordsgatan 39, Kista, 14:00 (English)
Opponent
Supervisors
Funder
ICT - The Next Generation
Note

QC 20120831

Available from: 2012-08-31 Created: 2012-08-30 Last updated: 2014-01-23Bibliographically approved
List of papers
1. Enabling Self-Management Of Component Based Distributed Applications
Open this publication in new window or tab >>Enabling Self-Management Of Component Based Distributed Applications
Show others...
2008 (English)In: FROM GRIDS TO SERVICE AND PERVASIVE COMPUTING, Springer-Verlag New York, 2008, 163-174 p.Conference paper, Published paper (Refereed)
Abstract [en]

Deploying and managing distributed applications in dynamic Grid environments requires a high degree of autonomous management. Programming autonomous management in turn requires programming environment support and higher level abstractions to become feasible. We present a framework for programming self-managing component-based distributed applications. The framework enables the separation of application’s functional and non-functional (self-*) parts. The framework extends the Fractal component model by the component group abstraction and one-to-any and one-to-all bindings between components and groups. The framework supports a network-transparent view of system architecture simplifying designing application self-* code. The framework provides a concise and expressive API for self-* code. The implementation of the framework relies on scalability and robustness of the Niche structured p2p overlay network. We have also developed a distributed file storage service to illustrate and evaluate our framework.

Place, publisher, year, edition, pages
Springer-Verlag New York, 2008
Series
CoreGRID
Keyword
self-management, autonomic computing, component-based applications, P2P, Grid
National Category
Computer Science
Identifiers
urn:nbn:se:kth:diva-12956 (URN)10.1007/978-0-387-09455-7_12 (DOI)000259036400012 ()978-0-387-09455-7 (ISBN)
Conference
10th CoreGRID Symposium 2008, Canary Isl, SPAIN, AUG 25-26, 2008
Projects
FP6 EU project Grid4All (Contract IST-2006-034567)FP6 Network of Excellence CoreGRID (Contract IST-2002-004265)
Note
QC 20100520 VV 20111221Available from: 2010-05-20 Created: 2010-05-20 Last updated: 2012-08-31Bibliographically approved
2. Niche: A Platform for Self-Managing Distributed Applications
Open this publication in new window or tab >>Niche: A Platform for Self-Managing Distributed Applications
2012 (English)In: Formal and Practical Aspects of Autonomic Computing and Networking: Specification, Development, and Verification / [ed] Phan Cong-Vinh, IGI Global, 2012, 241-283 p.Chapter in book (Refereed)
Abstract [en]

We present Niche, a general-purpose, distributed component management system used to develop, deploy,and execute self-managing distributed applications. Niche consists of both a component-based programming model as well as a distributed runtime environment. It is especially designed for complex distributed applications that run and manage themselves in dynamic and volatile environments. Self-management in dynamic environments is challenging due to the high rate of system or environmental changes and the corresponding need to frequently reconfigure, heal, and tune the application. The challenges are met partly by making use of an underlying overlay in the platform to provide an efficient, location-independent,and robust sensing and actuation infrastructure, and partly by allowing for maximum decentralization of management. We describe the overlay services, the execution environment, showing how the challengesin dynamic environments are met. We also describe the programming model and a high-level design methodology for developing decentralized management, illustrated by two application case studies.

Place, publisher, year, edition, pages
IGI Global, 2012
National Category
Computer Science Software Engineering
Identifiers
urn:nbn:se:kth:diva-50235 (URN)10.4018/978-1-60960-845-3.ch010 (DOI)2-s2.0-84898222936 (Scopus ID)9781609608453 (ISBN)1609608453 (ISBN)
Projects
FP6 EU-project Grid4All (contract IST-2006-034567)FP6 EU-project SELFMAN (contract IST-2006-034084)
Funder
ICT - The Next Generation
Note

QC 20130823

Available from: 2011-12-02 Created: 2011-12-02 Last updated: 2014-01-27Bibliographically approved
3. A design methodology for self-management in distributed environments
Open this publication in new window or tab >>A design methodology for self-management in distributed environments
2009 (English)In: IEEE International conference on Computational Science and Engineering, 2009, 430-436 p.Conference paper, Published paper (Refereed)
Abstract [en]

  Autonomic computing is a paradigm that aims at reducing administrative overhead by providing autonomic managers to make applications selfmanaging. In order to better deal with dynamic environments, for improved performance and scalability, we advocate for distribution of management functions among several cooperative managers that coordinate their activities in order to achieve management objectives. We present a methodology for designing the management part of a distributed self-managing application in a distributed manner. We define design steps, that includes partitioning of management functions and orchestration of multiple autonomic managers. We illustrate the proposed design methodology by applying it to design and development of a distributed storage service as a case study. The storage service prototype has been developed using the distributing component management system Niche. Distribution of autonomic managers allows distributing the management overhead and increased management performance due to concurrency and better locality.

Keyword
autonomic computing, control loops, distributed systems, selfmanagement, component management system, design and development, design methodology, design steps, distributed environments, distributed storage, dynamic environments, management functions, management objectives, self management, self-managing, storage services, computer science, design, distribution functions, large scale systems, light measurments, managers, model checking, remote control, management
National Category
Computer Science
Identifiers
urn:nbn:se:kth:diva-12957 (URN)10.1109/CSE.2009.301 (DOI)2-s2.0-70749096986 (Scopus ID)9780760538235 (ISBN)
Note
QC 20100520Available from: 2010-05-20 Created: 2010-05-20 Last updated: 2012-08-31Bibliographically approved
4. Achieving Robust Self-Management for Large-Scale Distributed Applications
Open this publication in new window or tab >>Achieving Robust Self-Management for Large-Scale Distributed Applications
2010 (English)In: Self-Adaptive and Self-Organizing Systems (SASO), 2010 4th IEEE International Conference on: SASO 2010, IEEE Computer Society, 2010, 31-40 p.Conference paper, Published paper (Refereed)
Abstract [sv]

Achieving self-management can be challenging, particularly in dynamic environments with resource churn (joins/leaves/failures). Dealing with the effect of churn on management increases the complexity of the management logic and thus makes its development time consuming and error prone. We propose the abstraction of robust management elements (RMEs), which are able to heal themselves under continuous churn. Using RMEs allows the developer to separate the issue of dealing with the effect of churn on management from the management logic. This facilitates the development of robust management by making the developer focus on managing the application while relying on the platform to provide the robustness of management. RMEs can be implemented as fault-tolerant long-living services. We present a generic approach and an associated algorithm to achieve fault-tolerant long-living services. Our approach is based on replicating a service using finite state machine replication with a reconfigurable replica set. Our algorithm automates the reconfiguration (migration) of the replica set in order to tolerate continuous churn. The algorithm uses P2P replica placement schemes to place replicas and uses the P2P overlay to monitor them. The replicated state machine is extended to analyze monitoring data in order to decide on when and where to migrate. We describe how to use our approach to achieve robust management elements. We present a simulation-based evaluation of our approach which shows its feasibility.

Place, publisher, year, edition, pages
IEEE Computer Society, 2010
Keyword
P2P replica placement schemes;fault-tolerant long-living services;finite state machine replication;large-scale distributed applications;management logic;reconfigurable replica set;resource churn;robust management elements;robust self-management;fault tolerant computing;finite state machines;peer-to-peer computing
National Category
Computer Science Computer Systems
Identifiers
urn:nbn:se:kth:diva-53219 (URN)10.1109/SASO.2010.42 (DOI)2-s2.0-79952045321 (Scopus ID)978-1-4244-8537-6 (ISBN)
Conference
2010 Fourth IEEE International Conference on Self-Adaptive and Self-Organizing Systems (SASO), Budapest, Hungary, Sep. 27-Oct. 1, 2010
Funder
ICT - The Next Generation
Note
VV 20121223. QC 20120103Available from: 2011-12-23 Created: 2011-12-23 Last updated: 2012-08-31Bibliographically approved
5. Robust Fault-Tolerant Majority-Based Key-Value Store Supporting Multiple Consistency Levels
Open this publication in new window or tab >>Robust Fault-Tolerant Majority-Based Key-Value Store Supporting Multiple Consistency Levels
2011 (English)In: 2011 IEEE 17TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2011, 589-596 p.Conference paper, Published paper (Refereed)
Abstract [en]

The wide spread of Web 2.0 applications with rapidly growing amounts of user generated data, such as, wikis, social networks, and media sharing, have posed new challenges on the supporting infrastructure, in particular, on storage systems. In order to meet these challenges, Web 2.0 applications have to tradeoff between the high availability and the consistency of their data. Another important issue is the privacy of user generated data that might be caused by organizations that own and control datacenters where user data are stored. We propose a large-scale, robust and fault-tolerant key-value object store that is based on a peer-to-peer network owned and controlled by a community of users. To meet the demands of Web 2.0 applications, the store supports an API consisting of different read and write operations with various data consistency guarantees from which a wide range of web applications would be able to choose the operations according to their data consistency, performance and availability requirements. For evaluation, simulation has been carried out to test the system availability, scalability and fault-tolerance in a dynamic, Internet wide environment.

Series
International Conference on Parallel and Distributed Systems - Proceedings, ISSN 1521-9097
Keyword
peer-to-peer, key-value store, consistency models, distributed hash table, majority-based quorum technique
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-92454 (URN)10.1109/ICPADS.2011.110 (DOI)000299395900076 ()2-s2.0-84856603401 (Scopus ID)978-0-7695-4576-9 (ISBN)
Conference
17th IEEE International Conference on Parallel and Distributed Systems (ICPADS) DEC 07-09, 2011 Tainan, TAIWAN
Note
QC 20120404Available from: 2012-04-04 Created: 2012-04-02 Last updated: 2012-08-31Bibliographically approved
6. State-Space Feedback Control for Elastic Distributed Storage in a Cloud Environment
Open this publication in new window or tab >>State-Space Feedback Control for Elastic Distributed Storage in a Cloud Environment
2012 (English)In: ICAS 2012: The Eighth International Conference on Autonomic and Autonomous Systems, St. Maarten, Netherlands Antilles, 2012, 589-596 p.Conference paper, Published paper (Refereed)
Abstract [en]

Elasticity in Cloud computing is an ability of asystem to scale up and down (request and release resources) in response to changes in its environment and workload. Elasticity can be achieved manually or automatically. Efforts arebeing made to automate elasticity in order to improve system performance under dynamic workloads. In this paper, we reportour experience in designing an elasticity controller for a key-value storage service deployed in a Cloud environment. To design our controller, we have adopted a control theoretic approach. Automation of elasticity is achieved by providing a feedback controller that automatically increases and decreases the number of nodes in order to meet service level objectives under high load and to reduce costs under low load. Every step in the building of a controller for elastic storage, includingsystem identification and controller design, is discussed. We have evaluated our approach by using simulation. We have developed a simulation framework EStoreSim in order to simulate anelastic key-value store in a Cloud environment and be able to experiment with different controllers. We have examined the implemented controller against specific service level objectives and evaluated the controller behavior in different scenarios. Our simulation experiments have shown the feasibility of our approach to automate elasticity of storage services using state-space feedback control.

Place, publisher, year, edition, pages
St. Maarten, Netherlands Antilles: , 2012
Keyword
elasticity, key-value store, Cloud, state-space feedback control
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-101375 (URN)978-1-61208-187-8 (ISBN)
Conference
The 8th International Conference on Autonomic and Autonomous Systems (ICAS 2012)
Funder
ICT - The Next Generation
Note

QC 20130524

QC 20151216

Available from: 2012-08-27 Created: 2012-08-27 Last updated: 2015-12-16Bibliographically approved
7. ElastMan: Autonomic Elasticity Manager for Cloud-Based Key-Value Stores
Open this publication in new window or tab >>ElastMan: Autonomic Elasticity Manager for Cloud-Based Key-Value Stores
2012 (English)Report (Other academic)
Abstract [en]

The increasing spread of elastic Cloud services, together with the pay-asyou-go pricing model of Cloud computing, has led to the need of an elasticity controller. The controller automatically resizes an elastic service, in response to changes in workload, in order to meet Service Level Objectives (SLOs) at a reduced cost. However, variable performance of Cloud virtual machines and nonlinearities in Cloud services, such as the diminishing reward of adding a service instance with increasing the scale, complicates the controller design. We present the design and evaluation of ElastMan, an elasticity controller for Cloud-based elastic key-value stores. ElastMan combines feedforward and feedback control. Feedforward control is used to respond to spikes in the workload by quickly resizing the service to meet SLOs at a minimal cost. Feedback control is used to correct modeling errors and to handle diurnal workload. To address nonlinearities, our design of ElastMan leverages the near-linear scalability of elastic Cloud services in order to build a scale-independent model of the service. Our design based on combining feedforward and feedback control allows to efficiently handle both diurnal and rapid changes in workload in order to meet SLOs at a minimal cost. Our evaluation shows the feasibility of our approach to automation of Cloud service elasticity.

Publisher
14 p.
Series
TRITA-ICT-ECS R, ISSN 1653-7238 ; 12:01
Keyword
Elasticity Controller, Cloud Storage, feedback, feedforward, SLO
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-101660 (URN)
Funder
ICT - The Next Generation
Note

QC 20120831

Available from: 2012-08-30 Created: 2012-08-30 Last updated: 2014-01-23Bibliographically approved

Open Access in DiVA

AhmadAlShishtawy_PhDThesis(8313 kB)2529 downloads
File information
File name FULLTEXT01.pdfFile size 8313 kBChecksum SHA-512
615daf655ec426638fa33637b9ce8d444789c9bfc8d29e3631f061f503e76c347abb0edbd81325784de0d0887199ebb8127b5b1bfab3bb51cfe88ef00188643b
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Al-Shishtawy, Ahmad
By organisation
Software and Computer systems, SCS
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 2529 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1294 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf