Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Probabilistic Fault Management in Networked Systems
KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. Swedish ICT SICS.
2014 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Technical advances in network communication systems (e.g. radio access networks) combined with evolving concepts based on virtualization (e.g. clouds), require new management algorithms in order to handle the increasing complexity in the network behavior and variability in the network environment. Current network management operations are primarily centralized and deterministic, and are carried out via automated scripts and manual interventions, which work for mid-sized and fairly static networks. The next generation of communication networks and systems will be of significantly larger size and complexity, and will require scalable and autonomous management algorithms in order to meet operational requirements on reliability, failure resilience, and resource-efficiency.

A promising approach to address these challenges includes the development of probabilistic management algorithms, following three main design goals. The first goal relates to all aspects of scalability, ranging from efficient usage of network resources to computational efficiency. The second goal relates to adaptability in maintaining the models up-to-date for the purpose of accurately reflecting the network state. The third goal relates to reliability in the algorithm performance in the sense of improved performance predictability and simplified algorithm control.

This thesis is about probabilistic approaches to fault management that follow the concepts of probabilistic network management (PNM). An overview of existing network management algorithms and methods in relation to PNM is provided. The concepts of PNM and the implications of employing PNM-algorithms are presented and discussed. Moreover, some of the practical differences of using a probabilistic fault detection algorithm compared to a deterministic method are investigated. Further, six probabilistic fault management algorithms that implement different aspects of PNM are presented.

The algorithms are highly decentralized, adaptive and autonomous, and cover several problem areas, such as probabilistic fault detection and controllable detection performance; distributed and decentralized change detection in modeled link metrics; root-cause analysis in virtual overlays; event-correlation and pattern mining in data logs; and, probabilistic failure diagnosis. The probabilistic models (for a large part based on Bayesian parameter estimation) are memory-efficient and can be used and re-used for multiple purposes, such as performance monitoring, detection, and self-adjustment of the algorithm behavior. 

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2014. , 61 p.
Series
TRITA-CSC-A, ISSN 1653-5723 ; 2014:06
Keyword [en]
probabilistic network management; probabilistic modeling; fault management; fault detection; event-correlation; change detection, probabilistisk nätverkshantering; probabilistiska modeller; fel- hantering; feldetektion; korrelationsanalys; förändringsdetektion
National Category
Communication Systems
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-144608ISBN: 978-91-7595-114-0 (print)OAI: oai:DiVA.org:kth-144608DiVA: diva2:714337
Public defence
2014-05-28, F3, Lindstedtsvägen 26, KTH, Stockholm, 14:00 (English)
Opponent
Supervisors
Note

QC 20140509

Available from: 2014-05-09 Created: 2014-04-27 Last updated: 2014-05-13Bibliographically approved
List of papers
1. Fault-Tolerant Incremental Diagnosis with Limited Historical Data
Open this publication in new window or tab >>Fault-Tolerant Incremental Diagnosis with Limited Historical Data
2008 (English)Conference paper, Published paper (Refereed)
Abstract [en]

We describe a novel incremental diagnostic system based on a statistical model that is trained from empirical data. The system guides the user by calculating what additional information would be most helpful for the diagnosis. We show that our diagnostic system can produce satisfactory classification rates, using only small amounts of available background information, such that the need of collecting vast quantities of initial training data is reduced. Further, we show that incorporation of inconsistency-checking mechanisms in our diagnostic system reduces the number of incorrect diagnoses caused by erroneous input.

National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-144601 (URN)10.1109/PHM.2008.4711451 (DOI)000266719200038 ()2-s2.0-58449111615 (Scopus ID)
Conference
The International Conference on Prognostics and Health Management (PHM)
Note

QC 20140509

Available from: 2014-04-27 Created: 2014-04-27 Last updated: 2014-05-09Bibliographically approved
2. Estimating the Parameters of Randomly Interleaved Markov Models
Open this publication in new window or tab >>Estimating the Parameters of Randomly Interleaved Markov Models
2009 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Sequences that can be assumed to have been generated by a number of Markov models, whose outputs are randomly interleaved but where the actual sources are hidden, occur in a number of practical situations where data is captured as an unlabeled stream of events. We present a practical method for estimating model parameters on large data sets under the assumption that all sources are identical. Results on representative examples are presented, together with a discussion on the accuracy and performance of the proposed estimation algorithms. Finally, we describe a real-world case study where we apply the technique to the sequence of events recorded in the technical support database of an IT vendor.

National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-144602 (URN)10.1109/ICDMW.2009.17 (DOI)000290247100049 ()2-s2.0-77951187103 (Scopus ID)
Conference
IEEE International Conference on Data Mining Workshops (ICDMW)
Note

QC 20140509

Available from: 2014-04-27 Created: 2014-04-27 Last updated: 2014-05-09Bibliographically approved
3. Towards Distributed and Adaptive Detection and Localisation of Network Faults
Open this publication in new window or tab >>Towards Distributed and Adaptive Detection and Localisation of Network Faults
2010 (English)Conference paper, Published paper (Refereed)
Abstract [en]

We present a statistical probing-approach to distributed fault-detection in networked systems, based on autonomous configuration of algorithm parameters. Statistical modelling is used for detection and localisation of network faults. A detected fault is isolated to a node or link by collaborative fault-localisation. From local measurements obtained through probing between nodes, probe response delay and packet drop are modelled via parameter estimation for each link. Estimated model parameters are used for autonomous configuration of algorithm parameters, related to probe intervals and detection mechanisms. Expected fault-detection performance is formulated as a cost instead of specific parameter values, significantly reducing configuration efforts in a distributed system. The benefit offered by using our algorithm is fault-detection with increased certainty based on local measurements, compared to other methods not taking observed network conditions into account. We investigate the algorithm performance for varying user parameters and failure conditions. The simulation results indicate that more than 95% of the generated faults can be detected with few false alarms. At least 80% of the link faults and 65% of the node faults are correctly localised. The performance can be improved by parameter adjustments and by using alternative paths for communication of algorithm control messages.

National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-144600 (URN)10.1109/AICT.2010.65 (DOI)2-s2.0-77955355785 (Scopus ID)
Conference
The Sixth Advanced International Conference Telecommunications (AICT)
Note

QC 20140509

Available from: 2014-04-27 Created: 2014-04-27 Last updated: 2014-05-09Bibliographically approved
4. Long-Term Adaptation and Distributed Detection of Local Network Changes
Open this publication in new window or tab >>Long-Term Adaptation and Distributed Detection of Local Network Changes
2010 (English)Conference paper, Published paper (Refereed)
Abstract [en]

We present a statistical approach to distributed detection of local latency shifts in networked systems. For this purpose, response delay measurements are performed between neighbouring nodes via probing. The expected probe response delay on each connection is statistically modelled via parameter estimation. Adaptation to drifting delays is accounted for by the use of overlapping models, such that previous models are partially used as input to future models. Based on the symmetric Kullback-Leibler divergence metric, latency shifts can be detected by comparing the estimated parameters of the current and previous models. In order to reduce the number of detection alarms, thresholds for divergence and convergence are used. The method that we propose can be applied to many types of statistical distributions, and requires only constant memory compared to e.g., sliding window techniques and decay functions. Therefore, the method is applicable in various kinds of network equipment with limited capacity, such as sensor networks, mobile ad hoc networks etc. We have investigated the behaviour of the method for different model parameters. Further, we have tested the detection performance in network simulations, for both gradual and abrupt shifts in the probe response delay. The results indicate that over 90% of the shifts can be detected. Undetected shifts are mainly the effects of long convergence processes triggered by previous shifts. The overall performance depends on the characteristics of the shifts and the configuration of the model parameters.

National Category
Communication Systems
Identifiers
urn:nbn:se:kth:diva-144603 (URN)10.1109/GLOCOM.2010.5684137 (DOI)000287977405109 ()2-s2.0-79551638217 (Scopus ID)
Conference
IEEE Global Telecommunications Conference (GLOBECOM)
Note

QC 20140509

Available from: 2014-04-27 Created: 2014-04-27 Last updated: 2014-05-09Bibliographically approved
5. Toward Decentralized Probabilistic Management
Open this publication in new window or tab >>Toward Decentralized Probabilistic Management
2011 (English)In: IEEE Communications Magazine, ISSN 0163-6804, E-ISSN 1558-1896, Vol. 49, no 7, 80-96 p.Article in journal (Refereed) Published
Abstract [en]

In recent years, data communication networks have grown to immense size and have been diversified by the mobile revolution. Existing management solutions are based on a centralized deterministic paradigm, which is appropriate for networks of moderate size operating in relatively stable conditions. However, it is becoming increasingly apparent that these management solutions are not able to cope with the large dynamic networks that are emerging. In this article, we argue that the adoption of a decentralized and probabilistic paradigm for network management will be crucial to meet the challenges of future networks, such as efficient resource usage, scalability, robustness, and adaptability. We discuss the potential of decentralized probabilistic management and its impact on management operations, and illustrate the paradigm by three example solutions for real-time monitoring and anomaly detection.

National Category
Communication Systems
Identifiers
urn:nbn:se:kth:diva-144604 (URN)10.1109/MCOM.2011.5936159 (DOI)000292376000010 ()2-s2.0-79959961883 (Scopus ID)
Note

QC 20140509

Available from: 2014-04-27 Created: 2014-04-27 Last updated: 2017-12-05Bibliographically approved
6. A Distributed Spatio-Temporal Event Correlation Protocol for Multi-Layer Virtual Networks
Open this publication in new window or tab >>A Distributed Spatio-Temporal Event Correlation Protocol for Multi-Layer Virtual Networks
2011 (English)Conference paper, Published paper (Refereed)
Abstract [en]

We present a distributed spatio-temporal event correlation protocol for multi-layer networks. The problems that we address relate to scalability in stacked overlay networks and network equipment with asynchronous clocks, which complicates the problem of event correlation. We describe a cross-layer protocol designed to address these problems, operating in a fully distributed manner and taking into account asynchronous timestamps. It is assumed that events in one layer may arise from a series of events in lower layers. Detected events that are spatially related in one layer are aggregated using a gossip-like protocol, and constitute a root cause. The set of aggregated events is disseminated to lower layers and used for temporal correlation. We have tested the scalability and the performance of the distributed event protocol, using both synthetically generated and real-world topologies. The results indicate that the average overhead produced for collecting events down the stack of overlays increases with the number of layers. For a fixed number of layers, the protocol scales similarly with the graph-theoretic properties for a network of increasing size.

National Category
Communication Systems
Identifiers
urn:nbn:se:kth:diva-144605 (URN)10.1109/GLOCOM.2011.6133988 (DOI)000300509002116 ()2-s2.0-84857207708 (Scopus ID)
Conference
IEEE Global Telecommunications Conference (GLOBECOM)
Note

QC 20140509

Available from: 2014-04-27 Created: 2014-04-27 Last updated: 2014-05-09Bibliographically approved
7. Performance Evaluation of a Distributed and Probabilistic Network Monitoring Approach
Open this publication in new window or tab >>Performance Evaluation of a Distributed and Probabilistic Network Monitoring Approach
2012 (English)Conference paper, Published paper (Refereed)
Abstract [en]

We investigate the effects of employing a probabilistic fault detection approach relative the performance of a deterministic network monitoring method. The approach has its foundation in probabilistic network management, in which performance limits and thresholds are specified in terms of e.g. probabilities or belief values. When combined with adaptive mechanisms, probabilistic approaches can potentially offer improved controllability, adaptivity and reliability, compared to deterministic monitoring methods. Results from synthetically generated and real network QoS measurements indicate that the probabilistic approach generally can perform at least as good as a deterministic algorithm, with a higher degree of predictable performance and resource-efficiency. Due to the stochastic nature of the algorithm, worse performance than expected is sometimes observed. Nevertheless, the results give additional support to some of the practical benefits expected in using probabilistic approaches for network management purposes.

National Category
Communication Systems
Identifiers
urn:nbn:se:kth:diva-144606 (URN)2-s2.0-84872084126 (Scopus ID)
Conference
The 8th International Conference on Network and Service Management (CNSM)
Note

QC 20140509

Available from: 2014-04-27 Created: 2014-04-27 Last updated: 2014-05-09Bibliographically approved
8. Direct In-Network Localization of Performance Degradations
Open this publication in new window or tab >>Direct In-Network Localization of Performance Degradations
(English)In: IEEE Transactions on Network and Service Management, ISSN 1932-4537, E-ISSN 1932-4537Article in journal (Other academic) Submitted
National Category
Communication Systems
Identifiers
urn:nbn:se:kth:diva-144607 (URN)
Note

QS 2014

Available from: 2014-04-27 Created: 2014-04-27 Last updated: 2017-12-05Bibliographically approved

Open Access in DiVA

Thesis(974 kB)458 downloads
File information
File name FULLTEXT01.pdfFile size 974 kBChecksum SHA-512
ffc9bd54ed38248ec1528ed7fd92420f8bdbc51f3dd6897691e2217b70808a0451e6db04a21523ee5e62528b583afc2df2dccca6ce9c3d1d5ad3e553caa5c113
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Steinert, Rebecca
By organisation
Computational Biology, CB
Communication Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 458 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1373 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf