Real telecommunications networks are not failure free. Any single disconnection impacts the network provider reputation and finances, and produces incalculable consequences to the customers through the affected applications. A common policy to handle this issue is the stipulation of the availability to be guaranteed in a business contract known as Service Level Agreement SLA. The stipulated availability must be commercially competitive, and it must fit the customer needs. However, the fulfillment of the SLA may imply huge costs in terms of the resources to be reserved and/or the penalties associated with the violation of the agreement. In addition, the selection of the availability to be stipulated, is a difficult task due to the following challenges: (1) SLAs are defined for a finite time interval which demands the study of the probability distribution of the interval availability and the respective risk that it represents. (2) Having failure and repair data from operational networks is a must, in order to assess accurately the SLA risk. However, this kind of information is limited for a number of reasons, among them that failures are not what operators like to have exposed in a competitive commercial marketplace. (3) The end-to-end interval availability is affected by several stochastic processes. The networks addressed in this thesis are compound systems where Markovian assumptions do not apply. The duration of up and down times are not exponentially distributed, and correlation between failure/repair processes may exist. (4) Designing assignment policies to use efficiently network resources is a classical challenge that becomes harder when SLA availability constraints have to be considered. This thesis addresses the mentioned challenges as follows.
In the first part of the thesis, a theoretical study of the probability distribution of the interval availability is made. A mathematical approximation to evaluate the cumulative downtime distribution of a single-component system that does not have the Markovian properties is proposed, and the evolution of the distribution of the interval availability with the increase of the observation period is studied. This study shows how sensitive is the SLA risk to: i) The distribution of up and down times. ii) The duration of the SLA.
An important part of the thesis is the study of operational failure and repair events obtained from measurements of the UNINETT core network. In this study, up and down times of routers and links are characterized, and the correlation between failure and repair processes is studied. Network components are classified according to their dependability characteristics. The information obtained from the characterization phase is used as an input in the simulations developed in later parts of the thesis, and in order to justify some of the assumptions made. These analyzes lead to the conclusion that failure and repair processes in a backbone network are not independent and do not have the Markovian properties.
In this thesis, the probability that the availability offered to a compound network connection after the contract duration is less than the availability promised in the SLA is assessed using simulation techniques. First, unprotected, shared protected and dedicated protected connections under non-Markovian failure and repair processes are studied. In addition, two methods to model correlated Weibull, gamma and empirically distributed up and down times are proposed. The first method uses trace driven simulation combined with random circular shifting. The second method uses Monte Carlo techniques. Through these methods, the SLA risk may be assessed, considering real/operational network features.
This thesis discusses how to allocate requested connections in a given network topology under SLA availability constraints. An intelligent sharing mechanism to use the bandwidth efficiently is proposed. In addition, this thesis studies the SLA penalty scheme and it proposes a model to allocate connections, fulfilling SLA requirements, and maximizing the operator profit, through a two-stage stochastic optimization program. The model considers the stochastic behavior of network components, correlation between failure and repair processes, the SLA finite duration, and the flexibility to allocate or reject a connection based on its impact on the provider profit.
Finally, the problem of guaranteeing SLA availability is studied in cloud computing environments. This study, proposes the use of the SLA-budget for the implementation of smart policies in: i) The assignment of spare servers when virtual machines are restored. ii). The dynamic use of fault tolerance licenses. The result is a considerable reduction in the probability of failing the SLA availability requirement by making an efficient use of the cloud resources available. This work is a first step in the design of SLA-aware cloud computing management, and it illustrates how the distribution of the interval availability may be manipulated by mechanisms under the provider control.
NTNU Trondheim, 2013. , 192 p.
2013-05-15, Rådsrommet, Electro Building, NTNU, Gløshaugen, Trondheim, 09:33 (English)