Change search
ReferencesLink to record
Permanent link

Direct link
Estimating Time to Repair Failures in a Distributed System
Linköping University, Department of Computer and Information Science.
Linköping University, Department of Computer and Information Science.
2016 (English)Independent thesis Basic level (degree of Bachelor), 10,5 credits / 16 HE creditsStudent thesisAlternative title
Estimering av reparationstid vid haverier i ett distribuerat system (Swedish)
Abstract [en]

To ensure the quality of important services, high availability is critical. One aspect to be considered in availability is the downtime of the system, which can be measured in time to recover from failures. In this report we investigate current research on the subject of repair time and the possibility to estimate this metric based on relevant parameters such as hardware, the type of fault and so on. We thoroughly analyze a set of data containing 43 000 failure traces from Los Alamos National Laboratory on 22 different cluster organized systems. To enable the analysis we create and use a program which parses the raw data, sorts and categorizes it based on certain criteria and formats the output to enable visualization. We analyze this data set in consideration of type of fault, memory size, processor quantity and at what time repairs were started and completed. We visualize our findings of number of failures and average times of repair dependent on the different parameters. For different faults and time of day we also display the empirical cumulative distributionfunction to give an overview of the probability for different times of repair. The failures are caused by a variety of different faults, where hardware and software are most frequently occurring. These two along with network faults have the highest average downtime. Time of failure proves important since both day of week and hour of day shows patterns that can be explained by for example work schedules. The hardware characteristics of nodes seem to affect the repair time as well, how this correlation works is although difficult to conclude. Based on the data extracted we suggest two simple methods of formulating a mathematical model estimating downtime which both prove insufficient; more research on the subject and on how the parameters affect each other is required.

Place, publisher, year, edition, pages
2016. , 27 p.
National Category
Computer and Information Science
URN: urn:nbn:se:liu:diva-131847ISRN: LIU-IDA/LITH-EX-G--16/072—SEOAI: diva2:1034002
Available from: 2016-10-17 Created: 2016-10-10 Last updated: 2016-10-17Bibliographically approved

Open Access in DiVA

fulltext(525 kB)2 downloads
File information
File name FULLTEXT01.pdfFile size 525 kBChecksum SHA-512
Type fulltextMimetype application/pdf

By organisation
Department of Computer and Information Science
Computer and Information Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 2 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

ReferencesLink to record
Permanent link

Direct link