As many of us already depend on computer systems to lead our lives to a standard we find acceptable, we put a correspondingly strong emphasis on ensuring that these systems are indeed dependable. Of the various attributes that make up the system dependability, this work focuses on reliability, the ability of a system to provide continuous service. One of the main barriers to increasing the reliability of a system is that it requires special competences that most developers do not possess. Hence, many developers are faced with either spending resources on additional training or hiring external experts, who may know little of the system in question and its problem domain. As a result, increasing the reliability of a computer system can increase the acquisition cost so much that the system is built without a cost-eective level of reliability, when considering costs over its total life span. Another barrier is that a naive introduction of fault-tolerance mechanisms may actually worsen the reliability of a system. Introducing fault-tolerance mechanisms is sure to add to the complexity of the system. This added complexity, if not properly managed, is likely to result in a system with more faults, possibly contributing to a net decrease in the system reliability. Due to such barriers, computer systems often fail to meet our expectations in terms of reliability, and we as a society frequently pay a higher than necessary price when suering the consequences of system failures.
This motivates our decision to work towards the goal of reducing the development effort and competence required to create reliable distributed reactve systems. Model-driven software engineering aids developers in breaking down complex systems by abstracting away from the implementation details and often provides viewpoints capturing one concern at a time. We therefore build on the existing model-driven engineering method SPACE. The abstractions provided by model-driven engineering also facilitate automated model checking to detect software faults. We take advantage of this to provide highly automated fault-removal features.
Even a system completely free from software faults will be aected by external faults, such as power failures and the breakdown of hardware components, if the system is not designed to tolerate them. We therefore focus on adding fault-tolerance mechanisms to systems, so that they can be relied upon also in a less than ideal operational environment. To easily ensure such mechanisms are correctly built and integrated, we again look to model checking.
Going into more detail, the first contribution described is an automatic transformation of SPACE models into a formal system specification along with properties to be verified, complete with visualization of erroneous behaviour and, in some cases, automatic diagnoses and fixes. Further, the work herein contributes new capabilities to build services tolerating message losses or process crashes, as well as a two-phase development method allowing developers to start with a simpler system and deal with the full complexity of fault tolerance afterwards. Finally, we introduce additional concepts for interface contracts allowing to compositionally verify properties of fault-tolerant services, even if internally replicated across several components.
Trondheim NTNU, 2014. , 275 p.