This thesis is concerned with exploring methods for making computing systems more resilient to problems in the network communication, both in the setting of existing infrastructure but also in the case where no infrastructure is available. Specifically, we target a situation called network partitions which means that a computer or device network is split in two or more parts that cannot communicate with each other.
The first of the two tracks in the thesis is concerned with upholding system availability during a network partition even when there are integrity constraints on data. This means that the system will optimistically accept requests since it is impossible to coordinate nodes that have no means of communicating during finite intervals; thus requiring a reconciliation process to take place once the network is healed.
We provide several different algorithms for reconciling divergent states of the nodes, one of which is able to allow the system to continue accepting operations during the reconciliation phase as opposed to having to stop all invocations. The algorithms are evaluated analytically, proving correctness and the conditions for termination. The performance of the algorithms has been analysed using simulations and as a middleware plugin in an emulated setting.
The second track considers more extreme conditions where the network is partitioned by its nature. The nodes move around in an area and opportunistically exchange messages with nodes that they meet. This as a model of the situation in a disaster area where the telecommunication networks are disabled. This scenario poses a number of challenges where protocols need to be both partition-tolerant and energy-efficient to handle node mobility, while still providing good delivery and latency properties.
We analyse worst-case latency for message dissemination in such intermittently connected networks. Since the analysis is highly dependent on the mobility of the nodes, we provide a model for characterising connectivity of dynamic networks. This model captures in an abstract way how fast a protocol can spread a message in such a setting. We show how this model can be derived analytically as well as from actual trace files.
Finally, we introduce a manycast protocol suited for disaster area networks. This protocol has been evaluated using simulations which shows that it provides very good performance under the circumstances, and it has been implemented as a proof-of-concept on real hardware.