This PhD thesis addresses the vision of a geographically distributed immersive collaboration system that supports real-time delay-sensitive collaborations based on visual cues between performers for synchronization. Examples include collaborative dancing and remote conducting of choirs. The collaborators from different remote places perform in their own collaboration space (CS), but achieve the quality of experience (QoE) as if they perform in the same place and scene. To arrive at that very high level of QoE, all physical surfaces of a CS are constructed from arrays of multiview autostereoscopic displays and high-resolution micro-cameras with microphones and speakers. The CSs are interconnected by a high-speed network over which the audiovisual data are transported. The capacity of the links in the network varies as they may be shared by other users outside the collaboration system.
The information era with rapid developments in many fields is the right time to address the complex collaboration system. It is, however, still non-existent due to at least four technical challenges. First, the synchronization is shown to be harmonious if the maximum end-to-end delay (EED) in processing and transporting video data between the connected CSs can be guaranteed at 11.5ms. As the Internet is not designed to deliver it, the DistributedMultimedia Plays (DMP) system architecture is proposed to address it by means of Quality Shaping. Second, the very low latency constraint becomes more challenging because the video quality rendered in the CSs must also be gracefully degraded regardless of changing network condition. Third, the immense traffic of audiovisual data generated from a CS requires creative data reduction and fast processing to minimize processing delay. The last challenge comes from the transient periods that are expected to occur frequently in such traffic because a CS transmits and receives visual signals only from segmented bodies of the performers. The segmentation is key in the adopted object-based video processing and compression to discard irrelevant data based on the eye gazes of the performers that are detected and tracked in real-time.
This thesis presents research work on four of many aspects of the collaboration system: modeling, simulation, synthesis, and compression. Since human body is the smallest building block for simulating the collaboration system, its modeling as a discrete-event system lies at heart of the modeling and simulation of the collaboration system. By modeling a human body as a system of sixteen interconnected limbs, an event is defined as the spatial displacement of the two end points of a limb that represents its motion.
The motion of a human body is generated by simulating forward kinematics of its limbs using discrete-event simulation (DES) that includes both stochastic motion and gait cycles for walking and running as deterministic motion. DES guarantees that virtually unlimited unique sets of motions can be exactly reproduced. How any collaboration scenario with arbitrary number of CSs and collaborators can be simulated is illustrated by a detailed example. Based on the silhouette of visualized moving human bodies and the technical specification of the CSs, traces of uniquely reproducible transient traffic are synthesized as input traffic to DES of DMP architecture. Moreover, traffic from motions due to camera zoom and panning are also studied by real measurement and mathematical modeling.
DMP guarantees maximum EED because every DMP node can drop video packets deliberately according to instantaneous network condition to guarantee their local delays. Thus, intelligent packet dropping is the main source of information loss in DMP. Two schemes for such compression of image sequences are studied in pixel and transform domains. The first employs windowed kriging (WK) for optimal image interpolation in the Near-natural Object Coding proposed in DMP, and the latter is based on discrete cosine transform (DCT). The application of WK to luminance and chrominance is studied in terms of visual quality and computational time. Furthermore, an ultrafast, embedded, quality-scalable, DCT-based image coding scheme for DMP is proposed and shown to be technically feasible for hardware implementation. The application of resampling to regions in an image indicated by the tracked eye gazes is also studied, together with the effects to visual quality.
Addressing the compression aspect is important as the basis for future study of estimating video quality that results from packet dropping. Since this is not possible with the above methods of traffic synthesis, the study on compression complements the aspects of modeling, simulation, and synthesis, showing the coherence of the work.
NTNU, 2014. , 181 p.
2014-05-09, Totalrommet, Main Building, Gløshaugen, Trondheim, 22:42 (English)
Schelkens, Peter, ProfessorDavidrajuh, Reggie, ProfessorKristiansen, Lill, Professor