OpenACC-based Snow Simulation
In recent years, the GPU platform has risen in popularity in high performance com-
puting due to its cost effectiveness and high computing power offered through its many
parallel cores. The GPUs computing power can be harnessed using the low-level GPGPU
programming APIs CUDA and OpenCL. While both CUDA and OpenCL gives the pro-
grammer fine-grained control of a GPUs resources, they are both generally considered
difficult to use and can potentially lead to complicated software design. To simplify
GPGPU programming and gain more mainstream usage of GPUs, there is an increased
interest in moving the complexity of GPGPU programming over to the compiler. This
has lead to the development of the directive-based standard for heterogeneous computing
called OpenACC, supported by NVIDIA, Cray, PGI, CAPS and others.
In this thesis, we explore using OpenACC on a high performance snow simulator code de-
veloped by the HPC-Lab at NTNU. The snow simulator consists of two main simulation
components; the simulation of wind, and the simulation of snow particle movement.
The OpenACC version of the snow simulator is made by first updating the current
CUDA version, porting it to a sequential CPU implementation, and applying OpenACC
directives to accelerate compute intensive regions in the code. The OpenACC port is
also optimized by reducing datamovement between host and device using OpenACC
Due to the heterogeneous nature of OpenACC, we show that the inability to explicitly
use shared memory as temporary storage and not being able to use texture memory for
hardware based interpolation and 3D caching, are the largest performance bottlenecks
when comparing to the CUDA version.
This is supported by the benchmarks of the OpenACC implementation which is shown to
give only 40.6% performance of the CUDA version with an average speedup of 3.2x when
scaling the amount of snow particles simulated and using a balanced windfield dimension.
When scaling the windfield with constant snow particles 58% of the CUDA performance
is reached with an average speedup of 4.84x. The best real-time performance is found at
about 1.5M snow particles when using a balanced windfield with about 524K grid cells.
Using OpenACC for accelerating high performance graphical simulations can be a viable
option if the goal is high code portability, however, when the goal is to achieve the best possible performance, our experience show that it is still better to use the more low-level alternatives CUDA or OpenCL.
Place, publisher, year, edition, pages
Institutt for datateknikk og informasjonsvitenskap , 2013. , 122 p.
IdentifiersURN: urn:nbn:no:ntnu:diva-23000Local ID: ntnudaim:9823OAI: oai:DiVA.org:ntnu-23000DiVA: diva2:655634
Elster, Anne Cathrine, Førsteamanuensis