Parallel and Distributed Programing for Data Computation Intensive Applications

Jan, Bilal

Scientific Computing requires high computation power where large volumes of data are processed quickly usually in gigaFLOPS and teraFLOPS. Supercomputers, grid or cluster based systems are always the preferred choice for running such massively parallel scientific computing jobs. Due to its high performance and low cost GPUs are the preferred choice in High Performance Computing. The GPUs though originally were designed for rendering graphics in high resolution games, are now a days extensively used for computation intensive general purpose applications by the name GPGPU (General Purpose Graphic Processing Unit). Various programming tools and APIs have been developed for GPU computing with greater attention received by CUDA, OpenCL and OpenGL. This work uses OpenCL as parallel programming tool because of its open standard and heterogeneity. GPU Computing power has been exploited in running various applications such as sorting large data sets, design and implementation of parallel FFT library and the FFT based fast Magnetostatic field computation in the area of Micromagnetics. Sorting algorithms arrange a given sequence of input data into a certain order (monotonic increase or decrease) and are categorized by their computational complexity for best, average and worst case analysis. The time complexity is not the only deciding parameter, but other factors like stability, robustness, scalability, input distribution, memory storage and access patterns decide the applicability of a sorting algorithm for a certain application domain. The portion of the thesis work is devoted to the design and implementation of new parallel sorting techniques well suited for multi-processor architectures like GPUs and other multi-core systems. The novel sorting technique, Butterfly Network Sort, exploit high parallelism in its design and thus achieve considerable speedup against state-of-the-art sorting techniques. Fast Fourier Transforms library (named ToPe-FFT) is implemented using OpenCL. ToPe-FFT is based on the well-known Cooley-Tukey algorithm with auto-tuning for multiple GPUs. The open source ToPe-FFT implements several base radices along side the support for mixed-radices making it an almost arbitrary length FFT library. The library takes Complex-to-Complex (C2C) input type with dimension sizes up-to 3D. The design and interface of ToPe-FFT is similar to cuFFT and FFTW. The supported features of arbitrary input length, better accuracy in high dimension transforms, load balancing on multiple GPUs and above all significant speedup against cuFFT and FFTW makes ToPe-FFT promising in delivering maximum performance. An optimized version is tested in Micromagnetic simulations for performance improvement. In Micromagnetic simulations the computation of Magnetostatic field is the most time consuming part of the overall simulation time. In the case of a ferromagnetic region discretized into N number of elementary cells, the computation of Magnetostatic field at a particular location has a functional relationship with the magnetization at all other elements in the whole region. This long range elementary dipole interactions has high computation cost. In the FFT based Magnetostatic field computation, the given model is treated as discrete convolution problem with a reduced complexity. We have used an optimized version of our ToPe-FFT library for accelerating Magnetostatic field computation. Our GPU based optimized field solver has significant speedup against OOMMF Magnetostatic field computation time.

PORTO @ Archivio Istituzionale della Ricerca