![]() | Only 14 pages are availabe for public view |
Abstract Medical image processing faces great challenges: high memory consumption and algorithm acceleration. However, the real challenge is efficientimplementations of medical image processing algorithms of high sharpness images of highestresolution in a very short amount of processing time. CUDA platform can be a solution for the processing time issue of modern medical imaging algorithms. Programming of the GPU is simplified with the advancement of CUDA. Furthermore, it became easier to be used in general purpose-programing applications. Image processing algorithms were developed and optimized for serial CPUs. They are memory-intensive and require a high degree of computationaleffort. Many image-processing tasks exhibit a high degree of data locality, parallelism and map quite readily to specialized massively parallel computing hardware. There are different paradigms for the parallelism that can be implemented on the GPUs: task, data and instruction parallelism. More specifically, the GPU is especially well-suited to address the SPMD “single program, multiple data “problems. CUDA (Compute Unified Device Architecture) is a parallel computing architecture developed by NVidia for massively parallel high-performance computing. Developers no longer have to understand the complexities behind the GPU. Hardware abstraction in CUDA allows NVidia to change the GPU architecture in the futurewithout requiring the developers to learn a new set of instructions. This can limit the optimization process. NVIDIA GPU architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). So, when a CUDA program on thehost CPU invokes a kernel grid, and the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors. A multithreaded program is partitioned into blocks of threads that execute independently from each other, so that a GPU with more multiprocessors will automatically execute the program in less time than a GPU with fewer multiprocessors.” Automatic Scalability”. The source code for any CUDA program consists of both the host and device code mixed in the same file. Because the source code targets two different processing architectures, additional steps are required in the compilationprocess. Image filtering is a very important technique primarily used to reduce noise, but also to sharpen an image, to enhance the edges of an image, or generally to increase or decrease certain structures in an image. Removing the noise from the image should not blur or change the location of the edges and to avoid artifacts in segmentation operations. Summary and Conclusion 60 Anisotropic diffusion is one of the most robust noise reductiontechnique that have the advantage of the edge preservation that it’s mathematicallyexhaustive method So I focused on it. Nonlinear diffusion filtering goes back to Perona and Malik. Diffusion is a physical phenomenon that intuitively interpreted as a physical process that equilibrates concentration differences without creating or destroying mass. By applying this concept as image processing, we may identify the concentration with the grey value at a certain location. Therefore, to preserve the edges, the diffusion tensor has to be function of the image pixels values, Diffusionacross edges shall be inhibited for edge enhancement. Often the diffusion tensor is a function of the differential structure of the evolving image itself. Such a feedback leads to nonlinear diffusion filters. The diffusion process will still level densities (i.e., image intensities), but the process will be slower at potential edge locations. Anisotropic diffusion was the case study, where it exhibits different implementation challenges and it can be subdivided into number of kernel each of which, has different characteristics and different implementation structure , that targets different CUDA architecture characteristics . The execution times and the occupancy is the two main attribute of the efficient algorithm that can be achieved by further optimization and need good programming skills. The objective of this studywas to show an overview of GPU and CUDA, detailed implementation steps for heterogeneous programming, quick review on design optimization strategies and algorithm implementation trade-offs for performance by comparing different implementation techniques results and the effect of different memory types. Anisotropic Diffusion Algorithm was divided into four kernels. Each kernel was implemented in the parallel form with two alternatives; to showthe difference in performance between the alternatives of the implementation and was implemented also in the serial form to; show the gain achieved by the parallelization. The Microsoft SDK was used in the coding and NVCC was used for compilation, then Visual profiler and CUDA mem-check were used for profiling, debugging and verification in optimization process. They are useful to achieve better results in terms of executiontimes and occupancy. The images under test were mammograms with three resolutions (512x512, 1024x1024 and 2048x2048) .The average execution times of the algorithm of 10 trials were recorded for each kernel. The results of visual profiler of each kernel were tabulated and discussed. Then different image sizes were processed to show the effect of the image size on the performance gain. It was found that the kernels implemented in the parallel form were faster than that implemented in the serial one. The alternatives of the kernels also showed lower execution times: image regularization using Gaussian filter implemented in the shared memory with pre-fetch was less than that of the texture memory. “K-choosing” kernel using algorithm cascading was better in terms of execution time over the complete unrolling of the loops with sequential addressing. Both implementations of diffusion tensor calculation in the texture memory was similar in terms of the execution time. As regard the effect of the image size on the performance gain, it was found that as the image size increase, the speed-up increases. from this study, it can be concluded that parallel programming using CUDA could be considered as a very efficient way for acceleration of the medical image processing algorithms that can enhance the performance in terms of the execution times |