Author: Hussein, Mohamed Hassan Attia./ Title: Biomedical image processing using parallel programming =

Search In this Thesis

العنوان

Biomedical image processing using parallel programming =

المؤلف

Hussein, Mohamed Hassan Attia.

هيئة الاعداد

باحث / Mohamed Hassan Attia Hussein

مشرف / Adel Said Elmaghraby

مشرف / Saieh El-shehaby

مناقش / Mohamed A. Ismail

مناقش / Saleh Abdel-chaqour Elshehaby

الموضوع

medical Image Processing.

تاريخ النشر

2014.

عدد الصفحات

68 p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

الأشعة والطب النووي والتصوير

تاريخ الإجازة

14/7/2014

مكان الإجازة

جامعة الاسكندريه - معهد البحوث الطبية - معالجة الصور الطبية

الفهرس

Only 14 pages are availabe for public view

from

Abstract

Medical image processing faces great challenges: high memory consumption and
algorithm acceleration. However, the real challenge is efficientimplementations of medical
image processing algorithms of high sharpness images of highestresolution in a very
short amount of processing time.
CUDA platform can be a solution for the processing time issue of modern medical
imaging algorithms. Programming of the GPU is simplified with the advancement of
CUDA. Furthermore, it became easier to be used in general purpose-programing
applications.
Image processing algorithms were developed and optimized for serial CPUs. They
are memory-intensive and require a high degree of computationaleffort. Many image-processing tasks exhibit a high degree of data locality, parallelism and map quite readily to
specialized massively parallel computing hardware.
There are different paradigms for the parallelism that can be implemented on the
GPUs: task, data and instruction parallelism. More specifically, the GPU is especially well-suited to address the SPMD “single program, multiple data “problems.
CUDA (Compute Unified Device Architecture) is a parallel computing architecture
developed by NVidia for massively parallel high-performance computing. Developers no
longer have to understand the complexities behind the GPU. Hardware abstraction in
CUDA allows NVidia to change the GPU architecture in the futurewithout requiring the
developers to learn a new set of instructions. This can limit the optimization process.
NVIDIA GPU architecture is built around a scalable array of multithreaded
Streaming Multiprocessors (SMs). So, when a CUDA program on thehost CPU invokes a
kernel grid, and the blocks of the grid are enumerated and distributed to multiprocessors
with available execution capacity. The threads of a thread block execute concurrently on
one multiprocessor, and multiple thread blocks can execute concurrently on one
multiprocessor. As thread blocks terminate, new blocks are launched on the vacated
multiprocessors. A multithreaded program is partitioned into blocks of threads that execute
independently from each other, so that a GPU with more multiprocessors will
automatically execute the program in less time than a GPU with fewer multiprocessors.”
Automatic Scalability”.
The source code for any CUDA program consists of both the host and device code
mixed in the same file. Because the source code targets two different processing
architectures, additional steps are required in the compilationprocess.
Image filtering is a very important technique primarily used to reduce noise, but also
to sharpen an image, to enhance the edges of an image, or generally to increase or decrease
certain structures in an image. Removing the noise from the image should not blur or
change the location of the edges and to avoid artifacts in segmentation operations.
Summary and Conclusion
60
Anisotropic diffusion is one of the most robust noise reductiontechnique that have
the advantage of the edge preservation that it’s mathematicallyexhaustive method So I
focused on it. Nonlinear diffusion filtering goes back to Perona and Malik. Diffusion is a
physical phenomenon that intuitively interpreted as a physical process that equilibrates
concentration differences without creating or destroying mass.
By applying this concept as image processing, we may identify the concentration
with the grey value at a certain location. Therefore, to preserve the edges, the diffusion
tensor has to be function of the image pixels values, Diffusionacross edges shall be
inhibited for edge enhancement. Often the diffusion tensor is a function of the differential
structure of the evolving image itself. Such a feedback leads to nonlinear diffusion filters.
The diffusion process will still level densities (i.e., image intensities), but the process will
be slower at potential edge locations.
Anisotropic diffusion was the case study, where it exhibits different implementation
challenges and it can be subdivided into number of kernel each of which, has different
characteristics and different implementation structure , that targets different CUDA
architecture characteristics . The execution times and the occupancy is the two main
attribute of the efficient algorithm that can be achieved by further optimization and need
good programming skills.
The objective of this studywas to show an overview of GPU and CUDA, detailed
implementation steps for heterogeneous programming, quick review on design optimization
strategies and algorithm implementation trade-offs for performance by comparing different
implementation techniques results and the effect of different memory types.
Anisotropic Diffusion Algorithm was divided into four kernels. Each kernel was
implemented in the parallel form with two alternatives; to showthe difference in
performance between the alternatives of the implementation and was implemented also in
the serial form to; show the gain achieved by the parallelization. The Microsoft SDK was
used in the coding and NVCC was used for compilation, then Visual profiler and CUDA
mem-check were used for profiling, debugging and verification in optimization process.
They are useful to achieve better results in terms of executiontimes and occupancy.
The images under test were mammograms with three resolutions (512x512,
1024x1024 and 2048x2048) .The average execution times of the algorithm of 10 trials
were recorded for each kernel. The results of visual profiler of each kernel were tabulated
and discussed. Then different image sizes were processed to show the effect of the image
size on the performance gain.
It was found that the kernels implemented in the parallel form were faster than that
implemented in the serial one. The alternatives of the kernels also showed lower execution
times: image regularization using Gaussian filter implemented in the shared memory with
pre-fetch was less than that of the texture memory. “K-choosing” kernel using algorithm
cascading was better in terms of execution time over the complete unrolling of the loops
with sequential addressing. Both implementations of diffusion tensor calculation in the
texture memory was similar in terms of the execution time. As regard the effect of the
image size on the performance gain, it was found that as the image size increase, the speed-up increases.
from this study, it can be concluded that parallel programming using CUDA could
be considered as a very efficient way for acceleration of the medical image processing
algorithms that can enhance the performance in terms of the execution times