الفهرس | Only 14 pages are availabe for public view |
Abstract Deep Neural Network is the most trending approach among the different machine learning ones due to its ability to learn many features from a hierarchical data representation for many complex classification applications. In these complex applications, the learning process is performed on sufficiently large datasets by going deeper in network configuration consolidating more network layers to obtain high accuracy results. Subsequently, the energy consumption, area, and latency increase, so hardware acceleration is employed to minimize the computational overhead by relocating the necessary training and inference tasks from CPUs to dedicated hardware platforms that encompass specialized architectures tailored for handling comparable network workloads. The softmax layer is a widely recognized non-linear activation layer in most deep neural networks, playing a crucial role in various classification domains more broadly. The softmax function comprises expensive exponential and division units, which cause overflow problems, low accuracy, big area, and low throughput. Thus, it is a challenge to have an efficient hardware implementation for the softmax layer with high accuracy and low cost. The purpose of this thesis is to present a high-accuracy implementation of the softmax layer, designed for efficient hardware acceleration in image classification tasks involving multiple categories without being resource-consuming. The key feature of this implementation is changing the exponential base of the softmax function and hence, the complex operations in the traditional softmax will be replaced by simple shift and addition operations, with simpler Look-Up Tables and higher accuracy. The implemented hardware model is instantiated using Verilog Hardware Description Language, relies on single-precision floating-point arithmetic cores. Additionally, an evaluation setup for the model is established to offer a meaningful performance estimate. To assess the model, a dataset is chosen from open standard benchmarks, allowing for a comparison with a standard reference and prior implementations of the layer. The model achieves classification accuracy equal to 100% relative to a reference model and has an area of 0.0802 mm2 with a power consumption of 8.93 mW when synthesized under TSMC 28nm CMOS technology at the frequency of 1 GHz |