Award Date
5-1-2022
Degree Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Electrical and Computer Engineering
First Committee Member
Mei Yang
Second Committee Member
Yingtao Jiang
Third Committee Member
Henry Selvaraj
Fourth Committee Member
Mingon Kang
Number of Pages
109
Abstract
The increasing popularity of deep neural network (DNN) applications demands high computing power and efficient hardware accelerator architectures. DNN accelerators use a large number of processing elements (PEs) and on-chip memory for storing weights and other parameters. A significant challenge is faced when designing a many-core DNN accelerator to handle the data movement between the processing elements. As the communication backbone of a DNN accelerator, networks-on-chip (NoC) plays an important role in supporting various dataflow patterns and enabling processing with communication parallelism in a DNN accelerator. However, the widely used mesh-based NoC architectures inherently cannot efficiently support many-to-one (gather) and one-to-many (multicast) traffic largely existing in DNN workloads. This dissertation is focused on efficient communication support solutions for these traffic in DNN accelerators.In NoCs, many-to-one traffic is typically handled by repetitive unicast packets which is inefficient. The dissertation first proposes to use the gather supported routing on mesh-based NoCs employing the Output Stationary (OS) systolic array in support of many-to-one traffic. Initiated from the left-most node, the gather packet will collect data generated from the intermediate nodes along its way to the global memory on the right side of the mesh. Without changing the router pipeline, the gather supported routing significantly reduces the network latency and power consumption than the repetitive unicast method evaluated under the traffic traces generated from the DNN workloads. Further, the study is extended by proposing a modified mesh architecture with a one-way/two-way streaming bus to speed up multicast traffic and support multiple PEs per router using gather supported routing. The analysis of the runtime latency of a convolutional layer shows that the two-way streaming architecture achieves better improvement than the one-way streaming architecture for an OS dataflow architecture. Simulation results confirm the effectiveness of the proposed method which achieves up to 1.8x improvement in the runtime latency and up to 1.7x improvement in the network power consumption. The hardware overhead of the proposed method is justifiable for the performance improvements achieved over the repetitive unicast method. Finally, In-Network Accumulation (INA) is proposed to further accelerate the DNN workload execution on a many-core spatial DNN accelerator for Weight Stationary (WS) dataflow model. The INA unit further improves the latency and power consumption by allowing the router to support the partial sum accumulation which avoids the overhead of injecting and ejecting the partial sum from and to the PE. Compared with OS dataflow model, the INA-enabled WS dataflow model achieves up to 1.19x latency improvement and 2.16x power improvement across different DNN workloads.
Keywords
Accelerator; CNN; DNN; NoC; on chip network; Routing
Disciplines
Electrical and Computer Engineering
File Format
File Size
2200 KB
Degree Grantor
University of Nevada, Las Vegas
Language
English
Repository Citation
Tiwari, Binayak, "Efficient Networks-On-Chip Communication Support Solutions for Deep Neural Network Acceleration" (2022). UNLV Theses, Dissertations, Professional Papers, and Capstones. 4480.
http://dx.doi.org/10.34917/31813375
Rights
IN COPYRIGHT. For more information about this rights statement, please visit http://rightsstatements.org/vocab/InC/1.0/