Anh T. Tran
PhD Dissertation
VLSI Computation Laboratory
Department of Electrical and Computer Engineering
University of California, Davis
Technical Report ECE-VCL-2012-3,
VLSI Computation Laboratory,
University of California, Davis, 2012.
Processor designers have been utilizing more processing elements (PEs) on a single chip
to make efficient use of technology scaling and also to speed up system performance through increased
parallelism. Networks on-chip (NoCs) have been shown to be promising for scalable interconnection
of large numbers of PEs in comparison to structures such as point-to-point interconnects
or global buses. This dissertation investigates the designs of on-chip interconnection networks
for many-core computational platforms in three application domains: high-performance network
designs for applications with high communication bandwidths; low-cost networks for applicationspecific
low-bandwidth dynamic traffic; and reconfigurable networks for platforms targeting digital
signal processing (DSP) applications which have deterministic inter-task communication characteristics.
An on-chip router architecture named RoShaQ is proposed for platforms executing general-purpose
applications with dynamic and high communication bandwidths. RoShaQ maximizes
buffer utilization by allowing sharing of multiple buffer queues among input ports hence achieves
high network performance. Experimental results show that RoShaQ is 17.2% lower latency, 18.2%
higher saturation throughput and 8.3% lower energy dissipated per bit than state-of-the-art virtual-channel
routers given the same buffer capacity averaged over a broad range of traffic patterns.
For mapping applications showing low inter-task communication bandwidths, five lowcost
bufferless routers are proposed. All routers guarantee in-order packet delivery so that expensive
reordering buffers are not required. The proposed bufferless routers have lower costs and higher
performance per unit cost than all buffered wormhole routers -- the smallest proposed bufferless
router has 32.4% less area, 24.5% higher throughput, 29.5% lower latency, 10.0% lower power and
26.5% lower energy per bit than the smallest buffered router.
A globally asynchronous locally synchronous (GALS)-compatible reconfigurable circuit-switched
on-chip network is proposed for use in many-core platforms targeting streaming DSP and
embedded applications which show deterministic inter-task communication traffic. Inter-processor
communication is achieved through a simple yet effective source-synchronous technique which can
sustain the ideal throughput of one word per cycle and the ideal latency approaching the wire delay.
This network was utilized in a GALS many-core chip fabricated in 65 nm CMOS. For evaluating
the efficiency of this platform, a complete IEEE 802.11a baseband receiver was implemented. The
receiver achieves a real-time throughput of 54 Mbps and consumes 174.8 mW with only 12.2 mW
(7.0%) dissipated by its interconnects.
A highly parameterizable NoC simulator named NoCTweak is also proposed for early
exploration of performance and energy efficiency of on-chip networks. The simulator has been
developed in SystemC, a C++ plugin, which allows fast modeling of concurrent hardware modules
at the cycle-level accuracy. Area, timing and power of router components are post-layout data based
on a 65 nm CMOS standard-cell library. NoCTweak was used in many experiments reported in this
dissertation.
Anh T. Tran, "On-Chip Network Designs for Many-Core Computational Platforms," Ph.D Dissertation, Technical Report ECE-VCL-2012-3, VLSI Computation Laboratory, ECE Department, University of California, Davis, 2012.
@phdthesis{atran:vcl:phdthesis, author = {Anh T. Tran}, title = {On-Chip Network Designs for Many-Core Computational Platforms}, school = {University of California}, year = 2012, address = {Davis, CA, USA}, month = Aug, note = {\url{http://www.vcl.ece.ucdavis.edu/pubs/theses/2012-3/}} }