Many-core processor architecture has become the most promising computer
architecture for next-generation computing. However, how to utilize the extra system performance
for real applications such as video encoding is still challenging.
This dissertation investigates architecture design, physical implementation
and performance evaluation of a fine-grained many-core processor for advanced
video coding with a focus on interconnection, topology, memory system and related parallel programming methodology.
A baseline residual encoder for H.264/AVC on a current generation
fine-grained many-core system is proposed that utilizes no
application-specific hardware. The 25-processor encoder
encodes video sequences with variable frame sizes
and can encode 1080p HDTV at 30 frames per second with 293~mW average
power consumption by adjusting each processor to workload-based optimal
clock frequencies and dual supply voltages—a 38.4% power reduction
compared to operation with only one clock frequency and supply voltage.
In comparison to published implementations on the TI C642 DSP platform,
the design has approximately 2.9–3.7 times higher scaled throughput,
11.2–15.0 times higher throughput per chip area, and 4.5–5.8 times
lower energy per pixel. Compared to a heterogeneous SIMD architecture
customized for H.264, the presented design has 2.8–3.6 times greater
throughput, 4.5–5.9 times higher area efficiency, and similar energy
efficiency.
Next, this dissertation proposes novel processor shapes and inter-connection
topologies for many-core processor arrays which result in an overall
application processor that requires fewer cores and has a lower
total communication length. The proposed topologies compared to the commonly-used 2D mesh and
include two 8-neighbor topologies, two 5-nearest-neighbor and
three 6-nearest-neighbor topologies—three of which utilize 5-sided or hexagonal
processor tiles. A 1080p H.264/AVC residual video encoder and a complete 54 Mbps
802.11a/11g wireless LAN baseband receiver are mapped onto all topologies and compared.
The methodology to implement an array of hexagonal-shaped processor tiles with
industry-standard CAD tools and automatic place and route flow
is described. A 16-bit DSP processor tile is tailored for all proposed topologies and implemented at 65 nm CMOS technology without full-custom layout. Results show that the 6-neighbor hexagonal tile and the 6-neighbor rectangular tile incur a 2.9% area increase per tile compared to the 4-neighbor 2D mesh, but their much more effective inter-processor interconnect yields an average total application area reduction of 21% and
a total application inter-processor communication distance reduction of 19%.
Motivated by the fact that video encoding tasks normally read and write a block of data at one time in one transaction, the third part of this dissertation proposes a novel source synchronous bufferless shared memory to enable safe memory sharing among multiple processors with different clock domains. Compared with the previous FIFO buffered memory design, the bufferless memory module achieves lower latency, higher throughput, lower area overhead and lower power consumption. The bufferless memory module also supports direct communication with far-away processors through the existing processor-processor circuit switch interconnection network. The implementation results show that a 16~KB bufferless memory module reduces 58% single memory access latency and has higher burst-mode throughput (1%) compared to the 16~KB buffered memory module. The bufferless memory module also reduces the area overhead from 63% to 17% compared with buffered memory module, which yields a power reduction by 43%.
Zhibin Xiao, "Energy-efficient Fine-grained Many-core Architecture for Video and DSP Applications," Ph.D Dissertation, Technical Report ECE-VCL-2012-4, VLSI Computation Laboratory, ECE Department, University of California, Davis, 2012.
@phdthesis{zxiao:vcl:phdthesis, author = {Zhibin Xiao}, title = {Energy-efficient Fine-grained Many-core Architecture for Video and DSP Applications}, school = {University of California}, month = dec, year = 2012, address = {Davis, CA, USA}, note = {\url{http://www.ece.ucdavis.edu/vcl/pubs/theses/2012-4/}} }