In the last ten years, limited clock frequency scaling and increasing power density has shifted IC design focus towards parallelism, heterogeneity and energy efficiency. Improving energy efficiency is by no means simple and it calls for a reevaluation of old design choices in processor architecture, and perhaps more importantly, development of new programming methodologies that exploit the features of modern architectures.
This thesis discusses the design of energy-efficient digital signal processors with application-specific instructions sets, so-called ASIP-DSPs, and their programming tools. Target applications for such processors include, but are not limited to, communications, multimedia, image processing, intelligent vision and radar. These applications are often implemented by a limited set of kernel algorithms, whose performance and efficiency are critical to the application's success. At the same time, the extreme non-recurring engineering cost of system-on-chip designs means that product life-time must be kept as long as possible. Neither general-purpose processors nor non-programmable ASICs can meet both the flexibility and efficiency requirements, and ASIPs may instead be the best trade-off between all the conflicting goals.
Traditional superscalar- and VLIW processor design focus has been to improve the throughput of fine-grained instructions, which results in high flexibility, but also high energy consumption. SIMD architectures, on the other hand, are often restricted by inefficient data access. The result is architectures which spend more energy and/or time on supporting operations rather than actual computing.
This thesis defines the performance limit of an architecture with an N-way parallel datapath as consuming 2N elements of compute data per clock cycle. To approach this performance, this work proposes coarse-grained higher-order functional (HOF) instructions, which encode the most frequently executed compute-, data access- and control sequences into single many-cycle instructions, to reduce the overheads of instruction delivery, while at the same time maintaining orthogonality. The work further investigates opportunities for operation fusion to improve computing performance, and proposes a flexible memory subsystem for conflict-free parallel memory access with permutation and lookup-table-based addressing, to ensure that high computing throughput can be sustained even in the presence of irregular data access patterns. These concepts are extensively studied by implementing a large kernel algorithm library with typical DSP kernels, to prove their effectiveness and adequacy. Compared to contemporary VLIW DSP solutions, our solution can practically eliminate instruction fetching energy in many scenarios, significantly reduce control path switching, simplify the implementation of kernels and reduce code size, sometimes by as much as 30 times.
The techniques proposed in this thesis have been implemented in the DSP platform ePUMA (embedded Parallel DSP processor with Unique Memory Access), a configurable control-compute heterogeneous platform with distributed memory, optimized for low-power predictable DSP computing. Hardware evaluation has been done with FPGA prototypes. In addition, several VLSI layouts have been created for energy and area estimations. This includes smaller designs, as well as a large design with 73 cores, capable of 1280 integer GOPS or 256 GFLOPS at 500MHz and which measures 45mm2 in 28nm FD-SOI technology.
In addition to the hardware design, this thesis also discusses parallel programming flow for distributed memory architectures and ePUMA application implementation. A DSP kernel programming language and its compiler is presented. This effectively demonstrates how kernels written in a high-level language can be translated into HOF instructions for very high processing efficiency.
Linköping: Linköping University Electronic Press, 2016. , 340 p.
Liu, Dake, ProfessorEhliar, Andreas, Dr.