Mapping Signal Processing Kernels to Tiled Architectures
MASSACHUSETTS INST OF TECH LEXINGTON LINCOLN LAB
Pagination or Media Count:
This presentation focuses on understanding when a stream algorithm exists for a given kernel. We do so by considering the directed acyclic graph DAG for a particular implementation of the kernel. Nodes in the DAG represent inputs, outputs, or intermediate products of the algorithm, and edges from node A to node B in the DAG show that A is used to compute B. We can characterize the DAG for an algorithm by the ratio of inputs, W, to the number of intermediate products, Q, for which any one value is directly required. That is, for each output, there are W 2N inputs used and a total of Q N intermediate products the partial sums computed. The stream algorithm implementation of matrix multiply meets the compute efficiency condition. Matrix multiplication is an example of a kernel with a constant ratio of W to Q. All known algorithms with a constant ratio of W to Q have implementations that meet the compute efficiency condition. Stream algorithm techniques can be used to implement an efficient implementation of the radix-4 FFT for a 4x4 tile array, but this implementation is not scalable and performance will be worse on larger Raw systems. The Raw FFT achieves high performance for large data sizes, and offers performance that is more stable across a range of data sizes. In this talk, we will describe the implementation of FFT, QR factorization, and CFAR kernels on the Raw simulator and Raw board. We examine the performance of these kernels and compare to conventional implementations on the Pentium and G4 architectures. Finally, we characterize the DAG of each kernel and discuss how the DAG influences the implementation on Raw and on tiled architectures in general.
- Numerical Mathematics
- Computer Hardware