Reducing Cache Misses in Numerical Applications Using Data Relocation and Prefetching.
Abstract:
Numerical applications frequently contain nested loops that process large arrays of data. The execution of these loop structures often produces memory reference patterns that utilize data caches poorly. Indeed, poor reuse of the data, large working set sizes, and frequent non-unit stride accesses all combine to cause many cache misses. To improve cache performance, data copying has been proposed. However, this technique has high overhead. In this paper, instead, we propose a combined hardware and software technique called data relocation and prefetching which eliminates much of the overhead of data copying through the use of special hardware. Furthermore, by relocating the data while performing software prefetch- ing, the overhead of copying the data can be reduced further. This technique performs better than prefetching alone because it reduces cache misses through relocation, and it reduces overhead by prefetching multiple elements at once. The hardware is designed to overlap relocation and prefetching with normal execution, and to highly utilize the available bus bandwidth. Simulation results show that this technique greatly reduces data cache miss rates. As a result, large applications including PERFECT and SPEC benchmarks achieve up to 2.5 times speedup. The hardware support required by this technique has been greatly refined over that presented in an earlier paper. AN