A Systolic 2-D Convolution Chip

H. T. Kung and S. W. Song

Department of Computer Science
Carnegie-Mellon University
Pittsburgh, Pa. 15213

Last Revised March 1981
A Systolic 2-D Convolution Chip

H. T. Kung and S. W. Song

Department of Computer Science
Carnegie-Mellon University
Pittsburgh, Pa. 15213

Last Revised: March 1981

Copyright © 1981 H. T. Kung and S. W. Song

This research was supported in part by the Office of Naval Research under Contracts N00014-76-C-0370 and N00014-80-C-0236, in part by the National Science Foundation under Grant MCS 78-236-76, and in part by the Defense Advanced Research Projects Agency under Contract F33615-78-C-1551 (monitored by the Air Force Office of Scientific Research). The first author is currently on leave from Carnegie-Mellon University at ESL's Advanced Processor Technology Group in San Jose, California. (ESL is a subsidiary of TRW.) The second author is supported in part by CNPq, Conselho Nacional de Desenvolvimento Cientifico e Tecnologico, Brazil, under Contract 200.402-79-CC, and is on leave from the Institute of Mathematics and Statistics of the University of Sao Paulo, Brazil.
A SYSTOLIC 2-D CONVOLUTION CHIP

H. T. Kung and S. W. Song
Department of Computer Science
Carnegie-Mellon University
Pittsburgh, Pa. 15213
Last Revised March 1981

Abstract:

This paper describes a chip for performing the 2-D (two-dimensional) convolution in signal and image processing. The chip, based on a systolic design, consists of essentially only one type of simple cells, which are mesh-interconnected in a regular and modular way, and achieves high performance through extensive concurrent and pipelined use of these cells. Denoting by \( u \) the cycle time of the basic cell, the chip allows convolving a \( k \times k \) window with an \( n \times n \) image in \( O(n^2u/k) \) time, using a total of \( k^3 \) basic cells. The total number of cells is optimal in the sense that the usual sequential algorithm takes \( O(n^2k^2u) \) time. Furthermore, because of the modularity of the design, the number of cells used by the chip can be easily adjusted to achieve any desirable balance between I/O and computation speeds.

Keywords:

Image and signal processing, parallel computation, special-purpose chip design, systolic arrays, two-dimensional convolution, VLSI designs.

1. Introduction

With recent technological advances in the VLSI circuitry, the chip capacity (or component count on a chip) is increasing at an astonishing rate [7]. Both the opportunities and challenges regarding effective use of VLSI are tremendous. Systolic design is an architectural concept proposed for VLSI [3]. Chip designs based on systolic architectures tend to be simple, modular, and of high performance. In [6] a general discussion on the attractiveness of systolic architectures is given. Systolic architectures are particularly suited to chip implementation of operations in signal and image processing such as filtering, correlation, and discrete Fourier transform [5]. In this paper we introduce a chip, based on a novel systolic design, for performing the 2-D convolution operator. Examples of application of this operator include noise smoothing, linear edge enhancement, edge crispening, etc. Similar designs can be used for digital filtering and numerical relaxation.
2. Description of the Systolic Design

To define the 2-D convolution problem, consider a matrix (or image) \( X = \{ x_{ij} \} \) and a \( k \times k \) window with weighting coefficients \( w_{ij} \) as shown in Figure 2-1. For the purpose of illustration, we use \( k = 3 \). Now slide the \( 3 \times 3 \) window along the matrix. For each position of the window, the weighted sum of the entries (or pixels) in the submatrix covered by the window is to be computed. More precisely, if the entry at the center of the submatrix is \( x_{rs} \), then we wish to compute

\[
\sum_{i=1}^{3} \sum_{j=1}^{3} w_{ij} x_{r+i-2, s+j-2}
\]

\[
X = \begin{bmatrix}
x_{11} & x_{12} & x_{13} & \cdots & x_{1n} \\
x_{21} & x_{22} & x_{23} & \cdots & x_{2n} \\
x_{31} & x_{32} & x_{33} & \cdots & x_{3n} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
x_{n1} & x_{n2} & x_{n3} & \cdots & x_{nn}
\end{bmatrix}
\]

The 2-D convolution chip is based on a variant of the systolic FIR filtering array proposed in [4] and [5]. The chip uses essentially only one type of basic cells (see Figure 2-2), which are interconnected in a regular and modular way to form a two-dimensional systolic array. The function of a basic cell is to update a result \( Y \) as indicated in the figure. Note that in each cell, \( \omega \) is a preloaded weighting coefficient and each \( x \) stays inside the cell for a cycle.

\[
\begin{align*}
\text{X}_{\text{in}} & \rightarrow \text{X}_{\text{out}} \\
\text{Y}_{\text{in}} & \rightarrow \text{Y}_{\text{out}} \\
\text{X}_{\text{out}} & \leftarrow \omega \cdot \text{X}_{\text{in}} + \text{Y}_{\text{in}} \\
\text{X}_{\text{out}} & \leftarrow \text{X} \\
\text{X} & \leftarrow \text{X}_{\text{in}}
\end{align*}
\]

[Each X stays inside the cell for one cycle]

[Value \( \omega \) is a preloaded weighting coefficient]

\[
\begin{align*}
\sum_{i=1}^{3} \sum_{j=1}^{3} w_{ij} x_{r+i-2, s+j-2}
\end{align*}
\]

Figure 2-1: \( nxn \) matrix and \( k \times k \) window with \( k = 3 \).

Figure 2-2: Basic cell.
The three kernel cells that form the systolic array are shown in Figure 2-3. Each kernel cell is composed of nine basic cells and one row-interface cell whose function will become clear later. Five rows of the matrix $X$ advance synchronously from left to right. The small dots in the figure denote appropriate delays needed to make each column advance as a whole. (This does not mean that these delays have to be implemented on-chip: the inputs can be spaced accordingly when fed into the device.) During the cycle subsequent to that an entry $x_{ij}$ enters the upper right basic cell of a kernel cell, the weighted sum corresponding to a submatrix with that entry as the upper left entry will be output from that kernel cell. Therefore by sweeping the first five rows, a systolic array consisting of three kernel cells can compute the first three output rows on-the-fly. In general, by sweeping rows $3i+1$ to $3i+5$ the array computes rows $3i+1$ to $3i+3$. Thus an entry in the $nxn$ matrix is input to the cell array at most twice. In the rest of the section, we describe the underlying idea of the kernel cell design.

Suppose we want to compute the following weighted sum in the first row of a kernel cell:

$$Y = \omega_{11} x_{i1j} + \omega_{12} x_{i1j-1} + \omega_{13} x_{i1j-2}.$$  

One way to do this is to use a special case of the linear systolic FIR filtering array discussed in [5]. We use a two-way flow basic cell (to distinguish it from the one-way flow basic cell defined earlier) as in Figure 2-4, and the desired value for $Y$ can be obtained as illustrated in Figure 2-5 (a) through (c). In Figure 2-5 (a), we have denoted by $Y_{in}$ the particular $Y_{in}$ to the rightmost basic cell, with its value initialized to zero.

An improved way is to use the one-way flow basic cell as defined in Figure 2-2. Consider Figure 2-6 (a) where three basic cells in the first row of a kernel cell are shown. Consider the moment the entry $x_{ij}$ enters the
leftmost basic cell. Denote by \( Y \) the particular \( Y_{in} \) to this cell, whose value is initialized as zero. During the cycle \( x_i \) enters the leftmost cell. \( Y \) becomes \( \omega_{13} x_{iy} \). In the next cycle \( \omega_{12} x_{i+1} \) is accumulated to \( Y \), as shown in Figure 2-6 (b). Finally, as in Figure 2-6 (c), \( \omega_{11} x_{i+2} \) is further accumulated to \( Y \). As a result, when \( Y \) emerges from the rightmost cell, it has already accumulated the weighted sum of the first row of a submatrix. In the meanwhile, the same has been happening at the two bottom rows of the kernel cell. Therefore, by summing the partial results at the row-interface cell, the desired output, which is the weighted sum of all the nine entries in a 3x3 submatrix, is obtained (see Figure 2-3).

The filtering array of Figure 2-6 differs from the previous one in that both the x-stream and Y-stream now
move from left to right and the x-stream travels at half the speed of the Y-stream. This variation results in the active use of every cell at any given time. As a result each kernel cell can produce a pixel every cycle rather than every other cycle, though each basic cell now requires the additional storage to provide the needed delay for the x-stream.

3. Balancing I/O and Computation

Because of the modularity of the basic design of the systolic convolution array, its size can easily be adjusted to match the bandwidth between the array and the host to which it is attached. Let $u$ denote the cycle time of the basic cell. Suppose that the image is $n \times n$, and that in time $u$, $2k-1$ pixel values can be transferred from the host to the array, and $k$ output pixels can be sent back to the host. For processing the whole image, the I/O time alone is thus $O(n^2u/k)$. To balance this, the convolution array allows convolving a $k \times k$ window with the whole image in $O(n^2u/k)$ time, using $k \times k$ kernel cells. We thus see how I/O complexity can be used to guide the design of special-purpose devices. The reader is referred to [2] for several lower bound results on the I/O complexity for a number of problems including the fast Fourier transform. The total number of cells, $k^4$, used in the proposed convolution array is optimal in the sense that the usual sequential algorithm takes $O(n^2k^2u)$ time.
4. Implementation

Using part of a chip we have implemented a 3x3 kernel cell, consisting of nine basic cells plus one row-interface cell. The implementation uses a bit-serial word-parallel organization. Pixels are input as 8-bit samples and output with 16 bits. The weighting coefficients can be changed during the loading phase prior to the convolution computation. Their values are restricted to be powers of two in [-16, 16], to reduce the area of multiplier circuits in each basic cell. (See the remark in Section 4.4 for a technique of using the chip to handle cases where weighting coefficients are not powers of two.)

We expect that based on this prototype design a full chip can contain three 3x3 kernel cells and output a pixel in less than 175 ns. The anticipated high computation rate is mostly due to the high degree of concurrency inherent in the overall design.

4.1. I/O Description

A total of twelve I/O pins are used. Eleven of these are for input; only one is for output. Weighting coefficients are loaded into the convolution chip prior to the computation proper and therefore the inputs of these coefficients and the pixel values share the same pins. A cycle is composed of 16 minor cycles each of which includes phases \( \varphi_1 \) and \( \varphi_2 \). During each cycle one pixel value is input and one result is output. Since an input pixel value has only 8 bits which are input bit serially, input will occur every other half-cycle. In the following we give a list of names of each pin accompanied by a brief description. Some of the control signals can in fact be generated on-chip.

<table>
<thead>
<tr>
<th>Name</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Vdd</td>
<td>Power</td>
</tr>
<tr>
<td>Gnd</td>
<td>Ground</td>
</tr>
<tr>
<td>( \varphi_1 )</td>
<td>Clock phase 1</td>
</tr>
<tr>
<td>( \varphi_2 )</td>
<td>Clock phase 2</td>
</tr>
<tr>
<td>xorw₁</td>
<td>Pixel value or weighting coefficient for row 1</td>
</tr>
<tr>
<td>xorw₂</td>
<td>Pixel value or weighting coefficient for row 2</td>
</tr>
<tr>
<td>xorw₃</td>
<td>Pixel value or weighting coefficient for row 3</td>
</tr>
<tr>
<td>Loadω</td>
<td>Control signal indicating weighting coefficients are being loaded</td>
</tr>
<tr>
<td>LSB</td>
<td>Control signal indicating the least significant pixel bits are being input</td>
</tr>
<tr>
<td>Circulate</td>
<td>Control signal indicating the 2nd 1/2 cycle (absence of input)</td>
</tr>
<tr>
<td>Yaux</td>
<td>Optional input to be accumulated to the computed result</td>
</tr>
<tr>
<td>Y</td>
<td>Result of the weighted sum</td>
</tr>
</tbody>
</table>

(to allow repeated computation for non-power-of-2 weights)
4.2. A General Layout

In Figure 4-1 we illustrate a general layout of a kernel cell. Power and ground, as well as the clock signals, have been omitted in the illustration.

![Diagram of a kernel cell layout](image)

**Figure 4-1:** General layout of a kernel cell.

4.3. Implementation of the Basic Cell

The basic cell is represented schematically as in Figure 4-2. Each pixel value remains in a basic cell for two cycles while each result stays there for one cycle. Two 8-bit shift registers are provided for the storage of two 8-bit pixel values. During the first half cycle, pixel bits entering a basic cell advance from left to right into the first 8-bit shift register, and at the same time enter the multiplier, as indicated by the solid lines. During this same time, pixel bits already inside the first 8-bit shift register are moved into the second 8-bit shift register.
while pixel bits originally inside the second 8-bit shift register are moved to the next basic cell. During the second half cycle all pixel bits circulate inside the 8-bit shift registers. Flow during the second half cycle is indicated by the dash lines.

Multiplication in this particular case is merely a shift operation. It takes one minor cycle for an input bit to enter the shifter and the result bit to exit the serial adder. A 15-bit shift register therefore is provided to hold the result bits so that they will arrive at the next basic cell at the appropriate time.

Figure 4-2: Implementation of the basic cell.
4.4. The Row-Interface Cell

This cell computes the sum of the three partial results from the three rows of basic cells. It also takes an optional input $Y_{aux}$ which will be accumulated to the computed result. This will allow the handling of cases in which weighting coefficients are not powers of two by repeating the computation several times. In most image processing applications it is sufficient to have weighting coefficients as powers of two and, in such cases, $Y_{aux}$ will be zero.

5. Concluding Remarks

We have developed a chip design for the 2-D convolution operator, which uses essentially only one type of basic cells interconnected in a regular and modular way. While this experiment is about a particular design, we wish to make the following general remark concerning the balance of I/O and computation. Since a special-purpose device is typically attached to a host, from which it gets the data to be processed and to which it outputs results, I/O considerations play an important role on the overall performance. The ultimate goal of a special-purpose hardware design is (and should be no more than) that of achieving a computation rate that balances the available I/O bandwidth. Thus it is important that the design be modular so that its size can easily be adjusted to match the bandwidth between the device and the host. In the design presented basic cells are implemented in a bit-serial manner. As a result, the time for a basic cell to process a pixel is likely to be longer than that for the memory to input and output a pixel. This calls for parallel inputs and outputs of multiple pixels at each cycle of the basic cell. This is the reason why we made the systolic array of this paper input five pixels and output three pixels every cycle. If, on the other hand, basic cells are implemented by bit-parallel schemes (which imply a much shorter cycle time for the basic cell), then the chip is likely able to input and output only one pixel during each cycle. A one-dimensional, rather than two-dimensional, systolic array would be an appropriate design for this case.

Acknowledgment

We would like to thank all those people who made the MPC79 project [1] possible, through which a prototype design of our chip could be implemented. We would also like to thank the following people at CMU who helped on this research. Dave McKeown and John Kender provided valuable information regarding typical requirements in image processing applications. Dan Hoey provided the design of the serial adder. Mike Foster and Dan Hoey checked for design rule violations.
References

MPC'79: The Large-Scale Demonstration of a New Way to Create Systems in Silicon.

I/O Complexity: The Red-Blue Pebble Game.

Systolic Arrays (for VLSI).
A slightly different version appears in Introduction to VLSI Systems by C. A. Mead and L. A. Conway, Addison-Wesley, 1980, Section 8.3.

Also available as a CMU Computer Science Department technical report, September 1979.

Special-Purpose Devices for Signal and Image Processing: An Opportunity in VLSI.

Why Systolic Architecture?

Introduction to VLSI Systems.
Addison-Wesley, Reading, Massachusetts, 1980.
**REPORT DOCUMENTATION PAGE**

<table>
<thead>
<tr>
<th>Block</th>
<th>Information</th>
</tr>
</thead>
<tbody>
<tr>
<td>1.</td>
<td>REPORT NUMBER&lt;br&gt;CMU-CS-81-110</td>
</tr>
<tr>
<td>2.</td>
<td>GOVT ACCESSION NO.</td>
</tr>
<tr>
<td>3.</td>
<td>RECIPIENT'S CATALOG NUMBER&lt;br&gt;TA-104</td>
</tr>
<tr>
<td>4.</td>
<td>TITLE (and Subtitle)&lt;br&gt;A SYSTOLIC 2-D CONVOLUTION CHIP</td>
</tr>
<tr>
<td>5.</td>
<td>TYPE OF REPORT &amp; PERIOD COVERED&lt;br&gt;Interim</td>
</tr>
<tr>
<td>6.</td>
<td>PERFORMING ORG. REPORT NUMBER</td>
</tr>
<tr>
<td>7.</td>
<td>AUTHOR(S)&lt;br&gt;H.T. KUNG&lt;br&gt;S.W. SONG</td>
</tr>
<tr>
<td>8.</td>
<td>CONTRACT OR GRANT NUMBER(S)&lt;br&gt;N00014-76-C-0370</td>
</tr>
<tr>
<td>9.</td>
<td>PERFORMING ORGANIZATION NAME AND ADDRESS&lt;br&gt;Carnegie-Mellon University&lt;br&gt;Computer Science Department&lt;br&gt;Pittsburgh, PA 15213</td>
</tr>
<tr>
<td>10.</td>
<td>PROGRAM ELEMENT, PROJECT, TASK AREA &amp; WORK UNIT NUMBERS</td>
</tr>
<tr>
<td>11.</td>
<td>CONTROLLING OFFICE NAME AND ADDRESS&lt;br&gt;Office of Naval Research&lt;br&gt;Arlington, VA 22217</td>
</tr>
<tr>
<td>12.</td>
<td>REPORT DATE&lt;br&gt;MARCH 1981</td>
</tr>
<tr>
<td>13.</td>
<td>NUMBER OF PAGES&lt;br&gt;12</td>
</tr>
<tr>
<td>14.</td>
<td>MONITORING AGENCY NAME &amp; ADDRESS (if different from Controlling Office)</td>
</tr>
<tr>
<td>15.</td>
<td>SECURITY CLASS. (of this report)&lt;br&gt;UNCLASSIFIED</td>
</tr>
<tr>
<td>16.</td>
<td>DISTRIBUTION STATEMENT (of this Report)&lt;br&gt;Approved for public release; distribution unlimited</td>
</tr>
<tr>
<td>17.</td>
<td>DISTRIBUTION STATEMENT (of the abstract entered in Block 20, if different from Report)</td>
</tr>
<tr>
<td>18.</td>
<td>SUPPLEMENTARY NOTES</td>
</tr>
<tr>
<td>19.</td>
<td>KEY WORDS (Continue on reverse side if necessary and identify by block number)</td>
</tr>
<tr>
<td>20.</td>
<td>ABSTRACT (Continue on reverse side if necessary and identify by block number)</td>
</tr>
</tbody>
</table>