Multiplierless FIR Filter Implementation on FPGA

S. M. Badave and A. S. Bhalchandra,

Abstract—Area complexity in the algorithm of finite impulse response (FIR) filter is mainly caused by multipliers. Among the multiplierless techniques of FIR filter, Distributed Arithmetic is most preferred area efficient technique. In this technique, precomputed values of inner product are stored in LUT, which are further added and shifted with number of iterations equal to the precision of input samples. But the exponential growth of LUT with the order of FIR filter, in its basic structure, makes it prohibitive for many applications. An improvement over the basic DA structure is presented in this paper, by the use of slicing of LUT to the desired length. An architecture of 16 tap FIR filter is presented, with different length of slice of LUT. Design implementation and synthesis result shown the improvement in speed of operation as well as saving in area, with more number of slices. Found drastic improvement in speed, when compared with earlier result.

Index Terms—FIR filter, multiplierless, Distributed Arithmetic.

I. INTRODUCTION

In the last few years, there has been a growing trend to implement DSP functions in Field Programmable Gate Arrays (FPGAs), which offer a balanced solution in comparison with traditional devices. Although application specific integrated circuits (ASICs) and digital signal processors have been the traditional solution for high performance applications, now the technology and the market are imposing new rules. On one hand, high development costs and time-to-market factors associated with ASICs can be prohibitive for certain applications and, on the other hand, programmable DSP processors can be unable to reach a desired performance due to their sequential-execution architecture. In this context, FPGAs offer a very attractive solution that balance high flexibility, time-to-market, cost and performance. In that sense, the research community has put great effort in designing efficient architectures for DSP functions such as finite impulse response (FIR) filters, which are extensively used in multiple applications in digital communications, speech processing, wireless/satellite communications, biomedical signal processing and many others [1]-[3].

Most of the digital signal processing applications involves FIR filters due to its linearity and stability. Only the limitation offered by it, is large number of taps, to get desired frequency response, which leads to area complexity. In its general form, the FIR filter [4] is characterized by

\[ y(n) = \sum_{k=0}^{K-1} a_k x(n-k) \]  

Equation (1) shows that, it is the extensive sequence of multiplication operations. Since multiplier are costly in terms of area, many multiplier centric techniques are developed for implementation of FIR filter to resolve this issue. Research work found in two broad categories of FIR filter implementation, one by the use of multiplier, can categorized as Multipliered FIR filter [5-7] and another, without use of multiplier as Multiplierless FIR filter [1,8-13]. In multiplierless FIR filters, the efforts are taken to reduce the area either by sharing of multipliers or by manipulating the coefficients so as to reduce the number of multiplications, where as in multiplierless FIR filters, coefficient are transformed to other numeric representations whose hardware implementation or arithmetic manipulation is more efficient than the traditional binary representation. In Canonic Signed Digit (CSD) [1]-[13], coefficient are represented by a combination of power-of-two in such a way that multiplication can be implemented simply by adder / subtractor and shifter. Usage of memory or look-up tables to store precomputed values of coefficient operations is another way to replace the traditional multipliers. These methods are recognized by memory based methods. Constant Coefficient Multiplier (KCM) and Distributed Arithmetic (DA) [12], [14] fall under this category.

This paper presents, a hardware-efficient multiplierless FIR filter, implemented with distributed arithmetic. MAC operations in traditional structure may be replaced by a series of look-up-table (LUT) accesses and summations, known as distributed arithmetic (DA). DA is a bit-serial operation that implements a series of fixed-point MAC operations in a fixed number of steps, regardless of the number of terms to be calculated. But in basic form of DA architecture LUT size grows exponentially as the filter order increases.

In present paper, FIR filter structure is based on slicing of LUT. \( m \) slices are taken for \( K \)-tap filter, so as to form \( m \) smaller units, each of with \( K \)-tap DA base units (\( K=m \times k \)). Here it is assumed that \( K \) is not prime. The total memory requirement for a \( K \)-tap FIR filter, drastically get reduced from \( 2^K \) to (\( m \times 2^k \)) memory elements, with (\( m-1 \)) additional cost of adders. Thus proposed DA architecture enables FIR implementation with reduced area, mainly useful for, high order FIR filters. Design can be extended to further reduce to the memory, by implementing LUTless FIR filter structure. This paper is organized as follows. The review of basic DA structure is given in Section II and in Section III proposed DA architecture is presented. Implementation steps of slice based DA architecture are given in section IV. Section V highlighted on the area of utilization and performance of the proposed DA architecture. Its comparison with earlier work.
Distributed Arithmetic

Distributed Arithmetic, along with Modulo Arithmetic, are computation algorithms that perform multiplication with look-up table based schemes. It specifically targets the sum of products (sometimes referred to as the vector dot product) computation. It is one of the preferred methods of implementing FIR filters on FPGAs, over the conventional one. Equation (1) indicates $x$ and $y$ are two vectors of size $K$, that represent the input and transformed data, respectively and $a_k$ are the constant coefficients of the filter. In DA scheme, assuming that $a_k$ are known set of filter coefficients, the input $x$ to the filter is represented in L-bit 2’s complement binary numbers.

we have:

$$x_{(n-k)} = -b_{k,0} + \sum_{l=1}^{L-1} b_{k,l} 2^{-l} \quad (2)$$

Replacing this result in (1), we obtain:

$$y[n] = \sum_{k=0}^{K-1} a_k \left( -b_{k,0} + \sum_{l=1}^{L-1} b_{k,l} 2^{-l} \right)$$

$$y[n] = -\left( \sum_{k=0}^{K-1} a_k b_{k,0} + \sum_{l=1}^{L-1} \left( \sum_{k=0}^{K-1} a_k b_{k,l} \right) 2^{-l} \right) \quad (3)$$

From (3), it is observe that the terms in inner parenthesis may take one of $2^k$ possible values, given that $b \in \{0,1\}$, and those values correspond to all possible sum combinations of filter coefficients. These values can be precomputed and stored in LUTs or memories, and addressed by $b_{k,l}$. Thus, the MAC algorithm of FIR filters is reduced to LUT accesses and summations.

Analysis shows that, the direct implementation of filter from (1), the number of MAC units increases with increase in the filter order, whereas, in DA structure hardware in critical path is decoupled from the order of filter. Hence proved to be an area economical structure.

Flexibility of DA structure, permits to develop the filter arrangements to vary from serial to full-parallel. The right balance among versions is tied to specifications for a given application, and basically depends on requirements in terms of hardware cost and throughput. In each case, the designer has to trade bandwidth for area.

III. PROPOSED ARCHITECTURE

In its basic form of Distributed Arithmetic(fig. 1), the size of LUT grows exponentially with the order of the filter.

To alleviate this problem, the main strategy is to make the slicing of LUT into desired number. It reduces the size of memory, with small increase in area requirement due to adders. Applying this approach, an area efficient FIR filter is designed and implemented. This sliced LUT-DA scheme on an FPGA consist of input registers, sliced LUT units and the shifter/accumulator unit. Additionally, it would require an adder tree to perform addition of partial products. Control unit which is finite state machine(FSM), used to manipulate the filter operation.

A. Input Register

A stream of input samples $x(n)$ of datawidth $L$ stored in input registers, (fig.2). Converting these parallel formed input samples, into serial form, advanced to right for every clock, so as to create an address of LUT.

B. LUT Slicing

Exponential growth of single LUT can effectively be restricted by slicing the LUT into desired number. When $K$ order LUT is divided into $m$ slices, forms $k$ units. By appropriate adjusting weightages of each unit, desired output can be calculated. Fig.3(a) and (b), highlights the structural details before and after slicing of LUT respectively.

In present work, analysis of 16 tap FIR filter is carried out on various size of slicing. Details of one of the four slices, is given in fig.4. By accumulating the output of all slices by adder tree, partial product term can be calculated. Further, by
taking the iterations of successive accumulation and shift operation, final output is calculated.

<table>
<thead>
<tr>
<th>A3</th>
<th>A2</th>
<th>A1</th>
<th>A0</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>w0</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>w1</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>w0 + w1</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>w2</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>w2 + w0</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>w2 + w1</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>w2 + w1 + w0</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>w3</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>w3 + w0</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>w3 + w1</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>w3 + w1 + w0</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>w3 + w2</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>w3 + w2 + w0</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>w3 + w2 + w1</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>w3 + w2 + w1 + w0</td>
</tr>
</tbody>
</table>

Fig. 4. 2[3] Word LUT of data

C. Accumulator and Shifter Unit

This stage consists of an accumulator and a shifter. The partial product generated by LUTs is added and shifted in every iteration. Number of iterations are defined by the input precision.

D. Control Unit

This unit controls the other circuit components and the whole circuit behavior. It is a counter whose upper limit depends basically on the input precision and defines the circuit throughput. In contrast to other methods, an advantage of Distributed Arithmetic is that the throughput in DA-based architectures is independent of the order of the filter.

IV. IMPLEMENTATION

To evaluate the performance of the proposed scheme, 16 tap, symmetric lowpass FIR filter is implemented and synthesized. The results are compared to the earlier implementation. The precision for input and coefficient is 8 bit. Firstly, the filter design is done using the equiripple method on Matlab. The coefficients are truncated and scaled with 8 bits of precision. The frequency response of the designed filter is shown in fig.5.

![Frequency response of 16 tap FIR filter](image)

Fig. 5. Frequency response of 16 tap FIR filter

Effect of slicing on area and throughput (fig.6) is thoroughly analyzed.

V. RESULT

The Xilinx Integrated Software Environment (ISE) is used for performing synthesis and implementation of design. A 16 tap FIR filter is designed and implemented with fixed point filter coefficient. All the designs are synthesized for maximum performance. Area complexity and operating speed, on various number of LUT slices, of proposed circuit, are given in Table I. Comparison of present work with previous, is also tabulated (Table-II).

TABLE I: RESOURCE UTILIZATION OF PROPOSED SLICED LUT DA STRUCTURE

<table>
<thead>
<tr>
<th>No. of Slices of LUT-DA</th>
<th>Max. Frequency in MHz</th>
<th>Gate Count</th>
</tr>
</thead>
<tbody>
<tr>
<td>2</td>
<td>171.431</td>
<td>1820</td>
</tr>
<tr>
<td>4</td>
<td>173.302</td>
<td>1320</td>
</tr>
<tr>
<td>8</td>
<td>184.641</td>
<td>905</td>
</tr>
</tbody>
</table>

TABLE II: FREQUENCY PERFORMANCE OF PROPOSED 4 INPUT DA AND PREVIOUS DA

<table>
<thead>
<tr>
<th>Filter Type</th>
<th>16 Tap FIR Filter</th>
<th>Maximum Frequency</th>
</tr>
</thead>
<tbody>
<tr>
<td>Proposed 4 Input DA</td>
<td>173.202 MHz</td>
<td></td>
</tr>
<tr>
<td>Previous 4 Input DA</td>
<td>46.7 MHz</td>
<td></td>
</tr>
</tbody>
</table>

VI. CONCLUSION

Distributed Arithmetic has proved to be an area efficient technique of FIR filter implementation. While using it, special care is required against exponential growth of LUT size. Slicing of LUT of desired length, gives an effective solution, particularly, for high order filter designs. Highly flexible nature of this structure, allow it to use in complete serial to full parallel form. One has to trade off between area and bandwidth.

REFERENCES


Sunita Mukund Badave received the B.E. degree in Electrical (Electronics Specialization) from Shivaji University in 1989 and M.E.Degree in Electrical from Dr.B.A.M.University., Aurangabad, India, in 1998. She is currently working toward the Ph.D. degree in Electronics at Dr.B.A.M.University. Her research interests include architectures and circuit design for digital signal processing. She has presented nearly 16 technical papers in at Nationally and Internationally. Mrs. S.M. Badave is Member of the Institute of Electronics and Telecommunication Engineers(IETE),India.

Anjali S.Bhalchandra received the B.E. Electronics and Telecommunication degree and M.E. Electronics degree in 1985 and 1992 respectively. She has completed her Ph.D. in Electronics from S.R.M.University, Nanded, India, in 2004. She has a wide scientific and technical background covering the areas of Electronics and Communication. Currently, she is Head of Electronics and Telecommunication Engineering Department and Associate Professor in Government College of Engineering, Aurangabad. Her research interest includes image processing, signal processing and communication. She has published more than 50 technical papers in various reputed journals and conference proceedings. Dr. Bhalchandra is a Fellow of the Institution of Engineers (IE), India and life member of Indian Society for Technical Education(ISTE)India.