# Algorithm for Computation of DCT and its Implementation using a Systolic Architecture

### Anamika Jain, Neeta Pandey



Abstract: In this paper a new algorithm for computing N-point DCT, where N=4r, r>1 is presented. A new algorithm has been derived that can compute the 1D DCT and it is realized in systolic array that utilizes identical processing elements (PE's). The proposed approach can be used to obtain other transform like Discrete Sine Transform (DST), Discrete Hartley Transform (DHT). The suggested algorithm requires reduced number of multiplications as compared to the other methods of computing DCT. This suggests structure meets the architectural challenge and it is simple, regular design and cost-effective for special-purpose system.

Keywords: Processing Element, Systolic architecture, DST, DCT, and DHT.

### I. INTRODUCTION

The Discrete Cosine Transform (DCT) play an important role in many Digital signal processing(DSP) applications, since it is good alternative of DFT. It is used in most digital media, including digital images such as in JPEG where some high-frequency components can be discarded as they are redundant, digital video, digital audio (MP3), digital radio (AAC+ and DAB+), digital

television (SDTV, HDTV). DCTs are also important to reduce the usage of network bandwidth and also used in finding the numerical solution of partial differential equations using spectral method [1-4]. However, to meet the demand of real time applications, dedicated VLSI implementation of DCT is inevitable [5-7]. Different style of implementation has been reported in the literature including multiplierless based design [8], distributed arithmetic ROM based design [9,10]. Different algorithms have been reported to compute DCT such as FDCT (including Radix 2 and 4) [11-12], Recursive algorithm [13-15]. Among them radix 4 algorithm approach have the features of fast results. There are essentially two ways to build a fast computer system. One is to use concurrency, and the other is to use fast components. Systolic structures are concurrent structures and are useful for implementing a diversity of parallel algorithms [16-21].

Revised Manuscript Received on April 25, 2020. \* Correspondence Author

Anamika Jain\*, ECE department, Maharaja Agrasen Institute of Technology(MAIT) affiliated to GGSIP University, Delhi, India. Email: anaamikajain@yahoo.co.in

**Prof. Neeta Pandey**, ECE department, Delhi Technological University(DTU) Delhi University, Delhi, India. Email: neetapandey@dce.ac.in

© The Authors. Published by Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an <u>open access</u> article under the CC BY-NC-ND license (<u>http://creativecommons.org/licenses/by-nc-nd/4.0/</u>)

Retrieval Number: D9047049420/2020©BEIESP DOI: 10.35940/ijeat. D9047.049420 Journal Website: <u>www.ijeat.org</u> These features are useful for us to develop an efficient design for 1D, N length DCT algorithm as well as systolic architecture. In the presented work, a new algorithm and a concurrent structure for DCT computation is suggested and its systolic architecture is also suggested in the paper. Paper is organized as follows next section (Section II) presents the new algorithm for computing DCT Section III shows the systolic architecture for the presented algorithm. To get more clear explanation computations of DCT N=16 points sequence is presented in section IV. Conclusion and Compression are discussed in section V.

#### II. PROCEDURE FOR PAPER SUBMISSION

II. A New algorithm for DCT: The Discrete Cosine Transform (DCT)

$$X[K] = \sum_{n=0}^{N-1} x[n] Cos\left(\frac{\pi k (2n+1)}{2N}\right) \quad k = 0, 1, 2, ..., N-1 \quad (1)$$

Dividing eq.(1) into four groups

$$X[K] = \sum_{n=0}^{N} x[n] Cos\left(\frac{\pi k(2n+1)}{2N}\right) + \frac{N}{2} \sum_{n=\frac{N}{4}} x[n] Cos\left(\frac{\pi k(2n+1)}{2N}\right) + \frac{\frac{3N}{4} - 1}{n = \frac{N}{2}} x[n] Cos\left(\frac{\pi k(2n+1)}{2N}\right) + \frac{N-1}{2} \sum_{n=\frac{N}{4}} x[n] Cos\left(\frac{\pi k(2n+1)}{2N}\right), \quad k = 0, 1, 2, ..., N-1$$
(2)

$$X[K] = \sum_{n=0}^{\frac{N}{4}-1} \begin{cases} x[n] \cos(\theta_k) + x \left[ n + \frac{N}{4} \right] \cos\left(\theta_k + \frac{\pi K}{4} \right) \\ + x \left[ n + \frac{N}{2} \right] \cos\left(\theta_k + \frac{\pi K}{2} \right) + \\ x \left[ n + \frac{3N}{4} \right] \cos\left(\theta_k + \frac{3\pi K}{4} \right) \end{cases}$$
(3)

Let

$$x[n] = A, \quad x\left[n + \frac{N}{4}\right] = B, \\ x\left[n + \frac{N}{2}\right] = C, \\ x\left[n + \frac{3N}{4}\right] = D$$
  
And  $\theta_k = \frac{\pi K(2n+1)}{2N}$ 

Published By: Blue Eyes Intelligence Engineering & Sciences Publication © Copyright: All rights reserved.



## Algorithm for Computation of DCT and its Implementation using a Systolic Architecture

Equation 3 can be written as

$$X[K] = \sum_{n=0}^{\frac{N}{4}-1} \begin{cases} ACos(\theta_k) + Bcos\left(\theta_k + \frac{\pi K}{4}\right) + \\ Ccos\left(\theta_k + \frac{\pi K}{2}\right) + Dcos\left(\theta_k + \frac{3\pi K}{4}\right) \end{cases}$$
(4)  
$$X[K] = \sum_{n=0}^{\frac{N}{4}-1} \begin{cases} Cos(\theta_k) \begin{bmatrix} A + Bcos\left(\frac{\pi K}{4}\right) + Ccos\left(\frac{\pi K}{2}\right) \\ + Dcos\left(\frac{3\pi K}{4}\right) \end{bmatrix} \\ + Dcos\left(\frac{3\pi K}{4}\right) \end{bmatrix} \\ \end{cases}$$
(5)  
$$Sin(\theta_k) \begin{bmatrix} BSin\left(\frac{\pi K}{4}\right) + CSin\left(\frac{\pi K}{2}\right) \\ + DSin\left(\frac{3\pi K}{4}\right) \end{bmatrix} \end{cases}$$

Let X[K] = X[4p+q]

Where p=0 to N/4-1, and q=0, 1, 2, 3

Case 1: q=0

$$X[4p] = \sum_{n=0}^{\frac{N}{4}-1} \left\{ Cos\left(\theta_{p,q} \right) \begin{bmatrix} A + B\cos(\pi p) + \\ C\cos(2\pi p) + D\cos(3\pi p) \end{bmatrix} \right\}$$
(6)

i.e.

$$X[4p] = \sum_{n=0}^{\frac{N}{4}-1} \left\{ Cos\left(\theta_{p,q}\right) \left[ A + C + (-1)^{p} \left( B + D \right) \right] \right\}$$
(7)

Case2:q=1

Equation 5 reduces to

$$X[4p+1] = \sum_{n=0}^{\frac{N}{4}} \begin{cases} \cos\left(\theta_{p,q}\right) \left(A + \frac{\left(-1\right)^{p}}{\sqrt{2}}\left(B - D\right)\right) + \\ \cos\left(\theta_{p,q} + \frac{\pi}{2}\right) \left(C + \frac{\left(-1\right)^{p}}{\sqrt{2}}\left(B + D\right)\right) \end{cases}$$

(8)

Similarly Case3: q=2

$$X[4p+2] = \sum_{n=0}^{N-1} \begin{cases} \cos(\theta_{p,q})(A-C) + \\ (-1)^{p} \cos(\theta_{p,q} + \frac{\pi}{2})(B-D) \end{cases}$$
(9)

And Case 4: q=3

$$X[4p+3] = \sum_{n=0}^{\frac{N}{4}-1} \left\{ \cos\left(\theta_{p,q} \left(A - \frac{\left(-1\right)^{p}}{\sqrt{2}}(B-D)\right) - \left(\cos\left(\theta_{p,q} + \frac{\pi}{2}\right)\left(C - \frac{\left(-1\right)^{p}}{\sqrt{2}}(B+D)\right)\right)\right\}$$

$$(10)$$

Therefore, for even value of q=0,2 generalized equation is :

Retrieval Number: D9047049420/2020©BEIESP DOI: 10.35940/ijeat. D9047.049420 Journal Website: <u>www.ijeat.org</u>

$$X[4p+q] = \sum_{n=0}^{\frac{N}{4}-1} \begin{cases} \cos(\theta_{p,q}) \left(A + (-1)^{\frac{q}{2}}C\right) + \\ (-1)^{p} \cos(\theta_{p,q} + \frac{q\pi}{4}) (B + (-1)^{\frac{q}{2}}D) \end{cases}$$
(11)

Similarly generalized equation for odd values of q is:

$$X[4p+q] = \sum_{n=0}^{\frac{N}{4}-1} \left\{ Cos\left(\theta_{p,q}\right) \left( A - \frac{\left(-1\right)^{3+q/2} \left(-1\right)^{p}}{\sqrt{2}} \left( B - D \right) \right) + Cos\left(\theta_{p,q} + \frac{\pi}{2}\right) \left(-1\right)^{3+q/2} \left( C - \frac{\left(-1\right)^{3+q/2} \left(-1\right)^{p}}{\sqrt{2}} \left( B + D \right) \right) \right) \right\}$$
(12)

Now equations (11) and (12) are the generalized equations for DCT coefficient computation and has been used realization of DCT systolic structure.

## III. BLOCK DIAGRAM OF THE PROPOSED ALGORITHM

In designing special-purpose systems, cost-effectiveness and Fast solutions are always main concern. Fast computations can be achieved by using fast algorithms and costs can be reduced by the use of suitable architectures. Great saving can be achieved by decomposing a structure into a few simple substructures, which are used repetitively with simple interfaces. This is especially true for VLSI designs where a single chip comprises hundreds of thousands of components. Systolic architectures are concurrent structures- that can map high-level computations into hardware structures. In a systolic system, data flows in a rhythmic fashion, passing through many processing elements before it returns to memory and can achieve high throughput with balance memory bandwidth. Figure 1 shows a basic block diagram of the proposed algorithm where Pre-processing of the input data is performed to reduce the number of multiplications required to calculate the DCT coefficients.



Fig 1. Block diagram of the proposed algorithm

Pre-processing units contains only adders and Subtractors. Different combination of the input (generated by the preprocessing unit) are fed to the systolic array of processing elements (PE). An example of pre-processing (PP) unit and processing element (PE) is shown in the figure 2 for two input values. Output of the pre -processing unit is the addition and subtraction of the two inputs. In the systolic array a single processing unit four inputs  $X_i$ ,  $Y_i$ ,  $U_i$ ,  $V_i$ . Two multipliers are used in the processing element and output is U0, V0. The inputs Xi and Yi are used by the other processing elements as X0 and Y0.

Published By: Blue Eyes Intelligence Engineering & Sciences Publication © Copyright: All rights reserved.





## Pre-processor (PP)



## **PROCESSING ELEMENT (PE)**



here two multipliers  $\alpha$ ,  $\beta$  are used  $U_o = U_i + \alpha * X_i$  $V_o = V_i + \beta * Y_i$ 

## Fig2.Pre- processing unit, Processing element for two input samples

that there are N/4 pre –processing unit in stage 1 (PP1, PP2, PP3 and PP4) where all the inputs are added and subtracted to provide input to the next stage of preprocessing (PP5).



Fig3.Pre- processing unit for N=16



Retrieval Number: D9047049420/2020©BEIESP DOI: 10.35940/ijeat. D9047.049420 Journal Website: <u>www.ijeat.org</u>

## Algorithm for Computation of DCT and its Implementation using a Systolic Architecture

O/P of PP-5 are I0 to I11 ,where I0-I7:- (An+Cn)  $\pm$  (Bn+Dn), n=0 to N/4-1 (I0,I1,I2,I3:- addition),(I4,I5,I6,I7:-subtraction) I8-I11:- (An-Cn) - (B((N/4)-1-n)-D((N/4)-1-n)),n=0 to N/4-1 <u>O/P of PP-7 :</u> I12 to I27 where I12-I19:-[An $\pm$ (Bn-Dn)/ $\sqrt{2}$ ] = [An $\pm$  (1/ $\sqrt{2}$  of O/Ps from PP-3 and PP-4)], I20-I27:-:[Cn $\pm$ (Bn+Dn)/ $\sqrt{2}$ ] = [Cn $\pm$  (1/ $\sqrt{2}$  of O/Ps from PP-3 and PP-4)],

Eg :- I12=A0+(B0-D0)/ $\sqrt{2}$  I13= A0-(B0-D0)/ $\sqrt{2}$  I14=A1+(B1-D1)/ $\sqrt{2}$ 

I15=A1-(B1-D1)/ $\sqrt{2}$ , I16, I17 for n=2, I18, I19 for n=3, I20=C0+ (B0+D0)/ $\sqrt{2}$ , I21= C0-(B0+D0)/ $\sqrt{2}$ 

**Systolic array of PE**: The architecture is a pipelined network arrangement of Processing Elements (PEs) called cells. It is a

Figure 4 shows the systolic array unit for computation of even coefficients which are multiple of N/4.



## Fig 4: Processing unit for computation of multiple od N/4 coefficients

Remaining even coefficients X(2,6,10,14),here q=2 are obtained as X[2]=(I8\*(a)+I9\*(c)+I10\*(-d)+I11\*(-b))X[6]=(I8\*(c)+I9\*(-b)+I10\*(-a)+I11\*(-d))X[10]=(I8\*(d)+I9\*(-a)+I10\*(b)+I11\*(c))X[14]=(I8\*(b)+I9\*(-d)+I10\*(c)+I11\*(-a))Multiplication factors:-a=0.9807,b=0.1950,c=0.8314,d=0.5555 Systolic array of PE (one multiplier is used in these PEs)

specialized form of parallel computing, where cells compute the data which is coming as input and store them independently. A systolic architecture is an array composed of matrix-like rows of cells. Each cell shares the information with its neighbors immediately after processing.

For even DCT coefficients: X[k] (k=0,4,8,12)= X[4p+q](q=0) X[0]=(I0+I3+I1+I2)\*1 $X[8]=(I0+I3-I1-I2)*(-\beta)$ 

X[4]=(I4-I7)\* <sub>γ</sub>+(I5-I6)\*α X[12]=(I4-I7)\* α +(I5-I6)\* <sub>γ</sub> Here α= 0.3826 ,γ=0.9238,β=0.7071



Published By:

& Sciences Publication

Blue Eyes Intelligence Engineering

© Copyright: All rights reserved.





Fig 5. Processing element for computation of even coefficients of X[K]

The Processing for odd values (1, 3, 5, 7, 9, 11, 13, and 15):-X[1]=m0\*I12+m2\*I14+m4\*I16+m6\*I18-m1\*I20-m3\*I22-m5\*I24-m7\*I26 X[5]=m4\*I13+m1\*I15-m6\*I17-m2\*I19-m5\*I21-m0\*I23-m7\*I25+m3\*I27 X[9]=m7\*I12-m4\*I14-m3\*I16+m0\*I18-m6\*I20-m5\*I22 +m2\*I24+m1\*I26 X[13]=m3\*I13-m6\*I15+m0\*I17-m4\*I19-m2\*I21+m7\*I23-m1\*I25-m5\*I27 X[3]=m2\*I13+m7\*I15+m1\*I17-m5\*I19+m3\*I21+m6\*I23+m0\*I25+m4\*I27 X[7]=m6\*I12-m5\*I14-m2\*I16+m1\*I18+m7\*I20+m4\*I22-m3\*I24-m0\*I26 X[11]=m5\*I13-m0\*I15+m7\*I17+m3\*I19+m4\*I21-m1\*I23-m6\*I25+m2\*I27 X[15]=m1\*I12-m3\*I14+m5\*I16-m7\*I18+m0\*I20-m2\*I22+m4\*I24-m6\*I26





X[1] X[5] X[9] X[13] X[7] X[3] X[15] X[11]

Fig-6: systolic array of PE for computing odd DCT coefficients

Published By:

& Sciences Publication

Blue Eyes Intelligence Engineering



## Algorithm for Computation of DCT and its Implementation using a Systolic Architecture

Comparison: The suggested algorithm for computing DCT is a reduced computational complexity algorithm. In this algorithm the number of multiplications for N=16, is reduced nearly 1/3 of the N<sup>2</sup> (no of multiplications required in general method). The number of PE required to calculate all the DCT coefficients of a sequence N=16, are 43, and the number of multipliers per PE are 2.Latency is defined as the time required to compute the DCT computations. Parallel computing of even and odd coefficients leads to a latency of 11T as it needs maximum time to compute the odd coefficients. For computing even coefficients the PE (with two multipliers) required are only 11, and the latency for k=0,8 is T, For k=4,12, it is 2T, for k=2,6,10,14 latency is 5T.Maximum delay is for odd values k=1,3,5,7,9,11,13,15, the latency for these values is 11T.Therefore, the overall latency is consider to be 11T. T=(m+a), denotes the time required for multiplications(m) and additions(a). Table 1 shows the comparison of different systolic architecture for computation odd DCT. It can have seen from the table that the proposed architecture is efficient as it requires less number of processing elements as well as it is faster as its latency is less.

Table 1. Performance comparison of systolic architecture for computing N=16 point DCT:

|                          | No of real<br>multipliers | Required<br>no. of PE | Latency    | Throughput |
|--------------------------|---------------------------|-----------------------|------------|------------|
| S.B. Pan,                | 2                         | N²/4=64               | (N-1)T=15T | Т          |
| R.H.<br>Park[16]         |                           |                       |            |            |
| Chang<br>and<br>Wang[22] | 1                         | N <sup>1</sup> /2=128 | (N-1)T=15T | Т          |
| Poposed                  | 2                         | N <sup>2</sup> /6=43  | (N-5)T=11T | Т          |

## **IV. CONCLUSION:**

A new algorithm for computing DCT and its VLSI implementation using parallel and pipelined network of processing elements is proposed in the paper. The array of processing elements works in systolic fashion which makes the computation fast.

### REFERENCES

- 1 Y. Han, W. He, S. Ji, Q. Luo, "A digital watermarking algorithm of color image based on visual cryptography and discrete cosine transform", P2P Parallel grid cloud and internet computing (3PGCIC) 2014 Ninth International conference on pp.525-530,2014.
- 2 C. Li, Z. Qin, "A blind digital image watermarking algorithm based on DCT", smart and sustainable city 2013 (ICSSC 2013) IET International conference on, pp.446-448,2013.
- 3 L. Tan, Z.J. Fang, "An adaptive middle frequency embedded digital watermark algorithm based on DCT domain", Management of e-commerce and e-Government 2008.ICMECG '08. International conference on, pp.382-385,2008.
- 4 M. Saidi, H. Hermassi, R. Rhouma, et al., "A new adaptive image steganography scheme based on DCT and chaotic map", Multimed. Tools Appl 76,13493-13510 (2017).
- 5 S. An, C. Wang, "Recursive algorithm, architectures and FPGA implementation of the two-dimensional discrete cosine transform" IET Image Process. 2(6),286-294 (2008).
- 6 A. Kassem, M. Hamad, E. Haidamous, "Image compression on FPGA using DCT" International conference on Advances in Computational Tools for Engineering Applications, 2009, ACTEA '09, pp.320-323,15-17 July 2009
- 7 An, S., Wang, C. "Recursive algorithm, architectures and FPGA implementation of two-dimensional discrete cosine transform" IET Image Process 2(6),286-294 (2008).

- S. Choomchuay, "A bit-serial architecture for a multiplierless DCT" Journal of Information and communication Technology, vol.2(1),pp.15-30, January 2020
- 9 Y. H. Chen., T. Y. Chang, C. Y. Li, "High throughput DA based DCT with high accuracy error-compensated adder tree" IEEE Trans. Very Large Scale Integr. (VLSI) Syst. (99),1-5 (2010)
- 10 V. Vinetha Kasturi, Y. Syamala, "VLSI architecture for DCT based on distributed arithmetic" Int. J. Eng. Res. Technol. (IJERT) 2(5),2013.
- 11 G. Deng, "Fast algorithm for zero-phase linear filter using discrete cosine transform" The Institute of Engineering and Technology (IET), vol. 55, Issue 10, pp.621-623, 16 May 2019.
- 12 M. N. Murthy, "Radix-2 algorithms for implementation of type-II discrete cosine transform and discrete sine transform" Int. J. Eng. Res. Appl.3(3), 602-608 (2013).
- 13 P. Dahiya and P. Jain, "Realization of Second-Order Structure of Recursive Algorithm for Discrete Cosine Transform" Circuits, Systems, and Signal Processing, vol. 38, pp.791-804. (2019).
- 14 P. Dahiya and P. Jain, "Efficient MDCT Recursive Structure for VLSI Implementation" Circuits, Systems, and Signal Processing, vol. 39, pp. 1372-1386, (2020).
- 15 Y. H. Chan, L. P. Chau, W.C. Siu, "Efficient implementation of discrete cosine transform using recursive filter structure" IEEE Trans. Circuits syst. Video Technol. 4(6),pp. 550-552 (1994).
- 16 S. B. Pan, R. H. Park, "Unified systolic array for fast computation of discrete cosine transform, discrete sine transform and, discrete Hartley transform", IEEE Transactions on circuits and System for Video Technology 7 (2),413-419(1997).
- 17 D. F. Chiper, M. N. S. Swamy , M. O. Ahmad, "Systolic algorithm and a memory-based design approach for a unified architecture for the computation of DCT/DST/IDCT/IDST" IEEE Trans. Circuits Syst. I Regul. Pap. 52(6), 1125-1137 (2005).
- 18 C. Chakrabarti and J. Jaja, "Systolic architectures for the computation of the discrete Hartley and the discrete cosine transforms based on prime factor decomposition", IEEE Transactions on Computers 39 (11),1359-1368, (1990).
- C. Cheng, K. K. Parhi, "A novel systolic array structure for DCT" IEEE Trans. Circuits Syst. II Express Briefs 52(7), 366-369 (2005).
- 20 P. K. Meher, J. C. Patra, "A new convolutional formulation of discrete cosine transform for systolic implementation", IEEE International Conference on Information, Communications and Signal Processing (2007), pp. 1-4.
- 21 M. N. Murthy, "Recursive algorithms and systolic architectures for realization of type-II discrete cosine transform and inverse discrete cosine transform" Int. J. Eng. Res. Appl. 4, 24-32 (2014).
- 22 Y. T. Chang and C. L Wang, "New systolic array implementation of the 2D discrete cosine transform and its inverse", IEEE Transactions on Circuits and Systems Video Technology, CSVT-5,31-40, (1995).

## **AUTHORS PROFILE**



Anamika Jain, B.E., M.E. in Electronics and Communication Engineering, Assistant Professor, Maharaja Agrasen Institute of Technology (MAIT) Affiliated to Guru Gobind Singh Indraprastha University, ECE department. More than 18 years of teaching experience. Her field of interest is VLSI,,DSP.



**Prof. Neeta Pandey,** M. E. (Microelectronics) Ph.D. Professor in Delhi Technological University (DTU) with approximately 30 years teaching and research experience in Electronics and Communication Engineering. More than 250 international journal publications. Her area of interest are Analog and Digital VLSI Design, Current mode ADC Design.

Published By: Blue Eyes Intelligence Engineering & Sciences Publication © Copyright: All rights reserved.

