Next: One Dimensional Parallel Fast Up: Fast Fourier Transform Previous: The Danielson Lanczos Algorithm

## Fast Fourier Transform in Parallel

Below is the parallel version of our Danielson-Lanczos program, for a 2 dimensional data.

PROGRAM danielson_lanczos

IMPLICIT NONE

REAL(kind=8), PARAMETER :: pi = 3.141592653589793_8
INTEGER :: i, j
COMPLEX(kind=8), DIMENSION(0:7, 0:7) :: x, y
REAL(kind=8) :: dx, dy

DO i = 0,7
DO j = 0,7
dx = 2.0_8 * pi / 8.0_8 * i
dy = 2.0_8 * pi / 8.0_8 * j
x(i,j) = CMPLX(SIN(dx), 0.0_8, kind=8) * CMPLX(SIN(dy), 0.0_8, kind=8)
END DO
END DO
y = x
CALL fft(y); y = TRANSPOSE(y); CALL fft(y); y = TRANSPOSE(y)
y = -2.0_8 * pi / 64.0_8 * y
WRITE(*,'(8 ( 8 ("(", f6.2, ",", f6.2, ") "), /))') x
WRITE(*,'(8 ( 8 ("(", f6.2, ",", f6.2, ") "), /))') y
STOP 0

CONTAINS

FUNCTION bit_reverse(i, size)

IMPLICIT NONE

INTEGER :: bit_reverse
INTEGER, INTENT(in) :: i, size

INTEGER :: length, temp

temp = i
DO length = size, 2, -1
temp = ISHFTC(temp, -1, length)
END DO
bit_reverse = temp

END FUNCTION bit_reverse

SUBROUTINE swap (a, b)

IMPLICIT NONE

COMPLEX(kind=8), DIMENSION(:) :: a
COMPLEX(kind=8), DIMENSION(SIZE(a)) :: b
COMPLEX(kind=8), DIMENSION(:), ALLOCATABLE :: hold

ALLOCATE(hold(SIZE(a)))
hold = a
a = b
b = hold
DEALLOCATE(hold)

END SUBROUTINE swap

SUBROUTINE fft(x)

IMPLICIT NONE

COMPLEX(kind=8), DIMENSION(0:, 0:), INTENT(inout) :: x

DOUBLE PRECISION, PARAMETER :: pi = 3.141592653589793_8
INTEGER :: length, number_of_bits, i, j, mmax, istep, m
COMPLEX(kind=8) :: wp, w, ws
REAL(kind=8) :: theta
COMPLEX(kind=8), DIMENSION(:), ALLOCATABLE :: temp

length = SIZE(x, 2)
number_of_bits = INT(LOG(REAL(length))/LOG(2.0))
IF (2**number_of_bits .NE. length) THEN
WRITE(*,*) 'fft: error: input data length must be a power of 2'
STOP 10
ELSE
DO i = 1, length-2
j = bit_reverse(i, number_of_bits)
IF (j .GT. i) CALL swap(x(:,i), x(:,j))
END DO
END IF

ALLOCATE(temp(SIZE(x, 1)))
mmax = 1
DO
IF ( length .LE. mmax) EXIT
istep = 2 * mmax
theta = - pi/(mmax)
wp=CMPLX(-2.0_8 * SIN(0.5_8*theta)**2, SIN(theta), kind=8)
w=CMPLX(1.0_8, 0.0_8, kind=8)
DO m = 1, mmax
ws=w
DO i=m-1, length-1, istep
j=i+mmax
temp = ws*x(:,j)
x(:,j) = x(:,i) - temp
x(:,i) = x(:,i) + temp
END DO
w = w*wp + w
END DO
mmax = istep
END DO
DEALLOCATE(temp)

END SUBROUTINE fft

END PROGRAM danielson_lanczos


This program differs very little from our original one-dimensional program.

I will outline those differences by discussing diffs between the two source files:

1.
6,16c6,10
<   INTEGER :: i, j
<   COMPLEX(kind=8), DIMENSION(0:7, 0:7) :: x, y
<   REAL(kind=8) :: dx, dy
<
<   DO i = 0,7
<      DO j = 0,7
<         dx = 2.0_8 * pi / 8.0_8 * i
<         dy = 2.0_8 * pi / 8.0_8 * j
<         x(i,j) = CMPLX(SIN(dx), 0.0_8, kind=8) * CMPLX(SIN(dy), 0.0_8, kind=8)
<      END DO
<   END DO
---
>   INTEGER :: i
>   COMPLEX(kind=8), DIMENSION(8) :: x, y
>
>   x = (/ (i, i=0,7) /); x = 2.0_8 * pi / 8.0_8 * x; x = SIN(x) + COS(2 * x) &
>       - SIN(3 * x)

This is the first difference: this time my arrays x and y are two dimensional. For clarity, instead of using an implicit DO loop, I have used explicit calculation, where dx and dy measure the distance from the origin, (0,0), and the input data is of the form of .
2.
18,21c12,15
<   CALL fft(y); y = TRANSPOSE(y); CALL fft(y); y = TRANSPOSE(y)
<   y = -2.0_8 * pi / 64.0_8 * y
<   WRITE(*,'(8 ( 8 ("(", f6.2, ",", f6.2, ") "), /))') x
<   WRITE(*,'(8 ( 8 ("(", f6.2, ",", f6.2, ") "), /))') y
---
>   CALL fft(y)
>   y = 2.0_8 * pi / 8.0_8 * y
>   WRITE(*,'(8 ("(", f6.3, ",", f7.3, ")", /))') x
>   WRITE(*,'(8 ("(", f6.3, ",", f7.3, ")", /))') y

Here, instead of just calling fft once, we call it twice. This is still the same fft as before, but this time it operates on whole columns of matrix x, not on scalar entries of vector x. Every manipulation that we have discussed before is now performed on all column entries at once. This is where the parallelism comes in. There is no communication between different rows, hence, if we were to implement this program in High Performance Fortran, the matrix should be distributed row-wise, i.e., different rows should live on different processors. This way our column manipulations will not require expensive inter-process communications.

The 2-dimensional Fourier Transform works as follows: first we need to find Fourier Transform in the x direction, and then we need to find Fourier Transform on the first Fourier Transform in the y direction.

Because our subroutine fft does only one direction and it is always x, in order to evaluate Fourier Transform in the y direction we need to transpose the matrix, thus interchanging the directions, and then call subroutine fft again.

The returned result must be transposed back to normal.

This is the only place where communication is involved in our program. Matrix transpose is a fairly costly operation.

Observe that I have also changed the value of the scaling constant. In this case I wanted the value to match what we have used in our heat diffusion program for the initial temperature distribution, T0(x,y), so that we can make a quick check if the program runs and if it returns correct results.

3.
43a35,36
>     COMPLEX(kind=8) :: a, b
>     COMPLEX(kind=8) :: hold
45,51d37
<     IMPLICIT NONE
<
<     COMPLEX(kind=8), DIMENSION(:) :: a
<     COMPLEX(kind=8), DIMENSION(SIZE(a)) :: b
<     COMPLEX(kind=8), DIMENSION(:), ALLOCATABLE :: hold
<
<     ALLOCATE(hold(SIZE(a)))
55d40
<     DEALLOCATE(hold)

This is the change from subroutine swap. This time we're swapping two columns of a matrix, no longer two scalar numbers. The columns must be of the same length, of course, and the auxialiary variable hold must be an array too.

This array must be allocated once we know the size of the arrays to be swapped, and then deallocated when it is no longer needed.

4.
63c48
<     COMPLEX(kind=8), DIMENSION(0:, 0:), INTENT(inout) :: x
---
>     COMPLEX(kind=8), DIMENSION(0:), INTENT(inout) :: x
66,67c51
<     INTEGER :: length, number_of_bits, i, j, mmax, istep, m
<     COMPLEX(kind=8) :: wp, w, ws
---
>     COMPLEX(kind=8) :: wp, w, ws, temp
69c53
<     COMPLEX(kind=8), DIMENSION(:), ALLOCATABLE :: temp
---
>     INTEGER :: length, number_of_bits, i, j, mmax, istep, m

Here we're already looking at declarations in subroutine fft. Observe that x is now an array of rank 2, or, as I prefer to put it, a two-dimensional array.

Variable temp must be made into an array this time. It is going to hold ws*x(:,j). And it must be made allocatable to, because, we don't know its size a priori.

5.
71c55
<     length = SIZE(x, 2)
---
>     length = SIZE(x)

The length of the array of columns is not obtained by calling SIZE on the second dimension of the matrix.
6.
79c63
<           IF (j .GT. i) CALL swap(x(:,i), x(:,j))
---
>           IF (j .GT. i) CALL swap(x(i), x(j))

Instead of swapping two scalar entries of an array, we swap two whole columns. But if each row lives on its own separate processor, this operation should take exactly the same time as swapping two scalar entries of a one-dimensional array.
7.
83d66
<     ALLOCATE(temp(SIZE(x, 1)))
95,97c78,80
<              temp = ws*x(:,j)
<              x(:,j) = x(:,i) - temp
<              x(:,i) = x(:,i) + temp
---
>              temp = ws*x(j)
>              x(j) = x(i) - temp
>              x(i) = x(i) + temp
103d85
<     DEALLOCATE(temp)

Array temp must be able to store the whole column, so the number of entries that we must allocate for it corresponds to the number of rows, which is the first dimension in an array (rember that the ordering is always rows followed by columns).

And this is it. This is all we had to do, to parallelise our one dimensional Fast Fourier Transform program. If the program is to run on the SP and if it is to be compiled by HPF, we need to add the following HPF directive in main:

!HPF$DISTRIBUTE (BLOCK, *) x, y  This directive should be matched by corresponding directives in subroutines of our program. Does this program work? Here is how it compiles and runs: gustav@blanc:../src 16:51:02 !575$ f90 -o lanczos-p lanczos-p.f90
gustav@blanc:../src 16:53:40 !576 $./lanczos-p ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.50, 0.00) ( 0.71, 0.00) ( 0.50, 0.00) ( 0.00, 0.00) ( -0.50, 0.00) ( -0.71, 0.00) ( -0.50, 0.00) ( 0.00, 0.00) ( 0.71, 0.00) ( 1.00, 0.00) ( 0.71, 0.00) ( 0.00, 0.00) ( -0.71, 0.00) ( -1.00, 0.00) ( -0.71, 0.00) ( 0.00, 0.00) ( 0.50, 0.00) ( 0.71, 0.00) ( 0.50, 0.00) ( 0.00, 0.00) ( -0.50, 0.00) ( -0.71, 0.00) ( -0.50, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( -0.50, 0.00) ( -0.71, 0.00) ( -0.50, 0.00) ( 0.00, 0.00) ( 0.50, 0.00) ( 0.71, 0.00) ( 0.50, 0.00) ( 0.00, 0.00) ( -0.71, 0.00) ( -1.00, 0.00) ( -0.71, 0.00) ( 0.00, 0.00) ( 0.71, 0.00) ( 1.00, 0.00) ( 0.71, 0.00) ( 0.00, 0.00) ( -0.50, 0.00) ( -0.71, 0.00) ( -0.50, 0.00) ( 0.00, 0.00) ( 0.50, 0.00) ( 0.71, 0.00) ( 0.50, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 1.57, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( -1.57, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( -1.57, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 1.57, 0.00) STOP 0 gustav@blanc:../src 16:53:42 !577$

Comparison with the result discussed in P573 for shows that we're right on the spot, i.e.,

Although in our example we have worked with a square data matrix, subroutine fft should work with rectangular matrices too.

Next: One Dimensional Parallel Fast Up: Fast Fourier Transform Previous: The Danielson Lanczos Algorithm
Zdzislaw Meglicki
2001-02-26