Below is the parallel version of our Danielson-Lanczos program, for a 2 dimensional data.
PROGRAM danielson_lanczos
IMPLICIT NONE
REAL(kind=8), PARAMETER :: pi = 3.141592653589793_8
INTEGER :: i, j
COMPLEX(kind=8), DIMENSION(0:7, 0:7) :: x, y
REAL(kind=8) :: dx, dy
DO i = 0,7
DO j = 0,7
dx = 2.0_8 * pi / 8.0_8 * i
dy = 2.0_8 * pi / 8.0_8 * j
x(i,j) = CMPLX(SIN(dx), 0.0_8, kind=8) * CMPLX(SIN(dy), 0.0_8, kind=8)
END DO
END DO
y = x
CALL fft(y); y = TRANSPOSE(y); CALL fft(y); y = TRANSPOSE(y)
y = -2.0_8 * pi / 64.0_8 * y
WRITE(*,'(8 ( 8 ("(", f6.2, ",", f6.2, ") "), /))') x
WRITE(*,'(8 ( 8 ("(", f6.2, ",", f6.2, ") "), /))') y
STOP 0
CONTAINS
FUNCTION bit_reverse(i, size)
IMPLICIT NONE
INTEGER :: bit_reverse
INTEGER, INTENT(in) :: i, size
INTEGER :: length, temp
temp = i
DO length = size, 2, -1
temp = ISHFTC(temp, -1, length)
END DO
bit_reverse = temp
END FUNCTION bit_reverse
SUBROUTINE swap (a, b)
IMPLICIT NONE
COMPLEX(kind=8), DIMENSION(:) :: a
COMPLEX(kind=8), DIMENSION(SIZE(a)) :: b
COMPLEX(kind=8), DIMENSION(:), ALLOCATABLE :: hold
ALLOCATE(hold(SIZE(a)))
hold = a
a = b
b = hold
DEALLOCATE(hold)
END SUBROUTINE swap
SUBROUTINE fft(x)
IMPLICIT NONE
COMPLEX(kind=8), DIMENSION(0:, 0:), INTENT(inout) :: x
DOUBLE PRECISION, PARAMETER :: pi = 3.141592653589793_8
INTEGER :: length, number_of_bits, i, j, mmax, istep, m
COMPLEX(kind=8) :: wp, w, ws
REAL(kind=8) :: theta
COMPLEX(kind=8), DIMENSION(:), ALLOCATABLE :: temp
length = SIZE(x, 2)
number_of_bits = INT(LOG(REAL(length))/LOG(2.0))
IF (2**number_of_bits .NE. length) THEN
WRITE(*,*) 'fft: error: input data length must be a power of 2'
STOP 10
ELSE
DO i = 1, length-2
j = bit_reverse(i, number_of_bits)
IF (j .GT. i) CALL swap(x(:,i), x(:,j))
END DO
END IF
ALLOCATE(temp(SIZE(x, 1)))
mmax = 1
DO
IF ( length .LE. mmax) EXIT
istep = 2 * mmax
theta = - pi/(mmax)
wp=CMPLX(-2.0_8 * SIN(0.5_8*theta)**2, SIN(theta), kind=8)
w=CMPLX(1.0_8, 0.0_8, kind=8)
DO m = 1, mmax
ws=w
DO i=m-1, length-1, istep
j=i+mmax
temp = ws*x(:,j)
x(:,j) = x(:,i) - temp
x(:,i) = x(:,i) + temp
END DO
w = w*wp + w
END DO
mmax = istep
END DO
DEALLOCATE(temp)
END SUBROUTINE fft
END PROGRAM danielson_lanczos
This program differs very little from our original one-dimensional program.
I will outline those differences by discussing diffs between
the two source files:
6,16c6,10 < INTEGER :: i, j < COMPLEX(kind=8), DIMENSION(0:7, 0:7) :: x, y < REAL(kind=8) :: dx, dy < < DO i = 0,7 < DO j = 0,7 < dx = 2.0_8 * pi / 8.0_8 * i < dy = 2.0_8 * pi / 8.0_8 * j < x(i,j) = CMPLX(SIN(dx), 0.0_8, kind=8) * CMPLX(SIN(dy), 0.0_8, kind=8) < END DO < END DO --- > INTEGER :: i > COMPLEX(kind=8), DIMENSION(8) :: x, y > > x = (/ (i, i=0,7) /); x = 2.0_8 * pi / 8.0_8 * x; x = SIN(x) + COS(2 * x) & > - SIN(3 * x)This is the first difference: this time my arrays
x and y
are two dimensional. For clarity, instead of using an implicit
DO loop, I have used explicit calculation, where dx
and dy measure the distance from the origin, (0,0), and
the input data is of the form of
18,21c12,15
< CALL fft(y); y = TRANSPOSE(y); CALL fft(y); y = TRANSPOSE(y)
< y = -2.0_8 * pi / 64.0_8 * y
< WRITE(*,'(8 ( 8 ("(", f6.2, ",", f6.2, ") "), /))') x
< WRITE(*,'(8 ( 8 ("(", f6.2, ",", f6.2, ") "), /))') y
---
> CALL fft(y)
> y = 2.0_8 * pi / 8.0_8 * y
> WRITE(*,'(8 ("(", f6.3, ",", f7.3, ")", /))') x
> WRITE(*,'(8 ("(", f6.3, ",", f7.3, ")", /))') y
Here, instead of just calling fft once, we call it twice.
This is still the same fft as before, but this time
it operates on whole columns of matrix x, not on scalar
entries of vector x. Every manipulation that we have
discussed before is now performed on all column entries at once.
This is where the parallelism comes in. There is no communication
between different rows, hence, if we were to implement
this program in High Performance Fortran, the matrix should
be distributed row-wise, i.e., different rows should live on different
processors. This way our column manipulations will not require
expensive inter-process communications.
The 2-dimensional Fourier Transform works as follows: first we need to find Fourier Transform in the x direction, and then we need to find Fourier Transform on the first Fourier Transform in the y direction.
Because our subroutine fft does only one direction and
it is always x, in order to evaluate Fourier Transform
in the y direction we need to transpose the matrix, thus
interchanging the directions, and then call subroutine fft
again.
The returned result must be transposed back to normal.
This is the only place where communication is involved in our program. Matrix transpose is a fairly costly operation.
Observe that I have also changed the value of the scaling constant. In this case I wanted the value to match what we have used in our heat diffusion program for the initial temperature distribution, T0(x,y), so that we can make a quick check if the program runs and if it returns correct results.
43a35,36 > COMPLEX(kind=8) :: a, b > COMPLEX(kind=8) :: hold 45,51d37 < IMPLICIT NONE < < COMPLEX(kind=8), DIMENSION(:) :: a < COMPLEX(kind=8), DIMENSION(SIZE(a)) :: b < COMPLEX(kind=8), DIMENSION(:), ALLOCATABLE :: hold < < ALLOCATE(hold(SIZE(a))) 55d40 < DEALLOCATE(hold)This is the change from subroutine
swap. This time
we're swapping two columns of a matrix, no longer two
scalar numbers. The columns must be of the same length, of
course, and the auxialiary variable hold must be
an array too.
This array must be allocated once we know the size of the arrays to be swapped, and then deallocated when it is no longer needed.
63c48 < COMPLEX(kind=8), DIMENSION(0:, 0:), INTENT(inout) :: x --- > COMPLEX(kind=8), DIMENSION(0:), INTENT(inout) :: x 66,67c51 < INTEGER :: length, number_of_bits, i, j, mmax, istep, m < COMPLEX(kind=8) :: wp, w, ws --- > COMPLEX(kind=8) :: wp, w, ws, temp 69c53 < COMPLEX(kind=8), DIMENSION(:), ALLOCATABLE :: temp --- > INTEGER :: length, number_of_bits, i, j, mmax, istep, mHere we're already looking at declarations in subroutine
fft. Observe that x is now an array of rank
2, or, as I prefer to put it, a two-dimensional array.
Variable temp must be made into an array this time.
It is going to hold ws*x(:,j). And it must be
made allocatable to, because, we don't know its size
a priori.
71c55 < length = SIZE(x, 2) --- > length = SIZE(x)The length of the array of columns is not obtained by calling
SIZE on the second dimension of the matrix.
79c63 < IF (j .GT. i) CALL swap(x(:,i), x(:,j)) --- > IF (j .GT. i) CALL swap(x(i), x(j))Instead of swapping two scalar entries of an array, we swap two whole columns. But if each row lives on its own separate processor, this operation should take exactly the same time as swapping two scalar entries of a one-dimensional array.
83d66 < ALLOCATE(temp(SIZE(x, 1))) 95,97c78,80 < temp = ws*x(:,j) < x(:,j) = x(:,i) - temp < x(:,i) = x(:,i) + temp --- > temp = ws*x(j) > x(j) = x(i) - temp > x(i) = x(i) + temp 103d85 < DEALLOCATE(temp)Array
temp must be able to store the whole column, so
the number of entries that we must allocate for it corresponds
to the number of rows, which is the first dimension in an array
(rember that the ordering is always rows followed by
columns).
And this is it. This is all we had to do, to parallelise our one dimensional Fast Fourier Transform program. If the program is to run on the SP and if it is to be compiled by HPF, we need to add the following HPF directive in main:
!HPF$ DISTRIBUTE (BLOCK, *) x, yThis directive should be matched by corresponding directives in subroutines of our program.
Does this program work?
Here is how it compiles and runs:
gustav@blanc:../src 16:51:02 !575 $ f90 -o lanczos-p lanczos-p.f90 gustav@blanc:../src 16:53:40 !576 $ ./lanczos-p ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.50, 0.00) ( 0.71, 0.00) ( 0.50, 0.00) ( 0.00, 0.00) ( -0.50, 0.00) ( -0.71, 0.00) ( -0.50, 0.00) ( 0.00, 0.00) ( 0.71, 0.00) ( 1.00, 0.00) ( 0.71, 0.00) ( 0.00, 0.00) ( -0.71, 0.00) ( -1.00, 0.00) ( -0.71, 0.00) ( 0.00, 0.00) ( 0.50, 0.00) ( 0.71, 0.00) ( 0.50, 0.00) ( 0.00, 0.00) ( -0.50, 0.00) ( -0.71, 0.00) ( -0.50, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( -0.50, 0.00) ( -0.71, 0.00) ( -0.50, 0.00) ( 0.00, 0.00) ( 0.50, 0.00) ( 0.71, 0.00) ( 0.50, 0.00) ( 0.00, 0.00) ( -0.71, 0.00) ( -1.00, 0.00) ( -0.71, 0.00) ( 0.00, 0.00) ( 0.71, 0.00) ( 1.00, 0.00) ( 0.71, 0.00) ( 0.00, 0.00) ( -0.50, 0.00) ( -0.71, 0.00) ( -0.50, 0.00) ( 0.00, 0.00) ( 0.50, 0.00) ( 0.71, 0.00) ( 0.50, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 1.57, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( -1.57, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( -1.57, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 0.00, 0.00) ( 1.57, 0.00) STOP 0 gustav@blanc:../src 16:53:42 !577 $Comparison with the result discussed in P573 for
Although in our example we have worked with a square data matrix,
subroutine fft should work with rectangular matrices too.