Next: param.f Up: The Code Previous: main.f

### Discussion of main.f

• First observe that all arrays are declared ALLOCATABLE. The dimensions of the arrays are obtained from a name list by subroutine init_diffusion. We'll see how that is done later.
• dif_x and dif_y are one dimensional arrays that contain x and y coordinates of points at which temperature is going to be evaluated for various times. The points are distributed on a 2D rectangular grid. There are dif_numx points in the x direction and dif_numy points in the y direction. The total number of points is thus dif_numx * dif_numy. The authors of the program chose to list all those points as one long one-dimensional array, rather than a 2D matrix. The one dimensional index that numbers entries in the corresponding arrays xp and yp (where p stands for a point of the grid) is called ij and the formula that converts 2D matrix indexes i and j into ij is
ij = dif_numx*(j-1) + i


After having obtained basic parameters for the diffusion program we allocate space for dif_x and dif_y and then initialise their values so that and .

ALLOCATE( dif_x(dif_numx*dif_numy) ); ALLOCATE( dif_y(dif_numy*dif_numx) )
delx = PI/ ( dif_numx + 1.d0); dely = PI/ ( dif_numy + 1.d0)
DO i = 1, dif_numx
DO j = 1, dif_numy
ij = dif_numx*(j-1)  + i
dif_x(ij) = delx* i; dif_y(ij) = dely * j
ENDDO
ENDDO

This way of doing things adds an unnecessary complexity to the code and gains little, especially on a parallel machine and in HPF, so you can just as well drop it and use 2D grid arrays instead. This is especially so when modifying the program for writing images on HDF animation files.
• The next step is to allocate all the arrays used by the program. Since the call to init_diffusion should have returned the required dimensions of the problem, i.e., the required number of points in the space and the required number of harmonics, we are now ready to allocate as much space as is needed.

The parameters dif_nx and dif_ny yield the requested number of harmonics in x and y directions respectively. Recall that our collapsed index l (equation 3.51) runs from 1 through , or, using the program notation, from 1 through n = dif_nx * dif_ny.

The program uses a simple error control mechanism accumulating a number of observed allocation problems on num_errors and checking if it's greater than 0 at the end of this part of the code. The ALLOCATE statement returns exit status on stat=.

• The allocated arrays have the following meaning:
1.
lambda(n) is the array of eigenvalues, of matrix F:
ALLOCATE(lambda(n),stat=stat); IF(stat.NE.0) num_errors=num_errors+1

Observe that lambda is not mentioned in the DISTRIBUTE directives. This means that this array is going to be replicated on every node.
2.
a(n) corresponds to what in our derivations we have called a0 = a(0), i.e., Fourier coefficients that correspond to the initial condition T0(x, y) = T(x, y, 0):
!HPF$DISTRIBUTE (BLOCK) :: a ... ALLOCATE(a(n), stat=stat); IF(stat.NE.0) num_errors=num_errors+1  3. ab(n) is the initial condition, i.e., a0 rotated'' into the eigen-basis of F. As you remember our formula for that was: In order to avoid overloading the word lambda the program uses symbol b for . Recall that But for orthonormal matrices (and hopefully b is going to be such): b-1 = bT So in this case we can write that: where bkk' is the matrix of eigenvectors returned by the eigen-problem subroutine. The trick, in other words, is to contract a0 with the first index of that matrix and not with the second one, as we would normally do for matrix-vector multiplication. In the code this is done as follows: !HPF$  DISTRIBUTE (*,BLOCK) :: b
!HPF$DISTRIBUTE (BLOCK) :: ab ... ALLOCATE( ab(n), stat=stat); IF( stat .NE. 0) num_errors = num_errors + 1 ... ALLOCATE( b(n,n) , stat=stat ); IF( stat .NE. 0) num_errors = num_errors + 1 ... ab(i) = 0.d0 DO i = 1, n DO j = 1, n ab(i) = b(j,i)*a_rep(j) + ab(i) ENDDO ENDDO ...  The array a_rep is a replicated version of a. Observe the following: ALLOCATE( a_rep(n), stat=stat); IF( stat .NE. 0) num_errors = num_errors + 1 ... CALL expand_temp_profile(a) a_rep = a ...  What do we gain by replicating a_rep(j)? Observe that b is distributed in the second index only. Since a_rep is replicated and not distributed, all those multiplications in b(j,i)*a_rep(j) will be performed locally, as will the outer loop (over i) because ab(i) is distributed the same way that b(*,i) is, i.e., in a BLOCK fashion. 4. The next two arrays we allocate space for are xsines and ysines. These arrays are used in the final evaluation of and are defined as follows:  = (3.59) = (3.60) where , and , and nx is the number of harmonics in the x direction (dif_nx), ny is the number of harmonics in the y direction (dif_ny), and np is the total number of physical grid points (dif_npts = dif_numx * dif_numy). The first index of these arrays runs over the physical grid points, and the second index runs over the harmonics: ALLOCATE( xsines(dif_npts, dif_nx) , stat=stat ); IF( stat .NE. 0) & num_errors = num_errors + 1 ALLOCATE( ysines(dif_npts, dif_ny) , stat=stat ); IF( stat .NE. 0) & num_errors = num_errors + 1  These arrays are not distributed either. That is, they are local to every node. The reason for this is the following computation that is performed towards the end of the program: DO it = 1, dif_ntemps DO i = 1, dif_npts iy = (i-1)/dif_nx+1; ix = MOD(i-1,dif_nx)+1 DO k = 1, dif_npts temp(k,it) = temp(k,it) + xsines(k,ix) * ysines(k,iy) * x(i,it) ENDDO ENDDO ENDDO  where the variables temp and x are distributed in the second index only. This means that the two inner loops over i and k will be performed locally, and only the outer loop, over it will be distributed. Observe the somewhat unusual work partitioning: it is the time slices that we're distributing between various processors so that then the whole history is calculated in parallel, simultaneously for various time slices. The i index of the middle loop corresponds to the power space index l, whereas the k index of the innermost loop runs over all physical grid points. The variable x(i,it) corresponds to . 5. Now we allocate space for temperature at various points in space and time: !HPF$  DISTRIBUTE (*,BLOCK) :: temp
...
ALLOCATE(temp(dif_npts, dif_ntemps), stat=stat); IF(stat .NE. 0) &
num_errors = num_errors + 1

The first index of this array runs over all points of the physical grid, and the second index numbers consequtive instances of time.
6.
The next array is what we have called Fll' or F. This is a square array with dimensions :
!HPF$DISTRIBUTE (*,BLOCK) :: f ... ALLOCATE( f(n,n) , stat=stat ); IF( stat .NE. 0) num_errors = num_errors + 1  7. The diagnolisation of F will return a matrix of eigenvectors, apart from the array of eigenvalues. We have denoted that matrix in our derivations by and in the code it is called b. We have discussed the allocation of this distributed array above, when describing ab. 8. Variable abt contains the whole history of coefficients a(t) for various instances of time in the eigen-basis. That is these coefficients have not been rotated back onto the original basis yet. In other words they are: abt like temp is a distributed array, where it is the various time instances that have been distributed over multiple nodes. !HPF$  DISTRIBUTE (*,BLOCK) :: abt
...
ALLOCATE( abt(n,dif_ntemps) , stat=stat ); IF( stat .NE. 0) &
num_errors = num_errors + 1
...
DO it = 1, dif_ntemps
i = 0
DO i = 1, n
abt(i,it) = EXP( -lambda(i) *(it-1)*dif_delta_t) * ab(i)
ENDDO
ENDDO

9.
The last array we allocate space for is the array that corresponds to This array is returned by a call to subroutine gemm, which takes abt on input. Array x is distributed in the same way as abt:
!HPF\$  DISTRIBUTE (*,BLOCK) :: x
...
ALLOCATE( x(n,dif_ntemps) , stat=stat ); IF( stat .NE. 0) &
num_errors = num_errors + 1
...
CALL gemm(1.d0,b,abt,0.d0,x,'N','N')
...
DO it = 1, dif_ntemps
DO i = 1, dif_npts
iy = (i-1)/dif_nx+1; ix = MOD(i-1,dif_nx)+1
DO k = 1, dif_npts
temp(k,it) = temp(k,it) + xsines(k,ix) * ysines(k,iy) * x(i,it)
ENDDO
ENDDO
ENDDO

I have quoted again the computation of T, this time in order to draw your attention to x, rather than to xsines and ysines.
• In summary we have the following arrays:
lambda(harmonics)               replicated
a(harmonics)                    distributed over harmonics
ab(harmonics)                   distributed over harmonics
a_rep(harmonics)                replicated (copy of ab)
xsines(grid, harmonics_in_x)    replicated
ysines(grid, harmonics_in_y)    replicated
temp(grid, time)                distributed over time
f(harmonics, harmonics)         distributed over second harmonics arg
b(harmonics, harmonics)         distributed over second harmonics arg
abt(harmonics, time)            distributed over time
x(harmonics, time)              distributed over time


• Once all the arrays have been allocated we check if there were any problems and bump out if there were:
IF( num_errors .GT. 0 ) THEN
WRITE(*,*) 'Error in allocating arrays in main'; STOP
ENDIF

• The next two statements call two external procedures, which we're going to discuss later, and evaluate a0 and F. After the distributed array a (which stands for a0) is returned, it is gathered on one of the processes and then scattered in full amongst all participating processes. In HPF this is simply done by saying:
a_rep = a

where a is a DISTRIBUTED array, and a_rep is a replicated array. All HPF objects are replicated by default unless they are either DISTRIBUTED or SEQUENCE.

In summary, the arrays a0 and F are evaluated by the following:

CALL expand_temp_profile(a); a_rep = a
CALL get_diffusion_matrix(f)

• The following two double nested DO loops evaluate

for and , where np is the total number of grid points. Function dsin is the DOUBLE PRECISION version of sin.
DO i = 1, dif_nx
DO j = 1, dif_npts
t1 = i*dif_x(j); xsines(j,i) = SQRT(2.d0/pi) * dsin(t1)
ENDDO
ENDDO

DO i = 1, dif_ny
DO j = 1, dif_npts
t2 = i*dif_y(j); ysines(j,i) = SQRT(2.d0/pi) * dsin( t2)
ENDDO
ENDDO

• Now we're finally ready to solve the problem we have constructed. This is done simply by calling a PESSL   routine called syevx, which solves the eigenvalue problem and returns both the eigenvalues and eigenvectors:
CALL syevx(f,lambda,'u', z=b)

You will find this routine documented on page 598 of PESSL Guide and Reference''.

The routine computes eigenvalues and eigenvectors of a real symmetric matrix.

Is our matrix F real and symmetric? The matrix is certainly real, because of the way we have constructed it. The matrix is really based on sines and cosines, which we have replaced at some stage with equivalent Euler formulas employing exponenses. So we are not working here with auxiliary imaginary quantities that we would have to throw away at some stage. All entries of F are real.

Is F symmetric? Consider two indexes: l = (j - 1)nx + k and l' = (j' - 1)nx + k'. Switching l and l' around corresponds to switching j and j' and then k and k' around. But this would not affect expressions such as: k k' + l2r j j' and k k' - l2r j j' in front of the integrals. The integrals are expressed in terms of

• ei(k + k'), no change here
• ei(j + j'), ditto
• ei(k - k'), sign change here
• ei(j - j'), sign change here
But recall that we have already shown that

 (3.61)

so even though there is a change of sign in k- and j- the integrals themselves are unaffected.

To sum up F is real and symmetric. Since it is symmetric, this implies that F = FT, and thus F is also normal, hence the transformation can be chosen to be orthonormal.

• Subroutine syevx has a somewhat complicated synopsis. It takes three mandatory arguments and then a plethora of optional arguments, all of which must be flagged with appropriate keywords, such as abstol=, or ifail=.

The mandatory arguments are:

f
the input matrix itself
lambda
a preallocated space for a vector of eigenvalues, which will be returned in it on exit.
uplo
this is a character constant, which has to be either 'U' or 'L' (case doesn't matter), and which tells syevx whether it will find the valid values of F in upper or in lower part of matrix f. Because that matrix is symmetric syevx references only half of it
In our case we use just one optional parameter, which is flagged by z=. If this option is requested, subroutine syevx will return orthonormal eigenvectors of F on exit in preallocated (and correctly dimensioned) space given as an argument to this option. In our case this is:
z=b

• After a successful return from syevx we evaluate first

and store it on ab and then

and store it on abt for various values of t:
ab(i) = 0.d0
DO i = 1, n
DO j = 1, n
ab(i) = b(j,i)*a_rep(j) + ab(i)
ENDDO
ENDDO

DO it = 1, dif_ntemps
i = 0
DO i = 1, n
abt(i,it) = EXP( -lambda(i) *(it-1)*dif_delta_t) * ab(i)
ENDDO
ENDDO

• All that we need to do next is to unrotate the solution away from the eigen-basis. Although this would be quite easy to do given matrix b, there is a special   PESSL function called gemm, which stands for General Matrix Matrix Product, and which can do it for us easily and very efficiently. See page 497 of PESSL Guide and Reference'' for more information about it. This subroutine, depending on various switches with which it's been called, will perform one of the following seven operations:
1.
2.
3.
4.
5.
6.
7.

We call this function in the following way:

CALL gemm(1.d0,b,abt,0.d0,x,'N','N')


And the meaning of various parameters passed to it is as follows:

1.d0
This is the value of
b
This is matrix A. In our case it is or b.
abt
This is matrix B. In our case this is the as yet unrotated eigensolution, which is stored on abt. The matrix here is of a peculiar type, because its second index numbers time slices.
0.d0
This is the value of
x
This is C, i.e., space allocated and correctly dimensioned for the result of the matrix multiplication. In our case these will be the coefficients al(tm) (remember that the index l here is just an index, not a power - whenever I want to be very strict about what is a vector and what is a form, I place vector indexes upstairs, and at some stage we needed to get strict about these things in order to derive a correct rule for rotating the eigensolution).
'N'
This character constant tells gemm that matrix A should not be transposed.
'N'
This last parameter tells gemm that matrix B should not be transposed.
• Now we already have our solution in terms of al(tm) (array x). All that needs to be done still is to convert it to values of T(x, y, t). This is done by the following two loops:
DO it = 1, dif_ntemps
DO k = 1, dif_npts
temp(k, it) = 0.d0
ENDDO
ENDDO

DO it = 1, dif_ntemps
DO i = 1, n
iy = (i-1)/dif_nx+1; ix = MOD(i-1,dif_nx)+1
DO k = 1, dif_npts
temp(k,it) = temp(k,it) + xsines(k,ix) * ysines(k,iy) * x(i,it)
ENDDO
ENDDO
ENDDO

• The following two loops with lots of WRITE statements simply write all the results on standard output. They're not worth quoting here. The only thing that you ought to remember is that in HPF programs only one node is responsible for I/O. There is no parallel I/O in standard HPF. Yet. So the HPF run-time system has to collect all those distributed arrays, and temp is distributed so that various time slices live on various processors, on a single node and then that node does printing on standard output or writing on files.

When trying to link HDF subroutines with HPF programs you must not assume a priori that HPF will do that for you though. You must declare all HDF related variables as SEQUENCE.

• Last, but not least, we clean up. That is free all space that has been allocated for the arrays. This isn't strictly necessary, since we don't do anything more in this program, but in case we would, we might wish to reclaim and reuse that space for other tasks.
DEALLOCATE(xsines); DEALLOCATE(ysines); DEALLOCATE(lambda)
DEALLOCATE(temp); DEALLOCATE(a); DEALLOCATE(abt); DEALLOCATE(a)
DEALLOCATE(f); DEALLOCATE(dif_x); DEALLOCATE(dif_y)


Observe that various arrays used in this program have been very thoughtfully distributed or replicated, as the case may be, in order to minimize communication between processes.

Next: param.f Up: The Code Previous: main.f
Zdzislaw Meglicki
2001-02-26