### Subroutine get_diffusion_matrix

Whereas subroutine expand_temp_profile is very relaxed when it comes to checking the transform length, especially against the number of nodes used by your HPF program, subroutine get_diffusion_matrix goes about it quite meticulously. This is an overkill in this program though. Observe that expand_temp_profile is called before get_diffusion_matrix. Consequently, if transform length is not a multiple of the number of nodes, we're going to bump out thanks to fft in expand_temp_profile well before we get to all that checking in get_diffusion_matrix. But, at least, this subroutine shows how to do that, if you really have to.

Subroutine get_diffusion_matrix calls subroutine factor_nodes on entry. Subroutine factor_nodes, in turn, checks the number of processors used by the program, and then tests if that number can be represented as

np = n = 2h 3i 5j 7k 11m

where h, i, j, k, and m must satisfy the same conditions as are required for subroutine fft. On exit subroutine factor_nodes returns an array:

[h, i, j, k, m]

if the number of processors, np, is kosher or stops with an error message, if it is not.

After the return of factor_nodes we perform the following computation:

Consider some examples to see how this works.

First assume that np = 2h. Then f1 = np / 2h = 1 and . Subroutine minpower2 will adjust to the nearest power of two greater than or equal to and multiplication by f1 = 1 won't change anything. In short, in this context all that we really do is:

nx = minpower2(4*(dif_nx + 1))

and this is much the same as we have already seen in expand_temp_profile. Furthermore we still don't check if nx is the multiple of the number of nodes, sic!

Why is there 4*(dif_nx + 1) instead of just 2*(dif_nx + 1), as was sufficient in expand_temp_profile? The reason for this is that when we calculate matrix Fwe will need terms such as . In other words, if k and k' run from 1 through 7, then k+ runs from 1 through 14. In principle k- would run from -6 through 6, but we have already shown that in our case , and ditto for j. In short, this time we need to have Fourier coefficients from 1 through 2 nx and from 1 through 2 ny, and then, in order to stay away from negative frequencies we need to double the number of Fourier coefficients still.

But since Fourier coefficients are cheap, and you only need to get them once, you can just as well avail yourself of more than you really need.

Now assume that np = 2h 31. Then f1 = np / 2h = 3 and

For nx = 7 this will yield f2 = 11. The closest power of 2 that's greater than or equal to 11 is 24 = 16, so minpower2 will return 16. Now we multiply that 16 by 3 and get 48. Will this work for, say, 12 nodes? , , and , so we have satisfied both conditions, i.e., that transform length must be a multiple of the number of nodes and that it must be of the form 2h 3i 5j 7k 11mtoo.

But observe that the total number of nodes still falls out of the equation. We'll be alright going for and for , but not for . The subroutine doesn't check if nx and ny evaluated by that process are indeed larger than the number of processors. Admittedly, not many people can afford 96 node machines, and those who can are unlikely to run this demonstration program on all 96 nodes either. Still, this is an omission, or a bug, and, as you see, the procedure could still be improved.

In summary, both ways to calculate the transform length in expand_temp_profile and in get_diffusion_matrix are botched. They demonstrate two techniques of attacking the problem, but both have lose ends. My preference would be to go for the simpler one, i.e., for a power of 2 and to forget about the complexities associated with filling the gaps between the powers of 2 with 2h 3i 5j 7k 11m, unless you really, really need it. Forcing the number of nodes to be a power of 2 too can be done easily on program entry. Transform length can be then adjusted simply by taking a multiple of the number of nodes and comparing that with the required number of harmonics. On our SP you could then run the program on 2, 4, 8, 16, or 32 nodes.

Another alternative would be to evaluate fft locally, i.e., not to make df or atmp distributed arrays. For our problem those arrays are so small, and even sequential fft is so fast, that we don't gain anything by parallelising this computation.