Working with dist_fft

The normalization of data array conserves for both forward and backward dist_fft. And array is not centered after fft.

This is how you do it in dist_fft. Note that the dist_fft routines require you to have $n_{x}=n_{y}$ for 2D, and $n_{x}=n_{y}=n_{z}$ for 3D.

  1. Following the usual call to MPI_Init(), call
     planForward = dist_fft_create_plan(type, dimension, DIST_FFT_FORWARD,
                   forwardFlags, comm);
     planInverse = dist_fft_create_plan(type, dimension, DIST_FFT_INVERSE,
                   inverseFlags, comm);
    

  2. Then allocate local space by calling
     dist_fft_local_dimensions(planForward, &local_dimension, &local_start,
       &local_storage_size);
     DIST_FFT_MALLOC_DATA(data, local_storage_size);
    
    The initial array is truncated along the last dimension. For example, if a $9\times9\times9$ array is split into 3 processes, the array size for each process is $9\times9\times3$, local_dimension should be 3, and local_start should be 0, 3, 6 respectively. For convenience we also allocate a spare space with the same size, which will be used by transpose operation etc.
     DIST_FFT_MALLOC_WORKSPACE(workspace, local_storage_size);
    

  3. We can load parts of our data now. Let's say we have a function f which provides the initial value for each pixel in a 3D real-space array with. We would load data into this array in real space with
     start = local_start * nx * ny
     limit = (local_start + local_dimension) * nx * ny
     FOR(index = start; index < limit; index++){
       local_index = index - start;
       c_re(data, local_index) = c_re(f,index);
       c_im(data, local_index) = c_im(f,index);
     }
    

  4. We then do a transform on data with
     dist_fft_execute(planForward, data, workspace);
     dist_fft_execute(planInverse, data, workspace);
    

  5. At the end of the program, call
     DIST_FFT_FREE_WORKSPACE(workspace);
     dist_fft_destroy_plan(planForward);
     dist_fft_destroy_plan(planInverse);
     MPI_Finalize();
    

DIST_FFT_COLUMN_INPUT refers to array indexing with $x$ being the fast array index.

Test results on the Stony Brook cluster for dist_fft FFTs, using 32 processors:

N$^{3}$ data order input output Seconds
$1024^{3}$ float split COL COL 13.4
$1024^{3}$ float interleaved COL COL 18.0
$512^{3}$ float split COL COL 4.1
$512^{3}$ float interleaved COL COL 3.5
$512^{3}$ float split ROW ROW 8.5
$512^{3}$ float interleaved ROW ROW 6.8
$512^{3}$ float split ROW COL 6.4
$512^{3}$ float interleaved ROW COL 5.2
$512^{3}$ float split COL ROW 6.2
$512^{3}$ float interleaved COL ROW 5.2
$512^{3}$ double split COL COL 6.6
$512^{3}$ double interleaved COL COL 4.4

Figure 6.1: Benchmark times for running dist_fft on a 2D FFT using varying numbers of processors.
Image apple_timings

Microscope User 2008-04-30