TABLE OF CONTENTS


ABINIT/fourwf [ Functions ]

[ Top ] [ Functions ]

NAME

 fourwf

FUNCTION

 Carry out composite Fourier transforms between real and reciprocal (G) space.
 Wavefunctions, contained in a sphere in reciprocal space,
 can be FFT to real space. They can also be FFT from real space
 to a sphere. Also, the density maybe accumulated, and a local
 potential can be applied.

 The different options are :
 - option=0 --> reciprocal to real space and output the result.
 - option=1 --> reciprocal to real space and accumulate the density.
 - option=2 --> reciprocal to real space, apply the local potential to the wavefunction
                in real space and produce the result in reciprocal space.
 - option=3 --> real space to reciprocal space.
                NOTE that in this case, fftalg=1x1 MUST be used. This may be changed in the future.

 The different sections of this routine corresponds to different
 algorithms, used independently of each others :
 - fftalg=xx0 : use simple complex-to-complex routines, without zero padding
     (rather simple, so can be used to understand how fourwf.f works);
 - fftalg=1x1 : use S Goedecker routines, with zero padding
     (7/12 savings in execution time);
 - fftalg=1x2 : call even more sophisticated coding also based on S Goedecker routines

 This routine contains many parts that differ only
 by small details, in order to treat each case with the better speed.
 Also for better speed, it uses no F90 construct, except the allocate command
 and for zeroing arrays.

COPYRIGHT

 Copyright (C) 1998-2018 ABINIT group (DCA, XG, GMR, FF)
 This file is distributed under the terms of the
 GNU General Public License, see ~abinit/COPYING
 or http://www.gnu.org/copyleft/gpl.txt .
 For the initials of contributors, see ~abinit/doc/developers/contributors.txt .

INPUTS

 cplex= if 1 , denpot is real, if 2 , denpot is complex
    (cplex=2 only allowed for option=2, and istwf_k=1)
    not relevant if option=0 or option=3, so cplex=0 can be used to minimize memory
 fofgin(2,npwin)=holds input wavefunction in G vector basis sphere.
                 (intent(in) but the routine sphere can modify it for another iflag)
 gboundin(2*mgfft+8,2)=sphere boundary info for reciprocal to real space
 gboundout(2*mgfft+8,2)=sphere boundary info for real to reciprocal space
 istwf_k=option parameter that describes the storage of wfs
 kg_kin(3,npwin)=reduced planewave coordinates, input
 kg_kout(3,npwout)=reduced planewave coordinates, output
 mgfft=maximum size of 1D FFTs
 mpi_enreg=information about MPI parallelization
 ndat=number of FFT to do in //
 ngfft(18)=contain all needed information about 3D FFT, see ~abinit/doc/variables/vargs.htm#ngfft
 npwin=number of elements in fofgin array (for option 0, 1 and 2)
 npwout=number of elements in fofgout array (for option 2 and 3)
 n4,n5,n6=ngfft(4),ngfft(5),ngfft(6), dimensions of fofr.
 option= if 0: do direct FFT
         if 1: do direct FFT, then sum the density
         if 2: do direct FFT, multiply by the potential, then do reverse FFT
         if 3: do reverse FFT only
 paral_kgb=Flag related to the kpoint-band-fft parallelism
 tim_fourwf=timing code of the calling routine (can be set to 0 if not attributed)
 weight_r=weight to be used for the accumulation of the density in real space
         (needed only when option=1)
 weight_i=weight to be used for the accumulation of the density in real space
         (needed only when option=1 and (fftalg=4 and fftalgc/=0))
 fofginb(2,npwin)=holds second input wavefunction in G vector basis sphere.
                 (intent(in) but the routine sphere can modify it for another iflag)
                 (for non diagonal occupation)
 use_ndo = use non diagonal occupations.

OUTPUT

  (see side effects)

SIDE EFFECTS

 Input/Output
 for option==0, fofgin(2,npwin*ndat)=holds input wavefunction in G sphere;
                fofr(2,n4,n5,n6*ndat) contains the output Fourier Transform of fofgin;
                no use of denpot, fofgout and npwout.
 for option==1, fofgin(2,npwin*ndat)=holds input wavefunction in G sphere;
                denpot(cplex*n4,n5,n6) contains the input density at input,
                and the updated density at output (accumulated);
                no use of fofgout and npwout.
 for option==2, fofgin(2,npwin*ndat)=holds input wavefunction in G sphere;
                denpot(cplex*n4,n5,n6) contains the input local potential;
                fofgout(2,npwout*ndat) contains the output function;
 for option==3, fofr(2,n4,n5,n6*ndat) contains the input real space wavefunction;
                fofgout(2,npwout*ndat) contains its output Fourier transform;
                no use of fofgin and npwin.

TODO

  Remove paral_kgb, we are already passing mpi_enreg

NOTES

   DO NOT CHANGE THE API OF THIS FUNCTION.
   If you need a specialized routine for the FFT of the wavefunctions, create
   a wrapper that uses fourwf to accomplish your task. This routine, indeed,
   has already too many parameters and each change in the API requires a careful
   modification of the different wrappers used for specialized FFTs such as FFTW3 and MKL-DFTI

PARENTS

      dfpt_accrho,dfpt_mkrho,dfptnl_resp,fock_getghc,getgh1c,getghc
      gwls_hamiltonian,m_cut3d,m_epjdos,m_fft_prof,m_fock,mkrho,mlwfovlp
      pawmkaewf,pawsushat,posdoppler,prep_fourwf,spin_current,susk,suskmm
      tddft,vtowfk

CHILDREN

      ccfft,cg_addtorho,cg_box2gsph,dcopy,dfti_seqfourwf,fftw3_seqfourwf
      fourwf_mpi,gpu_fourwf,ptabs_fourwf,sg_fftpad,sg_fftrisc,sg_fftrisc_2
      sphere,sphere_fft,timab,xmpi_sum

SOURCE

117 #if defined HAVE_CONFIG_H
118 #include "config.h"
119 #endif
120 
121 #include "abi_common.h"
122 
123 subroutine fourwf(cplex,denpot,fofgin,fofgout,fofr,gboundin,gboundout,istwf_k,&
124 &  kg_kin,kg_kout,mgfft,mpi_enreg,ndat,ngfft,npwin,npwout,n4,n5,n6,option,&
125 &  paral_kgb,tim_fourwf,weight_r,weight_i, &
126 &  use_gpu_cuda,use_ndo,fofginb) ! Optional arguments
127 
128  use defs_basis
129  use defs_abitypes
130  use m_profiling_abi
131  use m_xmpi
132  use m_errors
133  use m_cgtools
134 
135  use m_mpinfo,    only : ptabs_fourwf
136  use m_fftcore,   only : sphere_fft, sphere
137  use m_sgfft,     only : sg_fftpad, sg_fftrisc, sg_fftrisc_2
138  use m_dfti,      only : dfti_seqfourwf
139  use m_fftw3,     only : fftw3_seqfourwf
140  use m_fft,       only : fourwf_mpi
141 
142 !This section has been created automatically by the script Abilint (TD).
143 !Do not modify the following lines by hand.
144 #undef ABI_FUNC
145 #define ABI_FUNC 'fourwf'
146  use interfaces_18_timing
147  use interfaces_53_ffts, except_this_one => fourwf
148 !End of the abilint section
149 
150  implicit none
151 
152 !Arguments ------------------------------------
153 !scalars
154  integer,intent(in) :: cplex,istwf_k,mgfft,n4,n5,n6,ndat,npwin,npwout,option,paral_kgb
155  integer,intent(in) :: tim_fourwf
156  integer,intent(in),optional :: use_gpu_cuda,use_ndo
157  real(dp),intent(in) :: weight_r,weight_i
158  type(MPI_type),intent(in) :: mpi_enreg
159 !arrays
160  integer,intent(in) :: gboundin(2*mgfft+8,2),gboundout(2*mgfft+8,2)
161  integer,intent(in) :: kg_kin(3,npwin),kg_kout(3,npwout),ngfft(18)
162  real(dp),intent(inout) :: denpot(cplex*n4,n5,n6),fofgin(2,npwin*ndat)
163  real(dp),intent(inout),optional :: fofginb(:,:) ! (2,npwin*ndat)
164  real(dp),intent(inout) :: fofr(2,n4,n5,n6*ndat)
165  real(dp),intent(out) :: fofgout(2,npwout*ndat)
166 
167 !Local variables-------------------------------
168 !scalars
169  integer :: fftalg,fftalga,fftalgc,fftcache,i1,i2,i2_local,i3,i3_local,i3_glob,idat,ier
170  integer :: iflag,ig,comm_fft,me_g0,me_fft,n1,n2,n3,nd2proc,nd3proc
171  integer :: nfftot,nproc_fft,option_ccfft 
172  real(dp) :: fim,fre,xnorm
173  character(len=500) :: message
174  logical ::  luse_gpu_cuda,luse_ndo 
175 !arrays
176  integer,parameter :: shiftg0(3)=0
177  integer,parameter :: symmE(3,3)=reshape([1,0,0,0,1,0,0,0,1],[3,3])
178  integer, ABI_CONTIGUOUS pointer :: fftn2_distrib(:),ffti2_local(:)
179  integer, ABI_CONTIGUOUS pointer :: fftn3_distrib(:),ffti3_local(:)
180  real(dp) :: tsec(2)
181  real(dp),allocatable :: work1(:,:,:,:),work2(:,:,:,:),work3(:,:,:,:)
182  real(dp),allocatable :: work4(:,:,:,:),work_sum(:,:,:,:) 
183 
184 ! *************************************************************************
185 
186  ! Accumulate timing
187  call timab(840+tim_fourwf,1,tsec)
188 
189  n1=ngfft(1); n2=ngfft(2); n3=ngfft(3); nfftot=n1*n2*n3
190  fftcache=ngfft(8)
191  fftalg=ngfft(7); fftalga=fftalg/100; fftalgc=mod(fftalg,10)
192  me_fft=ngfft(11)
193  nproc_fft=ngfft(10)
194 
195  comm_fft = mpi_enreg%comm_fft; me_g0 = mpi_enreg%me_g0
196 
197  !if (ndat/=1) then
198  !  write(std_out,*)fftalg
199  !  MSG_ERROR("Really? I thought nobody uses ndat > 1")
200  !end if 
201 
202  !if (weight_r /= weight_i) then
203  !  write(std_out,*)fftalg
204  !  MSG_ERROR("Really? I thought nobody uses weight_r != weight_i")
205  !end if 
206 
207  !if (option == 0 .and. fftalgc == 0) then
208  !  MSG_ERROR("Option 0 is buggy when fftalgc ==0 is used!")
209  !end if 
210 
211 !Cuda version of fourwf
212  luse_gpu_cuda=PRESENT(use_gpu_cuda)
213  if (luse_gpu_cuda) luse_gpu_cuda=(luse_gpu_cuda.and.(use_gpu_cuda==1))
214 
215  if(luse_gpu_cuda) then
216 #if defined HAVE_GPU_CUDA
217    call gpu_fourwf(cplex,denpot,fofgin,fofgout,fofr,gboundin,gboundout,istwf_k,&
218 &   kg_kin,kg_kout,mgfft,mpi_enreg,ndat,ngfft,npwin,npwout,n4,n5,n6,option,&
219 &   paral_kgb,tim_fourwf,weight_r,weight_i) !,&
220 !  &  use_ndo,fofginb)
221 #endif
222    call timab(840+tim_fourwf,2,tsec); return
223  end if
224 
225  if ((fftalgc<0 .or. fftalgc>2)) then
226    write(message, '(a,i4,a,a,a,a,a)' )&
227 &   'The input algorithm number fftalg=',fftalg,' is not allowed.',ch10,&
228 &   'The third digit, fftalg(C), must be 0, 1, or 2',ch10,&
229 &   'Action: change fftalg in your input file.'
230    MSG_ERROR(message)
231  end if
232 
233  if (fftalgc/=0 .and. ALL(fftalga/=(/1,3,4,5/)) ) then
234    write(message, '(a,i4,5a)' )&
235 &   'The input algorithm number fftalg=',fftalg,' is not allowed.',ch10,&
236 &   'The first digit must be 1,3,4 when the last digit is not 0.',ch10,&
237 &   'Action : change fftalg in your input file.'
238    MSG_ERROR(message)
239  end if
240 
241  if (option<0 .or. option>3)then
242    write(message, '(a,i4,a,a,a)' )&
243 &   'The option number',option,' is not allowed.',ch10,&
244 &   'Only option=0, 1, 2 or 3 are allowed presently.'
245    MSG_ERROR(message)
246  end if
247 
248  if (option==1 .and. cplex/=1) then
249    write(message, '(a,a,a,i4,a)' )&
250 &   'With the option number 1, cplex must be 1,',ch10,&
251 &   'but it is cplex=',cplex,'.'
252    MSG_ERROR(message)
253  end if
254 
255  if (option==2 .and. (cplex/=1 .and. cplex/=2)) then
256    write(message, '(a,a,a,i4,a)' )&
257 &   'With the option number 2, cplex must be 1 or 2,',ch10,&
258 &   'but it is cplex=',cplex,'.'
259    MSG_ERROR(message)
260  end if
261 
262  ! DMFT uses its own FFT algorithm (that should be wrapped in a different routine!)
263  luse_ndo=.false.
264  if (present(use_ndo).and.present(fofginb)) then
265    if(use_ndo==1) then 
266      luse_ndo=.true.
267      if((size(fofginb,2)==0)) then
268        write(message, '(a,a,a,i4,i5)' )&
269 &       'fofginb has a dimension equal to zero and use_ndo==1',ch10,&
270 &       'Action : check dimension of fofginb',size(fofginb,2),use_ndo
271        MSG_ERROR(message)
272      end if
273    end if
274  end if
275 
276  if (luse_ndo) then 
277    if (.not.(fftalgc==2 .and. option/=3)) then
278      MSG_ERROR("luse_ndo but not .not.(fftalgc==2 .and. option/=3)")
279    end if
280    ABI_CHECK(nproc_fft==1, "DMFT with nproc_fft != 1")
281    ABI_CHECK(ndat == 1, "use_ndo and ndat != 1 not coded")
282 
283    call sg_fftrisc_2(cplex,denpot,fofgin,fofgout,fofr,gboundin,gboundout,&
284 &   istwf_k,kg_kin,kg_kout,&
285 &   mgfft,ngfft,npwin,npwout,n4,n5,n6,option,weight_r,weight_2=weight_i,luse_ndo=luse_ndo,fofgin_p=fofginb)
286    goto 100
287  end if 
288 
289  ! Get the distrib associated with this fft_grid => for i2 and i3 planes
290  call ptabs_fourwf(mpi_enreg,n2,n3,fftn2_distrib,ffti2_local,fftn3_distrib,ffti3_local)
291 
292  ! Branch immediately depending on nproc_fft 
293  if (nproc_fft > 1 .and. fftalg /= 412) then
294    call fourwf_mpi(cplex,denpot,fofgin,fofgout,fofr,gboundin,gboundout,&
295 &   istwf_k,kg_kin,kg_kout,me_g0,mgfft,ngfft,mpi_enreg%distribfft,n1,n2,n3,npwin,npwout,&
296 &   n4,n5,n6,ndat,option,weight_r,weight_i,comm_fft)
297    goto 100
298  end if
299 
300  select case (fftalga)
301 
302  case (FFT_FFTW3)
303    if (luse_ndo) MSG_ERROR("luse_ndo not supported by FFTW3")
304    if (nproc_fft == 1) then
305 !      call wrtout(std_out,"FFTW3_SEQFOURWF","COLL")
306      call fftw3_seqfourwf(cplex,denpot,fofgin,fofgout,fofr,gboundin,gboundout,istwf_k,&
307 &     kg_kin,kg_kout,mgfft,ndat,ngfft,npwin,npwout,n4,n5,n6,option,weight_r,weight_i)
308    else
309      MSG_ERROR("Not coded")
310    end if
311 
312  case (FFT_DFTI) 
313    if (luse_ndo) MSG_ERROR("luse_ndo not supported by DFTI")
314    if (nproc_fft == 1) then
315 !     call wrtout(std_out,"DFTI_SEQFOURWF","COLL")
316      call dfti_seqfourwf(cplex,denpot,fofgin,fofgout,fofr,gboundin,gboundout,istwf_k,&
317 &     kg_kin,kg_kout,mgfft,ndat,ngfft,npwin,npwout,n4,n5,n6,option,weight_r,weight_i)
318    else
319      MSG_ERROR("Not coded")
320    end if
321 
322  case default 
323    ! TODO: Clean this code!
324 
325    ! Here, use routines that make forwards FFT separately of backwards FFT,
326    ! in particular, usual 3DFFT library routines, called in ccfft.
327    if (fftalgc==0 .or. (fftalgc==1 .and. fftalga/=4) .or. &
328 &   (fftalgc==2 .and. fftalga/=4 .and. option==3) )then
329 
330      ABI_ALLOCATE(work1,(2,n4,n5,n6*ndat))
331 
332      if (option/=3)then 
333        ! Insert fofgin into the fft box (array fofr)
334 
335        if (fftalga/=4)then
336          iflag=1
337          call sphere(fofgin,ndat,npwin,fofr,n1,n2,n3,n4,n5,n6,kg_kin,istwf_k,iflag,me_g0,shiftg0,symmE,one)
338 
339        else if (fftalga==4 .and. fftalgc==0) then
340          ! Note the switch of n5 and n6, as they are only
341          ! needed to dimension work2 inside "sphere"
342          ABI_ALLOCATE(work2,(2,n4,n6,n5*ndat))
343 
344          iflag=2
345          nd2proc=((n2-1)/nproc_fft) +1
346          nd3proc=((n6-1)/nproc_fft) +1
347          ABI_ALLOCATE(work3,(2,n4,n6,nd2proc*ndat))
348          ABI_ALLOCATE(work4,(2,n4,n5,nd3proc*ndat))
349 
350          if (istwf_k == 1 .and. paral_kgb==1) then
351            ! sphere dont need a big array
352            work3=zero
353            call sphere_fft(fofgin,ndat,npwin,work3,n1,n2,n3,n4,n6,kg_kin,&
354 &           mpi_enreg%distribfft%tab_fftwf2_local,nd2proc)
355          else
356            ! sphere needs a big array and communications
357            if (nproc_fft == 1 .and. ndat == 1 .and. istwf_k == 1) then
358              ! dimensions of tab work3 and work2 are identical no need to use work2
359              work3=zero
360              call sphere(fofgin,ndat,npwin,work3,n1,n2,n3,n4,n6,nd2proc,&
361 &             kg_kin,istwf_k,iflag,me_g0,shiftg0,symmE,one)
362            else
363              work2=zero
364              call sphere(fofgin,ndat,npwin,work2,n1,n2,n3,n4,n6,n5,&
365 &             kg_kin,istwf_k,iflag,me_g0,shiftg0,symmE,one)
366 
367              if (paral_kgb==1 .and. istwf_k > 1) then
368                ! Collect G-vectors on each node
369                work3=zero
370                ABI_ALLOCATE(work_sum,(2,n4,n6,n5*ndat))
371                call timab(48,1,tsec)
372                call xmpi_sum(work2,work_sum,2*n4*n6*n5*ndat,comm_fft,ier)
373                call timab(48,2,tsec)
374 
375                ! Extract my list of G-vectors needed for MPI-FFT.
376                do idat=1,ndat
377                  do i2=1,n2
378                    if( fftn2_distrib(i2) == me_fft) then
379                      i2_local = ffti2_local(i2) + nd2proc*(idat-1)
380                      do i3=1,n3
381                        do i1=1,n1
382                          work3(1,i1,i3,i2_local)=work_sum(1,i1,i3,i2+n5*(idat-1))
383                          work3(2,i1,i3,i2_local)=work_sum(2,i1,i3,i2+n5*(idat-1))
384                        end do
385                      end do
386                    end if
387                  end do
388                end do
389                ABI_DEALLOCATE(work_sum)
390              end if
391 
392              if (paral_kgb/=1) then
393                do idat=1,ndat
394                  do i2=1,n2
395                    do i3=1,n3
396                      do i1=1,n1
397                        work3(1,i1,i3,i2+nd2proc*(idat-1))=work2(1,i1,i3,i2+n5*(idat-1))
398                        work3(2,i1,i3,i2+nd2proc*(idat-1))=work2(2,i1,i3,i2+n5*(idat-1))
399                      end do
400                    end do
401                  end do
402                end do
403              end if
404            end if
405          end if
406          if (paral_kgb==1) then
407            option_ccfft=1
408          else
409            option_ccfft=2
410          end if
411        end if
412 
413        ! Fourier transform fofr (reciprocal to real space)
414        ! The output might be in work1 or fofr, depending on inplace
415        if (fftalgc==0) then
416          if (fftalga/=4) then 
417            ! Call usual 3DFFT library routines 
418            call ccfft(ngfft,+1,n1,n2,n3,n4,n5,n6,ndat,2,fofr,work1,comm_fft)
419          else 
420            ! SG simplest complex-to-complex routine
421            call ccfft(ngfft,+1,n1,n2,n3,n4,n5,n6,ndat,option_ccfft,work3,work4,comm_fft)
422            ABI_DEALLOCATE(work2)
423            ABI_DEALLOCATE(work3)
424          end if
425        else
426          ! Call SG routine, with zero padding
427          call sg_fftpad(fftcache,mgfft,n1,n2,n3,n4,n5,n6,ndat,gboundin,+1,fofr,work1)
428        end if
429      end if ! option/=3
430 
431      ! Note that if option==0 everything is alright already, the output is available in fofr.
432      ! MG: TODO: Rewrite this mess in human-readable form!
433      if (option==0) then 
434        if (fftalgc==0) then
435          if (fftalga/=4) then 
436            call DCOPY(2*n4*n5*n6*ndat,work1,1,fofr,1)
437          else 
438            call DCOPY(2*n4*n5*n6*ndat,work4,1,fofr,1)
439          end if
440        else
441          ! Results are copied to fofr.
442          call DCOPY(2*n4*n5*n6*ndat,work1,1,fofr,1)
443        end if
444      end if 
445 
446      if (option==1) then 
447        ! Accumulate density
448        if ((fftalgc==0) .and. (fftalga==4)) then
449          do idat=1,ndat
450            do i3=1,n3
451              if( me_fft == fftn3_distrib(i3) ) then
452                i3_local = ffti3_local(i3) + nd3proc*(idat-1)
453                do i2=1,n2
454                  do i1=1,n1
455                    denpot(i1,i2,i3)=denpot(i1,i2,i3)+&
456 &                   weight_r*work4(1,i1,i2,i3_local)**2+&
457 &                   weight_i*work4(2,i1,i2,i3_local)**2
458                  end do
459                end do
460              end if
461            end do
462          end do ! idat
463        else
464          call cg_addtorho(n1,n2,n3,n4,n5,n6,ndat,weight_r,weight_i,work1,denpot)
465        end if
466      end if ! option==1
467 
468      if (option==2) then 
469        ! Apply local potential
470        if (cplex==1) then
471 
472          if ((fftalgc==0) .and. (fftalga==4)) then
473 !$OMP PARALLEL DO PRIVATE(i3_local,i3_glob) 
474            do idat=1,ndat
475              do i3=1,n3
476                if( me_fft == fftn3_distrib(i3) ) then
477                  i3_local = ffti3_local(i3) + nd3proc*(idat-1)
478                  i3_glob = i3+n3*(idat-1)
479                  do i2=1,n2
480                    do i1=1,n1
481                      fofr(1,i1,i2,i3_glob)= denpot(i1,i2,i3)*work4(1,i1,i2,i3_local)
482                      fofr(2,i1,i2,i3_glob)= denpot(i1,i2,i3)*work4(2,i1,i2,i3_local)
483                    end do
484                  end do
485                end if
486              end do
487            end do
488          end if
489          if ((fftalgc/=0) .or. (fftalga/=4)) then
490 !$OMP PARALLEL DO PRIVATE(i3_glob)
491            do idat=1,ndat
492              do i3=1,n3
493                if( me_fft == fftn3_distrib(i3) ) then
494                  i3_glob = i3+n3*(idat-1)
495                  do i2=1,n2
496                    do i1=1,n1
497                      fofr(1,i1,i2,i3_glob)=denpot(i1,i2,i3)*work1(1,i1,i2,i3+n3*(idat-1))
498                      fofr(2,i1,i2,i3_glob)=denpot(i1,i2,i3)*work1(2,i1,i2,i3+n3*(idat-1))
499                    end do
500                  end do
501                end if
502              end do
503            end do
504          end if
505 
506        else if (cplex==2) then
507          if ((fftalgc==0) .and. (fftalga==4)) then
508 !$OMP PARALLEL DO PRIVATE(fre,fim,i3_local,i3_glob) 
509            do idat=1,ndat
510              do i3=1,n3
511                if( me_fft == fftn3_distrib(i3) ) then
512                  i3_local = ffti3_local(i3) + nd3proc*(idat-1)
513                  i3_glob = i3+n3*(idat-1)
514                  do i2=1,n2
515                    do i1=1,n1
516                      fre=work4(1,i1,i2,i3_local)
517                      fim=work4(2,i1,i2,i3_local)
518                      fofr(1,i1,i2,i3_glob)=denpot(2*i1-1,i2,i3)*fre -denpot(2*i1,i2,i3)*fim
519                      fofr(2,i1,i2,i3_glob)=denpot(2*i1-1,i2,i3)*fim +denpot(2*i1,i2,i3)*fre
520                    end do
521                  end do
522                end if
523              end do
524            end do
525          end if
526 
527          if ((fftalgc/=0) .or. (fftalga/=4)) then
528 !$OMP PARALLEL DO PRIVATE(fre,fim,i3_glob)
529            do idat=1,ndat
530              do i3=1,n3
531                if( me_fft == fftn3_distrib(i3) ) then
532                  i3_glob = i3+n3*(idat-1)
533                  do i2=1,n2
534                    do i1=1,n1
535                      fre=work1(1,i1,i2,i3+n3*(idat-1))
536                      fim=work1(2,i1,i2,i3+n3*(idat-1))
537                      fofr(1,i1,i2,i3_glob)=denpot(2*i1-1,i2,i3)*fre -denpot(2*i1,i2,i3)*fim
538                      fofr(2,i1,i2,i3_glob)=denpot(2*i1-1,i2,i3)*fim +denpot(2*i1,i2,i3)*fre
539                    end do
540                  end do
541                end if
542              end do
543            end do
544          end if
545        end if ! cplex=2
546 
547      end if ! option=2
548 
549      ! The data for option==2 or option==3 is now in fofr.
550      if (option==2 .or. option==3) then
551 
552        if (fftalgc==0) then 
553          ! Call usual 3DFFT library routines or SG simplest complex-to-complex routine
554          if (fftalga==FFT_SG2002) then
555            ABI_DEALLOCATE(work1)
556            ABI_ALLOCATE(work1,(2,n4,n6,n5*ndat))
557          end if
558 
559          if (option==3 .or. fftalga/=4) then
560            call ccfft(ngfft,-1,n1,n2,n3,n4,n5,n6,ndat,2,fofr,work1,comm_fft)
561          else
562            ! creation of small arrays
563            ! nd3proc=((n5-1)/nproc_fft) +1
564            nd3proc=((n6-1)/nproc_fft) +1
565            nd2proc=((n2-1)/nproc_fft) +1
566            ABI_ALLOCATE(work3,(2,n4,n5,nd3proc*ndat))
567            ABI_ALLOCATE(work2,(2,n4,n6,nd2proc*ndat))
568 
569            if (paral_kgb==1) then
570 
571              if (cplex==1) then
572                do idat=1,ndat
573                  do i3=1,n3
574                    if( me_fft == fftn3_distrib(i3) ) then
575                      i3_local = ffti3_local(i3) + nd3proc*(idat-1)
576                      do i2=1,n2
577                        do i1=1,n1
578                          work3(1,i1,i2,i3_local)=denpot(i1,i2,i3)*work4(1,i1,i2,i3_local)
579                          work3(2,i1,i2,i3_local)=denpot(i1,i2,i3)*work4(2,i1,i2,i3_local)
580                        end do
581                      end do
582                    end if
583                  end do
584                end do
585              else
586                do idat=1,ndat
587                  do i3=1,n3
588                    if( me_fft == fftn3_distrib(i3) ) then
589                      i3_local = ffti3_local(i3) + nd3proc*(idat-1)
590                      do i2=1,n2
591                        do i1=1,n1
592                          fre=work4(1,i1,i2,i3_local)
593                          fim=work4(2,i1,i2,i3_local)
594                          work3(1,i1,i2,i3_local) = denpot(2*i1-1,i2,i3)*fre-denpot(2*i1,i2,i3)*fim
595                          work3(2,i1,i2,i3_local) = denpot(2*i1-1,i2,i3)*fim+denpot(2*i1,i2,i3)*fre
596                        end do
597                      end do
598                    end if
599                  end do
600                end do
601              end if
602              option_ccfft=1
603 
604            else 
605              if (nproc_fft /=1 .or. ndat /= 1 ) then
606                do idat=1,ndat
607                  do i3=1,n3
608                    do i2=1,n2
609                      do i1=1,n1
610                        work3(1,i1,i2,i3+nd3proc*(idat-1))=fofr(1,i1,i2,i3+n3*(idat-1))
611                        work3(2,i1,i2,i3+nd3proc*(idat-1))=fofr(2,i1,i2,i3+n3*(idat-1))
612                      end do
613                    end do
614                  end do
615                end do
616                option_ccfft=2
617              end if
618            end if
619 
620            if (paral_kgb==1) then
621              call ccfft(ngfft,-1,n1,n2,n3,n4,n5,n6,ndat,option_ccfft,work3,work2,comm_fft)
622            else
623              if (nproc_fft /=1 .or. ndat /= 1 ) then
624                call ccfft(ngfft,-1,n1,n2,n3,n4,n5,n6,ndat,option_ccfft,work3,work2,comm_fft)
625              else
626                call ccfft(ngfft,-1,n1,n2,n3,n4,n5,n6,ndat,option_ccfft,fofr,work1,comm_fft)
627              end if
628            end if
629 
630            ! load of work1
631            if ((paral_kgb==1) .and.  ( istwf_k > 1 )) work1(:,:,:,:)=zero
632 
633            if (paral_kgb==1) then
634              if ( istwf_k > 1 ) then
635                do idat=1,ndat
636                  do i2=1,n2
637                    if( me_fft == fftn2_distrib(i2) ) then
638                      i2_local = ffti2_local(i2) + nd2proc*(idat-1)
639                      do i3=1,n3
640                        do i1=1,n1
641                          work1(1,i1,i3,i2+n5*(idat-1))= work2(1,i1,i3,i2_local)
642                          work1(2,i1,i3,i2+n5*(idat-1))= work2(2,i1,i3,i2_local)
643                        end do
644                      end do
645                    end if
646                  end do
647                end do
648              end if
649 
650            else
651              if (nproc_fft /=1 .or. ndat /= 1 ) then
652                do idat=1,ndat
653                  do i2=1,n2
654                    do i3=1,n3
655                      do i1=1,n1
656                        work1(1,i1,i3,i2+n5*(idat-1))=work2(1,i1,i3,i2+nd2proc*(idat-1))
657                        work1(2,i1,i3,i2+n5*(idat-1))=work2(2,i1,i3,i2+nd2proc*(idat-1))
658                      end do
659                    end do
660                  end do
661                end do
662              end if
663            end if
664            ABI_DEALLOCATE(work3)
665            if ((paral_kgb==1) .and.  ( istwf_k > 1 )) then
666              call timab(48,1,tsec)
667              call xmpi_sum(work1,comm_fft,ier)
668              call timab(48,2,tsec)
669            end if
670          end if
671 
672        else
673          ! Call SG routine, with zero padding
674          call sg_fftpad(fftcache,mgfft,n1,n2,n3,n4,n5,n6,ndat,gboundout,-1,fofr,work1)
675        end if
676 
677        xnorm = one/dble(nfftot)
678 
679        if (fftalga/=4) then
680          call cg_box2gsph(n1,n2,n3,n4,n5,n6,ndat,npwout,kg_kout,work1,fofgout, rscal=xnorm)
681        else 
682          ! if fftalga==4
683          if ((paral_kgb==1) .and. ( istwf_k == 1 )) then
684 !$OMP PARALLEL DO PRIVATE(i1,i2,i3,i2_local)
685            do idat=1,ndat
686              do ig=1,npwout
687                i1=kg_kout(1,ig); if(i1<0)i1=i1+n1 ; i1=i1+1
688                i2=kg_kout(2,ig); if(i2<0)i2=i2+n2 ; i2=i2+1
689                i3=kg_kout(3,ig); if(i3<0)i3=i3+n3 ; i3=i3+1
690                i2_local = ffti2_local(i2) + nd2proc*(idat-1)
691                fofgout(1,ig+npwout*(idat-1))= work2(1,i1,i3,i2_local)*xnorm
692                fofgout(2,ig+npwout*(idat-1))= work2(2,i1,i3,i2_local)*xnorm
693              end do
694            end do
695            ABI_DEALLOCATE(work2)
696          else
697 !$OMP PARALLEL DO PRIVATE(i1,i2,i3) 
698            do idat=1,ndat
699              do ig=1,npwout
700                i1=kg_kout(1,ig); if(i1<0)i1=i1+n1; i1=i1+1
701                i2=kg_kout(2,ig); if(i2<0)i2=i2+n2; i2=i2+1
702                i3=kg_kout(3,ig); if(i3<0)i3=i3+n3; i3=i3+1
703                fofgout(1,ig+npwout*(idat-1))=work1(1,i1,i3,i2+n5*(idat-1))*xnorm
704                fofgout(2,ig+npwout*(idat-1))=work1(2,i1,i3,i2+n5*(idat-1))*xnorm
705              end do
706            end do
707          end if
708        end if ! fftalga
709      end if ! if option==2 or 3
710 
711      if (allocated(work1))  then
712        ABI_DEALLOCATE(work1)
713      end if
714    end if
715 
716    ! Here, call more sophisticated specialized 3-dimensional fft
717    ! (zero padding as well as maximize cache reuse) based on S Goedecker routines.
718    ! Specially tuned for cache architectures.
719    if (fftalga==FFT_SG .and. fftalgc==2 .and. option/=3) then
720      call sg_fftrisc(cplex,denpot,fofgin,fofgout,fofr,gboundin,gboundout,&
721 &     istwf_k,kg_kin,kg_kout,mgfft,ndat,ngfft,npwin,npwout,n4,n5,n6,option,weight_r,weight_i)
722    end if
723 
724    ! Here, call new FFT from S Goedecker, also sophisticated specialized 3-dimensional fft
725    ! (zero padding as well as maximize cache reuse)
726    if (fftalga==FFT_SG2002 .and. fftalgc/=0) then
727      ! The args are not the same as fourwf, but might be
728      call fourwf_mpi(cplex,denpot,fofgin,fofgout,fofr,gboundin,gboundout,&
729 &     istwf_k,kg_kin,kg_kout,me_g0,mgfft,ngfft,mpi_enreg%distribfft,n1,n2,n3,npwin,npwout,&
730 &     n4,n5,n6,ndat,option,weight_r,weight_i,comm_fft)
731    end if
732 
733    if (allocated(work4))  then
734      ABI_DEALLOCATE(work4)
735    end if
736    if (allocated(work2))  then
737      ABI_DEALLOCATE(work2)
738    end if
739 
740  end select
741 
742 ! Accumulate timing
743  100 continue
744  call timab(840+tim_fourwf,2,tsec)
745 
746 end subroutine fourwf