TABLE OF CONTENTS
ABINIT/fourwf [ Functions ]
NAME
fourwf
FUNCTION
Carry out composite Fourier transforms between real and reciprocal (G) space. Wavefunctions, contained in a sphere in reciprocal space, can be FFT to real space. They can also be FFT from real space to a sphere. Also, the density maybe accumulated, and a local potential can be applied. The different options are : - option=0 --> reciprocal to real space and output the result. - option=1 --> reciprocal to real space and accumulate the density. - option=2 --> reciprocal to real space, apply the local potential to the wavefunction in real space and produce the result in reciprocal space. - option=3 --> real space to reciprocal space. NOTE that in this case, fftalg=1x1 MUST be used. This may be changed in the future. The different sections of this routine corresponds to different algorithms, used independently of each others : - fftalg=xx0 : use simple complex-to-complex routines, without zero padding (rather simple, so can be used to understand how fourwf.f works); - fftalg=1x1 : use S Goedecker routines, with zero padding (7/12 savings in execution time); - fftalg=1x2 : call even more sophisticated coding also based on S Goedecker routines This routine contains many parts that differ only by small details, in order to treat each case with the better speed. Also for better speed, it uses no F90 construct, except the allocate command and for zeroing arrays.
COPYRIGHT
Copyright (C) 1998-2018 ABINIT group (DCA, XG, GMR, FF) This file is distributed under the terms of the GNU General Public License, see ~abinit/COPYING or http://www.gnu.org/copyleft/gpl.txt . For the initials of contributors, see ~abinit/doc/developers/contributors.txt .
INPUTS
cplex= if 1 , denpot is real, if 2 , denpot is complex (cplex=2 only allowed for option=2, and istwf_k=1) not relevant if option=0 or option=3, so cplex=0 can be used to minimize memory fofgin(2,npwin)=holds input wavefunction in G vector basis sphere. (intent(in) but the routine sphere can modify it for another iflag) gboundin(2*mgfft+8,2)=sphere boundary info for reciprocal to real space gboundout(2*mgfft+8,2)=sphere boundary info for real to reciprocal space istwf_k=option parameter that describes the storage of wfs kg_kin(3,npwin)=reduced planewave coordinates, input kg_kout(3,npwout)=reduced planewave coordinates, output mgfft=maximum size of 1D FFTs mpi_enreg=information about MPI parallelization ndat=number of FFT to do in // ngfft(18)=contain all needed information about 3D FFT, see ~abinit/doc/variables/vargs.htm#ngfft npwin=number of elements in fofgin array (for option 0, 1 and 2) npwout=number of elements in fofgout array (for option 2 and 3) n4,n5,n6=ngfft(4),ngfft(5),ngfft(6), dimensions of fofr. option= if 0: do direct FFT if 1: do direct FFT, then sum the density if 2: do direct FFT, multiply by the potential, then do reverse FFT if 3: do reverse FFT only paral_kgb=Flag related to the kpoint-band-fft parallelism tim_fourwf=timing code of the calling routine (can be set to 0 if not attributed) weight_r=weight to be used for the accumulation of the density in real space (needed only when option=1) weight_i=weight to be used for the accumulation of the density in real space (needed only when option=1 and (fftalg=4 and fftalgc/=0)) fofginb(2,npwin)=holds second input wavefunction in G vector basis sphere. (intent(in) but the routine sphere can modify it for another iflag) (for non diagonal occupation) use_ndo = use non diagonal occupations.
OUTPUT
(see side effects)
SIDE EFFECTS
Input/Output for option==0, fofgin(2,npwin*ndat)=holds input wavefunction in G sphere; fofr(2,n4,n5,n6*ndat) contains the output Fourier Transform of fofgin; no use of denpot, fofgout and npwout. for option==1, fofgin(2,npwin*ndat)=holds input wavefunction in G sphere; denpot(cplex*n4,n5,n6) contains the input density at input, and the updated density at output (accumulated); no use of fofgout and npwout. for option==2, fofgin(2,npwin*ndat)=holds input wavefunction in G sphere; denpot(cplex*n4,n5,n6) contains the input local potential; fofgout(2,npwout*ndat) contains the output function; for option==3, fofr(2,n4,n5,n6*ndat) contains the input real space wavefunction; fofgout(2,npwout*ndat) contains its output Fourier transform; no use of fofgin and npwin.
TODO
Remove paral_kgb, we are already passing mpi_enreg
NOTES
DO NOT CHANGE THE API OF THIS FUNCTION. If you need a specialized routine for the FFT of the wavefunctions, create a wrapper that uses fourwf to accomplish your task. This routine, indeed, has already too many parameters and each change in the API requires a careful modification of the different wrappers used for specialized FFTs such as FFTW3 and MKL-DFTI
PARENTS
dfpt_accrho,dfpt_mkrho,dfptnl_resp,fock_getghc,getgh1c,getghc gwls_hamiltonian,m_cut3d,m_epjdos,m_fft_prof,m_fock,mkrho,mlwfovlp pawmkaewf,pawsushat,posdoppler,prep_fourwf,spin_current,susk,suskmm tddft,vtowfk
CHILDREN
ccfft,cg_addtorho,cg_box2gsph,dcopy,dfti_seqfourwf,fftw3_seqfourwf fourwf_mpi,gpu_fourwf,ptabs_fourwf,sg_fftpad,sg_fftrisc,sg_fftrisc_2 sphere,sphere_fft,timab,xmpi_sum
SOURCE
117 #if defined HAVE_CONFIG_H 118 #include "config.h" 119 #endif 120 121 #include "abi_common.h" 122 123 subroutine fourwf(cplex,denpot,fofgin,fofgout,fofr,gboundin,gboundout,istwf_k,& 124 & kg_kin,kg_kout,mgfft,mpi_enreg,ndat,ngfft,npwin,npwout,n4,n5,n6,option,& 125 & paral_kgb,tim_fourwf,weight_r,weight_i, & 126 & use_gpu_cuda,use_ndo,fofginb) ! Optional arguments 127 128 use defs_basis 129 use defs_abitypes 130 use m_profiling_abi 131 use m_xmpi 132 use m_errors 133 use m_cgtools 134 135 use m_mpinfo, only : ptabs_fourwf 136 use m_fftcore, only : sphere_fft, sphere 137 use m_sgfft, only : sg_fftpad, sg_fftrisc, sg_fftrisc_2 138 use m_dfti, only : dfti_seqfourwf 139 use m_fftw3, only : fftw3_seqfourwf 140 use m_fft, only : fourwf_mpi 141 142 !This section has been created automatically by the script Abilint (TD). 143 !Do not modify the following lines by hand. 144 #undef ABI_FUNC 145 #define ABI_FUNC 'fourwf' 146 use interfaces_18_timing 147 use interfaces_53_ffts, except_this_one => fourwf 148 !End of the abilint section 149 150 implicit none 151 152 !Arguments ------------------------------------ 153 !scalars 154 integer,intent(in) :: cplex,istwf_k,mgfft,n4,n5,n6,ndat,npwin,npwout,option,paral_kgb 155 integer,intent(in) :: tim_fourwf 156 integer,intent(in),optional :: use_gpu_cuda,use_ndo 157 real(dp),intent(in) :: weight_r,weight_i 158 type(MPI_type),intent(in) :: mpi_enreg 159 !arrays 160 integer,intent(in) :: gboundin(2*mgfft+8,2),gboundout(2*mgfft+8,2) 161 integer,intent(in) :: kg_kin(3,npwin),kg_kout(3,npwout),ngfft(18) 162 real(dp),intent(inout) :: denpot(cplex*n4,n5,n6),fofgin(2,npwin*ndat) 163 real(dp),intent(inout),optional :: fofginb(:,:) ! (2,npwin*ndat) 164 real(dp),intent(inout) :: fofr(2,n4,n5,n6*ndat) 165 real(dp),intent(out) :: fofgout(2,npwout*ndat) 166 167 !Local variables------------------------------- 168 !scalars 169 integer :: fftalg,fftalga,fftalgc,fftcache,i1,i2,i2_local,i3,i3_local,i3_glob,idat,ier 170 integer :: iflag,ig,comm_fft,me_g0,me_fft,n1,n2,n3,nd2proc,nd3proc 171 integer :: nfftot,nproc_fft,option_ccfft 172 real(dp) :: fim,fre,xnorm 173 character(len=500) :: message 174 logical :: luse_gpu_cuda,luse_ndo 175 !arrays 176 integer,parameter :: shiftg0(3)=0 177 integer,parameter :: symmE(3,3)=reshape([1,0,0,0,1,0,0,0,1],[3,3]) 178 integer, ABI_CONTIGUOUS pointer :: fftn2_distrib(:),ffti2_local(:) 179 integer, ABI_CONTIGUOUS pointer :: fftn3_distrib(:),ffti3_local(:) 180 real(dp) :: tsec(2) 181 real(dp),allocatable :: work1(:,:,:,:),work2(:,:,:,:),work3(:,:,:,:) 182 real(dp),allocatable :: work4(:,:,:,:),work_sum(:,:,:,:) 183 184 ! ************************************************************************* 185 186 ! Accumulate timing 187 call timab(840+tim_fourwf,1,tsec) 188 189 n1=ngfft(1); n2=ngfft(2); n3=ngfft(3); nfftot=n1*n2*n3 190 fftcache=ngfft(8) 191 fftalg=ngfft(7); fftalga=fftalg/100; fftalgc=mod(fftalg,10) 192 me_fft=ngfft(11) 193 nproc_fft=ngfft(10) 194 195 comm_fft = mpi_enreg%comm_fft; me_g0 = mpi_enreg%me_g0 196 197 !if (ndat/=1) then 198 ! write(std_out,*)fftalg 199 ! MSG_ERROR("Really? I thought nobody uses ndat > 1") 200 !end if 201 202 !if (weight_r /= weight_i) then 203 ! write(std_out,*)fftalg 204 ! MSG_ERROR("Really? I thought nobody uses weight_r != weight_i") 205 !end if 206 207 !if (option == 0 .and. fftalgc == 0) then 208 ! MSG_ERROR("Option 0 is buggy when fftalgc ==0 is used!") 209 !end if 210 211 !Cuda version of fourwf 212 luse_gpu_cuda=PRESENT(use_gpu_cuda) 213 if (luse_gpu_cuda) luse_gpu_cuda=(luse_gpu_cuda.and.(use_gpu_cuda==1)) 214 215 if(luse_gpu_cuda) then 216 #if defined HAVE_GPU_CUDA 217 call gpu_fourwf(cplex,denpot,fofgin,fofgout,fofr,gboundin,gboundout,istwf_k,& 218 & kg_kin,kg_kout,mgfft,mpi_enreg,ndat,ngfft,npwin,npwout,n4,n5,n6,option,& 219 & paral_kgb,tim_fourwf,weight_r,weight_i) !,& 220 ! & use_ndo,fofginb) 221 #endif 222 call timab(840+tim_fourwf,2,tsec); return 223 end if 224 225 if ((fftalgc<0 .or. fftalgc>2)) then 226 write(message, '(a,i4,a,a,a,a,a)' )& 227 & 'The input algorithm number fftalg=',fftalg,' is not allowed.',ch10,& 228 & 'The third digit, fftalg(C), must be 0, 1, or 2',ch10,& 229 & 'Action: change fftalg in your input file.' 230 MSG_ERROR(message) 231 end if 232 233 if (fftalgc/=0 .and. ALL(fftalga/=(/1,3,4,5/)) ) then 234 write(message, '(a,i4,5a)' )& 235 & 'The input algorithm number fftalg=',fftalg,' is not allowed.',ch10,& 236 & 'The first digit must be 1,3,4 when the last digit is not 0.',ch10,& 237 & 'Action : change fftalg in your input file.' 238 MSG_ERROR(message) 239 end if 240 241 if (option<0 .or. option>3)then 242 write(message, '(a,i4,a,a,a)' )& 243 & 'The option number',option,' is not allowed.',ch10,& 244 & 'Only option=0, 1, 2 or 3 are allowed presently.' 245 MSG_ERROR(message) 246 end if 247 248 if (option==1 .and. cplex/=1) then 249 write(message, '(a,a,a,i4,a)' )& 250 & 'With the option number 1, cplex must be 1,',ch10,& 251 & 'but it is cplex=',cplex,'.' 252 MSG_ERROR(message) 253 end if 254 255 if (option==2 .and. (cplex/=1 .and. cplex/=2)) then 256 write(message, '(a,a,a,i4,a)' )& 257 & 'With the option number 2, cplex must be 1 or 2,',ch10,& 258 & 'but it is cplex=',cplex,'.' 259 MSG_ERROR(message) 260 end if 261 262 ! DMFT uses its own FFT algorithm (that should be wrapped in a different routine!) 263 luse_ndo=.false. 264 if (present(use_ndo).and.present(fofginb)) then 265 if(use_ndo==1) then 266 luse_ndo=.true. 267 if((size(fofginb,2)==0)) then 268 write(message, '(a,a,a,i4,i5)' )& 269 & 'fofginb has a dimension equal to zero and use_ndo==1',ch10,& 270 & 'Action : check dimension of fofginb',size(fofginb,2),use_ndo 271 MSG_ERROR(message) 272 end if 273 end if 274 end if 275 276 if (luse_ndo) then 277 if (.not.(fftalgc==2 .and. option/=3)) then 278 MSG_ERROR("luse_ndo but not .not.(fftalgc==2 .and. option/=3)") 279 end if 280 ABI_CHECK(nproc_fft==1, "DMFT with nproc_fft != 1") 281 ABI_CHECK(ndat == 1, "use_ndo and ndat != 1 not coded") 282 283 call sg_fftrisc_2(cplex,denpot,fofgin,fofgout,fofr,gboundin,gboundout,& 284 & istwf_k,kg_kin,kg_kout,& 285 & mgfft,ngfft,npwin,npwout,n4,n5,n6,option,weight_r,weight_2=weight_i,luse_ndo=luse_ndo,fofgin_p=fofginb) 286 goto 100 287 end if 288 289 ! Get the distrib associated with this fft_grid => for i2 and i3 planes 290 call ptabs_fourwf(mpi_enreg,n2,n3,fftn2_distrib,ffti2_local,fftn3_distrib,ffti3_local) 291 292 ! Branch immediately depending on nproc_fft 293 if (nproc_fft > 1 .and. fftalg /= 412) then 294 call fourwf_mpi(cplex,denpot,fofgin,fofgout,fofr,gboundin,gboundout,& 295 & istwf_k,kg_kin,kg_kout,me_g0,mgfft,ngfft,mpi_enreg%distribfft,n1,n2,n3,npwin,npwout,& 296 & n4,n5,n6,ndat,option,weight_r,weight_i,comm_fft) 297 goto 100 298 end if 299 300 select case (fftalga) 301 302 case (FFT_FFTW3) 303 if (luse_ndo) MSG_ERROR("luse_ndo not supported by FFTW3") 304 if (nproc_fft == 1) then 305 ! call wrtout(std_out,"FFTW3_SEQFOURWF","COLL") 306 call fftw3_seqfourwf(cplex,denpot,fofgin,fofgout,fofr,gboundin,gboundout,istwf_k,& 307 & kg_kin,kg_kout,mgfft,ndat,ngfft,npwin,npwout,n4,n5,n6,option,weight_r,weight_i) 308 else 309 MSG_ERROR("Not coded") 310 end if 311 312 case (FFT_DFTI) 313 if (luse_ndo) MSG_ERROR("luse_ndo not supported by DFTI") 314 if (nproc_fft == 1) then 315 ! call wrtout(std_out,"DFTI_SEQFOURWF","COLL") 316 call dfti_seqfourwf(cplex,denpot,fofgin,fofgout,fofr,gboundin,gboundout,istwf_k,& 317 & kg_kin,kg_kout,mgfft,ndat,ngfft,npwin,npwout,n4,n5,n6,option,weight_r,weight_i) 318 else 319 MSG_ERROR("Not coded") 320 end if 321 322 case default 323 ! TODO: Clean this code! 324 325 ! Here, use routines that make forwards FFT separately of backwards FFT, 326 ! in particular, usual 3DFFT library routines, called in ccfft. 327 if (fftalgc==0 .or. (fftalgc==1 .and. fftalga/=4) .or. & 328 & (fftalgc==2 .and. fftalga/=4 .and. option==3) )then 329 330 ABI_ALLOCATE(work1,(2,n4,n5,n6*ndat)) 331 332 if (option/=3)then 333 ! Insert fofgin into the fft box (array fofr) 334 335 if (fftalga/=4)then 336 iflag=1 337 call sphere(fofgin,ndat,npwin,fofr,n1,n2,n3,n4,n5,n6,kg_kin,istwf_k,iflag,me_g0,shiftg0,symmE,one) 338 339 else if (fftalga==4 .and. fftalgc==0) then 340 ! Note the switch of n5 and n6, as they are only 341 ! needed to dimension work2 inside "sphere" 342 ABI_ALLOCATE(work2,(2,n4,n6,n5*ndat)) 343 344 iflag=2 345 nd2proc=((n2-1)/nproc_fft) +1 346 nd3proc=((n6-1)/nproc_fft) +1 347 ABI_ALLOCATE(work3,(2,n4,n6,nd2proc*ndat)) 348 ABI_ALLOCATE(work4,(2,n4,n5,nd3proc*ndat)) 349 350 if (istwf_k == 1 .and. paral_kgb==1) then 351 ! sphere dont need a big array 352 work3=zero 353 call sphere_fft(fofgin,ndat,npwin,work3,n1,n2,n3,n4,n6,kg_kin,& 354 & mpi_enreg%distribfft%tab_fftwf2_local,nd2proc) 355 else 356 ! sphere needs a big array and communications 357 if (nproc_fft == 1 .and. ndat == 1 .and. istwf_k == 1) then 358 ! dimensions of tab work3 and work2 are identical no need to use work2 359 work3=zero 360 call sphere(fofgin,ndat,npwin,work3,n1,n2,n3,n4,n6,nd2proc,& 361 & kg_kin,istwf_k,iflag,me_g0,shiftg0,symmE,one) 362 else 363 work2=zero 364 call sphere(fofgin,ndat,npwin,work2,n1,n2,n3,n4,n6,n5,& 365 & kg_kin,istwf_k,iflag,me_g0,shiftg0,symmE,one) 366 367 if (paral_kgb==1 .and. istwf_k > 1) then 368 ! Collect G-vectors on each node 369 work3=zero 370 ABI_ALLOCATE(work_sum,(2,n4,n6,n5*ndat)) 371 call timab(48,1,tsec) 372 call xmpi_sum(work2,work_sum,2*n4*n6*n5*ndat,comm_fft,ier) 373 call timab(48,2,tsec) 374 375 ! Extract my list of G-vectors needed for MPI-FFT. 376 do idat=1,ndat 377 do i2=1,n2 378 if( fftn2_distrib(i2) == me_fft) then 379 i2_local = ffti2_local(i2) + nd2proc*(idat-1) 380 do i3=1,n3 381 do i1=1,n1 382 work3(1,i1,i3,i2_local)=work_sum(1,i1,i3,i2+n5*(idat-1)) 383 work3(2,i1,i3,i2_local)=work_sum(2,i1,i3,i2+n5*(idat-1)) 384 end do 385 end do 386 end if 387 end do 388 end do 389 ABI_DEALLOCATE(work_sum) 390 end if 391 392 if (paral_kgb/=1) then 393 do idat=1,ndat 394 do i2=1,n2 395 do i3=1,n3 396 do i1=1,n1 397 work3(1,i1,i3,i2+nd2proc*(idat-1))=work2(1,i1,i3,i2+n5*(idat-1)) 398 work3(2,i1,i3,i2+nd2proc*(idat-1))=work2(2,i1,i3,i2+n5*(idat-1)) 399 end do 400 end do 401 end do 402 end do 403 end if 404 end if 405 end if 406 if (paral_kgb==1) then 407 option_ccfft=1 408 else 409 option_ccfft=2 410 end if 411 end if 412 413 ! Fourier transform fofr (reciprocal to real space) 414 ! The output might be in work1 or fofr, depending on inplace 415 if (fftalgc==0) then 416 if (fftalga/=4) then 417 ! Call usual 3DFFT library routines 418 call ccfft(ngfft,+1,n1,n2,n3,n4,n5,n6,ndat,2,fofr,work1,comm_fft) 419 else 420 ! SG simplest complex-to-complex routine 421 call ccfft(ngfft,+1,n1,n2,n3,n4,n5,n6,ndat,option_ccfft,work3,work4,comm_fft) 422 ABI_DEALLOCATE(work2) 423 ABI_DEALLOCATE(work3) 424 end if 425 else 426 ! Call SG routine, with zero padding 427 call sg_fftpad(fftcache,mgfft,n1,n2,n3,n4,n5,n6,ndat,gboundin,+1,fofr,work1) 428 end if 429 end if ! option/=3 430 431 ! Note that if option==0 everything is alright already, the output is available in fofr. 432 ! MG: TODO: Rewrite this mess in human-readable form! 433 if (option==0) then 434 if (fftalgc==0) then 435 if (fftalga/=4) then 436 call DCOPY(2*n4*n5*n6*ndat,work1,1,fofr,1) 437 else 438 call DCOPY(2*n4*n5*n6*ndat,work4,1,fofr,1) 439 end if 440 else 441 ! Results are copied to fofr. 442 call DCOPY(2*n4*n5*n6*ndat,work1,1,fofr,1) 443 end if 444 end if 445 446 if (option==1) then 447 ! Accumulate density 448 if ((fftalgc==0) .and. (fftalga==4)) then 449 do idat=1,ndat 450 do i3=1,n3 451 if( me_fft == fftn3_distrib(i3) ) then 452 i3_local = ffti3_local(i3) + nd3proc*(idat-1) 453 do i2=1,n2 454 do i1=1,n1 455 denpot(i1,i2,i3)=denpot(i1,i2,i3)+& 456 & weight_r*work4(1,i1,i2,i3_local)**2+& 457 & weight_i*work4(2,i1,i2,i3_local)**2 458 end do 459 end do 460 end if 461 end do 462 end do ! idat 463 else 464 call cg_addtorho(n1,n2,n3,n4,n5,n6,ndat,weight_r,weight_i,work1,denpot) 465 end if 466 end if ! option==1 467 468 if (option==2) then 469 ! Apply local potential 470 if (cplex==1) then 471 472 if ((fftalgc==0) .and. (fftalga==4)) then 473 !$OMP PARALLEL DO PRIVATE(i3_local,i3_glob) 474 do idat=1,ndat 475 do i3=1,n3 476 if( me_fft == fftn3_distrib(i3) ) then 477 i3_local = ffti3_local(i3) + nd3proc*(idat-1) 478 i3_glob = i3+n3*(idat-1) 479 do i2=1,n2 480 do i1=1,n1 481 fofr(1,i1,i2,i3_glob)= denpot(i1,i2,i3)*work4(1,i1,i2,i3_local) 482 fofr(2,i1,i2,i3_glob)= denpot(i1,i2,i3)*work4(2,i1,i2,i3_local) 483 end do 484 end do 485 end if 486 end do 487 end do 488 end if 489 if ((fftalgc/=0) .or. (fftalga/=4)) then 490 !$OMP PARALLEL DO PRIVATE(i3_glob) 491 do idat=1,ndat 492 do i3=1,n3 493 if( me_fft == fftn3_distrib(i3) ) then 494 i3_glob = i3+n3*(idat-1) 495 do i2=1,n2 496 do i1=1,n1 497 fofr(1,i1,i2,i3_glob)=denpot(i1,i2,i3)*work1(1,i1,i2,i3+n3*(idat-1)) 498 fofr(2,i1,i2,i3_glob)=denpot(i1,i2,i3)*work1(2,i1,i2,i3+n3*(idat-1)) 499 end do 500 end do 501 end if 502 end do 503 end do 504 end if 505 506 else if (cplex==2) then 507 if ((fftalgc==0) .and. (fftalga==4)) then 508 !$OMP PARALLEL DO PRIVATE(fre,fim,i3_local,i3_glob) 509 do idat=1,ndat 510 do i3=1,n3 511 if( me_fft == fftn3_distrib(i3) ) then 512 i3_local = ffti3_local(i3) + nd3proc*(idat-1) 513 i3_glob = i3+n3*(idat-1) 514 do i2=1,n2 515 do i1=1,n1 516 fre=work4(1,i1,i2,i3_local) 517 fim=work4(2,i1,i2,i3_local) 518 fofr(1,i1,i2,i3_glob)=denpot(2*i1-1,i2,i3)*fre -denpot(2*i1,i2,i3)*fim 519 fofr(2,i1,i2,i3_glob)=denpot(2*i1-1,i2,i3)*fim +denpot(2*i1,i2,i3)*fre 520 end do 521 end do 522 end if 523 end do 524 end do 525 end if 526 527 if ((fftalgc/=0) .or. (fftalga/=4)) then 528 !$OMP PARALLEL DO PRIVATE(fre,fim,i3_glob) 529 do idat=1,ndat 530 do i3=1,n3 531 if( me_fft == fftn3_distrib(i3) ) then 532 i3_glob = i3+n3*(idat-1) 533 do i2=1,n2 534 do i1=1,n1 535 fre=work1(1,i1,i2,i3+n3*(idat-1)) 536 fim=work1(2,i1,i2,i3+n3*(idat-1)) 537 fofr(1,i1,i2,i3_glob)=denpot(2*i1-1,i2,i3)*fre -denpot(2*i1,i2,i3)*fim 538 fofr(2,i1,i2,i3_glob)=denpot(2*i1-1,i2,i3)*fim +denpot(2*i1,i2,i3)*fre 539 end do 540 end do 541 end if 542 end do 543 end do 544 end if 545 end if ! cplex=2 546 547 end if ! option=2 548 549 ! The data for option==2 or option==3 is now in fofr. 550 if (option==2 .or. option==3) then 551 552 if (fftalgc==0) then 553 ! Call usual 3DFFT library routines or SG simplest complex-to-complex routine 554 if (fftalga==FFT_SG2002) then 555 ABI_DEALLOCATE(work1) 556 ABI_ALLOCATE(work1,(2,n4,n6,n5*ndat)) 557 end if 558 559 if (option==3 .or. fftalga/=4) then 560 call ccfft(ngfft,-1,n1,n2,n3,n4,n5,n6,ndat,2,fofr,work1,comm_fft) 561 else 562 ! creation of small arrays 563 ! nd3proc=((n5-1)/nproc_fft) +1 564 nd3proc=((n6-1)/nproc_fft) +1 565 nd2proc=((n2-1)/nproc_fft) +1 566 ABI_ALLOCATE(work3,(2,n4,n5,nd3proc*ndat)) 567 ABI_ALLOCATE(work2,(2,n4,n6,nd2proc*ndat)) 568 569 if (paral_kgb==1) then 570 571 if (cplex==1) then 572 do idat=1,ndat 573 do i3=1,n3 574 if( me_fft == fftn3_distrib(i3) ) then 575 i3_local = ffti3_local(i3) + nd3proc*(idat-1) 576 do i2=1,n2 577 do i1=1,n1 578 work3(1,i1,i2,i3_local)=denpot(i1,i2,i3)*work4(1,i1,i2,i3_local) 579 work3(2,i1,i2,i3_local)=denpot(i1,i2,i3)*work4(2,i1,i2,i3_local) 580 end do 581 end do 582 end if 583 end do 584 end do 585 else 586 do idat=1,ndat 587 do i3=1,n3 588 if( me_fft == fftn3_distrib(i3) ) then 589 i3_local = ffti3_local(i3) + nd3proc*(idat-1) 590 do i2=1,n2 591 do i1=1,n1 592 fre=work4(1,i1,i2,i3_local) 593 fim=work4(2,i1,i2,i3_local) 594 work3(1,i1,i2,i3_local) = denpot(2*i1-1,i2,i3)*fre-denpot(2*i1,i2,i3)*fim 595 work3(2,i1,i2,i3_local) = denpot(2*i1-1,i2,i3)*fim+denpot(2*i1,i2,i3)*fre 596 end do 597 end do 598 end if 599 end do 600 end do 601 end if 602 option_ccfft=1 603 604 else 605 if (nproc_fft /=1 .or. ndat /= 1 ) then 606 do idat=1,ndat 607 do i3=1,n3 608 do i2=1,n2 609 do i1=1,n1 610 work3(1,i1,i2,i3+nd3proc*(idat-1))=fofr(1,i1,i2,i3+n3*(idat-1)) 611 work3(2,i1,i2,i3+nd3proc*(idat-1))=fofr(2,i1,i2,i3+n3*(idat-1)) 612 end do 613 end do 614 end do 615 end do 616 option_ccfft=2 617 end if 618 end if 619 620 if (paral_kgb==1) then 621 call ccfft(ngfft,-1,n1,n2,n3,n4,n5,n6,ndat,option_ccfft,work3,work2,comm_fft) 622 else 623 if (nproc_fft /=1 .or. ndat /= 1 ) then 624 call ccfft(ngfft,-1,n1,n2,n3,n4,n5,n6,ndat,option_ccfft,work3,work2,comm_fft) 625 else 626 call ccfft(ngfft,-1,n1,n2,n3,n4,n5,n6,ndat,option_ccfft,fofr,work1,comm_fft) 627 end if 628 end if 629 630 ! load of work1 631 if ((paral_kgb==1) .and. ( istwf_k > 1 )) work1(:,:,:,:)=zero 632 633 if (paral_kgb==1) then 634 if ( istwf_k > 1 ) then 635 do idat=1,ndat 636 do i2=1,n2 637 if( me_fft == fftn2_distrib(i2) ) then 638 i2_local = ffti2_local(i2) + nd2proc*(idat-1) 639 do i3=1,n3 640 do i1=1,n1 641 work1(1,i1,i3,i2+n5*(idat-1))= work2(1,i1,i3,i2_local) 642 work1(2,i1,i3,i2+n5*(idat-1))= work2(2,i1,i3,i2_local) 643 end do 644 end do 645 end if 646 end do 647 end do 648 end if 649 650 else 651 if (nproc_fft /=1 .or. ndat /= 1 ) then 652 do idat=1,ndat 653 do i2=1,n2 654 do i3=1,n3 655 do i1=1,n1 656 work1(1,i1,i3,i2+n5*(idat-1))=work2(1,i1,i3,i2+nd2proc*(idat-1)) 657 work1(2,i1,i3,i2+n5*(idat-1))=work2(2,i1,i3,i2+nd2proc*(idat-1)) 658 end do 659 end do 660 end do 661 end do 662 end if 663 end if 664 ABI_DEALLOCATE(work3) 665 if ((paral_kgb==1) .and. ( istwf_k > 1 )) then 666 call timab(48,1,tsec) 667 call xmpi_sum(work1,comm_fft,ier) 668 call timab(48,2,tsec) 669 end if 670 end if 671 672 else 673 ! Call SG routine, with zero padding 674 call sg_fftpad(fftcache,mgfft,n1,n2,n3,n4,n5,n6,ndat,gboundout,-1,fofr,work1) 675 end if 676 677 xnorm = one/dble(nfftot) 678 679 if (fftalga/=4) then 680 call cg_box2gsph(n1,n2,n3,n4,n5,n6,ndat,npwout,kg_kout,work1,fofgout, rscal=xnorm) 681 else 682 ! if fftalga==4 683 if ((paral_kgb==1) .and. ( istwf_k == 1 )) then 684 !$OMP PARALLEL DO PRIVATE(i1,i2,i3,i2_local) 685 do idat=1,ndat 686 do ig=1,npwout 687 i1=kg_kout(1,ig); if(i1<0)i1=i1+n1 ; i1=i1+1 688 i2=kg_kout(2,ig); if(i2<0)i2=i2+n2 ; i2=i2+1 689 i3=kg_kout(3,ig); if(i3<0)i3=i3+n3 ; i3=i3+1 690 i2_local = ffti2_local(i2) + nd2proc*(idat-1) 691 fofgout(1,ig+npwout*(idat-1))= work2(1,i1,i3,i2_local)*xnorm 692 fofgout(2,ig+npwout*(idat-1))= work2(2,i1,i3,i2_local)*xnorm 693 end do 694 end do 695 ABI_DEALLOCATE(work2) 696 else 697 !$OMP PARALLEL DO PRIVATE(i1,i2,i3) 698 do idat=1,ndat 699 do ig=1,npwout 700 i1=kg_kout(1,ig); if(i1<0)i1=i1+n1; i1=i1+1 701 i2=kg_kout(2,ig); if(i2<0)i2=i2+n2; i2=i2+1 702 i3=kg_kout(3,ig); if(i3<0)i3=i3+n3; i3=i3+1 703 fofgout(1,ig+npwout*(idat-1))=work1(1,i1,i3,i2+n5*(idat-1))*xnorm 704 fofgout(2,ig+npwout*(idat-1))=work1(2,i1,i3,i2+n5*(idat-1))*xnorm 705 end do 706 end do 707 end if 708 end if ! fftalga 709 end if ! if option==2 or 3 710 711 if (allocated(work1)) then 712 ABI_DEALLOCATE(work1) 713 end if 714 end if 715 716 ! Here, call more sophisticated specialized 3-dimensional fft 717 ! (zero padding as well as maximize cache reuse) based on S Goedecker routines. 718 ! Specially tuned for cache architectures. 719 if (fftalga==FFT_SG .and. fftalgc==2 .and. option/=3) then 720 call sg_fftrisc(cplex,denpot,fofgin,fofgout,fofr,gboundin,gboundout,& 721 & istwf_k,kg_kin,kg_kout,mgfft,ndat,ngfft,npwin,npwout,n4,n5,n6,option,weight_r,weight_i) 722 end if 723 724 ! Here, call new FFT from S Goedecker, also sophisticated specialized 3-dimensional fft 725 ! (zero padding as well as maximize cache reuse) 726 if (fftalga==FFT_SG2002 .and. fftalgc/=0) then 727 ! The args are not the same as fourwf, but might be 728 call fourwf_mpi(cplex,denpot,fofgin,fofgout,fofr,gboundin,gboundout,& 729 & istwf_k,kg_kin,kg_kout,me_g0,mgfft,ngfft,mpi_enreg%distribfft,n1,n2,n3,npwin,npwout,& 730 & n4,n5,n6,ndat,option,weight_r,weight_i,comm_fft) 731 end if 732 733 if (allocated(work4)) then 734 ABI_DEALLOCATE(work4) 735 end if 736 if (allocated(work2)) then 737 ABI_DEALLOCATE(work2) 738 end if 739 740 end select 741 742 ! Accumulate timing 743 100 continue 744 call timab(840+tim_fourwf,2,tsec) 745 746 end subroutine fourwf