Tuning TUNING OF THE ABINIT CODE This file is a complement to abinit_help, and describes how to tune ABINIT : - some information about large numbers and large arrays - how to reduce the memory needs at the expense of (some) CPU time or wall clock time. - brief introduction to the most CPU time consuming routines. Before tuning ABINIT, you should understand how to obtain accurate results for the physical problem of interest. This is treated in 'abinit_help', not here. __________________________________________________________________ See a recent version of the file 'new_user_guide', for an introduction to the ABINIT package. See a recent version of the file 'abinit_help' for learning how to use the code, as well as for indications about the numerical accuracy of the code. These files can be found in the ~abinit/doc subdirectory. Any comment or suggestion to improve the 'tuning' file will be welcome, simply contact the ABINIT group. Copyright (C) 1999-2014 ABINIT group (XG,DCA) This file is distributed under the terms of the GNU General Public License, see ~abinit/COPYING or http://www.gnu.org/copyleft/gpl.txt . For the initials of contributors, see ~abinit/doc/developers/contributors.txt . ******************************************************************* 0. Time and memory. 1. Basic variables. 2. Large arrays. 3. Tuning the memory use. 4. The CPU time consuming routines. ******************************************************************* 0. Time and memory The "wall clock" time is the time seen by the user between the beginning of a job and the end of the job. It will depend of course on the charge of the machine (if the machine can run many processes concurrently), and the number of processors of the machine (if the code can be used in parallel). It includes the time for all operations : straight number crunching, but also system interrupts and input/output (I/O) operations. The "CPU" time is an internal measure of the time that the job spends on the Central Processing Unit(s) (the CPUs). It excludes I/O time, and should depend little on the charge of the machine (in practice, this is not true for all machines, since there is some overhead for switching from one user to another user). The (central - or core) memory is the space for arrays directly addressed by the code. On PC computers, it is presently (Feb 1999) on the order of 32 to 128 MB. Disk space can be used by I/O operations, which are much slower than addressing central memory. ABINIT estimates the memory needs for each dataset at the beginning of the run, while it prints the CPU and wall clock time effectively used for a complete job at the end of the run. __________________________________________________________________ 1. Basic variables It is important to understand first the behaviour of a few variables (natom, mband, ucvol, mpw, nfft, nkpt, and nsppol) with respect to the characteristics of the system that is investigated. Some of them have already been described in the abinit_help file. 1.A. The number of atoms We will consider the number of atoms in the unit cell, natom, as the primary source of memory and CPU time scaling. It is an input variable. Roughly speaking, the number of bands, nband, another input variable, will scale directly with the number of atoms. Since the number of bands may depend on the k-point (if occopt==2), we will use the maximum number of bands over the k-points, called mband, as the basic variable instead of nband. 1.B. The volume of the cell The volume of the unit cell (ucvol) will be computed from the input variables acell and rprim. In many cases, the volume of the cell is proportional to the number of atoms. However, there are cases in which the volume can be increased while the number of atoms is fixed : an isolated atom or molecule in a supercell ; a slab calculation (elongated cell with a significant fraction of vacuum). 1.C. The spatial resolution The cell size and number of atoms being fixed, the memory and CPU time needs may still vary, as they depend on the size of the smallest features of the wavefunction and density to be described correctly. This is determined by the kinetic energy of the plane wave basis set, whose maximum is determined by the input variable ecut. The largest wavevector of the plane wave basis set, contained in spheres centered on each k point, will scale as the square root of ecut. The number of plane waves scales as the volume of these spheres, and thus as the 3/2 power of ecut. Since the spheres are different for each k-point, ecut determines different plane wave sets, whose maximum is called mpw (maximum number of plane waves). When the cell size is allowed to vary, mpw varies linearly with the volume of the cell, ucvol. The wavefunctions are treated both in reciprocal space (using the coefficients of plane waves) and in real space (using a Fourier transformed set). The transform between real and reciprocal space, done by the Fast Fourier Transform algorithm, relies on 3D homogeneous grids of points. The number of points of this grid is approximately 16 times the number of plane waves (see section 4.A), and is called nfft. This number is usually very large (between 2000 and 10e6). 1.D. Sampling the Brillouin zone. Since ABINIT is based on periodic boundary conditions, the Brillouin zone must be sampled adequately. The number of k-points to be used for this sampling, in the full Brillouin zone, is inversely proportional to ucvol, but may also vary a lot from system to system. As a rule of thumb, a system with a large band gap will need few k-points, while metals will need lot of k-points to produce converged results. For large systems, the inverse scale with respect to ucvol is unfortunately stopped because at least one k-point must be used. The effective number of k-points to be used will be strongly influenced by the symmetries of the system, since only the irreducible part of the Brillouin zone must be sampled. Moreover the time-reversal symmetry (k equivalent to -k) can be used for ground-state calculations, to reduce sometimes even further the portion of the brillouin zone to be sampled. The number of k points to be used in a calculation is named nkpt. There is another way to take advantage of the time-reversal symmetry, in the specific case of k-points that are invariant under k => -k , or are sent to another vector distant of the original one by some vector of the reciprocal lattice. See below for more explanation about the advantages of using these k-points. As a rule of thumb, for homogeneous systems, a reasonable accuracy may be reached when the product of the number of atoms by the number of k-points in the full Brillouin zone is on the order of 50 or larger, for wide gap insulators, on the order of 250 for small gap semiconductors like Si, and beyond 500 for metals, depending on the value of the input variable tsmear. As soon as there is some vacuum in the system, the product natom * nkpt can be much smaller than this (for an isolated molecule in a sufficiently large supercell, one k-point is enough). 1.E. The number of spin polarizations. Finally, many treatments and arrays must be duplicated when calculations involve spin-polarized systems. The basic variable is here nsppol. ____________________________________________________________________________ 2. Large arrays. There are many large arrays used by ABINIT. Here, we describe the most important ones. 2.A. The wavefunction arrays. The wavefunction coefficients describe, for each k point, for each band, for each spin polarisation, the wavefunctions in reciprocal space. That is, for each wavefunction, a set of npw (number of plane wave) complex coefficients in double precision. As presently coded, the number of bytes taken by the whole set of numbers is the product nkpt * mband * nsppol * mpw * 16 (16 is the number of bytes for a double precision complex) Starting with small systems, for which the number of k-points scales as the inverse of the system size, the size of this set of numbers is roughly proportional to the system size (nkpt is inversely proportional, while mband and mpw are proportional). When the number of k-points has been reduced to one, the scaling will be quadratic in the system size (scaling of mband and mpw). In the specific case of the following 8 k points : (0 0 0), (0 0 1/2), (0 1/2 0), (0 1/2 1/2), (1/2 0 0), (1/2 0 1/2), (1/2 1/2 0), (1/2 1/2 1/2), the use of time-reversal symmetry has been implemented. It allows to connect coefficients of the wavefunctions that correspond to G+k and -G-k : each must be complex conjugate of the other. So, the number of planewaves explicitly treated in ABINIT can be taken as half the usual number of plane waves (see istwfk input variable). We will refer to these k-points as being TR (Time Reversal) k-points. If the time-reversal symmetry is not taken into account, special k-point grids based on TR k-points are not the most efficient ones. But CPU time gains due to the use of time-reversal symmetry can change this appreciably. For example, for a case without spatial symmetry, the shifted 2x2x2 k-point grid reduces to 4 k-points : (1/4 1/4 1/4), ( 1/4 1/4 -1/4), ( 1/4 -1/4 1/4), ( 1/4 -1/4 -1/4) while a non-shifted 2x2x2 k-point grid gives precisely the set of 8 k-points mentioned above. The accuracy of both k-point sets should be about the same. If time-reversal symmetry is not taken into account, the shifted grid is obviously more advantageous (twice less CPU time, the same core memory needed), but taking into account the time-reversal symmetry will give about the same CPU time, and only half the memory needs for the non-shifted grid. 2.B. The FFT arrays. There are many arrays that are dimensioned with nfft. By chance, no other variable that scales as ucvol or natom appears in their other dimensions, only small numbers like nsppol, or constant factors like 2 or 4 (the latter only for GGA). So, the overall scaling is linear with system size. However, the total space for these arrays usually dominates the memory needs when natom is lower than about 20-40, since nfft is a very large number. The total number of bytes is on the order of 15 to 50 times nfft double precision complex numbers. The prefactor will be the largest for spin-polarized calculations, GGA calculations, and when iscf=5 instead of iscf=2 or 3. 2.C. The phases for the non-local form factors. The last big array to be described is the set of phases generated by the scalar product of the planewave vectors in reciprocal space by the atom position in real space. Its size in bytes is mpw * natom * 16 (16 is the number of bytes for a double precision complex) The scaling of this array is quadratic in the system size. ____________________________________________________________________________ 3. Tuning the memory use. Reducing the central memory use can be done by placing a few of these arrays on disk and reading them when needed, or by recomputing them when needed. This is accomplished through the use of the input variables mkmem, mffmem and nloalg . 3.A. The wavefunction arrays. When the wavefunctions are modified to find the lowest energy, the wavefunction coefficients for at least one k-point must stay in central memory. The wavefunction coefficients for the other k-points can be placed on disk. This can be done by setting the input variable mkmem to 0. In this case, the minimal use of memory for this array is mband * nsppol * mpw * 16 (16 is the number of bytes for a double precision complex) Its scaling is quadratic with the system size, but its prefactor is much smaller than the FFT arrays, so that it does not dominate the memory use until the number of atoms is larger than 20-40 atoms. The penalty for the use of disk space will mainly depend on the speed of the disk I/O system. It is obviously better to have the files saved on a disk residing on the computing node. This can be accomplished by chosing correctly the root name of the temporary files, in the 'files' file (see abinit_help). In this case, the wall clock time may increase by only a few percent. For the TR k-points, the number of bytes mentioned above is decreased by a factor of two. 3.B. The FFT arrays. One can also take advantage of disk space for some of the FFT arrays. Simply set the input variable mffmem to 0. The reduction of memory space will not be very large, however, on the order of 40-20% for these arrays. The penalty will be similar to the penalty seen when mkmem is set to 0. Note that iscf=2 or iscf=3 are less demanding than iscf=5 at the level of these FFT arrays. However, unlike using disk space, using different iscf value may affect the convergence. Also, intxc=0 is less demanding than intxc=1 . In most other electronic structure codes, you will not find an equivalent to the intxc=1 option. It gives a better accuracy on the xc energy and potential, but in the vast majority of cases, this additional accuracy does not matter. 3.C. The phases of the non-local form factors. These can be recomputed by block, so that the maximal size will be on the order of mpw * 10 * 16 bytes, and will not scale quadratically with the system size. Simply set the input variable nloalg to a negative value (see the abinit_help file). The penalty for recomputing the phase factors will be seen at the level of CPU time for the nonlop routine, may be an increase of 20 to 40 % of CPU time in this routine. ____________________________________________________________________________ 4. The CPU time consuming routines. When the number of atoms in the system exceeds 5-20, only a few routines will significantly contributes to the CPU time of the job. These are fourwf, nonlop and projbd. Fourwf scales as the square of the system size, times its logarithm, with a large prefactor, while nonlop and projbd scale as the cube of the system size. The latter two will dominate for really large systems. The routines fourwf and nonlop are called every time the Hamiltonian is applied to a trial wavefunction or wavefunction change. The routine projbd is called after that application is done. The number of such applications is already large, since it is on the order of the product of mband by nkpt by the number of line minimizations (whose maximum is the input variable nline) by the number of SCF cycles (whose maximum is the input variable nstep) by the number of 'time steps' (whose maximum is the input variable ntime). Improving the convergence of the overall procedure resulting in the decrease of the number of line minimizations, the number of SCF cycles and the number of time steps might lead to a considerable speed-up of the code, but the corresponding techniques are not described in the present document. We focus here on the time associated with these three routines when the number of applications of the Hamiltonian is kept fixed, except for the linear scaling of mband with ucvol. The product of mband by nkpt does not scale as the size of the system when the number of k-points is not reduced to one, but will scale linearly with it otherwise. The other factors may have some weak scaling with the dimension of the system, but this is also not described here. 4.A. The wavefunction Fourier transform routine (fourwf). The routine fourwf performs the Fourier transform of the wavefunctions for real to reciprocal space and vice-versa. When the wavefunction is obtained in real space, the local potential operator may be applied to it. Then, the product is back transformed to reciprocal space. Due to the Nyquist theorem (see standard books on Fourier transform), the smallest feature of the grid in real space must be half the smallest wavelength of the planewaves in order to have an exact numerical result thanks to the use of the Fourier transform on a grid. Thus, the plane wave basis sphere can be inscribed in a rectangular grid of half the linear dimensions of the FFT grid. Taking into account the ratio between the volume of a sphere and the volume of a parallelipipedic box in which it is placed, we arrive at a ratio of about 16 between the number of FFT grid points and the number of plane wave coefficients. As a related technical issue, such Fourier transforms between the sphere of plane wave coefficients and the FFT grid in real space can be done either in a naive way, by inserting the planewave sphere in a FFT box in reciprocal space and filling the remaining grid with zeroes, before performing a usual 3D FFT, or in a more clever way, by doing computations only on the sets of numbers that are not all zeros. If the algorithm of the FFT is otherwise unchanged, the latter technique will only use 7/12 of the CPU time of the first technique. If the wavefunction corresponds to a TR k-point, the FFT CPU time can be decreased further by about 45%. As an important technical information, the Fast Fourier Transform algorithm is efficient only when the size of the grid along each of the dimensions is a multiple of powers of small prime numbers, like 2, 3, 5 and eventually 7 or 11. Consult for example the book 'Numerical Recipes' for an explanation. In this case, the time for one FFT will be proportional to the size of the FFT grid times its logarithm. Usually, the vendor of number-crunching computers provide a library of highly optimized numerical routines, that include a 3D FFT routine. However, this routine will not take into account the specific feature of the FFT needed for the present code, that is, it will make computations on arrays of zeroes. Also, the set of allowed multiple of powers of small prime numbers for such library routines will vary from vendor to vendor. That is why an portable FFT routine is needed, that takes advantage of the above-mentioned features, and that behaves in the same way on different machines. Such a routine has been included in the ABINIT code thaks to the use of subroutines written by S. Goedecker (see the ~abinit/doc/users/bibliography.html file). In order to be able to compare the speed of this routine with the speed of vendor routines, the input variable fftalg has been introduced. S. Goedecker routines are used when the absolute value of fftalg is either 1 or 3. Negative values of fftalg use the naive approach, while positive values do not make operations on arrays of zeroes. When the ABINIT code is installed on a new computer, it is possible to use immediately these options. It is also possible to change the routine fourwf to take advantage of an eventual vendor-provided 3D FFT routine, that should be called by fftalg=-2 . This machine-dependency is handled through the use of the C-preprocessor, (see the ~abinit/doc/developers/use_cpp file). Note that the implementation must also be done at the level of the fourdp routine. Until now, the fftalg=3 option is the best one on all scalar and super-scalar machines, even when vendor-provided 3D FFT routines are available. It should be possible in the future to build a routine with vendor-provided 1D FFT routines that take advantage of the existence of planes filled with zeroes, but this will take time to be coded ... For the vector machine VPP Fujitsu, it has been found that the fftalg=1 option is better. It is also possible to use the FFTW routines with fftalg=3, with or without multi-threadings (see the ~abinit/doc/developers/use_cpp file). A good choice of FFTW_NTHREADS depends upon the size of the transform as well as on the number of processors you have. As a general use, you don't want to use more threads than you have processors (Using more threads will work, but there will be extra overhead with no benefit.) In fact, if the problem size is too small, you may want to use fewer threads than you have processors. Finally, it must also be mentioned that the fourwf routine is also used to generate the density corresponding to a wavefunction. This use is less frequent (the factor nline is not present in the count of calls, and only the reciprocal to real space transform is done). In the timing analysis the two usages are distinguished : fourwf(pot) and fourwf(den) . 4.B. The application of the non-local part of the Hamiltonian (nonlop). There will usually be a non-local, separable part of pseudopotential of the Kleinman-Bylander type, for each atom in the system. The number of components for each atom type will vary with the angular momenta explicitly treated in the pseudopotential : 1 for the s channel, 3 for the p channel, 6 for the d channel, and 10 for the f channel. Moreover, for each channel, the number of projectors can be 1 or a larger number. For example, a pseudopotential with 2 s-projectors and 1 p-projectors will have 5 components. To apply this separable part, in the present status of the code, the number of operations will be proportional to mpw * natom * the number of components. The scaling for one application is thus proportional to the square of the size of the system. Different implementation of the non-local pseudopotential application have been coded, and can be called by the input variable nloalg. The value nloalg=4 seems to be good for all the machines. In theory, it is optimized for super-scalar architectures. In the future, a real space implementation of the non-local pseudopotential application might reduce the scaling from quadratic to linear in the system size, but this takes coding time ... The present reciprocal space implementation is rather well optimized. The nonlop routine is used also to compute the non-local pseudopotential contributions to the forces and stresses. These uses are less frequent than the application of the non-local part of the pseudopotential to wavefunctions. In the timing analysis the three usages are distinguished : nonlop(energy) for the application to a wavefunction, and also nonlop(forces) and nonlop(stresses) for the computation of forces and stresses. Their scaling with respect to the system size is the same, but the force computation is done only once at the end of a line minimization, while the stress computation is even done only at the end of a set of SCF cycles, when the wavefunctions are converged. For a wavefunction corresponding to a TR k-point, the CPU time for the application of the non-local operator is decreased by about 45%. 4.C. The routine for projection of the change of wavefunctions out of the occupied space (projbd) When the Hamiltonian has been applied to the wavefunction, the resulting vector must be orthogonalized against the existing wavefunctions of the present band and the other bands of the same k-point. This is done in reciprocal space, and the scaling of the operation for a single wavefunction is proportional to the product mpw * mband . Thus, it scales quadratically with the system size. This routine might dominate the CPU time when the number of atoms becomes very large. There will be a constant ratio between this time and the time needed for nonlop. On most machines this ratio is between 1/2 and 3/2 . This ratio depends on the number of components of the pseudopotentials, and on the ratio between the number of atoms and the number of bands. The routine projbd is called twice for every Hamiltonian application, in the standard implementation of the band-by-band CG algorithm. It is possible to call it only once per application, by changing the value of the ortalg input variable to negative values. However, the convergence of the overall algorithm will change a bit. This might overcome the gain due to the reduction from two calls to one. This routine has also been optimized a bit, by taking into account the use of registers. The default value 2 seems good on all machines. If the wavefunctions correspond to a TR k-point, the CPU time for projbd will be decreased by at least a factor of two, and at most a factor of four. This part of the overall algorithm is the one in which most of the progress should be done in the future, both at the level of the algorithm, as well as the level of the implementation. It might be very profitable to use the BLAS library in this respect.