LLN discussion meeting on parallelism
| What |
|
|---|---|
| When |
Feb 21, 2006 from 08:05 AM to 09:05 PM |
| Where | Louvain-la-Neuve (Belgium) |
| Add event to calendar |
|
Parallelism in ABINIT (2006-02-21)
The goal of this meeting was to discuss the state of parallelism implementation in ABINIT and how to go further. This meeting is then mainly oriented for developers and several presentations are centered on the current development state. Changing will likely occur in the near future.
Participants and presentations
- François Bottin: CEA, Bruyères le Châtel, France.
- Xavier Gonze: PCPM, Louvain la Neuve, Belgium.
- Torsten Hoefler: Computer Science Departement, Technical University of Chemnitz, Germany. Computer scientist, works on data collectors in MPI and their applications in ABINIT.
Presentation: Parallelization Options for the Band-by-Band Minimization of Teter et al.
Notes:- In the "Implementation Issues" slide, the complexity is not on the overall data size but the the communication data size.
- LogP model is used to characterise timing in MPI implementations, 'L' stands for latency, 'o' for overlap, 'g' for gap (time to wait between delivering a packet and effective sending) and 'P' for the number of processors.
- The use of non blocking communications in DFT could be done by overlapping bands : one band is computed while the communication on the previous one is done.
- Rebecca Janisch: Institute for Opto- and Solid State Electronics, Technical University of Chemnitz, Germany. Physicist, works on parallelism over coefficients (with Torsten Hoefler).
Presentation: Parallel Perturbations in ABINIT
Notes:- This presentation is a summary of Phillipp Plaenitz's work (same address, also physicist), who is working on the parallelisation of the response function.
- It could be possible to choose to enable parallelism on k points or on shifts or even on both, but in the current implementation, the keyword that enables the shift doesn't disable the points parallelisation.
- Some future works could be done on enabling different number of k points for different perturbations.
- Yann Pouillon: PCPM, Louvain la Neuve, Belgium. Physicist, co-maintainer of ABINIT, works on the build system of versions 5.x of ABINIT.
Presentation: MPI support in ABINIT 5
Notes:- The 4.7 version contains the latest commits of Chemnitz teams (from T. Hoefler, R. Janish...), M. Torrent... but does not have those of T. Deutsch and P-M. Anglade.
- Riad Shaltaf: PCPM, Louvain la Neuve, Belgium. Physicist, post-doctorate, works on GW and the implementation of parallelism in it.
Presentation: Parallelization of GW in ABINIT
Notes:- In Screening I slide, the parallelisation on q is not a good idea because q=0 requires lots more calculations than others values of q. The parallelisation is done instead on k in the screening sum.
- The current implementation doesn't parallelise the memory : all cpus have all band and wave-functions informations.
- Marc Torrent: CEA, Bruyères le Châtel, France. Physicist, has implemented PAW in ABINIT and works in the implementation of PAW in linear response.
- Gilles Zérah: CEA, Bruyères le Châtel, France. Physicist, has implemented Lobpcg in ABINIT and works on implementing parallelism in Lobpcg method.
Presentation: Parallelisation in Abinit: PW+Lobpcg //lelization and next
Discussions
Using OpenMP and MPI
CPMD (development version) and PWSCF use both MPI and OpenMP when highly parallel computers are involved (with thousand procs). MPI is used on points and OpenMP on bands calculations with a scheme of 8 to 16 procs per node.
Some MPI implementations have a genuine behaviour when dealing with shared memory: they do copy arrays on collective communicators only once if shared memory is available. The Open MPI implementation (see Open MPI web site) has such a nice feature. This could be a solution to use only MPI (implementing this feature) and letting MPI decide when shared memory can be used.
Current implementation of OpenMP in ABINIT is broken due to contributions that do not take care of OpenMP flags and forget to declare private/shared variable when adding line in loops parallelised by OpenMP.
Conclusion on this point: too much time should not be spent on OpenMP in ABINIT at least on old parts of code.
MPI / MPI-IO / MPI2 / MPI3
The implementation of MPI-IO in ABINIT is in progress. It reveals itself very efficient in some phonons calculations in the CEA where 50% of time is spent in waiting for I/O on files.
MPI-IO is a part of MPI2, but the latter is not yet available on every architectures.
Conclusion on this point: the following macro flags should be used in ABINIT,
- MPI alone, for all MPI-1.1 calls ;
- MPI_IO, for MPI-IO (of course) ;
- MPI_2, for guess what? ... MPI2 routines ;
- MPI_EXT, for all other MPI extensions.
The current MPI_FFT should be banned and be merged with the MPI flag and replaced by the use of a keyword (not yet defined).
Using BLAS/LAPACK
At the present time, BLAS is not used because of the time reversal symmetry that induces a specific treatment for g=0.
It has been decided not to audit all the previously written code to find where BLAS calls could be used ; but to take care in future produced code to add BLAS calls whenever it is possible.
ScaLAPACK has also been discussed, but since it distributes parallelised data in its own way (usually in a manner that is incoherent with the current parallelisation in ABINIT), it has been decided not to use it.
Band by band parallelisation
There are currently two implementations :
- in cgwf method, it is the classic way in ABINIT ;
- Lobpcg, can be applied without interfering with the rest of the code.
FFT parallelisation
There are two different implementations of the FFT algorithm by S. Goedecker. T. Hoefler has worked on the old one but his modifications could be applied to the new one. In the newest implementation, the work is done on a cube containing the sphere of non-null data points, including then some point with null-values. Nevertheless, this implementation has a gain on data reordering, since dealing with a cube is much simpler than handling a sphere.
The current issue in parallelisation of the FFT is to find a good compromise between the load balancing and a second call to MPI_ALL_TO_ALL. This could be solved using a common distribution scheme (T. Hoefler's job).
Nested parallelism
Nested parallelism deals with the way different levels of parallelism can be done in the same time. Currently, in ground state calculations, parallelism can be found on k points, on spin, on bands... Parallelism is found also in GW calculations or in linear response... All these implementations use the same variable in ABINIT to store their informations : mpi_enreg.
This variable should first contain the state of current parallelisation (if we're on band...). Then it should also contain the way data are distributed, maybe by storing communicator schemes adapted for different parallelisation stages. But there is the problem that there are different Fourier grids in ABINIT (one for wave-functions, one for density...), and then it is impossible to define a global communicator for all Fourier transform operations for example.
This is still an open issue, no decision has been taken during the meeting on this point.

