The source is here.

This is one of the more computationally demanding examples - either run it on 50 or more nodes of jaguar or reduce the number of functions employed (value of NFUNC in the source).

Points of interest

Using functors to incorporate state into functions to be compressed
Using special points in a functor to provide hints to adaptive refinement algorithm
Operating on multiple functions in a vector
Heuristic analysis of function trees to generate a new mapping of data to processor
Installation of the new process map with automatic redistribution of data
Timing to estimate benefit of load balancing

Background

Poor distribution of work (load imbalance) is the largest reason for inefficient parallel execution within MADNESS. Poor data distribution (data imbalance) contributes to load imbalance and also leads to out-of-memory problems due to one or more processes having too much data. Thus, we are interested in uniform distribution of both work and data.

Many operations in MADNESS are entirely data driven (i.e., computation occurs in the processor that "owns" the data) since there is insufficient work to justify moving data between processes (e.g., computing the inner product between functions). However, a few expensive operations can have work shipped to other processors.

There are presently three load balancing mechanisms within MADNESS

static and driven by the distribution of data,
dynamic via random assignment of work, and
dynamic via work stealing (currently only in prototype).

Until the work stealing becomes production quality we must exploit the first two forms. The random work assignment is controlled by options in the FunctionDefaults class.

FunctionDefaults::set_apply_randomize(bool) controls the use of randomization in applying integral (convolution) operators. It is typically beneficial when computing to medium/high precision.
FunctionDefaults::set_project_randomize(bool) constrols the use of randomization in projecting from an analytic form (i.e., C++) into the discontinuous spectral element basis. It is typically beneficial unless there is already a good static data distribution. Since these options are straightforward to enable, this example focusses on static data redistribution.

The process map (an instance of WorldDCPmapInterface) controls mapping of data to processors and it is actually quite easy to write your own (e.g., see WorldDCDefaultPmap or LevelPmap) that ensure uniform data distribution. However, you also seek to incorporate estimates of the computational cost into the distribution. The class LBDeuxPmap (deux since it is the second such class) in mra/lbdeux.h does this by examining the functions you request and using provided weights to estimate the computational cost. Communication costs are proportional to the number of broken links in the tree. Since some operations work in the scaling function basis, some in the wavelet basis, and some in non-standard form, there is an element of empiricism in getting best performance from most algorithms.

Implementation

The example times how long it takes to perform the following operations

constructing 4000 Gaussian functions with random exponent and origin,
truncating them,
differentiating them, and
applying the Coulomb Green's function to them.

The process map (data distribution) is then modified using the LBDeux heuristic and the operations repeated.

Results

/tmp/work/harrison % aprun -n 50 -d 12 -cc none ./dataloadbal
Runtime initialized with 10 threads in the pool and affinity 1 0 2
Before load balancing
project 2.57 truncate 9.18 differentiate 11.80 convolve 12.57 balance 0.00
project 2.74 truncate 9.08 differentiate 11.33 convolve 13.07 balance 0.00
project 2.80 truncate 8.84 differentiate 10.76 convolve 12.66 balance 2.19
After load balancing
project 2.96 truncate 2.56 differentiate 3.28 convolve 9.81 balance 0.00
project 3.05 truncate 2.61 differentiate 3.46 convolve 9.24 balance 0.00
project 3.01 truncate 2.60 differentiate 3.25 convolve 9.30 balance 0.00