COUPLED CONSISTENCY NON-BLOCKING AMI PROCESSOR AGGLOMERATION REDISTRIBUTEPAR COLLATED

v2312: New and improved parallel operation

Coupled patch field value consistency

TOP

Coupled constraint boundary conditions, e.g. processor and cyclic, should be consistent with their internal field values. For example, on a processor patch the value field has the neighbouring processor cell value, i.e. it caches the other side cell value. Other coupled patch fields might have as value the interpolate of the local and neighbouring cell value. The rule of thumb is that any code modifying a cell value, e.g. gradient calculation, should make a call to correct the boundary conditions to perform the value update (equates to a halo-swap on processor boundary field). However, for ‘local’ operations, e.g. multiplication, this can sometimes be skipped if the boundary condition only has a value, and does not depend on cell values. Most 'normal' boundary conditions fall into this category.

For coupled boundary conditions it also applies to the processor patch fields but not to e.g. cyclic, cyclicAMI variants. This results in values of intermediate calculations not obeying the consistency constraint.

In this release we enforce an evaluation after local operations such that at any time the value of a constraint patch is up to date. Please note that this choice is still under investigation and likely to be updated in the future.

Coupled patchField consistency checking

Visual code checks identified that the wall-distance field returned by the meshWave and meshWaveAddressing methods were not parallel consistent. To help catch these type of errors an optional consistency check has been added, enabled using debug switches:

DebugSwitches
{
    volScalarField::Boundary            1;
    volVectorField::Boundary            1;
    volSphericalTensorField::Boundary   1;
    volSymmTensorField::Boundary        1;
    volTensorField::Boundary            1;

    areaScalarField::Boundary           1;
    areaVectorField::Boundary           1;
    areaSphericalTensorField::Boundary  1;
    areaSymmTensorField::Boundary       1;
    areaTensorField::Boundary           1;
}

A value of 1 enables the check and leads to a FatalError when the check fails. This can be used to easily pinpoint problems, particularly in combination with FOAM_ABORT to produce a stack trace:

FOAM_ABORT=true mpirun -np 2 simpleFoam -parallel

The debug switch is interpreted as a bitmap:

bit	value(2^bit)	effect
0	1	add check for every local operation
1	2	print entry and exit to check
2	4	warning instead of fatalError

The comparison tolerance is set to 0 by default. For processor halo-swaps there is no tolerance issue since the exact same operations are performed in exactly the same order. However, for ‘interpolating’ coupled boundary conditions, e.g. cyclic, cyclicAMI, slightly different truncation errors arise since the local operation, e.g. multiplication by a constant, is performed -after- the interpolation v.s. interpolation of the local operation result. In this case the tolerance can be overridden:

OptimisationSwitches
{
    volScalarField::Boundary::tolerance           1e-10;
    // .. and similar for all the other field types ..
}

Backwards compatibility

The new consistency operations will slightly change the behaviour of any case that uses a cyclic or cyclicAMI boundary condition or any non-trivial turbulence model using coupled boundary conditions.

Optionally, this behaviour can be reverted to the previous (inconsistent!) form by overriding the localConsistency setting in etc/controlDict:

OptimisationSwitches
{
    //- Enable enforced consistency of constraint bcs after 'local' operations.
    //  Default is on. Set to 0/false to revert to <v2306 behaviour
    localConsistency 0;
}

A simple test is any tutorial with a cyclicAMI patch, e.g. the pipeCyclic tutorial. With the above DebugSwitches to enable the checking it runs ok, but when disabling the localConsistency flag an inconsistency is detected:

[0] --> FOAM FATAL ERROR: (openfoam-2302 patch=230110)
[0] Field dev(symm(grad(U))) is not evaluated? On patch side1 type cyclicAMI : average of field = ...

Related issues

Issue 2783

Source code

$FOAM_SRC/OpenFOAM/fields/GeometricFields/GeometricField/GeometricBoundaryField.C

Merge request

Merge request 628

Non-blocking cyclic A(C)MI

TOP

The cyclicAMI boundary condition implements an area-weighted interpolation from multiple neighbouring faces. These faces can be local, or reside on remote processors and therefore require parallel communications.

In previous releases each cyclicAMI evaluation or matrix contribution in the linear solver (in case of non-local neighbouring faces) was triggering its own set of communications and waited for these to finish before continuing to the next cyclicAMI or processor patch. In this release the procedure follows a similar path as the processor patches that starts all sends/receives, and a ‘consumption’ phase that uses the remote data to update local values. A typical boundary condition evaluation or linear solver update now takes the form:

do all initEvaluate/initInterfaceMatrixUpdate (coupled boundaries only). For processor, cyclicA(C)MI this starts non-blocking sends/receives.
wait for all communication to finish (or combine with below using polling (v2306 nPollProcInterfaces))
do all evaluate/updateInterfaceMatrix. This uses the received data to calculate the contribution to the matrix solution.

By handling the communication from cyclicA(C)MI in exactly the same way as processor boundary conditions there is less chance of bottlenecks and hopefully better scaling. An additional optimisation is that the local send/receive buffers are allocated once and reused.

Source code

$FOAM_SRC/finiteVolume/fields/fvPatchFields/constraint/cyclicAMI/cyclicAMIFvPatchField.H

Merge request

Merge request 641

Tutorial

any case with cyclicAMI or cyclicACMI

GAMG : support for cyclicAMI in processor agglomeration

TOP

The GAMG solver, in addition to local agglomeration, can combine matrices across processors (processor agglomeration). This can be beneficial at larger core counts since it:

lowers the number of cores solving the coarsest level - most of the global reductions happen the coarsest level; and
increases the amount of implicitness for all operations, i.e. smoothing, preconditioning.

In this release the framework has been extended to allow processor agglomeration of all coupled boundary conditions, e.g. cyclicAMI and cyclicACMI.

As a test, a comparison was made between

a single 40x10x1 block; and
two 20x10x1 blocks coupled using cyclicAMI.

Both cases were decomposed into 4, and using the GAMG solver in combination with the masterCoarsest processor agglomerator where all matrices are combined onto the master(s)

solvers
{
    p
    {
        solver                  GAMG;
        processorAgglomerator   masterCoarsest;
        ..
    }

single block (so no cyclicAMI, only processor faces):

                              nCells       nInterfaces
   Level  nProcs         avg     max       avg     max
   -----  ------         ---     ---       ---     ---
       0       4         100     100       1.5       2
       1       4          50      50       1.5       2
       2       1         100     100         0       0
       3       1          48      48         0       0

The number of boundaries (nInterfaces) becomes 0 as all processor faces become internal.

two-block case (so cyclicAMI and processor faces):

                              nCells         nInterfaces
   Level  nProcs         avg     max         avg     max
   -----  ------         ---     ---         ---     ---
       0       4         100     100           3       3
       1       4          50      50           3       3
       2       1         100     100           2       2
       3       1          48      48           2       2

Here, the number of boundaries reduces from 3 to 2 since only the two cyclicAMI are preserved.

Notes

cyclicA(C)MI :
- as all faces become local, the behaviour is reset to non-distributed i.e. operations are applied directly on provided fields without any additional copying.
- rotational transformations are not yet supported. This is not a fundamental limitation but requires additional rewriting of the stencils to take transformations into account.
processorCyclic : (a cyclic with owner and neighbour cells on different processors) is not yet supported. This is treated as a normal processor boundary and will therefore lose any transformation. Note that processorCyclic can be avoided by using the patches constraint in decomposeParDict, e.g.

constraints
{
    patches
    {
        //- Keep owner and neighbour on same processor for faces in patches
        //  (only makes sense for cyclic patches and cyclicAMI)
        type    preservePatches;
        patches (cyclic);
    }
}

only masterCoarsest has been tested but the code should support any other processor-agglomeration method.
the limited testing to date has shown no benefit of processor agglomeration of cyclicAMI. It is only useful if bottlenecks, e.g. the number of global reductions or implicitness are the issue.

Source code

$FOAM_SRC/OpenFOAM/meshes/lduMesh/lduPrimitiveMesh.C

Merge request

Merge request 645

Improvements for redistributePar and file systems

TOP

New hostUncollated fileHandler. This uses the first core on each node to perform the I/O. It is equivalent to explicitly specifying cores using the ioRanks option.

Improved general support of the collated file format and corresponding adjustments to the redistributePar utility. With these changes, the collated format can be used for a wider range of workflows than previously possible.

Handling of dynamic code, e.g. codedFixedValue boundary condition, is now supported for distributed file systems. For these systems, the dynamically compiled libraries are automatically distributed to the other nodes.

The numberOfSubdomains entry in the decomposeParDict file is optional. If not specified, it is set to the current number of processors the job was started with. Note that this is not useful for some methods, e.g. hierarchical which requires a consistent number of subdivisions in the three coordinate directions.

Distributed roots are automatically copied from the master processor. With a hostfile containing multiple hosts it is now possible to automatically construct, e.g. processors5_XXX-YYY on the local or remote nodes:

mpirun -hostfile hostfile -np 5 ${FOAM_ETC}/openfoam redistributePar -parallel -fileHandler hostCollated -decompose

Note the use of the openfoam wrapper script to ensure that all nodes use the same OpenFOAM installation.

Improved handling of included files with collated

TOP

In previous versions using “include” files in combination with collated could be very fragile when the file contents were treated as runtime-modifiable. The handling of watched files has now been updated to ensure proper correspondence across the processor ranks.

v2312: New and improved parallel operation

Coupled patch field value consistency

Coupled patchField consistency checking

Backwards compatibility

Non-blocking cyclic A(C)MI

GAMG : support for cyclicAMI in processor agglomeration

Notes

Improvements for redistributePar and file systems

Improved handling of included files with collated

Sign up for our newsletter