The Use of BLAS and LAPACK in PETSc and external libraries#

  1. BLAS 1 operations (and GPU equivalents) - vector operations such as VecNorm(), VecAXPY(), and VecScale() are used extensively in PETSc. Depending on the simulation the size of the vectors may be from hundreds of entries to many millions.

  2. BLAS 2 operations - dense matrix with vector operations, generally the dense matrices are very small.

  3. Eigenvalue and SVD computations, generally for very small matrices

  4. External packages such as MUMPS and SuperLU_DIST use BLAS 3 operations (and possibly BLAS 1 and 2). The dense matrices may be of modest size, going up to thousands of rows and columns.

For most PETSc simulations (that is not using certain external packages) using an optimized set of BLAS/LAPACK routines only provides a modest improvement in performance. For some external packages using optimized BLAS/LAPACK can make a dramatic improvement in performance.

32 or 64-bit BLAS/LAPACK integers#

BLAS/LAPACK libraries may use 32 or 64-bit integers. PETSc configure and compile handles this automatically so long at the arguments to the BLAS/LAPACK routines are set to the type PetscBLASInt. The routine PetscBLASIntCast(PetscInt, PetscBLASInt *) casts a PetscInt to the BLAS/LAPACK size. If the BLAS/LAPACK size is not large enough it generates an error. For the vast majority of simulations, even very large ones, 64-bit BLAS/LAPACK integers are not needed, even when 64-bit PETSc integers are used.

The configure option --with-64-bit-blas-indices attempts to locate and use a 64-bit integer version of BLAS/LAPACK library. Except for MKL Cluster PARDISO, most external packages do not support using 64-bit BLAS/LAPACK integers so if you are using such packages you cannot use 64-bit BLAS/LAPACK integers.

The configure options --with-64-bit-indices and --with-64-bit-blas-indices are independent. --with-64-bit-indices does not imply that the BLAS/LAPACK libraries use 64 bit indices.

Shared memory BLAS/LAPACK parallelism#

Some BLAS/LAPACK libraries can make use of shared memory parallelism within the function calls, generally using OpenMP, or possibly PThreads. If this feature is turned on, it is in addition to the MPI based parallelism that PETSc is using. Thus it can result in over-subscription of hardware resources. For example, if a system has 16 cores and PETSc is run with an MPI size of 16 then each core is assigned an MPI process. But if the BLAS/LAPACK is running with OpenMP and 4 threads per process this results in 64 threads competing to use 16 cores which will perform poorly.

If one elects to use both MPI parallelism and shared memory BLAS/LAPACK parallelism one should ensure they do not over subscribe the hardware resources. Since PETSc does not natively use OpenMP this means that phases of the computation that do not use BLAS/LAPACK will be under-subscribed, thus under-utilizing the system. For PETSc simulations which do not use external packages there is generally no benefit to using parallel BLAS/LAPACK. The environmental variable OMP_NUM_THREADS can be used to set the number of threads used by each MPI process for its shared memory parallel BLAS/LAPACK. The additional environmental variables OMP_PROC_BIND and OMP_PLACES may also need to be set appropriately for the system to obtain good parallel performance with BLAS/LAPACK. The configure option --with-openmp will trigger PETSc to try to locate and use a parallel BLAS/LAPACK library.

Certain external packages such as MUMPS may benefit from using parallel BLAS/LAPACK operations. See the manual page MATSOLVERMUMPS for details on how one can restrict the number of MPI processes while running MUMPS to utilize parallel BLAS/LAPACK.

Available BLAS/LAPACK libraries#

Most systems (besides Microsoft Windows) come with pre-installed BLAS/LAPACK which are satisfactory for many PETSc simulations.

The freely available Intel MKL mathematics libraries provide BLAS/LAPACK that are generally better performing than the system provided libraries and are generally fine for most users.

For systems that do not provide BLAS/LAPACK, such as Microsoft Windows, PETSc provides the Fortran reference version --download-fblaslapack and a f2c generated C version --download-f2cblaslapack (which also supports 128 bit real number computations). These libraries are less optimized but useful to get started with PETSc easily.

PETSc also provides access to OpenBLAS via the --download-openblas configure option. OpenBLAS uses some highly optimized operations but falls back on reference routines for many other operations. See the OpenBLAS manual for more information. The configure option --download-openblas provides a full BLAS/LAPACK implementation.

BLIS does not bundle LAPACK with it so PETSc’s configure attempts to locate a compatible system LAPACK library to use if --download-blis is selected. One can use --download-f2cblaslapack --download-blis. This is recommended as a portable high-performance option. It is possible if you use --download-blis without --download-f2cblaslapack the BLIS library installed will not be used! Instead, PETSc will link in some LAPACK implementation and the BLAS that comes with that implementation!