SSIDS  Sparse Symmetric Indefinite Direct Solver¶
#include <spral_ssids.h> /* or <spral.h> for all packages */
Purpose¶
This package solves one or more sets of \(n\times n\) sparse symmetric equations \(AX = B\) using a multifrontal method. The following cases are covered:
1. \(A\) is indefinite. SSIDS computes the sparse factorization
where \(P\) is a permutation matrix, \(L\) is unit lower triangular, and \(D\) is block diagonal with blocks of size \(1 \times 1\) and \(2 \times 2\).
2. \(A\) is positive definite. SSIDS computes the sparse Cholesky factorization
where \(P\) is a permutation matrix and \(L\) is lower triangular. However, as SSIDS is designed primarily for indefinite systems, this may be slower than a dedicated Cholesky solver.
The code optionally supports hybrid computation using one or more NVIDIA GPUs.
SSIDS returns bitcompatible results.
An option exists to scale the matrix. In this case, the factorization of the scaled matrix \(\overline{A} = S A S\) is computed, where \(S\) is a diagonal scaling matrix.
Version history¶
[For detail, see ChangeLog]
 20160923 Version 2.0.0
 Add support for CPU and hybrid execution.
 Add support for 64bit ptr(:) on input matrix \(A\).
 options%presolve removed.
 20140317 Version 1.0.0
 Initial release: GPU only
Usage overview¶
Solving \(AX=B\) using SSIDS is a four stage process.
 Call
spral_ssids_analyse()
to perform a symbolic factorization, stored in akeep.  Call
spral_ssids_factor()
to perform a numeric factorization, stored in fkeep. More than one numeric factorization can refer to the same akeep.  Call
spral_ssids_solve1()
orspral_ssids_solve()
to perform a solve with the factors. More than one solve can be performed with the same fkeep.  Once all desired solutions have been performed, free memory with
spral_ssids_free()
.
In addition, advanced users may use the following routines:
spral_ssids_enquire_posdef()
andspral_ssids_enquire_indef()
return the diagonal entries of the factors and the pivot sequence.spral_ssids_alter()
allows altering the diagonal entries of the factors.
Note
Bitcompatibility: If used with bitcompatible BLAS and compiler options, this routine will return bit compatible results. That is, consecutive runs with the same data on the same machine produces exactly the same solution.
Basic Subroutines¶
Note
For the most efficient use of the package, CSC format should be used without checking.

void
spral_ssids_default_options
(struct spral_ssids_options *options)¶ Intialises members of options structure to default values.
Parameters:  options – Structure to be initialised.

void
spral_ssids_analyse
(bool check, int n, int *order, const long *ptr, const int *row, const double *val, void **akeep, const struct spral_ssids_options *options, struct spral_ssids_inform *inform)¶ Perform the analyse (symbolic) phase of the factorization for a matrix supplied in CSC format. The resulting symbolic factors stored in
akeep
should be passed unaltered in the following call tospral_ssids_factor()
.Parameters:  check – if true, matrix data is checked. Outofrange entries are dropped and duplicate entries are summed.
 n – number of columns in \(A\).
 order – may be NULL; otherwise must be an array of size n
used on entry a usersupplied ordering
(
options.ordering=0
). On return, the actual ordering used.  ptr[n+1] – column pointers for \(A\) (see CSC format).
 row[ptr[n]] – row indices for \(A\) (see CSC format).
 val – may be NULL; otherwise must be an array of size ptr[n] containing nonzero values for \(A\) (see CSC format). Only used if a matchingbased ordering is requested.
 akeep – returns symbolic factorization, to be passed unchanged to subsequent routines.
 options – specifies algorithm options to be used
(see
spral_ssids_options
).  inform – returns information about the execution of the routine
(see
spral_ssids_inform
).
Note
If a usersupplied ordering is used, it may be altered by this routine, with the altered version returned in order[]. This version will be equivalent to the original ordering, except that some supernodes may have been amalgamated, a topographic ordering may have been applied to the assembly tree and the order of columns within a supernode may have been adjusted to improve cache locality.

void
spral_ssids_analyse_ptr32
(bool check, int n, int *order, const int *ptr, const int *row, const double *val, void **akeep, const struct spral_ssids_options *options, struct spral_ssids_inform *inform)¶ As
spral_ssids_analyse()
, except ptr has typeint
. Provided for backwards comptability, users are encourage to use 64bit ptr in new code.

void
spral_ssids_analyse_coord
(int n, int *order, long ne, const int *row, const int *col, const double *val, void **akeep, const struct spral_ssids_options *options, struct spral_ssids_inform *inform)¶ As
spral_ssids_analyse()
, but for coordinate data. The variant parameters are:Parameters:  ne – number of nonzero entries in \(A\).
 row[ne] – row indices for \(A\) (see Coordinate format).
 col[ne] – column indices for \(A\) (see Coordinate format).

void
spral_ssids_factor
(bool posdef, const long *ptr, const int *row, const double *val, double *scale, void *akeep, void **fkeep, const struct spral_ssids_options *options, struct spral_ssids_inform *inform)¶ Parameters:  posdef – true if matrix is positivedefinite
 ptr – may be NULL; otherwise a length n+1 array of column pointers
for \(A\), only required if
akeep
was obtained by runningspral_ssids_analyse()
with check=true, in which case it must be unchanged since that call.  row – may be NULL; otherwise a length ptr[n] array of row indices
for \(A\), only required if
akeep
was obtained by runningspral_ssids_analyse()
with check=true, in which case it must be unchanged since that call.  val[] – nonzero values for \(A\) in same format as for
the call to
spral_ssids_analyse()
orspral_ssids_analyse_coord()
.  scale – may be NULL; otherwise a length n array for diagonal
scaling. scale[i1] contains entry \(S_ii\) of \(S\). Must be
supplied by user on entry if
options.scaling=0
(usersupplied scaling). On exit, returns scaling used.  akeep – symbolic factorization returned by preceding call to
ssids_analyse()
orssids_analyse_coord()
.  fkeep – returns numeric factorization, to be passed unchanged to subsequent routines.
 options – specifies algorithm options to be used
(see
spral_ssids_options
).  inform – returns information about the execution of the routine
(see
spral_ssids_inform
).

void
spral_ssids_solve1
(int job, double *x1, void *akeep, void *fkeep, const struct spral_ssids_options *options, struct spral_ssids_inform *inform)¶ Solve (for a single righthand side) one of the following equations:
job Equation solved 0 \(Ax=b\) 1 \(PLx=Sb\) 2 \(Dx=b\) 3 \((PL)^TS^{1}x=b\) 4 \(D(PL)^TS^{1}x=b\) Recall \(A\) has been factorized as either:
 \(SAS = (PL)(PL)^T~\) (positivedefinite case); or
 \(SAS = (PL)D(PL)^T\) (indefinite case).
Parameters:  job – specifies equation to solve, as per above table.
 x1[n] – righthand side \(b\) on entry, solution \(x\) on exit.
 akeep – symbolic factorization returned by preceding
call to
spral_ssids_analyse()
orspral_ssids_analyse_coord()
.  fkeep – numeric factorization returned by preceding
call to
spral_ssids_factor()
.  options – specifies algorithm options to be used
(see
spral_ssids_options
).  inform – returns information about the execution of the routine
(see
spral_ssids_inform
).

void
spral_ssids_solve
(int job, int nrhs, double *x, int ldx, void *akeep, void *fkeep, const struct spral_ssids_options *options, struct spral_ssids_inform *inform)¶ Solve (for multiple righthand sides) one of the following equations:
job Equation solved 0 \(AX=B\) 1 \(PLX=SB\) 2 \(DX=B\) 3 \((PL)^TS^{1}X=B\) 4 \(D(PL)^TS^{1}X=B\) Recall \(A\) has been factorized as either:
 \(SAS = (PL)(PL)^T~\) (positivedefinite case); or
 \(SAS = (PL)D(PL)^T\) (indefinite case).
Parameters:  job – specifies equation to solve, as per above table.
 nrhs – number of righthand sides.
 x[ldx*nrhs] – righthand sides \(B\) on entry, solutions \(X\) on exit. The ith entry of righthand side j is in position x[j*ldx+i].
 ldx – leading dimension of x.
 akeep – symbolic factorization returned by preceding
call to
spral_ssids_analyse()
orspral_ssids_analyse_coord()
.  fkeep – numeric factorization returned by preceding
call to
spral_ssids_factor()
.  options – specifies algorithm options to be used
(see
spral_ssids_options
).  inform – returns information about the execution of the routine
(see
spral_ssids_inform
).

int
spral_ssids_free_akeep
(void **akeep)¶ Frees memory and resources associated with
akeep
.Parameters:  akeep – symbolic factors to be freed.
Returns: 0 on success, or a CUDA error code on failure.

int
spral_ssids_free_fkeep
(void **fkeep)¶ Frees memory and resources associated with
fkeep
.Parameters:  fkeep – numeric factors to be freed.
Returns: 0 on success, or a CUDA error code on failure.

int
spral_ssids_free
(void **akeep, void **fkeep)¶ Frees memory and resources associated with
akeep
andfkeep
.Parameters:  akeep – symbolic factors to be freed.
 fkeep – numeric factors to be freed.
Returns: 0 on success, or a CUDA error code on failure.
Warning
The free routine(s) must be called by the user. Merely deallocating
akeep
or fkeep
, or allowing them to go out of scope will
result in memory leaks. akeep
should only be deallocated after all
associated numeric factorizations fkeep
have been freed.
Advanced subroutines¶

void
spral_ssids_enquire_posdef
(const void *akeep, const void *fkeep, const struct spral_ssids_options *options, struct spral_ssids_inform *inform, double *d)¶ Return the diagonal entries of the Cholesky factor.
Parameters:  akeep – symbolic factorization returned by preceding
call to
spral_ssids_analyse()
orspral_ssids_analyse_coord()
.  fkeep – numeric factorization returned by preceding
call to
spral_ssids_factor()
.  options – specifies algorithm options to be used
(see
spral_ssids_options
).  inform – returns information about the execution of the routine
(see
spral_ssids_inform
).  d[n] – returns the diagonal of \(L\). d[i1] stores the entry \(L_{ii}\).
 akeep – symbolic factorization returned by preceding
call to

void
spral_ssids_enquire_indef
(const void *akeep, const void *fkeep, const struct spral_ssids_options *options, struct spral_ssids_inform *inform, int *piv_order, double *d)¶ Return the pivot order and/or values of \(D\) of the Symmetric Indefinite Factorization.
Parameters:  akeep – symbolic factorization returned by preceding
call to
spral_ssids_analyse()
orspral_ssids_analyse_coord()
.  fkeep – numeric factorization returned by preceding
call to
spral_ssids_factor()
.  options – specifies algorithm options to be used
(see
spral_ssids_options
).  inform – returns information about the execution of the routine
(see
spral_ssids_inform
).  piv_order – may be NULL; otherwise a length n array for the pivot order. On return, \(\,\texttt{piv_order[i]}\) gives the position of variable \(i\) in the pivot order.
 d – may be NULL; otherwise a length 2*n array for the \(2\times2\) block diagonal of \(D\). d[2*(i1)+0] stores \(D_{ii}\) and d[2*(i1)+1] stores \(D_{(i+1)i}\).
 akeep – symbolic factorization returned by preceding
call to

void
spral_ssids_alter
(const double *d, const void *akeep, void *fkeep, const struct spral_ssids_options *options, struct spral_ssids_inform *inform)¶ Alter the entries of the diagonal factor \(D\) for a symmetric indefinite factorization. The pivot order remains the same.
Parameters:  d[2*n] – New entries of \(D\). d[2*(i1)+0] stores \(D_{ii}\) and d[2*(i1)+1] stores \(D_{(i+1)i}\).
 akeep – symbolic factorization returned by preceding
call to
spral_ssids_analyse()
orspral_ssids_analyse_coord()
.  fkeep – numeric factorization returned by preceding
call to
spral_ssids_factor()
.  options – specifies algorithm options to be used
(see
spral_ssids_options
).  inform – returns information about the execution of the routine
(see
spral_ssids_inform
).
Note: This routine is not compatabile with the option
options.presolve=1
.
Derived types¶

struct
spral_ssids_options
¶ The derived data type ssids_options is used to specify the options used within
SSIDS
. The components, that are automatically given default values in the definition of the type, are:
int
array_base
¶ Indexing base for arrays. Either 0 (C indexing) or 1 (Fortran indexing). Default is 0.

int
print_level
¶ Level of printing:
< 0 No printing. = 0 (default) Error and warning messages only. = 1 As 0, plus basic diagnostic printing. > 1 As 1, plus some additional diagnostic printing. The default is 0.

int
unit_diagnostics
¶ Fortran unit number for diagnostics printing. Printing is suppressed if <0. The default is 6 (stdout).

int
unit_error
¶ Fortran unit number for printing of error messages. Printing is suppressed if <0. The default is 6 (stdout).

int
unit_warning
¶ Fortran unit number for printing of warning messages. Printing is suppressed if <0. The default is 6 (stdout).

int
ordering
¶ Ordering method to use in analyse phase:
0 Usersupplied ordering is used (order argument to spral_ssids_analyse()
orspral_ssids_analyse_coord()
).1 (default) METIS ordering with default settings. 2 Matchingbased elimination ordering is computed (the Hungarian algorithm is used to identify large offdiagonal entries. A restricted METIS ordering is then used that forces these on to the subdiagonal).
Note: This option should only be chosen for indefinite systems. A scaling is also computed that may be used in
spral_ssids_factor()
(seescaling
below).The default is 1.

int
nemin
¶ Supernode amalgamation threshold. Two neighbours in the elimination tree are merged if they both involve fewer than nemin eliminations. The default is used if nemin<1. The default is 8.

float
gpu_perf_coeff
¶ GPU perfromance coefficient. How many times faster a GPU is than CPU at factoring a subtree. Default is 1.0.

int
scaling
¶ Scaling algorithm to use:
<=0 (default) No scaling (if scale[]
is not present on call tospral_ssids_factor()
, or usersupplied scaling (ifscale[]
is present).=1 Compute using weighted bipartite matching via the Hungarian Algorithm (MC64 algorithm). =2 Compute using a weighted bipartite matching via the Auction Algorithm (may be lower quality than that computed using the Hungarian Algorithm, but can be considerably faster). =3 Use matchingbased ordering generated during the analyse phase using options.ordering=2
. The scaling will be the same as that generated withoptions.scaling=1
if the matrix values have not changed. This option will generate an error if a matchingbased ordering was not used during analysis.>=4 Compute using the normequilibration algorithm of Ruiz (see SCALING  Sparse matrix scalings). The default is 0.

long
small_subtree_threshold
¶ Maximum number of flops in a subtree treated as a single task. See method section. The default is 4e6.

int
cpu_block_size
¶ Block size to use for parallelization of large nodes on CPU resources. Default is 256.

bool
action
¶ Continue factorization of singular matrix on discovery of zero pivot if true (a warning is issued), or abort if false. The default is true.

int
pivot_method
¶ Pivot method to be used on CPU, one of:
0 Aggressive a posteori pivoting. Choleskylike communication pattern is used, but a single failed pivot requires restart of node factorization and potential recalculation of all uneliminated entries. 1 (default) Block a posteori pivoting. A failed pivot only requires recalculation of entries within its own block column. 2 Threshold partial pivoting. Not parallel. Default is 1.

double
small
¶ Threshold below which an entry is treated as equivalent to 0.0. The default is 1e20.

double
u
¶ Relative pivot threshold used in symmetric indefinite case. Values outside of the range \([0,0.5]\) are treated as the closest value in that range. The default is 0.01.

int

struct
spral_ssids_inform
¶ Used to return information about the progress and needs of the algorithm.

int
cublas_error
¶ CUBLAS error code in the event of a CUBLAS error (0 otherwise).

int
cuda_error
¶ CUDA error code in the event of a CUDA error (0 otherwise). Note that due to asynchronous execution, CUDA errors may not be reported by the call that caused them.

int
flag
¶ Exit status of the algorithm (see table below).

int
matrix_dup
¶ Number of duplicate entries encountered (if
spral_ssids_analyse()
called with check=true, or any call tospral_ssids_analyse_coord()
).

int
matrix_missing_diag
¶ Number of diagonal entries without an explicit value (if
spral_ssids_analyse()
called with check=true, or any call tospral_ssids_analyse_coord()
).

int
matrix_outrange
¶ Number of outofrange entries encountered (if
spral_ssids_analyse()
called with check=true, or any call tospral_ssids_analyse_coord()
).

int
matrix_rank
¶ (Estimated) rank (structural after analyse phase, numerical after factorize phase).

int
maxdepth
¶ Maximum depth of the assembly tree.

int
maxfront
¶ Maximum front size (without pivoting after analyse phase, with pivoting after factorize phase).

int
num_delay
¶ Number of delayed pivots. That is, the total number of fullysummed variables that were passed to the father node because of stability considerations. If a variable is passed further up the tree, it will be counted again.

long
num_factor
¶ Number of entries in \(L\) (without pivoting after analyse phase, with pivoting after factorize phase).

long
num_flops
¶ Number of floatingpoint operations for Cholesky factorization (indefinte needs slightly more). Without pivoting after analyse phase, with pivoting after factorize phase.

int
num_neg
¶ Number of negative eigenvalues of the matrix \(D\) after factorize phase.

int
num_sup
¶ Number of supernodes in assembly tree.

int
num_two
¶ Number of \(2 \times 2\) pivots used by the factorization (i.e. in the matrix \(D\)).

int
stat
¶ Fortran allocation status parameter in event of allocation error (0 otherwise).
inform.flag Return status 0 Success. 1 Error in sequence of calls (may be caused by failure of a preceding call). 2 n<0 or ne<1. 3 Error in ptr[]. 4 CSC format: All variable indices in one or more columns are outofrange.
Coordinate format: All entries are outofrange.
5 Matrix is singular and options.action=false 6 Matrix found not to be positive definite. 7 ptr[] and/or row[] not present, but required as spral_ssids_analyse()
was called with check=false.8 options.ordering out of range, or options.ordering=0 and order parameter not provided or not a valid permutation. 9 options.ordering=2 but val[] was not supplied. 10 ldx<n or nrhs<1. 11 job is outofrange. 13 Called spral_ssids_enquire_posdef()
on indefinite factorization.14 Called spral_ssids_enquire_indef()
on positivedefinite factorization.15 options.scaling=3 but a matchingbased ordering was not performed during analyse phase. 50 Allocation error. If available, the stat parameter is returned in inform.stat. 51 CUDA error. The CUDA error return value is returned in inform.cuda_error. 52 CUBLAS error. The CUBLAS error return value is returned in inform.cublas_error. +1 Outofrange variable indices found and ignored in input data. inform.matrix_outrange is set to the number of such entries. +2 Duplicate entries found and summed in input data. inform.matrix_dup is set to the number of such entries. +3 Combination of +1 and +2. +4 One or more diagonal entries of \(A\) are missing. +5 Combination of +4 and +1 or +2. +6 Matrix is found be (structurally) singular during analyse phase. This will overwrite any of the above warning flags. +7 Matrix is found to be singular during factorize phase. +8 Matchingbased scaling found as sideeffect of matchingbased ordering ignored (consider setting options.scaling=3). +50 OpenMP processor binding is disabled. Consider setting the environment variable OMP_PROC_BIND=true (this may affect performance on NUMA systems). 
int
Example¶
Suppose we wish to factorize the matrix
and then solve for the righthand side
The following code may be used.
/* examples/C/ssids.c  Example code for SPRAL_SSIDS package */
#include "spral.h"
#include <stdlib.h>
#include <stdio.h>
void main(void) {
/* Derived types */
void *akeep, *fkeep;
struct spral_ssids_options options;
struct spral_ssids_inform inform;
/* Initialize derived types */
akeep = NULL; fkeep = NULL; /* Important that these are NULL to start with */
spral_ssids_default_options(&options);
/* Data for matrix:
* ( 2 1 )
* ( 1 4 1 1 )
* ( 1 3 2 )
* ( 2 1 )
* ( 1 2 ) */
bool posdef = false;
int n = 5;
long ptr[] = { 1, 3, 6, 8, 9, 10 };
int row[] = { 1, 2, 2, 3, 5, 3, 4, 4, 5 };
double val[] = { 2.0, 1.0, 4.0, 1.0, 1.0, 3.0, 2.0, 1.0, 2.0 };
/* The righthand side with solution (1.0, 2.0, 3.0, 4.0, 5.0) */
double x[] = { 4.0, 17.0, 19.0, 2.0, 12.0 };
/* Perform analyse and factorise with data checking */
bool check = true;
spral_ssids_analyse(check, n, NULL, ptr, row, NULL, &akeep, &options,
&inform);
if(inform.flag<0) {
spral_ssids_free(&akeep, &fkeep);
exit(1);
}
spral_ssids_factor(posdef, NULL, NULL, val, NULL, akeep, &fkeep, &options,
&inform);
if(inform.flag<0) {
spral_ssids_free(&akeep, &fkeep);
exit(1);
}
/* Solve */
spral_ssids_solve1(0, x, akeep, fkeep, &options, &inform);
if(inform.flag<0) {
spral_ssids_free(&akeep, &fkeep);
exit(1);
}
printf("The computed solution is:\n");
for(int i=0; i<n; i++) printf(" %18.10e", x[i]);
printf("\n");
/* Determine and print the pivot order */
int piv_order[5];
spral_ssids_enquire_indef(akeep, fkeep, &options, &inform, piv_order, NULL);
printf("Pivot order:");
for(int i=0; i<n; i++) printf(" %5d", piv_order[i]);
printf("\n");
int cuda_error = spral_ssids_free(&akeep, &fkeep);
if(cuda_error!=0) exit(1);
}
This produces the following output:
The computed solution is:
1.0000000000e+00 2.0000000000e+00 3.0000000000e+00 4.0000000000e+00 5.0000000000e+00
Pivot order: 2 3 0 1 1
Method¶
Partition of work across available resources¶
Once the ordering has been determined and the assembly tree determined in the analyse phase, the tree is broken into a number of leaf subtrees rooted at a single node, leaving a root subtree/forest consisting of all remaining nodes above those. Each leaf subtree is preassigned to a particular NUMA region or GPU for the factorization phase. Details of the algorithm used for finding these subtrees and their assignment can be found in the paper [2].
The factorization phase has two steps. In the first, leaf subtrees are factorized in parallel on their assigned resources. Once all leaf subtrees are factored, the second phase begins where all CPU resources cooperate to factorize the root subtree.
At present the solve phase is performed in serial.
Data checking¶
If check is set to .true. on the call to spral_ssids_analyse()
or if
spral_ssids_analyse_coord()
is called, the usersupplied matrix data is
checked for errors. The cleaned integer matrix data (duplicates are
summed and outofrange indices discarded) is stored in akeep. The use
of checking is optional on a call to spral_ssids_analyse()
as it incurs both
time and memory overheads. However, it is recommended since the
behaviour of the other routines in the package is unpredictable if
duplicates and/or outofrange variable indices are entered.
If the user has supplied an elimination order it is checked for errors.
Otherwise, an elimination order is generated by the package. The
elimination order is used to construct an assembly tree. On exit from
spral_ssids_analyse()
(and spral_ssids_analyse_coord()
), order[] is
set so that order[i] holds the position of variable \(i\) in the
elimination order. If an ordering was supplied by the user, this order may
differ, but will be equivalent in terms of fillin.
Factorization performed¶
The factorization performed depends on the value of posdef.
If posdef=true, a Cholesky factorization is performed:
Pivoting is not performed, so \(P\) is the permutation determined in the analysis phase. \(S\) is a diagonal scaling matrix.
If posdef=false, a symmetric indefinite factorization is performed:
Pivoting is performed, so \(P\) may differ from the permutation determined in the analysis phase, though it is kept as close as possible to minimize fill. The exact pivoting algorithm varies depending on the particular kernel the algorithm chooses to employ.
Full details of the algorithms used are provided in [1] for GPUs and [2] for CPUs.
Small Leaf Subtrees¶
For subtrees allocated to run on the CPU, the factorization of small nodes near
the leaves of the tree can be amalgamated into a single parallel task (normally
each would be treated as its own OpenMP task to be scheduled). This can reduce
scheduling overheads, especially on small problems. If the total number of
operations for a subtree root at a given node is less than
options.small_subtree_threshold
,
that subtree is treated as a single task.
References¶
[1]  J.D. Hogg, E. Ovtchinnikov and J.A. Scott. (2014). A sparse symmetric indefinite direct solver for GPU architectures. ACM Transactions on Mathematical Software 42(1), Article 1, 25 pages. [DOI: 10.1145/2756548] [Preprint RALP2014006] 
[2]  (1, 2) J.D. Hogg. (2016). A new sparse LDLT solver using a posteriori threshold pivoting. RAL Technical Report. RALTR20160xx, to appear. 