Author | Pearu Peterson |
Created | 2021-05-10 |
The aim of this blog post is to review naming conventions used in various software that implement CSR format support. The need for this review originates from a PyTorch issue comment.
The CSR format, originating from mid-1960, was introduced to represent two-dimensional arrays (matrices) by three one-dimensional arrays:
nnz
nrows + 1
nnz
where nrows
denotes the number of array rows and nnz
denotes the number of specified values.
Note: the notation nnz
is an abreviaton from the “number of non-zero” elements. However, the “non-zero” part
should not be taken literally because nothing in the CSR format specification requires that the specified
values must be non-zero. The more appropiate term would be the “number of specified elements” (NSE) but
many software still use nnz
while allowing explicit zero values.
The following table summarizes the CSR format naming conventions used in existing software as well as elsewhere (ordering is arbitrary):
Software | NSE | values | extents of rows | column indices | |
---|---|---|---|---|---|
PyTorch (Python) | nnz |
values |
crow_indices |
col_indices |
|
scipy.sparse (Python) | nnz |
data |
indptr |
indices |
|
PyData Sparse (Python) | nnz |
data |
indptr |
indices |
|
cuSparse (C) | nnz |
csrValA |
csrRowPtrA |
csrColIndA |
|
Intel MKL solvers (C) | values |
rowIndex |
columns |
||
Intel MKL CSR format (C) | values |
rows_start /rows_end |
col_indx |
||
GNU GSL (C) | nnz |
data |
row_ptr |
col |
|
AOCL-sparse | nnz |
val |
row_ptr |
col_ind |
|
SPARSEKIT (Fortran) | n |
a |
ia |
ja |
|
SparseM (R) | nnz |
ra |
ia |
ja |
|
MathDotNet (C#) | ValueCount |
Values |
RowPointers |
ColumnIndices |
|
Stan Math Library (C++) | NNZE |
w |
u |
v |
|
Magma Sparse (C) | nnz |
val |
row |
col |
|
https://arxiv.org/abs/1511.02494 | N |
val |
rowptr |
colind |
|
Wikipedia Sparse Matrix | NNZ |
V |
ROW_INDEX |
COL_INDEX |
|
Sputnik | nonzeros |
values |
row_offsets |
column_indices |
Notes:
NNZ
for the number of specified elements is dominant. Documentation of various software define it as the “number of non-zero” elements and in next sentense these may mention that explicit zero values are allowed. In summary, the usage of NNZ can be characterized as “the most consistently used inconsistency between the notation and the actual definition”.ia
, w
, etcrow
or row_index
or rowIndex
for naming the “extents of rows” array are not really good role models for choosing naming conventions (IMHO) because the given namings are misleading in the sense that the values of the “extents of rows” array are never the row indices (but are input parameters to row index generators).RowPtr
, rowptr
, row_ptr
, indptr
. However, the usage of “pointers” may be confusing for C/C++ programmers because in C/C++ language the term is used as “a memory address of a variable”.crow_indices
for the “extents of rows” and is derived from phrase “Compressed ROW INDICES”. However, unwitting user may relate “crow” to a bird Crow.The current choice of PyTorch naming convention is satisfactory (IMHO) but not ideal mainly because of crow_indices
choice that is not used elsewhere and has birdish flavor. On the other hand, there appears to be no naming convention that would be ideal in general and therefore I think that PyTorch has a freedom as well as opportinity to introduce better naming convention from other software with respect to sparse tensor formats. The naming convention must be