pearu.github.io

CSR format naming conventions

   
Author Pearu Peterson
Created 2021-05-10

The aim of this blog post is to review naming conventions used in various software that implement CSR format support. The need for this review originates from a PyTorch issue comment.

The CSR format, originating from mid-1960, was introduced to represent two-dimensional arrays (matrices) by three one-dimensional arrays:

where nrows denotes the number of array rows and nnz denotes the number of specified values.

Note: the notation nnz is an abreviaton from the “number of non-zero” elements. However, the “non-zero” part should not be taken literally because nothing in the CSR format specification requires that the specified values must be non-zero. The more appropiate term would be the “number of specified elements” (NSE) but many software still use nnz while allowing explicit zero values.

The following table summarizes the CSR format naming conventions used in existing software as well as elsewhere (ordering is arbitrary):

Software NSE values extents of rows column indices  
PyTorch (Python) nnz values crow_indices col_indices  
scipy.sparse (Python) nnz data indptr indices  
PyData Sparse (Python) nnz data indptr indices  
cuSparse (C) nnz csrValA csrRowPtrA csrColIndA  
Intel MKL solvers (C)   values rowIndex columns  
Intel MKL CSR format (C)   values rows_start/rows_end col_indx  
GNU GSL (C) nnz data row_ptr col  
AOCL-sparse nnz val row_ptr col_ind  
SPARSEKIT (Fortran) n a ia ja  
SparseM (R) nnz ra ia ja  
MathDotNet (C#) ValueCount Values RowPointers ColumnIndices  
Stan Math Library (C++) NNZE w u v  
Magma Sparse (C) nnz val row col  
https://arxiv.org/abs/1511.02494 N val rowptr colind  
Wikipedia Sparse Matrix NNZ V ROW_INDEX COL_INDEX  
Sputnik nonzeros values row_offsets column_indices  

Notes:

Conclusions

The current choice of PyTorch naming convention is satisfactory (IMHO) but not ideal mainly because of crow_indices choice that is not used elsewhere and has birdish flavor. On the other hand, there appears to be no naming convention that would be ideal in general and therefore I think that PyTorch has a freedom as well as opportinity to introduce better naming convention from other software with respect to sparse tensor formats. The naming convention must be