CSR tensor support in PyTorch
|
|
Author |
Pearu Peterson |
Created |
2021-04-07 |
The aim of this blog post is to propose a roadmap for completing the
implementation of the CSR layout for PyTorch tensors that is started
in PR 50937.
The CSR layout was introduced to resolve the issue of slow matrix
multiplication with sparse matrices when using the existing COO layout
support in PyTorch. See Roadmap for torch.sparse Matrix Product
API
for overview. The PR 50937 resolves this issue with high success: the
performance of matrix multiplication of sparse matrix-vector and
sparse matrix-dense matrix increases by ten-fold when using the CSR
layout, another five-fold increase is achieved when using Intel MKL
library tools, see CSR vs COO
benchmarks.
However, maintaining the PR 50937 and preparing it for landing has
become an increasingly difficult task because of
- the constant flux in PyTorch and used libraries APIs that causes
periodic merge conflicts,
- inherit inference between different PyTorch features (Autograd,
TorchScript, Testing, etc) that must be taken into account when
implementing a new storage layout,
- the current conversation count (> 250), the discussion items count
(ca 400), and the commit count (ca 260) makes hard to grasp the
overall state of the PR.
It has been
agreed
that the PR will land as it is but the work on the CSR layout support
needs to continue to meet the (still envolving) PyTorch coding and
testing standards.
Here, an attempt will be made to organize and discuss the follow-up
tasks for completing the CSR layout support.
Unresolved discussion items
- https://github.com/pytorch/pytorch/pull/50937#discussion_r562636792 - simplify under contiguity assumption
- https://github.com/pytorch/pytorch/pull/50937#discussion_r562646336, https://github.com/pytorch/pytorch/pull/50937#discussion_r562648465 - hardcoded development style
- https://github.com/pytorch/pytorch/pull/50937#discussion_r562656595 - docs
- https://github.com/pytorch/pytorch/pull/50937#discussion_r563184880 - VS build failure
- https://github.com/pytorch/pytorch/pull/50937#discussion_r563185162 - int32 vs int64 performance, releated to https://github.com/pytorch/pytorch/issues/56959
- ~https://github.com/pytorch/pytorch/pull/50937#issuecomment-766777337 - is it fixed?~ yes, csr tensor repr is ok.
- ~https://github.com/pytorch/pytorch/pull/50937#issuecomment-766778930 - is it fixed?~ yes, csr tensor size is deduced from indices
- ~https://github.com/pytorch/pytorch/pull/50937#issuecomment-803905480 - is it fixed?~ yes, docs look ok, although doc build success is not verified
- https://github.com/pytorch/pytorch/pull/50937#discussion_r603533421 - is it fixed?, no: decide if
crow_indices()
should return Tensor
or const Tensor &
- https://github.com/pytorch/pytorch/pull/50937#discussion_r603535187 - is it fixed?, no AFICT: need single point entry for constructing csr tensor
- https://github.com/pytorch/pytorch/pull/50937#discussion_r603536011 - resize internal tensors for memory efficiency https://github.com/pytorch/pytorch/issues/56696, https://github.com/pytorch/pytorch/pull/56744
- https://github.com/pytorch/pytorch/pull/50937#discussion_r603537183 - eliminate memory format, https://github.com/pytorch/pytorch/issues/56697
- ~https://github.com/pytorch/pytorch/pull/50937#discussion_r604353856 - test return value~, fixed in master
- ~https://github.com/pytorch/pytorch/pull/50937#discussion_r604354153 - use C++ RAII~, fixed in master
- https://github.com/pytorch/pytorch/pull/50937#discussion_r604379873 - dims/size checks, we’ll need implement tests for bad inputs
- ~https://github.com/pytorch/pytorch/pull/50937#discussion_r604968109 - code repetitions~, looks resolved to me
MKL and Windows build issues
- https://github.com/pytorch/pytorch/pull/50937#issuecomment-778288404
and conversation following that. The issue is summarized in
https://github.com/pytorch/pytorch/pull/50937#issuecomment-779272492
with possible solutions.
MKL and macOS build issues
- https://github.com/pytorch/pytorch/pull/50937#issuecomment-803909578
CSR indices support int32 and int64
- ~https://github.com/pytorch/pytorch/pull/50937#discussion_r603538191 - expect one int type~
- ~https://github.com/pytorch/pytorch/pull/50937#discussion_r604349014 - why not default to int32?~
- ~https://github.com/pytorch/pytorch/pull/50937#discussion_r604969458~
See also https://github.com/pytorch/pytorch/issues/56959 that summarizes
the background of the three discussion items above.
IIUC, it would be preferred to support a single dtype for CSR indices
tensors. (Explain why this preference). While COO uses only int64 as
dtype for indices, a natural choice for the dtype of CSR
crow/col_indices
would also be int64
. As a result, the situation
would be simpler for users, conversion from COO to CSR and CSR to COO
would be memory/processor efficient, etc. However, a big performance
gain in matrix multiplication is achieved when using Intel MKL library
tools but with the current MKL support in PyTorch, only int32 indices
can be used as inputs to MKL routines.
So, we could fix dtype to int64 (as in COO) but at the expense that
one needs int64->int32 conversion (and inverse?) whenever calling an
MKL routine.
We could also fix dtype to int32 (that would be most efficient when
using MKL support) but at the expense that all conversions between COO
and CSR would be more expensive than necessary.
Btw, the conversion between COO and CSR is important because the COO
layout is the most human-friendly layout for constructing sparse
tensors while the CSR layout is computationally much more efficient
than COO.
COO to CSR conversion
- https://github.com/pytorch/pytorch/pull/50937#discussion_r604346660 - slow because implemented in Python, about 3x slower than gcs, https://github.com/pytorch/pytorch/issues/57381
- https://github.com/pytorch/pytorch/pull/50937#discussion_r608326213 - dense-csr without coo
Not much to discuss here: for efficiency, implement the direct dense
to CSR conversion in C++. This will be important when we decide that
the indices of COO and CSR will have different dtypes. Otherwise, I
would not expect much performance gain.
Testing
- ~https://github.com/pytorch/pytorch/pull/50937#discussion_r604348471 - see modern COO testing https://github.com/pytorch/pytorch/issues/56371~
- ~https://github.com/pytorch/pytorch/pull/50937#discussion_r608327057 - don’t set default_dtype_type https://github.com/pytorch/pytorch/issues/56369~, fixed
- ~https://github.com/pytorch/pytorch/pull/50937#discussion_r608327057 - avoid using numpy https://github.com/pytorch/pytorch/issues/56371~, fixed
- ~https://github.com/pytorch/pytorch/pull/50937#discussion_r608675701 - inefficient CSR samples https://github.com/pytorch/pytorch/issues/56371~, fixed
Avoid COO-isms
- https://github.com/pytorch/pytorch/pull/50937#discussion_r604364248
- https://github.com/pytorch/pytorch/pull/50937#discussion_r604365304 - empty indices, https://github.com/pytorch/pytorch/issues/56696, WIP: https://github.com/pytorch/pytorch/pull/56744
- https://github.com/pytorch/pytorch/pull/50937#discussion_r604369842 - aliasing, what happens to CSR after resizing values? CSR invariants!
- https://github.com/pytorch/pytorch/pull/50937#discussion_r604379081 - always require contiguity?
- https://github.com/pytorch/pytorch/pull/50937#discussion_r604379537 - device testing
- https://github.com/pytorch/pytorch/pull/50937#discussion_r604384386 - introduce sparse_csr namespace
- https://github.com/pytorch/pytorch/pull/50937#discussion_r604974231
- https://github.com/pytorch/pytorch/pull/50937#discussion_r608312412 - docs
Code quality
- https://github.com/pytorch/pytorch/pull/50937#discussion_r604972242
- https://github.com/pytorch/pytorch/pull/50937#discussion_r604972910 - CPU logic
- https://github.com/pytorch/pytorch/pull/50937#discussion_r604989807 - avoid auto-deduction
- https://github.com/pytorch/pytorch/pull/50937#discussion_r608326624 - use assertExpectedInline
Deal for landing PR 50937
- https://github.com/pytorch/pytorch/pull/50937#issuecomment-811253895 - the deal of landing
- https://github.com/pytorch/pytorch/pull/50937#issuecomment-812888029 - tentative plan
- https://github.com/pytorch/pytorch/pull/50937#discussion_r608312830 - dtype/device parameters
- https://github.com/pytorch/pytorch/pull/50937#issuecomment-814944029 - more todo items
Main features missing
- CUDA support in the CSR layout https://github.com/pytorch/pytorch/issues/56485
- Inference with Autograd - this is relevant also to COO layout as to
sparse tensor support in general.
- Generalization of CSR as N-dimensional tensor