Representing Matrices as JSON Objects: Part 2 - Sparse Matrices
Representing a Sparse Matrix as a JSON object is a task that appears in many modern data science contexts. While there is no universally agreed way to achieve this task, in this post we discuss a number of options and the associated tradeoffs.
Recap of Part 1 of the Matrix-to-JSON Post Series
In the first installment of this series, Part 1 we discussed the motivation behind representing and serializing matrices as JSON objects. We defined relevant concepts and in particular the concept of unrolling the matrix into a one-dimensional array and the notion of Column and Row Major orders. We outlined some use cases of interest and initiated a benchmarking exercise that looks into various R and Python JSON serialization utilities (available at the matrix2json repository).
In this second post of this series, we revisit the issue of matrix representations in JSON format for the special (but important) sub-category of matrices that are termed sparse.
What exactly is a Sparse Matrix?
A Sparse Matrix is a special class of matrices that shows up in several subfields of mathematics (Notable certain subfields of Linear Algebra and Graph Theory). Sparse matrices are simply matrices (2D Arrays) where a substantial number of elements are zero. The visual representation of a sparse matrix is similar to that of a regular (full or dense) matrix, the only difference concerns the typical values. E.g., consider the matrix:
$$ B = \begin{bmatrix} 0 & 1 & 0 & 0 \\ 2 & 0 & 0 & 0 \\ 0 & 3 & 0 & 0 \\ 0 & 0 & 0 & 4 \\ \end{bmatrix} $$
Out of the 16 possible values of this 4x4 matrix, only four are non-zero.
The question we want to tackle in the second post of the series is how to work with this type of matrix, namely how to represent it using a JSON object. As mentioned in Part 1, in particular matrices with many elements (large size) and significant sparsity (many zero values) may create opportunities to economize on memory and/or computational time. This applies both to the representation of the matrix in-memory and its serialization on disk.
In-Memory Representations of Sparse Matrices
The representation of sparse matrices (with many zero elements) in computer memory requires different approaches depending on the context and application requirements. The basic strategy is that memory requirements might be reduced by storing only the non-zero entries. There is no unique way to do this. The key trade-off is that accessing individual elements may become less straightforward but worse, (depending on the structure of the matrix), storing the additional required information might also carry a penalty. Broadly speaking, in choosing a representation one must balance potentially competing requirements:
- Efficient memory usage (total amount of memory or disk space used)
- Efficient modification of the matrix (enlarging or reducing the matrix)
- Efficient element access and matrix operations (linear algebra etc.)
A list of various formats that have been proposed historically and are typically mentioned in sparse matrix context is1,2 given by:
- COOrdinate list format (also IJV or Triplet format)
- Dictionary of Keys (DOK) format
- Compressed Sparse Row (CSR) format
- Compressed Sparse Column (CSC) format
- Block Sparse Row (BSR) format
- List of Lists (LIL) format
We will explore here the main design ideas behind this list.
Note: More specialized techniques and variations may be applicable for (anti)symmetric, banded or other special matrices. For example, an antisymmetric matrix where, by construction, the element $(i, j)$ is equal to minus the element $(j, i)$ so in principle this matrix could be stored using half the memory. Matrix libraries may offer specialized tools for working with such matrices (for example SciPy offers the DIAgonal format).
We will next explore how the above list of in-memory representation proposals can be translated into a JSON serialization.
JSON Representation of Sparse Matrices
In Part 1 we saw two natural JSON representations of full matrices: The one-dimensional array format
["a11", "a12", "a13", "a21", "a22", "a23", "a31", "a32", "a33"]
and the two-dimensional array-of-arrays format:
[
["a11", "a12", "a13"],
["a21", "a22", "a23"],
["a31", "a32", "a33"]
]
These can obviously be used to represent sparse matrices when the size is not a concern!
The Coordinate List (COO) Format
The Coordinate List format (COO) representation is the first of the additional formats we will explore. It is actually very intuitive and well known, as it is effectively a table where each matrix element is represented by a table row. The first column of the table indicates the matrix row index (i), the second table column indicates the matrix colum index (j) and the third table column indicates the element value (v). That is why this format is sometimes called IJV format.
Matrix Row i | Matrix Column j | Matrix Value v |
---|---|---|
1 | 1 | $a_{11}$ |
1 | 2 | $a_{12}$ |
… | … | … |
2 | 3 | $a_{23}$ |
3 | 3 | $a_{33}$ |
Hence, to represent a matrix element we need a triplet of entries $i, j, v$, which motivates the name “triplet format”.
When is the COO format efficient?
Notice that when the matrix is full, the COO format is not very efficient (in terms of storage requirements). This is because in the array-of-arrays format the row and column indices are implicit. The strict incremental ordering of row and index values obviates the need to record them explicitly. Here, instead, they must be represented by integer values. This is potentially tripling the amount storage required. Consider, though, the case of our sparse matrix $B$ example.
In COO format it reads:
Matrix Row | Matrix Column | Matrix Value |
---|---|---|
1 | 2 | 1 |
2 | 1 | 2 |
3 | 2 | 3 |
4 | 4 | 4 |
Instead of 16 elements we need to represent only the four non-zero elements. In total, we need 12 values. This balance is the kind of calculus that can make a significant difference for large sparse matrices, depending on their sparsity value.
Sparse Matrix to JSON in C++ Linear Algebra Libraries
In Part 1 we explored matrix serialization using a number of Python and R libraries. Here we expand the programming language toolkit to include two C++ Linear Algebra libraries Armadillo and Eigen that are used widely for matrix related calculations.
Armadillo C++ Library
Armadillo is a popular C++ library for linear algebra and scientific computing. Armadillo provides classes for working (performing linear algebra calculations) with sparse matrices. The Non-zero elements are stored in compressed sparse column (CSC) format (which we will explore more below); zero-valued elements are never stored.
In terms of serialization, the Armadillo matrix classes feature a save method that admits a number of options for formats (binary, csv, hdf5 etc.). The option of most interest to us here is the coord_ascii option, which saves data in the coordinate format.
Specifically, numerical data are stored as a text file in coordinate list format, without a header. Only the non-zero values are stored. For real matrices, each line contains information in the format: row, column, value. For complex matrices, each line contains information in the format: row, column, real_value, imag_value.
This type of serialization applies to both regular and sparse matrices. The following snippet illustrates (all code availabel at the repository).
int main(int argc, const char **argv) {
// Initialize the random generator
arma::arma_rng::set_seed_random();
// Create a 4x4 random matrix and print it on the screen
arma::Mat<double> A = arma::randu(4, 4);
std::cout << "A:\n" << A << "\n";
// Create a new diagonal matrix using the main diagonal of A:
arma::Mat<double> B = arma::diagmat(A);
std::cout << "B:\n" << B << "\n";
// Save matrices A and B using COO format:
A.save("A_mat.txt", arma::coord_ascii);
B.save("B_mat.txt", arma::coord_ascii);
return 0;
}
We thus are almost there with serializing a sparse matrix, but what we have is an ASCII text file, not a JSON file! The problem of converting tabular (e.g., CSV) data to JSON is so common there is a W3C Recommendation about it! According to that recommendation, the Minimal Mode conversion would include only the information gleaned from the cells of the tabular data within the output. For our sparse matrix $B$ this would be the following:
[
{"i": 1, "j": 2, "v": 1},
{"i": 2, "j": 1, "v": 2},
{"i": 3, "j": 2, "v": 3},
{"i": 4, "j": 4, "v": 4}
]
In other words, the sparse matrix can be stored as an array of objects, which in turn are simple dictionaries with a triplet of key/value pairs. Alas, Armadillo itself does not include a built-in JSON serialization facility. We could “roll our own” or use a specialized JSON/C++ library such as rapidJSON that allows working with JSON objects.
Eigen C++ Library
Eigen is another popular C++ library for linear algebra and scientific computing. Eigen provides classes for working (performing linear algebra calculation) with sparse matrices. The Eigen class SparseMatrix is the main sparse matrix representation in Eigen. For in-memory storage Eigen implements a variant of the Compressed Column (or Row) Storage scheme. This representation consists of four compact arrays:
- Values: stores the coefficient values of the non-zeros.
- InnerIndices: stores the row (resp. column) indices of the non-zeros.
- OuterStarts: stores for each column (resp. row) the index of the first non-zero in the previous two arrays.
- InnerNNZs: stores the number of non-zeros of each column (resp. row). The word inner refers to an inner vector that is a column for a column-major matrix, or a row for a row-major matrix. The word outer refers to the other direction.
We will not go over this design in more detail but rather illustrate how Eigen works with Sparse matrices:
int main(int argc, const char **argv) {
IOFormat HeavyFmt(FullPrecision, 0, ", ", ",\n", "[", "]", "[", "]");
// Initialize and print a random sparse matrix
int n = 4; // size of matrix
int k = 4; // number of non-zero elements
typedef Eigen::Triplet<double> T;
std::vector<T> tripletList;
tripletList.reserve(4);
std::cout << "Sparse Matrix B in Coordinates Format" << std::endl;
std::cout << "i" << " " << "j" << " " << "v" << std::endl;
for (int e = 0; e < k; e++) {
int i = rand() % 4;
int j = rand() % 4;
double v = rand() % 1000;
std::cout << i << " " << j << " " << v << std::endl;
tripletList.push_back(T(i, j, v));
}
SparseMatrix<double> B(n, n);
B.setFromTriplets(tripletList.begin(), tripletList.end());
std::cout << "Sparse Matrix B in Array of Arrays Format:\n" << B.toDense().format(HeavyFmt) << std::endl;
return 0;
}
As with Armadillo, Eigen does not provide built-in JSON serialization. Nevertheless, the IOFormat class provides significant flexibility in structuring matrix output.
Matrix to JSON Using the Python Scipy package
As one final example of programmatic representation of sparse matrices we take a look the Python SciPy package. SciPy provides fundamental algorithms for scientific computing in Python. scipy.sparse is the SciPy 2D sparse array package for numeric data. The following snippet illustrates how one works with sparse matrices using SciPy.
import scipy
import numpy
import json
rng = numpy.random.default_rng()
M = 4
a = scipy.sparse.random_array((M, M), density=0.2, format='coo', dtype=float, random_state=rng)
print(a) # print the sparse matrix directly
# Serialize to JSON
b = a.todense()
s_out = json.dumps(b.tolist())
f = open("b.json", "w")
f.write(s_out)
f.close()
It should come as no surprise that here too, there is no native JSON serialization. In order to save a sparse matrix to JSON we need to first convert it into a dense matrix (which might not be feasible for largfe matrices). For example, for a random sparse 4x4 matrix we get:
[[0.94540984, 0., 0.43350626, 0. ],
[0.06298551, 0., 0., 0. ],
[0., 0., 0., 0. ],
[0., 0., 0., 0. ]]
Interestingly though, SciPy offers an alternative ASCII serialization that is quite close to the COO format.
The Dictionary of Keys (DOK) Format
The same matrix as above can represented in DOK format as (NB: we shifted Python’s output to base 1):
(1, 1) 0.9454098352637809
(1, 3) 0.43350625608897153
(2, 1) 0.06298551100462368
What is different here compared to the COO format is the use of Python’s built-in tuple data structure to store the matrix element coordinates in the form $(i, j)$. Using tuples we have now effective a hashtable or dictionary of key/value pairs. This approach is generalizable to higher order matrices (aka tensors).
Unfortunately JSON does not support tuples. The implication is that the DOK format cannot be used as is. But with a small hack we can convert the keys into strings and have e.g., the following JSON representation (notice that the order of the key/value pairs is not important):
{
"(2, 1)": 0.06298551100462368,
"(1, 3)": 0.43350625608897153,
"(1, 1)": 0.9454098352637809
}
The disadvantage is of course that keys must always be translated back and forth between strings and numbers.
The Compressed Sparse Row (CSR) Format
The Compressed Sparse Row (CSR) format represents a sparse matrix using three one-dimensional arrays. It is an important design which we now describe briefly. It’s full ramifications can only be appreciated when using it to express matrix computations.
The overall strategy is to use the following structure:
{
"V": [],
"CI": [],
"IP": []
}
in other words a dictionary of three arrays.
The Values Array
The array denoted Values ($V$) stores the values of the non-zero elements of the matrix, in unrolled, sequential order from left to right along each row, then from the top row to the bottom. It is thus just our well-known Row Major Order representation, except it only includes the non-zero values: $$ V = [v_{1}, v_{2}, \ldots, v_{nz}] $$
where $nz$ is the number of non-zero elements.
In our $B$ matrix example $nz = 4$ and
$$ V = [1, 2, 3, 4] $$
The Column Indexes Array
The array denoted Column Indices ($CI$) stores the column index for each of these non-zero elements. This array is also of size $nz=4$. In our example:
$$ CI = [2, 1, 2, 4] $$
The Index Pointer Array
The array denoted $IP$ (Index Pointer) (or Row Offset) associates row indices and the matrix elements. Its length is equal to the matrix rows (including zero rows) plus 1 $(n + 1)$, where $n x n$ is the size of the matrix.
For each row $i$, if the row has non-zero elements, $IP[i]$ records the index in the Value (or Column Indices) array that corresponds to the first non-zero element of that row. In other words, in the unrolled matrix picture, the index pointer array helps slice the one-dimensional Value array into its the compressed (with zeroes expunged) component rows.
For a row that has no non-zero elements, $IP$ records the index of the next non-zero element occurring in the matrix. For convenience in iterative algorithms, the last element of $IP$ stores the position of the last non-zero element or the size of the Value array ($nz$).
Putting all together for our example:
{
"V": [1, 2, 3, 4],
"CI": [2, 1, 3, 4],
"IP": [1, 2, 3, 4, 4]
}
The List of Lists (LIL) Format
Finally, a linked list is a classic data structure that represents linear collection of data elements whose order is not given by their physical placement in memory but each element points to the next. The standard implementation of a linked list in computer memory, using pointers, thus not particularly suitable for serialization into JSON.