Representing Matrices as JSON Objects: Part 1

Representing a matrix as a JSON object is a task that appears in many modern data science contexts, in particular when one wants to exchange matrix data online. While there is no universally agreed way to achieve this task in all circumstances, in this series of posts we discuss a number of options and the associated tradeoffs.

Motivation and Objective

Representing a matrix as a JSON object is a task that appears in many modern data science contexts, in particular when one wants to exchange matrix data online. There is no universally agreed way to achieve this task and various options are available depending on the matrix type and the programming tools and environment one has available. Matrices are in general not “native” structures in computing environments but are handled with speficic packages (modules, extensions or libraries).

In this post we will discuss a number of options and the associated tradeoffs for common current setups in python, R, julia and C++. To make the task finite, we focus on the following requirements and metrics:

  • The processing speed when serializing / deserializing from / into computer memory
  • The storage requirements on disk. Those can serve also as a proxy for the speed of transmission over the internet (or storage requirements in databases)
  • The convenience and ability to represent a family of similar matrices

In this first post on the topic we will cover quite a bit of ground defining the concepts and use cases of interest and initiating a benchmarking exercise. In subsequent posts we will explore additional issues and/or go in more depth in specific topics. The cumulative results are in the matrix2json repository.

What exactly is a matrix?

A matrix is a versatile mathematical object that shows up in several subfields of mathematics (Linear Algrebra, Geometry, Graph Theory, Probability to name but a few). As part of an applied mathematics toolkit matrices are used in countless applications in many scientific and technical disciplines. For example, matrices are an essential element of many physical theories such as quantum mechanics. As another broad category, the probabilistic interpretation of matrices is the basis of areas such as Markov Chains which find many applications in economics and finance.

The utility of matrices means that from the early days of computing they have been related implementations1. In computer science matrices are frequently termed 2D Arrays. There is a subtle but important semantic nuance here: using the term matrix instead of array implies an object that may have additional properties, whereas a 2D array can be thought more as a data container. We will return to this topic below.

The visual impression of a matrix is familiar enough, a rectangular array or table of numbers, symbols, or expressions (denoted the matrix elements), arranged in rows and columns, for example something like:

$$ \begin{bmatrix} 10 & 1.2 & -1 \\ 2.1 & 5 & -6.2 \\ -20 & 100 & 0 \end{bmatrix} $$

More generally, we may introduce an abstract matrix as follows: $$ {\bf A} = \begin{bmatrix} a_{11} & a_{12} & a_{13} \\
a_{21} & a_{22} & a_{23} \\
a_{31} & a_{32} & a_{33} \end{bmatrix} $$

The above representation is entirely oriented towards human (and more specifically: visual) perception. The brackets to the left and right of the matrix elements aim to visually enclose the elements and isolate them from other information bits - they have no further meaning. The geometrical layout of the elements $a_{ij}$ (ordered in rows and columns with white space in between) is important in two ways. The up/down and left/right dimensions of the page are used to introduce the concept of columns and rows, while the relative placement (location) of each element assigns implicitly its row and column index.

The actual meaning and usage of the above neatly arranged numbers in various domains is the subject of year-long mathematics courses, ranging from basic introductions to linear algebra to sophisticated specialized applications. In what follows we will not attempt even to summarize all the possible variations, as it would take us too far afield. We will selectively reference some specific use cases as they illustrate the practical relevance of representation approaches.

The use context in which a matrix appears determines a number of important attributes for our purposes:

The Matrix Size

The size of the matrix (denoted as N x M where N the number of rows and M the number of columns). The special class of square matrices (N x N) appears quite frequently but we will not need to confine the discussion to this subclass.

The smallest non-trivial matrix dimension is 2 2. The class of 2x2 matrices is already extremely useful in many domains. Such small sized 2x2 (and 3x3 matrices) are very prevalent in e.g. computer graphics and applications related to two and three-dimensional geometries. On the other extreme we have “big data” type matrices where N can range in the billions. For our purposes we can informally group matrices by size as follows:

  • Very small size $N \in [2,3]$. Here the context, meaning and operations on matrix elements is very rigidly defined. The implementations and representations of such matrices will in general be customized and optimised to the application (and is outside our scope).
  • Small size $3 < N < 20$. This typically a domain where each row and column of the matrix carries specific relevance and weight (may, e.g., be a labelled / distinguishable index). An example would be e.g., a transition matrix as used in financial applications to model credit transitions. In this case the size of the matrix is determined by the distinct (labelled) credit rating categories recognized by a rating system.
  • Medium size $20 < N < 1000$. In this size range individual elements of the matrix may no longer be special in any way (but are still individually indexed and identifiable). Efficient representation and working with such matrices is our primary focus in this post.
  • Large size $N > 1000$. This enters into the domain of big data and is outside the scope of the current discussion. For very large matrices the size itself may be fluidly defined and working with such data will invariably require specialized techniques.

Matrix Data Type

The data type of the matrix elements: Applications from applied mathematics will typically involve matrix elements that have a uniform and elementary mathematical type such as integers, real numbers or (occasionally) complex numbers. In principle other more complex types could be present. Here we will simplify the discussion by focusing on real numbers (represented as float / double precision etc.).

In concrete applications the accurate representation of data types might be of extreme importance to avoid information loss. The JSON standard that will be our target representation of matrix data is not particularly precise about numerical precision 3. Yet in principle numbers can be encoded as strings (and thus preserve all required accuracy), so we leave this complication outside the scope.

Matrix Density

The density of the matrix is the frequency with which the zero element appears in the matrix. Matrices that feature large number (e.g. > 50%) of zeros are termed sparse. Non-sparse matrices are termed dense.4 For very large matrices, the density of a matrix becomes a critical aspect in both storage and manipulation performance. We will revisit issues around sparse matrices in a future post.

To summarize, in the representation of matrices we selected three attributes (Size, Data Type, Density) as quite crucial for our purposes. The true utility of matrices in applications only emerges once we add further features, such as:

  • constraints on the possible values of matrix elements. These can be simple and directly observable, e.g., positive values or (anti)symmetric structure or hidden (implied) in the relations of different values (e.g. invertible matrices, orthogonal matrices etc.)
  • operations on matrices, leading to different matrix algebras. There is a vast universe of possible matrix “spaces” spanned by such operations. For our purposes the prototypical example of operations are additions and multiplications of matrices, with scalars, vectors and other matrices.

As a final note, the properties that distinguish matrices from 2D arrays are precisely those not explicitly expressed by the data representations of a matrix. We can think of them instead as metadata that must be communicated separately to convey the information as to what precisely is encoded numberically.

In-Memory Representations of Matrices

In the previous section we discussed briefly matrices and their human oriented depiction on paper or screen. Now we turn briefly to the representation of matrices inside computer memory. In computer applications matrices are concrete formations of data (values) that live inside a digital computer’s (RAM) memory. Interestingly enough, standard computer memory is actually a two-dimensional structure laid out on the surface of a silicon chip.

Physical RAM

Physical RAM

Yet for reasons of simplicity, that two-dimensional physical geometry of computer memory is generally not accessible. Computer memory is addressed instead via linear, one-dimensional, logical addresses using the computer’s memory management unit. Thus, memory appears to a program as a single, continuous address space. In turn this means that some type of strategy is required when attempting to store composite structures such as matrices.

We will not enter here into the low-level details of how matrix representations might be stored in memory but what is important is to appreciate

  • the general approach (For further details 5 ) and
  • the design has can have significant performance implications 6.

Since computer memory is accessed as a linear structure (one-dimensional address space), a matrix must be “unwound”. We pick one of the two dimensions to start the unwinding process and we proceed to one row or column at a time until we exhaust them all:

A matrix is stored in the computer's memory in contiguous (sequential) memory locations as one row following another, or, as one column following another. The two options are termed Row-Major Order and Column-Major Order respectively.

The following visualization illustrates the “matrix unrolling” strategy:

Row and Column Major Order

Row and Column Major Order

Hence, if our system uses Row Major Order the matrix would look like $$ A = [a_{11}, a_{12}, a_{13}, a_{21}, a_{22}, a_{23}, a_{31}, a_{32}, a_{33}] $$

(and similarly for Column Major Order). In high level languages these representations would be hidden by the user.

What is JSON and why use it?

JSON (JavaScript Object Notation) is a lightweight data interchange format 7 (compared, e.g., to XML or HDF5). It is relatively easy to read by humans (depending on size and structure) but can also be processed by computers. As implied by the name, it is based on a subset of the JavaScript Programming Language which has made it a de-facto standard for exchanging data between networked web applications. The specification itself is programming language / computer system independent. The implementation of any associated tools to work with JSON objects will obviously depend on language features (and will inherit, e.g., performance or ease of use characteristics from that language).

As JavaScript became the default language of client-side web development, JSON adoption expanded. Being both human-readable and machine-readable JSON moved beyond web pages and is increasingly used in software everywhere. Hence, exchanging various types of quantitative data (including matrix data) in JSON format becomes quite relevant. Alas, the tradeoff of a light specification is that there is no standard way to encode matrices (and arrays more general).

As a starting point, JSON can represent four primitive types (strings, numbers, Booleans, and null) and two structured types (objects and arrays). Further, JSON has two core structures that can be used in matrix representation:

  • A collection of name/value pairs. In different programming languages, this is realized (and termed) an object, a record, a struct, a dictionary, a hash table, a keyed list, or an associative array.
  • An ordered list of values. Within a programming language environment, this is realized as an array, a vector, a list, or a sequence.

The key/value pair structure looks like this:

{
  "name":"John", 
  "age":31, 
  "city":"New York"
}

whereas the array structure looks like this:

["Chilean", "Argentinean", "Peruvian", "Colombian"]

JSON Representations of Matrices

We turn now to the possible choices representing matrices using the above options. Obviously the unrolled column and row major orders we discussed above are in immediately feasible in JSON using the array element:

["a11", "a12", "a13", "a21", "a22", "a23", "a31", "a32", "a33"]

This representation is pleasantly simple and (pressumably) efficient but the challenge is that it does not convey any shape information. Hence, the receiver of the matrix data would, in general, have difficulty interpreting it correctly. This can in principle be resolved by constructing a slightly more complex object that adds the missing pieces of information:

  • an integer value “shape” that determines the size of the 1D row or column arrays and
  • a binary value “order” that takes value 0/1 to indicate row / column order
{
  "shape" : 3,
  "order" : 0,
  "data" : ["a11", "a12", "a13", "a21", "a22", "a23", "a31", "a32", "a33"]
}

While this is a more complete design, it still requires the receiving party to recognize and parse the object as following this particular standard 8. Therefore, we turn next to de-facto standards, namely the structures produced by popular programming environments.

Matrices in Programming Languages

Let us first introduce a concrete matrix example that will help us both illustrate concepts and measure performance. Please meet matrix $A(3)$, a small square matrix who’s elements span the range 1-9: $$ A = \begin{bmatrix} 1 & 2 & 3 \\
4 & 5 & 6 \\
7 & 8 & 9 \end{bmatrix} $$

A larger sibling of $A(3)$ would be matrix $A(1000)$, a 1000x1000 matrix who’s elements span the range 1 to 1 million (for obvious reasons we will not represent it explicitly):

$$ A(1000) = \begin{bmatrix} 1 & \ldots & 1000 \\
\ldots & \ldots & \ldots \\
999000 & \ldots & 1000000 \end{bmatrix} $$

This parametric family of matrices $A(N)$ can be used for benchmarking (N can be made large enough to bring any system to its knees). Small scripts that implement and benchmark the different concepts are available in our matrix2json repository.

Matrix to JSON in the R Language

The most basic data type in R is the 1D array. It holds an ordered set of homogeneous values of type “logical” (booleans), character (strings), “raw” (bytes), numeric (doubles), “complex” (complex numbers with a real and imaginary part), or integers.

jsonlite package

Using the jsonlite R package9, we get the following JSON representation in row major order.

    library(jsonlite)
    mat <- matrix(1:9, nrow = 3, ncol = 3, byrow=TRUE)
    toJSON(mat, matrix = c("rowmajor"))
[
  [1,2,3],
  [4,5,6],
  [7,8,9]
] 

We observe that the canonical Row Major Order JSON presentation is an array of arrays. NB: we can ask the toJSON function to return a column major order instead:

    library(jsonlite)
    mat <- matrix(1:9, nrow = 3, ncol = 3, byrow=TRUE)
    toJSON(mat, matrix = c("columnmajor"))

The result is another arrangement of values which, for the current symmetric matrix example, is not particularly useful:

[
  [1,4,7],
  [2,5,8],
  [3,6,9]
] 

Benchmarking data for R

We record the following data:

  • time to convert (in-memory) a matrix to a json string and save the string to a file (T_out)
  • size of the file (Size)
  • time to load the file and convert it to a matrix (T_in)

For the $A(5000)$ matrix we get the following results for different R packages handling JSON I/O for matrices:

Package T_out T_in Size
jsonlite 19.23444 secs 25.10679 secs 238.9 MB
RJSONIO 19.62544 secs 23.94082 secs 238.9 MB
rjson 4.120749 secs 16.52509 sec 238.9 MB

Notes

  • The file size is here identical as the format is the same
  • Different packages may not preserve the data type
  • Recovering the correct matrix orientation in the roundtrip may require some attention

Matrix to JSON in Python

In python the canonical library for handling arrays is numpy. While early versions of numpy had an explicit matrix class this is now deprecated in favor of more general nd-arrays.10

[[1 2 3]
 [4 5 6]
 [7 8 9]]

Numpy does not have a direct toJSON function. An intermediate conversion to a python list is required as follows:

    import numpy as np
    import json
    a = np.arange(1,10).reshape(3,3)
    print(json.dumps(a.tolist()))

which produces again the canonical row ordered list format

[
  [1,2,3],
  [4,5,6],
  [7,8,9]
] 

For the $A(5000)$ matrix we get the following results for different python packages handling JSON I/O for matrices:

Package T_out T_in Size
numpy / json 5.80395 secs 5.51479 secs 238.9 MB
pandas.to_json 7.97321 secs 14.95067 secs 238.9 MB

References


  1. Fortran IV: DIMENSION STATEMENT ↩︎

  2. When the row or column count is 1 it is more appropriate to talk about an 1D array. When both row and column count is 1 it is a scalar. ↩︎

  3. For a more adapted format see Hierarchical Data Format Hierarchical Data Format ↩︎

  4. The choice of zero as the element that defines sparsity reflects typical use cases (the underlying matrix algebra). Similar considerations would apply if another element was deemed as carrying “no information”. ↩︎

  5. Wikipedia on Matrix Representation ↩︎

  6. A program optimization approach motivated by efficient usage of computer memory is termed Data Oriented Design ↩︎

  7. The JSON Spec ↩︎

  8. XKCD on Standards ↩︎

  9. The jsonlite Package A Practical and Consistent Mapping Between JSON Data and R Objects ↩︎

  10. Numpy Docs: numpy matrix deprecation status ↩︎