From Big Data, to Linked Data and Linked Models

Linked Models

From Big Data, to Linked Data and Linked Models

The big data problem:

“As certainly as the sun will set today, the big data explosion will lead to a big clean-up mess”

How do we know? We only have to study the still smouldering last chapter of banking industry history. Currently banks are portrayed as something akin to the village idiot as far as technology adoption is concerned, and there is certainly a nugget of truth to this. But it is also true that banks, in many jurisdictions and across trading styles and business lines, have adopted data driven models already a long time ago. In fact, long enough, that we have already observed how it all ended up pear shaped, Great Financial Crisis and all.

The evidence of how an industry may struggle to contain its data related liabilities and risks goes by the code name BCBS 239. What is BCBS 239? It is a report that captures regulatory mandated “Principles for effective risk data aggregation and reporting” that was prompted by the inability of the financial industry to effectively manage data flows linked to important business objectives.

This precedent exemplifies the other, dark side, of the big data coin. After the big data iron and software stack is purchased and installed and the big data sets start to flow, commensurately big decisions are to be taken on the basis of said data. But big decisions carry big risks. Data quality, provenance (accountability, reproducibility) and other such boring sounding and poorly understood requirements become suddenly essential to prevent the firm from turning the big data based bonanza into a big nightmare.

It gets worse. While sometimes data do sing and dance by themselves, we still live in a world where specialists must cajole the data into producing some useful signal, metric or other distilled piece of information. This process is called various names such as: analytics, models, quants, rocket science or data science voodoo.

This step is also fraught with risks, actually significantly higher than those linked to the raw data. It is a virtual certainty that if you

“give the same data set to two different analysts, they will produce different models and metrics”

In this space too, the promotion of so called artificial intelligence and machine learning as the great “untapped” quantification programme leads to the uncharted territory where not only will the model outcomes be different, but the analysts will also have no idea why this is so…

The Linked Data Solution

There is actually a magic wand to solve the data problem. The prosaic, old style summary answer is: metadata. But there is a more comprehensive, forward looking and informative name, namely Linked Data. Conceived by none other that Tim Berners-Lee, the same person who brought us the Web (on which you read this post).

Linked Data helps solve the data provenance problem by elevating data into first class citizens in an interconnected web. Data get to have a (web) name and address, a full bio, including pedigree and friends. There is a growing and maturing set of Semantic Web tools that can be used to aggregate, query and use such metadata.

The good news (for BCBS 239 enthusiasts) is that this technology is so vital for the broader economy, it will certainly find increasing support and development. The bad news is that Linked Data do not sing and dance by themselves either. While we are well on our way into taming data management, the corresponding status for model management is far less advanced.

The Linked Models Solution

The challenge around governance and management of quantitative models is huge: we do not even have a good definition of what a model is. If you want to experience large scale communication breakdown take a random organization and try to collect all the “models” being used.

Is an excel sheet a model? Is the excel sheet a different model after you hit F9? (think about this one). Is the napkin where the CEO scribbled the economics of a deal a model? Is a procedure stored in a vendor system a model? Is a calculation done in a Bloomberg terminal a model? Is the expert opinion of a collection of silver haired individuals a model? What if they express this opinion as a score?

The problem is that models of one sort or another are so useful and ubiquitous, they are all over the place. In fact IT people will call the mere schema of a database a “model”…

The solution we are proposing is not unlike the Linked Data based solution to the Big Data mess. Namely we promote models to be first class citizens in an interconnected web:

“A Linked Model has a name and an address, a full bio, including pedigree and friends”

Hence, away with napkins, sheets and throwaway models of any sort if you don’t want to be facing a calamity a few years down the line.

But lets be a bit more specific (as the well known Professor says, this section is – more – wonkish).

A Linked Model is not one but three (3) things at the same time. We call this DOAM (semantic Description Of A Model):

  • There is an abstract model: a paper or other description of a Function, that takes specific Linked Data as input and produces specific Linked Data as output. This description is available somewhere on the web, although it may be private. May be machine readable / translatable into code.
  • There is the model source code: the pieces of programming that are required to implement an abstract model, again available somewhere on the web. Again may be private or public.
  • There is the model instance: the actual application that executes the model source code (produces live results), available (via an API) somewhere on the web.

The metadata technologies required for implementing Linked Models are similar to – but extending – those pertaining to Linked Data. It also overlaps with the so called Service Oriented Architectures (SOA), but places much more emphasis on documenting and making accessible the mathematical content of the data processing function of the model.

Learn more about Linked Models in our presentation at the Dutch Central Bank / TopQuants meeting

Interested to explore the Linked Models implementation of the OpenCPM? Let us know!