From Big Data, to Linked Data and Linked Models:
The big data problem:
As certainly as the sun will set today, the big data explosion will lead to a big clean-up mess
How do we know? We only have to study the still smouldering last chapter of banking industry history. Currently banks are portrayed as something akin to the village idiot as far as technology adoption is concerned, and there is certainly a nugget of truth to this. Yet it is also true that banks, in many jurisdictions and across trading styles and business lines, have adopted data driven models already a long time ago. In fact, long enough, that we have already observed how it all ended up pear shaped, Great Financial Crisis and all.
The evidence of how an industry may struggle to contain its data related liabilities and risks goes by the code name BCBS 239. What is BCBS 239 ? It is a regulatory report that captures regulatory mandated Principles for effective risk data aggregation and reporting that was prompted by the inability of the financial industry to effectively manage data flows linked to important business objectives.
This important precedent exemplifies the other, darker side, of the big data coin. After the big data iron and the big data software stack is purchased and is installed and the big data sets start to flow, commensurately big decisions need to be taken on the basis of said data. Yet big decisions always carry big risks. Data Quality , Data Provenance (accountability, reproducibility) and other such boring-sounding and poorly understood requirements become suddenly essential to prevent the firm from turning the big data based bonanza into a big nightmare.
It gets worse.
While sometimes data do sing and dance by themselves, we still live in a world where specialists must cajole the data into producing useful signals, metrics or other distilled pieces of information. This process is called various names such as: analytics, modeling, quantitative analysis, rocket science or data science voodoo… This step is also fraught with risks, actually significantly higher than those linked to the quality of the raw data.
It is a virtual certainty that if you:
give the exact same data set to two different analysts, they will produce different models and metrics
In this space too, the promotion of artificial intelligence and machine learning as the great untapped quantification programme leads to uncharted territory where not only will the model outcomes be different, but the analysts will also have no idea why this is so!
The Linked Data Solution
There is actually an (almost) magic wand to solve the data problem. The prosaic, old style summary answer is: Metadata. There is a more comprehensive, forward looking and informative name, namely Linked Data . Conceived by none other that Tim Berners-Lee, the same person who brought us the Web (on which you read this post).
Linked Data helps solve the data provenance problem by elevating data into first class citizens in an interconnected web. Data get to have a (web) name and address, a full bio, including pedigree and friends. There is a growing and maturing set of Semantic Web tools that can be used to aggregate, query and use such metadata.
The good news (for BCBS 239 enthusiasts) is that this technology is so vital for the broader economy, it will certainly find increasing support and development. The bad news is that Linked Data do not sing and dance by themselves either. While we are well on our way into taming data management, the corresponding status for Model Governance is far less advanced.
The Linked Models Solution
The challenge around governance and management of quantitative models is huge: we do not even have a good definition of what a model is. Fun fact: If you want to experience large scale communication breakdown take a random organization and try to collect all the models being used.
- Is an excel sheet a model?
- Is the excel sheet a different model after you hit F9? (think about this one).
- Is the napkin where the CEO scribbled the ball-park economics of a business deal a model?
- Is a procedure stored in a vendor database system a model?
- Is a calculation done in a Bloomberg terminal a model?
- Is the expert opinion of a collection of silver haired individuals a model?
- What if said experts express this opinion as a score?
The problem is that models of one sort or another are so useful and ubiquitous, they are all over the place. In fact IT people will call the mere schema of a database a (data) model, adding to the confusion.
The solution we are proposing is not unlike the Linked Data based solution to the Big Data mess. Namely we promote models to be first class citizens in an interconnected web:
A Linked Model has a name and an address, a full bio, including pedigree and friends
Hence, away with napkins, sheets and throwaway models of any sort if you don’t want to be facing a calamity a few years down the line. But let us be a bit more specific (as a well known Professor says, this section is - more - wonkish).
A Linked Model is not one but three (3) things at the same time. We call this DOAM (semantic Description Of A Model):
- There is an abstract model: a paper or other description of a Function, that takes specific Linked Data as input and produces specific Linked Data as output. This description is available somewhere on the web, although it may be private. May be machine readable / translatable into code.
- There is the model source code: the pieces of programming that are required to implement an abstract model, again available somewhere on the web. Again may be private or public.
- There is the model instance: the actual application that executes the model source code (produces live results), available (via an API) somewhere on the web.
The metadata technologies required for implementing Linked Models are similar to - but extending - those pertaining to Linked Data. They also overlap with so-called Service Oriented Architectures (SOA), but place much more emphasis on documenting and making accessible the mathematical content of the data processing function of the model.
Learn more about Linked Models in our presentation at the 2015 Dutch Central Bank / TopQuants meeting
Interested to explore the Linked Models implementation of the OpenCPM? Let us know!