Open Source Risk Data with MongoDB and Python
Open source software is all the rage those days in IT and the concept is making rapid inroads in all parts of the enterprise. An earlier comprehensive survey by Gartner, Inc. found that by 2011 more than half of organizations surveyed had adopted open-source software (OSS) solutions as part of their IT strategy. This percentage may have currently exceeded the 75% mark according to open source advisory firms.
In the domain of risk management we are only beginning to see the benefits of open source (e.g. with the increased adoption of R) and there is much learning and building to do to harvest the huge potential benefits of open source. The tide that raises all boats in our corner of the universe is none other than the infamous big data phenomenon, as much of the big data software stack is actually open source.
In this post we cover some powerful open source big data tools that lay out an accelerated path to practical implementations in risk management.
MongoDB, the darling of NoSQL databases
A good part of risk management relies on risk data and the availability and quality of such data has been historically a sore point. This has led to a regulatory wake up call going under the banner of “Principles for effective risk data aggregation and risk reporting”, or otherwise known as bcbs239
Emerging database technologies going under the name of “NoSQL” promise to provide significant further ammunition for the fight against data entropy (=data mess). The name NoSQL already suggests that these databases are not of the typical Structured Query Language variety. Indeed they dispose of both the SQL syntax and that most venerable of data structures, the database Table. Instead they adopt novel means for organizing data.
MongoDB, currently the most popular of the NoSQL databases, uses the concept of a document collection. A collection is like a table, except each of the collection documents can have very flexible structure, e.g., can be adapted to the data instead of the other way around. That adaptation can lead to extraordinary performance gains for some data operations (but, of course, is not a solution for every conceivable database problem).
Python, the Swiss knife of data programming
Python is rapidly developing as a popular general purpose programming language with particular strengths around data processing and machine learning.
Being a programming language, Python is not directly comparable to R (the open source statistical calculation environment), but instead should be compared with Java, C++ and the like. While vanilla python code is slow compared to these systems oriented languages, python compensates by providing specialized modelling functionality via optimized libraries. Therein lies its power, because by now there are excellent libraries for practically any imaginable computational task (and by far not limited to statistical modelling work).
One of the most important such libraries is numpy, a general toolkit for working with arrays and matrices, which does (amongst others) a very good job at emulating matlab.
OpenCPM, an open framework for credit portfolio management
As part of the open source risk model framework sponsored by Open Riskwe develop OpenCPM. This is a framework targeting to provide an open source solution for the typical credit portfolio management tasks. Currently a the Open Risk Academy we offers three free tutorials as part of the OpenCPM course cycle:
- Managing Loan Portfolios Using MongoDB offers an introduction to MongoDB and how to use it to develop document stores for loan data. Examples are using both the mongo shell and the Python API.
- Loan Level Templates Using Python is a tutorial on working with loan data templates. We use the ECB SME reporting guidelines as an example of a fully specified template. The tutorial shows how to use python to connect and work with spreadsheets data
- Concentration Measurement using Python is a tutorial on building fast risk analytics using the specialized numpy and scipy python libraries. In this case the focus is on calculating concentration indexes from portfolio exposure data
We welcome any feedback!