Synthetic Risk Data

The challenge with historical risk data

Historical risk data are vital for a host of risk management activities: Starting with assessment of the performance of different types of risk, all the way to the construction of sophisticated risk models. Such is the importance of data inputs that for risk models impacting significant decision making / external reporting there are even prescribed regulatory minimum requirements for the type and quality of necessary historical risk data.

Securing adequate Risk Data can pose significant challenges for a host of possible reasons. Just to name a few possibilities:

  • The intrinsic scarcity of risk events: There might be a genuine shortage of data because the number of events in the observation window is small.
  • The sensitivity of client related data: While in many cases properly filtered and anonymized data sets can make quite difficult the identification of any concrete entity, the extra work required to achieve this might dissuade publication
  • A general reluctance to release information: Even when datasets are no longer commercially sensitive (e.g older business lines or portfolios) there might be a reluctance to reveal any internal information (see also the previous point on the extra work required)
  • The quality of the existing data: Even if none of the above constraints applies, the collection, processing and availability of historical risk data may suffer from data quality problems

Utilizing synthetic risk data

Synthetic Risk Data are computer generated risk data (produced using generative machine learning algorithms) that:

  • Refer to the same client / product / business characteristics tracked by a production system
  • Adopt the same lifecycle typology (possible events and measured elements)
  • Conform to the actual schema (data template) that is being emulated
  • Have the same or similar statistical properties as the risk data set that is being emulated

How is it done?

There are many useful variations we offer but in a fully simulated lifecycle the basic steps involved in synthetic risk data creation would be captured in the following list:

  • Generation of a sufficiently detailed risk system model
  • Simulation of the system
  • Production of simulated risk events
  • Compilation and distribution of synthetic risk data snapshots in the required data formats

Further Reading / Next Steps

Read more about synthetic credit risk data in this blog post

We are rolling out a synthetic risk data API on a trial basis at the demo site where instructions on generating and downloading a variety of synthetic risk data are provided. If you are interested to use this service (or simply to learn more about synthetic risk data) we encourage you to contact us