Some of you may have heard me talk about the Simulacrum before, when I referred to it as “the best idea I will ever have“.
For those that haven’t, the Simulacrum is a ‘fake’ copy of lots of detailed information about patients with cancer, which has been collected, checked and is securely stored by Public Health England (PHE). Rather than try and remove information to make patients non-identifiable (with the risk that someone manages to identify people, as the governor of Massachusetts found out) we simply made up lots of data – but made it up so that, for example, the distribution of lung cancer by age, is the same in the simulated population as in the real population. It uses the same table format, the same data items and data dictionary as the real data, but contains no information about actual patients. If you want to know more, we now have a lovely website – including an initial data release and some sample queries. The work isn’t finished yet – for example, there is currently no radiotherapy data, and we are still finalising the journal papers – but it provides a good place to start.
The project website does a reasonable job of explaining what the Simulacrum is, but I thought it would be worth talking about why we developed it.
I have now been trying to work with large-scale cancer data for ~10 years now, and one of the persistent problems is that we cannot get all the pieces together. I think the key pieces are tools, data and people – but getting all three together is hard. University departments (such as the Dept. of Computing at Imperial College, where I have one foot) have clever people and tools to spare – but no access to the data, and little access to patients to help define questions. In the NHS & PHE, we have access to the data and patients, but installing tools is difficult, and getting the right technical people in place is difficult. Given that the aim of our lab is to use computational approaches to solve clinical problems, this has been a bugbear of ours for a long while. The aim of Simulacrum is to change this.
By using simulated data, we can be sure that the data follows the right “shape” – that is, the table and field names are correct, the data has the right sort of values, etc. The values associated with individuals are mostly correct – we know that there are some deviations, but most of the common bivariate distributions are correct to within a few percent. One advantage of this is that if you want to work with PHE data (e.g. if you need to know exactly how many men and women had lung cancer in 2016), then you can build your code against the Simulacrum, and when it works, apply for data access in the usual way through PHE, and your code should run against that data as well.
Important and useful as that is – and it has the potential to substantially de-risk projects and help both academic and commercial projects develop more easily – my main aim was to help transform the application of new computing approaches to cancer data.
In order to do that, we needed a way of allowing the computing community to get their hands on the data and have a play with it. Outsiders often underestimate how much of data analysis and data science is driven by ‘playing’ with the data – seeing what it looks like, where the gaps are, what is trivially easy, and what isn’t. This is what Simulacrum enables. Although Simulacrum isn’t a perfect copy, from a methodological perspective, it often doesn’t matter if the proportion of patients who have fourth-line chemotherapy is out by 10%. Simulacrum provides an external, easily sharable, publicly available reference dataset for technical development. If you invent a new technique for survival prediction, you now have a very large dataset to test it on. This points to the second main use for Simulacrum – as a teaching resource. We have been doing some work recently on applying machine learning to survival analysis, and it has been really difficult to find large datasets with clearly defined parameters (we have a small paper solving some of this coming out shortly).
Simulacrum could not have existed without the work & insight number of people (better described here). There are already some simple SQL queries on the website, and over the few months, we will be sharing examples of code snippets to demonstrate how to use the Simulacrum. We have some particular projects in mind with some of mathematical collaborators who have been waiting for the data, but the advantage of the data being openly available is that anyone can use it, and I am looking forward to being surprised with the uses that people put it to. One of my hopes is that we are able to develop a community around the use of Simulacrum and share code and tips (Git repository coming with the first code examples in a few weeks). Those of you interested in this are welcome to drop me an email, and if you are an MRes/ PhD student interested in cancer data, please do get in touch.
My hope – and one aim of the Simulacrum project – was that we could get really bright PhD students and post-docs to work on large-scale cancer data. Up until now, that has been formidably difficult. With the first release of the Simulacrum, I hope that has become a little bit easier.
Imperial College London