In the era of big data, it is easy to imagine that we have all the information we need to make good decisions. But in fact the data we have are never complete, and may be only the tip of the iceberg. Just as much of the universe is composed of dark matter, invisible to us but nonetheless present, the universe of information is full of dark data that we overlook at our peril. In Dark Data, data expert David Hand takes us on a fascinating and enlightening journey into the world of the data we don’t see.
What are dark data?
DH: Dark data are data you do not have. They might be data you know you don’t have, like missing answers on a form, or data you don’t know you don’t have, like the number of dissatisfied customers who didn’t bother to complain. But beyond that simple binary classification, dark data can occur in a wide range of ways, some obvious, some subtle. For example, while simple summary statistics tell you some things about your data, they conceal other aspects. And definitions designed for one purpose might be dramatically misleading for another. Counterfactuals, that is data you don’t have but would like to have, are hidden data telling you what would have happened under different circumstances. In all, my book describes fifteen kinds of dark data to keep an eye open for.
Why are dark data important?
DH: Dark data are important because, if the data actually available in your database, stored in your computer, written in your notebooks, or posted in your spreadsheet are only partial and hide important information, then your analysis is likely to mislead.
There is a myth that small amounts of missing data are not a problem. In the world of “big data”, so goes the claim, the vast masses of data which now readily accumulate will dilute away any errors or mistakes arising from small missing amounts. But this is wrong, and those missing values could be crucial in understanding what is going on. Customers who do not return are no longer around to contribute data, but ignoring them could lead to a gross misunderstanding of how to make your company successful. An algorithm for diagnosing illness which had been trained on data which did not include a rare but fatal disease would be bad news for those with that illness.
Are the dangers of dark data getting worse?
DH: Certainly the dangers have always been with us. It is impossible to know everything, so necessarily there are things we don’t know. The question is whether these missing things matter, and whether the headlong rush towards a data-driven society is aggravating the problem.
There are reasons to think that things might be getting worse. While it is certainly true that the ready (and automatic) acquisition of large data sets, coupled with the power of modern statistical, machine learning, and AI tools hold great promise for enhancing the human condition, these advances don’t come without their challenges. In particular, while computers equip us with awesome powers, it also means that we have to rely on those machines. We need them to provide us with the statistical summaries, graphical plots, and the output of the algorithms. This means that the computer is a necessary intermediary between us and the data. While it acts as a lens revealing those data, it also acts as a wall between us and the data. It injects a fundamental opacity into data analysis, with light being shed only in those places where we can peer through the wall.
Can you give some examples of areas impacted by dark data and how those areas are affected
DH: Every domain is at risk from dark data. In business you will have data on how your customers behave, but expansion requires understanding of how possible other customers are likely to behave. In clinical trials of new medicines you need to know why patients drop out – is it because the treatment is having no effect, or is it perhaps because the treatment was completely effective and the condition has been cured? In astrophysics we cannot see all the stars in the sky, so what if the ones we cannot see (the literally dark data) are different from the ones we can? When humans are involved – in areas such as economics and public policy, for example – the situation is even more complicated. Humans react to the circumstances in which they find themselves, and even sometimes to the fact that you are observing or measuring them. This means that the data you collect are not the data you would have collected had you not undertaken the study. The potential for complications and misunderstandings arising from dark data are obvious.
How should the problem of dark data be tackled?
DH: There are three basic aspects to handling dark data problems: preventing the problem from arising in the first place, detecting the presence (or perhaps I should say “absence”) of dark data, and correcting or at least making allowance for it when it arises. Think about the provenance of the data and the trustworthiness of the source. Have the data have undergone some kind of cleaning process before you get your hands on them? Since, by definition, “cleaning” means changing, could those changes have removed things of crucial interest to you. I’ve certainly seen that happen more than once!
Do the data conform to what you would expect? Are the averages and the distributional shapes reasonable? Some sophisticated detection strategies have been developed – for example, larger effects being associated with larger samples suggests some sort of selection mechanism is in action.
When trying to adjust for dark data, but you must always bear in mind that statisticians cannot perform miracles. Missing data, distorted data, and inaccurate data are bad news, and correction for these problems can only be made at the expense of assumptions about why they have arisen. It that were not the case, life would be simplicity itself: it would mean that if you gave me a data set, no matter how corrupted or small (three data points, all incorrect?), then I could use just these to draw valid conclusions about the population they were drawn from. That is clearly nonsensical.
You wrote in the book that some kinds of dark data are deliberately generated. Can you say more about that?
DH: Deliberate generation of dark data arises in two, almost diametrically opposite, scenarios.
The first is when fraudsters lay a trail of dark data, concealing the truth, and lead you to make decisions favouring them. Regrettably, this happens in most walks of life, though some more than others. Financial fraud of various kinds is perhaps the most common: people falsifying accounts and making up data, or insider trading where they make use of data which should be kept dark, for example, but it arises in other areas as well.
The other, and in a sense completely opposite, scenario involves the strategic application of ignorance to facilitate discovery. In clinical trials researchers deliberately conceal which treatment is which from the clinicians treating the patients, so that they will not be tempted to treat different groups differently (so-called “blinding” – again a very literal darkening of data). In survey work, a sample of data is taken – which is, of course, equivalent to treating all the rest of the data as dark. More advanced statistical methods can be thought of as generating data which might have been, and can lead to improved estimates, better predictions, and greater understanding. Some of these advanced ideas are outlined in the book.
In fact the you yourself use the idea of protecting data by using dark data – in your passwords for example. Sometimes you want to keep your data dark, and you do that by means of dark data.
What first got you interested in this area?
DH: I think it was a growing awareness of how issues of poor data quality impacted the conclusions people were drawing, even from highly sophisticated (and valid) analyses. As I worked in different areas, so I saw the same dark data problems arising – in medical research, in consumer credit, in manufacturing, in financial trading, and so on. Everywhere in fact. The bottom line is that, no matter how clever you are, and no matter how advanced the statistical models and algorithms you use, if you are ignorant of distortions in your data then your conclusions are likely to be wrong.
David J. Hand is emeritus professor of mathematics and senior research investigator at Imperial College London, a former president of the Royal Statistical Society, and a fellow of the British Academy. His many previous books include The Improbability Principle, Measurement: A Very Short Introduction, Statistics: A Very Short Introduction, and Principles of Data Mining.