“Complete bollocks”, says David Spiegelhoffer, Winton Professor for the Public Understanding of Risk at Cambridge University.
“Big data is bullshit”, says Harper Reed, CTO for the Obama 2012 Re-election campaign. “The ‘big’ is there for pure marketing. This is about you buying big, expensive servers and whatnot.”
Certainly, “Big Data” is a vague term that has been savagely promoted by the industry and is currently facing a backlash. At its heart, we define Big Data as an approach, enabled by new database technologies. This approach declares that statistical correlation can provide all answers if enough data is provided. This idea has been well described in ”The End of Theory”, a 2008 article published in Wired Magazine. Big Data asserts that causation models and other statistical methods are no longer necessary because Big Data techniques can find correlations in such enormous amounts of data they can answer questions without the need for models and the scientific method. This hubris reminds me of the hype surrounding value-at-risk (VaR) theories that took hold in financial institutions in the 2000’s. In a nutshell, VaR is a number – a maximum amount that a trader may lose in one day, predicted with 95% confidence. Many institutions still cling to this disarmingly simple number despite its reliability being repeatedly disproved, especially when it is needed most, with horrifying outcomes. In 1992, 2000 and 2008, extraordinary events led to prices jumping around so much that all the correlations that had been meticulously captured by risk managers went right out of the window. This sorry tale is described in a most entertaining way in the bestselling book “Black Swans” by Nassim Nicolas Taleb. Big Data seems to have been born in a similar fog of faith that raw correlations are all that is needed to make decisions.
Big Data techniques are geared to the capture of high-velocity data and then using machine learning algorithms to find correlations. While this is a fascinating pursuit, we think that it does not follow that when correlations are located then hypotheses become irrelevant. Using Big Data techniques to discover insight can certainly be a valuable pursuit. However, to be effective, we must remain vigilant to not confusing correlation with causation. The best technique we have for doing this has existed for centuries – the scientific method. Finding out what-causes-what is hard and requires hypotheses to be tested and attempted to be disproved. Finding out what-correlates-with-what is much easier, but it can be dangerous and fragile. The chance of finding correlations that are statistically robust but genuinely happen only by chance increase with the amount of data processed, especially from a single data source.
Big Data sets can be extremely large. Facebook alone generates 3 billion ‘likes’ per day. However, just because these data sets are large does not mean they are without bias. Making extrapolations and predictions remains unreliable. Also, there still remains the problem of data quality. “There are lots of small data problems in big data. They don’t disappear just because you have got lots of it. In fact, they get worse”, Spiegelhoffer reminds us.
Streetbump is a mobile app that uses a phone’s accelerometer to detect potholes. When a user drives over a hole in the road, the app records its location. As citizens drive around the city, a map of potholes is generated and sent to City Hall in real-time. At first sight, this is a brilliantly simple solution that would have been unconceivable just a few years ago. However, as Kate Crawford, Principal Researcher at Microsoft and visiting professor at MIT points out, “what Streetbump really produces is a map of potholes that systematically favours young, affluent areas where more people own smartphones.” This problem of sampling bias is well understood by statisticians but much Big Data analysis going on today fails to consider it properly.
We think that such work has potential to be extremely valuable. Big Data techniques may be best used to locate the best questions to ask, rather than locate the answers. Flaws in the technique and other risks be can be ameliorated by following these three approaches-
1. Remain vigilant to confusion between correlation and causation. Consider sampling bias. Don’t abandon theories. Think about using Big Data to come up with better theories rather than a means in itself to providing answers.
2. Systematically fix data quality problems. At scale, this means automation. Use a data quality rules engine to scrub your data before running analysis. Capture data quality metrics.
3. Take advantage of the new capabilities to consume extremely large data sets, but don’t forget, smaller data sets can also contain value, especially when blended accurately with those large sets. Many “Big Data” platforms are not good at blending multiple data sets together into composites because they are focused on high-velocity, large scale sets.
By Mark Blakey, Co-Founder, Misato Systems.