A New Big Data Paradigm for the Zeta Byte Era

By Dr. Andreas Freund | 6 August 2015

Introduction

Big Data, Big Data, Big Data, Big Data …the rallying cry of the digitisation of our world and the foundation of a transformation the world has not seen since humans tamed fire, now being driven by the 6 Ds of the Digital Revolution – Digitisation, Deception, Disruption, Dematerialisation, Demonetisation and Democratisation. Big Data has become the lifeblood of business models that are disrupting industries from cabs (Uber, Lyft etc.) to hotels (AirBnB and its clones). If you cannot monetise your data, your business will be in trouble soon.

As you might have heard by now, we live in exponential times (see figure below) - in the 21st century we will advance as much as we did in the last 20,000 years. That creates significant mindset challenges for us, as human evolution has made us quasi-linear perceivers and thinkers. We’re not used to seeing and understanding the impact of exponential growth changes such as:

AirBnb adds 20,000 rooms every two weeks; Uber increased it ridership from 1 million rides a day at the beginning of 2015 to 1 million rides a day in China alone; and Cisco extrapolates in its 2014 VNI report, “The Zeta Byte Era”, that global internet traffic will reach 168 exabytes a month (1 exabyte = 1 billion gigabytes) by 2019.  Cisco expects doubling about every 3 years, and for traffic to bedriven by IoT (Internet of Things), which Cisco estimates to reach somewhere between 25 to 50 billion devices by 2020.

Given this exponential growth, it is clear that current thinking as to how we approach this phenomenon needs to change radically. This is true, in particular, for Big Data, the tool to unlock and create value from data in the 21st century. How will we comb through the expected data avalanche to find the next terrorist threat, the next new drug or spot the next cyberattack before it happens?

Figure 1- from Wisdomchief.com

As with everything now, we need to think exponentially about Big Data to address the challenge of exponential change. What can we do radically differently to manage the exponential Big Data challenge?

The New Big Data Paradigm

Speed-to-Recommendation, not Speed-to-Insight, will be the critical metric of the exponential Big Data era. This means that Big Data will have to transform to Fast Big Data to meet its challenges. But what does that mean?

In order to minimise Speed-to-Recommendation, we will:

  • Have to recognise that not all data is created equal, and that just like when we fast forward through a movie to a particular scene, we need to look for real-time “signals” of interesting patterns that meet our current needs, e.g., the movie scene from which we want to start,
  • Categorise and rank data by source and type so that we can create specific, intelligent data sensors optimised to recognise patterns by type and source just like our human eyes and ears;
  • Combine patterns from those data sensors and look for larger patterns - for example is the sound coming from a car engine consistent with how the engine looks, or use signals of one type of data sensor to search for specific signals from other data sensors, as you will look for an ambulance when you hear sirens approaching;
  • Deploy the data sensors directly at the data stream just like particle detectors are built around the collision point to collect their data; and
  • Continuously update and refine the data sensors as they learn more and more about old and new patterns in the data streams so that we can learn every day to recognise new sounds and images.

If this sounds to the reader as if Fast Big Data is working like a human being, then you are correct. In order to process and make real time decisions, we need to mimic what human beings do as they process their surroundings and make decisions based on the signals and patterns they see, hear and feel, and sense.

Pieces of this new Big Data paradigm are already in place but are not connected, others are still missing.

What do we have?

  • The concept of online, nearline and offline data is employed by companies like Google, Facebook, Amazon and LinkedIn. It is used primarily for data processing and for search engine companies in deploying their AI optimised search engines such as Google, Yahoo or Microsoft, or recommendations such as in LinkedIn or Facebook. In a Fast Big Data context, it means:

- Online: Finding signals in data streams, comparing and matching them to existing patterns to make recommendations for real time decision making and support
- Nearline: Finding new patterns either in extracted real-time signals or in not-real-time critical data that are interesting and are to be used for offline AI learning or direct insights
- Offline: Using data lakes to train the next iteration of narrow and general purpose AI systems to be promoted and used in nearline and online systems.

  • The new Now on Tap on Google’s Android OS update, currently called Android M, is a dedicated Big Data AI solution to deliver recommendations or answers based on the context of the individual. In addition, the overall, general purpose AI system will learn from every human it interacts with and use these learnings to refine the individual AI -- a self-reeinforcing feedback loop.
  • In-memory processing frameworks for streaming data combined with analytics as realised in, for example, SparkR, the latest addition in Apache’s Spark framework. SAP Hana, etc., allow for “smart data sensors” to be built directly on top of online data streams.
  • ISPs and Telco carriers are already analysing their network traffic to find patterns to optimise their network capacity and to monetise usage patterns by creating new offers.
  • Google’s Fiber project will put streaming data together with Google’s data processing and AI capabilities which will create a value add for those organisations using its network as Google will not only monetise the fiber but also the insight and recommendations from the data crossing its network.
  • Model building automation, a critical aspect to realise a Fast Big Data environment, has started to be more robust with commercial tools, e.g., SAS (Factory Miner) or SkyRELR, and proprietary tools built at Google, Facebook, VIV, etc.

What do we still need?

  • Connect narrow  and general purpose AIs in different environments at the different layers of abstraction hierarchy with one another to enable cross AI learning similar to humans
  • Create open data stores globally by building more and better data APIs for AI systems to comb through offline data and learn – the equivalent to a global library
  • Build a hierarchy of narrow AI systems from origin of data to larger geographic/environment scales to a global view that looks at larger and larger patterns just as humans do. Then connect narrow AIs with general purpose AIs where the general purpose AIs take the important patterns and use them to learn and abstract between functions, such as is done in sciences where we combine mathematics with experiments to advance our knowledge  
  • Use nanotechnology to push as much analytical power to the data edges as possible – the equivalent of specialised human cells for audio, visual and touch/taste. Initial attempts at this have been successful in  Japan at the University of Tokyo and Yamashita and the Tokyo Institute of Technology
  • Leverage the emergent Blockchain 2.0 technologies to build decentralised streaming frameworks, e.g. SparkR meets the Blockchain.

Where does this leave us? We are not there yet, and in the coming years we will need to be highly selective with the data we will use for real-time decision support. And we will need to understand the associated caveats. As technologies rapidly evolve and mature, we will be able to transition from Big Data to Fast Big Data in the next 10 years.

All opinions expressed in this article are my own.
 

By Dr. Andreas Freund, bobsguide Contributor

Become a bobsguide member to access the following

1. Unrestricted access to bobsguide
2. Send a proposal request
3. Insights delivered daily to your inbox
4. Career development