Hazelcast, the leading open source in-memory data grid (IMDG) with hundreds of thousands of installed clusters and over 17 million server starts per month, today announced a new solution which integrates Hazelcast’s IMDG and Apache Spark – an open source data processing engine that enables data workers to efficiently execute streaming, machine learning or SQL workloads which require fast iterative access to datasets. By combining the two technologies, developers now have access to an open source solution that provides data storage and compute capabilities for big data requirements that go well beyond the historical limitations of a single Java Virtual Machine (JVM).
One of the key driving forces behind the widespread developer adoption of Apache Spark is it’s easy-to-use Application Programming Interfaces (APIs) for operating on large datasets, one of which is Resilient Distributed Dataset (RDD). At the core, an RDD is an immutable distributed collection of elements of data, partitioned across nodes in a cluster to provide fault tolerance and parallel access to data. Both of these key features are a natural fit with Hazelcast, as they’re essential building blocks for any performant distributed compute capability.
To demonstrate the potential of integrating Apache Spark into a Hazelcast IMDG application, BetLeopard, an example sports betting application, has been developed. Put simply, BetLeopard is a bet engine that scales across multiple JVMs with the sharing of events via Hazelcast IMDG partitions, with a query engine that uses Spark to provide real-time risk and analytics of future events. The combination of Hazelcast’s advanced in-memory compute capabilities and distributed store, alongside Spark’s query and analytics capabilities creates a powerful gaming solution. In addition, the integration provides a solid base for the next generation of JVM applications.
Hazelcast IMDG is well-known for its interoperability capabilities and is integrated with dozens of other software technologies, including Spark. Hazelcast has clients for several programming languages including Java, .Net/C#, C++, Python, Node.js and Scala, while Spark supports Java, Scala, Python and R out of the box. Consequently, Hazelcast and Spark can easily be used across stacks that comprise multiple languages.
Greg Luck, CEO of Hazelcast, said: “The feedback we get back from the community is that any big data solution needs to be able to distribute processing and storage across machines whilst maintaining a flexible and convenient programming interface. Without these functionalities, it becomes impossible to build enterprise applications which are expected to process more and more data. We believe Hazelcast and Spark provide a compelling open source alternative which has been designed based on developer engagement.”
Spark has several advantages compared to other big data and MapReduce technologies like Hadoop and Storm. Firstly, Spark provides a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature (text data, graph data, etc...) as well as the source of data (batch vs. real-time streaming data). It also enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk. From version 3.7 onwards, Hazelcast IMDG has shipped an open source connector that allows Hazelcast to be used as a storage medium for Spark.