Google Cloud and Spark in the docker consideration for rreal time streaming data
We have designed an efficient trading system that runs on cluster of three nodes in Google Cloud Platform (GCP).
Each node runs one Zookeeper and one Kafka broker. So in total we have three Zookeeper dockers and three Kafka broker dockers. All these nodes communicate with each other using Internal IP addresses as opposed to ephemeral external IP addresses that change if the compute node is restarted.
The trading system uses streaming topic that is also generated in JSON format from one of the compute nodes. For all dockers , we specified the --net=host option so the Docker uses the host's network stack for the container. The network configuration of the container is the same as that of the host and the container shares the service ports that are available to the host. Since each host serves as the host for the containers this port mapping between the docker and host works fine.
A high level design is shown below
Unlike a typical Lambda Architecture, we have replaced the Batch Layer using traditional data-warehouse such as Hbase or Hive with flash optimized Aerospike database with very low latency. For redundancy we created a two node Aerospike cluster. All incoming ticker (trade) data (from Kafka topic) are stored in one Aerospike set using Aerospike-Kafka connector. For speed layer, we used Spark SQL with Scala and FP to look for high value trades as they stream in, and if the price is desirable (according to the decision engine), the a notification is sent to the real time dashboard and these high value events are also stored in another Aerospike set (table) as well.
We installed Spark binaries (Spark version 2.4.3). We did not use dockers for Spark. We started testing using Local (a single JVM) which was a smoke test and proved that the design worked, However, there are significant challenges that need to be addressed when running a distributed application like Spark in a multi-host environment. We opted for running Spark on Standalone mode. One node runs Spark master, and the workers run on
all three nodes. Each node is "n1-standard-4 (4 vCPUs, 15 GB memory)". In summary, one node runs the Spark master plus Spark workers plus one zookeeper docker and one Kafka Broker docker. The remaining two nodes run one zookeeper docker, one Kafka Broker docker and one Aerospike docker and Spark workers.
We thought a lot about it and decided not to rely on Spark docker. Most literature seems to point to certain concepts but not real life dockertisation of Spark. So my question is how would this design would be different/better if we decided to build Spark using dockers?
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.