Streaming in Spark- Essential Informations

satabdi ray
3 min readApr 6, 2021

--

Here I am sharing certain essential informations on Spark Streaming.

Comparison with Engines.

Comparison with Batch Processing and Stream Processing.

Apache Kafka integration with structured streaming:

  1. Kafka is a powerful publisher/subscriber messaging technology which allows various publisher generating messages to publish messages to a queue which can then be received by subscribers.
  2. When messages are published to kafka, they are categorised by topics and stored in partitioned, replicated logs.
  3. As a subscriber, we will specify the topics that we are interested in and receive all that messages on that topic.
  4. Kafka is a distributed technology which means its a highly skilled and can process many millions of messages per second.
  5. Internally Kafka uses Apache Zookeeper for configuration services, management and synchronisation of the servers in its cluster.
  6. Let’s have a brief description with below example on publisher/subscriber messaging model that kafka uses.

Twitter streaming API could be a publisher. Then we have a subscriber who is interested in those messages for some kind of analysis. We were interested in Twitter Trends via subscribers. Every message that a publisher publishes goes to a particular topic.

Kafka divides all the messages to categories where the messages belong. These categories are called topics and subscriber subscribe to those topics that they are interested in.

7. Kafka as a distributed streaming platform is extremely useful because it allows us to completely decouple the publishers from the consumers of those messages.

8. Kafka can also store streams of records in a fault tolerant and durable way.

9. Kafka runs on a cluster. Each node in cluster is called a broker.

10. Each record(messages) has key, value and timestamp.

11. Zookeeper provides the coordination and synchronization for the Kafka distributed system. We have to have zookeeper running in order to run Kafka standalone as well even though we just have a one node cluster.

12. In order to run any Kafka consumer, we need to specify a bootstrap server, these are the servers for the initial connection to Kafka, these are the servers that our consumer will use to describe the full cluster.

13. Spark application needs to know the host and port on which Kafka is running and the topic to which it should subscribe to.

14. We can create a spark session as usual and use the spark session to read from a kafka source. Because of how well kafka and Spark are integrated, we read streams from the format Kafka that’s the only thing we need to specify a bootstrap server, a host on our Kafka Cluster which Spark can connect to. Spark then uses this bootstrap server to discover the rest of the cluster.

Streaming Data Sources: Few are mentioned below.

  1. Akka
  2. Flume
  3. Kafka
  4. Amazon Kinesis

Hope this is useful. Thanks.

--

--

satabdi ray
satabdi ray

Written by satabdi ray

Data Engineer Professionally, loves writing, sharing and learning!