Paddle your own Canoe!

Deep down the heart we realise this phrase ‘Paddle your own canoe’ is so true. In this one life we have been given with pretty much everything like 24hrs time, all fine physically, so we should be always grateful. But we look for the missing things rather than enjoying what we have. It’s easy not just to say but to realise as well.

Certain challenges will keep coming in our lives-that’s beyond our control, what we can control is our reaction to the challenges. When I was younger like 10 to 20 years, I had so…


Usecase: Get a file say csv of size 5GB–10GB and process it in Spark.

There are certain websites like Kaggle where you can download file size as per your requirement. There is another way to achieve this quickly locally by following below steps.

  1. Download a sample file of size of 1GB(doable).
  2. Copy the same file in its location per you required size(say you need to have 10GB out from 1GB, then copy it for 10times).
  3. Open the CLI from the file location and run below command.

cat f1 f2 f3 f4 f5 > MyFile(newfilename)

4. You can see a new…


If there is a requirement to connect to Hive installed on remote server to a Spark application(written in scala and running on separate AWS EMR), then here you can refer below code snippet(Test.scala) which will help you to read from Hive and writing a Dataframe to this Hive server locally provided you have been given with required connection details.

Try test connection from DB client as mentioned below. You should be able to connect to Hive.

  1. First you have to create a sample table in Hive using SQL for testing purpose while read and write.
  2. Download HiveJDBC42.jar and add below…

If there is a usecase to add SFDC as a connector to a Spark application (written in scala), then here you can refer below code (Test.scala) which will help you to read from SFDC and writing to this SFDC locally (provided you have been given with required connection details)

  1. First you need to connect to SFDC site using below info.

i. User name ii. Password iii.SFDC url/website/Lighting site/endpoint

2. Once you loging, you have to locate the existing objects/SfObjects/tables by https://xxx-deved.lightning.force.com/lightning/setup/ObjectManager/home

3. Then click on Developer console(new window pops up)inside Setup gear icon where you can query any Object using…


If there is a usecase to add Teradata as a connector to a Spark application (written in scala), then here you can refer below code snippet(Test.scala) which will help you to read from Teradata and writing to this locally (provided you have been given with connection details)

  1. Try test connection from DB client as mentioned below.

2. First you have to create a sample table in Teradata and import data to it from any file system(say csv). You can create table wither as SET/MULTISET.

3. Next you have to read this table as a Dataframe to your Spark.

4. After…


Problem Statement: There is a source DB2 and sink as MongoDb where you are reading data from DB2 and writing it to MongoDb.
But while writing there is an issue arises because of type mismatch.

DF is created reading from DB2, which has few columns with datatype as float or double, that is not supported by MongoDb(as it supports bson type).

Solution:
1. we have to first filter out columns which are having float/double datatype from the DF.
2. on that filter result, we have to apply typer conversion operation as shown on below code snippet.

// show df reading…

If there is a usecase to add Big Query connector(using GCP) to a Spark application (written in scala), then here you can refer below code snippet(Test.scala) which will help you to read from BigQuery and writing a Dataframe to this locally provided you have been given with connection details.

  1. First you have to create a unique project Id in GCP.
  2. Then create a bucket in Cloud storage where you can upload a sample csv.
  3. You need to add the “Storage Admin” role for this bucket.
  4. You have to create service account credential.[API and services->Credentials->create credential dropdown-> select service account].
  5. After…


If there is a requirement to add Cloud warehouse Snowflake connector to a Spark application written in scala, then here you can refer below code snippet(Test.scala) which will help you to read from Snowflake and writing a Dataframe to this locally provided you have been given with connection details. Try test connection from DB client as mentioned below.

  1. First you have to create a sample table in Snowflake and import data to it from any file system(say csv).
  2. Next you have to read this table as a Dataframe to your Spark.
  3. After reading you can do some query onto it…

If there is a requirement to add MongoDb connector to a Spark application written in scala, then here you can refer below code snippet(Test.scala) which will help you to read from MongoDb and writing a Dataframe to this locally provided you have been given with connection details.

Mongo DB details: 
Database: [database name]
user: [user name]
pass: [password]
Connecction string: mongodb://username:password@ec2-xx-xx-xxx- xxx.xx-xx-x.compute.amazonaws.com:27017/
  1. First you have to create a sample table in MongoDb and import data to it from any file system(say csv).
  2. Next you have to read this table as a Dataframe to your Spark.
  3. After reading you can do…

Here I am sharing certain essential informations on Spark Streaming.

Comparison with Engines.

Comparison with Batch Processing and Stream Processing.

satabdi ray

Data Engineer Professionally, loves writing, sharing and learning!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store