Google Big Query connector-Spark-Scala(sink/source)

satabdi ray
2 min readMay 12, 2021

If there is a usecase to add Big Query connector(using GCP) to a Spark application (written in scala), then here you can refer below code snippet(Test.scala) which will help you to read from BigQuery and writing a Dataframe to this locally provided you have been given with connection details.

  1. First you have to create a unique project Id in GCP.
  2. Then create a bucket in Cloud storage where you can upload a sample csv.
  3. You need to add the “Storage Admin” role for this bucket.
  4. You have to create service account credential.[API and services->Credentials->create credential dropdown-> select service account].
  5. After enabling this service account, add a key which will download a file(json/p12 format) that contains a private key. Store the key securely because this key can not be recovered if lost.
  6. Go to Big Query, create a sample Dataset and table and import data to it from its Cloud Storage Bucket.
  7. Next you have to read this table as a Dataframe to your Spark.
  8. After reading you can do some query onto it and create a new table where you can write the results of the query there in Big Query.
  9. Required parameter are projectId, bucketname, dataset, table, credential file which can be encoded/decoded to Base64 as shown below.

9. And in build.sbt, add the below library.(it depends on Scala version used in your application)

libraryDependencies += "com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop2-1.9.17"

libraryDependencies += "com.google.cloud.spark" % "spark-bigquery-with-dependencies_2.11" % "0.20.0"

libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.3.2"

assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.google.common.**" -> "my_conf.@1").inAll
)
libraryDependencies += "com.google.guava" % "guava" % "27.0.1-jre"

10. Create a test.scala file, run it locally using the above and verify if you are able to connect to Big Query and do read/write operations.

This is written to do a quick connection test from your local environement to GCP Big Query. Hope this helps!

--

--

satabdi ray

Data Engineer Professionally, loves writing, sharing and learning!