Read files from Google Cloud Storage Bucket using local PySpark and Jupyter Notebooks

Kashif Sohail
4 min readJun 28, 2020

This tutorial is a step by step guide for reading files from google cloud storage bucket in locally hosted spark instance using PySpark and Jupyter Notebooks

Google cloud storage is a distributed cloud storage offered by Google Cloud Platform. It has great features like multi-region support, having different classes of storage, and above all encryption support so that developers and enterprises use GCS as per their needs.

Many organizations around the world using Google cloud, store their files in Google cloud storage. These files may have a variety of formats like CSV, JSON, Images, videos in a container called a bucket. A bucket is just like a drive and it has a globally unique name. Each account/organization may have multiple buckets. You can manage the access using Google cloud IAM.

GCS can be managed through different tools like Google Console, gsutils (cloud shell), REST APIs and client libraries available for a variety of programming languages like (C++, C#, Go, Java, Node.js, Php, Python and Ruby).

Google cloud offers a managed service called Dataproc for running Apache Spark and Apache Hadoop workload in the cloud. Dataproc has out of the box support for reading files from Google Cloud Storage. Read Full article

It is a bit trickier if you are not reading files via Dataproc. In this tutorial, we will be using locally deployed Apache Spark for accessing data from Google Cloud storage. Here are the details of my experiment setup:

  • OS: Red Hat Enterprise Linux Server 7.7 (Maipo)
  • Apache Spark: 2.4.4

Setup in Google Cloud

First of all, you need a Google cloud account, create if you don’t have one. Google cloud offers $300 free trial. Navigate to Google Cloud Storage Browser and see if any bucket is present, create one if you don’t have and upload some text files in it.

Create a Service Account

To access Google Cloud services programmatically, you need a service account and credentials.

Open Google Cloud Console, go to Navigation menu > IAM & Admin, select Service accounts and click on + Create Service Account.

In step 1 enter a proper name for the service account and click create.

Creating a new service account

In step 2, you need to assign the roles to this services account. Assign Storage Object Admin to this newly created service account.

Assigning a role to the service account.

Now you need to generate a JSON credentials file for this service account. Go to service accounts list, click on the options on the right side and then click on generate key. Select JSON in key type and click create.

A JSON file will be downloaded. Keep this file at a safe place, as it has access to your cloud services. Do remember its path, as we need it for further process.

Cloud Storage Connector

Apache Spark doesn’t have out of the box support for Google Cloud Storage, we need to download and add the connector separately. It is a jar file, Download the Connector. Now go to shell and find the spark home directory.

$ echo $SPARK_HOME
Output: /usr/local/spark

Copy the downloaded jar file to $SPARK_HOME/jars/ directory.

Now all set for the development, let's move to Jupyter Notebook and write the code to finally access files.

PySpark Code Example

First of all initialize a spark session, just like you do in routine. The simplest way is given below.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(‘GCSFilesRead’).getOrCreate()

Now the spark has loaded GCS file system and you can read data from GCS. You need to provide credentials in order to access your desired bucket.

spark._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile","<path_to_your_credentials_json>")

Now all set and we are ready to read the files. Suppose I have a CSV file (sample.csv) place in a folder (data) inside my GCS bucket and I want to read it in PySpark Dataframe, I’ll generate the path to file like this:

bucket_name="my_bucket"
path=f"gs://{bucket_name}/data/sample.csv"

The following piece of code will read data from your files placed in GCS bucket and it will be available in variable df. You can read the whole folder, multiple files, use the wildcard path as per spark default functionality. All you need is to just put “gs://” as a path prefix to your files/folders in GCS bucket.

df=spark.read.csv(path, header=True)
df.show()

Beware from the Cost

When you are using public cloud platform, there is always a cost assosiated with transfer outside the cloud. See the Google Cloud Storage pricing in detail.

--

--

Kashif Sohail

Data Engineer with more than 7 years of experience having exposure to fintech, contact center, music streaming, and ride-hail/delivery industries.