How to read Compressed CSV files from S3 using local PySpark and Jupyter notebook

Kashif Sohail
2 min readDec 9, 2019

--

This tutorial is a step by step guide for configuring your Spark instance deployed on EC2 instance, virtual machine hosted in cloud or in local environment.

CSV and Json are the most commonly used formats for ETL process. Usually source system generate CSV files after some defined interval, which are uploaded to a remote FTP server or cloud storage service e.g. AWS S3. The host system process these CSV files periodically and load them in to the destination Data Warehouse or Data Lake. As CSV is a plain text file, it is a good idea to compress it before sending to remote storage. Gzip is widely used for compression.

Data engineers prefers to process files stored in AWS S3 Bucket with Spark on EMR cluster as part of their ETL pipelines. Spark on EMR has built-in support for reading data from AWS S3. You don’t need to configure anything, just need to specify Bucket name, Access ID and Access Key and you will be ready to read and write files from S3. If you are not using EMR and configured spark at you own(EC2, virtual machine, local computer) then it is not straight forward to read these files.

Let’s do it step by step:

First of all download the hadoop aws jar from its official Maven repository.

You need to download an appropriate version else you will be encountered by version mismatch issue between spark and hadoop-aws library. I have tested Spark 2.3.3 with Hadoop AWS 2.7.3

Find the spark home directory. Use the following command.

$ echo $SPARK_HOME
Output: /usr/local/spark

Copy the downloaded jar file to $SPARK_HOME/jars/ directory.

Now start your notebook server and create a new python3 notebook.

First of all you need to load the hadoop-aws binaries. Place and execute the following code on top of your notebook.

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell"

If you are using a different version of hadoop-aws binaries, replace 2.7.3 with that version number.

The above line of code must be executed before creating the spark session. This line will load the library at the time spark session is created.

Create a spark session

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(‘S3CSVRead’).getOrCreate()

In order to access files from S3, we need credentials. Usually it comprises of an access key id and secret access key.

accessKeyId=”your-secret-key-id”
secretAccessKey=”your-secret-access-key”

Register the AWS file system to hadoop configuration.

spark._jsc.hadoopConfiguration().set(“fs.s3a.impl”, “org.apache.hadoop.fs.s3native.NativeS3FileSystem”)

Load the credentials in hadoop configurations.

spark._jsc.hadoopConfiguration().set(“fs.s3a.awsAccessKeyId”, accessKeyId)spark._jsc.hadoopConfiguration().set(“fs.s3a.awsSecretAccessKey”, secretAccessKey)

Note: It is never been a good idea to specify credentials in code. Always load them from a configuration file or in environment variables.

Now you are all set to read files from S3.

df= spark.read.csv(“s3a://bucket-name/path/to/file”, header=True)

Show the loaded data.

df.show()

--

--

Kashif Sohail

Data Engineer with more than 7 years of experience having exposure to fintech, contact center, music streaming, and ride-hail/delivery industries.