By default, Spark can automatically handle compressed files, including formats like .gz, .bz2, .zip, etc.
Specifically:
Here's how to use it:
# Directly read compressed CSV file
df = spark.read \\\\
.option("header", "true") \\\\
.csv("path/to/your/file.csv.gz")
# Or read compressed files from an entire directory
df = spark.read \\\\
.option("header", "true") \\\\
.csv("path/to/directory/*.csv.gz")
This is a convenient feature of Spark that makes handling large-scale compressed data files simple. You don't need to decompress files before reading them - Spark handles this process automatically.
Let's break down this shell script line by line:
Sub Snap Step 1: Error Handle & Variable
set -e
TAXI_TYPE=$1 # "yellow"
YEAR=$2 # 2020