Putting the ingestion script into Docker

Step Snap 1 [jupyter nbconvert]:

jupyter nbconvert --to=script upload_data.ipynb

jupyter nbconvert is a command-line tool from Jupyter
-to=script specifies that we want to convert to a Python script format
upload_data.ipynb is the source notebook file that will be converted
The result will be a Python file named upload_data.py

Common use cases:

When you want to run your notebook code as a standalone Python script
When integrating code into other projects or systems
For version control (.py files are more git-friendly than .ipynb)

Note that nbconvert also supports other output formats like:

HTML (-to=html)
PDF (-to=pdf)
Markdown (-to=markdown)

Step Snap 2 [argparse]:

Notice we are using the argparse, but why we need it in our data engineering process? Is it for formating only or have more necessary use case. So, here, we get into it to help understand the logi c behind.

In Data Engineering, argparse has widespread applications. Let me illustrate through several practical scenarios:

ETL Script Parameter Control

import argparse
import pandas as pd

parser = argparse.ArgumentParser(description='Data processing script')
parser.add_argument('--input_path', required=True, help='Input data path')
parser.add_argument('--output_path', required=True, help='Output data path')
parser.add_argument('--date', help='Processing date, format YYYY-MM-DD')
parser.add_argument('--mode', choices=['full', 'incremental'], default='incremental', help='Processing mode')

args = parser.parse_args()

# ETL processing logic
df = pd.read_csv(args.input_path)
if args.mode == 'incremental':
    df = df[df['date'] == args.date]
# Process data...
df.to_csv(args.output_path)