Mastering Data Science: Beyond the Basics - A Comprehensive Guide to ETL Pipelines (PART 2)
Table of contents
- Introduction
- Designing an ETL Architecture
- Key Steps in Designing an ETL Architecture:
- Project Setup
- Implementation Phase
- Understanding the Code Structure for an ETL Pipeline
- Setting Up Database Connection with base.py
- Defining Table Schemas with tables.py
- Creating Tables with create_tables.py
- Implementing the ETL Process
- Step 2: Transforming Data with transform.py
- Step 3: Loading Data with load.py
- Orchestrating the ETL Process
- Conclusion
Introduction
In the first part of our series, we explored the basics of ETL (Extract, Transform, Load) pipelines, understanding what they are and how they function. In this part, we're delving straight into the more technical aspects of building an ETL pipeline.
Designing an ETL Architecture
Before embarking on any ETL project, it's crucial to design an appropriate architecture.
What is an ETL Architecture?
An ETL architecture is a framework that outlines how data is extracted from various sources, transformed to fit operational needs, and loaded into a database for further analysis and business intelligence.
Key Steps in Designing an ETL Architecture:
1. Identify the Data Source: Recognize where your data is coming from.
2. Determine the Data Format: Ascertain the format in which you'll be extracting the data.
3. Explore the Data: Understand the data to determine the necessary transformation processes.
4. Set Up the Destination Database: Establish the database and tables where the cleaned data will be stored for data scientists and analysts to utilize.
For our project, we will be referencing the ETL diagram provided below.
Schema Design
The next step involves designing the schema. A well-structured schema is vital for efficient data organization and retrieval.
Note: I used draw.io for designing these architecture diagrams. It's a free and user-friendly tool.
Project Setup
For project creation, I recommend PyCharm for its ease of setting up a virtual environment. The following folder structure should be created, as shown in the image below:
1. Pipeline Architecture: Incorporate the schema and architecture diagrams here.
2. Raw: This will contain the unprocessed dataset.
3. Scripts:
Subfolder 'common': Contains scripts for table creation.
extract, transform, load, and execute scripts.
Additionally, include .gitignore
, README.md
, and requirements.txt
files.
Implementation Phase
Now that everything is set up, let's proceed with the implementation.
1. Library Installation and Database Setup:
- Install necessary libraries like SQLAlchemy (a SQL toolkit and ORM for Python) and psycopg (a PostgreSQL driver).
2. Data Preparation:
- In the 'raw' folder, store the dataset to be used. For this tutorial, we're using 'genre_v2.csv'.
3. Creating Database Connections and Tables:
In the 'common' subfolder inside 'scripts', create three files:
i.
base.py
: For setting up the database connectionii.
create_
tables.py
: For defining tables using metadata.iii.
tables.py
: For defining table structures using SQLAlchemy classes.
4. Scripting the ETL Process:
In the 'scripts' folder, create four scripts:
extract.py
,load.py
,transform.py
, andexecute.py
.These scripts will correspond to the different stages of the ETL process, with the 'execute.py' script tying everything together.
With the architecture and schema designed and the necessary scripts and database setup in place, we are now ready to execute our ETL pipeline. In the following articles, we'll walk through each script and detail how the ETL process is orchestrated using these components.
Understanding the Code Structure for an ETL Pipeline
In this section, we delve into the core components of setting up an ETL pipeline using SQLAlchemy in Python, focusing on establishing a database connection, defining table schemas, and creating tables. These components are crucial for the Extract, Transform, and Load process, specifically tailored for handling Spotify genre data.
Setting Up Database Connection with base.py
The base.py
script is responsible for setting up the database connection using SQLAlchemy. Here's a breakdown of its components:
from sqlalchemy import create_engine
from sqlalchemy.orm import Session
from sqlalchemy.orm import declarative_base
engine = create_engine("postgresql+psycopg2://admin:password@localhost:5432/spotify_genre")
session = Session(engine)
Base = declarative_base()
`create_engine`: This function initializes a connection to the database. The URL provided specifies the database type (`PostgreSQL), the user (`admin`), the password (`password`), the host (`localhost`), and the database name (`spotify_genre`).
`Session`: A Session
establishes and maintains all conversations between your program and the databases. It represents a 'holding zone' for all the objects that you've loaded or associated with it during its lifespan.
`declarative_base`: This function returns a base class for class definitions. It converts classes and instances into database tables and queries.
Defining Table Schemas with tables.py
In tables.py
, we define the structure of our database tables using SQLAlchemy's ORM capabilities:
from sqlalchemy import Column, Float, Integer, String, Identity
import common.base as b
class GenreRawAll(b.Base):
__tablename__ = "genre_raw_all"
danceability = Column(String(55))
energy = Column(String(55))
key = Column(String(55))
loudness = Column(String(55))
mode = Column(String(55))
acousticness = Column(String(55))
instrumentalness = Column(String(55))
liveness = Column(String(55))
valence = Column(String(55))
tempo = Column(String(55))
type = Column(String(55))
id = Column(String(500))
uri = Column(String(500))
track_href = Column(String(500))
analysis_url = Column(String(500))
duration_ms = Column(String(50))
time_signature = Column(String(255))
genre = Column(String(255))
track_id = Column(Integer, Identity(start=42, cycle=True), primary_key=True)
class GenreCleanAll(b.Base):
__tablename__ = "genre_clean_all"
danceability = Column(Float)
energy = Column(Float)
key = Column(Integer)
loudness = Column(Float)
mode = Column(Integer)
acousticness = Column(Float)
instrumentalness = Column(Float)
liveness = Column(Float)
valence = Column(Float)
tempo = Column(Float)
type = Column(String(255))
id = Column(String(500))
uri = Column(String(500))
track_href = Column(String(500))
analysis_url = Column(String(500))
duration_ms = Column(Integer)
time_signature = Column(Integer)
genre = Column(String(255))
track_id = Column(Integer, Identity(start=42, cycle=True), primary_key=True)
- Each class (`GenreRawAll` and GenreCleanAll
) corresponds to a table in the database. The class attributes map to columns in the table.
`__tablename__`: Specifies the name of the table in the database.
`Column`: Represents a column in the table. The parameters (like String
, Integer
, Float
) define the data type of the column.
`Identity`: Used for generating unique identifiers for new database entries, ensuring each record has a distinct track_id
.
The distinction between GenreRawAll
and GenreCleanAll
illustrates the transition from raw data to processed data, adhering to the ETL pipeline's transform phase.
Creating Tables with create_
tables.py
Finally, the create_
tables.py
script uses the definitions in tables.py
to create the actual tables in the database:
from base import Base, engine
from tables import GenreRawAll, GenreCleanAll
for table in Base.metadata.tables:
print(table)
if __name__ == "__main__":
Base.metadata.(engine)
These scripts form the backbone of our ETL pipeline's setup phase, enabling efficient data extraction, transformation, and loading processes tailored for Spotify genre analysis.
Implementing the ETL Process
The ETL process is the backbone of data preparation, enabling data analysts and scientists to work with clean, organized data. We'll break down the process into three main scripts: extract.py
, transform.py
, and load.py
, each performing a crucial step in the ETL pipeline.
Step 1: Data Extraction with extract.py
The extract.py
script is responsible for extracting data from a zipped CSV file, cleaning it up by removing unnecessary columns, and saving the processed CSV for further transformation.
Define Paths: It starts by setting up the paths for the base project, source zip file, and destination for the raw data.
Create Directory: A helper function checks if the directory exists where the extracted data will be stored; if not, it creates it.
Extract and Clean CSV: This function extracts the CSV file from the zip archive, reads it into a Pandas DataFrame, drops unnecessary columns, and saves the cleaned data back to a CSV file in the designated raw data directory.
import csv
import os
import tempfile
from zipfile import ZipFile
import pandas as pd
base_path = "C:\\Users\\EM\\PycharmProjects\\SpotifyMusicRecommendation-Genre"
source_path = "C:\\Users\\EM\\Desktop\\NEWMOVE\\genres_v2.csv.zip"
raw_path = f"{base_path}/raw/data"
def create_directory_if_not_exist(path):
"""
:param path:
:return:
"""
os.makedirs(os.path.dirname(path), exist_ok=False)
def extract_csv(source, raw):
create_directory_if_not_exist(raw)
with ZipFile(source, mode='r') as f:
name_list = f.namelist()
csv_file_path = f.extract(name_list[0], path=raw)
csv_file = pd.read_csv(csv_file_path, low_memory=False)
csv_file = csv_file.drop(['song_name', 'Unnamed: 0', 'title'], axis=1)
csv_file.to_csv(f'{raw_path}/genres.csv')
def main():
print("[Extract] start")
print("[Extarct] create directory")
extract_csv(source_path, raw_path)
print(f"[Extract] saving data to '{raw_path}'")
print("[Extract] end")
Step 2: Transforming Data with transform.py
The transformation phase is handled by transform.py
, where the data is further processed to fit the database schema.
Imports: It includes necessary modules for regular expressions, CSV handling, database models, and session management.
Case Transformation: A simple function to convert strings to lowercase, ensuring consistency in textual data.
Text Cleaning: Uses regular expressions to remove unwanted characters from text, helping in normalizing data fields.
Data Preparation for Load: Reads the cleaned CSV file, applies text cleaning and case transformation, and organizes the data into a format suitable for loading into the database.
import re
import csv
from common.tables import GenreRawAll
from common.base import session
from sqlalchemy import text
base_path = "C:\\Users\\EM\\PycharmProjects\\SpotifyMusicRecommendation-Genre"
raw_path = f"{base_path}/raw/data/genres.csv"
def transform_case(string):
return string.lower()
def clean_text(string_input):
new_text = re.sub("['\"\[\]_()*$\d+/]", "", string_input)
return new_text
def truncate_table(table):
session.execute(
text(f'TRUNCATE TABLE {table} RESTART IDENTITY CASCADE;')
)
session.commit()
def divide_chunks(l, n):
# looping till length l
for i in range(0, len(l), n):
print("length", len(l))
yield l[i:i + n]
def transform_new_data():
with open(raw_path, mode='r', encoding='utf8') as csv_file:
reader = csv.DictReader(csv_file)
lines = list(reader)
print(list(divide_chunks(list(reader), 1000)))
genre_raw_objects = []
for row in list(divide_chunks(lines, 1000)):
print(row)
genre_raw_objects.append(
GenreRawAll(
danceability = row[0],
energy = row[1],
key=row[2],
loudness=row[3],
mode=row[4],
acousticness=row[5],
instrumentalness=row[6],
liveness=row[7],
valence=row[8],
tempo=row[9],
type=row[10],
id=row[11],
uri=row[12],
track_href=row[13],
analysis_url=row[14],
duration_ms=row[15],
time_signature=row[16],
genre=row[17],
)
)
print("len", len(genre_raw_objects))
session.bulk_save_objects(genre_raw_objects)
session.commit()
def main():
print("[Transform] start")
print("[Transform] remove any old data from genre_raw_all table")
truncate_table('genre_raw_all')
print("[Transform] transform new data available run transformation")
transform_new_data()
print("[Transform] end")
Step 3: Loading Data with load.py
Finally, load.py
manages the loading of transformed data into the database, completing the ETL process.
Data Type Casting and Insertion: This script is responsible for casting data types appropriately and inserting or updating records in the database using SQLAlchemy.
Inserting Transformed Data: It selects data from the transformed dataset, casts each field to the correct data type, and inserts the records into the database.
Deleting Stale Data: Identifies and removes any records from the database that are not present in the latest dataset, keeping the database up-to-date.
from sqlalchemy import cast, Float, Integer, delete, String
from sqlalchemy.dialects.postgresql import insert
from common.base import session
from common.tables import GenreRawAll, GenreCleanAll
def insert_tracks():
# select track id
clean_track_id = session.query(GenreCleanAll.track_id)
# select columns and cast appropriate type when needed
tracks_to_insert = session.query(
cast(GenreRawAll.danceability, Float),
cast(GenreRawAll.energy, Float),
cast(GenreRawAll.key, Integer),
cast(GenreRawAll.loudness, Float),
cast(GenreRawAll.mode, Integer),
cast(GenreRawAll.acousticness, Float),
cast(GenreRawAll.instrumentalness, Float),
cast(GenreRawAll.liveness, Float),
cast(GenreRawAll.valence, Float),
cast(GenreRawAll.tempo, Float),
cast(GenreRawAll.type, String),
cast(GenreRawAll.id, String),
cast(GenreRawAll.uri, String),
cast(GenreRawAll.track_href, String),
cast(GenreRawAll.analysis_url, String),
cast(GenreRawAll.duration_ms, Integer),
cast(GenreRawAll.time_signature, Integer),
cast(GenreRawAll.genre, String),
).filter(~GenreRawAll.track_id.in_(clean_track_id))
# print number of transactions to insert
print("Tracks to insert: ", tracks_to_insert.count())
columns = [
'danceability', 'energy', 'key', 'loudness', 'mode', 'acousticness', 'instrumentalness', 'liveness', 'valence',
'tempo', 'type', 'id', 'uri', 'track_href', 'analysis_url', 'duration_ms', 'time_signature', 'genre'
]
stmt = insert(GenreCleanAll).from_select(columns, tracks_to_insert)
session.execute(stmt)
session.commit()
def delete_tracks():
"""
Delete operation: delete any row not present in the last snapshot
"""
raw_track_id = session.query(GenreRawAll.track_id)
tracks_to_delete = session.query(GenreCleanAll).filter(~GenreCleanAll.track_id.in_(raw_track_id))
# print number of transactions to delete
print("Tracks to delete: ", tracks_to_delete.count())
tracks_to_delete.delete(synchronize_session=False)
session.commit()
def main():
print("[Load] Start")
print("[Load] Inserting new rows")
insert_tracks()
print("[Load] Deleting rows not available in the new transformed data")
delete_tracks()
print("[Load] End")
Orchestrating the ETL Process
An execute.py
script is used to orchestrate the ETL process, ensuring each step is executed in the correct order: extraction, transformation, and loading.
import extract
import transform
import load
if __name__ == "__main__":
# run extract
extract.main()
# run transform
transform.main()
# run load
load.main()
Conclusion
Through these scripts, we've demonstrated how to implement an ETL pipeline for Spotify genre data, showcasing each step from extraction to loading. This process is crucial for preparing data for in-depth analysis and insights discovery.
For those interested in further exploring this project or trying it with a larger dataset, refer to the GitHub repository links provided.
This guide offers a practical approach to ETL processes, empowering data professionals to handle and prepare data efficiently for analytical purposes.
Stay connected for more insights and tutorials as we navigate the ever-evolving landscape of data science. Your journey to becoming a data expert has only just begun. See you soon!