Comprehensive Analysis of Retail Sales Data Part 3 - Data Preprocessing and Integration (Pandas, SQL, Python Script)
Data preprocessing is a critical step in any data analysis. It involves preparing raw data into a clean and structured format suitable for analysis. Raw datasets often contain inconsistencies, missing values, outliers, and irrelevant information that can impact the quality and reliability of insights or model predictions.
​
On this page, I will walk through the following steps:
​
-
Using Pandas to examine each column for missing values, validate data integrity, identify patterns, and prepare the dataset for further processing.
-
Utilizing SQL to create new tables for date-related and invoice-level metrics, as well as analyzing trends and patterns in cancellations.
-
Collecting external data using Python scripts from sources such as the Nager.Date Public Holidays API and Wikipedia to gather information on holidays and significant events.
Table of Contents:
The dataset represents transactional data from a UK-based non-store online retailer. It contains records of all transactions between December 1, 2010 and December 9, 2011. The company specializes in unique, all-occasion gift items, with a significant proportion of its customers being wholesalers.
Key Characteristics:
​
-
Type of Data: Multivariate, Sequential, Time-Series
-
Number of Instances (Rows): 541,909 transactions
-
Number of Attributes (Columns): 8
-
Attribute Types: Integer, Real, Nominal
-
Missing Values: Present in CustomerID and Description
-
Source: UCI Machine Learning Repository
Attribute Details:
​
-
InvoiceNo: A 6-digit unique identifier for each transaction. Codes starting with "C" indicate canceled transactions.
-
StockCode: A 5-digit code uniquely assigned to each product. Some also include 1 to 3 letters.
-
Description: The name and properties of the product. Missing in 1454 entries.
-
Quantity: The number of units sold per transaction (integer). Negative means cancellation.
-
InvoiceDate: The date and time when the transaction occurred.
-
UnitPrice: The price of each product unit (numeric, in GBP).
-
CustomerID: A unique identifier for each customer (5-digit nominal number). Missing in 135080 entries.
-
Country: The country of residence of the customer. Unspecified in some entries.
In this section, data preprocessing is performed using Pandas to clean and prepare the dataset for analysis. The order of handling each of the 8 columns is crucial, as changes to one column could affect the others. For example, handling the Description column was done last, as earlier preprocessing steps (such as removing invalid or irrelevant data) would naturally eliminate many invalid descriptions.
​
Deciding what constitutes meaningful data and what should be considered invalid is typically a collaborative decision with stakeholders. However, due to the lack of access to direct communication, I had to make educated guesses about what should stay and what should be excluded. This approach was based on patterns in the data and the intended analysis goals.
Column 1: InvoiceNo
​
​ 1. Identifying Canceled Transactions: Creating a new column, Cancelled, to mark canceled transactions based on whether the InvoiceNo starts with the letter "C". We have 3422 cancelled transactions.
2. Standardizing the Format: Extracting the numeric part of InvoiceNo for non-canceled rows and converting it to an integer.
​​
Column 2: StockCode
​
​ 1. Identifying Invalid StockCode Patterns: A valid StockCode has five digits, optionally followed by up to two letters (case insensitive). Do we have valid stock codes with invalid patterns? Looks like we do!
​​
2. Removing Invalid StockCodes: All of the above except the specific out-of-pattern stock codes that are valid (such as "DCGS0076" - "SUNJAR LED NIGHT NIGHT LIGHT" and "DCGS0003" - " BOXED GLASS ASHTRAY"). In total, we have 23 invalid StockCodes.​
Column 3: Quantity
​
​ 1. Identifying Invalid Negative Quantities: Rows where Quantity is negative and the transaction is not canceled. (Only cancelled transactions must have negative Quantity values). In total, we have 1325 such invalid rows.
​​
3. Removing Invalid Negative Quantities: Above shows all invalid negative Quantities belong to NaN CustomerIDs. Also, they are all priced at zero and do not have a meaningful Description.
​​
Column 4: InvoiceDate
​
​ 1. Datetime Conversion: Converting InvoiceDate from String to a proper datetime format.
2. Checking for Missing Values: Identifying rows where InvoiceDate is null after the conversion.
​​
Column 5: CustomerID
​
​ 1. Replacing null values with "UNKNOWN"
2. Removing any decimal points. Since the column was originally of type float, we must ensure the ".0" suffix is no longer included.
​​
Column 6: UnitPrice
​
​ 1. Checking for Zero Values: Free items to non-UNKNOWN customers? 13878 items (34 unique items) were given to 25 customers for free.
​​
3. Checking for Zero Values: Free items to "UNKNOWN" customers with uncommon descriptions? (common is all-uppercase, which is the typical Description pattern).
4. Removing Invalid Zero Values: There's one thing to gift 0-priced items to "UNKNOWN" customers (for which we have meaningful descriptions). But above's Description column is clearly stating that these rows (price = 0 and customer = "UNKNOWN") do not have meaningful values and must be removed. Examples: (Description: "NaN", "amazon", "Found", and "wrongly sold (22719) barcode"). In total, we have 743 such invalid rows.
Column 7: Description
​
​ 1. Checking Consistency with StockCode: Identifying StockCodes with more than one unique description. We have 225 StockCodes with more than one unique Description value.
​​
3. Reconciliation Using Most Frequent Description: Replace inconsistent descriptions with the most frequent description for each StockCode.
Column 8: Country
​
​ 1. Checking for Invalid Country Names. Looks like all are valid!
​​
Data Overview
​
​ After preprocessing, the table is now structured as follows: Below, you'll find the unique value counts and a summary of the Transactions dataset:
​​
This section outlines the creation and population of the new tables Date, InvoiceMetrics, and Holidays in the database to facilitate data analysis. Using MS SQL Server for querying, along with Python script for web scraping, these tables are designed to extract and organize specific features from the original Transactions table or gather new data from the internet.
To ensure consistency, reusability, and maintainability, the population of SQL tables are implemented in the form of stored procedures. By encapsulating the logic within stored procedures, the data preprocessing steps can be executed efficiently and repeatedly without manual intervention.
​​
Furthermore, a dedicated SQL query is crafted to analyze patterns in cancellations.
​
Table Creation
​
​ Designing structured tables for Date, which captures detailed time-based attributes, and InvoiceMetrics, summarizing invoice level metrics.
​​
PROCEDURE: InsertIntoDateTable
​
A procedure that populates the Date table, a structured table that breaks down each InvoiceDate into detailed components such as Season, Month, Year, Time of Day, etc...
​​
PROCEDURE: InsertIntoInvoiceMetrics
​
A procedure that populates the InvoiceMetrics table by deriving key metrics for each invoice in the Transactions table. This table consolidates essential invoice-level metrics such as TicketSize and BasketSize.
​​
Finding Patterns in Cancellations
​
​ Analyzing the relationship between canceled and original transactions reveals that, out of 3,422 canceled transactions, 225 have a matching counterpart. This means that for these 225 transactions, there is an exact match in terms of basket items, customer, and ticket size, with one transaction being canceled and the other not. However, no such relationship exists for the remaining 3,197 canceled transactions. Due to the lack of consistent patterns in the majority of cases, it is not feasible to take actionable insights from this analysis.
​​
Web Scraping with Python: Holidays
​
​ Creating a new Holidays table to include public holidays for countries in the original Transactions table (2010–2011), sourced from Nager.Date API. Global holidays and major events are then manually added using Wikipedia, ensuring comprehensive coverage of significant dates across all listed countries.
​​