Pandas DataFrames with DuckDB
Pandas DataFrames with DuckDB, Pandas is widely recognized as one of the most versatile Python libraries for handling structured data.
If you’re already familiar with SQL, you can harness the power of DuckDB to query Pandas DataFrames seamlessly.
Pandas DataFrames with DuckDB
In this article, you will discover how to:
- Utilize DuckDB to query Pandas DataFrames
- Efficiently perform aggregations and filtering
- Conduct joins and window functions within DataFrames
Let’s dive into the details!
Getting Started with DuckDB and Pandas
Before we jump in, ensure that you have DuckDB and Pandas installed:
You can set up your Python development environment or use our Google Colab notebook to follow along.
Creating a Sample Dataset
Let’s create a sample dataset that simulates customer transactions:
import pandas as pd
import duckdb
# Create a sample DataFrame (mimicking real-world shopping data)
data = {
"order_id": [101, 102, 103, 104, 105, 106, 107, 108],
"customer_name": ["Ethan", "Harper", "Lucas", "Mia", "James", "Amelia", "Henry", "Charlotte"],
"product": ["Smartphone", "Smartwatch", "Blender", "TV", "Microwave", "Laptop", "Camera", "Headphones"],
"category": ["Electronics", "Electronics", "Home", "Electronics", "Home", "Electronics", "Electronics", "Accessories"],
"price": [899, 250, 120, 1500, 200, 1300, 700, 180],
"quantity": [2, 4, 3, 2, 2, 1, 3, 4],
"order_date": pd.to_datetime(["2024-03-05 11:10", "2024-03-06 13:50", "2024-03-06 07:30",
"2024-03-07 19:15", "2024-03-08 10:45", "2024-03-09 14:25",
"2024-03-10 09:40", "2024-03-11 16:10"]),
"payment_method": ["Credit Card", "PayPal", "Debit Card", "Credit Card", "PayPal", "Debit Card", "Credit Card", "PayPal"]
}
df = pd.DataFrame(data)
# Display first few rows
df.head()
This DataFrame represents customer orders with essential details including products, categories, prices, and payment methods.
Using DuckDB to Query Pandas DataFrames
With DuckDB, you can easily query a Pandas DataFrame using SQL syntax as if it were a traditional database table.
1. Select All Data from the DataFrame
To fetch all data from the DataFrame, use:
duckdb.query("SELECT * FROM df").df()
This will return the complete dataset in a SQL table format.
2. Total Revenue Calculation
Calculating the total revenue is straightforward. Use the following query:
duckdb.query("""
SELECT SUM(price * quantity) AS total_revenue
FROM df
""").df()
This query calculates the total sales revenue from all orders and yields:
total_revenue
0 10678.0
3. Finding the Most Popular Products
To identify the best-selling products, execute this SQL command:
duckdb.query("""
SELECT product, SUM(quantity) AS total_quantity
FROM df
GROUP BY product
ORDER BY total_quantity DESC
LIMIT 5
""").df()
This provides a ranking of products by total quantity sold.
product | total_quantity |
---|---|
Smartphone | 2.0 |
Smartwatch | 4.0 |
Blender | 3.0 |
Camera | 3.0 |
Headphones | 4.0 |
Analyzing Customer Behavior
Identifying Top Spending Customers
To discover which customers are spending the most, use:
duckdb.query("""
SELECT customer_name, SUM(price * quantity) AS total_spent
FROM df
GROUP BY customer_name
ORDER BY total_spent DESC
LIMIT 5
""").df()
This provides the list of top spenders:
customer_name | total_spent |
---|---|
Mia | 3000.0 |
Henry | 2100.0 |
Ethan | 1798.0 |
Amelia | 1300.0 |
Harper | 1000.0 |
Analyzing Payment Methods
To find out which payment method is most popular, run:
duckdb.query("""
SELECT payment_method, COUNT(*) AS order_count
FROM df
GROUP BY payment_method
ORDER BY order_count DESC
""").df()
This gives you insights into customer payment preferences.
payment_method | order_count |
---|---|
PayPal | 3 |
Credit Card | 3 |
Debit Card | 2 |
Going Deeper: Analyzing Shopping Trends
Finding Peak Shopping Hours
To identify when most purchases occur, extract the hour from order_date
:
duckdb.query("""
SELECT strftime('%H', order_date) AS order_hour, COUNT(*) AS order_count
FROM df
GROUP BY order_hour
ORDER BY order_hour
""").df()
This analysis reveals which hours are busiest for orders.
order_hour | order_count |
---|---|
07 | 1 |
09 | 1 |
10 | 1 |
11 | 1 |
13 | 1 |
14 | 1 |
16 | 1 |
19 | 1 |
Calculating Running Total Revenue
To monitor revenue accumulation over time, utilize a window function:
duckdb.query("""
SELECT order_date, SUM(price * quantity)
OVER (ORDER BY order_date) AS running_revenue
FROM df
""").df()
This query helps you understand revenue growth over time.
Conclusion
In this tutorial, we showed you how to efficiently query Pandas DataFrames using DuckDB, covering topics such as:
- Sales analytics, including total revenue and top products
- Customer behavior analysis and payment method trends
- Time-based analytics to determine peak shopping hours and track revenue growth
DuckDB’s lightweight and high-performance query engine enables you to analyze large Pandas datasets using SQL effectively.
Ready to elevate your data analysis game?
Try integrating DuckDB into your next Python project!