Pandas DataFrames with DuckDB

Pandas DataFrames with DuckDB, Pandas is widely recognized as one of the most versatile Python libraries for handling structured data.

If you’re already familiar with SQL, you can harness the power of DuckDB to query Pandas DataFrames seamlessly.

Pandas DataFrames with DuckDB

In this article, you will discover how to:

  • Utilize DuckDB to query Pandas DataFrames
  • Efficiently perform aggregations and filtering
  • Conduct joins and window functions within DataFrames

Let’s dive into the details!

Getting Started with DuckDB and Pandas

Before we jump in, ensure that you have DuckDB and Pandas installed:

You can set up your Python development environment or use our Google Colab notebook to follow along.

Creating a Sample Dataset

Let’s create a sample dataset that simulates customer transactions:

import pandas as pd
import duckdb

# Create a sample DataFrame (mimicking real-world shopping data)
data = {
    "order_id": [101, 102, 103, 104, 105, 106, 107, 108],
    "customer_name": ["Ethan", "Harper", "Lucas", "Mia", "James", "Amelia", "Henry", "Charlotte"],
    "product": ["Smartphone", "Smartwatch", "Blender", "TV", "Microwave", "Laptop", "Camera", "Headphones"],
    "category": ["Electronics", "Electronics", "Home", "Electronics", "Home", "Electronics", "Electronics", "Accessories"],
    "price": [899, 250, 120, 1500, 200, 1300, 700, 180],
    "quantity": [2, 4, 3, 2, 2, 1, 3, 4],
    "order_date": pd.to_datetime(["2024-03-05 11:10", "2024-03-06 13:50", "2024-03-06 07:30",
                                  "2024-03-07 19:15", "2024-03-08 10:45", "2024-03-09 14:25",
                                  "2024-03-10 09:40", "2024-03-11 16:10"]),
    "payment_method": ["Credit Card", "PayPal", "Debit Card", "Credit Card", "PayPal", "Debit Card", "Credit Card", "PayPal"]
}

df = pd.DataFrame(data)

# Display first few rows
df.head()

This DataFrame represents customer orders with essential details including products, categories, prices, and payment methods.

Using DuckDB to Query Pandas DataFrames

With DuckDB, you can easily query a Pandas DataFrame using SQL syntax as if it were a traditional database table.

1. Select All Data from the DataFrame

To fetch all data from the DataFrame, use:

duckdb.query("SELECT * FROM df").df()

This will return the complete dataset in a SQL table format.

2. Total Revenue Calculation

Calculating the total revenue is straightforward. Use the following query:

duckdb.query("""
    SELECT SUM(price * quantity) AS total_revenue
    FROM df
""").df()

This query calculates the total sales revenue from all orders and yields:

total_revenue
0 10678.0

3. Finding the Most Popular Products

To identify the best-selling products, execute this SQL command:

duckdb.query("""
    SELECT product, SUM(quantity) AS total_quantity
    FROM df
    GROUP BY product
    ORDER BY total_quantity DESC
    LIMIT 5
""").df()

This provides a ranking of products by total quantity sold.

producttotal_quantity
Smartphone2.0
Smartwatch4.0
Blender3.0
Camera3.0
Headphones4.0

Analyzing Customer Behavior

Identifying Top Spending Customers

To discover which customers are spending the most, use:

duckdb.query("""
    SELECT customer_name, SUM(price * quantity) AS total_spent
    FROM df
    GROUP BY customer_name
    ORDER BY total_spent DESC
    LIMIT 5
""").df()

This provides the list of top spenders:

customer_nametotal_spent
Mia3000.0
Henry2100.0
Ethan1798.0
Amelia1300.0
Harper1000.0

Analyzing Payment Methods

To find out which payment method is most popular, run:

duckdb.query("""
    SELECT payment_method, COUNT(*) AS order_count
    FROM df
    GROUP BY payment_method
    ORDER BY order_count DESC
""").df()

This gives you insights into customer payment preferences.

payment_methodorder_count
PayPal3
Credit Card3
Debit Card2

Going Deeper: Analyzing Shopping Trends

Finding Peak Shopping Hours

To identify when most purchases occur, extract the hour from order_date:

duckdb.query("""
    SELECT strftime('%H', order_date) AS order_hour, COUNT(*) AS order_count
    FROM df
    GROUP BY order_hour
    ORDER BY order_hour
""").df()

This analysis reveals which hours are busiest for orders.

order_hourorder_count
071
091
101
111
131
141
161
191

Calculating Running Total Revenue

To monitor revenue accumulation over time, utilize a window function:

duckdb.query("""
    SELECT order_date, SUM(price * quantity)
           OVER (ORDER BY order_date) AS running_revenue
    FROM df
""").df()

This query helps you understand revenue growth over time.

Conclusion

In this tutorial, we showed you how to efficiently query Pandas DataFrames using DuckDB, covering topics such as:

  • Sales analytics, including total revenue and top products
  • Customer behavior analysis and payment method trends
  • Time-based analytics to determine peak shopping hours and track revenue growth

DuckDB’s lightweight and high-performance query engine enables you to analyze large Pandas datasets using SQL effectively.

Ready to elevate your data analysis game?

Try integrating DuckDB into your next Python project!

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

13 + 7 =

Ads Blocker Image Powered by Code Help Pro

Quality articles need supporters. Will you be one?

You currently have an Ad Blocker on.

Please support FINNSTATS.COM by disabling these ads blocker.

Powered By
Best Wordpress Adblock Detecting Plugin | CHP Adblock