Efficiently Analyze CSV Files Using DuckDB

Efficiently Analyze CSV Files Using DuckDB, DuckDB is a powerful in-memory database tailored for analytical workloads.

It excels at querying and analyzing CSV files, making it a go-to tool for data analysts and data scientists alike.

Efficiently Analyze CSV Files Using DuckDB

In this comprehensive guide, we will walk you through a structured approach to analyzing a sample shopping dataset using DuckDB from the command line.

Getting Started with DuckDB

Installation of DuckDB

If you haven’t installed DuckDB yet, you can easily get started with a simple command in your terminal:

$ curl https://install.duckdb.org | sh

Once installed, launch DuckDB in your terminal. Make sure to replace ‘user’ with your actual username:

$ /home/user/.duckdb/cli/latest/duckdb

For further assistance, refer to the DuckDB installation guide.

Preview the Dataset

Once you have your DuckDB environment set up, you can preview the first five rows of the dataset by running:

SELECT * FROM read_csv('shopping_data.csv') LIMIT 5;

This command will return a preview of the dataset:

┌─────────────┬───────────────┬───────┬───────────────┬────────────────┐
│ customer_id │ customer_name │  age  │ purchase_date │ payment_method  │
│    int64    │    varchar    │ int64 │     date      │    varchar      │
├─────────────┼───────────────┼───────┼───────────────┼────────────────┤
│           1 │ Customer 1    │    56 │ 2024-01-01    │ PayPal          │
│           2 │ Customer 2    │    46 │ 2024-01-02    │ Credit Card     │
│           3 │ Customer 3    │    32 │ 2024-01-03    │ Cash            │
│           4 │ Customer 4    │    60 │ 2024-01-04    │ Cash            │
│           5 │ Customer 5    │    25 │ 2024-01-05    │ Debit Card      │
└─────────────┴───────────────┴───────┴───────────────┴────────────────┘

Analyzing the Shopping Dataset

Check Column Names and Data Types

To understand the structure of your dataset better, you can check the column names and their corresponding data types using:

DESCRIBE SELECT * FROM read_csv('shopping_data.csv') LIMIT 5;

This command will provide information on each column, its data type, and other details:

┌─────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│   column_name   │ column_type │  null   │   key   │ default │  extra  │
├─────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ customer_id     │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ customer_name   │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ age             │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ gender          │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ purchase_amount │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ category        │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ purchase_date   │ DATE        │ YES     │ NULL    │ NULL    │ NULL    │
│ payment_method  │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘

Generate Summary Statistics

To obtain key statistics for numerical columns like age and purchase amount, use the following SQL query:

SELECT
    MIN(age) AS min_age, MAX(age) AS max_age, AVG(age) AS avg_age,
    MIN(purchase_amount) AS min_purchase, MAX(purchase_amount) AS max_purchase, AVG(purchase_amount) AS avg_purchase
FROM read_csv('shopping_data.csv');

This query will return a summary like this:

┌─────────┬─────────┬───────────┬──────────────┬────────────────────────┐
│ min_age │ max_age │    ...    │ max_purchase │      avg_purchase      │
├─────────┼─────────┼───────────┼──────────────┼────────────────────────┤
│   19    │   61    │     ...   │    485.1     │ 242.93                 │
└─────────┴─────────┴───────────┴──────────────┴────────────────────────┘

Filtering and Sorting Transactions

To find high-spending customers who spent more than $200, sorted by purchase amount in descending order, use:

SELECT customer_name, age, purchase_amount, category
FROM read_csv('shopping_data.csv')
WHERE purchase_amount > 200
ORDER BY purchase_amount DESC;

This query identifies customers and displays their transactions:

┌───────────────┬───────┬─────────────────┬─────────────┐
│ customer_name │  age  │ purchase_amount │  category   │
├───────────────┼───────┼─────────────────┼─────────────┤
│ Customer 16   │    20 │           485.1 │ Groceries   │
| ...           | ...   |           ...   | ...         │
└───────────────┴───────┴─────────────────┴─────────────┘

Grouping and Aggregating Data for Insights

To analyze total sales per product category and gain meaningful insights, execute:

SELECT category, COUNT(*) AS total_purchases, SUM(purchase_amount) AS total_sales, AVG(purchase_amount) AS avg_spent
FROM read_csv('shopping_data.csv')
GROUP BY category
ORDER BY total_sales DESC;

The output will provide an overview like this:

┌─────────────┬─────────────────┬────────────────────┬────────────────────┐
│  category   │ total_purchases │   total_sales      │     avg_spent      │
├─────────────┼─────────────────┼────────────────────┼────────────────────┤
│ Groceries   │               9 │ 2716.71            │ 301.86             │
| ...         | ...             | ...                | ...                │
└─────────────┴─────────────────┴────────────────────┴────────────────────┘

Conclusion

DuckDB makes it easy to analyze large CSV files efficiently from the command line, eliminating the need for complex setups.

This guide has covered everything from installation to basic data analysis techniques, empowering you to leverage DuckDB for your analytical needs.

By using DuckDB, you can transform how you analyze data, ensuring quicker insights and informed decision-making in your data-driven projects.

Happy analyzing!

Pandas DataFrames with DuckDB » FINNSTATS

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

twelve + 9 =

Ads Blocker Image Powered by Code Help Pro

Quality articles need supporters. Will you be one?

You currently have an Ad Blocker on.

Please support FINNSTATS.COM by disabling these ads blocker.

Powered By
Best Wordpress Adblock Detecting Plugin | CHP Adblock