A Guide to Transformational Modeling with A BigQuery Dataset

Part 3: How to effectively transform raw data with dbt using the ER diagram of a BigQuery information schema.

Part 3: How to effectively transform raw data with dbt using the ER diagram of a BigQuery information schema.

Norapath Arjanurak

Jul 10, 2024

Jul 10, 2024

data-modeling

👋 Hi there, welcome to Datascale: SQL+Data modeling. We help create database diagrams and map out data models from existing schema and SQL queries 🌱. Find us on LinkedIn 🎉

An overview of transformational modeling

In Part 2, we showed how to define tables and constraints using BigQuery DDL and how to visualize a schema using the Datascale platform. In Part 3, we will continue building the ER diagram from Part 2 to dive into the transformation of data by focusing on the various modeling layers: base, staging, intermediate, and fact/dimension tables. We will also show how to illustrate the lineage of data as it flows through these transformations.

Based on the above diagram, the transformational modeling involves the creation of different layers of models to process raw data into meaningful business insights. For this blog, we will cover the transformation with modular data modeling by using the dbt tool. The main layers include:

1. Staging Layer: the staging layer is responsible for cleaning and filtering the raw data to prepare it for further processing. 

2. Intermediate Layer: the intermediate layer aggregates and standardizes data from staging models that will be preparing for dimension and fact models.

3. Fact and Dimension Layer: the fact tables contain quantitative data and all foreign keys related to the dimension tables. The dimension tables contain the data that describe the fact table in various perspectives(dimensions).

All the layer implementations are different depending on the business that the organization has in order to answer specific questions from the data insights.

Learn more: the efficient ways to organize data models.

Benefits of the modular data modeling

With a modular approach, it is more convenient for the team to select, choose, and extract data at specific locations. It provides a strong foundation for analytics engineers to make the transformation workflow easy to retrieve. By dividing the process into these four layers, we can cover business usage and analytical tasks in various dimensions.

Modeling TheLook E-commerce dataset

Before we start, we have provided the GitHub repository for the following implementation here: https://github.com/datascalehq/example-thelook-ecommerce. All the following transformations will be based on the Look E-commerce dataset, as illustrated in the ER diagram below.

TheLook Ecommerce Data Model

Now, let's go through the implementation of each layer using the Look E-commerce dataset. 🎉

1. Staging layer

The staging layer consists of the base and staging models. The base layer involves selecting all columns from the data source (blue columns), while the stage layer focuses on selecting specific columns and filtering certain rows (green columns).

1.1. Base models

The base models include the raw tables directly ingested from the source:

  1. order_items: bigquery-public-data.thelook_ecommerce.order_items

  2. events: bigquery-public-data.thelook_ecommerce.events

1.2. Staging models

The staging models standardize the raw data:

  1. Distribution center: data about distribution centers, including their ID, name, latitude, and longitude.

  2. Inventory items: data on inventory items, including item IDs, product IDs, creation and sale dates, and costs.

  3. Orders & order Items: order information such as order ID, user ID, status, and key timestamps (creation, shipping, delivery, return).

  4. Products:  product data, including product ID, name, category, brand, retail price, cost, department, SKU, and associated distribution center ID.

  5. Users: user information, including user ID, first and last names, email, age, gender, location details (state, city, country), traffic source, and creation date.

  6. Events: event data like event ID, user ID, session information, timestamps, IP address, location (city, state, postal code), browser, traffic source, URI, and event type.

From this phase, we extracted all 7 tables contained in the database to create files that directly store data from the sources without any transformation. This acts as a defensive layer to protect against any changes directly to the main database. This approach will be beneficial for the following phase where heavier transformations will occur.

2. Mart layer

The mart phase is the important stages for data transformation where data is prepared for analysis and reporting. It divides into intermediate and fact/dimension layer.

2.1. Intermediate models

The intermediate layer aggregates and standardizes data from the staging models. We aggregate the following:

1. int_order_sales: It contains the order details with the user information, useful for delivery data purposes.

with order_sales as (
    select
        o.order_id,
        o.user_id,
        o.status as order_status,
        o.created_at as order_created_at,
        o.returned_at as order_returned_at,
        o.shipped_at as order_shipped_at,
        o.delivered_at as order_delivered_at,
        o.num_of_item,
        u.first_name,
        u.last_name,
        u.email,
        u.age,
        u.gender,
        u.state,
        u.city,
        u.country,
        u.traffic_source
    from
        stg_orders o
    left join
        stg_users u
    on
        o.user_id = u.id
)
select * from

2. int_product_inventory: It contains information about the products that the company has, including price and storage location.

with product_inventory as (
    select
        p.id as product_id,
        p.name as product_name,
        p.category as product_category,
        p.brand as product_brand,
        p.retail_price as product_retail_price,
        p.cost as product_cost,
        p.department as product_department,
        p.sku as product_sku,
        p.distribution_center_id,
        i.id as inventory_item_id,
        i.created_at as inventory_created_at,
        i.sold_at as inventory_sold_at,
        i.cost as inventory_cost,
        dc.name as distribution_center_name,
        dc.latitude as distribution_center_latitude,
        dc.longitude as distribution_center_longitude
    from
        stg_products p
    join
        stg_inventory_items i
    on
        p.id = i.product_id
    join
        stg_distribution_centers dc
    on
        p.distribution_center_id = dc.id
)
select * from

3. int_user_orders: It contains user details along with historical data on what and when orders were made, useful for marketing purposes like identifying peak purchasing times.

with user_orders as (
    select
        u.id as user_id,
        u.first_name,
        u.last_name,
        u.email,
        u.age,
        u.gender,
        u.state,
        u.city,
        u.country,
        u.traffic_source,
        o.order_id,
        o.status as order_status,
        o.created_at as order_created_at,
        o.returned_at as order_returned_at,
        o.shipped_at as order_shipped_at,
        o.delivered_at as order_delivered_at,
        o.num_of_item
    from
        stg_users u
    left join
        stg_orders o
    on
        u.id = o.user_id
)
select * from

2.2. Fact and Dimension models

With the above base, stage and intermediate, this layer will organize the data to facilitate analysis and reporting. By combining these layers, we can query, filter, and aggregate only the data that can derive the business insight. The following section is our implementation for fct/dim tables.

1. dim_orders: It tracks the order details for every order from users, useful for marketing purposes to identify which products generate the most transactions.

2. dim_products: It provides product details which can be used to track product information, such as engagement among users and delivery times based on location and past shipping data.

Lineage:

Relations:

3. dim_inventory_items: It contains information about the inventory, including product prices and storage locations.

4. dim_users: It contains user details along with historical data on orders, useful for marketing purposes to determine peak purchasing times.

5. dim_event_user: It tracks events and participating users, helping to identify which events engage users the most and which products sell best during events.

6. fct_sales: A fact table that tracks every piece of sales information, useful for customer service and parcel tracking.

By implementing these fact and dimension tables, we create a data transformation workflow that effectively supports detailed business analysis. It also provides specific sources that data-related fields can extract and work on, making it easier for the data management.

To conclude, the transformational modeling process we discussed ensures that data is organized and processed efficiently through various layers: base, stage, intermediate, and fact/dimension tables. This approach creates an effective structure that allows for meaningful analysis that drives business insights and decisions.

Visualize the lineage in data model

To visualize each data model and understand the relationships between different tables, we use the Datascale platform (same as in Part 2). The ER & Lineage diagrams generated on Datascale provide a clear view of how data flows and interacts with each other.

In the Datascale platform, you can add a new SQL model and paste the DDL/SQL statements of your tables into the workspace. Then, click on the “Diagram” button on the right panel. It will automatically generate the data lineage diagram from this data model.

We will visualize the dim_products table as an example using the SQL code provided in the GitHub repository.

This visualization clearly illustrates the relationships between the tables within the query code. The display is also column level which is clearly the dependency of data.

Conclusion

By following this transformational modeling process, we ensure that the data is clean, organized, and ready for comprehensive analysis and reporting. The ER diagram generated in the Datascale platform provides a clear visualization of the relationships and structure of the query. These implementations are good practices for analytics engineers to create an effective data transformation workflow.

What's next?

This concludes Part 3 of our blog series on transformational modeling and data lineage. In the next part, we will explore how to create actionable business insights using these transformed data models and implement the real business workflow.

Related blogs

Contents

Section

Get a clear view of your SQL dependencies

Datascale helps reverse engineer data models from existing schema and SQL queries

Learn more

Where data gets modeled

Where data gets modeled