Combining Google Analytics and Shopify: Creating an eCommerce Data Warehouse in BigQuery

  ●   February 15, 2023 | Analytics
Written by
February 15, 2023 | Analytics

Google Analytics is a great tool that allows you to understand the user behaviour of those visiting your website. Whereas, Shopify is fantastic at providing detailed coverage of performance from an eCommerce perspective. Both are great at what they do, but there comes a point where the data from Shopify and Google Analytics (GA) in isolation fails to provide the deeper insights you might want. Whilst each has the ability to provide powerful information, there is a much greater level of insight that can be gained by combining the data.

In this particular example, we will cover how we achieved this for our friends over at Pooch & Mutt, a UK-based company that produces natural and healthy dog food. We wanted to enhance insights by taking the data collected in Google Analytics and using it in conjunction with eCommerce data from Shopify, as well as the subscription data from Recharge and joining this with legacy subscription data from Bold. Using this variety of data sources, we could build a richer, more complete picture.

This blog will seek to provide an overview of how we aggregate this data, and then outline the thought processes behind combining it to extract and visualise these data-driven insights. Specifically, we will explain how we went about answering the following three questions;

  1. Which channels drive subscription sign-ups?
  2. Basket combinations beyond GA; what are the most popular baskets for subscribers and non-subscribers and which are the most profitable?
  3. How long and how many purchases on average does it take before a user becomes a subscriber?

Methodology

Before we get into answering these questions, let’s first cover how we collected all the data from their respective platforms into a centralised location and built a table we could use to collate this data.

Data Structure

We chose to use BigQuery (BQ) to centralise the data because of its versatility and capability to handle the combination of datasets. The flowchart below highlights the general connections between data and the way in which this was achieved:

Stitch

Stitch is an affordable tool that allows you to extract and load data into pre-built data warehouses.

We easily connected to Shopify and Recharge through Stitch, where we simply selected the tables we wanted. In this case, the most important tables are ‘customers’ and ‘subscriptions’ from Recharge, and ‘orders’ from Shopify.

This information is then synced with BQ every 6 hours through Stitch’s automated scheduling process.

Bold

For historical subscriber data, we took a final export and uploaded this to BQ via Google Sheets.

Google Analytics

Whilst Stitch can also be used for this, we opted to use RStudio to get this data into BQ. If you’re interested in learning more about how you can do this, you can read a blog by our very own R Studio expert Danny Smith on how to use GA API with R.

Building a Master Table

Once we had all the required data in BQ, we could begin combining it. The steps we took to do this are as follows:

  1. We combined and aligned the legacy subscriber data from Bold with the current subscriber data from Recharge.
  2. We then pulled through eCommerce data from GA that includes order IDs (e.g. SPFYXXXXX), along with source, medium and channel.
  3. We created new columns based on the tables of GA and unionised subscriber data before joining onto Shopify order data. In this case, some of these additional columns included;
    1. ‘Active’ subscriber identifier 
    2. Original date of subscription
    3. Source, medium & channel of order
    4. First or repeat customer order
  4. Next, we selected the relevant fields from the Shopify order data that we wanted to join this new data to. Some key examples are;
    1. Customer ID
    2. Order ID
    3. Order Date
    4. SKU
    5. Item Quantity
    6. Item Price
    7. Total Order Price
    8. Discount Code
    9. Discount Amount
    10. Tag (Used as an additional method to categorise subscribers vs non-subscribers)
  5. The final stage here was to create a few more additional columns to more easily deal with duplicate rows for different items in the same order;
    1. Avgitemprice
    2. Avgdiscount
    3. Subscriber Status (tags + if they’re present in the Bold & Recharge table)

We then scheduled this query to run once a day and used the result as the master table for additional queries. Doing this meant we didn’t have to run the full query multiple times, as it’s a static table that only updates when scheduled, rather than every time a request is made through the Looker Studio Dashboard.

Insights

Now that we’ve covered the methodology and outlined the steps taken to get all the data we need, we can move on to discussing how we used this data to answer our questions.

Which channels drive subscription sign-ups?

  1. To calculate this, we looked at the first orders placed by customers, using the field we created in the master table that returns either ‘First’ or ‘Repeat’ based on the first order date.
  2. Using the channel grouping & subscriber status additional columns within the master table, we are able to look at the first orders of people that went on to become subscribers, either on the first purchase or repeat purchase, and what channel they were attributed to in GA.
  3. Removing the step to look only at subscribers, we could also see what channels predominantly drive first purchases.

Basket combinations beyond GA; what are the most popular baskets for subscribers and non-subscribers, and which are the most profitable?

  1. Using the STRING_AGG function, we were able to combine the different products from single orders in a single row, separated by ‘ | ’. This essentially creates a basket for each order.
  2. Along with this information, we summed the revenue for each of these items as ‘Basket Total’, with the aim being to have a row per order with all of the relevant information.
  3. Of course, there is a lot of variance between products (flavour, size, etc.) so we added a column that stripped out this variance to look at the base product basket combinations.
  4. We then used ‘ROW_NUMBER() OVER (PARTITION BY Transaction_ID) AS Identifier’ to assign sequential values to each order. This was then used as a filter in the dashboard to isolate single rows that still contained all of the information.

How long and how many purchases on average does it take before a user becomes a subscriber?

  1. Using our master table, we pulled all customer IDs that we knew had been subscribers and took their first subscription date.
  2. We then took the minimum order date and used the DATE_DIFF function to calculate the difference between these dates in ‘days’.
  3. The final step pulled in the channel and the average number of days before subscribing, which resulted in a table showing the average number of days before customers typically became subscribers.
  4. To calculate the average number of orders customers placed before becoming subscribers, we simply divided the count of unique order IDs by the count of unique customer IDs where they were ‘Subscribers’ and they had order dates less than the minimum subscription date.

These are just some surface-level examples of the views we generated using these reports. With the data, there are endless possibilities – other examples from this project include looking at customer lifetime value from various perspectives and assessing the use of discount codes and their impact across the lifetime behaviour of customers. With these insights, they can be used to supplement business decisions. For example, knowing that it takes an average of 3 orders before becoming a subscriber, you can look to share offers and incentives that encourage customers to become subscribers after 2 purchases. The results of this can then be measured at a later time.

Summary

This case study demonstrates the potential for incredibly valuable data-driven insights that can be gained through the combination of multiple data sources, from GA to Shopify and beyond. Once this infrastructure is created, it can be maintained dynamically and used to report on numerous different aspects of the business.

If the questions we answered in this blog are similar to those that you might be interested in for your own business, you can get in touch with us by filling out our contact form.

Our Blog