How to Upload Data to BigQuery with R and Python

  ●   June 6, 2023 | Analytics
Written by
June 6, 2023 | Analytics

The world of web analytics continues to hurtle towards the fateful 1 July date when Universal Analytics stops processing data and is replaced by Google Analytics 4 (GA4). One of the key changes is that in GA4, you can only retain data in the platform for a maximum of 14 months. This is a major change from UA, but in exchange for this, you can push GA4 data into BigQuery for free, up to a limit.

BigQuery is an extremely useful resource for data storage beyond GA4. With it becoming more important than ever in a few months, it is as good a time as ever to start using it for all of your data storage needs. Often, it will be preferable to manipulate the data in some way prior to uploading. For this, we recommend using a script written in either R or Python, especially if this kind of manipulation needs to be done repeatedly. You are also able to upload data into BigQuery straight from these scripts, and that is exactly what this blog is going to guide you through.

Uploading to BigQuery from R

R is an extremely powerful language for data science and the easiest to work with for uploading data to BigQuery. The first step is to import all the necessary libraries. For this tutorial, we will need the following libraries:

library(googleAuthR)

library(bigQueryR)

If you haven’t used these libraries before, run install.packages(<PACKAGE NAME>) in the console to install them.

Next, we must tackle what is often the trickiest and consistently the most frustrating part of working with APIs – the authorisation. Fortunately, with R, this is relatively simple. You will need a JSON file containing authorisation credentials. This can be found in the Google Cloud Console, the same place where BigQuery is located. First, navigate to the Google Cloud Console, and click on ‘APIs and Services’.

Next, click ‘Credentials’ in the sidebar.

On the Credentials page, you can view your existing API keys, OAuth 2.0 Client IDs and Service Accounts. You’ll want an OAuth 2.0 Client ID for this, so either hit the download button at the very end of the relevant row for your ID, or create a new ID by clicking ‘Create Credentials’ at the top of the page. Make sure that your ID has permission to view and edit the relevant BigQuery project – to do this, open the sidebar, hover over ‘IAM and Admin’ and click on ‘IAM’. On this page, you can grant your service account access to the relevant project by using the ‘Grant Access’ button at the top of the page.

With the JSON file obtained and saved, you can pass the path to it with the gar_set_client() function to set your credentials. The full code for authorisation is below:

googleAuthR::gar_cache_empty()

googleAuthR::gar_set_client("C:\\Users\\Tom Brown\\Documents\\R\\APIs\\credentials.json")

bqr_auth(email = "<your email here>")

Obviously, you will want to replace the path in the gar_set_client() function with the path to your own JSON file, and insert the email address you use to access BigQuery into the bqr_auth() function.

Once authorisation is all setup, we need some data to upload to BigQuery. We will need to put this data into a dataframe. For the purposes of this article, I am going to create some fictional data with a number of locations and counts of sales, but most likely, you will be reading real data from a .csv file or spreadsheet. To read data from a .csv file, you can simply use the read.csv() function, passing as an argument the path to the file:

data <- read.csv("C:\\Users\\Tom Brown\\Documents\\Semetrical\\Blogs\\mycsv.csv")

Alternatively, if you have your data stored in a spreadsheet, your method will vary depending on where this spreadsheet is located. If your spreadsheet is stored in Google Sheets, you can read its data into R using the googlesheets4 library:

library(googlesheets4)

data <- read_sheet(ss=”<spreadsheet URL>”, sheet=”<name of tab>”)

As before, if you have not used this package before, you’ll have to run install.packages(“googlesheets4”) in the console before running your code.

If your spreadsheet is in Excel, you’ll need to use the readxl library, which is part of the tidyverse library – something I recommend using. It contains a huge number of functions that make data manipulation in R that much easier:

library(tidyverse)

data <- read_excel(“C:\\Users\\Tom Brown\\Documents\\Semetrical\\Blogs\\myxl.xlsx”)

And once again, make sure to run install.package(“tidyverse”) if you haven’t before!

The final step is to upload the data to BigQuery. For this, you’ll need a place in BigQuery to upload it. Your table will be located within a data set, which will be located within a project, and you’ll need the names of all three of these in the following format:

bqr_upload_data(“<your project>”, “<your dataset>”, “<your table>”, <your dataframe>)

In my case, this means my code reads:

bqr_upload_data(“my-project”, “test2”, “blogtestR”, data)

If your table does not exist yet, do not worry, the code will create it for you. Don’t forget to insert the names of your project, dataset, and table into the code above (within the quotation marks) and ensure that you are uploading the correct dataframe! Once this is done, you should see your data in BigQuery, as below:

As a final step, let’s say you have additional data that you would like to add to BigQuery. For example, in my data above, say I forgot to include a couple of locations from the continent, and I want to upload to BigQuery, but I don’t want to overwrite the existing data. For this, bqr_upload_data has a parameter called writeDisposition. writeDisposition has two settings, “WRITE_TRUNCATE” and “WRITE_APPEND”. The former tells bqr_upload_data() to overwrite the existing data in the table, while the latter tells it to append the new data. Thus, to upload this new data, I’ll write:

bqr_upload_data(“my-project”, “test2”, “blogtestR”, data2, writeDisposition = “WRITE_APPEND”))

And sure enough, in BigQuery we can see our data has some new roommates:

Uploading to BigQuery from Python

In Python, things are a little different. Once again, we will need to import some packages, so let’s begin with these:

import pandas as pd

from google.cloud import bigquery

from google.oauth2 import service_account

Authorisation is complicated. Once again we will need a JSON file containing credentials. As above, we will navigate to the Google Cloud Console and click on ‘APIs and Services’, then click on ‘Credentials’ in the sidebar. This time, at the bottom of the page, there will be a section called ‘Service Accounts’.

There you can either download the key to your service account, or by clicking on ‘Manage Service Account’ you can create a new key or a new service account for which you can download the credentials.

You will then want to ensure that your service account has permission to access and edit your BigQuery project. Once again, navigate to the IAM page under ‘IAM & Admin’ in the sidebar, and there you can grant your service account access to the relevant project by using the ‘Grant Access’ button at the top of the page.

As soon as you have that sorted out, you can write the authorisation code:

bqcreds = service_account.Credentials.from_service_account_file('myjson.json', scopes = ['https://www.googleapis.com/auth/cloud-platform'])

client = bigquery.Client(credentials=bqcreds, project=bqcreds.project_id,)

Next, you’ll have to get your data into a dataframe. Dataframes belong to the pandas package, and are very simple to create. To read in from a CSV, follow this example:

data = pd.read_csv(‘C:\\Users\\Tom Brown\\Documents\\Semetrical\\Blogs\\mycsv.csv’)

Obviously, you’ll need to replace the path above with that to your own CSV file. To read in from an Excel file, follow this example:

data = pd.read_excel(‘C:\\Users\\Tom Brown\\Documents\\Semetrical\\Blogs\\myxl.xlsx’, sheet_name=’mytab’>

Reading in from Google Sheets is tricky, and requires another round of authorisation. We will need to import some new packages, and use the JSON credentials file we retrieved during the R tutorial above. You can follow this code to authorise and read your data:

import gspread

from oauth2client.service_account import ServiceAccountCredentials

credentials = ServiceAccountCredentials.from_json_keyfile_name(‘myjson.json’, scopes = [‘https://spreadsheets.google.com/feeds’])

gc = gspead.authorize(credentials)

ss = gc.open_by_key(‘<spreadsheet key>’)

sheet = ss.worksheet(‘<name of tab>’)

data = pd.DataFrame(sheet.get_all_records())

Once you have your data in your dataframe, it’s time to upload to BigQuery once again! You can do this by following this template:

table_id = “<your project>.<your dataset>.<your table>”

job_config0 = bigquery.LoadJobConfig(write_disposition = ‘WRITE_TRUNCATE’)

job = client.load_table_from_dataframe(data, table_id, job_config=job_config0)

job.result()

For an example, here is the code I just wrote to upload the data I made earlier:

table_id = "my-project.test2.blogtestPy"

job_config0 = bigquery.LoadJobConfig(write_disposition = 'WRITE_TRUNCATE')

job = client.load_table_from_dataframe(data, table_id, job_config=job_config0)

job.result()

Once this is done, the data should immediately appear in BigQuery!

There is a lot more that you can do with these functions once you’ve got the hang of them. If you want to take greater control over your analytics setup, Semetrical is here to help! Check out our blog for more information on how to get the most out of your data. Or, for more support on all things analytics, head over to Web Analytics to find out how we can help you.

Our Blog