Ingesting Streaming Data into Bigquery by Directly Using Pubsub

Ingesting Streaming data from PubSub to Bigquery has never been that much easy, earlier we need to write script using Apache Beam and to run that Apache Beam Script we need to use DataFlow(Apache Beam Runner).But now with May 2022 Update from Google Cloud, Pub/Sub Subscription has the new feature of “write to bigquery”, It’s the variant of the push operation. This will be selected if you want to deliver messages directly to an existing BigQuery table.

This example will show you how to load streaming data manually from Pub/Sub to Bigquery without using DataFlow. 

Steps Involved:

  1. Creating Topic in PubSub.

  2. Creating Schema in PubSub.

  3. Creating Subscription in PubSub.

  4. Creating Table in Bigquery.

  5. Publishing message in Topic.

  6. Message loaded in Bigquery table.


Prerequisites:

First we need to Grant the BigQuery Data Editor (roles/bigquery.dataEditor) role and the BigQuery Metadata Viewer (roles/bigquery.metadataViewer) role to the Pub/Sub service account.

  1. In the Google Cloud console, go to the IAM page.
    Go to IAM

  2. Select Include Google-provided role grants.

  3. Filter by Name: Pub/Sub Service Account.

  4. Click Edit for the Pub/Sub Service Account.

  5. In the Edit Permissions pane, click Add another role.

  6. In the Select a role drop-down, enter BigQuery, and select the BigQuery Data Editor role.

  7. Click Add another role again.

  8. In the Select a role drop-down, enter BigQuery, and select the BigQuery Metadata Viewer role.

Begin here:

1.Creating Topic in Pub/Sub: 

A topic forward messages from publishers to subscribers

  • Go to PubSub->Topics->Create a Topic inside that name you Topic by entering Topic Id select Use a Schema option this will show you the option to Select a PubSub schema.

Creating Topic in Pub/Sub

2.Creating Schema in Pub/Sub:

A schema is a format that messages must follow. You can create schemas as standalone resources, associate schemas with Pub/Sub topics, and use them to validate the structure of published messages. 

  • If you haven’t created a schema for your Input Data/Streaming data go to Select a PubSub Schema (option visible while creating Topic) and click on Create New Schema this will open up Schema section of PubSub where you can enter details of Schema that you want to create.

Creating Schema in Pub/Sub
  • In my case I’ve given the Schema ID as “PubSub_Bigquery” and inside the Schema definition passed the field names as EmployeeName(String), EmpSalary(Integer), EmpPhoneNo(String).

  • After giving above details we can click on CREATE button, this will create a Schema with mentioned field names for the Input that we are going to ingest.

3.Creating Subscription in Pub/Sub:

A subscription directs messages on a topic to subscribers. Messages can be pushed to subscribers immediately, or subscribers can pull messages as needed.

  • After creating Topic and Schema within Pub/Sub we can go ahead and create a Subscription and can attach that subscription to the topic that we have already created.

  • Go to Subscriptions section within Pub/Sub -> Create a Subscription.

  • Enter the Subscription ID and select the Cloud Pub/Sub Topic that we have created in our case It’s Sample_Topic_Schema

Creating Subscription in Pub/Sub
  • Here comes the new feature of PubSub which is inside the Delivery type Option “Write to BigQuery” it’s a feature that lets PubSub to deliver messages directly to an existing BigQuery table

  • Select the delivery type as Write to BigQuery and Select the Project that you are using, Select the Dataset in which you are going to create a table or you can also create a new dataset as well

  • Select “Use topic Schema” option selected just because when enabled the topic schema will be used when writing to BigQuery.Else Pub/Sub writes the message bytes to a column called data in BigQuery

  • You can also set Message Retention duration or Expiration period in more options but as of now we are good to create a Subscription for our example

4.Creating Table in BigQuery:

  • If your BigQuery table is not created you can click on the navigate to BigQuery link while creating a subscription and you can create a new table by using the mentioned project and dataset.

Creating Table in BigQuery
  • Inside Creating a Table we are creating an Empty Table, we need to specify the Destination Project Name that we are using, Dataset that we have created and the Table name that we want to create

  • Inside Schema click on ADD field button and give the Schema details as mentioned in your Pub/Sub schema because while it writes the messages into PubSub, it will check for both the schemas in PubSub and BigQuery if it matches than it loads the data into BigQuery Table 

  • Click on CREATE TABLE

5.Publishing message in Topic:

  • Inside the Topics section of Pub/Sub select the Topic that we have created, here you will see the subscription that we attached to our Topic.This will confirm that our subscription is attached to the Topic.

  • Below is the screenshot for the Subscriptions tab inside the Topics section within Pub/Sub where we can see the subscription that we have already created.

If we publish the message through this Topic this will get delivered to the following list.

Publishing message in Topic
  • Now that we are good to publish the message in the above Topic for which we can click on the Messages tab and than Publish Message and pass the message using the specified JSON format as mentioned below.

Publishing message in Projects

6.Message Loaded into BigQuery:

  • After passing the message manually we can go ahead and Publish the message inside this topic so that the Subscription gets the message and it loads that data into the assigned BigQuery Table. 

  • We can go ahead and check the BigQuery table by clicking onto that under the Data preview Section of the Table we can see the new row which has been loaded into our table.

Message Loaded into BigQuery

This feature helps in bringing in simplicity in streaming data ingestion. It also results in cost reduction as products like Dafaflow and Composer are no longer required for streaming data ingestion. It also provides stability and scalability. 

Previous
Previous

Four ways Google Workspace works with and elevates your existing tools

Next
Next

Evonence, Google Cloud partner Ranks No. 1300 on the 2022 Inc. 5000 Annual List