MongoDb To BigQuery Data Ingestion & Post processing using DataFlow & Eventarc

What are Data Pipelines and Why is there a need for Data Warehousing?

A data pipeline is a type of application that processes data through a sequence of connected processing steps.As a general concept data pipelines can be applied to data transfer between any sources for eg. from any different data sources to one data warehouse.

Most organizations maintain many different systems and each organization has unique rules and processes for managing the data contained within those systems.

Organizations require one solution to store, manage, process, and handle data at one place. At this point Data Warehouse (BigQuery) comes into play.

BigQuery is a fully managed data warehouse that is designed for storing, managing and processing the data and also running analytical processing at any scale.

What is DataFlow and How does it help in Migrating data from MongoDB to BigQuery?

Dataflow is a unified stream and batch processing system that is serverless, fast, and cost effective.Dataflow’s serverless approach removes operational overhead from data  workloads.Dataflow is very efficient at implementing streaming transformations, which makes it a great fit for moving data from one platform to another with any changes in the data model required.

Migrating data from MongoDB to BigQuery has never been that much easy, earlier we were using Dataflow widely to move data from MongoDB to BigQuery, writing custom code using Apache Beam libraries using the different third party ETL tools for ingesting data such as Stitch,Hevodata etc.

But now with recent updates from MongoDB and Google Cloud, they have announced a set of open source DataFlow templates “MongoDB to BigQuery” A batch pipeline which reads data documents from MongoDB and writes them to BigQuery.

This document describes how you can migrate your upstream data from Mongodb to Google's data warehouse (BigQuery) using the MongoDB to Bigquery (batch) DataFlow template.

Steps involved:

  1. Creating DataFlow Job from Template.

  2. Running the DataFlow Job.

  3. Acknowledging the updates in BigQuery via Email using Sendgrid.

1.Creating DataFlow Job from Template:

For creating a job in Dataflow first we need to choose the template, here in our case for batch data loading we need to choose “mongodb to bigquery” template

under batch templates option.we can also directly search for the templates using filter option as mentioned in below screen.

Requirements for this pipeline

  • The target BigQuery dataset must exist.

  • The source MongoDB instance must be accessible from the Dataflow worker machines.

MongoDb To BigQuery

There are two types of options available for the mongodb to BigQuery template

1.MongoDB to BigQuery CDC (Change Data Capture)for Streaming template is a streaming pipeline that works together with MongoDB change streams and for incremental load. 

2.Mongodb to BigQuery Batch load that reads documents from MongoDB and writes them to BigQuery as specified by the userOption parameter. 

BigQuery Batch load

Required Parameters:

It is advised to setup VPN connection between GCP projects and MongoDB database as best practices, this will provide more security for data during transition/flow over the internet. 

Typically 50-80% additional cost saving would be achieved by disabling public IP usage while running dataflow jobs. This can be done using --disable-public-ips flag option. This option will also provide secure data processing as the compute engine machines won’t be available for public access. 

2. Running the DataFlow Job:

By specifying all the above parameters we can run our job.

Running the DataFlow Job

After clicking on the Run Job button,the job will get queued and will start running once the status changes to running, wait for the job to get executed successfully and once the status changes to succeed. You will now be able to see the data in the BigQuery table.

BigQuery table

After successfully running the job we can see the data loaded into our BigQuery table.

BigQuery Tables

Under the specified BigQuery table name we can see the data already loaded.

Validate the data ingestion through Schema Validation or by comparing the number of rows loaded.

This feature of using DataFlow template helps bring in simplicity in stream/batch data ingestion. It also results in cost reduction, as it can be used as a replacement to the third party tools like Stitch and Hevodata for stream/batch data ingestion from MongoDB to BigQuery. It also provides stability and scalability.

Monitoring the Dataflow job:

When you run your pipeline using the Dataflow-managed service, you can view that job and any others by using Dataflow's web-based monitoring user interface. The monitoring interface lets you see and interact with your Dataflow jobs.

To access a job's monitoring charts, click the job Name within the Dataflow monitoring interface. The Job details page is displayed, which contains the following information:

  • Job graph: visual representation of your pipeline

  • Execution details: tool to optimize your pipeline performance

  • Job metrics: metrics about the running of your job

  • Autoscaling: metrics related to streaming job autoscaling events

  • Job info panel: descriptive information about your pipeline

  • Job logs: logs generated by the Dataflow service at the job level where we can see for any errors or warnings captured during the execution of job

  • Worker logs: logs generated by the Dataflow service at the worker level

  • Diagnostics: table showing where errors occurred along the chosen timeline and possible recommendations for your pipeline

  • Time selector: tool that lets you adjust the timespan of your metrics

Monitoring the Dataflow job
Google Cloud Database

You can view the charts in the Job Metric section, each metric is sub divided into following sections:

1.Overview metrics

  • Throughput: 

Throughput is the volume of data that is processed at any point in time. This per step metric is displayed as the number of elements per second

CPU utilization

2. Resource metrics:

  • CPU utilization:

CPU utilization is the amount of CPU used divided by the amount of CPU available for processing. This per worker metric is displayed as a percentage. The dashboard includes the following four charts:

  • CPU utilization (All workers)

  • CPU utilization (Stats)

  • CPU utilization (Top 4)

  • CPU utilization (Bottom 4)

Google Cloud dataflow Metrics
  • Memory utilization:

Memory utilization is the estimated amount of memory used by the workers in bytes per second. The dashboard includes the following two charts:

  • Max worker memory utilization (estimated bytes per second)

  • Memory utilization (estimated bytes per second)

Google data graph Metrics

3. Acknowledging the updates in BigQuery via Email using Eventarc and Sendgrid:


Eventarc is the service in Google Cloud which can be used to capture certain events and trigger the reaction event corresponding to those events.

Eventarc gives features where BigQuery insert/update and other events can be captured from Audit logs. These captured events then can be further passed to Cloud Function, Cloud Run, GKE or Compute Engine workload for any kind of post processing. Eventarc is hosted on 2nd Generation Cloud Functions only.

In this case, changes on BigQuery are events which we captured and Acknowledged sending an email to the group of interested users.

Below we show how to create a trigger for that inside Eventarc. Here BigQuery InsertJob event is selected as Event provider.

Google Cloud database Event tracker

Filtering for specific resources is available under the Resource option for any specific Dataset & Table choose specific resource and provide path of the intended resource.

Google Cloud Functions

Post capturing the Event destination is selected as another cloud function which will use sendgrid as email agent to notify interested user on BigQuery Insert job completion

Conclusion:

This document shared a basic idea on how you can migrate your upstream data from Mongodb to Google's data warehouse (BigQuery) using the MongoDB to BigQuery (batch) DataFlow template.Which gives extra benefits over using any other third party ETL tool.

This approach is cost efficient and much more reliable in comparison to the use of other ETL tools.

You can easily have data loaded inside BigQuery where you will perform further analysis on the data and have useful insights.

Acknowledgement of the changes in BigQuery has also been taken care of by the service called Evantarc which we can easily connect with Cloud Functions to further add features like Email and SMS alerts.

Previous
Previous

Learn how to improve personal productivity in today's hybrid workplaces

Next
Next

Solving for innovation: Lessons from Google