Take the example of a company that has amassed a large amount of data that is housed in numerous databases and applications. Because of the complexity of the data storage, it takes a long time to produce analytics reports. To improve reporting efficiency, the team chooses to replicate all the data to BigQuery. In the past, this would have been a significant, expensive, and time-consuming effort.The team can now use Datastream and Terraform to automate replication instead of carefully building up each data source’s replication. They build a few configuration files in accordance with the organization’s setup, prepare a list of data sources, and presto! Within minutes, replication commences, and data starts to trickle into BigQuery.For a high-level overview of Datastream, we suggest reading this post or our most recent announcement of the launch of Datastream for BigQuery.One well-liked Infrastructure as code (IaC) tool is Terraform. Infrastructure management is made possible by Terraform through configuration files, making it safer, more reliable, and simple to automate.The Terraform support in Datastream, which was introduced in mid-February 23, removes obstacles and makes various innovative use cases easier, including:Management of policy compliance – Terraform can be used to enforce compliance and governance regulations on the resources provided by teams.Automatic replication procedure – Datastream processes can be automated using Terraform. This might be beneficial when you need automated replication, replication from various data sources, or replication of a single data source to multiple destinations.
Let’s look at an example where the data source is a PostgreSQL database and go over the steps step by step to configure Datastream replication from PostgreSQL to BigQuery using Terraform.
Limitations:
Make sure to create Datastream resources in the Google Cloud project where your data is intended to reside because Datastream will only replicate data to the BigQuery data warehouse set in that project.
Requirements:
Before moving on, we must enable the project’s Datastream API. Verify that the Datastream API is enabled on the API & Services page of the Google Cloud interface.
Ensure that the Terraform cli is installed by following the installation instructions for the cli.With a few minor adjustments, the instructions in this article can be applied to a MySQL or Oracle database. Use our guides for configuring MySQL or Oracle instead, and omit the part on Postgres settings.A Postgres database instance with some starting data will obviously be necessary. The instructions in the Cloud SQL for Postgres quickstart tutorial can be used to create a new Postgres instance.We must make sure that PostgreSQL is set up for Datastream replication. This entails turning on logical replication and, if desired, defining a special user just for Datastream. Check our documentation on configuring PostgreSQL. Make sure to write down the publication names and replication slot.Moreover, communication between your database and Datastream must be established. Choose the connectivity type that best suits your configuration by consulting the Network connectivity options guide.
Building streaming data pipelines on Google Cloud
To acquire, analyse, and then store data for later analysis, many clients construct streaming data pipelines. We’ll concentrate on the below-pictured typical pipeline architecture. There are three steps to it:
A pub/Sub topic, data sources deliver messages containing data.
The messages are forwarded to a processing component by Pub/Sub after being buffered.
The data is stored in BigQuery by the processing component after processing.
We’ll examine three options for the processing part, ranging from simple to complex: a BigQuery subscription, a Cloud Run service, and a Dataflow pipeline.
Typical usage cases
Let’s examine a few real-world use cases for streaming data pipelines before delving more deeply into the implementation specifics:processing clicks on ads. Ad clicks are received, fraud prediction algorithms are used on a per-click basis, and then the clicks are either discarded or saved for later study,establishing data format canons. Collecting data from different sources, canonicalizing them into a single data model, and storing them for subsequent analysis or additional processing.telemetry collection. keeping track of user activities and showing current information, such as the number of active users or the typical session length broken down by device type.Keeping a record of changing data recording all database updates from a database to BigQuery over Pub/Sub.
BigQuery and The Denodo Platform collide
Given the similarities between the Denodo Platform and Google BigQuery, it seems sense that the Denodo Platform received the Google Cloud Ready – BigQuery certification earlier this month.
The Denodo Platform meets BigQuery
While BigQuery, the cloud-based enterprise data warehouse (EDW) on Google Cloud, enables lightning-fast query-response across petabytes of data, even when some of that data is stored outside of BigQuery in on-premises systems, the Denodo Platform, powered by data virtualization, enables real-time access across disparate on-premises and cloud data sources without replication.For users of the Denodo Platform on Google Cloud, BigQuery certification offers confidence that the Denodo Platform’s data integration and data management capabilities work seamlessly with BigQuery, as Google only confers this designation on technology that meets stringent functional and interoperability requirements.BigQuery offers Denodo Platform users on Google Cloud more analytical capabilities in addition to storage “elbow room,” including pre-built machine learning (ML) tools like Apache Zeppelin for Denodo, GIS, business intelligence (BI), and other kinds of data analysis tools.But things improve.
BigQuery and The Denodo Platform on Google Cloud
BigQuery and the full capability of the Denodo Platform working together make it simple to access more data with just one tool. The seamless data transfer between on-premises, cloud, and Google Cloud Storage data sources is made possible by the Denodo Platform’s ability to send data in real time using BigQuery cloud-native APIs.The Denodo Platform’s query pushdown optimization tools are combined with enhanced BigQuery compatibility to process large big-data workloads more quickly and effectively. BigQuery can be used as a high-speed caching database for the cloud-based Denodo Platform to increase performance. Advanced optimization methods like multi-pass executions based on intermediary temporary tables are supported by this.Customers gain from the same flexible pricing offered by Google Cloud, which enables them to use BigQuery to start small and scale as necessary.
Realization of the Dream of a Hybrid Data Warehouse
Let’s have a look at one possible integration between BigQuery and the Denodo Platform on Google Cloud. The two technologies offer a hybrid (on-premises/cloud) data warehouse deployment in the architecture shown below.
I want to draw your attention to a few items in this diagram (see the numbered circles). One can:
The Denodo Platform on Google Cloud + BigQuery
Transfer your relational data to BigQuery for offline analytics and interactive querying.When you require high I/O and global consistency, move your relational data from big scale databases and apps to Google Spanner.Transfer your relational data to Google Cloud SQL from Web frameworks and current apps.Create a single centralised data hub by combining all of these sources with the relational data that is now housed on-site in a conventional data warehouse.Perform real-time queries on virtual data from other programmes.To acquire insights from the data, create operational reports and analytical dashboards on top of the Denodo Platform. Then, utilise Looker or other BI tools to serve thousands of end users.
Reference:
https://cloud.google.com/blog/products/data-analytics/denodo-platform-and-google-cloud-bigquery
Image Reference
https://tse4.mm.bing.net/th?id=OIP.RwZy-DttUnfoxoABhQMWqQHaEK&pid=Api&P=0
