Optimizing BigQuery Costs: A Comprehensive Guide

Brief Summary

Effective cost management in BigQuery is crucial for organizations leveraging large-scale data analytics. This article focuses on strategies to reduce costs and improve efficiency in BigQuery usage. We’ll explore various billing models, provide insights into cost analysis, and offer practical steps to optimize your BigQuery environment. By implementing these strategies, you can achieve significant cost savings while maintaining high-performance data analytics capabilities.

1. Digital Analytics

Web analytics is the foundation for measuring long-term success and increasing the efficiency of your (online) marketing activities. It provides the metrics necessary for data-driven decision-making and continuous improvement.

2. Storage Billing Model

In BigQuery, there are two billing options for storage: physical and logical. Depending on the type of data, one is more cost-effective than the other. The default is logical, but for Analytics data, physical billing is often cheaper since data repeats often and therefore has a good compression rate.

There is a disadvantage to the physical billing model regarding the cost of using the time travel feature. More details can be found here.

3. Compute Billing Model

The following compute billing models are possible in BigQuery:

  • On-demand
  • Capacity

The default is on-demand, where BigQuery automatically scales and charges based on processed data volumes, ensuring queries are executed as quickly as possible.

With capacity billing, you can create a reservation (not to be confused with commitments). When a query is executed, compute is used up to the reservation, the query execution will take longer if the reservation is too low. It is billed by slot-time and scales to zero, you only pay for what you use.

More information on pricing models can be found here.

4. Introduction to BigQuery Cost Optimization

Many organizations struggle with managing BigQuery costs and efficiency. The main issues are a lack of overview on cost-driving factors and high usage costs. We’ll focus on scenarios involving BigQuery with multiple regularly running queries or Dataform executions (compute costs), datasets exceeding 1 TB (storage costs), and Universal Analytics or GA4 data. To address these challenges, we first need to create a comprehensive cost overview. We’ll use three key queries to build a dashboard:

<p>

Optimizing BigQuery Costs: A Comprehensive Guide

Summary

<em>Effective cost management in BigQuery is crucial for organizations leveraging large-scale data analytics. This article focuses on strategies to reduce costs and improve efficiency in BigQuery usage. We’ll explore various billing models, provide insights into cost analysis, and offer practical steps to optimize your BigQuery environment. By implementing these strategies, you can achieve significant cost savings while maintaining high-performance data analytics capabilities.</em>

Digital Analytics

Web analytics is the foundation for measuring long-term success and increasing the efficiency of your (online) marketing activities. It provides the metrics necessary for data-driven decision-making and continuous improvement.

Storage Billing Model

In BigQuery, there are two billing options for storage: physical and logical. Depending on the type of data, one is more cost-effective than the other. The default is logical, but for Analytics data, physical billing is often cheaper since data repeats often and therefore has a good compression rate.

There is a disadvantage to the physical billing model regarding the cost of using the time travel feature. More details can be found <a href=”https://cloud.google.com/blog/products/data-analytics/new-bigquery-billing-model-helps-reduce-physical-storage-costs” data-mce-href=”https://cloud.google.com/blog/products/data-analytics/new-bigquery-billing-model-helps-reduce-physical-storage-costs”><strong>here</strong></a>.

Compute Billing Model

The following compute billing models are possible in BigQuery:</p><li>On-demand</li><li>Capacity</li>
<p>The default is on-demand, where BigQuery automatically scales and charges based on processed data volumes, ensuring queries are executed as quickly as possible.With capacity billing, you can create a reservation (not to be confused with commitments). When a query is executed, compute is used up to the reservation, the query execution will take longer if the reservation is too low. It is billed by slot-time and scales to zero, you only pay for what you use.

More information on pricing models can be found <a href=”https://cloud.google.com/bigquery/pricing#analysis_pricing_models” data-mce-href=”https://cloud.google.com/bigquery/pricing#analysis_pricing_models”><strong>here</strong></a>.

Introduction to BigQuery Cost Optimization

Many organizations struggle with managing BigQuery costs and efficiency. The main issues are a lack of overview on cost-driving factors and high usage costs. We’ll focus on scenarios involving BigQuery with multiple regularly running queries or Dataform executions (compute costs), datasets exceeding 1 TB (storage costs), and Universal Analytics or GA4 data. To address these challenges, we first need to create a comprehensive cost overview. We’ll use three key queries to build a dashboard:

Logical vs. physical storage billing comparison

This query works with the information schema, the only thing you need to change is the region you host your datasets in:

<em><span style=”font-weight: 400;” data-mce-style=”font-weight: 400;”>Figure 1: Logical vs. physical storage billing comparison (Original code by <a href=”https://medium.com/cts-technologies/bigquery-storage-billing-models-c2fc48aa8d13″ data-mce-href=”https://medium.com/cts-technologies/bigquery-storage-billing-models-c2fc48aa8d13″>Ash Broadley</a>)</span></em>

Job cost analysis by creator and type

<em><span style=”font-weight: 400;” data-mce-style=”font-weight: 400;”>Figure 2: Job cost analysis by creator and type (Original code by <a href=”https://medium.com/cts-technologies/bigquery-storage-billing-models-c2fc48aa8d13″ data-mce-href=”https://medium.com/cts-technologies/bigquery-storage-billing-models-c2fc48aa8d13″>Ash Broadley</a>)</span></em>

Dataset storage type breakdown (long-term vs. active)

<em><span style=”font-weight: 400;” data-mce-style=”font-weight: 400;”>Figure 3: Dataset storage type breakdown (long-term vs. active) (Original code by <a href=”https://medium.com/cts-technologies/bigquery-storage-billing-models-c2fc48aa8d13″ data-mce-href=”https://medium.com/cts-technologies/bigquery-storage-billing-models-c2fc48aa8d13″>Ash Broadley</a>)</span></em>

These queries will provide the insights needed to optimize BigQuery usage and reduce costs effectively.

Dashboard

We’ve created a Looker Studio dashboard that allows for very granular evaluations. Without going into depth about the dashboard, using the above queries it provides valuable insights into the available and chosen billing models.The dashboard gives summaries of costs per query. It also displays a total summary:

<em><span style=”font-weight: 400;” data-mce-style=”font-weight: 400;”>Figure 4: Cost by pricing model</span></em>

In this example, we can see the on-demand cost is above €1.000 while the standard model cost for the same queries is only €172 which is a reduction by 83%. Choosing the enterprise model would cost €258 and provide some additional features over the standard model.

Similarly the dashboard provides insights into storage cost for each dataset with a summary row like this:

<em><span style=”font-weight: 400;” data-mce-style=”font-weight: 400;”>Figure 4: Cost by pricing model</span></em>

In this example, we see the physical billing, if applied to all datasets, would reduce cost by 86% compared to the standard logical billing.

More details on cost by pricing model can be found <a href=”https://cloud.google.com/bigquery/docs/editions-intro” data-mce-href=”https://cloud.google.com/bigquery/docs/editions-intro”><strong>here</strong></a>.

Defining actions from insights

<br />Through the dashboard, you can identify which datasets can be switched to physical billing to save costs.The third query mentioned above also shows the distribution of data in the datasets between active and long-term storage, indicating which data is being used and which is not.Often, data, especially UA data, is no longer used. This data can be archived in Google Cloud Storage (GCS) to save costs.Depending on the situation, you can switch to physical billing or move the data to an archive bucket.<br />

Changing Storage Billing

Switching from logical to physical billing (or vice versa) can be easily done in the GCP Console. Open BigQuery, select the dataset you want to change, and click on “Edit Details.” Under advanced options, you’ll find the storage billing model. Select physical and click save. Once saved, the billing model is changed, and no further action is required.

Changing Compute Billing

While storage billing is quick and easy to change, compute billing requires a bit more effort. Unlike storage billing, there are significantly more options here. <span style=”font-weight: 400;” data-mce-style=”font-weight: 400;”>The pricing model can be chosen according to requirements.</span>

In addition to On-Demand billing, there are several variants of capacity pricing: standard, enterprise, and enterprise plus. The exact differences can be found <a href=”https://cloud.google.com/bigquery/docs/editions-intro#editions_features” data-mce-href=”https://cloud.google.com/bigquery/docs/editions-intro#editions_features”><strong>here</strong></a>.

In most cases, the Standard model is sufficient. To use Standard capacity pricing, open the GCP console and go to “capacity management” under the BigQuery menu. There you can create a reservation. It’s best to check the queries from recent months in the Job explorer to see how many resources they used. Based on this, you can choose the reservation. The reservation can be deleted at any time, unlike the commitment, which we won’t go into here. Once the reservation is created, you need to activate it. Click on the 3 dots (Actions column) and then on “create assignment.” Here, you can select the job type to which the reservation is applied. Typically, “Query” is the biggest cost item, but depending on use cases, the remaining job types should also be assigned.Once the assignment is applied, billing will be based on capacity pricing.

Conclusion

<br />By implementing these strategies, you can significantly optimize your BigQuery costs while maintaining high performance. Regular monitoring and adjustments based on usage patterns will ensure ongoing cost efficiency. Remember, the goal is not just to reduce costs, but to maximize the value you get from your BigQuery investment.For further assistance or questions about BigQuery cost optimization, please don’t hesitate to contact us.<br />

</p>
[/av_textblock]

This query works with the information schema, the only thing you need to change is the region you host your datasets in:

DECLARE active_logical_gib_price FLOAT64 DEFAULT 0.02;
DECLARE long_term_logical_gib_price FLOAT64 DEFAULT 0.01;
DECLARE active_physical_gib_price FLOAT64 DEFAULT 0.04;
DECLARE long_term_physical_gib_price FLOAT64 DEFAULT 0.02;
WITH
storage_sizes AS (
SELECT
table_schema AS dataset_name,
-- Logical
SUM(active_logical_bytes) / power(1024, 3) AS active_logical_gib,
SUM(long_term_logical_bytes) / power(1024, 3) AS long_term_logical_gib,
-- Physical
SUM(active_physical_bytes) / power(1024, 3) AS active_physical_gib,
SUM(active_physical_bytes - time_travel_physical_bytes - fail_safe_physical_bytes) / power(1024, 3) AS active_no_tt_no_fs_physical_gib,
SUM(long_term_physical_bytes) / power(1024, 3) AS long_term_physical_gib,
-- Restorable previously deleted physical
SUM(time_travel_physical_bytes) / power(1024, 3) AS time_travel_physical_gib,
SUM(fail_safe_physical_bytes) / power(1024, 3) AS fail_safe_physical_gib,
FROM
`region-eu`.INFORMATION_SCHEMA.TABLE_STORAGE_BY_PROJECT
WHERE total_logical_bytes > 0
AND total_physical_bytes > 0
-- Base the forecast on base tables only for highest precision results
AND table_type = 'BASE TABLE'
GROUP BY 1
)
SELECT
dataset_name,
-- Logical
ROUND(
ROUND(active_logical_gib * active_logical_gib_price, 2) +
ROUND(long_term_logical_gib * long_term_logical_gib_price, 2)
, 2) as total_logical_cost,
-- Physical
ROUND(
ROUND(active_physical_gib * active_physical_gib_price, 2) +
ROUND(long_term_physical_gib * long_term_physical_gib_price, 2)
, 2) as total_physical_cost,
-- improvement
ROUND(
ROUND(active_logical_gib * active_logical_gib_price, 2) +
ROUND(long_term_logical_gib * long_term_logical_gib_price, 2)
, 2) - ROUND(
ROUND(active_physical_gib * active_physical_gib_price, 2) +
ROUND(long_term_physical_gib * long_term_physical_gib_price, 2)
, 2) as diff
FROM
storage_sizes

Figure 1: Logical vs. physical storage billing comparison (Original code by Ash Broadley)

Job cost analysis by creator and type

select  
  *
  ,total_bytes_billed / 1000 / 1000 / 1000 as total_gb_billed
  ,total_bytes_billed / 1000 / 1000 / 1000 / 1000 * 6.25 as on_demand_costs
  ,total_slot_ms / 1000 / 60 / 60 * 0.04 as standard_model_costs
  ,total_slot_ms / 1000 / 60 / 60 * 0.06 as enterprise_model_costs
FROM
  `region-eu`.INFORMATION_SCHEMA.JOBS
WHERE
  state = 'DONE'

Figure 2: Job cost analysis by creator and type (Original code by Ash Broadley)

Dataset storage type breakdown (long-term vs. active)

SELECT
TABLE_SCHEMA,
ROUND(sum(ROUND(active_logical_bytes / POWER(1024, 4), 2))) AS active_logical_tb,
ROUND(sum(ROUND(long_term_logical_bytes / POWER(1024, 4), 2))) AS long_term_logical_tb,
ROUND(sum(ROUND((active_logical_bytes + long_term_logical_bytes) / POWER(1024, 4), 2))) AS total_logical_tb
FROM
`region-eu`.INFORMATION_SCHEMA.TABLE_STORAGE
group by TABLE_SCHEMA
ORDER BY total_logical_tb DESC

Figure 3: Dataset storage type breakdown (long-term vs. active) (Original code by Ash Broadley)

These queries will provide the insights needed to optimize BigQuery usage and reduce costs effectively.

Dashboard

We’ve created a Looker Studio dashboard that allows for very granular evaluations. Without going into depth about the dashboard, using the above queries it provides valuable insights into the available and chosen billing models.

The dashboard gives summaries of costs per query. It also displays a total summary:

Figure 4: Cost by pricing model

In this example, we can see the on-demand cost is above €1.000 while the standard model cost for the same queries is only €172 which is a reduction by 83%. Choosing the enterprise model would cost €258 and provide some additional features over the standard model.

Similarly the dashboard provides insights into storage cost for each dataset with a summary row like this:

Figure 4: Cost by pricing model

In this example, we see the physical billing, if applied to all datasets, would reduce cost by 86% compared to the standard logical billing.

More details on cost by pricing model can be found here.

Defining actions from insights

Through the dashboard, you can identify which datasets can be switched to physical billing to save costs.

The third query mentioned above also shows the distribution of data in the datasets between active and long-term storage, indicating which data is being used and which is not.

Often, data, especially UA data, is no longer used. This data can be archived in Google Cloud Storage (GCS) to save costs.

Depending on the situation, you can switch to physical billing or move the data to an archive bucket.

Changing Storage Billing

Switching from logical to physical billing (or vice versa) can be easily done in the GCP Console. Open BigQuery, select the dataset you want to change, and click on “Edit Details.” Under advanced options, you’ll find the storage billing model. Select physical and click save. Once saved, the billing model is changed, and no further action is required.

Changing Compute Billing

While storage billing is quick and easy to change, compute billing requires a bit more effort. Unlike storage billing, there are significantly more options here. The pricing model can be chosen according to requirements.

In addition to On-Demand billing, there are several variants of capacity pricing: standard, enterprise, and enterprise plus. The exact differences can be found here.

In most cases, the Standard model is sufficient. To use Standard capacity pricing, open the GCP console and go to “capacity management” under the BigQuery menu. There you can create a reservation. It’s best to check the queries from recent months in the Job explorer to see how many resources they used. Based on this, you can choose the reservation. The reservation can be deleted at any time, unlike the commitment, which we won’t go into here. Once the reservation is created, you need to activate it. Click on the 3 dots (Actions column) and then on “create assignment.” Here, you can select the job type to which the reservation is applied. Typically, “Query” is the biggest cost item, but depending on use cases, the remaining job types should also be assigned.

Once the assignment is applied, billing will be based on capacity pricing.

Conclusion

By implementing these strategies, you can significantly optimize your BigQuery costs while maintaining high performance. Regular monitoring and adjustments based on usage patterns will ensure ongoing cost efficiency. Remember, the goal is not just to reduce costs, but to maximize the value you get from your BigQuery investment.

For further assistance or questions about BigQuery cost optimization, please don’t hesitate to contact us.

[/av_one_full]