Internals of Big Query

Because the author have provided a detail introductions for the Interbal structure of Big Query, also we have metioned some of the core concepts in previous section. So, here we would just as summary the content/logic in video for reviewing purpose, Also please reveiw the links as author provided to help further digest the concepts :)

Step Snap 1 [Introduction to BigQuery Internals]

Overview:

While you can use BigQuery effectively with just best practices like clustering and partitioning, understanding its internals can be highly beneficial for designing robust data products. The video explains that BigQuery’s architecture is built around three key components:

Colossus Storage: A cost-effective, columnar storage system
Jupiter Network: A high-speed internal network that connects compute and storage
Dremel Execution Engine: A distributed query engine that decomposes and processes queries in parallel

This high-level architecture is what enables BigQuery to offer both performance and scalability, even as your data grows.

Step Snap 2 [Colossus Storage: The Backbone of Data Storage]

Initial Explanation:

BigQuery stores data in Colossus, a separate, inexpensive storage system designed to hold data in a columnar format. This separation from compute resources helps reduce costs because you’re charged mainly for compute during query execution rather than for storing large volumes of data.

Code Demonstration:

Here’s an example of how you might create a table in BigQuery that leverages best practices (partitioning and clustering), which in turn optimize how data is stored and retrieved from Colossus:

CREATE TABLE my_dataset.my_table (
  user_id INT64,
  event_type STRING,
  event_date DATE,
  event_value FLOAT64
)
PARTITION BY event_date
CLUSTER BY user_id;

Additional Explanation:

Columnar Format Advantages: Storing data column-wise means that when you run queries that target specific columns (for example, aggregating event values), BigQuery only reads the necessary columns. This reduces I/O, speeds up queries, and saves on compute costs.