How to get started with BigQuery

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
ever wondered how to unlock your data insights with bigquery stay tuned to find out more hi everyone i'm matild and i'm valeria and we're both customer engineers here at google cloud welcome back to the technical guides for startups a video series where we help startups to start build and grow their businesses successfully and sustainably on google cloud platform in our previous video we touched on firestore and datastore to document based databases optimized for your web and mobile applications and today we will be talking about bigquery google cloud's fully managed data warehouse that enables scalable analysis over petabytes of data in this episode we are going to cover what is bigquery and where to use it we look into building data etl pipelines with bigquery a goal over advanced features such as bigquery ml and bigquery bi engine then we will talk about bigquery pricing and optimization and take a quick tour into bigquery console and we will finish up with a customer story all right so let's get started with what bigquery actually does bigquery is at the core of google's data cloud it is a data warehouse designed to support all kinds of data sets including massive data sets such as log data from thousands of retail systems or iot data from millions of vehicle sensors across the globe this very same architecture is used for powering some of google's public products organizations rely on bigquery to aggregate data from disparate sources process it and make it readily available for data analysis that support decision making and innovation storing and querying massive data sets can be time consuming and expensive without the right hardware and infrastructure as a data warehouse bigquery is designed to help you ingest store analyze and visualize big data with ease you can simply ingest your data as batch loads or streaming for real-time insights bigquery supports both structured and semi-structured data such as json files for example with bigquery you can store massive data sets up to petabytes of data in a structured tabular format there are two different objects to notice in bigquery data sets and tables tables will be your usual data in tabular format and data sets will regroup those tables as folders to avoid storing duplicates of data you can also federate bigquery with other google cloud databases such as cloud storage without copying the data in bigquery storage but simply making a connection between products in bigquery you can use standard sql for easy and super fast querying on those massive data sets for aggregation from different sources and transforming your data with data manipulation language finally for visualization bigquery easily integrates with reporting tools such as looker spreadsheets google's at no cost data visualization tool data studio or many other products from our partner ecosystem let's dive into the different characteristics of bigquery what is bigquery google bigquery is fully managed so you can focus on analyzing your data up to petabyte scale without worrying about the scaling of the underlying infrastructure google bigquery is ml predictive enabling your data scientists or analysts to build and operationalize ml models with only a few lines of sql you can train and execute your models on data stored in bigquery without needing to move it around google bigquery is real time stream data and accelerate business decisions by getting your analysis on the spot bigquery is multi-cloud and with bigquery omni you can ingest and analyze data from other cloud providers so as you learned bigquery is a great tool when it comes to data analysis and enables users with quick access to the data insights with no extensive programming skills but what about large data sets management and machine learning as you know boss required knowledge of ml frameworks quite oftenly this excludes data analysts who understand the data but have limited ml knowledge and programming expertise bigquery ml empowers data analysts to use machine learning through existing sql tools and skills without needing to move your data out of bigquery with bigquery ml you can train and deploy machine learning models directly using sql bigquery ml takes care of data extracting model training and deploying for you that means you've got data storage data analytics and machine learning all within bigquery as your business data keeps growing in volume over time your data analytics pipeline have to be scalable to adapt to the rate of change let's see how bigquery can help us build an atl pipeline to accommodate the scale but first let's define what is a pipeline and what does etl mean a data pipeline is a series of data processing steps each step delivers an output that is an input to the next step etl refers to a specific type of data pipeline and it stands for extract transform and load another popular pattern is elt extract load and transform the transformation can be done inside of bigquery extract refers to pulling data out of source transform is about modifying the data so that it can be loaded into the destination and load is about inserting the data into the destination this is a common way for organizations to combine data from multiple systems into a single data warehouse etl can be used to store legacy data or as more typically today aggregate and process data to analyze and drive business decisions bigquery is a great choice for your etl data pipeline let's take a look at the architectural components of an etl solution for google bigquery using google cloud dataflow google cloud pops up and google cloud app engine chrome as building blocks as an example we can build a task orchestrator using google app engine chrome service google cloud pops up control topic and google cloud data flow in streaming mode then we will be using google cloud data flow for importing bounded bachelor data from sources such as relational google cloud sql database for example mysql or postgresql we as a gdbc connector and then we will be storing the files in google cloud storage then google cloud dataflow will be used for importing unbalanced streaming data from google cloud pops up data ingestion topic and after that bigquery will be used for storing staging and final data sets an additional etl transformation will be enabled via google cloud data flow and ended sql statements an interactive dashboard in data studio can be connected to bigquery just to visualize your data insights all these components are examples of fully managed services on google cloud platform with this architecture there is no infrastructure for you to deploy manage secure or scale and you only pay for what you use when it comes to data lifecycle management it's important to take into account data governance principles data governance is everything you do to ensure your data is secure private accurate available and usable inside your data warehouse we recommend that you build your data governance practice around three key components choosing the right framework that enables people to define agree to and enforce data policies creating effective processes for control oversight and administration over all data assets across on-prem cloud storage and data warehouses platforms and of course choosing the right tools and technologies for overseeing and supporting data policy compliance bigquery offers built-in data protection at scale within bigquery users can assign data set level and project level permissions to help govern data access and also we have data encryption set by default to ensure data security at all levels so what does it cost you to use bigquery let's take a look at pricing you pay for storage queries and some ingestion and export use cases storage costs are based on the amount of data stored and has two rates based on when the data was modified for the last time tables or partitions in long-term storage meaning that they weren't modified in the last 90 days see their storage price drop by 50 query costs are either on demand meaning that you'll be charged for the number of bytes processed by query the first terabyte process is offered or flat rate with flat rate you purchase dedicated processing capacity to run queries at a reduced price those resources are called slots they are virtual cpus with flat rates you can make commitments by the second the month or the year the rule of thumb is that you should consider flat rate once your bigquery processing cost exceeds 10k of dollars of widgets note that you are not charged for queries that return an error or for queries that retrieve results from the cache loading and exporting batch data to bigquery is at no cost you'll be paying for streaming inserts and exports and the use of the bigquery api you can find the link to detailed pricing in the description and what about performance and cost there are several ways to optimize them in bigquery let's start with query processing you should only query the data you need first use the preview option to explore your data before you write a query it's free when you write a query avoid the select all statement filter your query as early and as often as possible and avoid sql entity patterns such as joins that generate more outputs than inputs finally check how much your query is going to be charged by seeing the number of bytes processed in the ui or do dry rounds to get an estimation of your costs use cash intelligently bigquery writes all your query results to your table you either explicitly identify the destination table or it is cached as a temporary table temporary cached results tables are maintained per user per project there are no storage costs for temporary tables enforce cost controls to avoid human errors enforce max limits on bytes processed per query or at the user and project level partitioning and clustering your tables improves query performance and can help reduce costs by reducing the amount of data processed by queries you can partition by ingestion time date time stamp or integra range for example active assist recommendations guide you through the optimal billing model based on your usage the bigquery slot recommender helps on-demand customer estimate the number of slots when migrating to a flat rate pricing and does so by providing four options on a cost performance scale concerning storage keep in mind to enforce data retention policies by setting up expiration at the data set table or even the partition level also make sure you leverage federated data access you can avoid duplicate copies of data by connecting bigquery to google cloud storage and avoiding bigquery storage costs now let's take a quick tour to the console in this demo we will explore a public data sets for new york city city bike trips in bigquery the eric is as follows here i can see my project i can click on my data set and my data set here contains two tables city bike stations and cd bike trips i can visualize the schema of a table which gives me the different columns their type and their description in details i can view the size of a table and the number of rows here you can see we have a large data set of more than 33 million rows the preview window and this is my favorite feature allows you to visualize a part of the table and see what the data looks like without performing any query let's try to do a query now it is very straightforward you simply have to click on the query button and compose your query with sql language so here i'm gonna query the whole table i am gonna disable the cache so you can see how long it takes to query the whole table in bigquery you can also visualize the number of bytes bigquery will process with your query so you can keep an eye on billing so here let's wait and see how long it takes for bigquery to query the whole 33 million row table so here as you can see it only took 29 seconds to query the whole table all right so here let's do another query and see how fast bigquery is for more complex queries as you can see as soon as you group or add a condition both the duration and amount of bytes shuffled dramatically decrease you can perform any and very complex sql query you want to make your analysis and extract your insights once you have gathered the information you need you can save your results either in another bigquery table google sheets or csv or json file to share externally within your drive or through a local file let's export our results to a bigquery table i can now see the table added to my data set and my project the location of your table will be the one of your data set by inheritance the dataset can be located in a specific region london or a multi-region larger geographic area like the us location of your data is an important concept for query processing and loading and extraction of your data be sure to try the query sandbox it lets you explore bigquery capabilities at no cost and confirm that it fits your needs the sandbox lets you experience bigquery without providing a credit card creating a billing account or enabling billing for your project and if you wanted to see a real-life example of a big query power in action let's take a look at card.com an e-commerce platform which helps to democratize e-commerce by giving brands of all sizes the full capabilities they need to take on the world's largest online retailers supported by the startup program by google cloud and google cloud solutions their end-to-end environment empowers retailers to keep more of their revenue set up proven strategies for managing all aspects of their business and act on meaningful insights from customers data every step of the way so all in all bigquery is a go-to solution when it comes to getting concise from your data we hope that this has given you a good understanding on how you can get started on bigquery coming up in our next session we'll be covering how to run spark and hadoop jobs on google cloud and we will talk more specifically about dataproc for hadoop clusters and serverless park don't forget to like and subscribe to our youtube channel click on the bell icon to be notified each time a new video is posted thank you stay tuned and we will see you very soon in the next video [Music]
Info
Channel: Google Cloud Tech
Views: 41,018
Rating: undefined out of 5
Keywords: Google Cloud, Startup, Startups, GCP, VC, Series
Id: BH_7_zVk5oM
Channel Id: undefined
Length: 17min 19sec (1039 seconds)
Published: Thu Jul 28 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.