Google BigQuery Tutorial | Analyze Data in BigQuery | Google Cloud Platform Training | Edureka

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] storing and querying massive data sets can be time consuming and expensive without the right hardware and infrastructure bigquery is an enterprise data warehouse that solves this problem by enabling super fast sql queries using the processing power of google's infrastructure so you can simply move your data into bigquery and let it handle the hard work you can control access to both the project and your data based on your business needs such as giving others the ability to view and query your data hello everyone this is dhruv from eduvacar and i welcome you all to this session where i will be talking about google's bigquery so without any further ado let's take a look at today's agenda so we will start this session by first understanding why there is a need for google's bigquery and what actually it is then we will look at some of its key features and high level architecture moving ahead we will get an overview of bigquery's storage as well as techniques of ingesting data into it furthermore we will also look at different pricing models for bigquery and try our hands on implementing it practically on google cloud platform finally we will then look at some of its use cases and a case study for it before we begin do consider subscribing to our youtube channel and hit the bell icon to stay updated on trending technologies and also if you are interested in online training certification in google cloud platform check out the link given in the description box below first let's understand why did google release bigquery and why would you use it instead of a more established solution so google bigquery was designed as a cloud native data warehouse it was built to address the needs of data-driven organizations in a cloud-first world bigquery is gcp's serverless highly skillable and cost-effective cloud data warehouse it provides both batch and streaming modes to load data it also allows importing data directly from certain software as a service applications using bigquery data transfer service second ease of implementation so building your own is expensive time consuming and difficult to scale with bigquery you need to load data first and pay only for what you use third it's speed it process billions of rows in seconds and handle real-time analysis of streaming data now that you have understood why we need bigquery let's now understand what actually it is so bigquery is a fully managed serverless data warehouse that enables scalable analysis over petabytes of data it is platform as a service that supports querying using ensign sql it also has built-in machine learning capabilities so you can build and operationalize machine learning solutions with simple sql you can easily and securely share insights within your organizations and beyond as data sets queries spreadsheets and reports bigquery allows organizations to capture and analyze data in real time using its powerful streaming injection capability so that your insights are always current and it's free for up to one terabyte of data analyzed each month and 10 gb of data stored bigquery service manages underlying software as well as infrastructure including scalability and high availability now let's look at some of the key features of bigquery so first of all it's serverless so serverless data warehousing gives you the resource you need when you need them with bigquery you can focus on your data and analysis rather than operating and sizing computing resources so second is petabyte scale bigquery is fast and easy to use on data of any size with bigquery you will get great performance on a data while knowing you can scale seamlessly to store and analyze better bytes more without having to buy more capacity third is real-time analytics so big queries high-speed streaming injection api provides a powerful foundation for real-time analytics bigquery allows you to analyze what's happening now by making your latest business data immediately available for analysis fourth is flexible pricing models so bigquery enables you to choose the pricing model that best suits you on demand pricing lets you pay only for the storage and computation that you use flat rate pricing enables high volume users or enterprises to choose a stable monthly cost for analysis fifth is automatic high availability free data and compute replication in multiple locations means your data is available for query even in the case of extreme failure modes bigquery transparently and automatically provides durable replicated storage and high availability with no extra charge and no additional setup sixth one is data encryption and security you have full control over who has access to the data stored in bigquery bigquery makes it easy to maintain strong security with fine grain identity and access management with cloud identity and access management and your data is always encrypted at rest and in transit seventh is standard sql so bigquery supports a standard sql dialect which is nci 2011 complaint reducing the need for code rewrite and allowing you to take advantage of advanced sql features bigquery provides free odbc and jdbc drivers to ensure your current application can interact with bigquery's powerful engine this is a savage for less knowledgeable people like me who hates learning new stuffs every day earth is foundation for ai so bigquery provides a flexible powerful foundation for machine learning and artificial intelligence besides bringing ml to your data with bigquery ml integrations with cloud ml engine and tensorflow enable you to train powerful models on structured data moreover bigquery's ability to transform and analyze data helps you get your data in shape for machine learning ninth and the last one is foundations for bi that is business intelligence so bigquery forms the data warehousing backbone for modern bi solutions and enables seamless data integration transformation analysis visualization and reporting with tools from google and its technology partners let's now deep dive into bigquery and have an architectural understanding of it under the hood bigquery employs a vast set of multi-tenant services driven by low-level google infrastructure technologies like dremel colossus jupiter and borg it takes more than just a lot of hardware to make your queries run fast bigquery requests are powered by the dremel query engine which orchestrates your query by breaking it up into pieces and reassembling the results dremel turns your sql query into an execution tree the leaves of the tree it calls slots and do the heavy lifting of reading the data from colossus and doing any computation necessary the branches of the tree are mixers which perform the aggregation in between is shuffle which takes advantage of google's jupiter network to move data extremely rapidly from one place to another the mixers and slots are all run by bor which doles out hardware resources dremel dynamically apportions slots to queries on an as needed basis maintaining fairness amongst multiple users who are all querying at once a single user can get thousands of slots to run their queries dremel is widely used to add google from search to ads from youtube to gmail so there's great emphasis on continuously making dremel better bigquery users get the benefit of continuous improvements in performance durability efficiency and scalability without downtime and upgrades associated with traditional technologies also bigquery relies on colossus which is google's latest generation distributed file system each google data center has its own colossus cluster and each colossus cluster has enough disks to get every bigquery user thousands of dedicated disks at a time colossus also handles replication recovery and distributed management colossus is fast enough to allow big query to provide similar performance to many in-memory databases but leveraging much cheaper yet highly parallelized scalable durable and performant infrastructure bigquery leverages the column input output storage format and compression algorithm to store data in colossus in the most optimal way for reading large amounts of structured data colossus allows bigquery users to scale to dozens of petabytes and storage seamlessly without paying the penalty of attaching much more expensive compute resources typically with most traditional databases to give you thousands of cpu codes dedicated to processing your tasks bigquery takes advantage of borg which is google's large-scale cluster management system for clusters run on dozens of thousands of machines and hundreds of thousands of cores so your query which used 3300 cpus only used a fraction of the capacity reserved for bigquery and bigquery's capacity is only a fraction of the capacity of a box cluster borg assigns server resources to jobs the job in this case is the driven cluster machines crash power supplies fail network switches die and a my rate of other problems can occur while running a large production data center borg roots around it and the software layer is abstracted at google scale thousands of servers will fail every single day and block protects us from the failures someone unplugs a rack in the data center in this middle of running your query and you will never notice the difference besides obvious needs for resource coordination and compute resources big data workloads are often throttled by networking throughput google's jupiter network can deliver one beta bit per second of total bisection bandwidth allowing us to efficiently and quickly distribute large workloads jupiter networking infrastructure might be the single biggest differentiator in google cloud platform it provides enough bandwidth to allow one lakh machines to communicate with any other machine at 10 gb the network bandwidth needed to run our query would use less than 0.1 percent of the total capacity of the system this full duplex bandwidth means that locality within the cluster is not important if every machine can talk to every other machine at 10 gbps racks don't matter traditional approaches to separation of storage and compute include keeping data in an object store like google cloud storage or aws s3 and loading the data on demand to virtual machines this approach is often more efficient than co-tenant architectures like hdfs but is subject to local virtual machine and object storage throughput limits jupiter allows us to bypass this process entirely and read terabytes of data in seconds directly from storage for every sql query in the end these low level infrastructure components are combined with several dozen high level technologies apis and services like bigtable spanner and stubby to make one transparent powerful analytics database bigquery the ultimate value of bigquery is not in the fact that it gives you incredible computing scale it's that you are able to leverage this scale for your everyday sql queries without ever so much as thinking about software virtual machines networks and tests bigquery is truly a serverless database and bigquery simplicity allows customers with dozens of petabytes to have a nearly identical experience as its free to our customers now that we have reviewed the high level architecture of bigquery we now need to look at the bigquery storage organizations and format let's dive right into it in this first let's understand bigquery resource model bigquery organizes data tables into units called data sets these data sets are scoped to your gcp project these multiple scopes project data set and table helps you structure your information logically you can use multiple data cells to separate tables pertaining to different analytical domains and you can use project level scoping to isolate data sets from each other according to your business needs so in this the first one are projects so projects are root name spaces for objects which contain multiple data sets jobs access control list and iem roles it also control billing users and user privileges second is data sets which is a collection of related tables or views together with labels and descriptions it allows access control and data set level and define location of data that is multi-regional or regional that is like if it is a multi-regional it can have like u.s or european union or any other multi regions and if it is just or like regional so it can be like asia north east it specifically goes into region right so that's all then there are tables so which are collections of columns and rows stored in managed storage and defined by a schema with strongly typed columns of values and also allow access control at table label and column label the next is views so which are virtual tables defined by sql query and it allows access control at view level and the last one is jobs these are actions run by bigquery on behalf like low data export data copy data or query data jobs are like executed asynchronously the next in this we have is a storage management now let's review how bigquery manages the storage that holds your data traditional relational databases like mysql store data row by row means record oriented storage this makes them good at transactional updates and oltp means online transaction processing use cases then we have is bigquery bigquery on the other hand uses columnar storage where each column is stored in a separate file block this makes bigquery an ideal solution for olap means online analytical processing use cases you can stream data easily to bigquery tables and update or delete existing values bigquery supports mutations without limits mutations like insert update merge or delete bigquery uses variations and advancements on columnar storage internally bigquery stores data in appropriate free column format called capacitor which has a number of benefits for data warehouse workloads bigquery uses appropriate free format because the storage engine can evolve in tandem with a query engine which takes advantage of deep knowledge of the data layout to optimize query execution each column in the table is stored in a separate file block and all the columns are stored in a single capacitor file which are compressed and encrypted on disk bigquery uses query access patterns to determine the optimal number of physical charts and how data is encoded the actual persistence layer is provided by google's distributed file system colossus where data is automatically compressed encrypted replicated and distributed classes ensures durability using erasure encoding to stored redundant chunks of data on multiple physical disks this is all accomplished without impacting the computing power available for your query separating storage from compute allows you to scale to petabytes in storage seamlessly without requiring additional expensive compute resources there are a number of other benefits of decoupling compute and storage you can also take advantage of long term storage you can load data into bigquery at no cost because bigquery storage costs are based on the amount of data stored like first ngb is free each month and weather storage is considered active or long term if you have a table or partition modified in the last 90 days it is considered as active storage and incurs a monthly charge for data storage at bigquery storage rates if you have a table or partition that is not modified for 90 consecutive days it is considered long-term storage and the price of storage for that table automatically drops by 50 percent to the same cost as cloud storage near line discount is applied on per table per partition basis if you modify the data in the table the 90 days counter resets a best practice when optimizing cost is to keep your data in bigquery rather than exporting your older data to another storage option such as cloud storage take advantage of bigquery's long-term storage pricing this means not having to delete old data or architect or data archival process since the data remains in bigquery you can also query older data using the same interface at the same cost levels with the same performance characteristics let's now learn how to load or ingest data into bigquery and analyze them there are multiple ways to load data into bigquery depending on data sources data formats load methods and use cases such as first one as batch injection second streaming injection third one we can say as data transfer service so the first one is best generation so batch industry involves loading large boundary data sets that don't have to be processed in real time they are typically ingested at specific regular frequencies and all the data arrives at once or not at all the industry data is then queried for creating reports or combined with other sources including real time bigquery bash load jobs are free you only pay for storing and querying the data but not for loading the data for batch use cases cloud storage is the recommended place to land incoming data it is a durable highly available and cost effective object storage service loading from cloud storage to bigquery supports multiple file formats like csv json avro barricade and orc second one is streaming ingestion the streaming edition supports use cases that require analyzing high volumes of continuously arriving data with near real-time dashboards and queries tracking mobile app events is one example of its pattern the app itself or the servers supporting its back-end could record user interactions to an event ingestion system such as cloud pops up and stream them into bigquery using data pipeline tools such as cloud data flow or you can go serverless with cloud functions for low volume events you can then analyze this data to determine overall trends such as areas of high interaction or problems and monitor error conditions in real time bigquery streaming injection allows you to stream your data into bigquery one record at a time by using the table data dot insert all method the api allows uncoordinated inserts from multiple producers industry data is immediately available to query from the streaming buffer within a few seconds of the first streaming insertion it might take up to 90 minutes for data to be available for copy and export operations however one of the common patterns to ingest real-time data on google cloud platform is to read messages from cloud pub sub topic using cloud data flow pipeline that runs in a streaming mode and writes to bigquery tables after the required processing is done the best part with cloud data flow pipeline is you can also use the same cloud for both streaming and batch processing and google will manage the work of starting running and stopping compute resources to process your pipeline in parallel the best part with cloud data flow pipeline is you can also use the same code for both streaming and batch processing and google will manage the work of starting running and stopping compute resources to process your pipeline in parallel this reference architecture which you can see here is like this covers use case in master trade i hope you have understood till now like cloud data flowing cloud pops up how and what is cloud storage how does it work for making a pipeline through cloud data right so please note that you have options beyond cloud data flow to stream data to bigquery for example you can write streaming pipelines in apache spark and run on a hadoop cluster such as cloud dataproc using apache spark bigquery connector you can also call the streaming api in any client library to stream data into bigquery the third one is data transfer service the bigquery data transfer service gts is a fully managed service to ingest data from google software as a service app such as google ads external cloud storage providers such as amazon s3 and transferring data from the data warehouse technologies such as teradata and amazon redshift dts automates data movement into bigquery on a scheduled and managed basis tts can be used for data backfills to recover from any outages or gaps think of data transfer service as an effortless data delivery service to import data from applications to bigquery let's now look at the pricing criteria and models of google's bigquery so bigquery pricing has two main components so the first is analysis pricing which is the cost to process queries including sql queries user defined functions scripts and certain data manipulation languages and data definition language statements that scan tables second is storage pricing which is the cost to store data that you load into bigquery each project that you create has a billing account attached to it any charges incurred by bigquery jobs run in the project are billed to the attached billing account the query storage charges are also built to the attached billing account you can view bigquery costs and trends by using the cloud billing reports page in the cloud console so let's discuss the first one that is analysis pricing models bigquery offers a choice of two pricing sub models for running queries under this model so the first one in this is on demand pricing with this pricing model you are charged for the number of bytes processed by each query the first one terabyte of the query data process per month is free second is flat rate pricing with this pricing model you purchase slots which are virtual cpus when you buy slots you are buying dedicated processing capacity that you can use to run queries slots are available in the following commitments plans like flex slots you commit to an initial 60 seconds second is monthly plan you commit to an initial 30 days then there's annual plan where you commit to 365 days with monthly and annual plans you receive a lower price in exchange for a longer term capacity commitment you can combine both models to fit your needs with on-demand pricing you pay for what you use however your query is run using a shared tool of slots so performance can vary with flat rate pricing you purchase guaranteed capacity with a discounted price for a longer term commitment let's briefly understand the first one that is on demand analysis pricing so by default queries are built using the on-demand pricing model with on-demand pricing bigquery charges for the number of bytes processed you are charged for the number of files processed whether the data is stored in bigquery or in an external data source such as cloud storage drive or cloud bigtable on-demand pricing is based solely on usage so this is how the on-demand pricing structure looks like you can see like queries on demand are there then they are charged for five dollar per db the first one tv month is free and if you pay in a currency other than like us dollars the prices listed in your currency on cloud platform sqs apply like here you can see the sq is like how you need to choose your currency and you can filter it by product sq name or square id or service region or service okay now let's discuss the second one briefly that is flat rate pricing which is also available for high volume customers that prefer a stable monthly cost bigquery offers flat rate pricing for customers who prefer a stable cost for queries rather than paying the on-demand price per terabyte of data process to enable freight trace pricing use bigquery reservations when you enroll in flat rate pricing you purchase dedicated query processing capacity measured in bigquery slots your queries consume the capacity and you are not billed for bytes processed if your capacity demands exceeded your committed capacity bigquery will queue up slots and you will not be charged additional fees so there are flag slots which have like short term commitments so flag slots are special commitment type so commitment duration is only 60 seconds you can cancel flag slots anytime thereafter you are charged only for the seconds your commitment was deployed so flag slots are subjects to capacity availability when you attempt to purchase flag slots success of this purchase is not guaranteed however once your commitment purchase is successful your capacity is guaranteed until you cancel it the following table shows the cost of the flex slot commitment here you can see like how it hourly cost you can see like four dollars for number of slots hundred and monthly charges two nine two zero us dollars based on average of 72 hours per month then there are monthly flight rate commitments following table shows the cost of your monthly slot commitment you can see like 2012 is given for 100 slots then there's also annual flight rate commencements the following table you can see here which shows the cost of your annual slot commitment with uh you can see like monthly cost it's 1700 only for the number of slots and paid all this for us and it's for multi-region okay let's now move on to the second major pricing category that is storage pricing so storage pricing is the cost to store data that you load into bigquery you pay for active storage and long term storage so active storage includes any table or table partition that has been modified in the last 90 days and long term storage includes any table or table partition that has not been modified for 90 consecutive days the price for storage for that table automatically drops by approximately 50 percent there is no difference in performance durability or availability between active and long term storage here you can see like the first 10gb of storage per month is free and operation you can select like if it's active storage or long term storage then pricing is given for like per gb pricing is given like you can see for access to it is 0.020 dollar per gb and for long term storage it is 0.010 dollar per gb so that's how you can see like for long term like how the price is saved and in both of them you can see like the first 10 gb is free each month now that you have a theoretical and architectural understanding of google bigquery service let's now see a practical implementation it by trying our hands on running it on google cloud platform so you can just go to google cloud platform console we had console now so i already have a google cloud platform account so if you don't have one create one it's a very good platform to have your account on also you have to give your credit or debit card details it will debit one rupee and it will also be refunded instantly and you will get 300 free credit for 90 days and you can use that credit for like the demo i'm showing you you can use for that also and for using various other services on google cloud platform you can see like my free trial is over but it's still like what the demo i'm going to show you it doesn't cost anything so yeah let's start so we can go to bigquery so google you will find it under big data services so let's see big data yeah here's big data and this is bigquery just open it so what you have to do is you have to create a data set so you can just uh click here and you can create a data set from here that's a project name you can say and under that we are creating a data set so you can give a name for data set like data set id so you can give it like demo underscore bigquery create dataset and then create a table in it yeah here's the option click on this dataset created then create table so by clicking here so yeah everything is given this project name data set name project name yeah this project name is different okay i will show you just a second so this is the first step i have to create a project from here and it's a google cloud platform project like i have demo name project i have different other projects also you can create a new project from here okay then only start with bigquery okay so you can create table under the data set so this is the project name demo and then data set name demo underscore pq means big query then the table name we can give like what we are going to do today we are going to import a public data set named stack stackoverflow i hope you will know might be knowing about this tag overflow website which have all the technologies like the questions are posted and answers are given to that like people who engage in that so what we are going to find out is in last decade like which technology has been posted the most means the questions have been posted the most related to big technology okay so you can just give it a name table name like top underscore we have to find the top 10 technologies so top underscore time unesco tech we can give a table in it so yeah table is being created so here it is inside the project we have demo in the demo we have top 10 technologies okay now what we have to do is we have to add data so you can go here add data you can pin a project or you can insert a data set if you have an external data set a very big of like very small or big it depends on you okay what kind of dataset you want to work but right now we are going with public data sets so we can just import a public data set they have various data sets provided here you can see so you can see the latest one is about the coveted public data set where you have all the information about the patients and the how the cases are increasing and how the infected cases are increasing or if they are going down whatever the analytics is there regarding all the statistics and analytics whatever is provided with the data of the kobit 19 all those are provided here in which state or which country all those data are provided here okay so but we are going with stack overflow so we will just type here stack overflow yeah here it is so you can see all overview is given like what stack overflow is the largest online community for farmers to learn share their knowledge and advance their credits we know all these things okay then they are samples like here are many examples of sql queries you can run the data these are just the samples which are given the questions posted in the answer and all these things are there now we have to do is view the data set it will take us to another bigquery window back to the query page so yeah so we are here now the data set has been imported stack overflow is here so we can just go check overflow and if you open it there are certain tables are there like badges commands post links there are multiple tables out there but we have to go for the questions like how many times the questions is posted so we will choose this table name post questions so as you open this let's extend this yeah so if you see this first of all the schema will be shown for the table you can see like the features are given like the column names you can say column names are features you can say it so all these features are given and their data types are given and they are like nullable or it's not available everything is given okay then there are the details of the data set i can see like the size of the dataset is like really huge it's not small in any sense 33.4 gb is a lot and they are like you can see number of rows like 20 million rows are there then you can see the date for it like date it's got created on last modified and every data location everything is given here and then you can see the preview also what does this column features name actually okay so just a second also you can see everything is given body and everything like except answer id comment account these are all the features okay the column names but the most important is the text so we have this text you can see the detector the view count for data explorer keyword is two so in that way we have to ready the we have to write a query for it to find out the top 10 technologies okay so let's start so what we do is we have to query the table first now what we have to do is we have to select so let's split text we have to split first the text so working on this tag column so we have to split the text tags from okay put it all this limit we can remove from here ten thousand limit is given so that we can remove and then we can give like from where we have to extract it so like if there is any date column is given so because we are finding for the last decade right so which point from the creation date okay there is this creation date so we have to extract from there here we wanna extract only the from the here so we are just choosing the year here here from creation date we are extract creation date you have to give the platinum text also so we have to select the flatten tags so choose it from here select tax and then we can choose a tag count so drag underscore count and then we will join it below okay from creation date we have to give so that's the problem now it won't be flattened now we can just uh crosstrain it so that's fine you can use the nest for it so you join the text with flattened drawings so that works out like a self joining you can say so tax you can group by flatten text so you have to group that and then we have to order by tag count descending because we need the keyword with the technology name with the maximum count so that will come first if we order by descending order so sending then we'll limit it to 10 because we want the top timer right so limit 10 now it can be done so the variable process is 70 3.9 mb to showing so we can just run the query now you can see all the queries are here now like all the technologies again so you can see like the javascript is topping the list so then there's java then there's python you can see like they are 21 so you can see that 21 means i'm going to say like 2.1 million accounts are there attack counts css 10 c plus plus 8 9. last decade you can see their accounts like how many times they are searched so we are successful here now we can see the job information is given everything is given so what we can do is we can just uh save this save results okay you have to save this in a table like the table we have killed you can you can also save it as csv or json file and everything like google sheets also but we will save it as bigquery table so table name was top let's go to 10 underscore saved as a table so now you can see that job is 31 running i'll show the javascript like it has been created the table has been created the result has been saved to the table so that table has been successful so now we can see if you go here the project we have this data set then we have this so you can see the details for it like pattern text and tag count are there in the table and we can see the details like 148 by it takes total space table size and preview if you see like this table is saved here okay before saving the view let's understand this like you can also share this table or you can copy the table and delete table or also you can like export it to data studio like export with data studio and also you can export it to google cloud storage also all these things you can do and then you can like save view also if you save this view like it shows under view like how the table is working and how it output it shows everything that can be done and then you can schedule the query how you want to run it weekly or daily or monthly however you want to should use it you can should do it that way okay and then there is like you can also do the query settings so query engine you can choose for it like cloud data or bigquery engine and also if you are like querying it and you if you forgot to save it yeah it can get saved in a template table that temporary table you can create it from here okay you can see here in that after this like all these things have been there job priority everything destination table with reference all the things you can choose uh one more cool feature i want to show you and that can show you only if i run the query again so let's run this query again so come down to execution details what you can see here is how your queries are further split into multiple queries to get the results so first thing you can say these are like the working nodes which is like a different concept of like uh any another different concept of a do file system or google file system to discuss about but that's how it divides into multiple queries so here you can see like first how it's made read and compute and right everything is being divided and then you can see like first thing was like there are 20 million rows okay first step in input and then when it comes to sorting it decreases down to 20 million to it decreases down to 2 million rows and then finally when the output comes you can see it can reduced down to 130 okay in the final stage when you have ordered by descending order everything and when you limit it to 10 it comes down to top 10 technologies right so this is how it's been done i hope you have understood it now you can see how everything it provides at the dashboard only everything is every service it doesn't take time that's the thing about bigquery that's very fast let's now understand some of the use cases for bigquery so the first one is migrating data warehouses to bigquery so you can like solve for today's analytics demands and uh seamlessly scale your business by moving to google cloud smart and data warehouse streamline also you can like streamline the sense you can also streamline migration path from netezza or oracle redshift teradata or snowflake to bigquery and accelerate your time to insights you can like also streamline your migration path from netezza oracle redshift teradata or snowflake to bigquery and accelerate your time to insights in the second use case is a predictive analytics so predictive analytics helps you predict future outcomes more accurately and discover opportunities in your business so the google's smart analytics means bigquery is google is smart and its reference patterns are designed to reduce time to value for common analytics use cases with small code and technical reference guides you can like learn how bigquery and bigquery machine learning can help you build an e-commerce recommendation system but also like predict customers and lifetime value and design propensity to purchase solutions also you can bring any data into bigquery make analytics easier by bringing together data from multiple sources in bigquery for seamless analysis you can upload data files from local sources google drive or cloud storage buckets take advantages of bigquery data transfer service data fusion plugins or we can say like leverage on google's industry leading data integration partnerships you have ultimate flexibility in how you bring data into your data warehouse let's now see a case study for google bigquery it's the final topic we are discussing then we will wrap up so uh safari books online uses bigquery to solve a few key challenges like building detailed business dashboards to spot trends and manage abuse improve sales team effectiveness through sales intelligence and enable ad hoc querying to answer specific business questions so all these things we will understand step by step so what they did it means the safari what they did is they choose bigquery over other technologies because of the speed of querying using a familiar sql like language and the lack of required maintenance so safari books online has a large and diverse customer base constantly searching for accessing a growing library of over 30 000 books and videos from an increasing array of desktop and mobile devices this activity stream contains powerful knowledge which we can use to improve our services and increase profitability locked up in the mountains of usage data trends such as top users top titles and correcting the dots for sales inquiries safari's usage data was much too massive to query online in its entirety with the previous tool set analysis could be done with the third party web analytics tools such as home nature but those tools lack the ability to query and explore record level data in real time in addition they didn't have a great background for developing visualizations sql like pairing had to be done on smaller chunks of the data and was labor and 10 seconds slow they were impatient waiting for mysql queries to finish and were often in doubt all as to like whether or not they should finish at all once you reach a database client timeouts of 10 minutes you are in the realm of not being able to do like ad hoc analysis answering often operational questions can be like a time sensitive so speed is important of course it's very important so we played with the hadoop but we in the sense i mean safari organization has actually like played with hadoop but it proved to take a lot of resources to maintain so they ended up putting it on the back burner for future projects then they learned about google bigquery through a video from google input and output determined it seemed perfect for their use case so they decided to try it out so first step is like you can see here this diagram this tells you how safari get the data into bigquery so no matter what database platform you use the extract transform load step can be a challenge depending on your data sources bigquery is no exception i would say then this diagram shows an overview of how the data flow through the landing and etl servers leading up to bigquery the usage data comes from content delivery networks in cdn and web application server logs this data was packaged into time-based badged chunks automatically copied into a landing directory it needed some transformation to load into bigquery in a format where we could get the most use out of it here are the basic steps safari went through so it got a list of all source files they have previously loaded into bigquery then validate the list of source files waiting to be loaded into bigquery then transform the files and copy the files to bigquery cloud storage then load the files into bigquery and finally using the data once it is in bigquery so bq proved to be a good platform for dashboard data and offers the capability of drilling into the data in the dashboard further with the bigquery browser based ui for adopt analysis so here you can see like is a dashboard used internally by safari's operations team to monitor top users if anything raises any questions you can get more information about a user a title an ip address by querying with the bigquery browser tool now you can see here the dashboard safari uses internally to keep an eye on trending top titles again if anything raises any questions in the dashboard bigquery can answer them for you the data is fed to the these dashboards by an intermediary schedule job which is executed via cron this script queries bigquery with the bigquery command line tool gets the result in json and then contact some other web services to get more information which is not stored in bigquery such as the user data and book thumbnail images then it is all matched up and presented in the dashboard this is a good pattern as you wouldn't want to query bigquery every time the dashboard was loaded as it would get expensive it is best to do the analytical heavy lifting with a big query then store the results in something like a cheap lamp stack for mass consumption and then the second advantage that the safari got through bigquery is sales intelligence which is another used case bigquery was great for covering safari's web blocks for leads which came into their sales department this allowed them to relate a lead to other accounts in the system or cause the level of interest someone may have in safari by the number of previous books they had read in the past month on their site lead is created in safari's crm system and the bigquery asynchronously searches through its logs like this so you can see this in query like you can see this query written so the result is written quickly and reattached to the lead recorded allowing safari sales department to gain intelligence and contact the lead without having to wait around the information this gives them a nice summary of the users the ip address of the lead has done while ip address are not perfect identifiers they are much better than nothing you can see someone from this leads id you can see in usage history this how it is given so you can see someone from the leads ip address has an anonymously viewed user id 0 to 322 pages of our books and some other users who have accounts with safari already are active on that id rather than use the bigquery command line interface this use case was better suited for server to server or we can say oh auth with oauth which is like the authorization tool google even uses so with the google apis client library for php this enables them to create web service which matches up data from different sources and returns the result to the crm one of the sources is the big query one amazing testament to the speed of bigquery is that i didn't even have to implement the web service asynchronously it returns the results from bigquery within the timeout period of the web service that the crm requests other use case of safari which they took advantage of through using bigquery is pure add-on querying which aren't driven by dashboards but rather specific business questions when safari released the android apps it was easy to run a query on our usage data and so on safari's usage data to see the android page grouped by day in the google bigquery browser tool it was easy to export the query results as a csv file and load them into google spreadsheets to make a chart so you can see here how this shows when they release their android apps this is the graph and then here's an example of looking at their top users by number searches during a particular hour of the day there really is no limit to the kinds of insights you can gain from investigating your data from an adult perspective there is a huge sense of power being able to answer any question about your billions of rows and data in seconds for example you can find out if that strange traffic on your site is a foreign spam bot or a malicious user as recommended by google they divided their data up into years you can see this in the table normally this would be a bit of an inconvenience to have to do a union statement between the table but bigquery has a nice shortcut for this so you can give this shortcut and what it will do is create a union between the tables so they just stack on as many years back as they need to go into their time shaded tables and union and them together since they have made their schemas the same at the end of each year they archive of active tables and start a new one this is the best of both worlds as it provides the accessibility of having all your data in one big table with the performance of having them separate as they grow it will make sense to further share the data into monthly increments instead of erd now the summary of this case study is like safari like generated meaningful knowledge out of a massive amounts of data in a timely manner which is a challenge for safari books online like many businesses needed to solve being able to keep on top of trends as they happen while you can still do something about them instead of month after the fact is very valuable this leads to lower abuse problems and top line revenue increases by gathering marketing intelligence on trending topics and collecting lead intelligence to close sales more effectively so that was the summary of case study with this we come to the end of today's session of google bigquery i hope you had a great time learning and understanding about it and if you have any queries please feel free to leave them down in the comment section below until next time thank you i hope you have enjoyed listening to this video please be kind enough to like it and you can comment any of your doubts and queries and we will reply them at the earliest do look out for more videos in our playlist and subscribe to edureka channel to learn more happy learning you
Info
Channel: edureka!
Views: 21,459
Rating: undefined out of 5
Keywords: yt:cc=on, google bigquery, bigquery, big query, google cloud platform, bigquery tutorial, bigquery console, what is bigquery, gcp bigquery, google bigquery tutorial, google cloud, bigquery sql tutorial, google bigquery introduction, google bigquery pricing, what is google bigquery, what is cross join in sql, what is cross join in sql with example, unnest bigquery, sql in bigquery, bigquery datatypes, gcp console, bigquery create table, bigquery extract month from date, edureka
Id: jaNjk_jNdJ8
Channel Id: undefined
Length: 48min 9sec (2889 seconds)
Published: Fri Jul 09 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.