AWS Sydney Summit: Snowflake: Your Data. No Limits.

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so I want to start you guys off with a quote today Leonardo da Vinci once said simplicity is the ultimate sophistication and on that theme of simplicity I want to show you guys today just how easy it is to work with snowflake and the reason we're able to work with it in such a simple capacity is because our architecture our architecture has bent being built from the ground up specifically to leverage the cloud extremely sophisticated architecture and pretty much every feature I show you today is going to boil back to that architecture and how we've been able to build that fresh and it was originally built for AWS we are we are multi cloud today but again originally for AWS so gonna give you guys a bit of a background on the company where we've come from today a little bit about the history and the founders and what we are as a product so snowflake was founded in 2012 by industry veterans with over a hundred and twenty patterns to their name two of our founders terry and benoit came from deep in the bowels of oracle very very senior leaders in their dev team our third founder marcin was literally the man who invented vectorized query patterns so we've got some incredibly smart talent that's built snowflake again from the ground up a brand new architecture we currently have about a billion USD in venture capital funding some huge investors that have invested in snowflake out of Silicon Valley currently taking our valuation to four and a half billion USD we're currently one of the fastest growing SAS companies of all time if not the fastest so absolutely phenomenal growth and that you know jumps into the employee count that I've got up there so currently over thirteen hundred employees today when I started with the company about nine months ago I think I was number 520 so absolutely phenomenal growth in a really short space of time and of course we've got our first customers on board in 2014 and when GA in 2015 we actually spent our first three years in stealth mode so we again we built this product from the ground it's not based on anything I'm gonna dive into the architecture today and talk to you about how snowflake is absolutely different from every other on-prem or even other cloud solutions today so a few fun facts about snowflake just to kick us off before we before we get into the architecture and of course I'm gonna do a really thorough demo today as well so I'll try and get through the slides as quick as I can so here's some fun facts in snowflake queries processed per day globally 165 million queries this is a huge system our largest single table which belongs to Adobe sixty eight trillion records the largest number of tables within a single database is in excess of 200,000 single customer with the most data this belongs to Capital One one of the biggest banks in the u.s. fifty-five petabytes they don't need to spin any of that down that's all analytically available at their fingertips whenever they need it and the single most concurrent users on a system in one account at any given point in time in excess of ten thousand so what is snowflake we are a data warehouse we are an SI sequel standard so you know you open snowflake from day one we're gonna look and feel very familiar sequel data warehouse except we've been built specifically for the cloud and of course we are delivered as a service we are a SAS product as I mentioned earlier so throughout the demo today and throughout the presentation take note of all of the things that I'm no longer having to do all of the data janitorial work that you have come to expect from other systems in the past like writing indexes managing partition your database this is all gone there is no performance tuning to be done in snowflake you simply pick up your data you pour it in and it just works going back to that vision of simplicity again I'm so why snowflake why do you need yet another data warehouse I want to give you guys six reasons as to why data warehouses today are just not working now I think commonly most data warehouses that you do pick up will be able to solve one or two of these issues snowflake is the only one that's going to give you all six of these so complexity I've already spoken to the simplicity piece you don't need to think about you know provisioning for certain workloads anymore you don't need to worry about how much compute you're running snowflake flexes to any size that you need it to whenever you need it to and you and you guys will see this in the demo as well you've got a hundred megabytes today you've got a petabyte tomorrow there's absolutely no capacity planning to be done you don't need to do that anymore you don't need to run overnight jobs because we have zero contention on our system you can spin up an unlimited number of compute clusters over a single copy of the data inadequate elasticity you can in snowflake you can scale both vertically for performance and horizontally for concurrency the concurrency piece again one of the biggest reasons why Capital One moved across the snowflake from other for other cloud products rigid cost we bill you guys per second so think about running a query for two minutes and once that query finishes all of your data of course is persisted but you pay for only those two minutes that you're running your queries diversity of data as well you no longer need to spin up siloed data marts because you've got contention in the system snowflake flexes to any size you need whenever you need it it's all at your fingertips and all of your data is always analytically available we can also handle semi structured data so we can natively ingest into our brand new variant column which is currently patent-pending and very unique to snowflake natively ingest orc Avro park' XML and JSON and we actually optimize and column the rise those payloads on load and then you can interrogate those payloads by simply referencing the key value pairs that sit within them with simple sequel syntax it just works data sharing as well very unique to snowflake so we are now multi cloud so yes we started on AWS we've gone live on Azure as well and by the end of the year we're going to be live on GCP also so imagine a world where I've got an account on AWS my customer has an account on GCP they want to share a table with me without ever having to leave snow fly I can select star from their table within another cloud there is no one else doing this at the moment and this is this is what truly it's excites our executive today so how does snowflake make things easier I want to give you guys a bit of a background on traditional architectures and then talk about snowflakes architecture which is again as being completely redesigned from the ground up and talked to you guys about how that's a little bit different so just quickly shared disk typically typically with shared disk IO ops become the bottleneck as you start to scale beat these two architectures as well you know that these were invented back in the mid 80s so you know a lot of people refer to the term vintage as something being you know between thirty thirty and a hundred years old so kind of refer to these architectures as both being fairly vintage today they've been around for a long time they work really well but they do have their challenges with shared-nothing because your storage layer and your compute layer is so tightly coupled together if you want to scale it you're forced to scale both aspects of it do you want more CPUs well you've got to get more storage with it your more storage where you go to get more CPU and you don't really have that flexibility so it's highly scalable but it's not exactly elastic so with snowflake we've actually separated these two layers completely apart from each other and we now have a new architecture which we call multi cluster shared data you can spin up as many compute nodes as you want within as many different different siloed clusters over a single copy of the data without ever having to replicate across multiple environments and this is how that architecture looks so at the core at the storage layer we leverage s3 that's infinitely expandable the size of s3 and snowflake is the size of AWS right you can put as much as you want in there that's why Capital One have 55 petabytes in their account always analytically available and then you can spin up as many compute clusters as you want over that single copy of the data always with zero contention so unlimited storage scalability without ever having to refactor resize your clusters instantly have you got a four node cluster you want to go to an eight cluster that'll take you a couple of seconds you just hit a button and that's done I'm gonna jump into the demo really really soon and show you guys exactly how that works so sounds great show me more this is the data set that were going to be using for the demo today so I'm going to jump into into snowflake and then I'm going to switch across into tableau to visualize this data but we're going to load up a bunch of records from the city bike Manhattan data sets of publicly available data set you guys can get access to that too essentially we're going to look at the trips table you pick up a bicycle from a location a you drop it off at a location B and you pay for a trip I have created simple DDL a table structure that I'm going to copy some data into out of an s3 bucket this is my s3 bucket and I've got a bunch of zipped csv files in there so i'm going to jump over into the demo now and show you guys this live all right so the first thing to note this is the snowflake UI so super simple looking at this from the get-go probably looks quite familiar to most of you not too dissimilar to you know a sequel server or something like that but of course we've completely separated these two logical objects right databases and virtual warehouse clusters are completely separated think of this as being s3 and this a bit as being ec2 whilst I talk about ec2 and s3 you never physically have to interact with those layers we manage all of this for you so here is my city bike database that I've created it's checking my time here all right so we are good I have created the trips table like I showed you before this just has a bunch of empty columns inside it right now I'm going to load some data into this trips table and we're then going to continue with the demo now for the purposes of the demo I'm going to use two personas today currently logged in as John who's our DBA as you can see up here sitting inside the DBA role the 2nd persona is gonna be Jane and she's gonna be managing tableau so as we jump across into tableau we're going to have different permissions in your different role and she'll manage the the BI side of things and we'll do a little bit of correlation analysis and stuff like that so I'm gonna load up some data but before I do that I need some compute because right now in my snowflake account I'm paying for absolutely nothing I've got no storage and I'm running no compute all I've done is run DDL which is free so I'm gonna create a virtual warehouse cluster I'm gonna call this the load warehouse we're gonna use it to load some data and the very first option I get here in snowflake is this t-shirt sizing that we use to select the size of your cluster so this is what we call vertical scalability this is you know how bigger your data sets how big is the problem that you want to tackle you need to scale vertically to address these issues the default setting here is extra-large we we increment doubling each time so you know an extra small is one node which also consumes one credit per hour as small as a two note of medium is a four node it continues to double all the way up to a four Excel 128 ec2 nodes in a single cluster I'm gonna spin up an extra large again that is our default value with 16 nodes in it I'm not gonna use multi cluster warehouses right now but just to tell you guys what this is this is to scale horizontally for concurrency imagine I had a medium warehouse which was four nodes in my cluster imagine on a Monday morning I had a thousand people using the system all at the same time they're not asking harder questions that have bigger datasets there's just more people on the system I don't want to scale up for performance that's not gonna help me in this scenario I want to scale out for concurrency so if I was to set my upper boundaries to ten that's telling snowflake that I'm allowed to spin up additional copies of that same medium-sized warehouse in order to drain my query top faster reduced queueing in snowflake there is no queueing if you have this set up correctly it doesn't matter how many users you have again capital one over ten thousand users concurrently I'm gonna set this to one I'm the only person on the system today I'm gonna be jumping across to Kelley later who's going to help me out with some data sharing but one will be enough for us today I'm also gonna set my orders to spend two five minutes now this means if I don't use my virtual warehouse cluster for up to five minutes it's gonna automatically shut itself down and I'm gonna stop paying for it keep in mind because this layer is separated from storage the s3 layer where all of our storage resides is always going to be analytically available whenever you need it even though your compute clusters shut down so let's set this to five minutes I'm gonna leave order resume ticked which means the very second I fire a sequel statement back at this cluster it's gonna instantaneously wake back up and and we'll see that today as well I could also do this in sequel right I'm just using the UI now to make it nice and simple I will do it later with sequel as well so 16 nodes they are now running they're active that took me probably a hundred milliseconds we're good to go we're gonna load up some data here I'm gonna set my context sorry I'm in an earlier sheet here here we go just setting my context here so I'm using the DBA citibike role I'm in a city bike database this is my production environment I have a staging bucket which is literally an s3 bucket that I've connected to my accounts now this s3 bucket has a bunch of zip CSV files insider this is what I'm gonna load into my troops table that I showed you before we're gonna use the copy command which literally says copy into the trips table from the stage where the files are equal to 2018 there's about five years of data inside this stage right now but firstly I just want to load up 2018 so that's loaded for us in about ten seconds here we can see the query history off to the side I can even do a bit of a zoom here for you guys so we've loaded that up in about ten seconds and we now have one hundred and thirty million records inside our trips table now we do have a lot more records because there's still four years to go but before loading up those four years I wanted to show you guys how we can scale in snowflake and how fast we can do it how fast we can flex the system now what I want to do is copy the rest of those years into the trip table I want to use a larger compute size to do this but I only want to pay for that compute for the very seconds that I'm running this copy command so let me show you this first this is what we've loaded up this is just a summary of the table I'm just doing a quick top top 20 here just to show you what it looks like just a basic structured table as you'd expect we're gonna load the rest of the data into it the first thing I'm gonna do is alter the warehouse to a three Excel I'm gonna add 48 nodes of ec2 to my cluster virtually instantly so I'm gonna run this here that takes me a couple of milliseconds and you'll see I'm now running a three Excel I'm now gonna copy into the trips table the the remaining four years that we haven't ingested yet but I'm doing that on a 64 node ec2 cluster all in a single node this is only gonna take me a few seconds and the very moment that this is finished loading up I'm going to spin the cluster back down to a large I started with an Excel which was 16 when I spin back down to a large I'm gonna have 8 nodes in my cluster right so this is finished loading it's taken me about 20 seconds I'm gonna scale the warehouse back down so we've just dropped off 60 odd nodes there 50-something nodes and if I do a quick count I've now got a billion records inside my trips table right so we've just loaded up a billion records into snowflake just from zipped CSV files sitting inside an s3 bucket in under 30 seconds or about 30 seconds so that's now analytically available we're only using 8 nodes of ec2 now so we're using really small compute we've scaled right back down but now that it's inside our architecture I can query it really really fast right I've just written a very basic aggregation here this is aggregating all of the trips by hour of the day and pulling out a bunch of metrics that I want to see about this trips table right so I've summarized all those metrics but I've run an aggregation over a billion records with only 8 nodes of C's of ec2 now now I'm going to suspend my cluster I don't need it anymore I'm gonna go across to tableau I'm going to use a different cluster so I'm going to manually suspend it my computer now turned off all of that data is inside snowflake we can use it whenever we want and I'm also this is currently set to a medium hour sorry this is currently set to a large which is 8 nodes I'm gonna spin this down to a medium which is 4 nodes but it's still turned off it's just a logical object so the very second I fire a query back at this object it's going to wake up with 4 nodes next time because that's all I'm gonna need for the rest of the demo so now I'm going to switch across to tableau and actually just very quickly before I do that you'll notice that in my virtual warehouse tab I've just got the load warehouse for nodes of ec2 which are suspended so I'm not paying for them right now I want to have some separate compute that I'm gonna throw at tableau I don't want to have any contention in my system so I'm gonna spin up two additional virtual warehouse clusters but I'm gonna spin them up in a suspended state so I've just created logical objects here so I've got two more clusters now I've got a lot an extra large in a 2xl 16 and 32 nodes but they don't exist yet these are just logical objects right they're suspended but the very second I open up tableau these are going to wake up so let's do that now and I'm gonna log in here as Jane as I mentioned before see if we can get this to sign in I'm gonna do a quick refresh here let me refresh that again quickly and we'll see that the BI warehouse which is an extra-large is now started so the very second I opened up tableau because I had the correct access for Jane she instantiated the 16 nodes of ec2 through tableau into snowflake now tableau is one of the BI tools that I'm demoing today you could do this with power bi or looker or clique or whatever you want to do it with we've got native connectivity with all of these tools through ODBC JDBC they all work but for the purpose of the demo we're using tableau today to switch back to tableau here you'll now see that I have streamed in live the 1 billion records it's in the top corner there now the really cool thing about this I can remove this little panel here there we go full screen I've got my pickup locations on the left hand side of the screen I got my drop-off on the right as I start to click around in this map you'll see that for this individual click it generates all of the correlated drop-off locations that are associated with that pickup location each time I click around in tableau without running a compute cluster that's that big or that heavy over a billion rows of data this is actually sending sequel statements back to snowflake and only bringing back down the results that I need in tableau which makes this incredibly fast the reason these are so fast is that snowflake is sitting behind tableau and processing all of the compute as I interact with this chart now that I've generated a few clicks here again each click is a sequel statement that gets automatically generated I can jump across to my query time dashboard where I've actually tapped into the metadata repository of my account we collect absolutely every statistic in snowflake associated with every query that you run if you do anything in snowflake by default you generate a unique query ID now these unique query IDs are represented here by bubbles so each one of these bubbles is one of the cliques that I've just done and there was a little bit of caching as we opened up the workbook here for the first time but as I was interacting with that chart you can see that my query results were coming back to me in about 0.9 of a second this is all coming through live you can track this on the guard this is a very basic representation of the metadata you can go all the way down to a sub-query level and say gee how long did I spend scanning cache and all that kind of cool stuff but they're just keeping this high-level you can actually see the physical sequel statement that tableau automatically generated and sent back to - snowflake and then brought back down the results to tableau so that's pretty cool but what we want to do now now that now that Jane has been you know playing around with this data set for a bit she wants to add something more to it we want to overlay some weather data here so that we can do a little bit of correlation analysis so she goes back to John and says okay John I want you to load up some weather data for me so John comes back into snowflake lucky enough we have some where the data ready to go as well I'm gonna set my sheets context here the first thing I'm gonna do before I start playing around with my weather data is I'm currently in production now I want to be a good DBA I don't want to wreck anything in my production environment so I want to create a clone of my production environment so that I have a dev environment to play with right I'm gonna run what's called a zero copy clone this takes me a second and a half if I go back to my database tab here you'll see that I've got CD baikin city bike dev that's my pride and my dev now each of these tables the trip table with a billion records in it is about 15 gigabytes in size now the total size on disc is not 30 gigabytes it's 15 gig we didn't replicate the data we only replicated the logical object that is a database and then of course if we make changes to either environment we're simply creating Delta's of those changes right we never duplicate in snowflake it's very very cost effective I'm gonna go back to my worksheets here now that I've got that environment up and running I'm gonna spin up another virtual warehouse cluster and other sixteen nodes of ec2 I run this and within 300 milliseconds my 16 nodes are up and running I have another s3 staging bucket that I've proved for a car pre-prepared this time this staging bucket just has a bunch of zipped JSON files in it I'm gonna show you guys how snowflake handles semi structured payloads again we natively ingest orc Avro Parque XML and in this case JSON so let's load this up here before I can load it well actually before I want to load it I just want to quickly show you guys what this JSON looks like it's fairly complex JSON it's got a bunch of nested key value pairs here we've got to raise in there as well we're gonna be running some sequel queries on that after I've ingested it firstly we're going to create the weather table I'm going to do the exact same trick I did last time I'm going to add an additional 48 nodes to my cluster in a couple of hundred milliseconds this is now a three Excel so I just added yeah three hundred milliseconds I got another 48 nodes in my cluster I'm then going to ingest all of that where the data that all of that JSON payload into my variant column now again that variant column can natively ingest JSON and as we start to ingest that JSON into that variant column underneath the covers of the variants where actually column the rising the key value pairs that exist within the JSON so if you've got 50 key value pairs inside your JSON inside your semi-structured payload we identify all of those and we column the rise them underneath the covers and we get to things because we do that we get column the compression which is massive and we get massive performance gains as well because we're no longer having to do string searches within that semi structured payload we're actually going to reference those individual key value pairs themselves which see a simple sequel syntax snowflake of course has added some sequel syntax to the ANSI sequel standard so that we can handle these semi-structured payloads so that's run for about 50 seconds 56 seconds here I'm gonna spin back down to an extra large I don't need all of those all of those nodes anymore so let's drop 48 nodes out of my cluster so that we're no longer paying for them keep in mind those additional 48 nodes yes it's a lot of compute but I paid for it for one minute and I ingested what have we got here I think we got about half a billion records of JSON there we go four hundred and thirty million records of JSON they natively ingested into snowflake and now that they're inside our architecture we don't need all that compute anymore we're just running an extra large I can show you what that looks like we've also stripped out the timestamp here to make this perform a little bit better as well and we did that on India so let's have a look at the JSON the JSON still looks like JSON right we haven't actually done anything to change it yet but underneath the covers it's been optimized right so let's have a look let's try and dive inside this variant column I'm literally referencing the city name key value pair within that semi structured payload to search for the word New York I'm also reaching inside the weather array for dot Maine where that is equal to snow I've just queried half a billion records of JSON I found New York if I go down to the bottom here I've got my weather array I've got Maine and I've found snow so very rapidly without too much compute once you've ingested into snowflakes architecture you can start to interrogate interrogate these semi structured payloads you no longer need to write a custom Python script or some kind of other scripting language that you use to clean up these semi structured payloads you just throw them into snowflake do all your transformation inside snowflake it's too easy and keep in mind we've currently right now worked with a billion records of structured data half a billion records of semi structured data what have I not done no indexes no partitions no vacuuming there's no performance tuning to be done at all snowflake just works it's simple so now we're going to do something a little bit more complex this is an additional feature in snowflake called a lateral flatten a lateral flatten is very similar to like an unpeeled in MS sequel server terminology maybe or Oracle we're essentially flattening out the array because I know it has one too many values in it the conditions array and once I've flattened out that array I've then aggregated it back up so I've run a heavy group by and a lateral flatten and on only 16 nodes over half a billion records of JSON this has taken me six seconds this is too easy so now what I'm gonna do I'm happy with this I I know that my weather data is working I'm gonna load it back into tableau I've tested it it's kind of it's all good it's good to go I'm gonna clone my environment back to production I'm done with dev I don't need dev anymore I run the clone takes me half a second to do that I'm then going to drop my CD by dev you environment don't need it anymore I'm gonna drop my dev warehouse as well so I'm no longer paying for those compute nodes and then gonna set my context back to production so you'll see I'm back in the load warehouse that's a medium it's currently turned off I'm not paying for it I've just set that back as the logical object that we're about to fire up and I'm back in the production database so I'm using production data now what I'm gonna do here is I'm gonna reach inside that variant payload over and over again to get all of the different key value pairs that I want to send back to it back to tableau now tableau is not going to know what to do with a variant because there's a very unique column - snowflake so I'm gonna grab all of the all of the pieces of that variant that I want and feed them into different columns do a bunch of conversions and stuff like that and I'm gonna represent that as a view now have a look at this when I run when I execute this DDL statement and it's very important that this is only DDL you'll see that my virtual warehouse cluster has not even turned on yet I just created a view and that was free you can be doing that with no compute and no data if you want so you can play around with DDL whilst your account is sleeping so to speak but the very second that I run a query on top of that view you'll see that this load warehouse right which is currently not green it's turned off it will wake up because the very second I throw a sequel statement at it it's gonna instantaneously wake up here and it's going to execute my statement for me I don't even need to think about provisioning these nodes it just happens for you because I've told my I told my virtual warehouse cluster to order a resume the very second I interact with it right so now I've got just a top 20 rows just to show you guys what this view looks like on the left-hand side we have all of the trips data that we dealt with before and off to the right hand side in a really nice clean structured view we've pulled out all of the different key value pairs out of that JSON payload and just left joined it on to the side of that table so super simple and we're now going to send this back to tableau and do that correlation analysis that I that I mentioned earlier let's just do a quick count star so we still have should still have our our billion records here and we do see if I can zoom in therefore you go so just over a billion records here and now I'm gonna switch back to tableau first thing I'm gonna do is just refresh my data sources on that's actually already happen so there is my trip weather view I'm gonna drag that over the top of the trips table this has got the same columns in it it's just got some additional columns that I'm gonna use to do that weather correlation analysis go back to my buy week chart my time series chart here and I'm gonna grab the average temperature in Celsius drag that over the top and do a quick overlay here I'm just gonna grab the average temp now the weather payload is only one year worth of data so we're just going to filter out the difference because we've got five years worth of trips data so I'll just remove these nulls for you make this look a bit cleaner I'm gonna grab this average temp I'm gonna make it a nice cold color I want to show you guys how many crazy people in Manhattan are still riding their bikes when it gets really cold and let's grab that average temp and make this a line so it's a bit more obvious and let's go a little bit more granular and make this daily now like any good demo hopefully we've got a pretty strong correlation here between how many people are riding their bikes in Manhattan and the weather so you know when we're getting up around 80,000 people here the weather's around 25 Celsius that makes a lot of sense we've still got some crazy people down here our smallest number there's about 2,000 people riding their bikes when we're getting down to about negative 20 Celsius which is pretty damn impressive but now I'm gonna flick back across into snowflake and show you guys data sharing but before that before I do that just just to call this out really quickly you know that that's a billion records of structured data in that trips table overlaid with a view that hasn't been materialized as well as it's just a standard view with half a billion records of JSON and it's in tableau and it just works right we're not even throwing that much compute at it it's just nice and easy and you know that chase those JSON files can change their structure every day just continue to load them into snowflake as long as you've still got the same key value pairs inside them the view is going to work indefinitely forever and if you have additional ones well thrown owls so it works it's very very flexible very simple to work with I'm gonna run through data sharing really really quick the whole concept around data sharing is if I have an account my customer has a snowflake account without ever having to leave my account I can do a select star from their account with if they've given me access to do so right so you can share whatever you want to share if your customer doesn't have a snowflake subscription you can still share data with them you can spin up what's called a reader account and that's what I'm gonna do now I'm gonna set my context here and I'm gonna quickly create the city bike share I'm gonna I'm gonna grant a bunch of objects to that share so the city bike trip weather view that we just created let's have a look at these shares here and so we've got the Jersey City Housing Authority in the New York City Housing Authority two different shares that we want to share different types of data with they've got different subscriptions on this table I'm gonna add both of those guys unique URL pointers to my share so we can actually send them a URL they can open that up anywhere they want in the world and they can access the data that I've allocated to them let's have a look if that's worked there's our two pointers that have been added and these are the objects that we're sharing with the reader account I'm actually gonna run this part of the demo with one of my colleagues who's sitting in the front row she's gonna lock down this data share as I start to interact with it so I'm gonna log in to this data share account right now this reader account he's my NY CH a New York City Housing Authority account I literally click on this URL it opens a new login for me and again you could send this to your customer anywhere in the world they would simply log into snowflake and when I log in here I should be able to see that share that's been shared with my reader accounts and the data itself won't be replicated this is just a window into my master account right so I've allocated a compute cluster to these guys so that I can track their usage if you want to you can do what Nielsen data do today and they actually monetize this and charge this back to their custom as they consume their data on the fly which is absolutely phenomenal and we encourage that let's have a quick look at the shares here I can see that the CD bike share is coming through I'm going to create a database on top of that share I'm gonna import the privileges I've assigned I'm then gonna see what comes through so I should get my billion records right now they're just streaming straight through into my reader account and again I'm just running a very small four node cluster here I've just done a quick count star there's my billion records I'm gonna run an aggregation here on all of the different subscription types that always in that share and the first thing that we're going to notice is that there's a whole lot of nulls here I want to clean those nulls up that's dirty data it's about 900 million records of those nulls see if I can do a quick zoom here for you guys who are up the back just so that you can see it so in these 900 million records we want to remove those dirty data Kelly would have already done that for me because she's logged into another instance of my account and he's working on it simultaneously so if I run this again we'll see those nails no longer exist because she just ran a DML statement to clean those up right so they've now been deleted this is a live data share we're not replicating data here I'm looking at exactly what's happening in production so those nulls are now gone if I run my count star again I've now got 129 million rows down from the billion because they've all been deleted out but what I want to do now is restrict my access because I can see all of the different subscription types within this trips table I want to make it so I can only see the New York City Housing Authority subscription lines so if I if I go down here Kelly will have now granted me access to the secure trip view which should only be relevant to the rows inside my subscription and now that she's granted me access to that I'm pretty sure if I try to run this one again it's gonna fail because she's denied me access to the full data share which is exactly what I want to happen I don't want to be able to see everything I only want to be able to see the records that are relevant to the New York City Housing Authority and that's what she's done for me every single record that's like the word NYC h.a I now have access to the rest of the table I don't write so this is just a live data share and we're interacting with this live on the fly now I know we're short on time I'm gonna race ahead here that is data-sharing I want to show you not tableau but I want to show you one of the coolest features in snowflake which is our continuous data protection and time travel this is completely unique to snowflake and I always like to say this is one of the main reasons I decided to join the company when my boss first did this demo when I saw him do what I'm about to show you guys here I just went I went up to our VP of AIPAC Pete and just said how can I join this company fast enough it's just that cool but anyway that's my backstory let me show you how many records are in the table if this decides to show up here alright so we've still got the hundred it's - it's not rendering here but we've got 129 million rows or 130 odd million rows in this table there we go that's showed up for you now don't blame that on the resolution so there's a 129 million records I'll make this a little bit smaller just give you guys a zoom there so you can see it there they are I'm gonna drop that so that those are gone from my production environment you guys have probably already seen what I'm about to do here in snowflake you never again need to back up data that's one of the features that we that we we always call out in snowflake all of the things that you no longer have to do there's actually three reasons why you would ever do a backup this covers off only one of them and talked to the other two if we have time but this is essentially if you make a mistake in production what do you do you need to fix it right so we now have no records in this table I'm gonna get an error here because of course we've dropped it right the tables gone now in any other data warehouse you're looking at backups that could take you I don't know I'll be nice and say eight hours but that's being nice now in snowflake I just say under op table so that was what 67 milliseconds and there's my hundred and thirty rows immediately back there right you can't make a mistake in snowflake it's not possible anytime you do anything in snowflake by default you generate a unique query ID and with that unique query ID you can fix any problem no matter how complex it is and you guys are welcome to come over to our stand after this demo and I can show you how complex those problems get and we will fix them and I promise you that what we're going to do now is do something a little bit more complex than what we've just here you guys will remember that Kelly deleted 900 million records from my table through DML now that was production grade data I want to get that back that should not have been deleted right my customers called me up and said James that's important our to bring that back so all I need to do again talking about that unique query ID is search through my information schema of my query history for the last query that was like the word delete right I searched through for that I'm setting that to a session variable which is called a query ID I'll just zoom back out here I'll show you that query idea that we've just found we've found it that is the delete statement that Kelly ran that dropped out there's 900 million rows out of my table and now I can use the feature that's called time travel I'm giving you guys to select count stars here one is the current table and one is literally using the word before the query ID to show us how the table was before that query ID right so I run this and I've got my hundred and twenty nine million rows but I've still got my billion records it's still there we never actually dropped it the DML didn't kill it because we retain everything in snowflake for up to 90 days now it's time to fix it so I'm gonna I'm going to actually create a table called trips before to try and get the table back to the way it was before I'm gonna clone the current trips table but I'm gonna add in time travel I'm gonna say before the statement idea of query ID in just under a second I've created that table because again this is just a this is just like a DDL metadata operation I'm performing here I'm not actually having to do any replication this all just runs fast now I'm going to swap the trip table for the trip before table so I'm back in my production table I'm then going to drop the trip before a table I don't need that anymore and this is kind of the tada moment where we we run the count star from the new trip table or the trip table it was before I corrupted it and my billion rows is back in my trips table right so again any single thing you do in snowflake can be fixed it doesn't matter if you started making changes to that table over time we can still fix that through joins and other intricate ways to working with time travel so that's pretty cool I'm gonna jump back to the slides here really quickly and I'm gonna race through them because I know I've only got about two minutes to go here this is just a very quick summary of all of the features that I went through today anyone wanting these slides we can give these out but just have a look at some of the logos that we've got up there we've got a very rich ecosystem snowflake is a cloud only data warehouse regardless of that these are just some of the very few tech integrators that have better built native connectivity into snowflake they've leveraged ODBC JDBC Python you can see there's a bunch there for ETL alt all of the different BI tools that you can use data science we've got an incredibly fast spark driver you guys work with machine learning an AI do it through snowflake with our push down and it's phenomenally fast you can leverage the power of our compute and that's what's so unique about snowflake that compute layer you saw me spin up 64 nodes of ec2 just like that and just run a query paid for it for a minute and it shut down all of the features are up there I won't go through them again what a customers doing with snowflake today this is just to give you a bit of a look and I probably already mentioned this that you know customers today due to contention on their system are having just spin out data Mart's that's no longer the case in snowflake you don't need to spin out multiple data Mart's to deal with control and contention you simply load everything into snowflake you can have as many petabytes as you want it doesn't matter we we chai we actually passed through the cost of s3 our storage layer is s3 $25 per terabyte per month it is phenomenally the most cheapest data warehouse on the market today from a storage perspective in other ways too but especially from a storage perspective we utilize that blob store but we make it hyper fast to make everything always analytically available to you if you think about all of the data that exists in the world and I know I've run over time here I'll just be another two seconds you can really break it into three parts structured semi-structured and unstructured the first two snowflake works perfectly with from a data Lake perspective just to really race through this I'm not going to talk to these customers I've got some awesome stories but Capital One goldman sachs all of these guys have gone all-in on snowflake and they're having massive massive massive winds out of that migration and it yeah there's super cool stories there I won't won't hop on those we're being recognized by all the leaders we're on all of these quadrants do you guys know who all these guys are we're up there snowflake is a leader so much to remember this is where I just really quickly call out George Miller clinical psychologist in the 1950s did a lot of work you know researching human short-term memory anyway he came up with Miller's number most people in their short-term memory can remember between what is it six and nine thing seven plus or minus two that's called Miller's number they other six things I talked about at the start please remember those these are all of the things that snowflake can bring to your organization everyone else can do one or two we can do all six sign up for free on our website you get 30 days free 400 USD of free credits just try it out it's that easy and you're not alone we've got a bunch of amazing tech partners so talk to these guys they're born in cloud they're amazing there's some of AWS is back best tech partners they will help you on your journey in snowflake so thank you guys
Info
Channel: Snowflake Inc.
Views: 24,439
Rating: 4.8518519 out of 5
Keywords: snowflake, redshift, aws, data, warehousing, date, warehouse, cloud, build, native, amazon, web, services, computing, google, bigquery, hadoop, teradata, neteeza, JSON, frank, slootman, saas, technology, case, study, video, testimonial, gdpr, sydney, summit
Id: A5-wgSjRrcE
Channel Id: undefined
Length: 44min 23sec (2663 seconds)
Published: Mon Jun 24 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.