DC_THURS on Open Cloud Lake House w/ Ori Rafael (Upsolver)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome back to dc thursday everyone i'm pete soderling i'm the founder of data council and the data community fund and i am excited to be your guide through the crazy sprawling expanding modern data ecosystem as you know we have regular guests on the show who are typically open source data tools contributors or authors or startup founders in in the data ecosystem and today i'm very excited to welcome ori raphael who is the co-founder and ceo of upsolver so rory welcome to the show thanks pete great to be here thank you for having me so ori um i want to get started and talk a little bit about about your background which is something we typically do on the show um talk to us a little bit about your early career and how you got into data well my early career my first job was in israel i was i went to the army for six years and i spent that time in the israeli intelligence and my first job was a database engineer so i started as kind of a person in this community i was given the choice between engineering and data and i chose data and never never looked back and we had thousands of database instances so we had quite a decent scale our dba team was very big and we're basically the pivot point around all data analysis so it was a very interesting first position for me at the age of 21. and this is probably um is this the days of oracle when oracle was big or did you use something else oracle and sql server at the time okay got it oracle was the biggest for us and i started with oracle got it um and did you start um to use open source data tools um at some point uh following that yeah so we we've evaluated hadoop in different ways but it always was i would say more on the marginal side and after that when the no no sql started so we've used cassandra and we used elastic and redis and like a whole bunch of open sources got it um and and what were some of the key learnings from the early part in your career how long did you spend um at the in israeli intelligence by the way i spent six years actually seven years eventually and i finished leading all of their data integration so starting databases eventually went back more to esb or to etl depending on the on the time and what are my some of my key takeaways key insights well it's a very it's a very big question i i think uh i started my first days were oracle days so you know everything would be solved with an oracle and i remember learning by hand is that it's hard to be locked to one vendor like you have to use it even if it's not the right tool you have to pay a lot you're kind of dependent when there are bugs i think every person in my team had bugs listed under their names in in oracle we really took oracle to the extreme because it was still an rdbms a traditional ldbms and we had significant amount of of scale and also and also use cases so i really remember those early days in the dba as a dba but also remember that so many different users in the intelligence used the databases directly technical by the way and very semi semi-tactical and hadoop was always kind of on the side something for some expert that some remote team would be uh would operate and i remember a lot how much databases allowed us to iterate fast and every time that i come to a problem i usually first think about it in database terms and not in code terms so sql is very natural for me and it was natural for a huge amount of users which is something that they took away from my experience there yeah got it that makes sense um and where did you go from the israeli army well i spent a few years i wanted to really create a have my own startup so uh i i wanted i felt like coming out of the army which is a non-profit organization i needed to better understand what's going on in the market and why companies act as they act so i did three years of b2b sales and and then i found it up somewhere okay got it awesome well i'm excited to chat with you about solver um later in the show and want to understand what you're building there and and some cool things about the product um but but before we do that um i wanted to ask you about this open lake house concept um i know it's something that you've been talking about and um you've been writing about um you have a blog blog post on it that um that i was able to check out um sort of what's the big deal um first of all with the lake house lake house sounds like we're frankensteining a couple words together a couple concepts together um as sometimes we do in technology is is this literally a combination of a data lake and a in a cloud data warehouse or um how should we think about like what what what does lake house mean well some and i understand the confusion completely i think that you can decide that lake house is a database in that case it will be closer to her house and you can decide that it's an architecture and in that sense it will be closer to a data lake and i think that companies like snowflake are interpreting lake house as a database so basically a database built on lake foundation but it's a database just like oracle like it's a closed box internal storage and compute uh proprietary storage so everything is very specific to snowflake but it's built on lake infrastructure would make it easier to handle and more more scalable on the other side there is the the data lake of the world and datadeck is of course built on data like infrastructure but is it the house is it usable is it easier requirable all of that usability is has always been a challenge for the data lakes of the world so a person thinking of about the lake house as a data lake they want to make the cost-effective lake usable a person from the house they want to make the easy to use house scalable and more cost effective and i think how different companies are interpreting the term lake house in a different way and that's why i came i added the word open i interpret it as a make so i think storage is the most important piece in the database that's what's going to dictate your performance and cost and maintenance so giving that i think that taking the data like that didn't succeed in the past but gave all of those benefits and just make it usable is the right way to go gotta well um it's for sure there's a lot of um there's lack of efficiency and there's challenges around data lake architectures historically so um it seems like you know maybe we have a sort of a reference implementation of a data warehouse a la snowflake or redshift that seem and sort of feel to be more battle tested and and industry you know seeing industry-wide adoption um but maybe on the on the data lake side um we're a little further behind in actually understanding a mature implementation of a data lake um so i don't know if you would if you would say that the lake house is some sort of advancement of um you know some of the older legacy concepts in a data lake sort of brought into the modern era if you would agree with that or not i would agree with that i think that i would i would definitely agree with that i think that [Music] data lake is more it's harder to implement because it's an architecture and not a single product so i'm not getting one box on the one side you have ingested on the other side you have sql that's basically a database and it's easy to understand for the organization if you look at the architecture diagram usually there is one thing and if you look at the data lake you need to stitch together the solution from multiple things multiple services that of course adds management complexity and is it a bad thing well if i would say it's a bad thing i would also have to say that microservices is a bad thing because it's harder to manage a bunch of microservices comparing to managing one monolith of course it's harder to do but what are the benefits i think that there is a lot of similarity between the idea the concept of an open lecos and the concept of microservices that if different smaller services let's not be tied to one vendor let's be able to scale each server elastically and that's not be locked so locked as we are to database is something that i've felt in my uh in my early career so this is the way i'm thinking about it i see so so then to extend that i like that analogy um so essentially what you're saying also conversely or as an extension of that is that the the data warehouse model is actually more similar to the monolith database model where you're sort of talking to one vendor for everything from you know query to compute to storage um so perhaps your argument then is that um the the the the data warehouse even of today um is very much a monolith whereas the data lake is much more like a micro services sort of decoupled architecture completely and completely stand behind behind that and like look look even look at the website for some of the big companies so uh look at databases for a long time it was unified analytics platform unified means that you use data you use that tool and also for snowflake there they have you know you see snowflake compute engine and a whole bunch of use cases on top so you have data lake data warehouse data engineering so all of them and it's great but eventually why do you have to use no free compute because you're using snowflake storage and what if when it says data lake on top of snowflake there maybe that's not the tool that you want not saying that they're not doing a good job maybe that's not the best tool for you for your preference for your users what are you doing you're kind of you're kind of stuck like you know you don't have another option so snowflake is a great tool but eventually it's the same circle same dynamics i had with oracle 15 years ago it's the same dynamic two users have with it and i loved oracle i enjoyed working with it but there was some some limitations got it well let let's talk about uh on the on the data lake side um or the lake house side um what are what are some of the challenges that um you see that folks typically have in sort of getting efficiency out of their data lake architectures before we talk about the open standard that you're proposing let's talk a little bit about some of the the real world challenges um on the lake side of the house so to speak yeah so the pro the problem the problems on the lake side can be from data correctness to to performance and the reason is that a database really kept you safe so i did insult update delete and then i communicated with the table and underneath that table there was some kind of file system that i wasn't aware of i didn't tried it i didn't need to manage it i didn't need to make sure it's correct like everything i trusted oracle to give me that now i'm working with a data lake and basically i'm writing code on top of a raw file system so this is why database were invented so people wouldn't screw up all their their entire file system so that's why the engineering on the data lake side is one more complex and second riskier like you have to do what you're doing because you're basically putting on your shoulders the same responsibility you once gave to oracle which is they're pretty good at doing this this this kind of work so i feel like that understanding how to build a proper file system how to partition it what file formats to use what compression file sizes all of that is on the company implementing the data deck and they need to know how to do that and the second thing is how to manage the human process and what i've seen at least before observer is that you have a data analyst team you have a data science team and you have a core data engineering team and you trust your core data engineering team to with the risky and complex operation which was kind of a problem to then get that work into the heads of the analysts and the scientists you're kind of stuck only at coded engineers and like my experience with coded engineers they're super hard to get and their time is super valuable so definitely this is not the thing that i want to spend their time on so i think giving access to the data to all those data-based users is something very very important to do and a little missing with data lake today that's data lake 3.0 let's give it the same access we gave to the the same developer access we had in database got it so um so so what what about the so-called um open or open source data lake formats right there there's um delta lake there's iceberg there's hoodie there's um there's a bunch of formats that are in the open source um on the data lake side i think you don't necessarily see those all those formats as necessarily as open um perhaps as as their name belies talk to us a little bit about that yeah so first of all if you look at there there is a difference between a hoodie and a party hoodie coming from uber and an iceberg coming from netflix and delta they coming from databricks okay netflix and uber have different incentives so basically they've built a metadata store that gives more capabilities comparing to the traditional hive metadata store and they wanted to do it in order to add asset compliance insert update deletes and those kind of things into into a data they can they open source it so every person that's going to use that metadata so we'll have those capabilities and the only problem i can see i can i can see that it's not as well adapted or strongly adapted as a hive metastar so usually there is a compatibility issue with using one of those tools daltonic is a different story and data lake is basically databricks uh format for uh for a data lake that also solves the same types of problems but delta x is partially open source not fully open source like and uh it still has the same compatibility issues so if it can and a query engine didn't find the right way to integrate with delta lake then you can't really use delta lake for example latina delta lake connectivity is something that's been talked about in a lot of uh in a lot of blogs and usually what you need to do is to spin up an additional integration between your data lake and the glue catalog so an athena will be able to read it and that's kind of doing uh going uh in a kind of convoluted way so i feel the need is there people need more capabilities out of their hive metastores the update the lead the asset there is a big need around it and i feel like this is maybe the area where the data lake market needs to mature the most do i think there is a solution in the market today not yet still a little early and like i would put delta lake by the way uh i would put them in the same bucket as snowflake they are creating a file system that's very specific to databricks and most of the users like i can't i can't of course say the statistics as database but i've heard over 80 percent and again not saying that this is well back this is a word of mouth so i heard more than 80 percent of the people using delta electric versus using data breaks so it's a very similar play to what snowflake is doing and i'm sure you've seen all the news around snowflake and data breaks competition so they're competing because delta x is a competitive move versus snowflake and it's small and kind of the traditional data warehouse then it is an open data lake in my humble opinion so so you're arguing that we need this notion of an open cloud data lake or open cloud lake house based on these types of sort of tightly coupled interactions even in certain kinds of data lake architectures um so so talk to us about the principles of this open lake house that you're proposing because i think this is really interesting so an open lake house has three three layers it has the the engines that are building into the that sorry that are writing into the layouts there are the engines that are querying and in the middle you have your data store and i think if i would need to define what's an open lake house if you're using open source storage format like apache parquet or c and you're using a an open source metadata store like the aws catalog or the apache hive metastore then you're good like all query engines are able to query that data so uh of course on object storage i didn't mention that so on s3 with a table mapped to it from the glue catalog that would be an open like us there is no engine in there with no query engine in the world that would not be able to consume that uh that data that's the most so it doesn't matter if you're athena or presto or even snowflake um you you can query that file format that on on the object storage um directly from any compatible query engine because the the file storage itself is based on uh it's it's based on open formats yeah yeah and uh think about the the the potential of data sharing between engine like one of our public case studies on aws one of app servers the public case studies is a company name iron source they have they have a velocity of 3 million events per second which is a lot of a lot of data and then they are using three different query engines at the same time on the same copy of data so bi liked one thing and the analytics team like one thing and the data science team like one thing and maybe the cost would make made sense with one tool in the beginning of the of their journey but then they decided to move to another tool because they wanted to have their own cluster so things evolve and if the all the query engines can share the same copy of the data you get a lot more agility and you don't have the same overkill locking that we had 15 years ago and and what are your opinions on the storage layer options how do you compare formats and sort of any insights on on how a team might pick the correct open format for their storage so it's usually a discussion between apache parque and apache aussie so there isn't a lot of depth to it every time we've tried to check whether one is better than the other we'll find marginal improv marginal differences and use case dependent and we decided to standardize on apache parque because it is more common and because we haven't seen big big differences when you're using orc so we prefer pancake but it's not based on that the technology is not good it was just more popular got it um and and how popular do you think this this open cloud like house format is becoming um you obviously feel like the world is is ready for it you're out talking about it and evangelizing it um are you seeing the adoption sort of tick up and in the way that you would hope yeah for for sure this is the golden age of query engines so in the past that you needed to spin your own presto on uh you know on something like an emr like you needed to a cluster to manage every time and now you have athena and athena is a very um well adapted service for aws so the number of latina users a number of like companies using that service i think it's one of the aws most fast growing services and that's a good sign you can see starbursts and their success and their and their uh this entire recent fundraising you can see dremeo you can see snowflake bigquery redshift and azure synapse all having their independent query engines that query open source data so like if i needed to make the list of options for query engines three years ago i would have one two and they're they weren't very easy and now i have like ten so uh i think we're going to see more and more query engines and we're also going to see query engines that are more use case specific like for example if i could query the data from s3 and get apache droid-like performance that's a very nice service i don't need to get all my data into droid which is the hard part and i still get a very low latency on top of s3 so that could be a query engine that adds indexing into it that's an example of a startup you can have a query engine that reads the data and follows the elastic search pattern or the cassandra pattern so i kind of see this open layouts concept as you have your store on s3 you own it you as the as the customer you own that store and on top of that store you have a whole bunch of apis you have an sql api you have a low latency sql api you have an elastic api and nosql so all the databases that we know i think that eventually there are apis to data and the data should be decoupled away from them that's gonna five ten years from now yeah i like that vision i think you're right it is a golden era for uh for query engines and um this format could potentially make a lot of sense um i want to go to a question because um daniel has jumped in with a question how do you handle in scala terms covariant needs all engines can read it and contra variant needs meaning das needs metadata not in other formats to get good performance on the system do you have any thoughts on that can you repeat that covariance you said maybe a misunderstood i think he's using scala terms how do you handle covariant needs and contravariant needs i guess i'm not as familiar with the term um yeah same here but um maybe maybe daniel can jump in and clarify his example is regarding um where a contravariant system is like to ask that needs metadata not in other formants not not in other formats to get good performance on the system yeah i think i understand what i i hope that i understand what he's saying and he is right that one of the big caveats of getting data directly from an object store like s3 is the low latency performance so you need to go to the metadata store the highmaster which is basically my sequel and then i need to query it get the data back go to the files look in the parquet files understand which one i need to open so there's a lot of metadata round trips that like there is a every time you go to a presto there is a fine uh that you need to pay in the beginning before you can that's that's kind of uh there's a disconnect from real-time dashboarding and daniel is right and i think one of the i said that metadata stores i see them as a little uh still a little early we need to push the same level of statistics we add in databases into metadata stores and that's why we wouldn't need to go to every single uh [Music] there's a lot of room for optimization there's and i also think that dremeo and starburst are already doing some of that uh some of that work i completely acknowledge what daniel said this is a challenge and i think that once all query engine will be able to do sub second queries from s3 and that's going to close a big gap comparing to traditional databases very interesting um so i want to i kind of want to touch on the the third part of the architecture that you mentioned because um we talked about the storage and we talked about the the query engine layer um but this sort of is a bridge into talking a little bit more about what up solver does because obviously um the data has to get into the the lake or the warehouse in some way in the first place um and i know you have some thoughts on that um uh but but tell us a little bit first of all about what upsolver does and and why you decided to start the company and how that might fit in so but i met my partner uh for absolvering the israeli intelligence i think he was the cto of the largest data science group in the entire army at the time we both knew databases very well and we decided to work on a problem in the advertising space and by the way the advertising space and the intelligence space are pretty similar you're trying to understand what people are doing so we started there but there was big amounts of data we needed to do a lot of machine learning so two things pushed us to choose a data lake the first was cost because the volume the second is machine learning which is not a well-suited use case for it for uh for a database so we looked at the the tasks that we needed to to do and i i remember looking at something some kind of analytics breakthrough we needed to do a traffic and i said this would take me three hours to do an sql as a dba myself and then we needed to go find the data engineer which was hard because it wasn't such interesting work to do and when we brought that person it took them about 30 days to do that with spark so what brought me into this business is that i think that i want to turn data decks to be more like databases with their time to value and that's a pain i i had experience personally and we actually built internal tools for ourselves so we would be able to iterate faster on top of s3 and eventually decided to turn those tools into into absorber yeah so we see ourselves as i hope at least we see ourselves as our own as their own users and we can relate to any person coming from the database world into the data lake world and that's why we started the company so it's very much a a contrasting approach from using the old hadoop and spark tools so that that's based on your familiarity and love and appreciation for sql and belief that um data engineers could benefit from um tools um to to move the data through their systems for sure so yeah so i think that back in the back in the dup days uh some would call data like a science experience or something you do for your use case with too much data when you're only thinking about cost and you don't care about the about usability and in the next in the last few years we have been proven that the future of databases is data lake infrastructure this is not a science experiment like we need to store data on object storage so now it's not that people that are early adopting like doing a sound experience with hadoop that are moving now into spark those are all the database people that are also moving to a data lake they don't know a doof they don't know spark they come from databases but they have the same needs we need to build a data store on top of data lake infrastructure on top of object storage uh preferably in the cloud how are we going to do that we are not speaking the language of hadoop and we are not really interested in doing that this is not this is the not uh promoting the value of the business the value of the business is to get insights out of data that's our crowds well that's that's a fascinating vision um so so you see um all the dbas in the world essentially moving to object storage on the cloud and and yes they're going to need to learn some new tools and um and understand the difference between the query engines and as you mentioned before there needs to be some advancements there um but but your argument is that the data lake is essentially poised to become um the new the new database of the cloud yeah i'm not say things that i'm not qualified to say like data lake is definitely going to be the data store for all databases there's no doubt about it and but but yeah data lakes will continue to live in parallel with data warehouses or how do you see sort of that that emerging over time so when i say data lake let's let's try to not confuse the audience in that sense so when i say data like infrastructure i mean let's call it object storage from now on so i see all warehouses running on object storage and i see all lakes coming working on object storage so once you're moving to object storage that's basically changing the way a database work with local storage and you know replication copies and all of that so that's the change what i that i mean that's going to happen will data lakes and data warehouse coexist i feel like there is no other option like if i'm gonna use only a warehouse that i'm stuck in my i'm stuck with the other key locking and it's the same problem so i don't want to do that i want to use the data lake but maybe the data lake doesn't work with my use case end-to-end because it's not usable enough but then i'm putting a data deck before my warehouse as a staging area and i'm putting only the relevant data in the warehouse and then i'm also not locked in the warehouse all my data is in the lake and obvious question is well if all my data is in the lake can't i query it there well you can there are some limitations over time those limitations are going to be less going to be less and less uh limitation so i think the warehouse and lake are already already coexist today as the lake is staging or before the warehouse and i believe that we are going to see more and more analytics use cases opened uh happening directly in the lake because why send the data somewhere else and like the warehouses of the world are already giving an answer for it in this form of of a query engine that can work on external stores in the form of external tables in a sense yeah especially if as you're saying um some of the optimizations can be handled at the query layer um for for different kinds of um analytical use cases i mean there might be some limitations there because um you know typically an integrated sort of special purpose database for analytics will sort of control the format of the the data on the disk and do query optimization so perhaps there may be some limitations in the decoupling um of the the query in the storage and what can what can actually only be affected on the query layer but um i guess we'll see over time you know how that model shakes out yeah probably the query what the the piece that the query engine can add is indexing they're not changing the data because the data is shared by others they're not allowed to do that but they're adding their index and then when they query sometimes they query their index sometimes they query the raw data that's available for everyone and that's actually something i've seen for example with several companies yep for sure um well going back to this this sql etl use case um talk to us about um common use cases that sort of you see emerge surrounding this this sql etl pattern i'm curious to to hear more about how companies use that yeah so let me start with the why so i have the adobe person going to the data the cloud data lake they're moving from hadoop to spark so java chain to java and scala python there is a change there but it's basically the same interaction but for the dba person or the database of the pers database practitioner or the application that can be an engineer an analyst they're not they're not going to write that code and they don't know all the best practices they do understand sql that's something very familiar for them so what absolver did is basically created the sql language for etl and so every every um um phrase that you're gonna see there is probably gonna be i have this source column and i have this target column and probably i'm gonna apply more transformation on the source to get it into the target uh the reason we chose sql is first people already have uh you know like experience with sql so if they want to do the transformation they're going to go to absorb and say should i learn this company's new user experience regardless of how good it is or can i use the knowledge i already have to write transformation that was the first reason to have sql and once we did that we we saw a flow of data analyst users for example working with app solvers so people that wouldn't do it before because sql is really their their language the second reason is to show that the system is powerful enough so you're looking at this tool from a company called absorber and well can this really do my use case or is it this is a nice visual ui and eventually it's gonna i'm gonna get uh i'm gonna get to a glassing and what uh what we did is that we gave the full power of sql so you can do etl joins and etl aggregations and time windows so kind of every possible operation you can do in a database you can do it the etl layer with app solver it spoke for expressiveness and it made us i think it was adding sql to our platform was maybe the one of the biggest releases we've ever done and it impacted go to market almost immediately because of those reasons because it became easier and it was clear to explain why is it powerful that's really neat um so you use the sql for for the transformation of the data as well um is is this sort of competitive to dbt in that in that sense well um just hang on let me let me reconnect the stream peter i'm not sure that i can hear you right now i i the last thing i heard is you asked me about dbt and absorber because we're both sql based and did you get my answer or you want me to read it to italy yeah um could you reiterate that because i i missed that we had some technical difficulties here um sorry about that no no problem so uh i think the dbt and absorber are solving a problem in in a similar way but on different parts of the the ecosystem so dbt plugs on top of your warehouse you're getting a very powerful compute engine form your cloud warehouse and dbt helps you orchestrate work on top of that engine you're writing sql and you orchestrate jobs and you're using the sql you already know same concept as app software app solver works on top of on top of objects on top of s3 we don't have a compute engine to push down the sql to so we need to pulse and execute that sql ourselves so absolver i think built at least a half of the logic that you have to have in the database just to compute the sql and if you want to do transformations on top of s3 in the lake you're going to use a software if you're going to do transformation within the warehouse you're going to use dbt similar similar thing but the question whether you want to round out your transformation and of course there are inventions and advantages and these dungeons on both sides and and what are the advantages to doing the transformations over the object storage itself cost would be much much lower compared to like order of magnitude level of lower cost etls are are not very cost effective when they run on warehouses the second is latency so the way you you process data in warehouses is that you're basically running sqls and chunks of data and what we're absorber does is processing each event separately so we can deliver the final data within a third third time is the charm for sure so just to answer what you said so i told you that cost is one thing latency is uh is a second thing uh i think some people would like to do the i think that the more traditional analysts might like to do the transformation within their warehouse so there that's how we used to work in oracle that's also what creates the lock-in because the final data the refined data that you're creating in the end it's only available within the databases you're not creating that refined level in the beginning so you're not you're not able to send it to other databases it's only within your cloud warehouse advantages on this side costs lock-in latency adventures on the other side this is kind of how people were used to do stuff in oracle which is good and familiar yep got it well um i want to switch gears a little bit because i know that just from a sort of engineer founder perspective there's always challenges especially in going to market you know upsolver is a really interesting tool it's it's almost something that's so unique that that maybe people have never seen it before or they know they've never thought it was possible because it sort of carves out a new space in the data workflow um with sort of an old standby standard tool sql but it's basically creating new space i'm just wondering like from a you know from a founder ceo perspective what were some of the biggest lessons you learned from bringing up solver to market i think the biggest lesson by far is that uh you need to measure and and attack friction at every lit uh every point every level possible and i'll make it more concrete i remember getting started how do you deploy absorber are you using it as a cloud service or are you using it as a private vpc buying your own private gpc that was the ques in 2017 this was a legitimate question so no one was ravishing was not not that well uh it was not that popular back at the time and we like the team said well you can go to the security team and explain and do everything but explaining takes a lot of time we don't have time and effort and money to actually go ahead and explain so i think that the from a product perspective we've been looking at product to sell from day one and seeing how can people start using it with asking as few questions as possible and also avoiding those friction points and i think sql was another one okay i've heard analysts say after pocs well it's very nice what you've built here but i know sql this is the language of databases why are you serving me with something new and uh and i i heard that feedback a couple of times before we decided to work on sql and i have at least of about 10 things in absorber that i feel we have built just to eliminate friction even though we could be considered right but right isn't winning and that's kind of the biggest lesson yeah got it yeah well fascinating lessons um with with these innovative products but um yeah it's it's uh really interesting to hear your story ori i'm curious like what's your vision for this open lake house concept um what sorts of feedback and involvement are you looking for from the community um on this well i i hope that people will get educated on what's an open lake house they're going to look in their architecture well this is following a good design pattern this is i want to get an open lake house because that's what i believe in i think that i want a future pools i want to be able to work with multiple engine i want to reduce costs and i understand what it's it's costing me and i hope that the people on the other side taking a data warehouse so that all the way will also understand the the other options so i hope that what people are going to get just from that open lake house content that we are creating is that they get educated and i think people are very thirsty to get to to know that information so how do i architect for the cloud how is analytics different in the cloud what abstractions that i have that i didn't have in the on-premise day what's a good reference architecture what's the bad one how did other companies do that i think all of that information is very is very important and in demands i hope we'll be able to address yeah well it's very important as you say for engineers and implementers to understand sort of the differences in the tools they're using and how they fundamentally work and so um you know it's very hard to to make these decisions without a clear understanding and education um understanding sort of some of the design patterns and architectural differences and and how these tools work so um yeah it's been really fascinating to to have this conversation with you and to kind of deconstruct um some of the aspects of what an open lake house means um yeah i really really appreciate you taking the time and chatting with us and i'm curious um if folks want to try up solver as well um where can they do that well they have several options the first one is go to the website click on start free and just work with the platform there is a community edition which is free forever with all features except scale because we need to pay for that but other than that you can use everything and we have a few quick starts and we are there you can start to experiment the platform on their own that's option one option two is that we have dev days on a monthly basis and on dev days we are giving guided hands-on labs on how to take a use case and implement it in the software and today in the aws style in the future also in the azure cloud and then google cloud as well and basically you're building something that's end-to-end not just enough solver you're taking data that's just dumped on object storage and eventually making it queryable in sql and you do everything in 90 minutes on your own so that's very cool thank you yeah that day has been popular so that's option number two option number three talk to us yeah awesome um well we have another question actually um from the audience that i want to take before we sign off or if you have the time sure um so manu is asking what's the minimal poc footprint for upsolver if i want to try on my own super private data what's the simplest and fastest way to try upsolver well he said simple and super private data super private yeah yes so for the super private data i would do a private vpc you see so when you go into the platform after you sign up there is a cloud formation script that we launch and you can i'm assuming that it's aws it's there is the same on azure so once you do that uh you can choose private vpc and then would be deployed into your account the way it works is that we are able to manage the cluster in manu's account so basically spin up and spin on ec2 instances but our employees don't get access to the s3 bucket so we can never access the stl we cannot even access the instances so mano is is guaranteed on the aws level that we will not touch their data but he will be able to play with absolver that's the card right awesome um well that sounds great and ori if folks have other questions for you um what's the best way that they can reach out to you i don't think you're on twitter are you i think that i am on twitter are you okay but yeah uh ori ori absolver.com correct i always answer my email and linkedin twitter all of them works okay awesome that's the best one awesome well we're happy to help you um advertise this open cloud lake house format i think it's super interesting and um and a thoughtful architecture that more folks need to consider so um yeah i was really grateful that you took the time to come in and share with us today um and we'll you know continue to send feedback from our community and send folks directly to you um in case they're interested in in being helpful um or joining any sort of consortium that i think um you might be interested sort of secretly in in starting sounds good yeah well well thanks for joining us ori we'll talk to you soon bye and to everyone else out there thanks for your patience with our minor technical difficulties today i'm glad you stayed tuned in to the end of the show next week we'll have another awesome guest on dc thursday and we'll talk be talking to tom banz from soda data he's the co-founder and cto about data monitoring um so don't miss that episode don't forget to subscribe to the youtube channel hit the bell icon to get notifications about our next episodes and we'll look forward to seeing you soon next week here on dc thursday thanks everyone
Info
Channel: Data Council
Views: 245
Rating: 5 out of 5
Keywords: data engineering, data pipelines, data catalogs
Id: YFBKNsU36p0
Channel Id: undefined
Length: 46min 23sec (2783 seconds)
Published: Thu Apr 22 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.