Ask Me Anything about Photon/ Databricks SQL

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

series uh all right hey everyone we are now live with us asking me anything about photon and data brick sql i'm kelly i'm going to be moderating this session please go ahead put any and all questions they all have in our chat we're going to address as many as we can in the next 30 minutes with that i'm gonna turn it over to our panel to go ahead and introduce themselves hello everyone my name is alex and i'm the tech lead on photons i'm happy to tell you anything about it hi there i'm aleister mcgowan a solutions architect and i help customers get up and running and familiar with the platform hi my name is mustafa i'm the tech lead for performance at uh runtime performance database hi there miranda luna i'm the pm for databricks sql ui awesome thanks everyone again please feel free to put any and all of your questions in the chat i'm going to start off with a question from miranda actually miranda this morning um reynold had a little bit of a talk about data brick sequel this city brick sequel used to be called sql analytics from my understanding can you maybe explain the change that happened there and what this means yeah definitely um so what you're going to see when you log in and navigate to the sql tab today is the same product you know and love so nothing has fundamentally changed in terms of taking anything away or pivoting kind of what you've already had set up but one thing we wanted to make sure that we could really highlight was that there are kind of several components that go along with databrick sql right there's the endpoint component so i can use my sql endpoint with my tableau dashboards i can use power bi i can use any sort of uh additional client i'd like to but there's also the experience of i want to actually build a re-dash dashboard i want to schedule alerts i want to kind of integrate those queries and visualizations into different views and we want to be able to distinguish those because we do know that there are certainly customers that are using one or the other and not necessarily both and we want to be able to speak to and provide kind of a tailored offering and experience for those who might just be using one or the other so that's kind of what's behind the name change databrick sql still the same sql analytics you know and love but dataworks sql specifically referring to kind of that endpoints experience and hooking into other uh bi tools whereas you know the re-dash is going to be your classic dashboarding queries visualizations thanks for that and then because we're also talking about photon i want to start us off with another question for alex alex can you just set the level set for us what is photon and how does it differ from spark uh yeah sure so photon is a new execution engine that's completely built from scratch uh written in c plus um that means that it replaces the execution portions of spark and um to you know gain a lot more efficiency basically to use the underlying hardware more efficiently and this is for you know in terms of our vision this is supposed to help you know accelerate all kinds of workloads um from um etl to sql uh as well as sort of bi style uh queries but overall the idea is really to replace the core execution engine and just drive more efficiency to provide you know better value for kind of the cloud machines um uh that you're getting for us to utilize the underlying hardware better and so photon uses a certain uh well-known technique from database engines which is called vectorization it's a certain way of kind of structuring the engine to gain more uh kind of efficiency out of the low level you know hardware basically it work it's more friendly to how modern uh cpus work you can do um you know batching of the data and kind of crunch through the data in a more efficient way and it's also written from scratch in c plus plus to kind of uh remove some of the performance shackles that we've had you know running in the jvm and then i think you actually addressed this but just to make it super explicit we had a question coming in is photon only for sql or can it be used for scala as well for example the data frame or data api or udfs yeah so the the photon engine is fully compatible with a data frame api as well as the sql sql um api yeah uh it currently does not um does not support uh udfs natively um but that said the photon engine is built in a way that from a user's point of view um you don't really have to change your code to take advantage of the photon engine operations that are supported in photon natively can run a photon and other operations within the same query that are not supported by photon will fall back to the old engine so it's kind of transparent to the user you can submit a single job or query and parts of the query may run in the old engine and parts of it may run on the new engine so even if some parts of it are not supported you can still kind of make use of that and there's no change required from your side um it's kind of trans this this falling back is transparent to the user um if you want to kind of see which parts of the um query can be accelerated by photon you can look at the explain plan it will basically show um show you which parts of the query can run and which uh parts will will fall back to the current spark engine it's actually then another good follow-up question that i have i see here can you use the spark ui with photon then uh you can yes um yeah you can use the spark ui to kind of uh look at your photon queries and you can look at the individual operators we provide a kind of rich set of metrics for you to understand understand performance in the spark ui itself we even highlight the the parts of the plant that can run with photons so they have a different color so very quickly like you can visually see which parts you know uh run in the photon versus the the spark engine awesome and then i have a kind of general um follow-up question this is for anyone is there any time or any reason why i wouldn't want to use photon in conjunction with this databrick sql offering so for the databrick sql offering i i do think um you know by default uh photon is is the best choice because it uh you know based on our experience it does provide a superior um efficiency for these use cases you know that said um photon is a completely you know it's a new engine completely written from scratch and there may be certain uh certain types of operations that aren't supported yet or certain query patterns that maybe we haven't optimized for yet so there is the possibility of kind of a you know falling back to the existing engine uh in edge cases or in kind of extreme scenarios where many of the operations aren't supported by photon yet but in general we'd recommend a photon as the default engine for for data brick sql we have a we have a very broad uh we have we feel pretty good about the the feature coverage and the performance we're offering awesome and then i just got our first performance question um some things in general what workloads do we expect photon to help with what kind of speed ups do we expect with this new engine uh so photon is intended to speed up uh cpu bound operations basically operations where uh we're spending a lot of time processing data which is one of the main reasons why when you're looking at uh spark which is mostly written in scholar java we identified some inefficiencies where basically we end up spending a lot more time in uh while the system is basic processing and basically the cloud time is money so the faster we process things the the more efficient data bricks will be and the more value we're adding to our customers so photos is primarily focused on making cpu intensive operations more efficient at the same time we're making improvements across the stack basically uh while reading data especially while reading a large number of small files so quite often users would write data to partition tables or tables that are bucketed or would have some kind of electrical ingestion that would create a large number of small piles so as of late we've been working on making this read path that has to deal with large number of small files very very efficiently and we're also working on fine-tuning uh this prefetching operation across cloud so data bricks runs on gcp it runs on azure and runs on aws so for each one of these clouds we went and did the experiments to understand how they behave and we're tuning data breaks to run the best on each other all of those also at the same time we'll be making improvements for data extraction so we have a new feature called cloud fetch where basically if we're uh quitting a large number of a large amount of data from databricks uh the system will actually write it out to the object store and then the odbc client will be reading it by multiple threads and from what we've seen this is tremendously useful especially when using we're doing extracts into bi tools like power bi and tableau and then i'm doing a lot of photon based questions this one is does photon work in conjunction with adaptive query execution it does yes um it should provide i believe as of today um all of the adaptive execution features are supported in photon as well so yes and we have tests we have tested it you know on the standard uh benchmarks like tbcds and i think many of the numbers we report actually reflect um uh running with adaptive execution on awesome and then i have pretty much a yes no question for y'all is photon just a part of databrick sql or is it also going to be part of the databricks runtime with scala notebooks as well yeah that's great a photon will actually be available on both and they will both uh you know be uh publicly available as public preview in the coming weeks so both for data brick sql and for the um you know regular databricks clusters perfect and that actually answers that follow-up question i just got to is when will this be available for everyone to try out i i do have a question then about data databrick sql this might be more for our solution architects or product managers on the line what have you seen use cases so far on databrick sql formerly known as databricks analytics for anyone who has used it in the past also do you want to speak to some of your your customers absolutely so um what i'm mostly finding is there's a there's a growing segment of people that either don't want to wait for their data to be curated pushed into a warehouse and they need they need to get basically access to the whole data in the lake um but the problem is you know traditionally with a notebook interface it's not a very sql analyst friendly tool so you know for first time i ever saw a notebook it didn't make any sense to me i didn't know what to put in there with the query editor being able to look at the basically your data catalog on the left input some sql at the top get your results add some visualizations it's a much more natural way to do it so it's been a really good area for those sql analysts to be able to go and operate additionally you get some of those extra performance benefits and uh ease of management it's it's really simple to set up an endpoint and use an endpoint so being able to hook your third party bi tools up to that end point is a better experience for your users and your administrators um compared to a traditional all-purpose cluster was there anything else you wanted to add miranda no i think you hit the nail on the head that you know this is the best way to kind of get get data exposed to tableau power bi or if you want to do some quick analysis exploratory right within databricks you're all set and ready to go awesome and then i'm getting a lot of questions about photon again seems like that's the exciting part here we will follow up to our question about performance are you benchmarking against things like data warehouses this question points out here that performance is a really important part of a lake house so what kind of benchmarking benchmarkings being done how is that being approached uh definitely so basically as we go and develop uh databricks and develop spark and photon and so on we we try to understand uh what are uh what are the customer needs in terms of like efficiency throughput scalability and then so on and with the with the new features that we're building like photon uh we're able to match or beat uh state-of-the-art data or houses that are out there out there in the market in terms of like cpu in terms of efficiency at the same time our offerings tend to have significantly better uh price performance also as we do these comparisons uh we compare them in two different ways the first way is basically taking a data house and storing the data natively in the data house and in these scenarios we're able to perform better and have significantly better price performance than these other transitions with better house at the same time we have a significantly bigger advantage when we're creating data that's actually in the data lake so databricks natively uh separates compute and storage uh you could spin up as much compute as needed storage is always will be like in s3 and adls or in google uh file system and we have this ability to separate computer storage for for a long time um and this this is this is how data bricks operates native and we we do this on delta delta which is an open open source file format while these other systems significantly struggle when they're reading things like parquet and orc in the in the in the data lake so this is where the data bricks shine significantly by even sometimes the order order magnitude compared to these other systems that's awesome i also want to remind everyone please feel free put your questions in the um the chat queue we'll be glad to answer all of them we have 15 more minutes so we can definitely get to your questions another question then about photon does photon support structured streaming workloads yeah good good question so as of today uh streaming is isn't one area we have particularly focused in uh focused on so we've um we've really put a lot of emphasis on uh you know supporting the the the operators and the expressions and so on that are available in in the seek and kind of sql batch workloads uh that said we are aware that uh you know structured streaming is an important use case and we'll cert will definitely come back to that uh in the future and add support uh for those um today uh there are like i said it's not that it it it doesn't work at all it's it's just that we haven't focused on it so there may be certain operations that you know like stateful aggregations that would need a complete reimplementation and photon that we haven't focused on so far that said some of the more common use cases like um continuous mergers for example um can be made to work with photon today and there's actually there there's i would say it's more like there's something that may work it really depends on the case but it's not an area that we have uh focused on so far but we will uh will revisit in the future that makes a lot of sense that's actually time for a good follow-up about that would you mind telling us a little bit about then like your teams like where you're investing your time what the roadmap looks like for photon in terms of support oh yeah certainly so our our focus really is on expanding the feature feature coverage uh some of the some of the uh big things that are currently not supported in the photon engine are uh complex types like a struct arrays and map we have limited support for them but they're not quite complete yet um then sorting and window functions are two other operators um that are not supported in the photon engine either uh but we're actively working on those so those are on our immediate roadmap uh to address hopefully hopefully within uh within the coming quarters uh udfs is another area that uh where photon will fall back to the uh the old engine um but it's it's another thing that we're that we're looking at right now so it's really about coverage you know there's um you know in sql there's a lot of different built-in functions and we're kind of you know really closely looking at customer workloads to understand what are the commonly used operations and we're kind of prioritizing the work uh you know based on the workloads we're seeing um so we try to make sure that the commonly used operations are supported you know one of the exciting things that we more or less completed recently just to give you an example is um full support for date and time stamp functions you know these are very very commonly used functions um and uh we're kind of you know going through the different data types operators and expressions and adding adding coverage uh one by one so just to recap the big ones we're focusing on in the coming quarters are sort window complex types and later on udfs awesome and then i'm gonna ask one more photon question and then jump over to databrick sql for a little bit can you maybe tell the difference between catalyst and photon where they might differ where they might interact yeah good question um they are they are different uh different components of the stack so it might be helpful for us to kind of quickly walk through uh the life cycle of a query to understand where the different pieces fit in so when a query comes into the system um it first you know goes into the aspire driver it gets anal it gets parsed it gets analyzed and then uh goes into catalyst where we build an an optimized uh execution plan and so this is this remains the same for for the old spark as well as photon where photon comes in as a on the on the physical plan produced by catalyst uh we do an additional pass over the plan to identify which of the parts uh are eligible to be run in photon and then when it comes to actually executing that plan and you know submitting the tasks out to the worker nodes to perform the work that's uh where the photon engine kicks in at this very last stay uh step of task execution so in that sense they're really complementary um you know basically catalyst produces the plan and then feeds into the photon execution engine and then you know for adaptive execution then we feed back into catalyst which kind of regenerates the plan and re-photonizes it and so on yeah awesome that also answers one of the next questions that came in about does photon sql have a built-in optimizer follow-up to that though which is going to bridge us between photon and data brick sql the question i have is does photon sql use nc sql syntax so could someone please address um about how the apis for photon are spark apis and talk a little bit about what that looks like yeah that's a that's a good question so um in terms of our uh the direction or where we're kind of thinking is that now um you know since we want we we really want to making a sql kind of a first class citizen as part of the databrick sql offering and that does include um you know more standard conforming of sql syntax as well as behavior and error reporting so that's an area that's that's actively being worked on uh not just the databricks but actually in the spark community as a whole um so we are kind of really moving towards that direction and it also affects photon in particular for certain um you know implementing certain uh behaviors just one one canonical example is that uh you know if you're doing an aggregation that may overflow you know the usual you know kind of ansi sql behavior is to fail the query or in certain exceptional situations you fail the query whereas historically spark has been more forgiving i think and it would you know another null uh the result or something like that that so uh you know being more standard conforming is certainly an uh active uh area of development i would say yeah awesome and then because we are i'm spending so much time on photon i want to jump over to the brick sequel for just the instant i asked alex a little bit about his team's focus on the next two quarters or so for photon miranda would you mind talking a little bit about what's coming up for data bricks equal and what's that looking like in the near future yeah absolutely i think what you're going to see is a few different investment areas for us and hopefully that will align with a bit of what you heard in reynolds keynote this morning first and foremost we do want to make sure we're investing in performance we want to make sure that compute is readily available it's fast to start you shouldn't be waiting tens of seconds or minutes to to start an end point we want to make sure that if that's kind of a key use case that we're servicing that as well as kind of exactly what randall was talking about in the keynote today we want to continue to improve uh performance with higher number of concurrent users or small data over poorly organized data we also really want to as a second area of emphasis improve the onboarding experience so if you're a net new customer for dataworks sql we want to make sure that you're able to get started very quickly very easily on board your data give the correct permissions to users understand what you need to do in order to administrate kind of that area of the databricks experience um thirdly this is also something reynold and john touched on in the keynote today but we want to make sure we're really building kind of a first class uh query editor and experience when you kind of want to actually work with your data and extract some insights so you're going to see those investments that were previewed today like the tabbed editor with better autocomplete improved schema browsing a lot of kind of like little productivity shortcuts like you saw john mentioned that or john demo that you could just kind of click and get that immediate select 50 like a quick view of your your data those sorts of improvements will kind of continue to be a focus area for us and then last but certainly not least like we were able to incorporate a wonderful technology from kind of a company that was joined the databricks portfolio called re-dash provides an awesome experience in terms of dashboarding today but we want to make sure that we're able to kind of bolster that a bit more whether that's through additional visualizations or that's through allowing for kind of conditional formatting of those visualizations more flexible layout type improving the alert experience and kind of the integration into different parts of your your workflows so those i'd say are going to be the four areas you're going to see us really place an emphasis on in the coming quarters awesome and then a follow-up question a little bit of question i have around sql analytics functionality sounds like with redash we have this great sql editor our sql ui what about connections for things like drag and drop bi tools how would something like that work how would i connect my sql endpoint and databrick sql to a different uh maybe something like a power bi or tableau or something like that yep so we we obviously provide a query editor out of the box and deal break sql we provide like a dashboard experience which is great but we also know that we have plenty of customers who are already using tableau that's just kind of their system of record for reporting we have plenty of customers that are using power bi that's their system of record we don't want to kind of upend the apple cart we want to make sure they can bring the data to the kind of pane of glass through which they want to view it so we've actually built very specific and they're very specific connectors with a number of different bi tools certainly tableau and power bi included but you kind of sell the wall of logos that reynolds spoke to earlier today you know we're certainly going to continue to make sure that that experience is as snappy as possible you know we have a number of customers who go from the direction of you know running extracts we have other customers who want to do live query regardless of what decision a customer wants to make in terms of how they make that data available to that bi tool we're going to be focusing kind of under that performance umbrella on making sure that that's going to be as snappy of an experience as possible that there aren't going to be delays encountered in that and then a question i just got actually following up on that are there any plans within data breaks to add a gui based approach to data break sql uh can whoever asked about that click down um on what specifically they'd like to see come online i'll keep an eye out and then follow me without that perfect one more question then about databrick sql before i jump back over to photon can someone tell me a little bit about what kind of security and governance controls this databrick sql offer now sure so today you're going to see a number of different uh controls for an admin who needs to ensure that there's only kind of the right users and groups accessing the right data set um over time what you're going to see is kind of that unity catalog kind of synchronize with the efforts that we're making under databrick sql so that kind of major investment in terms of governance is going to be something that raises the water level in db sql as well but today certainly if you're an ad you're an admin you have that same control over users and groups in terms of what data they can access um if you are creating a query or a dashboard within uh databrick sql you have the power to kind of determine whether somebody another user can run that dashboard or that query as yourself and receive the the same kind of results that you are so they don't you know open a dashboard and get totally blank uh charts or you can say you know run it with your own credentials and you can also share kind of can run can view permissions can't edit permissions and we'll be kind of enhancing that as time goes on awesome and then there's been quite a few follow-ups to that discussion on like the photon versus catalyst how they interact i have a question about the tungsten engine now is photon just a better version of tungsten where is that differentiated information about that there good question uh i would say a first approximation yes you can think of it as uh you can think of photon as a replacement for the tungsten engine to a first approximation i think i think that is that is correct you know uh one you know tungsten is more based on the idea of runtime code generation whereas photon is more based on the idea of vectorization and also kind of runs native code instead of instead of inside the jvm with kind of all the benefits and kind of more access to the lower level features that brings and then another photon question have you investigated anything like hardware acceleration to speed up spark and if so what are some specific areas that you've been looking into uh yeah the the the discussion of uh specialized hardware um i think it's certainly an interesting uh area to look at um uh as of now we don't see it as um you know necessarily being uh the best solution in terms of price performance uh for what uh you know our our customers are looking at but that said uh you know if this uh it's a space where we're actively looking at uh and seeing whether the balance will eventually shift you know towards providing more um competitive price performance and then i also asked for data break sql how has that been seen in the field where are we seeing customers start to use it i want to ask the same question then for photon engine where are we seeing that in the field now it's been out in preview for a little bit what kind of speed ups are we seeing or performance boost have we been seeing in the field uh so so basically uh customers have been seeing great improvements with regards to uh speed i mean specifically speed up and reduction overall cost so in many cases customers they look at the recurring etl pipelines that run for long periods of time they're adjusting and writing large amounts of data so typically when they migrate these or even migrate their standard analytics workloads to photon they see a very big reduction in overall response time which is essential to make sure the business is running on time and a big improvement in price performance the speed of varies anywhere between 2x up to an order of magnitude and also like what photon offers is improvements to processing of the data and how we write the data as well so with photon we also have a c plus plus more efficiency implementation of the writer which is significantly faster than the current smartphone awesome so this everyone's last chance to get in any last minute questions we're going to go ahead and wrap things up here we are at the top of the hour i want to thank everyone for coming to this ask me anything i also really want to thank all of our panelists for all of the great answers these are two things that i'm very excited about about someone who works in the field acting is in practice they're awesome i hope everyone enjoys their next two days of summit and have a great rest of your evening everyone

Info

Channel: Databricks

Views: 2,121

Rating: 5 out of 5

Keywords: Databricks, Photon, SQL Analytics

Id: impQm4btSpE

Channel Id: undefined

Length: 30min 39sec (1839 seconds)

Published: Wed Sep 22 2021