Distributed Computation with WASM and WASI - Bailey Hayes & Carl Sverre, SingleStore

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey everyone today we're going to be talking about distributed computation with wasm and wazi my name is bailey hayes i'm a software engineer at single store i've been working with webassembly for several years now at my last company i worked on the more traditional use case for webassembly which was working on a data visualization library running in the browser carl hi my name is carl i have been working on databases for the last eight and a half years at a company called single store i work with bailey we're both software engineers there and over the last couple months i've been deep diving into the world of wazom with bailey and it's been a lot of fun and yeah we're really excited to show sort of share our vision of distributed computation with wasm and wazi especially with related relation to data so that's why we're here um so let's talk about whoops let's talk about apps really quick uh we just had a great presentation from liam that was really interesting about capabilities and how applications can become extracted from their requirements a little bit more i think that we actually look at this also from state um applications have state we all know that and applications start off with usually very little state when you're in development on your computer that gets a lot more interesting as apps grow as we all know applications tend to have a huge explosion of data especially modern applications we're seeing a huge like surge of data intensive applications and data intensive applications use a lot of state they if you look at any modern application you have things like databases you know blob storage caching layers cues all sorts of data everywhere and the thing that's interesting about data everywhere is that it leads to data moving everywhere in the last eight years working with single store i've seen a lot of companies run into the same problem they they start moving huge amounts of data between their data layer and their application layer with the sort of surge in data apps and data intensive applications we see this weird thing where the database doesn't always isn't always able to run the right type of computation and so we're starting to see apps move not just like an answer to a question but the actual data required to answer that question into the application where they're running sort of custom processing and custom code this is really inefficient it's energy inefficient it's computationally inefficient it's slow uh we believe that there's potentially a better way to do this so the ultimate question is can we move compute to data um this is a pretty easy idea but it's a really hard thing to execute but we believe that wasa might actually be a great way to to represent computation in such a way that we can move that computation to the data as opposed to moving that data to the computation and so today we're going to be presenting our idea for how this might look in the world of wasm and really engage with the community today hopefully especially tonight later on today talk to all of you about some of our ideas in this space and where we want to take it so bailey why do you think wasm is such a great idea for this yeah well we're all here so we already know a lot of the benefits of vlasm but for us we're a multi-tenant managed service database it's running across the clouds and it's distributed which means the secure by design feature of wasm is probably the most important component and you wouldn't want your data to leak to somebody else like from one bank to another so that is the first feature that we looked at the wasm module is only going to have the capabilities that we granted we also because we're distributed want that lightweight runtime and lightweight modules so that it's easy to move across nodes near native performance but also very predictable performance is another key component for running wasm as compute within a database or any other type of storage and the last thing which is probably the most important one to me because i'm not an expert in sql like somebody else on the stage and so i really love the idea of being able to write in the best tool the best language for the job and i can use all the testing and ci cd that is provided in my favorite tool and language of choice in a host agnostic way and the our goal with wazi data which i should just say introducing blazidata that's our proposal for how to do this i want to be able to take any language like roster python and say i'm using wazidata and that gave me the framework for being able to distribute my computation and push that compute down into a distributed database or some other type of distributed runtime so yeah so wazi data is partially something that we have built and are working on actively and partially a proposal for where we see this going um so we've sort of broken up into four stages uh in stage one and two we're focusing on building the sort of building blocks of computation um so in stage one we basically just want lambda functions these are just functions that map from a single piece of data to potentially another single piece of data in stage two we want procedural extensions we want to be able to define a procedural multi-stage algorithm or data application we'll give you an example of that in a little bit stage three is going to be about relational extensions it's about using wasm to actually extend and augment relational query execution systems so inside of a database you have things like logical query plans which actually describes the the way that the data is computed or the answer is computed over a distributed system we want to be able to extend that with wasm and then in stage four we want to even go farther and look at just arbitrary computational dags and define those in wasm and and be able to execute them on many different data systems and distributed systems so you will talk about each of these stages to sort of give more details let's start with the simplest one let's define functions uh you have things like scalars value in value out we want to operate on records for us that might mean a row in a database but that could also be structs in your language and then be able to work on batches and vectors of these and so this is the simplest unit of compute basically that we can start with this is an example and i've got three of these the first one is being able to translate english to french for example say we're streaming in comments and uh we want to make it really easy within our database to have outputs with whatever language our users are using the second example is being able to do sentiment analysis sentiment analysis is where you read in text and decide if the sentiment was say positive negative or neutral so say a customer had a really bad experience we would want to mark those negative comments for follow-up like say in some type of google review and the third example would be risk analysis say you're bundling up a bunch of loans and you want to decide whether or not it's too risky to offer that set bundle and all of these typically would depend on some type of machine learning model that we've already trained and now we've compiled it to a wasm module so that part was really simple basically value in value out but we also want to be able to define procedures which is creating distributed algorithms in a multi-stage way we need to be able to find functions create a plan for us we're just passing in a string right now this is definitely something we want to make better in the later stages so our plan string in our case is going to be like a sql query for our database single store but could also be like a mongodb query so let's look at that same translate example from before we check whether or not we have any users that have a french language and if we do let's update the comments so that they're in french and that's kind of the idea just a really simple algorithm uh coming up with examples for this sometimes is surprisingly hard that was a very contrived example but we just wanted to show that you know we want to be able to do procedural uh define a procedural work a set of steps in wasm and run that um then it gets really interesting so so once we have the basic foundation we have the basic lambda function the thing that can work on a unit of compute and we have the higher idea of having a wasm module be like an orchestration of of distributed computation we need to actually be able to define arbitrary distributed computation inside of wasm and we proposed this uh we could do this in two different ways and potentially both so the first one is extending relational systems with wasm so for anyone who isn't super into databases most databases especially relational databases use relational algebra and they have things called logical query plans and physical query plans logical query plans basically define a distributed computation so if in your sql engine if you say like select star from foo join bar something like that internally it might look like this in in the engine it might have some type of tree um but sometimes these things are limited you might want to do a distributed operation that is like a little bit more unique to your particular workflow and in these cases usually you'll have to pull the data out of the distributed system or the data the database into something like apache spark where you can go and write something completely custom we propose that do the opposite actually allow you to augment relational query plans using wasm and push that actual plan directly into the engine so we're looking at ways to do this and we're really excited about exploring this with the community so this is definitely future work and similarly i mentioned earlier we want to be able to represent computational dags so just arbitrary directic acyclic directed acyclic graphs of computation that would allow you to do really complex workflows if you could imagine defining this workflow entirely in wasm from the actual coordination aspect of the workflow all the way down to the individual components in that workflow that would be amazing and if you could do that in an abstract way that could run on many different distributed systems that would even be more more amazing and that goes back to i think what liam was saying about you know continuing to extract out those fundamental non-functional components and focus on the thing that really makes your app work and we believe that is true for databases hey carl so why wouldn't we use something like argo workflow so yeah there's there's a million workflow engines there's a million ways to do all this stuff but none of them allow you to push the computation all the way into the database and as you said at the beginning wasm is a fantastic way of sort of representing computation and we really believe that um by defining these things inside of a wazon module and pushing that into the engine uh into these distributed databases and distributed computation systems that's where we're going to see a huge win so cool all right so now a demo this is give you a live demo so here we go i think the code's still good yeah okay you're good just maximize oh yeah you're good so remember we keep going back to that sentiment analysis example this is our rest code for doing that sentiment analysis we're building right now just on whitex bindgen and the interface types proposal so not a ton that we had to do unique to us for being able to run this in a distributed way carl wrote this macro the swazi interface macro and and that basically creates the bindings that we need so let's run it like a data scientist which i am not i've got a python notebook here and i'm using a framework called dask if you're familiar with spark this is kind of like the python equivalent the data that we're operating on looks a little bit like this it's a table with comments from stack overflow we wanted to show you something with real data and maybe even do a real type of analysis on that data so i've got my text i can see how large my data frame is and you can kind of think of a data frame as almost like a table i've got a thousand rows and that's also really common you pull a small set of the data iterate on it locally on your host like which might be windows or mac or linux and come up with kind of some of your ideas we are going to add our sentiment analysis wasm module and right now i have it say i love wasm which is a really positive sentiment it's not completely positive because it doesn't actually know what wasm is but if we say something negative like covet is bad we'll see that now we get a negative sentiment you might notice this compound field here that's basically combining the positive negative and neutral traits of the phrase so that just kind of proved that my wasa module worked now we're going to operate on all those thousand rows and we get output that's kind of like this and that's interesting and by playing around with this i kind of form my own hypothesis i think that on stack overflow neutral statements because stack overflow is supposed to be giving you answers those answers should be technical and neutral those are going to have the highest score now you probably are familiar with upvoting and down voting maybe on reddit or even stack overflow on posts but comments also have upvoting and down voting because i want to operate on the full side of the data now i'm going to connect to our database which we've also added wasm support and that full set of data is 82 million rows and like before i'm going to take the exact same waza module that i was running in dask i'm going to run it in single store so the first step is letting single store know about my my wasa module that's that's sitting here in this workspace and we're going to do the exact same test as before hellowesmday i love wasm and that's mostly positive hello itself is kind of a neutral statement so i am curious read the docs do you guys think that is neutral negative how do you feel anybody neutral okay it's true it's neutral but that just kind of shows you computers don't know about passive aggressive [Music] now let's do something really cool i want to run distributed computation that was the promise of the whole talk and that's what this query does our database is distributed we're taking our wasa module and running it across all of our nodes across 82 million rows of data and we get back very quickly and a query that uses analytics now it's pretty simple analytics here but the idea is that we want to know by bucketing our our ro our all the rows and we group by them based on their score so if the score ranges from say 10 to 20 that's one bucket and when we look at this we see that these scores are both very negative and very positive so just a quick peek at the data maybe i'm not able to draw any types of conclusions so the next thing you do is plot it we're going to render a really simple graph that has a trend line using ordinary least squares can you tell them not a database or a data science person um but we actually see two really interesting trends off of this data in this graph we are plotting our our buckets so that's the x-axis on the left-hand side we're doing polarization so that's going from zero to one zero being completely neutral one being very negative or very positive when we graph this with the trend line we see that first it supports my hypothesis that higher scores will be more neutral but something else interesting came out which is they also seem to be more leaning towards more neutral positive perhaps negative comments are more likely to get down voted so i think that's pretty cool and kind of shows a little bit of the power of being able to run wasm and two totally different frameworks and i would love to see it show up in others so um yeah if you guys are into it come talk to us and just something to point out about that graph that graph is generated uh by pushing the sentiment code which we compiled from wasm into a distributed system running in google cloud right so instead of pulling 82 million comments down to our laptop to generate that graph we are pushing the compute to google cloud to actually generate this graph and all we're pulling down to the laptop are essentially a couple data points and that is the future of combining wasm and data it's sick and the our whole goal is that we were able to do this in a run time agnostic way our example is pushed up to github under single store labs our last slide will have a resource link and we also just want to say that we want to work with machine learning so being able to work with neural networks and maybe even do task parallel operations is uh very important to us sweet so yeah so in conclusion wasm and distributed is a fantastic thing i mean as liam said it's the future and we really really believe that but we believe that data has to be involved and so really wazi data is really just a proposal it's an idea it's an idea that we could actually push down computation into distributed systems and databases and data systems so that data apps can stop pulling data out put keep the data where it is keep it efficient and leverage wasm for its superpowers um so that's our talk any questions and please scan the qr code for github repos and and also you know hit us up on twitter or github if you want to reach out thank you [Applause] so any questions and can we get questions from virtual as well how does that work yeah who's running that yeah so they should in theory uh be asking the questions either in the slack if they wish or uh in the chat in the uh virtual environment but i hadn't seen any questions during the conversation because i think it was so interesting but let me check again so if there any questions appear in the slack the slack is available all week so if you have questions while you're chewing on everything you saw right uh carl and bailey will drop into the slack and you can ask the questions there if uh right now isn't there's a question over there go ahead and say your question we'll repeat it that we um for uh essentially limiting the resources that a single lasting function can utilize during execution do you guys have a strategy for dealing with things like that do you want to take one oh oh yeah we'll go ahead so so repeat the question make sure it's uh everybody can hear so so the question was uh basically do we have strategies for limiting uh how much computation a wasn't function is using or how many resources the comp wasn't function's using during computation or during execution uh and i'll let bayley take that one yeah i love that question uh crypto has a really similar problem where they want to compute the gas or like what the you know how many cycles are going to run on the cpu and yeah that's super important for us because you're running in a managed service just like maybe even a serverless function a lot of if you like squint and tilt your head a lot of the use cases are really the same and yes i would we haven't done anything yet we would love to take advantage of the things that crypto is doing in that space uh something that i haven't seen yet is throttling which is something else that i would really like to check out uh yeah if we could find a way to both predict like what the performance is a lot of these uh machine learning operations they're long running like they're just always running perhaps and perhaps you only want to run them when there's down time and so being able to say okay i know what my gas is going to be i know uh that this is kind of a low priority i want to just set it to this exact set of resource limits yes absolutely that's a huge use case for us all right if i may so james is asking a quick question and then we'll move on to the next presentation um wants to know uh in the slack he's uh how does this compare with the current mapreduce feature set and that kind of approach if you have a quick answer great uh if you want to go ahead and elaborate you can do it in the slack afterwards too yeah just the quick answer is uh with oh yeah sorry uh we're good okay cool um so the answer is basically in the current approach to what we have inside of single store and inside of dask so like what we've implemented you can actually build mapreduce using a combination of your functions which cannot act like a mapper and and the actual frameworks themselves so single sour and das both provide sort of solutions to be able to do reductions in groups and shuffles and stuff like that that you need to be able to run large-scale map-produce operations um so in our example of sentiment analysis we were using it like a mapreduce operation we were mapping the data over uh from from each of the comments we were mapping to the like sentiment records which contains the red sentiment scores and we were reducing uh by aggregating those sentiment scores right so you could think about that as a mapreduce um but in the future as we go towards relational extensions and workflow extensions we actually want to be able to find sort of arbitrary custom shuffling routines custom like re-partitioning routines and stuff like that and that's where you start to be able to build sort of completely custom things yeah we didn't talk about substrate at all that's true we are working with uh one sort of uh community called substrate which if you're interested in learning more about augmenting sort of like providing abstract logical plans i definitely recommend checking them out we'll link to them in our resources section as well okay we have more questions here or in the slack we've got a few just a few more minutes question here yep go ahead i have a bunch of questions no that's all good so uh one of the questions one of the slides that you guys showed some code that you wrote with a particular macro for generating the interface types i believe oh yeah do you guys have have you entered excuse me have you open sourced that code is that something that yep would you be able to elaborate just briefly on kind of what that workflow looks like for passing in data into the web web assembly function and then potentially passing that data back out is that somewhat of a painful process still or does that kind of code that you just showed off like get rid of a lot of that so this code leverages widex and widex by gen 100 so essentially what this does is this this macro takes a mod written in rust the goal is ergonomic simplicity for people who might not be they might just want to write a simple rust function they don't want to necessarily have to learn anything more than that so we want to simplify it to literally import wadzi interface uh annotate i think it's up on the screen annotate your your rust mod with like some basic struts and stuff like that uh this macro will spider this mod and export all the functions using the canonical api per the same as widex bind gen what this gives you the advantage of is on the host side like when we implemented the das side for example all we had to do is just take the widex existing widex bind gen python module and essentially load the widex what we're exploring is ways of eliminating the widex stage because we really want to like create ergonomic simplicity that's not entirely related to wazi data that's more just like general for the ecosystem we want to be able to write like a simple rust function and just like deploy it with good structure structural types and real types and so we did that for both rust and python can really quickly can we sort of recapitulate the question uh that he had asked right you were responding to so everybody's got it so hard to remember uh yes the question was about this interface that we built this rust interface what was the way in which it communicates with the host and how does it interact with with types and stuff like that um you asked it much better than we do have uh just a couple more minutes so if you want to launch on another question we are watching the slack and uh and the meeting play so uh there's a question online how would you handle failure failure recovery during a batch process would the state of the process be stored somewhere to be able to restart how are you thinking about that that's a great question one of the cool things about wasm is that i'm able to inspect things before i ever actually actually run it which is kind of how crypto computes perhaps gas ahead of time like with ethereum for us i could do things like saying i'm about to execute a transaction i can annotate that before i execute it wasm is basically a stack based machine i could probably replay things if it fails i can also freeze my wasm module where it's at say i have some kind of network or hardware failure happening on a node i can move that exact wasm module to a different node and all of this of course is in theory and i would be able to replay that transaction so it's kind of similar to perhaps sparks rdds uh reliable distributed data frames yeah so that's the idea and that's one of the cool things with wasm cool well i think that's all the time we have for questions uh thank you all for attending our talk and please if you if you got the qr code ah yeah okay nevermind anyways there's the qr code uh and we're all good thanks everyone [Applause]

Info

Channel: CNCF [Cloud Native Computing Foundation]

Views: 296

Rating: undefined out of 5

Keywords:

Id: TqZsbrvhD54

Channel Id: undefined

Length: 27min 51sec (1671 seconds)

Published: Fri Oct 29 2021