NoSQL Distilled to an hour by Martin Fowler

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay well welcome to the conference and I'm the talk I usually give what these kind of conferences generally is overview of no sequel databases based a lot on what I learned and discovered when writing my book no sequel distill their what since this is a conference it's more aimed at no sequel databases I want to get a bit of a sense as to the your level of knowledge so that I can shade things a little bit in the talk so how many people is this true for you've used at least one no sequel bed database in some kind of real situation and good a good number of you okay how many people haven't used any no sequel databases here at all and came here to discover when they might be useful so it's much smaller amount okay and some of him who put your hand up already might put up your hand up with this you've got a sense of the breadth of different you know sequel databases and when to use them in different places how many would say they have that quite a good amount of view okay so that makes a little bit of a difference to how I'll at least go through my talk the talk that I normally give for no sequel databases assumes the second question people who aren't that familiar with them and obviously in your case you're a more familiar audience as I had suspected might be diplomatic matter so I'm going to use very much the same structure of a talk and the same slides but one of the main things I'm going to do is talk a little bit about why I choose to explain things the way I do because I suspect for many people one of the challenges is getting other people to understand what no sequel databases are about even if you're familiar with it yourself you have to talk about this kind of thing to other people and that's where I think I can perhaps be most useful is in talking about at least how I approach that with the thought that you might find that same approach useful so for those particularly in the last question you're not going to learn anything about no sequel databases for me but you probably knew about anyway but what you hopefully can learn a little bit about is at least how I think about explaining the concepts so as with many things like this I do like to have a batteries in my clicker I do like to start by looking at some of the history because as with many things knowing the history of something can tell you a good bit about where you are with something and even if you're familiar with the ground it's good to know where things happen because that often explains why they are now I'm not going to go back too far but I will say that you know for a lot of us much of this comes with the rise of relational databases I am just about old enough to remember when relational databases were the new hot thing as I where when my career started and people were arguing about whether relational databases had a future or not some of you might remember that but I don't see that much gray hair around so most of you probably don't and relational databases have provided a lot of very valuable things for us they obviously give us a persistence mechanism they help with integration they have a standard query approach so you can query all your different sequel database no see you're different sequel database is the same way they provide transactions which gives you a useful tool to help with concurrency management and they have a great deal of good reporting capabilities and they're not as fantastic as some people like to mention early on but they still have been a very valuable tool but they do have their difficulties and one particular difficulty was something that was very obvious as I got working with them was the fact that whenever you're taking data and you're thinking of a typical application where you might be showing some data on a user interface that structure of the data in the user interface is different to the way in which you have to structure the data in the database everything has to be partitioned out into individual rows in tables because the tabular nature is the essence of what relational databases are about you're always dealing with tables while relations if you want to use the mathematical term that you think I think of them always as tables and that doesn't necessarily fit how you want to work with things in memory whether it's displaying two UI or even manipulating in a memory you often have richer structures in particular you tend to have hierarchical structures make a lot of sense for working in memory but you can't represent those easily in the database and this is what's been referred to a lot of times as an impedance mismatch what you have in memory and what you have in a database is different now this is particularly talked about in terms of object-oriented systems we talked about the object relational mismatch problem but it's not really about objects it's really the difference between in-memory data structures and the database because the same thing happens to you if using functional programming you've got no easy map between the maps and lists of functional programming and the tables of the database and even if you're just using regular data structures in C or something of that kind you still have the same mismatch problem now this mismatch problem was the reason why a lot of people felt at a certain point in time the object databases would be a big thing many people including myself felt that object databases with in fact Relic replace relational databases well we know how that turned out as it happened the relational databases still remained dominant object databases then died out completely but they were a tiny tiny niche and you very very rarely come across them in practice which is sad I had some good experiences with object databases but that was how it worked out there were many reasons that have been given as to why it was object databases despite the fact that they didn't suffer from this impedance mismatch problem despite the fact that they gave us many useful things that we wanted I mean they handled persistence and transactions and things of that kind just as well as a relational databases why did they not succeed as much as we'd hoped and my hypothesis for this is that it really comes down to the integration role that relational databases play in many organizations many organizations integrate between different applications by sharing a database and then different applications with different sequel goals and talks to the same tables how many of you work in organizations that do this that integrate between applications in a relational database yeah because though it's quite familiar to to everyone and that's where object-oriented databases broke down they couldn't play in that kind of space because they they were the whole point was to really to not use the simple relational tabular model so as a result it just didn't fit and this is an important question then for anything else that wants to take on the role of replacing sequel databases in many organizations how are you going to get around this integration problem I think there is a route for no sequel to go that way and I'll come to that a bit later on but I this I think is a very central question for anyone thinking of using a non relational technology so relational domination continued now I tell this piece of history because it helps set an important piece of context that I'll be using later on in Penn's myths match problem is I think quite a serious one and it affects a lot of what we want to do with no sequel databases and the fact that it wasn't enough to break relational databases because of the role of integration this sets up I think an important part of the of the the forces that are in play before we begin to look at the no sequel world but then the question comes in why has no sequel got interesting at all why did not people look at it and say oh that's just like object databases I'm not even going to go near here and the reason I think that that we're now beginning to ask about some alternative to sequel has entirely due to one effect which is the internet bringing in a great deal of traffic to certain websites around the world and I'm thinking here particularly the early big ones the Google's the eBay's the Amazons and the like now when you're faced with a lot of traffic you've got to grow to support that traffic so one possibility is to just build a bigger computer but the problem is that is to start expensive and secondly it is only so big you can really grow so what everybody did was I said well instead of having one big computer let's have lots and lots and lots of computers now at this point we run into the problem with sequel databases because sequel databases have really been designed with the one computer notion in their mind they have not been designed with the lots and lots of databases in mind now it doesn't mean that people don't use sequel with lots and lots of databases I've talked with people who organizations into big organizations where I've talked to people include eBay and Betfair where they have huge farms and they're using relational databases and the common thing that I hear whenever I hear someone talk about how they've dealt with that is unnatural acts they've had to bend sequel databases in so many different directions but they've lost most of the benefits of using relational databases in the first place and that's because the mental model of relational databases just doesn't fit in this distributed world so two organizations decided to do something about that so we have Google and Amazon and they produce their own data storage approaches that work really quite different to relational databases now these are at least originally very closed very much proprietary systems only really got known about sort of through the back channels of conversation and then through some moderately revealing papers from the firm the two organizations but that created the whole fruss on of interest that leads to the word no sequel so again stepping back and saying how I'm explaining this it is that I'm explaining what is the crucial force that makes no sequel different to object-oriented databases it's the high traffic the need to run on lots and lots of nodes I was an irony there because not all no sequel is about large traffic in fact I would argue that most of it actually isn't but that was the force that made people say we need something other than relational so the next thing is where does this no sequel word come from because one of the things that hopefully everybody has cottoned on tuned now is that no sequel is a completely arbitrary term it doesn't really have any meaning and a lot of people don't realize this so I explained the background it really comes from Johan Oscarsson who was working in London he'd been using Hadoop was interested in some of these BigTable ish dinamo style databases there was some activity going on in San Francisco he knew bit about it but didn't know that much for to find out more let's organize a little meeting to get together people and we can exchange ideas and of course if you're going to do that and in the modern age what's the most important thing you have to have in order to make this happen the Twitter hashtag and that's it that's when L sequel comes from it's just a Twitter hashtag for a meeting there was no clever person who sat down and said we need a new category of databases and I'm going to define them with these characteristics or anything like that nothing like that at all just oh here's a hashtag for a meeting and then people latched on to it there is a bunch of database people who showed up at that meeting but then a bunch of people that cottoned on and grabbed that no sequel hashtag and said well that's a good phrase let's run with it so it's a completely accidental piece of terminology and the fact that we're here at a no sequel conference is really in a way rather silly because we're not talking about anything that's well defined so so I would like to have a definition of no sequel if I'm going to talk about it if I'm going to write a book about it I like to define what I'm talking about but I think it's impossible to come up with a definition all one can do is say I can observe from common characteristics in the world but I think are important things that are other common things that people the databases that are referred to as no sequel databases tend to have notice the very backwards way I'm approaching now I'm saying I'm looking out in the world noticing that certain databases tend to be called no sequel databases either by themselves or by people working with them and I say what are the common features of these databases and that gives me a set of characteristics and these are the ones that I like to pick fairly trivially they're not relational notice it's not no it's not that they're not sequel it's that they're not relational the very first usage of the term no sequel was to describe a relational database query language that wasn't sequel but instead use UNIX pipes and filters totally unconnected with the no sequel world that has come since many years before it another feature is that they tend to be open source big table and dynamo weren't and there are certainly databases out there that call themselves no sequel that aren't open-source but most of the ones that people talk about her open source and I think this is very interesting part of the whole thing the fact that we seas are very much an open source driven movement cluster friendliness the ability to run on large clusters is another common characteristic many of the Nell sequel databases do this very well but one particular category the graph databases don't tend to do that particularly well that's not necessarily a bad thing that's just a different thing I define them as being part of a 21st century web and that's partly a timing thing we could look at really old databases like on data storage technologies like I Sam and numbers that were around before I began all of these older data storage technologies that relational ones replaced could we call them no sequel no you can't call in I Sam data storage with sequel so does that mean it's no sequel well now because no sequel is a term that really goes with what was happening in the early 2000s so it really comes out of that period and then the last point is they're pretty much all schema 'less they don't have a set schema but you have to fit with in the data and that's all another whole interesting characteristic of itself so here when explaining no sequel databases to people I really have to stress the fact there is no strict definition it's an accidental term but people kind of latched on to and leads to this very very mixed view of things nothing very consistent here nothing terribly logical about it you know if some student came up with this they will get an F right but we've got it and we're stuck with it and there's nothing we can do about it well good also think about the ties into the whole Big Data world the Hadoop's data analytics and things of that kind I don't tend to go there particularly when I'm focusing on no sequel although is obviously a big synergy between two areas and we see it at this conference and a lot of other conferences and that's because of the fact that no sequel database is geared to handle very large amounts of data or obviously a good fit for many of these kinds of problems but I think it's more to do with the fact that when we're trying to analyze complex data in interesting ways the relational model is a good tool some of the time but not the rest of the time and therefore we need non-sequel approaches to help us in those over times and now really rather logically moves us to talk about the data model this is one of the natural ways of thinking about what's difference in those sequel databases well it's we've got different data models and again we haven't really got one data model we have several often so I bring up this screen and what I would say is probably the most often talked about no sequel databases not a comprehensive list but just the ones that at least two struck my consciousness as the most commonly talked about and they're often divided up in terms of different data models but as we'll see these division this division can be somewhat arbitrary so what are these data models or key-value is fairly straightforward I have some limpets data I access it by a key I look it up key value system fairly straightforward if you've used DBM or new things are very old right we've been around a long time another category is document databases where you have some kind of structured data form I always find it odd that these are called document databases because they're not like any document that I'm used to they don't look like Excel or a page on a webpage or something they don't have text in them I'm used to documents being a mix of text and data but no these are just purely structured data but we call them documents for whatever reason an interesting point about documents is that they don't have a thick schema so that you can put any kinds of data you like in there this is obviously always true of a key value database because of the fact that it's completely in a strict key value database it's in the completely opaque blob of data so obviously it can't have a schema because it's just big hunk of data but even the document databases even though they've got a structure still have no schema to them but as I always like to point out when we're talking about schema lessness and I could spend a lot more on this and do it in different talks but for this one I've had to pare it down to a minimum just because there isn't a schema in the database doesn't mean there isn't a schema in your application if you're got if you're writing code like this you are assuming that I have somewhere a field called price and the field came prompt quantity otherwise things are going to break what's happening is I have an implicit schema so it's not really schema less in a way it's your schema is implicit and this is actually most of the time a bad thing because it means if you want to figure out how to manipulate your database you've got to figure out what the schema it is so if I forgot to derive it by looking at the data and trying to think oh it they call it that they call it quantity they don't call it qt why well you've got a circa round in the code to find out where the codes manipulating the data so you can see what the implicit schema is this is not a good thing now it has some benefits and there are some very valuable benefits about being schema less but there is definitely a curse that goes with it and it's something that tends to be under played a lot by people who talk about schema free databases there is definitely a downside you've still got to pay attention to what your schema is you still got to figure out how to make sure that you understand it and make allow people to see how the minute to work with it and it also makes an issue in terms of data migrating over time I hear people talk about how old data migration upgrading schemas that's easy with no sequel databases because there's no schema well actually you've got one or two extra tools in your toolkit that help a little bit but most of the problems are still the same at some point you're still going to have to do the same kind of data migration things that you do with a schema Rick the schema effects schema database on a schema loss database and you know that kind of stuff tends not to hit people until later on in their project but it's a common refrain I hear from people who have been you know a year year and a half into the project oh we thought schema lessons going to mean we didn't have to do these things but we did so be wary to scheme Alice stuff and don't over stress it so talk so far about two of the four data models document and key value and I've kind of diverted a little bit to talk about the implicit schema but now I'm going to go back to looking at this key value versus document thing and I'll immediately point out that this boundary between the two is actually quite a blurry boundary and in fact you'll hear people describe different databases as being document or key value depending on who the speaker is some people might say oh react is a key value database and somebody else oh no reacts a document database and you go huh what's going on how can they be classified arbitrarily so in such a way and it's because this boundary line is very there it so key value databases for instance might have metadata that you can attach to them which kind of makes them begin to look a bit like a document because they're having all these extra fields of metadata placed on them as well that's kind of documentation similarly a document database there's no reason why you can't have some special field that acts as a an ID or a key into that document and then you find that most people when they're accessing it through the document database are actually using the key they're not actually using the rest of the document so they're treating it like a key value database even though it's a document database so things get very blurry between the two I actually don't think the distinction between key value and document database is terribly useful it's a kind of hint as to what to expect but it's not necessarily a big factor what's interesting is what the two have in common and the term that I use and I'm trying to encourage other people to use but I certainly use for this is of an aggregate oriented database which is a little bit of a strange term perhaps if you're not familiar what do I mean by aggregate well I'm taking the term aggregate from Eric heaviness who wrote the book domain-driven design how many people have come across domain-driven design so they're a good book not an easy book but a very good book and in the main driven design he wrote this book in the context of your building object-oriented system and you're using relational databases so this is long before no sequel and he noticed the fact that when you operating in that world you often like to not deal with individual objects but clusters of objects that are related together so if you want to pull information about an order from a database you typically want all the line items on the order as well you know two of this five of that six of the other you pull the whole aggregate together so typically when you're interacting with a database you don't want to think about it on an individual object or individual row level you want to think about it instead at the aggregate level and in particular when you're using transactions it's often not a good idea to allow those transactions to go across these aggregates so he built up this whole approach of saying we think in terms of aggregates now these aggregates these are if you look at them from his work and you bring over to the no sequel work well these are really the values of the key value database under documents of the document database we've got some kind of hierarchically structured fairly rich structured clump data but you can represent as maps and lists in some kind of way and you deal with it by taking a whole aggregate from the store into memory or push or updating it when you push it back so you have you think at the aggregate level as you're talking with your data storage so it matches very well with the main driven design notion of aggregates because you can you save a whole document or you save a whole value and pull it back and I think this is the really interesting common characteristic the fact that you operate at an aggregate level rather than at an individual row and table level and for many applications this is very natural match I was talking with people folks at The Guardian newspaper who used started to use a lot of cases in their work and they said well the document model was a natural match for us because the article makes a natural aggregate unit instead of having to worry about how to spray it across all the different database tables we just save the article and pull the article back so in many situations an aggregate makes a natural choice so if you're using a column family database you have a similar aggregate structure but now the aggregate is identified by a combination of the row key in the column family name so it's a bit more complex because common fact column family databases are a more complex model but the basic idea of an aggregate still holds you still have this unit larger unit than a row or an object that you're pushing back and forth and when I talk about no sequel databases that's all I say about column family databases because they're quite complicated to explain and as long as people what I really want people to understand is this notion of pulling aggregates back and forth whether it's a column family a key value or a document database that's an important detail if you're actually sitting working with them but if you're just trying to get a basic understanding of what these databases are about it's the aggregate message that's really important but then the thing is that instead of saving into tables we can save the whole aggregates onto disk and this is really valuable when it comes to distribution and this is why it ties in with the whole idea of running on clusters if you want to run things on clusters you want to make sure that all of the aggregate is on the same note you don't want to be doing a whole bunch of separate remote calls to gather up the information for one aggregate and that's of course one of the problems with sequel it's hard to get everything in terms of one sequel query usually need a few but the great thing about these things is you can have the aggregate and the database knows about the aggregation a sequel database doesn't know that these orders and these line items are connected together in this way how you kind of knows through the referential links and all the rest of it but he doesn't really know he doesn't know how to keep everything together but with a aggregate oriented database the aggregate is very clear and so as a result it can be managed across a network you can put these aggregates on this machine those aggregates on that machine and you know that people are always going to ask to one aggregate and therefore you go and find it and you bring it back managing across a cluster is now much much easier and that's the heart of the cluster friendliness it's the role of the aggregate but there's a downside the downside comes if we think about the order example when somebody wants to see a report like this that says give me the revenue for the products over different times now what's happening here when you're producing this report you no longer want to see an order being an aggregate of line items the aggregation shifts and become somewhere else if you had got an aggregate oriented database now your aggregates are completely the wrong way around for what you want you're stuffed is the technical term well actually you're not stuffed but you have to start using MapReduce algorithms which is effectively the same thing as being stuffed but that's basically why MapReduce comes into play right because now you have to reshuffle your aggregates just apps to satisfy somebody else now it's a price worth paying because if you're dealing with large amounts of large enough amounts of data you can't hold it all in one nice relational structure not without those natural apps I talked about early on so you're going to have to bend somehow and bending this way is probably better than trying to force fit the relational approach but the really can't really vital message here is aggregate oriented databases work best when you have a clear aggregate that you're manipulating all the time as soon as you want to look at that database some data in something other than that aggregate then you're going to have to pay a cost for the simplicity of the usual aggregate back and forth so that's aggregate oriented databases and that's a large bunch of the no sequel category but there is one category that looks a bit different and that is a graph databases which are nothing like aggregate databases whatsoever in fact I had a couple of reviewers when I was writing with no sequel book so why are you talking about graph databases at all they're nothing like any of you of us you shouldn't be calling those no sequel databases but you know whenever you go to a no sequel database conference there's always some graph database people here aren't there you're there I can see you somewhere you may be hiding but you're there saying no in here for Jake wait weirdo sequel yes all right and it's because it's this totally arbitrary thing now if I have been coming up with definitions I would have aggregate oriented databases and whatever thing would be clear and straightforward but nobody asked me so we get the graph databases so graphs are a whole different animal in many ways aggregates are about taking all that relational database and building them into these bigger Trump's and managing the clumps aggregating stuff together graphs go the other way they look at relational databases and they say let's split those big hunking relational rows into even smaller pieces and let's focus on lots of relationships between little things and so you get these complicated graph structures and you say well if I want to look at a graph like this I want to query in terms of the graph and so they come up with graph query languages which are very much dependent on having graphs and they design the data storage to be able to navigate through graphs easily as we know relational databases are often people often have the the Mis thought of thinking that relational databases are making relations between different pieces of data but when you want to actually form relationship these relationships with foreign keys relational databases struggle and the more joins you have the slower your queries get it's not as bad as it was when I was younger and that the rule of thumb was never have more than three joins in query because if you did that you knew that db2 would kind of grind to a halt but it's still an effort to put a lot of joins in and the graph database people say hey we do joins we'll do hundreds of joins we don't care with join happy join friendly people and so they do all this stuff this is by the way of course a reason why there never will be a standard query language for no sequel databases because that is too different right a graph database can't be queried in the same way you're going to query in aggregate or inter database that you could argue that aggregate oriented databases don't really need much of a query language anyway because most important thing is getting the key and if you haven't got a key some metadata oriented kind of query you're not doing a big complicated things that sequel wants to do so a query language is less important for aggregate or in two databases it's very important for a graph database on fact one would argue that the real strength of a graph database is what you can do with its query language but its query language is always going to be different to a relational or an aggregate oriented system and whatever you do you scheme Alice because everybody likes to be schema less these days but I've already have ranted about schema lessness so that's enough of that so that's the first part of how I like to explain no sequel database is people focusing on the data models and why the data models are different and the fact that they are quite different now for the second thing I like to talk about I like to go into talking about consistency and I always start off by bringing up this which is the common way consistencies described in a no sequel world you know no sequel we're not acid we're base base is an acronym that is so bad it makes acid look meaningful and what I only like to point out is that thinking along these lines is just not a good idea you don't to go here so we start I start by saying the whole point of relational databases is you're taking a logic what's often treated as a simple logical lump of data the aggregate and you're splitting it across lots of rows and we have our aggregate oriented databases etc and they have advantages because the fact that you store the whole aggregate on its own obviously graph databases are different graph databases absolutely need to be asset so this is the first reason why no sequel equals base is a misnomer because the graph database people they will do asset transactions they have to because they're breaking things into little stuff whenever you break larger things into little things you need this whole kind of asset transaction thing so really when we're talking about consistency we kind of told the graph boys okay we're not going to talk about you you're boring you're the same as everybody else we'll focus on the different stuff which is the aggregate oriented world so I made the comment earlier on that aggregates a transaction bound line line up with transaction boundaries in domain driven design and here is again a very nice synergy between the two the aggregate is really the essence of what we're talking about in terms of transactions and a very important point that I like to stress the people is if you want atomic updates in an aggregate oriented database you have them but you only have them within one aggregate so what you can't do is say I've got an order over here and an order over there and I want to manipulate both of them and update them within a single transaction because I'm touching two aggregates I can't do that I have to only touch one aggregate in order to get an atomic update now that's of course less of an issue in many ways because of the fact that the aggregates are 'true structured in themselves but it obviously is a bit of an issue but what a lot of people forget is that acid doesn't help them that much even in traditional databases and they're but they're just so used to it but they don't think about it so the example I like to give is we've got a throughout the roof when two thousand got somebody using a browser its touching a server that's touching a database and I've got two users and above accessing the same data so we're both saying okay we both like to get the same order for instance then one of them is going to update it then the other person's going to update it and what we've got is a potential problem because you've got lost updates or whatever now transactions don't really solve this problem I mean they can if you open up the transaction for the entire interaction how many people would do this in a real production system with moderate volumes what a shock nobody does because of course you don't want to have long-running transactions that are open while people are going off to lunch not a good idea so what we do is we only have the transaction at the point of update but of course that leads to exactly the same problem that we'd have by having no transactions we can still get somebody overwriting somebody else's update and still get a lost update so how do we deal with this we deal with it all the time we use what I call an offline lock and a good way of doing that is we get a version stamp on whatever it is that we're dealing with and we make sure that when we update something we supply the version stamp of what we read then that increments version stamp and then when the next person does an update and they attempt to post we see there's a version mix-match look at the e tags or whatever it is whatever mechanism were using and we say up we've got some kind of conflict here and we deal with it we actually handle it typically within our applications or maybe in some kind of framework the object relational mapping the layering but it's not done by the database because we can't open the database transaction to that long and typically the way a good way of doing it is to again think in terms of an aggregate and a transaction boundary so really in practice there's much less difference in terms of consistency than you might think the database can help us to a certain degree it will enforce the updating an aggregate is done in an atomic way that's good but we have to manage cross aggregate things and we still have to in the same technique of using offline locks is every bit as useful with these databases as it is when working with a relational database so the difference in practice doesn't end up being that much transactions are useful tool but then don't solve all concurrency problems that we have to deal with so in fact that switch is not such a big deal as as many people think but that's only one form of consistency I like to divide consistency into two forms what I call logical consistency which is what I've just talked about which can happen on a single node if I've only got one computer I get logical consistency issues as soon as I get more than one user of that computer but we have a different form of consistency now problem now which are called replication consistency the whole beauty of running on a cluster is that I can send them off to lots of different nodes on the cluster and if I can do that often it's useful to replicate data so that I don't have to talk to that machine over there if I can get to this machine over there much quicker replication good thing we like it nice fast rapid response times but it introduces considered whole different family of consistency problems and this is unconnected really with acid because this is what happens when you do replicating data it's going to hit you whatever kind of database technology music so again I always like to illustrate problems with examples so here's the example this case so my imagination is me and my co-author Pro Mode he was the one who actually knows about databases he we're both trying to book a room gone to book the last room in a hotel and he's in India I'm in America and we're tapping away and we're talking to our local nodes which of course in different countries so we both send in our request for a reservation and the two local nodes then we'll communicate and bum ba-bum and they'll toss a coin or whatever it is they do to resolve the problem and they'll say okay we have to solve this so say promote gets the room so now what happens if i take this connection line and i break it now there's some problem in the internet India is cut off from the US whatever it is there's a breakage now what happens when we both try to do these things well we actually have two alternatives two different things that we can choose to do one thing is to say we've got no connection so but we cannot do anything we can't guarantee that we don't double booked a room so we're not doing anything at all no rooms are going to be booked hotel is unavailable that's one option the second option is to say yeah we'll take your room bookings no problem and then you get a double booked room then you've got to figure out how to sort it out and this is consistency versus availability and this is the essence of what it's about and the most important thing you need what that well I said let's say the most important thing to know is you've got to choose between one or the other you can't have both if your network connection goes down you have to decide which strategy you can operate with but more important than that is it's not for you as a programmer to decide which it is that you're going to go with it's actually a business decision you can't as a programmer say if the internet collection goes down we're completely stopping all our hotel bookings all right a lot of business people say but Internet's going down all the blasted time we can't do that we've got to run a business here and actually I do we've doubled with hotel rooms for years we have ways of doing it we have special little buffets in the hotel that's of certain rooms that we don't unbox some people don't turn up we manage things that way I mean done this the decades don't you computer people tell me that I suddenly have to stop making money just because I'm doing that I mean the canonical case of course of this is Amazon they wanted to make sure you could always put stuff in your shopping cart because what is the most important activity in America shopping nothing must stop you shop so they go ahead and do that in other cases when you do need the consistency there are cases when you don't avoid a little booking but it varies on a business and by the way if anybody tells you we must be absolutely consistent in financial matters because no financial institution would allow you to be inconsistent they've clearly never worked in the banking sector for real because by the way when you transfer money from one bank to another they're not using two-phase commit to do that they've lived with all of these problems for decades so the consistency availability choice is a business choice based on what you're going to do when your communications go down and this is the problem I have with the cap theorem the cap theorem is often stated as we had these three properties you've got to pick two of them because you can't have all three I think that's really thinking about the problem the wrong way around the way I think about the problem is saying if you can have a system that can partition that can have breaks then you have to choose between consistency and availability it's not an issue if you've got a single node system right you don't worry about the fact that you know one bus isn't going to talk to that I mean that's that's I mean it can happen but it's not on your list of priorities but if you've got talking across a network particularly a wide area network then you begin to worry about these things and then you have to say what am I going to do when I partition how am I going to shut everything down or I'm going to keep running accepting consistencies and deal with them later you have to choose and by the way it's not a binary choice there's a whole gradation of possibilities you can sir allow a certain amount of consistency a certain amount available 'ti things like you know when you stocks talk about dealing with chorim's and things of that kind that's all about making a choice in between the two but there's always some degree of choice between consistency and availability although actually most of the time it isn't really about availability at all I mean that the limit it is but a lot of the time it's actually more an issue about response time because if I want to book my room and I'm talking to the American server it's going to be slow if it has to talk to the Indian server and the Brazilian server and the Nigerian server and all the other service cutted around the world that slows down yes it can make sure everything's consistent but now I get a slower response and that's a problem because we know that if people have to wait for a response that doesn't go to somebody else so often what you're doing is trading off consistency versus response time saying okay how much am I prepared to deal with here and this is really the age-old thing about dealing about the trade-offs between safety and liveness which is a classic problem of concurrent programming the crucial thing is to remember that when you have replicated data when you have any kind of distributed system you have this issue about how to deal with what happens when things break in the network and you just have to cope with that thank you whoo I can extra five I've got only got ten minutes on my clock that's cool so there's a lot more things I can talk about this is only the beginning of discussing issues around consistency and how things have to change I don't have time to go all into the other stuff so I don't guide to skip over that while I try in trying to do when I do this bit of talk is I'm trying to bring out the fact that primarily that it's not a primary it's not the acid-base thing it's not about some big difference in that sense really when you're talking about as I said it with the logical consistency it's really not that different where you've got a acid relational system versus an aggregate oriented system I mean the boundaries are slightly different but really pretty much have to do the same things the really more complicated stuff is to do with replication consistency and that occurs everywhere I mean what is a cache that replicated stuff all of the consistency are issues I just talked about are the same with handling caches and if anybody out there doesn't think that they're you know having to deal with replication consistency problems on a website and they get any traffic at all they're either doing something really weird or they're just ignoring all their caches that are flying all around the internet we have to worry about this stuff and of course cache invalidation is one of the two hard problems in computer science so it isn't easy but we have to deal with it so you have to think about a whole different set of consistency issues and we have to think about them in business terms I really want to stress that point it's a business decision where you draw your line between consistency and availability it's not a technical decision so those are things are the two main things that people need to know about how to think about consistency differently what are the different data models at that point you can begin to adjust the question of well when and why should people consider using a no sequel system well obviously one reason to say what's odd what to look at no sequel you've got to say what are the things that drive you towards it and I think there are two main drivers the one that gets a lot of attention is the one that calls this whole thing to get interesting in the first place the fact that you've got large amounts of data and that obviously makes you head towards that direction and usually towards the aggregate oriented running on a cluster kind of things that may have been the first that kind of broke relational dominance that allowed to say there are circumstances when we can't use a relational database that was the crack but once you've opened that question up once you've said oh I don't have to use relational for everything then other things begin to slip through the crack right so again the graph data database is a big example of this they don't necessarily deal with huge amounts of data any better in sheer volume and relational databases do but once you've asked the question is relational the right for your problem and you begin to realize oh I've got lots of interconnected data the crack is opening up and the graph database is going right the way through it because they can see there's an advantage here and this leads to the other reason for using no sequel but it makes you development easier and actually for most of the places that I've talked to that are using no sequel it's the easier development that is driving it rather than the large amounts of data so if I look at where we've seen aggregate oriented databases applied The Guardian newspaper a great example they weren't having data volume problems they were just finding it too much of a pain in the neck to deal with a relational database and they've got a natural aggregate so they go for a document database just last week I published on my website a case study of a system we've been developing in California with the gap to handle their purchase orders and this purchase orders again and natural aggregate so what they used on that project was again as it turned out and they said it was great they didn't have to worry about all the relational database complications of mapping that data backs wound forwards they have a purchase order they read it off from the database they manipulate it they throw it back in the database much easier development suddenly a whole host of data issues went away that's what was driving now if I look at other projects that we've done at Fort works that's the constant picture is much more in their occasional cases where we have the big data sets but much more often it's the easier development that is a large Finglas we've used neo4j a great deal got quite a few projects that have used it and again it's the if you've got graph interconnected style data yeah you can do it in a relational database but it's such a frigging pain in the neck you'd be stupid too now that we've got an option a decent database that can actually handle graphs that's a much better way to go and we've done some really sophisticated things of genetic processing and stuff using closure it's really funky stuff using that kind of thing but the main reason that people said we couldn't do this walls because databases are used for integration and here we've been helped by a happy coincidence because at the same time as interest in no sequels been growing people are beginning to think well instead of integrating through shared databases we should use this web service stuff and get a restful and URL e and resource like and as it turns out this is perfect because if you can wrap your data behind some kind of service like this then nobody cares what the database technology is anymore suddenly you can use sequel you can use graphs or aggregate data it doesn't matter as long as you can sort of supply your appropriate resource endpoints and all the rest of it and that is that growth of popularity of using services to integrate is really I think at the heart of what's overcoming that degree of the problem so as well as having the hammer that's made the crack in terms of large amounts of data we've also got the web service integration that's giving us more options in terms of how we do to communicate so does this mean the future is no sequel my answer is no the future is what I call polyglot persistence the fact that now we have a choice of data storage technologies to use of which relational is one probably the biggest it will probably be the biggest for many years if not indefinitely but the point is we have a choice we have to decide what is the appropriate data storage technique for our application or indeed have an application that's got a whole bunch of different of techniques and technologies in place for different parts of it and that is I think where the future lies we now have to the bad news in a way is we've got to start thinking about it and of course we talk about opportunities and your joke is an opportunity and problem is the same thing so there's a whole bunch of issues that we have to deal with in this future that's the biggest one is that we've got to make decisions now we can't just say oh the corporate standard is Oracle we have to say what is the right database for this kind of problem so when thinking about what kind of project might another sequel dates database be useful I use these drivers as my base I say well if you want it what does easier development give you it gives you more rapid time to market faster cycle time new idea is out quicker because there's less friction in our development so that's one of the reasons to use no sequel and again if I've got lots and lots of data then a data intensive application is typically what drives it I also argue that it's not just that but you have to use these properties I think no sequel is more useful in what I call strategic projects well I mean by this strategic projects of projects that make a real difference to the business or the underlying organization as opposed to stuff that just keeps stuff running in the background payroll is a utility you just want it to work you don't really care that you do payroll better than your competitors strategic stuff is stuff you want to do better than your competitors the reason why I think no sequel is more suited towards strategic stuff is basically because it's immature the tools aren't there people don't know about it it's more risky so if you've got something that isn't compelling to you it is just a utility keep the lights on you don't want to take on some newfangled technology that's still in its early days so you want to actually focus it on where it's going to make a difference if you're going to have you see a business opportunity we're bringing in a graph database is going to allow you to analyze the data more effectively more quickly and give you a competitive advantage that's where you want to take it you don't want to say oh I want to do this boring task that no one cares about let's use some newfangled database that nobody knows to do it so that's how I how I look at this and how I sum up the differences are actually timed myself beautifully to my own timer which was a 10:30 finish which means I have five minutes according to you which I can avoid answering some questions so please fire away yep so the question is why have I not put triple stores on here well mainly because most of the stuff that I've seen we're talking about no sequel databases has not been talking much about triple stores now there are never of these things they've been around for ages kind of like the object databases have and not got out of a niche they didn't tend to get labeled as no sequel particularly and that's the totally arbitrary reason I left them off it means a lot of things that are left off in that discussion but what does mean is as we make the shift to polyglot persistence it's beyond what we look at at the moment or maybe even the future is no sequel databases I mean as we say oh we have to now think about choosing what is the appropriate data storage that raises all sorts of questions I mean I personally have a hope that object databases have a reappearance because film any problems object databases are very well-suited it would be nice to see them come back I also think there are other things that we often forget about people I think underestimate the file system which is a really good hierarchically key value database with hierarchically ordered keys you can do a lot with a file system and it they're very very common and quite fast and available and have got really very nice these days caching is very good on most operating systems people underestimate what you can do with the file system I think there's a particularly interesting class of applications that could use technologies like get to provide versioned access to information when you really don't want where you really want to make sure that every change is properly logged I mean use a lot of ideas you can take from from version control systems and solve ending up lucio into the whole area of temporal databases and things like that I'm quite interested by what day Tomic is doing for instance so the whole universe out there of interesting non-relational database approaches which are not necessarily classified as no sequel and really it's a totally arbitrary word so you don't care well I'm talking about with this talk and the Balkan and things like that is kind of the first step the things that are going on under most noticeably under that no sequel flag but I think that's just the advance guard of what's going to be a much broader range of data storage in things over there my seat no I don't think the future is some kind of super database that could be both a graph and a relational database and several other things because the things are just different models and so you never tably have new good you're never going to have one query language that's going to work on a relational database on a graph database and on aggregate or database I mean without a huge amount of compromises because you have to think about things differently I mean will some databases I mean in seeing is doing some degree so with Postgres at the moment saying we can support embedded data in JSON data and things of that kind yeah that will happen but I think we have to think about the different data models as different things because when the data model is different enough and when the consistency needs are different enough you have to think about them as in a different kind of way and so I think the idea of there'll be some kind of single super database it's unlikely okay the change come comes can be driven in a number of different ways I mean sometimes it is very much driven because well our data usage isn't really matching what relational can do so we have to push for something different sometimes you get the effect which I called an OD BA effect which is where our database group is such a pain in the neck we want to use a non relational database because then that way it's not classified as a database and we don't have to talk as and talk to them and you laugh but I've heard several cases where that is exactly been the reason and as it is not they can get changes much faster they can go much more quickly and that's and the business is supporting them and you know if the business is supporting you it doesn't matter what some technology group wants to say if the businesses is determined enough they will go find a way around and so that kind of happens in some routes as well so the let me know it's a guess that the most common signs I've seen so far between the two okay follow my lead is address more questions thanks again thank you

Info

Channel: NoSQL matters Conference

Views: 72,396

Rating: undefined out of 5

Keywords: NoSQL (Software Genre), Cologne (City/Town/Village)

Id: ASiU89Gl0F0

Channel Id: undefined

Length: 63min 19sec (3799 seconds)

Published: Wed Dec 11 2013