Introduction to NoSQL β€’ Martin Fowler β€’ GOTO 2012

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

The biggest thing you need to understand about NoSQL is that it is a collection of completely different technologies that have nothing to do with each other besides not being relational databases.

It's kinda like if someone started a 'NoObject' programming movement that grouped C, Lisp, Fortran, and assembly together.

πŸ‘οΈŽ︎ 22 πŸ‘€οΈŽ︎ u/defcon-12 πŸ“…οΈŽ︎ Mar 10 2014 πŸ—«︎ replies

You should have put the title in the title.

Put your personal opinion or whatever in a comment, not the title.

πŸ‘οΈŽ︎ 14 πŸ‘€οΈŽ︎ u/x-skeww πŸ“…οΈŽ︎ Mar 10 2014 πŸ—«︎ replies

And this video may help you really understand the NoSQL crowd:

http://www.youtube.com/watch?v=b2F-DItXtZs

;)

πŸ‘οΈŽ︎ 20 πŸ‘€οΈŽ︎ u/willvarfar πŸ“…οΈŽ︎ Mar 09 2014 πŸ—«︎ replies

Is the NoSQL bandwagon still accepting hopons?

πŸ‘οΈŽ︎ 3 πŸ‘€οΈŽ︎ u/[deleted] πŸ“…οΈŽ︎ Mar 10 2014 πŸ—«︎ replies

Couple notes to anyone who has seen the video:

  • I am pretty surprised that "data discovery" is not a NoSQL category. For example, Lucene is an open-source search library that utilizes schema-less persisted storage. I personally would have added that in there.

  • If you are deploying to the cloud, you can get started with NoSQL rather easily and have the vendor maintain the availability/redundancy for you: Azure Table Storage, Amazon SimpleDB/DynamoDB and Google Big Table

  • There are a lot of NoSQL private vendors. I would almost challenge what Martin says in the video that NoSQL DBs can be categorized as open-source. Even if you do have a full open-source DB server, enterprise support and drivers for a lot of these NoSQL databases is provided by private companies. Depending on your deployment, it is not 100% open-source.

  • Licensing. Some of the "open source" NoSQL databases don't have super friendly licensing. I wanted to use Neo4j for a project, but their licensing for commercial products can be expensive. http://www.neo4j.org/learn/licensing

  • Classic RDBMS are starting to add more NoSQL-like features. For example, SQL Server 2012 has ColumnStore indexes and SQL Server 2014 has an in-memory engine. Postgres has wrappers that can federate around NoSQL databases.

πŸ‘οΈŽ︎ 1 πŸ‘€οΈŽ︎ u/bartczernicki πŸ“…οΈŽ︎ Mar 11 2014 πŸ—«︎ replies
Captions
so welcome to the track on no sequel databases although it seems like we've already had a couple of tracks on those sequel databases my name is martin fowler stephen oscars hosted the track he asked me to kick things off most of this track is going to be about practical experience of people making use of no sequel databases but this talk is the exception because this is really an introduction to what no sequel databases are all about I'm going to do my best to cram into 50 minutes as much useful information as I can that will help you give you a context for understanding a lot of what goes on in the later talks and the first part of this is I'm going to talk a little bit about the history of no sequel databases because as with many things to understand why something is the way it is it's useful to know how on earth it got there in the first place now when I started in the computer industry in the mid 80s it was just at the point at richer relational databases were really coming in and beginning their rise it's kind of hard to imagine that there was a time without relational databases but I remember when they were the new hot things that was people are arguing about whether there would be any good or not and they've brought us many benefits obviously they look at the persistence of our data they're also very important in the fact that they manage concurrency through transactions sequel has become a de facto standard language to talking to these databases it's not entirely standard but its standard enough but once you know sequel you can talk to these different tools they've also become very important to for many organizations for integration and reporting which as we'll see has both its ups and down sides so sequel databases are a really good thing but they also have some problems and the most obvious problem is one that most application developers run into as they're working with them which is that we assemble structures of objects in memory often in terms of the kind of a cohesive whole of things and then in order to save it off to the database we have to strip out in two bits so that it goes into those individual rows and individual tables a single logical structure in for our user interface and for our processing in memory ends up being splattered across lots and lots of tables this is referred to as the impedance mismatch problem right the fact that we have these very two different models of how to look at things and the fact that we have to match them causes difficulties this is what leads to object relational mapping frameworks and all that kind of stuff now the impedance mismatch problem is sufficiently of an awkward problem that in the mid-90s people said well we think relational databases are going to go away object databases are going to come in that way we can take our in-memory structures and save them directly to disk without any of this mapping between the two but we know what happened there we didn't see the object databases people like me who thought that they were going to be a dominant thing in the future we were wrong and you still listen to me but oh well I guess you're taken but we argue endlessly about why it is object databases didn't actually fulfill that potential and I think at the heart of it is the fact that sequel databases had become an integration mechanism but many people integrated different applications through sequel databases and as a result that really made it very hard for any other kind of technology to come in and that led to relational continuing to be dominant right through into the 2000s so relation I've had 20 years of complete dominance of certainly the enterprise data space and plenty of other ones as well I mean we saw with a science work at the Large Hadron Collider they didn't really want to use relational databases and but they had to to some degree at least what changed really was the rise of the internet and particularly sites that have lots and lots of traffic the big in its sites such as an Amazon or a Google or a bet there or something of that kind as you get large lot amounts of traffic coming into your data what do you do where you need to scale things and the OP one obvious route is to scale things up by bigger boxes but that approach has problems you can only it costs a lot and there are real limits as to how far you can go so as I hope you all know a lot of organizations most famously Google use a completely different approach lots and lots of little boxes just basically CPU motherboards discs commodity Hardware all thrown into these massive grids but here there's an issue for the data storage sequel was designed to run on those big box dying to run as a single data node system it does not work very well with large clusters of little boxes and several of the big data players understood this they tried they are tempted I've talked to several people who have attempted to spread relational databases and put run them across clusters the usual term that comes up in conversation when they describe had I tried to do this was unnatural acts it's very hard to do so a couple of organizations said we've had enough of this we need to do something different and they developed their own data storage systems that were really quite different from relational databases and they started talking a little bit about that published papers and that talked about what they were up to and it is this that really inspired a whole new movement of databases which is the null sequel movement now it's important this point just to talk a little bit about where this term no sequel comes from a lot of people complain about it quite reasonably because they say well it's a really odd term trying to define a movement by something it's not and the origin is really very simple it was this guy in London Johannes Carson he did a lot of work with Hadoop and things like that he wanted to have a look you had to go to a conference in California he wanted to take a look at all of these various interesting a basis that were poking around at the time and he said proposed a meet-up a little meeting where people could discuss ideas and of course if you're going to do that in the late 2000s you absolutely need something that's really really important you need a Twitter hashtag so we asked around well what would be a good hashtag it's got to be short it's got to be unique so we can easily sort on it and a guy came up with the hashtag no sequel that's all no sequels ever meant to be a Twitter hashtag to advertise a single meeting one point in time the fat minute has now become the name of the whole movement was completely accidental nobody thought that was going to be the case so yeah this is the way language often goes it's very unpredictable fits and starts so there was a whole bunch of people who turned up to that meeting by the way this is the list of people there that's not what we call the whole set of no sequel databases since a lot of databases who weren't at that meeting are now considered part of that no sequel umbrella so this inevitably leads you to the question of well what is the definition of no sequel and this is something that I had to kind of think about writing a book about the subject if you important if you're going to write a book about something to define what it is you're writing about my conclusion is we cannot define no sequel databases because of this very odd history what we can do is we can identify some common characteristics of no sequel databases and there's a whole bunch of E's obviously no sequel databases are not relational it's actually more about non-relational than it is about no sequel obviously as a strong leading towards cluster friendliness the ability to run on large clusters because that's what the original spark through Google and Amazon came from but that's not an absolute characteristic there are some no sequel databases that aren't really focused around running on clusters most of these databases rather interestingly are open source so most of the things we generally call no sequel databases or open source there are martial tools that like to call themselves no sequel databases and maybe over time that will become part of that that would no longer be a common characteristic but it is still a common characteristic at the moment perhaps most importantly is they're all things that have come out of the 21st century website culture there are plenty of databases out there going back long before relational databases that do not use sequel or the relational model but we don't call such things as IMS or mumps for those who've heard of either of those things really no sequel databases so that's what I see as the common characteristics I'll mention the last one in a moment so one of the things that's interesting about no sequel databases is they use different data models to the relational model obviously since the name says that and if we kind of plot a picture of the most commonly referred to no sequel databases typically what we see is that they get divided into four broad chunks based on their data model let's dig into these data models a little bit more so the most simple data model to talk about is that of the key value store the basic idea is you have a key you go to the database tell me grab me the value of this key the database knows absolutely nothing about what's in that value it could be a single number it could be some complex document it could be an image the database doesn't know doesn't care now you've Viger this basically as just a hash map but persistent occur in the disk simple as that another data model that's very common is the document data model now the document data model thinks of a database as this storage of a whole mass of different documents where each document is some complex data structure usually that data structure is represented in forms of Jason because Jason is lots fashionable these days I mean you could do it in XML but who wants to be seen wearing XML in public no one so we have these different documents that all flash around and the usual document databases will allow you to say give me a document that has these fields with these those you can query into the document structure and you can usually retrieve portions of the document or update portions of a document so the big difference there than - the key value store where it's a very opaque structure and the document is much more transparent one thing to notice right away about these databases about document databases and indeed all know sequel databases is that they don't tend to have a set schema with a relational database you can only put the data into the database as long as it fits in the schema that you've defined for that database with almost all no sequel databases basically you can shove anything in you like many do any stuff you like just go in there and the no sequel people will talk endlessly about how this increases your flexibility it makes it easier for to migrate data over time it's all absolutely wonderful and as usual that's not really the entire truth I mean usually when you're talking to a database you want to get some specific pieces of data out of it you can say I would like the price I would like the quantity I would like the customer as soon as you're doing that what you're doing is you're setting up a implicit schema you are assuming that an order has a price field you are assuming that the order has a quantity field you're assuming that it is called price and not cost or price to customer or whatever other thing you could think of what it would be that implicit schema is still in place and you've got to manage that implicit schema in many ways in a similar approach to the way that you manage the relational more strict schema so schema las' is really a bit of a wussy term here now it by having the no fixed storage schema does give you some options that you don't get with relational databases and and there is a difference and there are advantages in terms of flexibilities as well but you can't nor the fact that you were always dealing with an implicit schema the only time you don't have to worry about an implicit schema is if you do something like give me all the fields in this record and throw them up on the screen field name value an occasion that you want to do that but most of the time you actually want to do something more interesting so I've talked about two data models key value and document data models and I've presented them as two quite different things but actually the line between these two is a hell of a lot more fuzzy than that many key value data stores allow you to store metadata about the value this allows of course you to have builds more complicated indexes I mean it's if you want to get all the orders for a particular customer you don't want to search every order in the database to find that the moral equivalent of a table scan you want to index that so key value databases allow you to store various metadata things typically which kind of makes them feel a bit like document databases right and then on a document database yeah you can do all sorts of queries against a thing but often there's an ID and often when you actually look that up you actually do it by saying give me the thing with that particular ID and that ID is effectively the same as the key in a key value store so the boundary between a key value and a document database as I said is somewhat blurry and I've often heard a particular database sometimes described as key value and sometimes described as document in reality I wouldn't worry too much about the difference between them think of it as a kind of a first approximation to work with but it's not actually that important as it goes on what is important though is that both key value and document databases have this common notion of you're taking some complex structure that you can save as a single unit into the database whether it be a relatively transparent document or a completely opaque value that notion still exists and that commonality magnet thing well we really need some term to describe databases that work kind of like that and so for the book I came up with a term an aggregate oriented database the hair that allows you to store these big complex structures and where did the term aggregate come from it comes from and this book here written by Eric Evans domain-driven design how many people have read domain driven design hopefully a good few of your excellent book it really talks about how to think about modeling domains and one of the key concepts in the early part of domain driven design is that often when we want to model things we have to group things together into natural aggregates because when we're talking to a database even a relational database it makes sense to think of those aggregates when restoring and retrieving data if we're modeling orders for instance usually we'll have separate classes that the orders and the line items that's pretty kind of a standard object 101 model but we think of the order as a whole thing a single unit so an aggregate may be many do many classes it may be quite a complex structure but when we're talking about persistent get or retrieving it from memory we think of it as one thing to cross back and forth now in a relational database we have to splat about aggregate across a whole bunch of tables but nice thing about an aggregate oriented database is we can save that aggregate as its single unit in the terms of the database itself so for a key-value database the aggregate is the value in a document database the aggregate is for document and that becomes the single unit that we move back and forth and are I certainly find this is a much easier way to think about the commonality of these classes of databases now the third data model I'm going to briefly describe is that of column family databases now this is a more complicated data model of these it is another aggregate oriented database however the column family database basically says we have some think of it single P they call it a row key and then within that we can store multiple column families where each column family is a combination of columns that kind of fit together the column family here is effectively your aggregate and you address it by a combination of the row key and the column family name now column families can also be kind of different look at the lower one here but it is effectively a list of items the various orders for a customer so that doesn't kind of feel so much like the typical record structure that you might know about but it is of course the same as storing an array in a document and of and something of that kind so again you get something of that that kind of rich structure and that you can build in here column family databases give you a slightly more complex data model to work with but the benefit you get is again in terms of the retrieval you can more easily pull individual columns and things of that out of the case but again the broad data model is that of an aggregate oriented picture so the great thing about this is that now when you're taking your aggregate in memory instead of spreading it across lots of individual records you get to store the whole thing in the database in one go and the database knows what your aggregate boundaries are now this is interesting where it becomes really useful is when we talk about running the system across clusters because if you're going to distribute data what you want to do is you want to distribute the data that tends to be accessed together and so the aggregate tells you what data is going to be accessed together so by placing different aggregates on different nodes across your cluster you know that when somebody says oh give me the details about this particular order you're only going to go to one node on the cluster instead of shooting around goodness knows how many to pick up different rows from different tables so aggregate orientation naturally fits in very nicely with storing data on large clusters and that's of course part of the whole thing with BigTable and dynamo both effectively went for an cluster oriented approach BigTable very much a column family style approach dynamo much more a key value store but it makes running on clusters efficiently way more straightforward and that's really been as I said that the driving factor here but however nothing is perfect an aggregate orientation isn't always a good thing let's imagine we've got our order system and we want to look at the data like this we want to say given a particular product tell me the revenue tell me a past revenue we now not care about orders at all we only care about what's going on with individual line items of many orders grouping them together by product effectively what we're doing is we're saying we want to change the aggregation structure for long-wear orders aggregate line items to ones where product segregate line items the product now becomes the root of the aggregate now in a relational database this is straight forward we just query the data differently it's very straightforward to rearrange the data into the structures we might want in different cases with an aggregate oriented database it's a pain in the neck you could do it and what they'll typically do is they will run MapReduce jobs to rearrange all your data into different aggregate forms and probably keep those persistent or maybe do even do incremental updates but it's always going to be more complicated so being aggregate oriented is an advantage if most of the time you're using the same aggregate to push data back and forth into persistence it is a disadvantage if you want to slice and dice your data in different ways so what I've done so far is I've managed to cover some of these models I've basically taken the document column family and key value and lumped them together under this aggregate oriented category and I think that's a useful abstraction at least at the level of what I can say in 50 minutes there's one very noticeable outlier that you see though and that is graph databases graph databases are not aggregate oriented at all they use a completely different data model a graph database is data model is basically that of a node an arc graph structure not a bar chart or anything like that but does nodes and arcs something that hopefully would be familiar at least from a few boring computer science classes the nice thing about storing a graph database is that it's very good at handling moving across relationships between things relational databases you might think with the word relation in there that they're good at handling relationships but of course relation doesn't mean relationship it means something in set theory and actually relational databases are not terribly good at jumping across relationships you have to set up foreign keys and you have to do joins if you do too many joins you can get in a mess if you've modeled a graph structure or a hierarchy to special form of graph structure in a relational database you'll have had this experience it's not straightforward relational databases aren't good at this so graph databases come in and say yeah we can handle jumping around relationships left right and center we make it easy to do and we optimize to make it fast to do that kind of thing furthermore we can come up with an interesting query language that is designed around allowing you to query graph structures this kind of query here this is a cipher from near forge a is all about saying well given a certain graph structure let me use that graph structure to express a more complex query and you can do some very interesting graph oriented queries in graph databases things that would be there very difficult to write in terms of sequel as well as a pig to in terms of performance so in many ways you can kind of think of we've gone in opposite directions aggregate oriented databases take a lot of stuff that scattered around and puts them into bigger lumps well graph oriented database is kind of break things apart into even smaller units and let you play with those smaller units more carefully I mean you can still model relationships in aggregate oriented databases just as you can in relational databases you basically refer to IDs in different documents but it's a lot more messy so part of your decision as to whether a no sequel database is going to be interesting to you is how do you work with your data do you tend to work with the same aggregates all the time which would lead you towards an aggregate oriented approach do you want to really break things up and jump across lots and lots of relationships in the complex structure but would leave you to a graph approach or is the tabular structure working well for you in which case you want to stay with a relational approach so no sequel divides into those two categories all of these are schemas so the graph databases as well allow you to add any bits of data to any node you have all that flexibility but with the same caution about implicit schemas as well so that is kind of half of the picture the data model part now I'm going to move on to another issue which is about consistency and effectively dealing with lots of people trying to modify the same data at the same time you've probably heard something like this but relational databases they are acid they do the familiar acid transactions that we all know and love atomic consistent isolating durable no sequel paths they don't do any of that kind of thing and of course no single people will say all we do base which is an even more contrived and meaningless acronym velocities and I won't even attempt to tell you what it is because I can only remember what it is on Tuesdays but basically what it boils down to is if you've got a single unit of information and you want to split it across several tables what you don't want to be doing is caught in a position where you only get to write half the data and somebody else reads it or you get to write half the data and somebody takes the same order and writes a different half of the data and things get really messy in that kind of situation you need to have this mechanism to control to effectively give you atomic updates and that's really what transactions are all about atomic update so that you either succeed or fail and nobody kind of comes in the middle and messes things up now when it comes to our nicely organized set of no sequel databases the first thing to point out is graph databases do tend to follow acid updates which makes sense they decompose the data even more than relational databases do so they've got even more of a need to make sure they use transactions to wrap things together so if anybody tells you are no sequel databases they don't do acid you now know an immediate rejoinder ah but graph databases do a little bit now aggregate oriented databases if I actually don't need transactions as much because the aggregate is a kind of bigger more richer structure in fact if you read the domain driven design book one of the things I point out is that the aggregates in domain driven design our transaction boundaries you shouldn't less transactions cross aggregate boundaries because if you do it'll just be complicated to manage the concurrence of your system so the domain driven design community from the beginning even before no sequel said keep your transactions within a single aggregate that's effectively what you do in the world of aggregate or inter databases any aggregates update is going to be atomic it's going to be isolated it's going to be consistent within itself it's only when you update multiple documents in a document or in database but you have to worry about the fact that you haven't got acid transactions but that problem occurs much more rarely than you'd think so that's the first line about acid-base think a some databases that are ass fully acid anyway and the aggregate oriented databases that aren't they are acid within the aggregates which is kind of what really matters but there's also a bit more to thinking about consistency even than that because even in a relational world acid transactions don't mean we get to be completely consistent and don't have to worry about update anomalies and I will walk you through what hopefully is a very familiar scenario to point this out and also to illustrate how you deal with some of this so imagine we have some typical multi-layered system we've got a person talking to a browser browser talks to a server server talks to a single database and we're going to have two people talking to the same data in the same database at the same time although through different browsers and servers and here's the basic little scenario we begin with both people left and right taking the same piece of data with a get request essentially they bring it up onto the grouser screen and now the human being girls I need to make some changes to this to them to them to them turn up the number them and eventually the guy on the Left I always get my left and right confused says okay I've got my updated data let's post some changes and then shortly afterwards the guy on the right says I've uploaded my data now let's post some changes now of course if we let that happen just like that and warning conflict this is a right right conflict two people have updated the same piece of information they weren't aware of each other's update and they've got themselves in trouble acid to the rescue right what do we do well what we have to do to prevent this conflict is we wrap the entire interaction from getting the data onto the screen and posting it back again in a transaction that way we make sure the database will ensure that we don't get a conflict effectively one of them will be tell now you got to do this again retrieve your data again we don't get conflicts problem solved how many people do this on your production systems yeah occasionally you can get away with this most of the time you can't why because holding a transaction open for that length of time while you've got a user looking and updating the data through the UI that's gonna really suck your performance out of your system right so and I want to stress you can do this in some circumstances if your performance needs a really very minor you know you've only got a handful of people look at using the system at once you might be able to get away with this approach and it is advantageous to do so because a whole lot of problems go away if you do this but the most systems you can't afford to you can't afford to hold transactions open that long and in fact most people who write about transact building systems like this will tell you never to do this don't hold transactions open for a user interaction what they say instead is you just wrap the transaction around that update that last bit of updating the database and that's a good thing because that stops a collision where one haften update Mac mixes up with another half-done update you get some tables updated over here and some different tables updated differently over there and the result is an inconsistent mess but you still effectively get a conflict because the two people made updates are the same piece of information without knowing you have a person did that and this is what typically might happen even in an aggregate or in array aggregate oriented database if you have to modify more than one aggregate because you might find one person modifies the first one and then they go over to the second one the other person doesn't give way around and as a result you could lead into an inconsistent between aggregates now if you've come across this which you probably have you probably also know how to solve this and basically use a technique which in one of my previous books I referred to as an offline unlock basically what that means a usual way of implementing this is that you give each data record or leach aggregate at least a version stamp and when you retrieve it you retrieve the version stamp with the aggregate data when you post you provide the version stamp of where you read from and then for the first guy everything works out ok the version stamp gets incremented and then when the second person tries to post they still got the old version stamped and then you know something's up and you can do whatever conflict resolution approach that you take you use the same basic techniques again with you working with an L sequel database the nice thing is you don't have to worry about transactions about this problem so much because the aggregate gives you that natural unit of update it is your transaction boundary but once you cross aggregates then you've got to think about juggling version stamps and doing something of that kind but it's not really very different to what you have to do with a relational database because offline locks force you to do this juggling with version stamps anyway so yeah you don't get these acid transactions to the same degree that you do with a relational database but the impact is not as great as some people think because we actually have to deal with this stuff all the time anyway now when we talk about consistency I find it useful to think about actually two kinds of consistency the consistency I've been talking about so far is what I call logical consistency these consistency issues occur whether you're running on a customer of machines or whether you're running on one single machine you always have to worry about these kinds of consistency issues now when you start spreading data across multiple machines this can introduce more problems when you comes to distributing data broadly you can talk about it in two different ways one is sharding data taking one copy of the data and putting it on different machines so that each piece of data lives in only one place but you're using lots of machines sharding does it really change the picture very much you still get the same logical consistency problems that you do with a single machine they're exacerbated to some degree but the basic problems are still the same another thing however that's common to do with clusters of machines is to replicate data to put the same piece of data in lots of places this can be advantageous in terms of performance because now you've got more nodes handling the same set of requests it can also be very valuable in terms of resilience if one of your nodes goes down the other replicas can still keep going so hence they'll talk a lot about availability and resilience with these cluster oriented approaches however as soon as you replicate data a new class of consistency problem starts coming in then again illustrate with a simple example so here we have two people myself and my co-author promote and we both want to book a particular hotel room and so we send in our booking request and we happen to be on different continents promotes in India I'm in the u.s. we send our requests to our local processing nodes now the processing nodes at this point need to communicate where needs going on what's going on here and the system as a whole needs to come up with some kind of decision essentially ensuring that one of us has to sleep on the streets in this case me this is what happens ninety-nine point nine nine whatever percent of a time however let's take a kind of variation on this example again we both want a book a hotel room but now the communication line has gone down the two nodes cannot communicate we send in our requests what happens well actually there's two broad alternatives alternative one is the system says that our communication lines gone down sorry we can't take your hotel bookings at the moment please try again later the alternative is the system says yes will accept your booking thank you very much because we're really reliable and up-to-date and all the rest of it and then proceed to double booked the hotel room I'm not that friendly with promote we will be good friends but you know we're all limits we may not want to share that hotel room so basically what we're seeing is a choice it's a choice between consistency which means now I'm not going to do anything if my communication lines down and availability which says yes I'm going to keep going but at the risk of introducing an inconsistent behavior that a vital thing here to realize is that this is a choice and it's a choice that can only be made by knowing about the business rules but the main rules that you're working with I mean it may sound really awful to say are we going to double booked a hotel room possibly with complete strangers I mean that would be bad but actually maybe the hotels have ways of dealing with this maybe they have a block of rooms that they always keep available till the last moment for emergencies they can just use one of them well maybe they just send an apologetic groveling letter and some frequent sleeper points out to try and make me happen there's various ways in business that people will deal with inconsistencies as they crop up now I'm not saying you should always go for availability of a consistency but what is true is that it's always a domain choice it is the business people who will have to decide what's more important the risk of double booking the last room in the hotel or the fact that we have to bring down the site and say sorry we can't accept any orders at the moment which is kind of bad for business this is one of the things that drove dynamo they wanted to make sure that the shopping cart was always available you can always put things in shopping cart why is this because it's America what's the most important thing to do in America shopping we must maintain a retail destiny we must always be able to shop and out what happens you look you come to checkout you go why is this item in here twice or I sure I put the so-and-so in here ah computers they make mistakes let me just fix it when the worst could happen you actually send out the order you get duplicate stuff you ring up Amazon rare sorry sorry sorry and you get it all back you much better than actually someone not being able to shop for a few seconds so the point is to business choice so this then ties into something you'll hear endlessly about whenever someone talks about this stuff which is the cap theorem everybody who's heard of the cap theorem how many people understand the cap theorem some of you it's actually pretty straightforward it's described very badly though well not not very badly but I don't think it's terribly useful they serve where are these three concepts up here and you get to pick any two this is true I think it's easier to reformulate it it's a bit clearer if you say if you've got a system that can get a network partition which basically means communication between different nodes in a cluster breaking down and if you have a distributed system by the way you are going to get a network partitions if you get a network partition you have a choice do you want to be consistent or do you want to be available that's really what the cap theorem boils down if you've got a single database running on a single server it's not been a partition you don't have to worry you can be as available as that node is and you're going to be consistent you can maintain everything but when as soon as you have a distributed system you have to make that choice but that isn't a single binary choice right across your system you actually have a spectrum you can go for a certain amount you can actually trade-off levels of consistency and availability I'm not going to go into how just trust me you can furthermore it can vary depending on a particular operation you want to do certain operations can be highly consistent certain other operations can be highly available any of the databases that do this kind of stuff will give you all the knobs and tweaks to do this and so you're going to learn out how to trade them across and actually most of the time you aren't trading off consistency versus availability it's not availability that's the issue and it's not even dealing with network partitions that's the issue a lot of the time what you're doing is you're trading off concern versus response time because what's happening is the more you want to have consistency across a cluster of nodes that means the more nodes have to get involved in the conversation again think of that hotel case the two had to communicate that's going to slow down the response time so you might say even if the network is up you know I'm going to let each node book it on hotel stuff and sort it out later even with the network up I'd still get you the faster response time rather than doing all the communication I need to get the consistency and again that's a business decision another thing that Amazon said was we want to always get people shopping fast because what's the most important thing in America shopping so therefore we want really rapid times and even if all the nodes are available and we could give you a completely consistent solution we want to be quick and they also it helps that merging shopping carts dealing with the inconsistency of shopping carts is relatively easy oh they asked for this over here there so that over there well clearly they want both because this is America everybody wants everything got taking stuff out of shopping carts how why would we want to encourage about only fact this is a broader trade-off in terms of computing this is really just another aspect of a general concurrency trade-off between safety and liveness and if you've gone to compute concurrent classes and you've heard people talk about that really this should actually seem fairly familiar in cutting those kinds of terms now what I really wanted to do with this little segment on consistency was focus on giving you a feel for how consistency is different in the particularly the aggregate oriented no sequel world as opposed to how you may have thought about consistency so far there's a lot of topics I could have talked about here that I'm just haven't got time to talk about the important thing that go away with is realizing that you have to think about consistency issues differently essentially because you've got this different data model and the possibility of replicated data and in particular you have to think of it about this terms of this consistency availability trade-off and that it's not up to just us as techies to make that decision it is actually up to the way the business want to works as to where we make these trade-offs and if you want more well I'm going to tell you to buy my book anyway so you know what to do okay so the last little segment I'm going to talk a bit about when and why you might want to use a no sequel database and the way I think of it is there are two drivers that push us towards a no sequel database the first one is the one that I've already talked about as the real driver for the whole suit no sequel movement itself and that is you've got to deal with large amounts of data if you've got more data than you can comfortably or economically fit onto a single database server you are going to go you're going to have to deal with some pain you can either take the pain of trying to run a relational database across a cluster or you can go into this new no sequel stuff and you know most of the time I think I'd go for the no sequel stuff because running data relational databases across clusters is still somewhat of the black art so a big amounts of data is a big issue now some people have said and for one of the reviews comments on my book was yeah but only very few organizations have to worry about this stuff if you Google and Amazon yes pretty much everybody else know as I read that what I heard in my head was 640k is enough for almost everybody reality is there is tons of data coming at us lot and every organization is going to be capturing and processing more and more data so this large-scale data problem it's only going to grow and that is a factor but actually this is not the main reason I think why most people go into no sequel there was a survey I saw in the the track on Monday but pointed out that most people actually aren't interested in our big amounts of data for no sequel databases what they want to do is they want to be able to develop more easily so a good example of this is I have some friends who work on The Guardian newspaper and website how many people heard of the Guardian good english-language newspaper many of you good and yep they're dealing with articles they're saving articles updating articles pushing articles back and forth the article for them is a natural aggregate spreading that articles data and metadata across relational databases it's a pain in the neck it's awkward but taking it as a single thing a single article and pushing it into the database that's much more straightforward the map is the impedance mismatch problem is drastically reduced if you've got a natural aggregate and many of the projects are that I've talked to in fort works have used a no sequel database have gone that route they've said our data model doesn't really fit very well with relational these one of these no sequel options is better it might be a natural aggregate in which case have gone the aggregate oriented route or it might be we've got something that feels much like a graph structure so we go the graph database root and and that I think is the most common reason at the moment why people are using those sequel databases because you've got you get get effectively getting rid of that impedance mismatch problem now of course that raises a question that was of course the promise of object databases they were going to get rid of the impedance mismatch problem but they got clobbered because databases are being used for integration why is that same problem not hitting us now well it is hitting us but it's greatly reduced because now more and more people are saying we don't want to integrate that way we want to hide our databases inside a broader application or service and then we want to use some kind of service-oriented interaction between the two which may be web services it can be something as really disgusting as soap on ESB s we've god knows what thrown in but the point is applications and their controlling access to data and if you're in a scenario where you can do that where you can effectively encapsulate your database then the integration issue becomes a lot less serious and that I think is a very important enabler to make it possible for no sequel databases to thrive that this is a good practice anyway even if you've got relational databases you do not want to be integrating through integration databases there cause no end of trouble believe me if you haven't experienced it yourself so much better to try and encapsulate something like that and if you're going to do that then you've got much more freedom for what database to use and I think that's going to be a very driving structure towards this another thing that's encouraging people to use these databases is to deal with analytics we all know about data warehousing the usual data warehousing project as far as I can tell is that salesman turns up from one of the big companies and says all you want to do data warehousing well here's this project plan by which every piece of data you could possibly have in your organization is all put into one place so that everybody can get it it easily and it's a multi-year project we have lots and lots of very diverse stakeholders we know that story I mean people come across these big data warehousing projects that they felt have succeeded there's usually one or two no one's prepared to admit it oh you prep about to admit it but most of them go battling what we look for instead is a different approach that says let's particularly focus on one particular problem and see what how do we grab the data from that and the data by the way might not be in well-known relational or even no sequel scores it might be scattered around in log files or you know what truly runs most enterprises which is Excel spreadsheets well let's get at that data and let's poke it and pull it together and no sequel databases play an important role in this the graph databases allow you to easily do graph like analytics on the database which is really quite nice the aggregate oriented database these are generally less good at this because they can't slice and dice so well but what they can do is store large quantities of data so if you are pulling stuff off devices or log files or the like then they become very attractive and of course that's what's given a big advantage to the Amazon because they're able to mine all this information so with all of this does this mean that sequel is the future of databases but relational databases and then a disappear and we're all going to be doing no sequel stuff I don't think so I think really the future is something that I refer to as polyglot persistence and what that means is we think that there's going to be room for lots and lots of different kinds of databases with relational databases still playing a big role if you're building an application maybe you'll use lots of different databases as part of your application certainly across an organization you'll use lots of databases and what you're doing is you're choosing the appropriate database for the nature of the problem that you're working with and because there are different nature's of problems for our different data stores the one the idea of whatever your problem is the answer is a relational database will go away now this is great it gives us lots of opportunities for the future but as every cynic knows every opportunity is really a problem and there are plenty of them you've now got to think about this kind of stuff you've got to decide what is the appropriate null sequel database for a problem you've got a deal with organizational issues relational DBAs are not going to like this in fact for some people that's a big advantage but let's not go down no sequel databases are immature they don't have the tools and the experience and the knowledge of how to work with them well that we've had from twenty years of relational databases and all of these consistency issues can still end up biting you so when it comes to what kind of project I get dive start with a drivers if you want rapid time to market fast cycle time you need to be quick easy and development is really important therefore if you can do that with no sequel databases such a reason to go with them similarly if you've got a very data intensive project then obviously no sequels ability to deal with large amounts of data is very important but I think there's another overriding goal as well which is is your project really important to the competitive advantage of your business what I refer to as a strategic project because if it's a strategic project then it's worth taking on the extra risk the unknowns of dealing with an immature and not so well known technology which is what no sequels are if on the other hand you've got a project that's what I call the utility project it's kind of a straightforward it's not really vital to the business's operation then that may mean not the best place to bring in an unknown like this in that kind of situation you're probably better off with a familiar at least for a few years but there's lots of strategic projects out there and certainly our experience over the last two or three years at four works has been very positive with no sequel databases I've heard remarkably few complaints and thought workers always complain about what they're working with so I certainly am very much convinced but no sequel databases have an important part to play in the spectrum of future developments and the rest of the talks in this track we'll explore different ways in which they've been used so I hope you found that helpful if you want more depth the book is very thin my target was 150 pages and I only missed it by 2 so it's 152 pages quick overview a bit more than what I just gave you and I hope that will be handy if you've got a that page on my website I collect together various other things that I've done or talked about in terms of no sequel and thank you for listening to me you
Info
Channel: GOTO Conferences
Views: 873,988
Rating: 4.9315138 out of 5
Keywords: Software Development, Martin Fowler, NoSQL (Software Genre), NoSQL Database, GOTO, GOTOcon, GOTO Conference, GOTO Conferences, Software (Industry), Software As A Service (Industry), Software Engineering (Industry), Software Development (Industry), Software Testing (Industry), Computer Science (Field Of Study), Programming Language (Software Genre), Database (Software Genre), Videos for Developers
Id: qI_g07C_Q5I
Channel Id: undefined
Length: 54min 51sec (3291 seconds)
Published: Tue Feb 19 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.