Solr 4. The NoSQL Database

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
let's get started so just a little bit about myself my name is yannick Seeley I created solar back in 2004 while working for C net networks and then a couple years later - beginning 2006 CNET contributed solar to the Apache Software Foundation you know open sourcing it I also wanted to take a second to explain why my slides are so white you might notice there's no borders or anything like that that's because I work out of New Jersey which as you might know got slammed by Hurricane sandy in this last week so I've been without power the whole week writing these slides sort of just hooked up to a generator that was also hooked up to my fridge no internet of course or anything like that so no downloading of slide templates or that's the breaks so what is no sequel um for many people at this point it's sort of like something they know it when they see it but it's basically a whole class of non-traditional data stores that don't use sequel and also don't give acid guarantees or full acid guarantees and then in return they've sort of been designed around different points and so they can do things like provide better scalability something that most of these no sequel solutions have in common that they're distributed and fault tolerant architectures it's really the genesis of some of the no sequel engines are like because databases did not scale out that well traditional databases so this is a slide of some of the earliest users of solar what some of their search architectures look like and I just sort of wanted to make the case that even back then it sort of looked like no sequel just from the read side so how this worked we'd have like normally a solar master and someone grabbing data from the database and chugging shoving it in the solar master periodically they'd be index replication to the solar searchers the search tier and then that would be behind a load balancer and then queried by the app tier that you know to generate dynamic HTML basically your web app so this is the what people often did is that they shoved all of the data that they would need not just search but they would actually need in their web apps they shove that all into solar and so that their web apps stopped relying on the database altogether so they weren't using solar just as a search index they weren't doing the search getting document IDs and then going back to a database or something and getting more document data they were using solar for everything and so look again the from from the read side looks more like a no sequel store from the update side not so much so even back in 2008 we started thinking about what we wanted and we knew we wanted more we knew we wanted automatic distributed indexing because we wanted to be able to send in a document to solar and have solar figure out where that document should live we had two distributed search for awhile but it was really up to the user to sort of break up figure out how to break up that big index into multiple small indexes we didn't do it for them we wanted high availability for writes on the last slide you notice there's a single solar master if that master goes down you lose write ability until you create another master somehow we wanted near real-time search Lucine works on point in time snapshots of the index and so near real-time search is just getting to the point where you can create new snapshots very rapidly like say one a second this didn't work well with the older architecture because we had that index application in between and that introduces a lot of latency real time get sometimes when you want to use what you want to something as a data store even near real time isn't good enough you always want to get the latest version of a document regardless and you don't want to have to worry about like index snapshots or any of that and then of course optimistic and currency is something that I always wanted to because it enables a whole class of opera of applications when you do need some sort of locking so the effort to add all this distributed stuff was called solar cloud and we really started from the ground up designing around the things the problems that we wanted to try to solve we didn't start from the cap theorem and decide if we wanted a or C or any of that if you're not familiar with the cap theorem C stands for consistency a is for availability and P use for partition resistance and the sort of saying goes is like choose two there's been some confusion around that though and some some let's just leave a confusion Cloudera actually had a great blog on this it is actually called cap confusion so if you want to read about it you know just Google for a cloud era cap confusion in their blog and they basically made the case that it's not that you get to choose you really have to choose P you have to choose partition tolerance because there's multiple types of partitions one type of partition that people not only think about is when you get two groups of machines that are both up and they can't talk to each other that's the classic sense of a network partition but they pointed out that a machine going down is also a form of a network partition happens all the time and networks are not perfect sometimes you'll lose a message and that's actually a form of a network partition - so in the real world we have to choose the P in the cap theorem and so the the real choice is between how much you trade off availability and consistency in that case so what we ended up with was a CP system I didn't think that that's what we were going to end up with when he started when he started I was looking at systems like dynamo and Cassandra and eventual consistency looked really cool although I got to a point where I realized that eventual consistency just seems fundamentally incompatible with things like optimistic concurrency and so we had to make the choice and and we chose to go more the consistency route that doesn't mean that we really sacrificed availability too much I think we still do pretty well with availability so for instance a shard has multiple replicas and we don't lose availability of that shard until all of its replicas are gone for a network partition if there's a big cluster in a small cluster there's always going to be one side that gets the quorum and so that will that part will remain available for reads and writes and it'll just be the other part that becomes unavailable and Mark Miller has a solar cloud talk tomorrow if you want even more in-depth stuff on solar cloud architecture I'd encourage you to attend that talk so Solar 4 was released a few weeks ago it's been a long time in the making years they think it's the first release that has all this solar cloud and distributed indexing functionality in there - so I wanted to take this opportunity to sort of reset how people view solar and really give the pitch that it's it's now like a complete no sequel platform to so it's document-oriented no sequel search platform and it's data format agnostic you know we don't form it we don't force you to use any particular data format you can use a whole bunch of them it's distributed of course so if you have too many documents that can fit on one machine you can split things up into multiple shards and then distributed search will query over those shards and make it look like a single it came from a single index it's fault-tolerant highly available with no single points of failure we support the atomic updates basically updating existing documents in the index optimistic and currency and then of course all the goodness we you'd expect to get from Lucene full-text search should hit highlighting and then just tons of specialized queries that have been in solar for some time and some new ones like you know faceting grouping join spatial search function queries so for those of you who may be new to solar I just wanted to put up a slide on on how easy it was to get started all you do is you download the binary distribution and unzip it and then you CD to the example directory and do Java - jar start that jar the example directory is really the stock solar server I think there's an issue actually open to rename it to server instead of example and then if you go to the solar URL after it's up and running you'll see our nice new spiffy admin UI which is just like miles ahead of the old one I only have one slide of it here but it's something that you should like you now go fire up solar and play around with it and explore it it's pretty cool so now that we have solar up and running we can just like just start using it right away and this is the simplest example how to add and retrieve a document it's sort of like what you'd start with with any key value store right so we can just use curl here to talk HTTP and send our JSON to the update URL and we're sending JSON that for a book for those people who are familiar with solar 3 you might notice that the URL doesn't have / update / JSON and that's because one of the new changes in solar 4 is we have a unified update Handler all all the updates can go to / update and then dispatching is done off of the content type so that content type in the HTTP headers is important and now that we've added our book we can simply turn around and get it back with by querying the slash get URL and giving the ID of the book and we get it back and Soler's also added a version field used for versioning now if you're new to solar none of this is probably surprising if you aren't new to solar you might think about hey wait don't you have to do a commit or something to make sure that things are visible and that / get URL is actually the real-time get Handler and so you always get the latest version of documents without having to worry about commits or anything like that what's happening under the covers what's actually happening into the covers is we have a transaction log now - and if you're trying to get back uncommitted data it actually services that read out of the transaction log so now that we've added our document we can update it and now let's say that we wanted to add two new fields we wanted to add a publication year and we wanted to add an ISBN number we can just go and add those fields the existing document using this new syntax and syntax also note that we don't need to go and add these fields to the schema first or predefined them or anything like that instead we're using a feature called dynamic fields it's basically set up configured such that any field name that it hasn't seen before that ends in underscore I is an integer any field that ends in underscore s is a string so you can just like support an infinite number of you know fields and just use them on the fly in the second example on the bottom this just shows how we can increment an existing integer field with the atomic increment for a category field we're adding a new value so category is multivalued so if it already had existing values then this ad would add a new value instead of just overwriting the existing values and now let's say that we got our ISBN a little bit wrong we could use set to overwrite the previous one and then you can also use set with a null value to remove a field so just a word about schema lists it's been thrown around more recently and I sort of wanted to make the point that there really is no such thing as true schema lists for anything based on Lucene because fields with the same name have to be indexed the same way or things will just blow up so one thing you could do is the close approximation which solar doesn't do but I'm just saying one thing you could do is try to guess the type when you see a new field based on the value and my point is that that can be kind of troublesome it can be troublesome because guessing is you get it wrong sometimes like maybe you're adding a title field but that the first book you're adding well it sort of looks like the title looks like a date it's special and if you guess it's a date then the next time you actually use it as something that's not it'll blow up numerix also have a lot of problems if you have numeric filled when you added zero to the first time you used it and we guessed that it was an integer and then when you tried to use 1.5 oops floating points or index differently it's going to blow up it's also not even efficient let's say that we could even guess sort of correctly and say you know if somebody sent in a 1 we can say all kids an integer right but then what if they send in a larger number that can only fit in a long and so if you want to guess you can't even use integers and floats you really have to use Long's and doubles just so you don't get overflows later and so this is why we use dynamic fields instead because we sort of get the best of both worlds really when people talk about schema lists or say that they want schema lists they normally want two things they want to be able to not have use whatever fields they want without pre defining them so basically they don't want to predefine all their fields up front they don't even know what all the fields are going to be up front the second point that they normally want is they want documents to be able to have different sets of fields like each document can have a different set of fields and we have both of these things and so even though we haven't traditionally advertised solar as schema lists because even it has a schema so it's kind of hard to call it schema lists but it has the essential elements that people are normally looking for is the point I'm trying to make so optimistic concurrency it's pretty cool stuff it's a conditional update based on version so how this works it requires some more some more work on the client-side normally so step number one is for the client to slash get the document to get the latest version and then it goes step two is to modify that document however they want except to leave the version field alone and have that exactly as you know what solar returned and then step three is to just send the modify document back to solar and if it succeeds then everything's good but it could fail and it would fail if somebody updated the document at the same time that you had it checked out if it does fail then you have to go back to step one and reget the document and make your modifications and try again so you can invoke the optimistic concurrency functionality just by specifying that version field in any update and so this is a table of just like the update semantics for different version values if the version is greater than one then it's essentially optimistic concurrency just like I went over you know the the document version and the index must match exactly what you're specifying but then we have some other special values if the version is exactly one we only care that the document exists we don't care what version it is if the document is less than zero then we're just saying that the document must not exist and then if the version is zero then that's just like the old semantics it's just overwrite if it does exist so here's a optimistic concurrency example we do a slash get on book two we get back the version in there also and we're just sort of modeling checking out a copy of a book and so locally we decrement the copies in and we increment the copies out or vice versa oh yeah yeah the decrement copies in and increment copies out right and then we send it back to the update URL note that if you don't want to specify the version for some reason as a document field you can actually specify it in the update URL also so in the case that there was an error this is just an example of what it looks like we use the HTTP 409 which is a conflict and in this example I've also used curl - I all that does is in in addition to the response body it also echoes back the response headers so you can just verify that the that 409 is also set in the in the header as the status code - so in your HTTP client or something you can either check the status code the actual HTTP status code or you could just go to the body and parse out the JSON in the body to see what the status code was - we've simplified the delete syntax a little bit the original motivation was partially to also insert the optimistic version stuff so give a place to put in that version but we also made it simpler so if you want to do a single delete by ID you just do delete and then the ID of the document you want to delete and then for an array for multiple delete by IDs you can just give an array of those IDs and then I also just added delete by query that hadn't changed but just for completeness so durable writes leucine has a commit that basically makes sure all of your updates previous to that are flush to stable storage flush to disk right if you do a bunch of updates and then don't do a commit and then kill - now in the JVM then you've lost everything since the last commit that's how it works but solar now maintains its own transaction log and so we use it for a bunch of things I mentioned previously for servicing real time get but we can also recover from a crash so if you do a bunch of updates now don't do a commit kill - not a JVM and then restarts a solar server it will replay the transaction log to recover those updates and then do a commit so we've lost no data that's just on a single node though you always have the problem like a single node you can go down the disk can die and ever come back right so that's when in distributed mode now when you send in a document to the cluster it will be forwarded to multiple nodes before it returns to you and so even with one node goes down never to return we still have multiple copies of your data and so we haven't lost anything we've also added soft commit which is for near real time you'll often see it abbreviated as NRT and what this does is sort of decoupled update visibility from update durability previous a commit did both they made sure that everything was flush to stable storage and then it also opened up a new index searcher so you could get a new view of your index but for near real-time for turning around things faster soft commit basically is only concerned with update visibility it'll just do whatever is necessary to give a new view of the index but it won't worry about durability stuff so commit within now implies a soft commit then because that's that's the way most people used it so we just made it the default if you look in Solar config dot XML now we auto commit by default now a hard auto commit every 15 seconds with open searcher equal to false and so this does not have anything to do because the open searcher equals false this doesn't have anything to do with update visibility it only has to do with update durability we do this because the transaction log we have to keep a transaction log of all the uncommitted data and so that can grow pretty big so one way to keep it smaller is if you know we do a commit for you every 15 seconds and so that's the maximum amount of stuff we have to keep in a transaction log there's also a little bit of a RAM issue for for every record in our transaction log we actually keep a small pointer in memory to that record with the ID so based that's for like real stuff like real time get and so if you're doing something like you know indexing 50,000 documents a second that can really start adding up ram wise even in memory and so that's another reason why we just like make sure that a commit happens every once in a while now all right so let's switch gears now to some of the new queries in solar for I added a little bit too much stuff here I probably won't have time to get to it all but I did sort of arrange it so that some of the stuff has been around for a while even though it's new for solar for and so if I don't get to some of the stuff right at the end that's okay it's a new spatial support this barely made it in the 4.0 it's very new it's basically the geospatial search can now support multiple values per field you can index shapes other than points so you can index actual like polygons or circles have a well-known text support that's just the way that you specify polygons and stuff and that's via like a jts jar that you have to like download and drop in yourself because I believe it is like LGPL or something that's why we don't ship it by default so for indexing this just gives an idea of like you know what the values look like when you're indexing these new types and so the first example these are just like indexing a point and then a circle and a polygon so for searching it looks pretty different than the existing spatial stuff you just do a filter query on the Geo field on the field where you index the points and this is just saying you know this is anything in that that was indexed in the Geo field does it intersect while we're giving a rectangle in the first example and then a polygon and inside example we have some new relevancy function queries basically they're called relevancy function queries because they're the lower level index statistics if you if for people who are in this morning's talk the lower level index statistics that you Lucene uses for scoring we've exposed those as function queries so you can do some cool things of those you know recall that in solar you can do things like sort by a function so doc freak that just returns the number of documents that contain a term and I've written there that it's a constant and it's really just it's a constant because it's going to be the same for every document for a given snapshot because it's really just based on the term the second one term freak that tells how many times a term is in a document so then you're using this you could you could do a query and then you could sort by the number of times a term appeared in the particular field total term freak that's just the number of times the term appears in the field in general and sum total term freak is just the sum of the previous one across all terms the easiest way to think about that though is it's really just the number of tokens in the whole field index for the whole field and then ID FTF and norm those are actually from the leucine similarity configured for the field so for example at EF that's by default the square root of the term freak so you can you can do like experiment with these with function queries and boosting and try to like really create your own similarity on-the-fly create your own ranking functions on the fly and then max doc and num Doc's just the number of documents in the index we've added some rudimentary boolean and conditional functions so we have constants true and false we also have an exists function that basically returns true if the field has a value we have an if function that if the first argument is true then it yields the second argument otherwise it yields the third argument and so in this example basically if there's a if my field has a value it returns 100 otherwise it returns field 2 plus field 3 we have a def which is function which is for default values and so if a document does not have a value for that field then it will return the second argument which can be another function or a field or whatever it doesn't have to be a constant and we have other boolean functions of course not and or an XOR that would act as you expect pseudo fields is a cool new feature you saw a glimpse of them I think in Eric Hatcher's presentation or used it a little bit so pseudo fields are basically adding other information in your search results in the alongside the stored fields of the documents you get back so they look just like field value or field values in documents and so we've added the ability to just directly request functions and so if field lists FL parameter just you normally just list the fields that you're interested in you can also just start to list functions and get those back as well we've added field name globs so you may not even know the full set of fields that you're using because the dynamic fields or something and so now you can specify patterns of fields that you want back we've also allowed you to specify multiple FL parameters that can be useful if you have an existing request and you have a big FL parameter already instead of going and getting that and modifying it and adding to it you can just add another FL parameter so it makes the make some some things perhaps more readable too than just like one long list of things too we can also do aliasing and that's just like changing the names of things by default if you request a function then the name of the psuedo feel that comes back will be the function that can be nice because you know exactly how the value is derived but it could also be not what you want if it's really big it depends so for aliasing in this example we're asking for the loke field back but we're saying name it location and then we're asking for the spatial geo distance result how far this document is from the point we asked about but name it the dist name of the dist field oh we also have augment or augment or czar add also add to the document store fields but they're just like more complex and so we've have a we have a couple implemented for example explain ads in the Lucene scoring explanation right in along with the document instead of like correlating it by ID and putting it in a separate section of the request and then we also have shard augmenter which adds information about which shard the document came from in a distributed request so here's the pseudo fields example so we're doing a query for solar and we're asking for a term freak function we're asking how many times does Apache appear in the text field and we're saying get that and then put it in the Apache mentions field and then the next line my constant that's just showing how you can use a constant stuff a constant in the results coming back the next line in stock comma not in stock that just shows how if you use a function and don't give it an alias it just sort of the label is the function itself so you see response not in stock is false and the last part is is an interesting thing you can do we're basically saying if the function the function query yields as its results the relevancy score based for this document based on another query and so we may be querying by solar but then we're also asking for the relevancy scores back based on a different query in this case text colon search so pseudo join something that's been implemented up quite a while but it is sort of it hasn't been in a released version so it is new for solar for and so it works essentially by translating one set of document IDs to another set and so it's really only useful for filters and so in this example we have blog objects and post objects and they're related by the post objects have the blog ID that will have the same value as the ID field so it you know the blog ID field essentially points to the ID field of its parent so the way that you want to normally read these joins is from right to left to make sense of them so if we wanted the restrict to blogs mentioning Netflix basically we'd start by query and hand side body : Netflix recruiting for Netflix and then we're following from the blog ID to the ID field and so that gets us back to blog objects there's some interesting examples of things you can do so if we wanted to show only post from blogs started after 2010 for some reason we'd first do the query on the blog objects so started 2010 to star and then we'd follow from the ID field to the blog ID field on the posts and so we get back to posts objects a filter for post objects and then we do our query using that filter in second case is even more interesting it's it's it's the an example of a self join essentially so we're saying that if any post in a blog mentions embassy with our intelligence hat on then we want to search for all posts in that blog for bomb so how we do that is we search for embassy we find you know a particular post that mentions that and then we search from then we join from blog id back to the same field and so that will select all posts that share the same blog ID and then of course that's just the filter part and then we do our query and the last one is just really showing how you can follow multiple relationships so if any blog post mint is embassy then search all emails with the same blog for bomb so this sort of assumes that now we have email objects over here - so it's like blogs posts emails and so we start off by searching the posts and we follow the field trail - blog objects the blog object will tell us what the user's email is the email ID and then we'll follow that in the second join again going from right to left to the actual email objects and then that's our filter and then we do our query for bomb cross core join I think Eric already went over this it's just a way of it only works in a single solar server it's it's kind of specialized and it's just a way to like separate your objects into two different parts and one big one small or one that changes wrapped and you know you can do a cross core John it's just like a join sometimes join might be the right thing to use sometimes grouping might be the right thing to use sometimes you could you could think of solving a single problem using either one and so I put this table together that sort of tells you some of the differences or pseudo join implementation is is currently Big O of the number of terms and join fields so it's not great for IDs currently it's kind of slow for it but it works on multi valued fields and grouping does not and as I stated like join really only works for filters because no information really transfers it's just mapping from one ID set to another ID set but there they can be chainable you can use multiple in a row joins affect which documents match a request so if you want to do faceting you know if you start off with post but you want numbers of blogs you know you do the the join and that will get you you can change units essentially and so your facet results will be have the right numbers that used to not be the case for grouping when grouping started out it was just a way of grouping all the results but that did not affect which results matched the query for for the purposes of faceting that's now been changed a little bit for solar for dado you can do something called post group faceting which uses the most relevant document in each group for faceting pivot faceting finds the top end constraints for a field and then for each of those constraints it goes and finds the top constraints for another field and so on and so forth you can nest it any any level deep and so in this example we have a facet pivot equals cat comma in stock and so we're finding the top constraints for category and then for each of those we find the top constraints for in stock of course in stock it's just a boolean so we get the full results all the time so you get the full table in this case per segment single valued faceting is a new faceting method well it's not so new but its new for solar for I actually implemented it like a year and a half ago or two years ago maybe so this is more NRT friendly it's unfriendly in general though you really only want to try to use this faceting method if you're committing trying to commit like multiple times a second otherwise it's currently slower sohow faceting well let's take a step back and talk about filled caches and Lucien's segmented architecture normally when you add a couple documents to the index and do a commit a lot of segments will be unchanged and Lucene will add this new little segment Soler's faceting normally builds filled cache elements at uses filled cache elements at the top level of the index so if anything changes the whole field cache has to be the element has to be repopulated and that's not good if you're committing very rapidly so the idea around per segment faceting is that if we use filled cache entries at the segment level like we've seen does for sorting then we really only have to like repopulate you really have to build a small new field cache element when you get a small new segment instead of rebuilding it for the whole index and so that's what we do in this method so it's a multi-threaded method we create an accumulator array for each segment of the index we have a field cache entry for each segment of the index we go and we count up you know we count up the counts to get our facet counts the green box in the middle of the bottom though is the killer in this case it's it's the thing that's here that is not there if you're doing the arm the faceting that uses the field cache element at the top level the non NRT faceting so this we have to have this new thing because we have multiple accumulator arrays now and we have to merge them somehow and our first slot in accumulator array 1 is not equal to the first slot in accumulator array 2 they don't correspond they're not they're not necessarily for the same word and so the only way to like correlate these and create a single count array is by essentially merging the field caches and then once you know as we do that we shove things through another priority queue to find the top and constraints so you can control the number of threads that we count concurrently with but really it doesn't seem to make too much of a difference because the counting part is not the slow part normally the slow part is that merging of the field caches to correlate the counts to make all you know get global counts instead of these localized counts per segment so yeah in in my in in some testing I did a while ago if unless I was committing multiple times a second this new method is slower than the existing faceting methods and it has larger memory use because of the multiple accumulator arrays and the the Lucene field cache elements themselves are larger because they don't share strings between them and I think I actually finished on time questions right yeah there's some oh yeah how do you if if one replicas is not available when you're forwarding the documents when you're doing an update how do you reconcile that later so when a node comes up it puts itself into recovery mode and the first thing it does in recovery mode is it like talks to leader and they do something called a pure sink and that essentially they ask hey did I miss any updates and they'd sort of compare and see if it's like too far out of date and so if you just missed an update and he came back up really quick then he can just get the update from the leader whoever is the leader at that time and to bring himself back in the date before it becomes he sets himself to the active state if too much has happened in the meantime like the whole bunch of updates have come in right then we go beyond the window that that piercing can fix and then we do whole index replication and how that works is that the new guy coming up he'll start buffering updates and then he'll do a full index replication from the leader and then he'll reapply the buffered updates that he got while he was replicating and so then he finally gets to the point where he's back in sync and it becomes active yeah so so the question was about atomic updates and what happens at the leucine level there are probably some caveats I didn't mention they're all fields all source fields must be stored for atomic updates to work and so what happens because because Lucene doesn't support any notion of update currently so what happens behind the scenes is that you know solar retrieves the document goes and makes the changes and re indexes it until in order to be able to retrieve the whole thing and reindex it all source fields must be stored yes yes so the question was of the missing piece about shard resizing yes what what we're working on is shard splitting so right now you sort of have to define up upfront decide upfront how many shards you want to start with how many yeah you know how many pieces you want to split your index into and you know you can add and subtract replicas easily for those but you can't easily create more shards and have solar decide automat where documents go you can do it if you're doing custom sharding or something if like time-based sharding you know you can just like you can insert new shards at any point but then you're sort of taking on the responsibility of deciding what documents go into what shards yourself Soula can't do it via hashing and stuff like that one more so yeah that is the index splitting is in progress yes we could TBD we just haven't done it yet so so the question was why why we just don't add the type along with the field when we add a new field for dynamic fields why do we you know depend on the naming conventions and so I think that is sort of in my head at least roadmap a way of like adding adding new fields to the index but it'll require a little bit of work in in in cloud mode because we have to like we have to I think keep it in zookeeper and we have to make sure that we're probably going to have to do things optimistically too we have to make sure the two people two different nodes don't try to like to find the field at the same time and end up with two different fields and the index you got to make one of them fail so that has to be coordinated on a cluster wide basis so it's it's going to be a little bit of work I think but we'll get there and I think we're out of time you
Info
Channel: LuceneSolrRevolution
Views: 38,512
Rating: 4.9053254 out of 5
Keywords:
Id: WYVM6Wz-XTw
Channel Id: undefined
Length: 48min 29sec (2909 seconds)
Published: Mon Feb 25 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.