Dropbox system design | Google drive system design | System design file share and upload

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

my name is Lorraine and indecision let's learn system design for Dropbox or you know Google Drive or any of the file shading and abroad services and here are the features which we want to support for this particular design so you should be able to upload and download the files which is there in the sync folder or he should be able to upload a single file and out of the single pipe also along with that he should be able to delay it should be able to update and also you should be able to you know kind of write rewrite and all this kind of operations along with that the very important feature is like history of the updates so what he needs to do is he also wants to look at the history of the updates to the fight say for example if he has a file of text file and he is he had made up known a lot of updates to that file he wants to see what was the last update I made you know one week before one day before or what is the latest update as such so these are the main features which are support along with that maybe people also want to support API integrations for our services say for example if I want to have an API using that API I can upload and download services such kind of services so my idea for now is I'm not going to talk a lot about web kind of services we're in Google Drive right you can use web services and upload and download so it just simply increases the complexity of the system design Instagram which do is I'm going to think of you know there is a client ok in which there are a lot of services and using that client like a sync client in which that the type is you're gonna install on your desktop or your mobile or and such and the client is always looking into a particular folder that is a sync folder now our number of folders and in which it is always is kind of monitoring the changes to the file and it uploads that's what the whole idea of the system design of Dropbox or drive or any file sharing service for today so now let's talk about the scale of the service which I want to you know which we are planning to design today so I actually don't know the Google Drive it could be enormous but I'm going to just take the numbers which I got it from the you know Dropbox okay so so far they have about 10 plus million users unique users and about hundred millions requests are happening per day and they have a lot of you know rights and louie's happening today system so right in in the sense even it includes the updates and new files uploading and reads as in so I it's just like downloading or just weaving the file using their services and here is the system design diagram over here so you just saw the drop box system design diagram right so it is full of complex components in it so before I explain all of these components individually I want to tell you something exciting or I want to actually convince you guys or explain you guys what is the core problem in file uploading and downloading services because there are a lot of assumptions about designing the Dropbox like whenever I talk to people like how do you design in our Dropbox like services what to say is all you need to use is some cloud services and then you just upload the file and download the file when I do you want and that's simple right because that's not how it works the problem is very kind of different problem so I'm going to explain all of that and tell you exactly what problem and how we are going to solve that so to explain that so consider this is you and you have a file say for example this file size is about 2 sorry 20 MB so let's consider that the file size is 20 MB and this is a baby enormous in a text file where you have a lot of text content inside it so now for the first time okay you don't have any service as such you have a file you have a cloud service say it is Amazon s3 or something and very important thing is in the session we are not going to talk about how do we actually build those you know cloud file storage or something like that let's go ahead and use Amazon s3 or any other services we just keeps the file in the cloud and it scales like anything and we don't care about it so we in the session let's concentrate only on how we build the you know drop of Dropbox like services and we don't concentrate on how do we build the cloud storage we just put the file or blog or anything into the cloud service and and the files will be saved and whenever we want we can reserve it back okay so now it is s3 or something okay fine now we have only three component that is you the file and the cloud so now for the first time you have the fire in which is a friendly mV now you want to share or upload it to the what you do is you take this file upload it into the cloud and you have that file here of pretty mb okay now all good now you get a unique link to this URL or something you share with your friends or something like that now the next time you might want to update this file two things might happen one thing is you might have a file locally or if you don't have a fight you will download it back to the I know your laptop or something and you edit it once you edit it you also want to upload it back to the flower so that you can share it again or you know save it into the cloud so now what do you know what do I do is upload it again thinking that I have a local copy and then upload it again fine we might do two things on the cloud services one thing is we might overwrite the holiday existing file or we might create one more copy of the file now consider we want to support the historical we also need to know what was the historical updates to the file so let's keep a copy of the file so the second file the version 2 and version 1 that is also frankini will be uploaded to the cloud now carefully observe when the first time we update it or upload it to the cloud services videos 20 my bandwidth and the second time also we use 20 MB bandwidth right so total 40 MB bandwidth utilization and in the cloud storage we are consuming almost like 40 MB storage and if I do one more update I will upload it once again no again I will consider you know and use or company and be back with and I will create one more copy of the file that is television or any funny I be again now do you see some problem over here I don't know whether you guys see it or not but let me explain see the bandwidth consumption for the 2005 for the three time update even when we have taken a little bit you know just one character correction or even if it is a bit corruption we almost consumed about 60 MB of the bandwidth and in the former also just to keep a copy of files or historical you know a versions of the file becomes in over 16 MB of to PI and this cloud is costlier right and also you need to pay dollars to every and me you're saving in the cloud service and also you we are consuming the bandwidth what if this was about 2 GB file how much time does it take to upload it to give you file and TT 2g v its did they stay too much of time to upload and download and also if it was a 2 e gb v you know easily consumed about 6 GB of bandwidth and also 6 GB of cloud storage over here now given this problem let's see what other problems listed here so the first one is upload our concurrency now what happens is say you try to automate the process of upload and download now consider service service which kind of monitors the changes to this file and of course automatically now what does python file will do is it what it can do is it looks at the file as a whole and then and whenever you click a button to sync the whole file will be uploaded to the cloud and that's what it does ok now what it does is it has to take the whole file and upload it into the cloud so if you close the observer it is kind of taking the whole file and uploading it so I can write a script with having you know multi-threaded of multi-process like in a threaded operation to sync the file so that the operation is faster can I do that now we can't actually do that because it's a whole file I can't just write a script which is multi-threaded to just upload one file it's kind of difficult very difficult right and that's the concurrency problem you can't actually make use of the concurrency to upload the files to make the upload operational down or operation much faster and that's one problem ok and now let's come to the latency so if it was about 20 MB consider like per one envy of the file it is taking once again now if we had to relive all xxii to upload or even we have to wait twenty second to down go we can't go optimization on that because whole file is uploaded or for this file is downloaded we can't just do any optic optimization there and also later to the bandwidth so how much magnitude is consuming every time when I even if I just added one character in the file I am actually you know uploading about 20 MB of the file to the cloud and also I'm saving 20 MB of the cloud storage that is the storage for problem just to keep the different history of the workshop viola I have to keep the whole copy of the file just because we edited one character in this 20 mu file so this is kind of come to me a bad idea this is what we initially think when we want to design file upload or sharing services so now let's sit back relax and design in a different way where all these kind problems are solved so what we do so what we can actually do is instead of rethinking the file as a whole one yeah it's what we thinking the whole file as one entity let's think differently what we do is let's break this file into much poaching's say this 25 right trillion be fine and we'll break it into say ten different pieces what I do is I am going to break this file into ten different pieces okay now I have broken this one into ten different pieces and we call this each of these pieces into chance so there is first chance second tree why you say seven eight nine and ten there are ten chains and each shank is obviously 2 MB in size fine let's revise the whole scenario which we just discussed so another fault for the very first time I have a text file of 20 MB now we want to upload obviously for the wave first time if I'm uploading to the file storage I'll have Lord all of these chains to the cloud it will take 20 MB of the band width and 20 MB of the storage so I'm gonna write something here so the question 1 trainee of you is right it took 20 MB of the fighters now all good now the fun part is now I just updated one character in the file in the check number five or in the in the position where there is check number five so now I'm gonna sink the fire to the cloud what do I need to do instead of loading all of the file to the cloud I will just need to sync this fifth check only because that way the update has happened now if I wanted to sync this what is the size of the chunk I'm going to just sink this chanc that is of 2 MB only and in here instead of keeping the whole file I'm just going to save just 2 MB Chang and I market as second version 5 how cool is that we just consumed pretty too and we have the sculpt storage in the bandwidth also we just consumed 22 MB only and now the very first problem right concurrency problem how do we solve that consider I wrote a pie in scratch not pie okay this survey is what it does is say how about 5 threads in it and each thread can take a change on approval so as I already mentioned if we are taking about one second total 1 MB then obviously I'll here we used to take about 20 seconds right now how much time does it take it will just stay to see the whole file we just take 20 seconds to run 5 times almost like about 4 seconds only are they wasted a 20 second now we are just taking 4 second and so this is that's what the optimization so the latency is very less because we can see the chunks family to the cloud ok and here instead of actually save the whole file we have all kind of saving the chance the very first fire which I've written here it's not the whole fight they didn't like ten different chunks of the files which save in the club and in the second time we only had one check the third time maybe fired this particular part and this particular part no I just need to think these two chains only now in this late we only good you know used for envy of kind of affirmative bad words and also in the third version we only keep this chance to eb-2 a me okay of these two chance which was updated now think that you have one more mobile devices where you want to keep on thinking is fine there now there also the bandwidth consumption will be almost equal into this much in earlier case if you are being like three times the five will be synced it will be like about sixty you know envy of the file and own bandwidth utilization to you why device but in this case it just consumes about 26 and B only and you can in mobile you as well so you can just use concurrency to done with these chains so there will be faster or sink in the operation so this is the radical approach which will help us to design you know next generation fast service fast file-sharing and approach service and now you might be thinking now somehow we divided this file into chunks now tracks all of this chains how do I know watching belongs to one file as such so now I just explained it with one file we've had about 10-15 files they'll be like things all over the cloud right so how do we know that to solve that problem what we need to do is we need to have one more file that's called as metadata file this file contains ease this chunk information maybe we take a hash of the chain and we mention all of the hashes ten different hashes and maybe their locations or something like some reference to this change to them some reference to the chain should be in this metadata file and this metadata file also can be synced to the cloud metadata file and that metadata file can be downloaded when a requirement or that all saved that basically metafile gives you the indexes to all the shanks of the file so that we can download all of the chains even when we are downloading we can actually use concurrency so that we can download it divided faster and we can stitch back all of these chains to get a complete file using this metadata fun ok so this is the very core or important part which you guys need to understand in designing you know any of the file uploading or downloading services and this way it becomes all out of this kind of problems so one such services is like HDFS right even HDFS deals with you know gigabytes of petabytes of the file and yeah it actually does the same thing it actually divides the file into 64 mb chunks and in and it distributes the file in different different machines so that it can give the backup provider of the files and duplicate you can save the duplicate copy of the file easily so even if in this case also if you want to distribute across different machines and everything it will be much easier to play with the chances of 105 and that thing will actually happen inside the cloud and that actually will be taken care by Amazon s3 and it will be much easier for Amazon s3 services to handle this small chunk of the file is from a very big file so these are the advantages of having you know chunk based operation instead of playing with the whole file so now let's see the very basics of the system design 480 of the file uploading services like Dropbox or Drive so basically as I explained you will have a client the client can be some application which is installed on your laptop or computer or your mobile phone or it could be even web app which you can open by a browser so now let's only concentrate let's think that this client is somewhere installed on your phone or your laptop so what are the very basic components in the client we should have to provide the seamless floor of the file or the sink of the file between devices or between clients so basically we have about four different components in the client itself so now I have only represented one client over here maybe we'll have one more client even more plants say let's consider this is kind of one kind more to find number three and all of these clients belong to the same user let's think that and proceed to our design so now we have four different combine components in the client and this is the folder in which nice files and this client is actually actively monitoring folder and the file which changes our updates happening to the files under the folder and also here we have a cloud storage service but for the simplicity purpose let's consider Amazon s3 or any other blobstore cloud storage services where it just saves the piece or chunk of the file or any of the file so nowadays since the Amazon s3 and here is the messaging services this is a queuing or asynchronous messaging services which you can use RabbitMQ or zero in pure Kafka and then there is a thing service I'm gonna explain that and there is a database and also cache now we're just represented one you know DB over here for now you consider it as DB and also cache and that's where our metadata will recite now this system owner over on a big picture of how it works and when I spin it now so now this is the client we have a little chunker watcher and eternity the very first thing we want to learn is this is the folder we have configured this time to keep monitoring this particular folder or the files in the folder for it you have two changes so the very first time the photo will be empty well and we have configured to monitor this photo right so this washer will be watching this particular folder as soon as we put some files into this particular folder this washer will get notified that there are couple of files which has been added to the folder what does guy does is it notifies the chunker and indexer that there is some changes for as and to the folder and it also passes the path to the files to the chunker and in Excel what checker does is as the name indicates it makes this pie into multiple chunks and then it uploads into cloud s3 before it doing so it actually computes the hash of each X and that will be kind of unique identifier for the file and the chunk and then that will be uploaded and that hash along with the URL which we got back from the cloud services after uploading the chunk will be given back to the indexer so and then what index it does is it receives the URL from the chunker and also the hash of the chunk and it will update that particular information in the internal DB against the file for which those hash belongs to so now everything is good right so we have already uploaded the chance of the fight to that Clovis C that means that what or the complete file is on Leah put it there and we also that the Excel has saved the hashes and the URL to those chains in the internal to B now everything is good now what we need to do is this indexer will notify the messaging services sir here is a you know file which have which was we saw it newly on the folder and we have made it to these many chunks and we have operated there and all this information will be passed to the messaging services it is just a queue what this indexer will do is that messages message will be sent to the messaging you know pipe or cube and that message will go here okay and then that message will be picked up by this sync service or synchronization services what do we need to do so because actually I mentioned that there can be multiple client which belong to the same pipe as soon as we add a file into the folder the same thing should be replicated on the other device right so that's the main reason why they should notify the messaging services are synchronization services that there is some modification to the file has happened or there is a addition or addition to the deletion of the file has happened so that when synchronization service what it does is it also updates the metadata into the metadata database usually it is my sequel because we need it to be consistent the data of metadata the data of the Chang's or the hashes should be saved in a consistent manner so for now let's consider we're using my sequel DB and we updated all of this information into the database and also watch you know sync service does is it sends back some of the messages back to the you know one more Q which is dating the messaging services and those messages will be broadcasted to the other clients so what happens is this guy send back this information of updates to the file back to all of the clients which belong to that particular user so now these guys learned that there is some file additional happen to this folder now what we need to do is they also have obviously have the same kind of you know components here and the indexer will now fetch back those chunks from the cloud and rebuilds the file or sync back the file into their devices so basically as soon as this messaging services sends the messages to these crimes what happens is the index error there will get back all of these chunks of the file say for example these two files chunks and then they inside the folder they will read they will recreate the files so they have two different files exactly similar copy of these folders and files will be duplicated in other device say I said this is device one device to device right now the same copy of the files is replicated in all of the three devices so that's how it works now I'm going to explain each of these friends little in-depth okay so now let's learn about the messaging service you might be thinking why do we need a queue over here whenever we saw that some file updates are happening why can't we just talk to the since services synchronously why do you need a synchronous messaging service over here and answer that question so this is the little bit more in-depth view of the messaging services so so you can you can think of the whole thing as a messaging services it is basically you know n number of queues will be there inside it so consider for the simplest repurpose n number of points there will be n number of queues will there plus little more queues will be there for different purposes I'm gonna explain that so there are two types of cues in it the first one is request queue and the second one is response queues where nearest request one response you and I spend I so consider there are three points as we as I mentioned earlier the twenty one two and three we have a file the client number one or the client which is looking at the file saw that there's some changes happen to this file well that Phi was newly added so this guy will upload it to the cloud service and now it had it knows all the information about the Chinese and the metadata information right what this guy does is he posts all the metadata information through the request queue whatever the queue is so this guy sends all of the metadata information to feel the quest queue of the messaging services why do we need to do the a synchronous way is because these devices might be connected to the Internet or might not we can't just always rely in a synchronous manner to send the updates about the files or sync the files right so be the might can be connected will be might not be connected so we need a mechanism there these updates are kind of buffered and sent back to the server and you know get the same information or the metadata information about the files in a synchronous way that only reason why we need messaging services are queues so what this guy does is he posted an information because queue and that is it will be staying here and then this Q will be connected to the sync services and then that message will be read by these things of this kind as I mentioned earlier see service will update the information into meta data store that is millilitre DB and cache that all happen and also the things of this will broadcast back to all the other clients for the state user which is registered in our VV what if this guide is is whatever information which is required Alyssa from the because - it should be broadcast all of those messages to the response queues what happens is all the same metadata information will be sent to the response to one response to two response q3 that information will be there inside this cue okay so now I think the cue helps you to buffer these updates if these two kinds of these treatments are disconnected now that message will stay in the queue will never get lost if it was a synchronous service this damn should have called directly to this client by somewhere by HTTP that's that doesn't work because these guys might be connected to the Internet or to the server or not so whether all of the meta information to the response queue and whenever we learn that these lines are connected to the servers this meta information will be deliver as soon as these price receives that metadata information they know what to do with that so they know the URL from where they should fetch those chains and then update their respective files if they don't have the files they will download the file and keep it I think otherwise they will just download the chanc which was updated from the client one and they update the file and now let's learn what are the information which is stored in metadata and also let's learn about metadata database and its own internal architecture to make it work seamless a Dropbox they call it as H store this is just for your okay so now what are the information which we want to store in meta data database and why do we need it so the first thing is metadata database or metadata information will actually contain the information about the chains and the hash right and also it contains the information of the file and its version because every chunk is kind of mapped to the different question of the file so that information is also captured and also along with that it also contains the information about the user and what workspace they were working on now using any particular client and now what kind of database we should be using for metadata store so we can actually use our DBMS or no sequel irrespective of that what we need importantly is the consistency because multiple clients will be working on the same file right so we definitely need a consistent data to be stored in our metadata even though we're not actually saving the whole file content to the DB the meta data information will itself represent a whole chunk or the file so we need this metadata to be you know consistent across different clients so we need a DB which acts as which acts more consistent manner so the very good thing about our DBMS is consistency comes in built in the our DBMS but where as you can go for no sequel the only problem is it is eventual consistency that might screw up your file or chance you might be wondering okay what happened to the fight where where is my question we tied it to this and this is gone and this part is there this part is gone so all of these problems will come back if you use no sequel or if you still want to use no sequel we have to havin layer on top of the most sequel which actually protects you or which actually gives you the consistent rights to the no sequel because resolutely through one also that no sequel will scales well with a you know more data as we store but nevertheless Dropbox guys have to use my sequel and scale it to the you know existing user base about 10 to 15 million users and you know tons of millions of files of metadata information using my sequel itself and I'm gonna explain like how actually it they did it as I mentioned that is called as a pet store and that's architecture is over here and before learning about what is this we have to also learn the structure of information which we want so that is the actual metadata and this is our clothes so this actually is a crime Jason which has these information which has check ID which has chanc order and all the information about the object why do you need a chunk or a room as I already mentioned that if we break the file into multiple pieces we should also know the order at which the chunks should be you know stitch back to get the virginal file and also in the object information we have the version number whether it itself is the folder information or it's a file and that information is captured in East folder and also the file name obviously and the extension size and path etc etc and also a little bit information about the user is also captured over here along with the device information so now let's learn the internal architecture of the metadata you know service itself that is not actually simple in this diagram you might think it is just a DB or you might have one more layer of caching or you have a separate caching component over here but it's not actually easy let's go ahead and design using my sequel itself when when we think of our TPM is it is kind of difficult to scale because to scale we have to either do shouting or what we can do master slave and all that right and those are a little difficult what Dropbox guys have done it is they have used shoddy technique to distribute the data across multiple you know my sequel instances they don't just have three my sequel database they have like thousands of database over here not just the metadata information they have all the other information also stored in using the edge storage so the whale basic problem when you're using multiple database which is charge is so when we decide to use my sequel database to store the metadata information what are the problems we might face consider this system is not there we just have database which the metadata is shared across multiple database so for developer it is a significant burden where they have to keep on validating the schema and everything right it's kind of difficult task and the next thing is whenever the database is filled and almost failed we might meet we should be keep on adding more database to accompany more information of mota diameter than function right we might need to rebalance all of the data on rashard this is kind of real pain and the real challenge is making these machines or managing these machines to be always available like 24/7 right that these are like different difficult problem to solve out of this problem what draw box guys actually done it buzzers they have they kind of build a wrapper around this shattered database and they built a tea eyes for the clients which actually touch to the edge store or metadata straw so the clients they have more of like go or Python say that directly instead of the clients directly interacting to the database this talent can be tracked with the edge wrapper which kind of provided our ORM that is object relational mapper which is kind of abstracted away right so the client just call this uses this all and and just interact with the redux how easy it is right and they have an engine which kind of transforms all this orange and then in fact with a my sequel everything happens internally and also obviously not every time when we want to fetch the file we we have to get the information about the chains and everything right it's it's not a good idea to every time to fight from the database so we have obviously have a cache in between so if it didn't have the system we would've had cash so separated first thing is we check in the cache if it is not there and go back to the database of the pitcher right so this system internally provided all of the services out of the box so the developer need not to actually you know check in the cache and then then quite in the PP but instead this edge wrapper itself kind of figured out that particular information is already available in the cache or not if it is not there then go to the engine unwrap the ORM and get that all the information about the file which they are querying from the database it is that convenient to have a proper route you know only my sequel and also the advantage is we never need to know about the underlying database and also the edge store by default provided the transaction isolation between different queries the trains are actually making to the database so that way we don't need to worry about having a lock or you know having a transaction when we actually you know writing something into the middle store it's stored the metadata database right so it is kind of out to the box provided buying the edge store in a different layer like a drag or grind engine so now let's talk about how to serve the data in scale as we know we have about more than 10 million users thinking that 10 million users and we'll have more than 100 millions of the files no best case so how do we serve all of this data at scale faster in a lightening faster or how do we do that so consider considering that we are using Amazon s3 as honestly has more than about 9 plus region presence right so think that they are automatically distributed using CDM Amazon's own CDN now we also need to make strategy based on the users you know home location if the user is from about from Singapore we have to place that file somewhere near to that location that is Singapore Amazon I think has about in the Singapore itself right so we can actually acquire all of this check files into the Singapore data source data center itself so that whenever that user wants to sync that information you can double it as fast as possible and that is kind of solved because you know the Amazon and service pouring services but when we are talking about metadata because without metadata no client can download anything so metadata is actually the key information to download all of the file so since we are the one who are storing all this metadata information using X store in my sequel we have to actually place this information also much closer to the user what we need to do is we have to either use some Syrian services like Akamai fastly or any of the Sirian services are we had to build our own CD I hope we do that first we need to find out where our users are from if there are like more users from Asia we have to obviously need to place a data center or in some servers near to Asia for Asian users so how do we do that we have to we can have we can actually use some kind of shallow learning methods like k-means clustering or some kind of testing algorithms to find out the group of users or the clusters of the users from different region and we have to place the server somewhere near to that region so actually Dropbox kind of did the same they actually clustered the users and they have placed the no CDN or edge servers near to the user what does information so what the server's contains is the metadata information because without metadata we can't download any of the file from Amazon s3 look at this first picture in which this explains say considered as the metadata information and the data of both are kind of stored in u.s. region if a user from Europe wants to download it it is kind of taking about 700 milliseconds but wherein if you decide to place the metadata server somewhere near Europe itself a user can actually download the file or at least get the metadata information and contact the you know actual Amazon s3 cloud within 3 30 milliseconds that's how we have increased improve the latency by a factor of two and that's the advantage of having CDN so so far we learned how exactly the file sharing and upload service works and its system design now before I end this session I just want to give some ideas on how we can actually implement the search engine the full text search engine for our service so we want to provide customized search feature from our clients client application with which our users have installed on our laptop or mobile phone how do we do work what are the different ways we can achieve that so I just want to throw some ideas on that you guys really about it because there are a lot in articles about how do we implement search engine using machine learning and NLP that is natural language processing first thing is we have a lot of files with us before we actually upload these files into the flour we can do processing there in the client but that is very tedious process we can actually apply all of the different you know machine learning you know models or train them so what else did we can do is we can actually have a back from running process or in a scheduler job in our data warehouse or you know in some data centers where we actually look into all of the files which the users are uploading and then in a weekly manner or by with your map we can actually look into all of these files and extract the you know text information from there if it is all if all of the files our users are using our kind of text files it is much easier to extract all of the text information to provide the search feature but we know that people might be uploading your text files PDF images or sometimes videos also we if we really want to provide some kind of search if you look at the Dropbox or Evernote they actually provide the image text search also how do we actually do about it how we actually do some kind of processing above so if it was a text file we can easily extract all of the and that we can do natural language processing and apply all of these different techniques like tokenization reviewing the stock words like you know articles the - and or something like that and find all the synonyms using wordnet which is a lexical dictionary which is available in the internet for free and there are studying and limit ization techniques to get the actual words and to understand the you know the actual context our connection of the you know sentence and give a better research results so if but if but the five was PDF image as or videos how do we do about it so there are different strategies to that also first thing is we can either take the video files take the we can actually we can pass the PDF file but if we can't pass the PDF file what we can do is take the image screenshot and then convert into image now we have PDF and image both as image form and now what we can do use what we can do is we have to you know extract the text your textual information present inside those images with this PDF definitely we are expected to have lot of textual information there but it is image we might however we might not have first thing is we might we should have a high level of some kind of algorithms like question learning algorithms which can detect whether there is a text information there in the image or not we can actually use convolution your indifferent or CNN's to do that which are very good in image processing and down we can actually leverage optical character recognition to actually do that tasks to figure out the letters in it what people actually use is if the image is kind of tilted the texts are not always properly horizontally oriented right then the text could be tilted in any direction people these days are using deep neural networks to actually rotate the image so that we can actually expect the text part and then provide it to the OCR or combination neural network to figure of what kind of textual information it is there and that get that textual information and then tag that particular position on the image and etc so when the user searches actually you can actually show image highlight it for that text also you know that's kind of way cooler feature you can provide in your document anywhere so basically we have to keep on looking at the signals from the user in type of interaction and then the file a place happens we need to keep on reading all of these processes which I just discussed to build the search indexes you know up to date I think I've covered most of the information which is needed to design if I sharing and pilot probe services I think it's time to end this session as usual if you guys have an institution please do send it to me if you want me to make any of the videos on any of the system design please drop me a comment I will pick one of the system design for the next week thank you as usual please don't forget to subscribe to the channel it actually encourages me to keep on producing more of these system design videos and also please tell your friends about the you know channel and then make them subscribe to thanks a lot

Info

Channel: Tech Dummies Narendra L

Views: 164,233

Rating: undefined out of 5

Keywords: interview preparations, software interview preparation, developer interview questions, Facebook interview question, google interview question, system design, dropbox system design, google drive system design, file upload and sharing system design, file sharing and upload sysmte design, System design and analysis, system design mit, software architecture, architecture diagram

Id: U0xTu6E2CT8

Channel Id: undefined

Length: 45min 32sec (2732 seconds)

Published: Wed Dec 05 2018