Hi I’m Jared Hillam,
If you’ve been accustomed to working with traditional SQL databases hearing somebody talk about Hadoop can sound like a crazy mess. . So in this video I’m going to try and simplify only 3 of many differences between Hadoop and traditional SQL sources which hopefully will provide context as to where each medium is used.
The first difference I want to point out is Schema on Write vs Schema on Read When we move data from SQL database A to SQL database B we need to have some information on hand before we write to database B. For example, we have to know what the structure of database B is and how to adapt the data from database A to fit that structure. Additionally, we have to ensure that the data being transferred meets the data types that the database is expecting. If we attempt to load something that does not meet what database B is expecting, then it will spit out errors and it will reject the data. This is what we call Schema on Write. Hadoop on the other hand has a schema on read approach. So when we write data into what’s called the Hadoop Distributed File System we just bring it in without dictating any gatekeeping rules. Then when we want to read the data, we apply rules rules to the code that reads the data rather than preconfiguring the structure of the data ahead of time. Now the concept of schema on write vs schema on read has profound implications on how the data is stored in Hadoop vs SQL which leads us to our second difference. In SQL the data is stored in a logical form with interrelated tables and defined columns. In Hadoop, the data is a compressed file of either text or any other data types. However, the moment data enters into Hadoop the file or data is replicated across multiple nodes in the Hadoop Distributed Filing System. So for example, let’s say we’re loading twitter data, and we have a large Hadoop cluster of 1000 servers. My twitter data might be replicated across 60 of them, along with all the other profiles of twitter users. Hadoop keeps track of where all the copies of my profile are. This seems like a waste of space, but it’s actually the secret sauce to the massive scalability magic in Hadoop, and this leads us to our 3rd difference. When thinking about Big Data solutions like Hadoop, think of something architected for an
unlimited number of servers. so let's stick with earlier example, and
will say we have one thousand servers in our Hadoop cluster. Now, imagine I'm searching
Twitter data, and I want to see all the sentences that have the word “unhappy”. The structure of the query in Hadoop will come in the form a Java program and that program defines
the request and distributes the calculation of that
search across all the 60 replicated copies of my profile in HDFS. However, instead of each copy conducting the exact same search, the Java code will break apart the workload so that each server is working on just a portion of my Twitter history. As each copy of my twitter data finishes its assigned segment of history, the answers are delivered to a reducer
program on the same node cluster, in this is responsible for adding up all the
tallies are producing a consolidated list. Now let’s say that 1 of those 60 servers servers breaks down or has some sort of issue while processing my request. The question now is, should I hold up the entire response to the user because I won’t have a complete data set yet? It’s the answer to this question is one of the primary differentiators between Hadoop and SQL. Hadoop would say no, and it would provide the user an immediate answer and eventually it would have a consistent answer. SQL would say yes, we must have complete consistency across all the nodes before we release anything to the user which is called a 2 phase commit. Now, neither approach is right nor wrong, as both have an important role to play based on the type of data being used. The eventual consistency methodology in Hadoop is a far more realistic method of reading continuously updating feeds of unstructured data across 1000’s of servers. While the 2 phase commit methodology for SQL databases is well suited for managing and rolling up transactions so we’re sure we get the right answer. However, with Hadoop we have a caveat. Because the query is literally a mapping and answer consolidation program which propagates to a flexible number of servers, the sky is the limit on how creative we can get with that program. But as you would guess, this Java based query also increases the complexity of talking to Hadoop. For this reason, you’ll find a lot of packaged distributions of Hadoop that provide some structure and focus to the Hadoop world. For example, Facebook created something called Hive that allowed their team members that didn’t know how to write Java code to write queries in standard SQL. Hadoop is so flexible that Facebook was able to build a program to essentially mimic SQL behavior on demand. This flexibility is also one of the reasons it’s so hard to nail down a single practice or even a “typical” deployment type.
The good news is that some of the leading tools for data management and analysis are now beginning to natively write Map Reduce programs and pre-packaged schemas on read so that organizations don’t have to hire expensive data scientists to get value from this powerful architecture. Intricity can help introduce you to some of these powerful platforms that allow for codeless access to your unstructured data. These are sources like your website’s log files, unstructured survey responses, or even using Hadoop as an Operational Data Store. I recommend you reach out to Intricity, and talk with one of our specialists. We can help you lower the cost of entry into this dynamic way of consuming data, while bringing value to your current infrastructure.