1. Why Move Away from Relational Databases?
Understanding the limits of relational database management systems
The downfalls of relational databases such as SQL Server, Oracle, or My SQL is that as more and more data becomes available and companies and organizations want to embark on big dat projects, they’re running into limits around using relational databases.
The first is Scalability. Many companies have projects that are in the gigabytes. This can be very expensive and complex and difficult. Also, some of these big data projects have different kinds of needs around data ingests or speed. Sometimes customers want real time ingests.
And there are other considerations around query ability, application of sophisticated processing like machine learning.
The Hadoop ecosystem is designed to solve a different set of data problems than those of relational databases. One of the core components of Hadoop is an alternate file system called HDFS or Hadoop File System.
Hadoop itself is actually not a database. It is an alternative file system with a processing library.
So, really when you think about bringing Hadoop in as a solution, it’s gonna be in addition to your existing RDBMS, not as a replacement for it.
So, Hadoop itself is most commonly implemented with something called HBase.
Now, so this is based on technology that was developed originally at Google to index the entire internet, which is called the GFS or the Google File System. What Google did about 10 years ago is they wrote a whitepaper on how they created this file system and the open-source community took the information from this whitepaper and made it was part of the basis of Hadoop. So if you hear GFS and HDFS, they are very, very similar implementations.
HBase is a NoSQL database that is very commonly used with Hadoop solutions. It is a wide column store. And what that means is it’s a database that consist of a key and then one to n number of values.
Introducing CAP (consistency, availability, partitioning)
To understand more about the use cases for Hadoop ecosystem we’re going to take a look at what’s called CAP theory or CAP theorem which is a way to understand the different categories and classifications of databases.
The first aspect of CAP theory is the idea of consistency. The concept is that there are certain database solutions that allow for very high data consistency. Another way to think about this is that the solution supports transactions. An example of transaction would be if you had two data modification operations and they were combined as a unit. So withdrawing money, for example, out of a savings account, and then adding that money into a checking account. You would want both of those changes to occur successfully, or neither. Otherwise our data would be inconsistent.
The second aspect of CAP theory is availability. Another way to say that is up-time. What this means is that you have the ability to make copies of the data so that if one copy goes down in one location, the data will still be available for some or all of your users.
The third aspect of CAP theory is portioning. Another way to think about that is scalability. What that means is that you can split your set of data across multiple processing locations or physical machines or virtual machines so that you can continue to grow the amount of data that you work with.
Traditional RDBMS systems are known for having consistency and availability, but have difficulties at the highest levels of partitioning. CAP theory says that the database systems can really only meet two of the three aspects of CAP theory.
This is where Hadoop comes into play because as I mentioned earlier the data that’s becoming available for businesses and other companies is growing larger and larger and larger, so partitioning and the complexities around partitioning and the expense is causing companies to look at database solutions that support that aspect, and maybe they don’t have a need for the other two aspects to be fully implemented.
Hadoop is designed for scalability. Hadoop is designed to run on commodity hardware. So cheap servers, really old servers, I see this very commonly. It is also designed for partitioning in that it makes three copies of the data by default and if any of the copies become bad because the hardware becomes bad or corrupt you can just pull that old hardware out and then put the new hardware in. The Hadoop file system will automatically manage that copy process. This goes to another property of Hadoop, which is flexibility, or availability. Because it runs on commodity hardware you can scale a Hadoop cluster nearly infinitely. And again, if you remember where HDFS file system came from, it makes sense. And by Google, originally to index the entire internet, so it’s infinitely scalable.
The number one user of Hadoop is Yahoo! The number two is Facebook. These companies obviously have huge datasets, and they are taking advantage of the cost saving that they get, scaling all of the data out on commodity hardware. And of course they also want high availability because it’s their business to be online and available all the time.
Understanding Big Data
What I find is that a lot of businesses have wrong information about Hadoop and they think of it as a replacement for a relational database. As an architect, I really haven’t encountered any customers for whom they don’t need some kind of relational database.
This is the world of Big Data projects and let me make some examples of this. This is data that will be batch-processed. So in other words, processed as a group rather than individually-queried. And it’s often a great fit for Hadoop.
2. What Is Hadoop?
Hadoop consists of two components, and oftentimes is deployed with other projects as well. What are those components? The first one is open-source data storage or HDFS which stands for Hadoop File System. The second one is a processing API which is called MapReduce. Most commonly in professional deployments Hadoop includes other projects or libraries, and these are many, many different libraries.
One of the libraries are HBase, Hive, and Pig. In addition to understanding the core components of Hadoop it’s important to understand what are called Hadoop Distributions.
The first set of distributions are 100% open source, and you’ll find those under the Apache Foundation. The core distribution is called Apache Hadoop and there are many, many different versions.
There are commercial versions that wrap around some version of the open source distribution and they will provide additional tooling and monitoring and management along with other libraries. The most popular of these are from companies Cloudera, Hortonworks, and MapR.
In addition to that, it’s quite common for businesses to use Hadoop clusters on the cloud. The cloud distribution that I use most often are from Amazon Web Services or from Microsoft with the Windows Azure HDInsight.
When you’re using a cloud distribution you can use an Amazon Distribution which implements the open source version of Hadoop, so Apache Hadoop on AWS with a particular version, or you can use a commercial version that is implemented on the AWS cloud such as MapR on AWS.
Examples of using Hadoop are as follows. One is customer churn analysis. It costs a lot more to gain a new customer rather than to keep a current one, so it’s in the best interest of many companies to collect as much information as possible. And also behavioral, what were the activities the customer was doing shortly before they left so that they can reduce the amount of customers that re leaving.
Hadoop Solutions make use of behavioral data so that companies can make better decisions.
Facebook is the largest known user of Hadoop or the largest public user of Hadoop. New York Times, Federal Reserve Board, IBM, and Orbitz Travel Company and there are literally hundreds of companies that are making use of Hadoop in augmenting their line of business data with behavioral data to make better decisions.
Understanding the difference between Hadoop and HBase
One of the confusing things about working with the Hadoop ecosystem is there are a tremendous number of parts and pieces, libraries, projects, terms, new words, phrases, it’s really easy to get core concepts misunderstood and one of the concepts that I actually didn’t understand the first, when I was working with Hadoop is Hadoop vs. HBase.
Hadoop core ecosystem consists of two parts. It consists of a new type of a file system, HDFS, and a processing framework, map reduce.
You can see that we’ve got representations of files and there are four files here and as I mentioned in a previous movie, each file by default is replicated three times in the Hadoop ecosystem on three different pieces of commodity hardware. So you can think of them as cheap servers.
I like to say that map reduce to Hadoop is kind of like C++ to object oriented programming. Map reduce is written in java, and customers who are working with Hadoop really don’t want to query or work with Hadoop at that level of abstraction. So one such solution is working with a library HBase. The way that this looks is a wide column store. So you can see on the right here you have a table with one ID column and then a data column. And there’s really no requirement for any particular values in the data column, that’s why it’s called wide column.
A lot of people think that HBase or query language for it, which is called Hive, is actually part of Hadoop. And although it often is in practical implementations, they are two separate things.
3. Understanding the Hadoop Core Components
Understanding Java Virtual Machines
Hadoop processes or execution activities run in separate JVMs. JVMs basically a process for executing Java bytecode in an executable program. It’s a little section of the program that runs.
Traditionally in database processing systems state is shared. The different Hadoop processes run in separate JVMs.
Exploring Hadoop Distributed File System (HDFS) and other file systems
The default file system is HDFS, which we talked about in a previous movie, accounts for a larger chunking of the data and is triple replicated by default.
The HDFS file system has two modes for implementation. Fully-distributed, which will give you the three copies or Pseudo-distributed which will use the HDFS File System but is designed for testing and will be implemented in a single node on a single machine.
As an alternative to HDFS you can run Hadoop with the Regular File System. This is called the Standalone mode. And it’s a great way, when you’re just first learning about the MapReduce Programming Paradigm. You’re reducing the complexity by just working with your Regular File System.
Alternatively, when you’re deploying Hadoop to production, particularly if you’re deploying on a public Cloud, it’s really common to use a file system that’s on that Cloud.
For example, in Amazon, the S3 File System or in Azure the BLOG storage. Which is similar to the Standalone mode in that you are not using HDFS, you’re using a Regular file system but choosing a Cloud based file system.
If you deploy Hadoop in Single node you’re going to use the Local file system and a single JVM for all the Hadoop processes.
If you deploy in Pseudo-distributed mode you’re going to use HDFS and the Java daemons are going to run all the processes on a single machine.
If you run in Fully-distributed mode you’re going to use HDFS, it’s going to be triple replicated and the daemons are going to run in various locations depending on where you choose to place them. So you can see in this particular drawing we’re in Fully-distributed mode. We have three separate physical servers. On each server we have various daemons and they’re represented in green. So you can see we’ve got Task Tracker on each one. And then we have a Job Tracker daemon that is actually implemented