Fundamentals of Software Version Control


Fundamentals of Software Version Control

Michael Lehman

1. Overview of Software Version Control

Overview of software version control

Version Control is the process of keeping track of your creative output as it evolves over the course of a project or product. It tracks what is changed, it tracks who changes it, and it tracks why it was changed.

Understanding version control concepts

As you work, each new version is kept in a simple database-like system. A Version Control system is an application or a set of applications that manage creating a repository, which is the name of the specialized database where your files and history are stored. In some systems, this is in an actual database, and in some systems this is done by creating a set of small change files which are identified by some identifier and managed by the software. The next is working set. Your files and folders can be organized into projects within the repository in a manner similar to how you manage files by keeping them in folders on your hard drive.

The files on your hard drive, from the point of view of the Version Control software, are considered your working set. You change the files in the working set and then update the repository to reflect those updates, automatically creating a backup and denoting the purpose of the changes in a process– that’s called add when you are adding new files or committing or checking in for existing files. Once you have added or updated files in the repository, you can easily revert your working set with any version of the data in the repository.

If you’re working by yourself, you can grab an older version. If you are working in a team, you can grab files that have been updated by your team members. That process is either called updating or checking out, and finally as we talked about before, you can mark the state of an entire project tree or an entire repository with an identifier called either a tag, or a label depending on the vendor, when significant events occur such as shipping a release or delivering it to a client.

2. Terminology 

Repository – the database where your changes are tracked;

Working set – your local files with potential changes not yet in the repository;

Add and check-in – ways of getting data into the repository;

Checkout and update – ways of getting data from the repository into your working set;

Revert or rollback – ways of updating your working set to match specific versions of the repository.

When sending changes from one repository to another, you can do it via two methods, one is to use a method which uses HTTP or SSH to communicate the information to a remote server, the other is a way of extracting data from a repository to be able to send by email. Push sends the changes from your repository to the server’s repository, and pull gets the changes from the server and adds them to your local repository. As I said, this is almost often done via a server listing on HTTP or via SSH.


Typically, you initialize the repository because the repository is on your local box.Similar to the Centralized Version Control, you still have a working set, and you add things and you update them. In addition, you can also have this optional remote repository, which is a way of backing things up. Now typically, people that use Git or Mercurial use a system like this, and they will use a hosted system like GitHub in order to be able to back up their local repository into the cloud, and that’s done as we mentioned just before in terminology by using push and pull.

4. Version Control Concepts

Getting files in and out of a repository

The first thing you need to know to use Version Control is how to create either the entire repository or how to create a portion of the repository for the project your’re working on. With Distributed Version Control, you’ll create the repository yourself at least on your local machine. And then you may synchronize it with another repository created on the server by an administrator. So you’ll create the repository by using a command that’s sometimes called initialize and sometimes called create. capture

And what this does is create an empty space for you to put your files and directories in, as we did in our sample before where we simply created myprog.c and checked it in. If your files already exist, meaning you’re adding an existing project to Version Control, you can use the Add feature of your Version Control Software to bulk add all the files and directories into your working set and then check them in. In Distributed Version Control, there’s usually a concept between your file system and the repository that’s called the staging area. I did git commit -a -m and the -a meant add any files that are in the working set that aren’t already in the repository automatically.

Saving changes and tracking history

Once you’ve got your files in, and you can get your files out, the typical work flow when using any Version Control System looks like this in your repository. At Time 1, you can see that there’s a file in your repository that contains A, B, and C. Let’s say you check it out and then you delete B and you add D. Now when you save your files back into the repository, use a command that’s called commit, check in, or update. When you command the Version Control System to save your updated file, it will ask you for a short description of the changes you’ve made. This is oftentimes called a commit message, and this is a crucial part of the history of tracking information.

Reverting to a prior version


This is one of the places where Version Control systems can really help you out.So you’ve got your file in the repository at time T1, you would check it out or update it as necessary, and you begin to make some changes.

You delete feature B and add feature D and then you realize that’s not the way you want it. So with Version Control software, you can go back and say please overwrite the file in my working set with the most recently saved one, or even overwrite the file or the entire working set with the version from last Thursday. This is sometimes called reverting, and in some systems it’s called rollback. Now, the key thing is that you have to identify the specific version you want to roll back to. Each time you commit or check in, the Version Control Software creates a change set identifier.

Creating tags and labels


Sometimes you want to be able to identify the entire state of the repository at a particular point in time, such as a customer release or a major milestone of the project. Rather than writing down a cryptic change set identifier, Version Control systems allow you to supply a human readable name describing the current state of the whole repository or project, in our case here, V1.0. This is called tagging in some systems and labeling in others.

Branching and merging


You start out with your main branch and your existing files, and you create a new branch, and now you have another set of identical files that you can make changes to. Because all branching means is that you’re taking the current state of the repository and creating a new copy.

In the new branch, as opposed to the main branch–which is also sometimes called the trunk–you can make any changes you want and they don’t affect the product that is already working.This is a perfect sandbox in which you can experiment and you can refactor code. You can delete old unnecessary features and add brand-new ones. You really have full freedom to do anything you want to.

In our example, a repository contains a file containing features A, B, and C. Once you branch you can see the main trunk still contains A, B, and C, and so does the branch.

Now, at T2 we may delete C and add feature D. And this doesn’t have anything to do with what we can do in the main branch, where we might, for example, add feature F. And back in the branch we might delete feature B now and add feature E. And this is all without changing our main production code.

Hand in hand with branching is the reverse process, called merging. This allows you to take changes that you’ve made in a branch and add them back into the main trunk or into another branch. So let’s take a look. We have our main branch, there is our main file, we’ll create a new branch, as we did before, feature 1, and we’ll take the copy and we’ll add feature D. And now back in the main branch, we’ll add feature F. And here in parallel, we’re going to add feature E in the feature 1 branch.

So as you do these changes, you might decide you like both sets of changes and want to combine them. In this case, we don’t have any conflicting changes because we didn’t delete anything, we just added. So now we ask our Version Control system to merge the two branches together, and now we end up with a new file 1, checked into the main branch with all the features from both the main branch and the feature branch.

When there are conflicts, such as two changes in each branch that effectively did the same thing, for example, if we deleted feature B in both the main branch and the feature 1 branch, the Version Control Software sometimes can’t tell exactly what you meant to do and so therefore it declares a merge conflict and requests that you resolve this manually.







Asking Great Data Science Questions


Asking Great Data Science Questions

Dough Rose


One of the most important parts of working in a data science team is discovering great questions. To ask great questions you have to understand critical thinking. Critical thinking is not about being hostile or disapproving. It’s about finding the critical questions.

These questions will help you shake out the bad assumptions and false conclusions that can keep your team from making real discoveries.

2.Apply Critical Thinking

Harness the power of questions

Your team needs to be comfortable in a world of uncertainty, arguments, questions, and reasoning. When you think about it, data science already gives you a lot of the information. You’ll have the reports that show buying trends. There will be terabytes of data on product ratings. Your team needs to take this information and ask the interesting questions that creates valuable insights.

A good question will challenge your thinking. It’s not easily dismissed or ignored. It forces you to unravel what you already neatly understood. It requires a lot more work than just listening passively.

Pan of gold

The critical in critical thinking is about finding the critical questions. These are the questions that might chip away at the foundation of the idea.

It’s about your ability to pick apart the conclusions that are part of an accepted belief.

Here’s an example:

At the end of the month, the data analysts ran a report that showed a 10% increase in sales. It’s very easy here to make a quick judgment. The lower prices encouraged more people to buy shoes. The higher shoe sales made up for the discounted prices. It looks like the promotions worked. More people bought shoes and the company earned greater revenue. Many teams would leave it at that. Your data science team would want to apply their critical thinking.Remember it’s not about good or bad, it’s about asking the critical questions. How do we know that the jump in revenue was related to the promotion? Maybe if you hadn’t had the promotion, the same number of people would have bought shoes.

What data would show a strong connection between the promotion and sales? You can even ask them essential questions. Do your promotions work? Would the same number of people have bought shoes? Everyone assumes that promotions work. That’s why many companies have them. Does that mean that they work for your website? These questions open up a whole new area for the research lead. When you just accept that promotions work, that everything is easy. These worked so let’s do more promotions. Instead the research lead has to go in a different direction.

How do we show that these promotions work? Should we look at the revenue from the one-day event? Did customers buy things that were on sale? Was it simply a matter of bringing more people into the website? This technique is often called panning for gold. It’s a reference to an early mining technique. It’s when miners would sift through sand looking for gold. The sand here are all the questions that your team asks. The research lead works with the team to find the gold nuggets that are worth exploring. The point of panning for gold is that you’ll have a lot of wasted material.

They’ll be a lot of sand for every nugget of gold. It takes patience to sift through that many questions. Don’t be afraid to ask big why’s?

Focus on reasoning

That’s why a key part of critical thinking is understanding the reasoning behind these ideas. Reasoning is your beliefs, evidence, experience, and valuesthat support conclusions about the data. It’s important to always keep track of everyone’s reasoning when working on a data science team.

Everyone on your team should question their own ideas. Everyone should come up with interesting questions and explore the weak points in their own arguments.

A University of California physicist named Richard Muller spent years arguing against global climate change. Much of his work was funded by the gas and oil industry.

Later, his own research found very strong evidence of global temperature increases. He concluded that he was wrong and that humans are to blame for climate change. Muller saw the facts against him were too strong to ignore, so he changed his mind. He didn’t do it in a quiet way or seem ashamed. Instead he wrote a long op-ed piece in the New York Times that outlined his initial arguments and why his new findings showed that he was wrong. That’s how you should apply critical thinking in your data science team.

3.Encourage Questions

Run question meetings

This is sometimes called a question-first approach. These meetings are about creating the maximum number of questions.

Identify question types

If you run an effective question meeting, then you’ll likely get a lot of good questions. That’s a good thing. Remember, you want your team to be panning for gold. They should be going through dozens of questions before they find a few that they want to explore. The more ideas you can expose, the better. Then you can decide which ones are best to explore. Just like the early miners who panned for gold, you want to be able to sort out the gold from the sand. You want to know how to separate good questions from those that you can leave behind.

You don’t want your team asking too many open questions, it’ll make everyone spend too much time questioning, and not enough time sorting through the data. On the flip side, you don’t want the team asking too many closed ended questions. Then the team will spend too much time asking smaller easier questions, without looking at the big picture. Once you’ve identified whether your question is open or closed, you’ll want to figure out if it’s essential.When it’s essential it gets to the essence of an assumption, idea, or challenge.

Organize your questions

Below them, you can use yellow notes for non-essential questions. Remember that these are questions that address smaller issues. They’re usually closed questions with a quicker answer.Finally you can use white or purple stickies for results. These are little data points that the team discovered which might help address the question. There are five benefits to having a question wall. This will help your team stay organized and even prioritize their highest-value questions.

Create question trees

Remember, the data science is using the scientific method to explore your data. That means that most of your data science will be empirical. Your team will ask a few questions, then gather the data, then they’ll react to the data and ask a series of new questions.

When you use a question tree it will reflect how the team has learned. At the same time, it will show the rest of the organization your progress.

Find new questions

You want to focus your questions on six key areas. The six key areas are questions that clarify key terms, root out assumptions, find errors, see other causes, uncover misleading statistics, and highlight missing data. If you discuss these six areas, then you’re bound to come up with at least a few questions.

4.Challenge the Team

 Clarify key terms

You need to carefully look at the reasoning behind your ideas and then question it. That way you’ll have a better understanding of everyone’s ideas.

Root out assumptions

Remember that correlation doesn’t necessarily mean causation. The key is to focus on identifying where they are. An assumption that’s accepted as fact might cause a chain reaction of flawed reasoning. Also keep in mind that an assumption isn’t just an error to be corrected. It’s more like an avenue to explore.

Find errors

There are key phrases that you might want to clarify. There’s also assumptions which might connect incorrect reasoning to false conclusions. Once you peel back these assumptions and clarify the language, you should be left with the bare reasoning. In many ways, now you’re asking more difficult questions. 

Challenge evidence

In fact, your data science team might be one of the only groups in the organization that’s interested in questioning well established facts. When you’re in a data science team, each time you encounter a fact, you should start with three questions. Should we believe it? Is there evidence to support it? How good is the evidence? Evidence is well established data that you can use to prove a larger fact. Still, you shouldn’t just think of evidence as proving or disproving the facts. Instead, try to think of the evidence as being stronger or weaker.

The important thing to remember is that facts are not always chiseled in marble. Facts can change as the evidence gets stronger or weaker. When you’re working in a data science team, don’t be afraid to question the evidence. Often it will be a great source of new insights.

See the other causes

It’s easy to say that correlation doesn’t imply causation. It’s not always easy to see it in practice. Often you see cause and effect and there’s no reason to question how they relate.Sometimes it’s difficult to see an outcome that happens after something is different from an outcome that happens because of something.

If they don’t make sense then you should investigate the connection. Some of your best questions might come from eliminating these rival causes and finding an actual cause.

Uncover misleading statistics

When you’re in a question meeting, your team should closely evaluate statistical data. They should question the data and be skeptical of statistics from outside the team. The person on your data science team suggests that as many of half of your customers run with their friends. The best way to sort this out is to separate the statistic from the story. With the running shoe website, you had two stories. One that says that the customer likes their friends to save money. The other one says that customers run with their friends.

Highlight missing information

The first thing is to try and understand the reason that information is missing. Maybe there was no time or limited space in their report.

5. Avoid Obstacles

Overcome question bias

Questions are at the heart of getting insights from your data. It take courage to ask a good question.









Database Fundamentals: Core Concepts


Database Fundamentals: Core Concepts

Adam Wilbert

1. Understanding Database Storage Models

What are databases?

Databases are at the core of our modern technology and it’s important to understand exactly what they are and the benefits that they bring to organizing a world of information. A database is a computer file that follows a specific structure and rules in order to allow the input, organization, and most importantly retrieval of data very quickly. It does this by organizing data into tables that could be sorted and filtered in very flexible ways.

So a database is just a structured collection of information.

The DBMS provides three very important tasks.

First, it helps us create the structural rules that our data will adhere to. This helps keep the data organized and consistent and provides a predictable results when it comes time to retrieve the information that we previously stored.

Second, the DBMS helps load data into the framework we’ve already established. Things like writing data to the tables and ensuring that it conforms to the established rules as well as helping users sort through the data to find trends or produce reports.

Finally, the Database Management System provides additional support for the database such as tracking users and providing log-in credentials, performing maintenance and routing backups, as well as a host of additional administrative tasks that protect and secure your data.

Once the database is created and we’ve started placing data into the established structure, we’ll need ways of getting the information back out.

The DBMS accomplishes this in a couple of ways. This process is called queries. They allow you to sort, filter, organize, and summarize your data in nearly endless ways.

Beyond just a collection of data, the DBMS supports a highly structured and efficient storage mechanism that allows you to enter, organize, protect, and retrieve information.

Understanding relational databases

By far, the most common type of database format follows the relational model.

The relational database builds on the organizational principles of the flat file system and the connected nature of the hierarchical system, but adds the ability to connect multiple tables together without restriction on the number of parent and child relationships.

The main idea behind a relational database is that your data gets broken down into common themes, with one table dedicated to describing the records of each theme.

This means that a single wide data table, one with lots of lots of columns, will become several smaller tables with fewer columns in each one.

By using unique identifiers for each record, we can relate one table to another. Using these identifiers for each record, we can relate one table to another. These identifiers are called key fields and they are glue that holds the entire system together.

A properly configured relational database can be a treasure trove of useful information that can be used to help guide business decisions or gain a better understanding of a complex system.

Exploring database fundamentals

Let’s talk about referential integrity. We know that key values are used to tie multiple related tables together. Referential integrity means that if the database is expecting a relationship on a particular field, then the corresponding value must already exist in the parent table before it will allow a change to the child.

In other words, you don’t ever want to be in a situation where a customer ordered a product that doesn’t actually exist in your inventory.

Or, a pay raise is given to an employee ID that was never issued to an actual person.

Referential integrity will protect you from these types of phantom connections by checking the existence of the item you’re referring to.

SQL is the standard language that relational databases use in order to create the data structures, enter and update data, and write queries to ask questions of the data set.

Though the core SQL language is part of the American National Standards Institute or ANSI Standard, each DBFS vendor applies their own tweaks and enhancements to the base language in order to distinguish their product from everyone else’s.

In the case of Microsoft SQL Server, the particular flavor or dialect, is called Transact-SQL or T-SQL.

2. Building Database Servers

Understanding the role of the server

When working with databases, the term “server” gets thrown around quite a bit. It’s important to understand what that means. A database server can either be a dedicated machine or a virtualized machine that is running the database management software. You’ll hear this referred to as an instance of the server and multiple instances or multiple separate installations can be installed on a single machine at the same time. That’s because when installing the server software, you’ll give the instance a unique name so they can function alongside multiple other instances without getting all tangled up.

3. Understanding Data Definition Language (DDL)

Using DDL statements to create database objects

Data Definition Language or DDL is used to define data structures in SQL Server. These statements create and manipulate database objects and use the keywords: USE, CREATE, ALTER, DROP, TRUNCATE, and DELETE.




Notes From Hadoop Fundamentals


Hadoop Fundamentals

Lynn Langit

1. Why Move Away from Relational Databases?

Understanding the limits of relational database management systems

The downfalls of relational databases such as SQL Server, Oracle, or My SQL is that as more and more data becomes available and companies and organizations want to embark on big dat projects, they’re running into limits around using relational databases.

The first is Scalability. Many companies have projects that are in the gigabytes. This can be very expensive and complex and difficult. Also, some of these big data projects have different kinds of needs around data ingests or speed. Sometimes customers want real time ingests.

And there are other considerations around query ability, application of sophisticated processing like machine learning.

The Hadoop ecosystem is designed to solve a different set of data problems than those of relational databases. One of the core components of Hadoop is an alternate file system called HDFS or Hadoop File System.

Hadoop itself is actually not a database. It is an alternative file system with a processing library.

So, really when you think about bringing Hadoop in as a solution, it’s gonna be in addition to your existing RDBMS, not as a replacement for it.

So, Hadoop itself is most commonly implemented with something called HBase.

Now, so this is based on technology that was developed originally at Google to index the entire internet, which is called the GFS or the Google File System. What Google did about 10 years ago is they wrote a whitepaper on how they created this file system and the open-source community took the information from this whitepaper and made it was part of the basis of Hadoop. So if you hear GFS and HDFS, they are very, very similar implementations.

HBase is a NoSQL database that is very commonly used with Hadoop solutions. It is a wide column store. And what that means is it’s a database that consist of a key and then one to n number of values.

Introducing CAP (consistency, availability, partitioning)

To understand more about the use cases for Hadoop ecosystem we’re going to take a look at what’s called CAP theory or CAP theorem which is a way to understand the different categories and classifications of databases.

The first aspect of CAP theory is the idea of consistency. The concept is that there are certain database solutions that allow for very high data consistency. Another way to think about this is that the solution supports transactions. An example of transaction would be if you had two data modification operations and they were combined as a unit. So withdrawing money, for example, out of a savings account, and then adding that money into a checking account. You would want both of those changes to occur successfully, or neither. Otherwise our data would be inconsistent.

The second aspect of CAP theory is availability. Another way to say that is up-time. What this means is that you have the ability to make copies of the data so that if one copy goes down in one location, the data will still be available for some or all of your users.

The third aspect of CAP theory is portioning. Another way to think about that is scalability. What that means is that you can split your set of data across multiple processing locations or physical machines or virtual machines so that you can continue to grow the amount of data that you work with.

Traditional RDBMS systems are known for having consistency and availability, but have difficulties at the highest levels of partitioning. CAP theory says that the database systems can really only meet two of the three aspects of CAP theory.

This is where Hadoop comes into play because as I mentioned earlier the data that’s becoming available for businesses and other companies is growing larger and larger and larger, so partitioning and the complexities around partitioning and the expense is causing companies to look at database solutions that support that aspect, and maybe they don’t have a need for the other two aspects to be fully implemented.

Hadoop is designed for scalability. Hadoop is designed to run on commodity hardware. So cheap servers, really old servers, I see this very commonly. It is also designed for partitioning in that it makes three copies of the data by default and if any of the copies become bad because the hardware becomes bad or corrupt you can just pull that old hardware out and then put the new hardware in. The Hadoop file system will automatically manage that copy process. This goes to another property of Hadoop, which is flexibility, or availability.  Because it runs on commodity hardware you can scale a Hadoop cluster nearly infinitely. And again, if you remember where HDFS file system came from, it makes sense. And by Google, originally to index the entire internet, so it’s infinitely scalable.

The number one user of Hadoop is Yahoo! The number two is Facebook. These companies obviously have huge datasets, and they are taking advantage of the cost saving that they get, scaling all of the data out on commodity hardware. And of course they also want high availability because it’s their business to be online and available all the time.

Understanding Big Data

What I find is that a lot of businesses have wrong information about Hadoop and they think of it as a replacement for a relational database. As an architect, I really haven’t encountered any customers for whom they don’t need some kind of relational database.

This is the world of Big Data projects and let me make some examples of this. This is data that will be batch-processed. So in other words, processed as a group rather than individually-queried. And it’s often a great fit for Hadoop.

2. What Is Hadoop?

Introducing Hadoop.

Hadoop consists of two components, and oftentimes is deployed with other projects as well. What are those components? The first one is open-source data storage or HDFS which stands for Hadoop File System. The second one is a processing API which is called MapReduce. Most commonly in professional deployments Hadoop includes other projects or libraries, and these are many, many different libraries.

One of the libraries are HBase, Hive, and Pig. In addition to understanding the core components of Hadoop it’s important to understand what are called Hadoop Distributions.

The first set of distributions are 100% open source, and you’ll find those under the Apache Foundation. The core distribution is called Apache Hadoop and there are many, many different versions.

There are commercial versions that wrap around some version of the open source distribution and they will provide additional tooling and monitoring and management along with other libraries. The most popular of these are from companies Cloudera, Hortonworks, and MapR.

In addition to that, it’s quite common for businesses to use Hadoop clusters on the cloud. The cloud distribution that I use most often are from Amazon Web Services or from Microsoft with the Windows Azure HDInsight.

When you’re using a cloud distribution you can use an Amazon Distribution which implements the open source version of Hadoop, so Apache Hadoop on AWS with a particular version, or you can use a commercial version that is implemented on the AWS cloud such as MapR on AWS.

Examples of using Hadoop are as follows. One is customer churn analysis. It costs a lot more to gain a new customer rather than to keep a current one, so it’s in the best interest of many companies to collect as much information as possible. And also behavioral, what were the activities the customer was doing shortly before they left so that they can reduce the amount of customers that re leaving.

Hadoop Solutions make use of behavioral data so that companies can make better decisions.

Facebook is the largest known user of Hadoop or the largest public user of Hadoop. New York Times, Federal Reserve Board, IBM, and Orbitz Travel Company and there are literally hundreds of companies that are making use of Hadoop in augmenting their line of business data with behavioral data to make better decisions.

Understanding the difference between Hadoop and HBase

One of the confusing things about working with the Hadoop ecosystem is there are a tremendous number of parts and pieces, libraries, projects, terms, new words, phrases, it’s really easy to get core concepts misunderstood and one of the concepts that I actually didn’t understand the first, when I was working with Hadoop is Hadoop vs. HBase.

Hadoop core ecosystem consists of two parts. It consists of a new type of a file system, HDFS, and a processing framework, map reduce.1

You can see that we’ve got representations of files and there are four files here and as I mentioned in a previous movie, each file by default is replicated three times in the Hadoop ecosystem on three different pieces of commodity hardware. So you can think of them as cheap servers.

I like to say that map reduce to Hadoop is kind of like C++ to object oriented programming. Map reduce is written in java, and customers who are working with Hadoop really don’t want to query or work with Hadoop at that level of abstraction.  So one such solution is working with a library HBase. The way that this looks is a wide column store. So you can see on the right here you have a table with one ID column and then a data column. And there’s really no requirement for any particular values in the data column, that’s why it’s called wide column.

A lot of people think that HBase or query language for it, which is called Hive, is actually part of Hadoop. And although it often is in practical implementations, they are two separate things.

3. Understanding the Hadoop Core Components

Understanding Java Virtual Machines

Hadoop processes or execution activities run in separate JVMs. JVMs basically a process for executing Java bytecode in an executable program. It’s a little section of the program that runs.

Traditionally in database processing systems state is shared. The different Hadoop processes run in separate JVMs.

Exploring Hadoop Distributed File System (HDFS) and other file systems

The default file system is HDFS, which we talked about in a previous movie, accounts for a larger chunking of the data and is triple replicated by default.

The HDFS file system has two modes for implementation. Fully-distributed, which will give you the three copies or Pseudo-distributed which will use the HDFS File System but is designed for testing and will be implemented in a single node on a single machine.

As an alternative to HDFS you can run Hadoop with the Regular File System. This is called the Standalone mode. And it’s a great way, when you’re just first learning about the MapReduce Programming Paradigm. You’re reducing the complexity by just working with your Regular File System.

Alternatively, when you’re deploying Hadoop to production, particularly if you’re deploying on a public Cloud, it’s really common to use a file system that’s on that Cloud.

For example, in Amazon, the S3 File System or in Azure the BLOG storage. Which is similar to the Standalone mode in that you are not using HDFS, you’re using a Regular file system but choosing a Cloud based file system.

If you deploy Hadoop in Single node you’re going to use the Local file system and a single JVM for all the Hadoop processes.

If you deploy in Pseudo-distributed mode you’re going to use HDFS and the Java daemons are going to run all the processes on a single machine.


If you run in Fully-distributed mode you’re going to use HDFS, it’s going to be triple replicated and the daemons are going to run in various locations depending on where you choose to place them. So you can see in this particular drawing we’re in Fully-distributed mode. We have three separate physical servers. On each server we have various daemons and they’re represented in green. So you can see we’ve got Task Tracker on each one. And then we have a Job Tracker daemon that is actually implemented




Notes from SCRUM: A Breathtakingly Brief and Agile Introduction


SCRUM: A Breathtakingly Brief and Agile Introduction

Chris Sims & Hillary Louise Johnson

What is SCRUM?

A scrum team typically consists of around seven people who work together in short, sustainable bursts of activity called sprints, with plenty of time for review and reflection built in. One of the mantras of scrum is “inspect and adapt,” and scrum teams are characterized by an intense focus on continuous improvement—of their process, but also of the product.


Scrum recognizes only three distinct roles: product owner, scrum master, and team member:

Product Owner

The product owner is responsible for maximizing the return the business gets on this investment (ROI).

One way that the product owner maximizes ROI is by directing the team toward the most valuable work, and away from less valuable work. That is, the product owner controls the order, sometimes called priority, of items in the team’s backlog. In scrum, no-one but the product owner is authorized to ask the team to do work or to change the order of backlog items.

Another way that the product owner maximizes the value realized from the team’s efforts is to make sure the team fully understands the requirements. If the team fully understands the requirements, then they will build the right thing, and not waste time building the wrong thing. The product owner is responsible for recording the requirements, often in the form of user stories (eg, “As a <role>, I want <a feature>, so that I can <accomplish something>”) and adding them to the product backlog. Each of these users stories, when completed, will incrementally increase in the value of the product.  For this reason, we often say that each time a user story is done we have a new product increment.

The Product Owner Role in a Nutshell:

  • holds the vision for the product represents the interests of the  business
  • represents the customers
  • owns the product backlog
  • orders (prioritizes) the items in the product backlog
  • creates acceptance criteria for the backlog items
  • is available to answer team members’ questions

Scrum Master

The scrum master acts as a coach, guiding the team to ever-higher levels of cohesiveness, self-organization, and performance. While a team’s deliverable is the product, a scrum master’s deliverable is a high-performing, self-organizing team.

The scrum master is the team’s good shepherd, its champion, and guardian, facilitator, and scrum expert.  The scrum master helps the team learn and apply scrum and related agile practices to the team’s best advantage. The scrum master is constantly available to the team to help them remove any impediments or road-blocks that are keeping them from doing their work. The scrum master is not—we repeat, not—the team’s boss. This is a peer position on the team, set apart by knowledge and responsibilities not rank.

The scrum master role in a Nutshell:

  • scrum expert and advisor
  • coach
  • impediment bulldozer
  • facilitator

Team Member

High-performing scrum teams are highly collaborative; they are also self-organizing. The team members doing the work have total authority over how the work gets done. The team alone decides which tools and techniques to use, and which team members will work on which tasks. The theory is that the people who do the work are the highest authorities on how best to do it. Similarly, if the business needs schedule estimates, it is the team members who should create these estimates.

A scrum team should possess all of the skills required to create a potentially shippable product. Most often, this means we will have a team of specialists, each with their own skills to contribute to the team’s success.  However, on a scrum team, each team member’s role is not to simply contribute in their special area. The role of each and every team member is to help the team deliver potentially shippable product in each sprint. Often, the best way for a team member to do this is by contributing work in their area of specialty. Other times, however, the team will need them to work outside their area of specialty in order to best move backlog items (aka user stories) from “in progress” to “done.” What we are describing is a mindset change from “doing my job” to “doing the job.” It is also a change in focus from “what we are doing” (work) to what is getting done (results).

The Team Member Role in a Nutshell:

  • responsible for completing user stories to incrementally increase the value of the product
  • self-organizes to get all of the necessary work done
  • creates and owns the estimates
  • owns the “ how to do the work” decisions
  • avoids siloed “not my job” thinking

So, how many team members should a scrum team have? The common rule of thumb is seven, plus or minus two. That is, from five to nine. Fewer team members and the team may not have enough variety of skills to do all of the work needed to complete user stories. More team members and the communication overhead starts to get excessive.

Scrum Artifacts

These are the tools we scrum practitioners use to make our process visible.

The Product Backlog

The product backlog is the cumulative list of desired deliverables for the product. This includes features, bug fixes, documentation changes, and anything else that might be meaningful and valuable to produce. Generically, they are all referred to as “backlog items.” While backlog item is technically correct, many scrum teams prefer the term “user story,” as it reminds us that we build products to satisfy our users’ needs.

The list of user stories is ordered such that the most important story, the one that the team should do next, is at the top of the list. Right below it is the story that the team should do second, and so on. Since stories near the top of the product backlog will be worked on soon, they should be small and well understood by the whole team. Stories further down in the list can be larger and less well understood, as it will be some time before the team works on them.

Each item, or story, in the product backlog should include the following information:

  • Which users the story will benefit (who it is for)
  • A brief description of the desired functionality (what needs to be built)
  • The reason that this story is valuable (why we should do it)
  • An estimate as to how much work the story requires to implement
  • Acceptance criteria that will help us know when it has been implemented correctly

The Sprint Backlog

The sprint backlog is the team’s to do list for the sprint. Unlike the product backlog, it has a finite life-span: the length of the current sprint. It includes: all the stories that the team has committed to delivering this sprint and their associated tasks. Stories are deliverables, and can be thought of as units of value. Tasks are things that must be done, in order to deliver the stories, and so tasks can be thought of as units of work. A story is something a team delivers; a task is a bit of work that a person does. Each story will normally require many tasks.

A burn chart shows us the relationship between time and scope. Time is on the horizontal X-axis and scope is on the vertical Y-axis. A burn up chart shows us how much scope the team has got done over a period of time.

A burn down chart shows us what is left to do. In general, we expect the work remaining to go down over time as the team gets things done.

These events appear as vertical lines on the burn down chart: a vertical line up when we add new work, or down when we remove some work from the plan.

Task Board

When the team’s tasks are visible to everyone from across the room, you never have to worry that some important piece of work will be forgotten.
The simplest task board consists of three columns: to do, doing and done.

Definition of Done

The team’s definition may include things like: code written, code reviewed, unit tests passing, regression tests passing, documentation written, product owner sign-off, and so on. This list of things that the team agrees to always do before declaring a story done becomes the teams “definition of done.”

The Sprint Cycle

The sprint cycle is the foundational rhythm of the scrum process. Whether you call your development period a sprint, a cycle or an iteration, you are talking about exactly the same thing: a fixed period of time within which you bite off small bits of your project and finish them before returning to bite off a few more.

The shorter the sprint cycle, the more frequently the team is delivering value to the business.

Sprint Planning Meeting

Part One: “What will we do?”

The goal of part one of the sprint planning meeting is to emerge with a set of “committed” stories that the whole team believes they can deliver by the end of the sprint. The product owner leads this part of the meeting.

Note the separation in authority: the product owner decides which stories will be considered, but the team members doing the actual work are the ones who decide how much work they can take on.

Part 2: “How will we do it?”

In phase two of the sprint planning meeting, the team rolls up its sleeves and begins to decompose the selected stories into tasks. Remember that stories are deliverables: things that stakeholders, users, and customers want. In order to deliver a story, team members will have to complete tasks. Task are things like: get additional input from users; design a new screen; add new columns to the database; do black-box testing of the new feature; write help text; get the menu items translated for our target locales; run the release scripts.

The output of the sprint planning meeting is the sprint backlog, the list of all the committed stories, with their associated tasks

Daily Scrum

The daily scrum, sometimes called the stand-up meeting, is:

Daily. Most teams choose to hold this meeting at the start of their work day. You can adapt this to suit your team’s preferences.

Brief. The point of standing up is to discourage the kinds of tangents and discursions that make for meeting hell. The daily scrum should always be held to no more than 15 minutes.

Pointed. Each participant quickly shares:

  1. Which tasks I’ve completed since the last daily scrum.
  2. Which tasks I expect to complete by the next daily scrum.
  3. Any obstacles are slowing me down.

The goal of this meeting is to inspect and adapt the work the team members are doing, in order to successfully complete the stories that the team has committed to deliver.

Note that these are not the stories in the current sprint–those stories are now in the sprint backlog. We recommend one hour per week, every week, regardless of the length of your sprint. In this meeting, the team works with the product owner on:

Story Splitting

Stories at the top of the product backlog need to be small. Small stories are easier for everyone to understand, and easier for the team to complete in a short period of time. Stories further down in the product backlog can be larger and less well defined. This implies that we need to break the big stories into smaller stories as they make their way up the list.

Inspect & Adapt, Baby

Experience is the best teacher, and the scrum cycle is designed to provide you with multiple opportunities to receive feedback—from customers, from the team, from the market—and to learn from it. What you learn while doing the work in one cycle informs your planning for the next cycle. In scrum, we call this “inspect and adapt”; you might call it “continuous improvement”; either way, it’s a beautiful thing.

All Excerpts From

Sims, Chris & Hillary Louise Johnson. “SCRUM: A Breathtakingly Brief and Agile Introduction.” DYMAXICON. 
This material may be protected by copyright.