The greatest enemy of knowledge is not ignorance, it’s the illusion of knowledge
The greatest enemy of knowledge is not ignorance, it’s the illusion of knowledge
In this lab, you’ll use Amazon Web Services to set up a 3 node Elastic MapReduce (EMR) cluster which you can then use for any/all of the class exercises. NOTE: This lab just really covers how to set the cluster up. To manage costs, you must shut down the cluster at the end of the class day. If you want to run the labs in the cluster, you’ll need to re-run these set up steps at the start of every day (should only take 5 mins once you get the hang of it).
Log into the EMR console at AWS. You will have needed to create an account first and for that you will need to provide:
Use the instructions provided by AWS to start up your cluster. Use the following values in place of the AWS ones:
After pressing “Create Cluster” button, your cluster will go into “Starting” mode (as per the following screenshot):
It can take up to 5-15 mins for the cluster to start (if you chose all applications, it will take up to 15). Might be a good time to get up & grab a coffee
Verify that your cluster has gone into “Waiting” mode (as per the screenshot below):
Continue ONLY when your screen looks like above “Waiting” status for cluster
In the EMR running a job is done with “steps”. Because of the special setup involved with EMR, you cannot easily just SSH into the master node and run “hadoop jar” commands. You have to run “steps”.
First, select “Steps / Add Step” from the EMR interface:
In the dialog that pops up, copy and paste the following into the appropriate fields (“Step Type: Custom Jar”) is selected by default:
Jar Location: s3://learn-hadoop/exercises/code/wordcount/WordCount.jar
Arguments: WordCountDriver s3n://learn-hadoop/exercises/data/shakespeare/ s3n://YOUR-BUCKET/NONEXISTENT-FOLDER/
All filled out, the “Steps” should look something like this:
Go ahead & click “Add”, then watch for the job to complete. “Job Status” will go from “Pending” to “Running” to “Completed” and interface will look like:
Keep hitting the “in page” refresh icon until the “Log files” populates (may take 3-5 minutes). Once you see “Log files”, you can select each, to see the actual logs, and when you now browse to OUTPUT-BUCKET/FOLDER, you’ll see several “part-r-0000x” files along with a “_SUCCESS” zero byte file (indicating the job ran OK). These are individual reducer outputs. You can click each to download and review the contents.
You’re done with the “required” part of this lab. You may just choose “Terminate” for your cluster right now as per the screenshot below.
To get full use of the cluster, you’ll want to establish a “tunnel” (secure channel) to the web front-ends like HUE, Spark history, etc. click “Enable Web Connection”
This will pop a set of instructions to establish a tunnel. You’ll need to understand Putty (and how to convert the “spark-class.pem” key to a PPK which Putty can use) to enable on windows, but the process is much more simple for Mac/Linux
the tunnel command should look something like:
ssh -i ~/spark-class.pem –ND 8157 hadoop@[master-node-dns]
ssh -i ~/spark-class.pem –ND 8157 firstname.lastname@example.org
Note: The “8157” port is just an open, unused port… you can use any open, unused port for this purpose, but we know that 8157 is free.
Once the tunnel is established, you can access the following web GUIs as though they were local:
HUE interface: http://<your-master-dns>:8888/
Namenode UI: http://<your-master-dns>:50070/
Resource Manager UI: http://<your-master-dns>:8088/
Spark History UI: http://<your-master-dns>:18080/
So, for example, if your Master DNS is “ec2-54-153-36-108.us-west-1.compute.amazonaws.com”,your connection to HUE would be http://ec2-54-153-36-108.us-west-1.compute.amazonaws.com:8888/
If you like, you can play with the HUE interface, and explore sections like Pig, File Browser, Sqoop, etc. Please note if you plan on running any of the following labs on the EMR cluster, you will need to run them as either “steps” (for standard MapReduce) or you will need to log into the MASTER node and run them from there (for Hive, Pig, Spark labs).
Again, you have the option of just using the HUE interface to submit jobs of various types.
Using Spark in the Hadoop Ecosystem by Rich Morrow Published by Infinite Skills, 2016, https://www.safaribooksonline.com/library/view/using-spark-in/9781771375658/
If you are truly going to be exceptional, if you’re truly going to be radical, you have to be willing to stand out.You have to be willing to be hated, you have to be willing to be controversial, you have to be willing to have certain people talking a bout you and you walk in the room and all the conversation stop. You have to be okay with that. Because if you’re radical that’s going to happen.
The possession of knowledge does not kill the sense of wonder and mystery. There is always more mystery.
Apache Maven, is a software project management, and comprehension tool, based on the concept of a project object model, or POM. Maven can manage a project’s build, reporting,and documentation from a central piece of information. A more comprehensive definition of Apache Maven, is that Maven is a project management tool, which encompasses a project object model.
It also ensures that programmers always get the most recent version of compilers, et cetera. Most Java projects rely on other projects, and open source frameworks, to function properly. It can be cumbersome to download these dependents manually, and keep track of their versions, as you use them in your project. Maven provides a convenient way to declare these project dependencies, in a separate,external, POM.XML file. It then automatically downloads these dependencies and allows you to use them in your project. This simplifies project dependency management greatly.
It is important to note, that in the POM.XML file, you specify the what, and not the how. The POM.XML file, can also serve as documentation tool, conveying your project dependencies and their versions. Software developers refer to Maven, as a build tool. Since it is used to build deployable artifacts from source code. On the other hand, if you asked a project manager they might call it a project management tool, since it follows a development life cycle. In reality, it is both.
Go to maven.apache.org
From this page we can either use the link on the left that says Download, or we can just use the link in the middle where it says Use Download.
Let’s look quickly at the system requirements. Probably one of the most important ones is that it requires Java Development Kit or JDK 1.7 or higher.
We’ll take a look at that in a minute, but for right now, under the link we have Binary tar file, a Binary zip file, and two source files.
I’m gonna go ahead and download the apache-maven-3.3.9-bin.zip file. It doesn’t take long to download and you can see it’s only 8.2 MB.
The first thing we need to do is extract the archive file that I just download, so I’m gonna go down to my archive, and I’m gonna click on it and I’m going to click on Extract, and I’m gonna say Extract all. The Maven website recommends that you extract the file to the Program Files directory on your C: drive so I’m gonna do that. I’m gonna click on Browse, I’m gonna go into my C: drive, to my Program Files, and I’m going to go ahead and say Select Folder, and I’m gonna say Extract. Now I have a folder called apache-maven-3.3.9, and if I open it up, I’ll see there’s a bin folder, a boot folder, a conf folder, and a lib folder. At this point, I’ve completed my download. The instructions for using Maven depends on whether you’re running a Windows machine or a Mac OS or Linux machine.
Before the installation we must verify our Java version from the command line using java -version. Remember it must be 1.7 or higher.
So do this:
cmd -> java -version
To make life easier, we need to update our environment variables. We can do that using either the command window, but, if you use the command window you’re going to have to update them every time, or, we can go to the control panel.
For now, let’s use the control panel. I’m going to go back to start again and type “control panel”. From here, I’m going to go to System and Security. I want to go to System. Now, I need to go to my Advanced system settings. Then you’ll see at the bottom it says Environment variables. Let’s click on Environment variables. The top half of the screen are specific user variables for the producer profile. The second half, the bottom half, is my system variables.
What we want to do here is we want to add a variable to indicate where the Java home is. This is where we stored our JDK.
Let’s add a new system variable. We’re going to click on new and we’re going to call it java_home. The value for the variable will be the path that takes me to my JDK file. In our case it’s going to be c:\program files\java\jdk, mine was 1.8.0_91 and I’m going to click OK.
Now you’ll see it’s added to your list of system variables. The next thing we need to do is we need to add our new Apache Maven directory, which is also in my program files. Let’s go back and take a look. If I go to my C drive, to Program Files,the very first folder is Apache Maven 3.3.9. Inside there is the bin folder. That’s what I need, I need to know that path. Let me close that. What I want you to do is to click on the path variable that already exists and say “edit”.
We’re going to add a new variable. It’s going to say, c:\program files\apache-maven-3.3.9 this time I also want to include, \bin. Now, my environment knows where to find the Maven commands. I’m going to click OK. I’m going to click OK, again, and OK again, and I’ll close my control panel.
Now, we do need to open a new command prompt. If you have any command prompt windows open, go ahead and close them and start a new window so it will pick up those environment changes that we just made. Now, from the command prompt, I’m going to type mvn -v and hit enter. We can see that this is Apache Maven 3.3.9. It tells me my Maven home folder. It gives me the Java version that we’re using, the Java home, as well as the default local and operating system name.
At this point I’m ready to get started using Maven on my Windows machine.
Maven use of the concept of a Project Object Model, or POM.
This model has a a set of standards, a project lifecycle, a dependency management system, and logic for executing plugin goals at certain phases in the lifecycle process. One of the things that makes Maven so powerful is that it relies on the concept that projects are set up with default behaviors. For example, the pom.xml file is always located in the base directory.The source code must be in a certain directory. Resources necessary for the project are in a another folder or directory. Test cases are in a specifically named folder. And a target folder is always created that’s used for the final JAR file.
As you can see the base directory is called calculator.Inside calculator we have our src folder, as well as our target folder, and our pom.xml file.
Inside the source folder is where we’ll find our main Java programs and our source code, as well as any needed resources. And, the test folder, which contains again, the Java test programs. And any resources needed for that. Finally, the JAR file will be stored in the target folder. This folder structure is an important example of how Maven has adopted convention over configuration. By always using a standard folder structure, it allows developers to concentrate on coding. Once the code and resources are placed in the correct directories,and the POM file is updated.
Maven handles the rest. A project model includes: A project description, a unique set of coordinates, project attributes, the project’s license information. The project version, any authors or contributors to the project, and a list of project dependencies.
Before we go further,let’s take a look at a sample POM file. This is the file for the calculator project. The POM file is stored as an XML file. XML files use tags similar to HTML.When you create a sample program using Maven,it automatically adds a j unit dependency to allow us to do unit testing for our Java program. In the case of Maven, we have tags such as group ID, artifact ID, packaging, version, etc. The artifact ID is used for the name of the program. In our case, calculator. Since it’s a Java program, the packaging is going to be to create a JAR file. And the version in this case is 1.0. The description, name, and URL are all optional. Below that are the dependencies. When you create a sample program using Maven,it automatically adds a j unit dependency to allow us to do unit testing for our Java program.
You might have noticed the three red asterisks next to the three fields: group ID, artifact ID,and version. That’s because these three fields together, make up what we call the coordinates of the project, and they must be a unique combination. So, if I wanted to create a second version of my calculator project. I’d have to change the version number from 1.0 to 2.0, or 1.1, or something, to make it unique.
When using Maven it’s important to understand the Maven life cycle. Let’s take a look at a high level overview of the flow when using Maven. Maven starts by generating a project. A project consists of a POM or Project Object Model and source code that’s assembled in the Maven standard directory layout. Next, we execute Maven with a life cycle phase as an argument that prompted Maven to execute a series of plugin goals. After that, we can install a Maven artifact into our local Maven repository.And finally, we can run the app.
Let’s take a closer look at the default life cycle phases.
One of the phases is Validate. Validate is used to validate the project to make sure it is correct and all necessary information is available.
Another phase is Compile. We compile the source code of the project.
Test: Test compiles the source code using a suitable unit testing framework.These tests should not require the code be packaged or deployed just yet.
Package, take the compiled code and package it in its distributable format. For example, a Java program will be packaged as a Java file or a Java archive file.
Integration-test. Process and deploy the package if necessary into an environment where integration tests can be run.
Verify runs any checks to verify the package is valid and meets quality criteria.
Install. Install the package into the local repository for use as a dependency in other projects locally.
And finally, Deploy. This is done in an integration or release environment. It copies the final package to the remote repository for sharing with other developers and projects.
Plugin goals can be attached to each lifecycle phase. As Maven moves through the phases in a lifecycle it will execute the goals attached to each particular phase. Each phase may have zero or more goals bound to it. For example, when we run mvn install we will see that more than one goal is executed.
It’s usually located on your home drive in a folder called .m2. This directory contains your Maven repository. When you download a dependency from a remote Maven repository,Maven stores a copy of the dependency in your local repository. In addition, it also places a copy of your jar file and the pom.xml file for each installed project. Let’s take a look at both of these.
As you can see, there’s a tag called dependencies, and inside these, there is one dependency called junit. The three tags groupID, artifactId, and version are the coordinates that make this particular dependency unique. The scope identifies what part of the life cycle this dependency is going to be used in. In this case, it’s the test phase. It is easy to add additional project dependencies by updating this pom.xml file. By adding a list of dependencies here in one place, it is also easy for someone to identify what dependencies are required for this particular project. Finally, by including the dependencies in this external file, it is easy to update the version numbers in one place as dependencies might change.
The POM file contains all the information about a project. The file is stored with an .XML extension. Here’s an example of POM.XML file that has the minimum amount of information required. As you can see, it has a groupId, an artifactId, and a version.Remember those three things make up the Maven coordinates and are required for all projects.
A plugin is a collection of one or more goals. And a goal consists of a unit of work in Maven.
A plugin may have one or more goals. Maven consists of several core plugins.
These core plugins include a JAR plugin, which creates the JAR, or Java Archive files. A compiler plugin, which contains goals for compiling source code and unit tests. And a Surefire plugin, which is used for executing unit tests and generating reports.
NetBeans always puts projects inside of the NetBeans folder that’s created on My Documents folder, within my producer, within my users, on my C drive. The group ID is com dot my company. I’m gonna change mine to com dot lynda.
On the left hand side you can see the mavenhelloworld project, let’s go ahead and take a look at the source packages. Inside there we have our com.lynda.mavenhelloworld. Let’s open that, and there’s our app.java. And as you can see I didn’t do any coding. Maven automatically created this little simple app that just says “Hello World”, on the left hand side of the project folder, you can see there’s the source packages, the test packages, any dependencies, test dependencies, java dependencies, and project files.
Let’s go ahead and run this new application by clicking on the green run arrow. In my output window I gotta scroll up a little bit. And there’s my Hello World output. Even if you don’t want hello world application, by using the quickstart to create the shell, you can now go in and make changes to the java application and create your own.
But it’s a great way to get the file and folder structure set up for a Maven project. If you have NetBeans installed, go ahead and give it a try.
In this challenge you’re gonna use Maven to create a new project. You’ll do this using the Maven Archetype Tool from the command prompt. The new project will be a very small, simple web application.
Here is the command that you’ll need to use at the command line. You’ll type mvn space archetype colon the word generate space dash DgroupId equals, I’m using com.lynda, but you can use your organization ID if you want, space dash DartifactId. The artifactId that we’re gonna use is the name of the program. I’m calling mine sampleWeb. Space dash DarchetypeArtifactId equals maven-archetype-webapp.
This tells Maven that we want to create a web app. Finally, dash interactiveMode equals false.Once you enter the command and hit Enter, Maven will create the project for you. From there you can navigate to your directory, look for sampleWeb/src/main/webapp folder and launch the index.jsp file.
Let’s talk a little bit about unit testing. As I’m sure you’ll agree, unit testing is a critical step in any programming project. What’s really nice about Maven is that it providesbuilt-in support for unit testing. JUnit plug-in is used to easily test our application. When we first created our project using the archetype quick start to get our project created, it automatically created a test directory with a test application.
There are times when you need to add dependencies. This is one of the benefits of using Maven. It makes adding dependencies easy. Remember, Maven supports both internal and external dependencies. Whenever a project references a dependency that isn’t available in a local repository, Maven will download the dependency from a remote repository into the local repository. So far, all of our projects have included the JUnit dependency. It is sometimes going to be necessary to add other dependencies required by your project.
Let’s say we’ve added some logging to our code for debugging purposes, and we need to add the Log4j as a dependency. Let’s add this dependency to our calculator project. In order to add the dependency, we need to edit the pom.xml file. So I’m going to go to my file explorer, I’m already in my calculator project. I’m going to go into my pom file. To open the file, I’m going to right click and say edit with Notepad++. And right below the dependency for JUnit, I’m going to add a new dependency.
The group ID is going to be Log4j. The artifact ID is also Log4j. The version number is 1.2.17.It’s always a good idea to check the version number by going to Google and just look up Log4j version. And, finally, the scope parameter. The logging is used in the compile phase of the life cycle.
And we don’t want to end our dependency tag. Okay, let’s save our pom file, and now we can go ahead and run our calculator program again. In the command prompt I’m in the calculator folder so I’m going to just go ahead and run mvn install.
Maven makes adding resources really easy. It is often helpful to allow your program to retest input from a file. To use a file for testing, you must add the file to your Resources folder in your test directory. Then we must update our code to read from the fileand I’m gonna add print line statements to help with debugging to make sure that everything is working as expected.
We have our main folder, but then we also have our test folder which was created by Maven.Inside Test, you should have a Java folder, but you’re probably gonna need to add a resources folder. To do that, you can right click and say New, Folder, and just name it Resources.
The last part of the process is packaging your application. The packaging information is stored in your pom.xml file. Some sample packaging types include jar, for Java Archive Files, war, for Web Archive File, EAR. Remember, the default is a jar file. If the type is omitted, Maven will automatically create a jar file. Let’s take a look at the pom.xml for our calculator app. Remember, the last time we updated it, we added the log for dependency,which starts on line 18.
But if we go back up to where the Maven coordinates are, embedded within the coordinates is, on line 6, a packaging tag. That packaging tag says jar, because our application was a Java application. So it will automatically create a Java archive. When we’re ready, we can go to the Command Prompt, and from within the base project folder, we can type MVN Package, and it will create the jar file. This is also done when you run the MVN Install as well as even the MVN Test.
But now we have our jar file, we have a copy of it in our local repository, and we’re ready to go. So remember, when you’re ready to package your application, check your pom.xml file to see what packaging type you have declared.
Most US residents know the United States Census Bureau as the government organization that counts the American population every 10 years. In fact, the Census Bureau, which is part of the Department of Commerce, is tasked with gathering a wide range of data for individuals, government entities at the national, state, and local levels, as well as industry.
You can see that on the main page there is an easy link to search for community facts where you can find popular facts such as population or income about a particular community.
It’s easy to think that policy and business decisions are made based on national or international data, but most governments and businesses operate at the state and local level. As the name implies, this data focuses on information at the state and county level within the United States.
The United States Census Bureau is well known for gathering information about U.S. citizens, but it also collects and analyzes many other categories of data. The Censtats Databases to find trade data that could help you analyze foreign and domestic markets. You can look at County Business Patterns, both by Standard Industrial Classification and by the North American Industry Classification System, the latter starting in 2003. And you can also look at International Trade Data.
One very useful way to analyze your data is by plotting it geographically. If you don’t have access to a full updated geographic information system, you can download the TIGER map shape files from the United States Census Bureau’s geography website.
A lot of products and services appeal to certain demographic groups more than others. Some television advertisers covet the 18 to 29 year-old group because they tended to spend their disposable income on items that are more fun than practical, but other companies offer services to individuals who are over 50 years of age. Judging the size of each age group, even down to a single year, helps companies estimate the potential reach of their goods and services. This is a Census Bureau site, so you can see links to other data areas, such as Topics by Population or Economy, grouped by Geography, the Library, Data, and also information about the Census Bureau.
Every public company in the United States, meaning every company that offers shares of stock for sale on the Exchange, must file certain documents with the United States Securities and Exchange Commission. These filings include financial accounting statements, commentary on how the numbers in the statements were derived, and disclosures of executive compensation.
The American judicial system covers a wide variety of areas, ranging from court hearings, to corrections, and with many areas in between. The US Department of Justice oversees the country’s programs at the federal level and through the Bureau of Justice Statistics gathers data at the federal and state levels.
No one likes to pay taxes, but the good news is that the filings generate a lot of useful data. In the U.S., the Internal Revenue Service provides data collections through its Tax Statistics service, which you can find on-line through the IRS’s website.
If you do business in or with the United States, it’s important to keep careful track of the country’s economic trends. The Department of Commerce’s Bureau of Economic Analysis gathers statistics relating to the U.S. economy that let you gain useful insights into the state of personal and business economic health in the U.S.
Started in 1997, FedStats is a U.S. government website that aggregates links to and statistics generated by government agencies. The benefit of looking for data through FedStats is that you don’t need to know which agency produced a particular statistic. In addition to the latest news, which you can see in this section here, you also have links to other U.S. government agencies: The Bureau of Economic Analysis, Bureau of Justice Statistics, Bureau of Labor Statistics, and so on.
Education provides the foundation for a productive society. The United States Department of Education gathers statistics from educational institutions around the U.S. and makes them available through the Department’s data and research site, enabling analysts to examine and evaluate education in the U.S.
You can get information on the salaries paid in various fields, examine employment trends, and look through the Occupational Outlook Handbook which looks at the future prospects for a variety of professions. That and there’s a lot of other data available as well.
The Bureau of Transportation Statistics,which is part of the Department of Transportation, gathers statistics on highway, water, rail, and inter modal transportation in the US. If you work in manufacturing, or need a baseline for national and international transportation trends, this website provides the data you need to make good decisions.
The U.S. federal government, like almost all governments around the world,lets its citizens register inventions, trademarks, and other forms of intellectual property to protect those valuable ideas against unauthorized use. In general terms, patents protect processes while trademarks protect words, phrases, and images that are used to identify companies, products, and services. If you want to search for existing patents and trademarks in the U.S., you can go to the U.S. Patent and Trademark Office.
Which gives you information on the world economy, health, such as life expectancy, education, and so on.
CIA, which is the United States Central Intelligence Agency, is tasked with gathering, analyzing, and assessing information about foreign countries. The goal of this gathering is to discover the current conditions in and intentions of countries other than the U.S. As part of its outreach to U.S. citizens, the CIA publishes its World Factbook, which offers basic information about the government, citizens, and economies of countries throughout the world.
The United Nations or UN is an international organization that provides a form for its 193 member states to express their views on relevant issues and to coordinate action. As part of its mission, the UN provides access to national data services and also its own data collections.
Statistics Canada, or StatCan, as it’s known informally, is a Canadian government agency that gathers statistics from industry and government institutions around Canada, and makes those numbers and commentary available on their website.
Eurostat, which is the European Union’s Statistics Agency, provides access to data about the countries of the European Union, and as a subset of those entities, the countries of the Euro zone. The latter group is comprised of the countries such as Germany, France and Finland that have adopted the Euro as their national currency.
The Organisation for Economic Co-operation and Development or OECD, is an international organization that promotes policies to improve the economic and social well-being of people around the world. Data gathering in foreign policy creation and evaluation, of course, so the OECD shares their data online for free.
Search engines provide ways to find data based on certain search terms. You might search for stock prices, oil production figures, or other benchmarks you use in your business. One search engine, Quandl, provides both a search interface and a set of curated data collections to streamline the discovery process.
Inforum is the inter-industry forcasting project at the University of Maryland.The Inforum team builds models to forecast future performance of the US and other economies. On their website you can find details of the econometric models they use, the data they work with, and links to software that let you work with their data as well.
Google is an exceptionally popular search engine, but the company also makes public data available through their Google Public Data collection. As of this recording, the collection contains 136 data sets covering a wide variety of economic and technology topics.
Amazon.com is best known as the internet’s book seller of choice, but it also provides Cloud Computing Services such as remote data storage and processing. As part of it’s operations, Amazon provides access to big data sets in several scientific fields.
Data.gov is the United States Federal Government’s data clearing house,where you can find links to data from all federal websites and many state and local collections, as well. If you’re not sure which government site or bureau has the data you’re looking for, searching at data.gov is a great place to start.
I mentioned Google’s public data service in another part of this course. In this movie I’d like to point you to the Google Ngram Viewer. An Ngram is a series of characters of a given length. For example, a two-character string is a bigram. A three-character string is a trigram, and so on. If you perform linguistic analysis, and want to search word usage in books published from 1800 to about 2008, the Google Ngram viewer is a great tool to keep in mind. Google Ngram viewer which finds the popularity of certain character strings and words in books that were publishedfrom about 1800 to 2008.
The corpus of contemporary American English which tracks English word usage in books,magazine, television, films and other media.
Version Control is the process of keeping track of your creative output as it evolvesover the course of a project or product. It tracks what is changed,it tracks who changes it, and it tracks why it was changed.
As you work, each new version is kept in a simple database-like system. A Version Control system is an application or a set of applications that manage creatinga repository, which is the name of the specialized database where your files and history are stored.In some systems, this is in an actual database, and in some systems this is done by creatinga set of small change files which are identified by some identifier and managed by the software.The next is working set.Your…
View original post 1,518 more words