USING SPARK in the HADOOP ECOSYSTEM
by Rick Morrow
Starting EMR on AWS
In this lab, you’ll use Amazon Web Services to set up a 3 node Elastic MapReduce (EMR) cluster which you can then use for any/all of the class exercises. NOTE: This lab just really covers how to set the cluster up. To manage costs, you must shut down the cluster at the end of the class day. If you want to run the labs in the cluster, you’ll need to re-run these set up steps at the start of every day (should only take 5 mins once you get the hang of it).
Log into the EMR console at AWS. You will have needed to create an account first and for that you will need to provide:
- A Credit Card
- A valid email address
Use the instructions provided by AWS to start up your cluster. Use the following values in place of the AWS ones:
- Cluster Name: Use your “sign-in” name (eg: “lab cluster” all lowercase)
- Logging: Enable
- S3 Folder: s3://aws-logs-YOUR-ACCT-YOUR-REGION/elasticmapreduce/
- Launch Mode: Cluster
- Vendor: Amazon
- Release: emr-4.4.0 (or whatever is latest)
- Applications: All applications (including Hive, Pig, Hue, Spark, etc)
- Instance Type: xlarge
- Number of instances: 3
- EC2 keypair: you will need to create another keypair if you wish to log in
- Permissions: default
- EMR role: EMR_DefaultRole
- EC2 instance profile: EMR_EC2_DefaultRole
After pressing “Create Cluster” button, your cluster will go into “Starting” mode (as per the following screenshot):
It can take up to 5-15 mins for the cluster to start (if you chose all applications, it will take up to 15). Might be a good time to get up & grab a coffee
Verify that your cluster has gone into “Waiting” mode (as per the screenshot below):
Continue ONLY when your screen looks like above “Waiting” status for cluster
Smoke test the cluster
In the EMR running a job is done with “steps”. Because of the special setup involved with EMR, you cannot easily just SSH into the master node and run “hadoop jar” commands. You have to run “steps”.
First, select “Steps / Add Step” from the EMR interface:
In the dialog that pops up, copy and paste the following into the appropriate fields (“Step Type: Custom Jar”) is selected by default:
Jar Location: s3://learn-hadoop/exercises/code/wordcount/WordCount.jar
Arguments: WordCountDriver s3n://learn-hadoop/exercises/data/shakespeare/ s3n://YOUR-BUCKET/NONEXISTENT-FOLDER/
- The “Jar Location” is a path to the Jarfile that has already been compiled for you
- In “Arguments”, the first Argument (WordCountDriver) is the name of the class that contains the main method where hadoop is to start processing
- The second argument (s3n://learn-hadoop/exercises/data/shakespeare/) is the path to the INPUT data file in S3 (this is a public data set accessible to everyone)
- The third argument (s3n://YOUR-BUCKET/NONEXISTENT-FOLDER/) is the OUTPUT path (bucket AND folder that will be created) in your own AWS account. EMR will drop the output to that directory when it’s done. MAKE ABSOLUTELY SURE YOU HAVE THE TRAILING “SLASH” IN THAT OUTPUT DIRECTORY.
All filled out, the “Steps” should look something like this:
Go ahead & click “Add”, then watch for the job to complete. “Job Status” will go from “Pending” to “Running” to “Completed” and interface will look like:
Keep hitting the “in page” refresh icon until the “Log files” populates (may take 3-5 minutes). Once you see “Log files”, you can select each, to see the actual logs, and when you now browse to OUTPUT-BUCKET/FOLDER, you’ll see several “part-r-0000x” files along with a “_SUCCESS” zero byte file (indicating the job ran OK). These are individual reducer outputs. You can click each to download and review the contents.
Terminate the cluster
You’re done with the “required” part of this lab. You may just choose “Terminate” for your cluster right now as per the screenshot below.
Optional Exercise: Enabling Web Connections
To get full use of the cluster, you’ll want to establish a “tunnel” (secure channel) to the web front-ends like HUE, Spark history, etc. click “Enable Web Connection”
This will pop a set of instructions to establish a tunnel. You’ll need to understand Putty (and how to convert the “spark-class.pem” key to a PPK which Putty can use) to enable on windows, but the process is much more simple for Mac/Linux
the tunnel command should look something like:
ssh -i ~/spark-class.pem –ND 8157 hadoop@[master-node-dns]
ssh -i ~/spark-class.pem –ND 8157 email@example.com
Note: The “8157” port is just an open, unused port… you can use any open, unused port for this purpose, but we know that 8157 is free.
Once the tunnel is established, you can access the following web GUIs as though they were local:
HUE interface: http://<your-master-dns>:8888/
Namenode UI: http://<your-master-dns>:50070/
Resource Manager UI: http://<your-master-dns>:8088/
Spark History UI: http://<your-master-dns>:18080/
So, for example, if your Master DNS is “ec2-54-153-36-108.us-west-1.compute.amazonaws.com”,your connection to HUE would be http://ec2-54-153-36-108.us-west-1.compute.amazonaws.com:8888/
If you like, you can play with the HUE interface, and explore sections like Pig, File Browser, Sqoop, etc. Please note if you plan on running any of the following labs on the EMR cluster, you will need to run them as either “steps” (for standard MapReduce) or you will need to log into the MASTER node and run them from there (for Hive, Pig, Spark labs).
Again, you have the option of just using the HUE interface to submit jobs of various types.
At the very end of every day, please make 100% sure you terminate your clusters on AWS.
STOP HERE — THIS IS THE END OF THE EXERCISE
Using Spark in the Hadoop Ecosystem by Rich Morrow Published by Infinite Skills, 2016, https://www.safaribooksonline.com/library/view/using-spark-in/9781771375658/