Hi all, this is a good matrix to keep track of your skill set as you advance in your field.
Hi all, this is a good matrix to keep track of your skill set as you advance in your field.
Computer Scientist’s Checklist
Tip #1: Java is very accessible and all the following are available for free.
The steps you take may slightly vary depending on your familiarity with Java and its tools.
Google search, good blogs and online tutorials are your friends in setting up the above 6 items. Even with 13+ year experience in Java, researching on Google.com is integral part of getting my job done as a Java developer. As an experienced Java developer, I can research things much faster. You will improve your researching skills with time. You will know what key words to search on. If you are stuck, ask your mentor or go to popular forums like Javaranch.com to ask your fellow Java developers.
Tip #2: Start with the basics first.
Enterprise Java has hundreds of frameworks and libraries and it is easy for the beginners to get confused. Once you get to a certain point, you will get a better handle on them, but to get started, stick to the following basic steps. Feel free to make changes as you see fit.
Tip #3: Once you have some familiarity and experience with developing enterprise applications with Java, try contributing to open source projects or if your self-taught project is non trivial, try to open source your self-taught project. You can learn a lot by looking at others’ code.
Tip #4: Look for volunteer work to enhance your hands-on experience. Don’t over commit yourself. Allocate say 2 to 3 days to build a website for a charity or community organization.
Tip #5: Share your hands-on experience gained via tips 1-4 in your resume and through blogging (can be kept private initially). It is vital to capture your experience via blogging. Improve your resume writing and interviewing skills via many handy posts found in this blog or elsewhere on the internet. It is essential that while you are working on the tips 1-5, keep applying for the paid jobs as well.
Tip #6: Voluntary work and other networking opportunities via Java User Groups (JUGs) and graduate trade fairs can put you in touch with the professionals in the industry and open more doors for you. The tips 1-5 will also differentiate you from the other entry level developers. My books and blog has covered lots of Java interview questions and answers. Practice those questions and answers as many employers have initial phone screening and technical tests to ascertain your Java knowledge, mainly in core Java and web development (e.g. stateless HTTP protocol, sessions, cookies, etc). All it takes is to learn 10 Q&A each day while gaining hands-on experience and applying for entry level jobs.
Here’s an excerpt I found very interesting from John Sonmez’s newly released book, ‘the Complete Software Developer’s Career Guide’. It is a good read and I recommend it.
I think I might try this option, and I think you guys might want to give this some thought too.
CREATE YOUR OWN COMPANY
Many people laugh when I tell them this idea of gaining experience when you don’t have any, but it’s perfectly legitimate.
Way more companies than you probably realize are actually run by a single person or a skeleton staff of part-time workers or contractors.
There is absolutely no reason why you cannot create your own software development company, develop an application, sell or distribute that app, and call yourself a software developer working for that company.
You can do this at the same time you are building your portfolio and learning to code.
If I were starting out today, I’d form a small company by filing for an LLC, or even just a DBA (Doing Business As) form (you don’t even need a legal entity), and I’d build an app or two that would be part of my portfolio. Then, I’d publish that app or apps in an app store or sell it online in some way.
I’d set up a small website for my software development company to make it look even more legit.
Then, on my resume, I’d list the company and I’d put my role as software developer.
I want to stress to you that this is in no way lying and it is perfectly legitimate. Too many people think too narrowly and don’t realize how viable and perfectly reasonable of an option this is.
I would not advocate lying in any way.
If you build an application and create your own software development company, there is no reason why you can’t call yourself a software developerfor that company and put that experience on your resume—I don’t care what anyone says.
Now, if you are asked about the company in an interview, you do need to be honest and say it is your own company and that you formed it yourself.
However, you do not need to volunteer this information.
I don’t think being the sole developer of your own software company is a detriment either.
I’d much rather hire a self-starter who formed their own software company, built an app, and put it up for sale than someone who just worked for someone else in their career.
I realize not all employers will think this way, but many will. You’d probably be surprised how many.
In this lab, you’ll use Amazon Web Services to set up a 3 node Elastic MapReduce (EMR) cluster which you can then use for any/all of the class exercises. NOTE: This lab just really covers how to set the cluster up. To manage costs, you must shut down the cluster at the end of the class day. If you want to run the labs in the cluster, you’ll need to re-run these set up steps at the start of every day (should only take 5 mins once you get the hang of it).
Log into the EMR console at AWS. You will have needed to create an account first and for that you will need to provide:
Use the instructions provided by AWS to start up your cluster. Use the following values in place of the AWS ones:
After pressing “Create Cluster” button, your cluster will go into “Starting” mode (as per the following screenshot):
It can take up to 5-15 mins for the cluster to start (if you chose all applications, it will take up to 15). Might be a good time to get up & grab a coffee
Verify that your cluster has gone into “Waiting” mode (as per the screenshot below):
Continue ONLY when your screen looks like above “Waiting” status for cluster
In the EMR running a job is done with “steps”. Because of the special setup involved with EMR, you cannot easily just SSH into the master node and run “hadoop jar” commands. You have to run “steps”.
First, select “Steps / Add Step” from the EMR interface:
In the dialog that pops up, copy and paste the following into the appropriate fields (“Step Type: Custom Jar”) is selected by default:
Jar Location: s3://learn-hadoop/exercises/code/wordcount/WordCount.jar
Arguments: WordCountDriver s3n://learn-hadoop/exercises/data/shakespeare/ s3n://YOUR-BUCKET/NONEXISTENT-FOLDER/
All filled out, the “Steps” should look something like this:
Go ahead & click “Add”, then watch for the job to complete. “Job Status” will go from “Pending” to “Running” to “Completed” and interface will look like:
Keep hitting the “in page” refresh icon until the “Log files” populates (may take 3-5 minutes). Once you see “Log files”, you can select each, to see the actual logs, and when you now browse to OUTPUT-BUCKET/FOLDER, you’ll see several “part-r-0000x” files along with a “_SUCCESS” zero byte file (indicating the job ran OK). These are individual reducer outputs. You can click each to download and review the contents.
You’re done with the “required” part of this lab. You may just choose “Terminate” for your cluster right now as per the screenshot below.
To get full use of the cluster, you’ll want to establish a “tunnel” (secure channel) to the web front-ends like HUE, Spark history, etc. click “Enable Web Connection”
This will pop a set of instructions to establish a tunnel. You’ll need to understand Putty (and how to convert the “spark-class.pem” key to a PPK which Putty can use) to enable on windows, but the process is much more simple for Mac/Linux
the tunnel command should look something like:
ssh -i ~/spark-class.pem –ND 8157 hadoop@[master-node-dns]
ssh -i ~/spark-class.pem –ND 8157 email@example.com
Note: The “8157” port is just an open, unused port… you can use any open, unused port for this purpose, but we know that 8157 is free.
Once the tunnel is established, you can access the following web GUIs as though they were local:
HUE interface: http://<your-master-dns>:8888/
Namenode UI: http://<your-master-dns>:50070/
Resource Manager UI: http://<your-master-dns>:8088/
Spark History UI: http://<your-master-dns>:18080/
So, for example, if your Master DNS is “ec2-54-153-36-108.us-west-1.compute.amazonaws.com”,your connection to HUE would be http://ec2-54-153-36-108.us-west-1.compute.amazonaws.com:8888/
If you like, you can play with the HUE interface, and explore sections like Pig, File Browser, Sqoop, etc. Please note if you plan on running any of the following labs on the EMR cluster, you will need to run them as either “steps” (for standard MapReduce) or you will need to log into the MASTER node and run them from there (for Hive, Pig, Spark labs).
Again, you have the option of just using the HUE interface to submit jobs of various types.
Using Spark in the Hadoop Ecosystem by Rich Morrow Published by Infinite Skills, 2016, https://www.safaribooksonline.com/library/view/using-spark-in/9781771375658/
The downfalls of relational databases such as SQL Server, Oracle, or My SQL is that as more and more data becomes available and companies and organizations want to embark on big dat projects, they’re running into limits around using relational databases.
The first is Scalability. Many companies have projects that are in the gigabytes. This can be very expensive and complex and difficult. Also, some of these big data projects have different kinds of needs around data ingests or speed. Sometimes customers want real time ingests.
And there are other considerations around query ability, application of sophisticated processing like machine learning.
The Hadoop ecosystem is designed to solve a different set of data problems than those of relational databases. One of the core components of Hadoop is an alternate file system called HDFS or Hadoop File System.
Hadoop itself is actually not a database. It is an alternative file system with a processing library.
So, really when you think about bringing Hadoop in as a solution, it’s gonna be in addition to your existing RDBMS, not as a replacement for it.
So, Hadoop itself is most commonly implemented with something called HBase.
Now, so this is based on technology that was developed originally at Google to index the entire internet, which is called the GFS or the Google File System. What Google did about 10 years ago is they wrote a whitepaper on how they created this file system and the open-source community took the information from this whitepaper and made it was part of the basis of Hadoop. So if you hear GFS and HDFS, they are very, very similar implementations.
HBase is a NoSQL database that is very commonly used with Hadoop solutions. It is a wide column store. And what that means is it’s a database that consist of a key and then one to n number of values.
To understand more about the use cases for Hadoop ecosystem we’re going to take a look at what’s called CAP theory or CAP theorem which is a way to understand the different categories and classifications of databases.
The first aspect of CAP theory is the idea of consistency. The concept is that there are certain database solutions that allow for very high data consistency. Another way to think about this is that the solution supports transactions. An example of transaction would be if you had two data modification operations and they were combined as a unit. So withdrawing money, for example, out of a savings account, and then adding that money into a checking account. You would want both of those changes to occur successfully, or neither. Otherwise our data would be inconsistent.
The second aspect of CAP theory is availability. Another way to say that is up-time. What this means is that you have the ability to make copies of the data so that if one copy goes down in one location, the data will still be available for some or all of your users.
The third aspect of CAP theory is portioning. Another way to think about that is scalability. What that means is that you can split your set of data across multiple processing locations or physical machines or virtual machines so that you can continue to grow the amount of data that you work with.
Traditional RDBMS systems are known for having consistency and availability, but have difficulties at the highest levels of partitioning. CAP theory says that the database systems can really only meet two of the three aspects of CAP theory.
This is where Hadoop comes into play because as I mentioned earlier the data that’s becoming available for businesses and other companies is growing larger and larger and larger, so partitioning and the complexities around partitioning and the expense is causing companies to look at database solutions that support that aspect, and maybe they don’t have a need for the other two aspects to be fully implemented.
Hadoop is designed for scalability. Hadoop is designed to run on commodity hardware. So cheap servers, really old servers, I see this very commonly. It is also designed for partitioning in that it makes three copies of the data by default and if any of the copies become bad because the hardware becomes bad or corrupt you can just pull that old hardware out and then put the new hardware in. The Hadoop file system will automatically manage that copy process. This goes to another property of Hadoop, which is flexibility, or availability. Because it runs on commodity hardware you can scale a Hadoop cluster nearly infinitely. And again, if you remember where HDFS file system came from, it makes sense. And by Google, originally to index the entire internet, so it’s infinitely scalable.
The number one user of Hadoop is Yahoo! The number two is Facebook. These companies obviously have huge datasets, and they are taking advantage of the cost saving that they get, scaling all of the data out on commodity hardware. And of course they also want high availability because it’s their business to be online and available all the time.
What I find is that a lot of businesses have wrong information about Hadoop and they think of it as a replacement for a relational database. As an architect, I really haven’t encountered any customers for whom they don’t need some kind of relational database.
This is the world of Big Data projects and let me make some examples of this. This is data that will be batch-processed. So in other words, processed as a group rather than individually-queried. And it’s often a great fit for Hadoop.
Hadoop consists of two components, and oftentimes is deployed with other projects as well. What are those components? The first one is open-source data storage or HDFS which stands for Hadoop File System. The second one is a processing API which is called MapReduce. Most commonly in professional deployments Hadoop includes other projects or libraries, and these are many, many different libraries.
One of the libraries are HBase, Hive, and Pig. In addition to understanding the core components of Hadoop it’s important to understand what are called Hadoop Distributions.
The first set of distributions are 100% open source, and you’ll find those under the Apache Foundation. The core distribution is called Apache Hadoop and there are many, many different versions.
There are commercial versions that wrap around some version of the open source distribution and they will provide additional tooling and monitoring and management along with other libraries. The most popular of these are from companies Cloudera, Hortonworks, and MapR.
In addition to that, it’s quite common for businesses to use Hadoop clusters on the cloud. The cloud distribution that I use most often are from Amazon Web Services or from Microsoft with the Windows Azure HDInsight.
When you’re using a cloud distribution you can use an Amazon Distribution which implements the open source version of Hadoop, so Apache Hadoop on AWS with a particular version, or you can use a commercial version that is implemented on the AWS cloud such as MapR on AWS.
Examples of using Hadoop are as follows. One is customer churn analysis. It costs a lot more to gain a new customer rather than to keep a current one, so it’s in the best interest of many companies to collect as much information as possible. And also behavioral, what were the activities the customer was doing shortly before they left so that they can reduce the amount of customers that re leaving.
Hadoop Solutions make use of behavioral data so that companies can make better decisions.
Facebook is the largest known user of Hadoop or the largest public user of Hadoop. New York Times, Federal Reserve Board, IBM, and Orbitz Travel Company and there are literally hundreds of companies that are making use of Hadoop in augmenting their line of business data with behavioral data to make better decisions.
One of the confusing things about working with the Hadoop ecosystem is there are a tremendous number of parts and pieces, libraries, projects, terms, new words, phrases, it’s really easy to get core concepts misunderstood and one of the concepts that I actually didn’t understand the first, when I was working with Hadoop is Hadoop vs. HBase.
Hadoop core ecosystem consists of two parts. It consists of a new type of a file system, HDFS, and a processing framework, map reduce.
You can see that we’ve got representations of files and there are four files here and as I mentioned in a previous movie, each file by default is replicated three times in the Hadoop ecosystem on three different pieces of commodity hardware. So you can think of them as cheap servers.
I like to say that map reduce to Hadoop is kind of like C++ to object oriented programming. Map reduce is written in java, and customers who are working with Hadoop really don’t want to query or work with Hadoop at that level of abstraction. So one such solution is working with a library HBase. The way that this looks is a wide column store. So you can see on the right here you have a table with one ID column and then a data column. And there’s really no requirement for any particular values in the data column, that’s why it’s called wide column.
A lot of people think that HBase or query language for it, which is called Hive, is actually part of Hadoop. And although it often is in practical implementations, they are two separate things.
Hadoop processes or execution activities run in separate JVMs. JVMs basically a process for executing Java bytecode in an executable program. It’s a little section of the program that runs.
Traditionally in database processing systems state is shared. The different Hadoop processes run in separate JVMs.
The default file system is HDFS, which we talked about in a previous movie, accounts for a larger chunking of the data and is triple replicated by default.
The HDFS file system has two modes for implementation. Fully-distributed, which will give you the three copies or Pseudo-distributed which will use the HDFS File System but is designed for testing and will be implemented in a single node on a single machine.
As an alternative to HDFS you can run Hadoop with the Regular File System. This is called the Standalone mode. And it’s a great way, when you’re just first learning about the MapReduce Programming Paradigm. You’re reducing the complexity by just working with your Regular File System.
Alternatively, when you’re deploying Hadoop to production, particularly if you’re deploying on a public Cloud, it’s really common to use a file system that’s on that Cloud.
For example, in Amazon, the S3 File System or in Azure the BLOG storage. Which is similar to the Standalone mode in that you are not using HDFS, you’re using a Regular file system but choosing a Cloud based file system.
If you deploy Hadoop in Single node you’re going to use the Local file system and a single JVM for all the Hadoop processes.
If you deploy in Pseudo-distributed mode you’re going to use HDFS and the Java daemons are going to run all the processes on a single machine.
If you run in Fully-distributed mode you’re going to use HDFS, it’s going to be triple replicated and the daemons are going to run in various locations depending on where you choose to place them. So you can see in this particular drawing we’re in Fully-distributed mode. We have three separate physical servers. On each server we have various daemons and they’re represented in green. So you can see we’ve got Task Tracker on each one. And then we have a Job Tracker daemon that is actually implemented
This blog is about so many different things. It’s about things that make you go ‘What’! And then when you don’t want to except those things that’s the part of the blog that’s in the second page it’s called “Oh hell no”! Then it’s like “Hold up”, because maybe I didn’t think that through, maybe I do want to know about it. And then it’s like “huh?? Oh okay.” You know what I mean?
I took that from the show Impractical Jokers https://www.youtube.com/watch?v=kgsP_WAFbu0
I think that that’s the kind of conversations developers have in their day-to-day life.
For example, developers have to learn new tools everyday in order to keep up with the changing field. Some ideas may make you have the first part of the conversation, “what! I have to learn this stuff!.” Then you hear how complex it is and it makes you say, “Oh hell no!” But then you realize that it actually makes sense and that it is not as complex, and that makes you go, “Hold up.” But then as you try to learn the tool you get into some obstacles, making you say “huh?” Then you realize how cool the new tool is and how much it simplifies your job more, and that makes you say “oh okay.”
For example, I just learnt a new tool for testing called the JMockit. JMockit is primarily used to mock objects, not an instance of an object, but objects themselves like classes and interfaces. But the problem is that it is not as simple as you would like it to be. It does not conform to object-oriented rules, and thus it feels so unnatural for a developer to use. This is the part where you have the “what! oh hell no!!” conversation with yourself. One thing that I find annoying is that debugging a JMockit unit test can be very difficult, because the internals of how mock object results are returned are not well documented. For example, if arguments to a mock object invocation do not properly implement “equals(Object other)” then your invocation may not match any expected invocation and will return null instead of your intended result. It is very difficult to step through the mock object framework matching code to find the particular argument that is failing to match. But then there’s so many other pros of using it, so you give JMockIt an open mind. That’s when you go “Hold up.” Sure enough JMockit provides well documentation for other functionalities.
In my opinion JMockit is definitely worth learning. However, the author of unheededwarnings.blogspot.com, Richard, advises that a simpler framework be used if available. But he also says that “JMockit is probably the simplest mock framework to use after you master its unusual API.” So the conversation would end in ” Huh? ? Oh okay.”