Hi all, this is a good matrix to keep track of your skill set as you advance in your field.
Hi all, this is a good matrix to keep track of your skill set as you advance in your field.
Computer Scientist’s Checklist
Tip #1: Java is very accessible and all the following are available for free.
The steps you take may slightly vary depending on your familiarity with Java and its tools.
Google search, good blogs and online tutorials are your friends in setting up the above 6 items. Even with 13+ year experience in Java, researching on Google.com is integral part of getting my job done as a Java developer. As an experienced Java developer, I can research things much faster. You will improve your researching skills with time. You will know what key words to search on. If you are stuck, ask your mentor or go to popular forums like Javaranch.com to ask your fellow Java developers.
Tip #2: Start with the basics first.
Enterprise Java has hundreds of frameworks and libraries and it is easy for the beginners to get confused. Once you get to a certain point, you will get a better handle on them, but to get started, stick to the following basic steps. Feel free to make changes as you see fit.
Tip #3: Once you have some familiarity and experience with developing enterprise applications with Java, try contributing to open source projects or if your self-taught project is non trivial, try to open source your self-taught project. You can learn a lot by looking at others’ code.
Tip #4: Look for volunteer work to enhance your hands-on experience. Don’t over commit yourself. Allocate say 2 to 3 days to build a website for a charity or community organization.
Tip #5: Share your hands-on experience gained via tips 1-4 in your resume and through blogging (can be kept private initially). It is vital to capture your experience via blogging. Improve your resume writing and interviewing skills via many handy posts found in this blog or elsewhere on the internet. It is essential that while you are working on the tips 1-5, keep applying for the paid jobs as well.
Tip #6: Voluntary work and other networking opportunities via Java User Groups (JUGs) and graduate trade fairs can put you in touch with the professionals in the industry and open more doors for you. The tips 1-5 will also differentiate you from the other entry level developers. My books and blog has covered lots of Java interview questions and answers. Practice those questions and answers as many employers have initial phone screening and technical tests to ascertain your Java knowledge, mainly in core Java and web development (e.g. stateless HTTP protocol, sessions, cookies, etc). All it takes is to learn 10 Q&A each day while gaining hands-on experience and applying for entry level jobs.
Here’s an excerpt I found very interesting from John Sonmez’s newly released book, ‘the Complete Software Developer’s Career Guide’. It is a good read and I recommend it.
I think I might try this option, and I think you guys might want to give this some thought too.
CREATE YOUR OWN COMPANY
Many people laugh when I tell them this idea of gaining experience when you don’t have any, but it’s perfectly legitimate.
Way more companies than you probably realize are actually run by a single person or a skeleton staff of part-time workers or contractors.
There is absolutely no reason why you cannot create your own software development company, develop an application, sell or distribute that app, and call yourself a software developer working for that company.
You can do this at the same time you are building your portfolio and learning to code.
If I were starting out today, I’d form a small company by filing for an LLC, or even just a DBA (Doing Business As) form (you don’t even need a legal entity), and I’d build an app or two that would be part of my portfolio. Then, I’d publish that app or apps in an app store or sell it online in some way.
I’d set up a small website for my software development company to make it look even more legit.
Then, on my resume, I’d list the company and I’d put my role as software developer.
I want to stress to you that this is in no way lying and it is perfectly legitimate. Too many people think too narrowly and don’t realize how viable and perfectly reasonable of an option this is.
I would not advocate lying in any way.
If you build an application and create your own software development company, there is no reason why you can’t call yourself a software developerfor that company and put that experience on your resume—I don’t care what anyone says.
Now, if you are asked about the company in an interview, you do need to be honest and say it is your own company and that you formed it yourself.
However, you do not need to volunteer this information.
I don’t think being the sole developer of your own software company is a detriment either.
I’d much rather hire a self-starter who formed their own software company, built an app, and put it up for sale than someone who just worked for someone else in their career.
I realize not all employers will think this way, but many will. You’d probably be surprised how many.
In this lab, you’ll use Amazon Web Services to set up a 3 node Elastic MapReduce (EMR) cluster which you can then use for any/all of the class exercises. NOTE: This lab just really covers how to set the cluster up. To manage costs, you must shut down the cluster at the end of the class day. If you want to run the labs in the cluster, you’ll need to re-run these set up steps at the start of every day (should only take 5 mins once you get the hang of it).
Log into the EMR console at AWS. You will have needed to create an account first and for that you will need to provide:
Use the instructions provided by AWS to start up your cluster. Use the following values in place of the AWS ones:
After pressing “Create Cluster” button, your cluster will go into “Starting” mode (as per the following screenshot):
It can take up to 5-15 mins for the cluster to start (if you chose all applications, it will take up to 15). Might be a good time to get up & grab a coffee
Verify that your cluster has gone into “Waiting” mode (as per the screenshot below):
Continue ONLY when your screen looks like above “Waiting” status for cluster
In the EMR running a job is done with “steps”. Because of the special setup involved with EMR, you cannot easily just SSH into the master node and run “hadoop jar” commands. You have to run “steps”.
First, select “Steps / Add Step” from the EMR interface:
In the dialog that pops up, copy and paste the following into the appropriate fields (“Step Type: Custom Jar”) is selected by default:
Jar Location: s3://learn-hadoop/exercises/code/wordcount/WordCount.jar
Arguments: WordCountDriver s3n://learn-hadoop/exercises/data/shakespeare/ s3n://YOUR-BUCKET/NONEXISTENT-FOLDER/
All filled out, the “Steps” should look something like this:
Go ahead & click “Add”, then watch for the job to complete. “Job Status” will go from “Pending” to “Running” to “Completed” and interface will look like:
Keep hitting the “in page” refresh icon until the “Log files” populates (may take 3-5 minutes). Once you see “Log files”, you can select each, to see the actual logs, and when you now browse to OUTPUT-BUCKET/FOLDER, you’ll see several “part-r-0000x” files along with a “_SUCCESS” zero byte file (indicating the job ran OK). These are individual reducer outputs. You can click each to download and review the contents.
You’re done with the “required” part of this lab. You may just choose “Terminate” for your cluster right now as per the screenshot below.
To get full use of the cluster, you’ll want to establish a “tunnel” (secure channel) to the web front-ends like HUE, Spark history, etc. click “Enable Web Connection”
This will pop a set of instructions to establish a tunnel. You’ll need to understand Putty (and how to convert the “spark-class.pem” key to a PPK which Putty can use) to enable on windows, but the process is much more simple for Mac/Linux
the tunnel command should look something like:
ssh -i ~/spark-class.pem –ND 8157 hadoop@[master-node-dns]
ssh -i ~/spark-class.pem –ND 8157 email@example.com
Note: The “8157” port is just an open, unused port… you can use any open, unused port for this purpose, but we know that 8157 is free.
Once the tunnel is established, you can access the following web GUIs as though they were local:
HUE interface: http://<your-master-dns>:8888/
Namenode UI: http://<your-master-dns>:50070/
Resource Manager UI: http://<your-master-dns>:8088/
Spark History UI: http://<your-master-dns>:18080/
So, for example, if your Master DNS is “ec2-54-153-36-108.us-west-1.compute.amazonaws.com”,your connection to HUE would be http://ec2-54-153-36-108.us-west-1.compute.amazonaws.com:8888/
If you like, you can play with the HUE interface, and explore sections like Pig, File Browser, Sqoop, etc. Please note if you plan on running any of the following labs on the EMR cluster, you will need to run them as either “steps” (for standard MapReduce) or you will need to log into the MASTER node and run them from there (for Hive, Pig, Spark labs).
Again, you have the option of just using the HUE interface to submit jobs of various types.
Using Spark in the Hadoop Ecosystem by Rich Morrow Published by Infinite Skills, 2016, https://www.safaribooksonline.com/library/view/using-spark-in/9781771375658/
Most US residents know the United States Census Bureau as the government organization that counts the American population every 10 years. In fact, the Census Bureau, which is part of the Department of Commerce, is tasked with gathering a wide range of data for individuals, government entities at the national, state, and local levels, as well as industry.
You can see that on the main page there is an easy link to search for community facts where you can find popular facts such as population or income about a particular community.
It’s easy to think that policy and business decisions are made based on national or international data, but most governments and businesses operate at the state and local level. As the name implies, this data focuses on information at the state and county level within the United States.
The United States Census Bureau is well known for gathering information about U.S. citizens, but it also collects and analyzes many other categories of data. The Censtats Databases to find trade data that could help you analyze foreign and domestic markets. You can look at County Business Patterns, both by Standard Industrial Classification and by the North American Industry Classification System, the latter starting in 2003. And you can also look at International Trade Data.
One very useful way to analyze your data is by plotting it geographically. If you don’t have access to a full updated geographic information system, you can download the TIGER map shape files from the United States Census Bureau’s geography website.
A lot of products and services appeal to certain demographic groups more than others. Some television advertisers covet the 18 to 29 year-old group because they tended to spend their disposable income on items that are more fun than practical, but other companies offer services to individuals who are over 50 years of age. Judging the size of each age group, even down to a single year, helps companies estimate the potential reach of their goods and services. This is a Census Bureau site, so you can see links to other data areas, such as Topics by Population or Economy, grouped by Geography, the Library, Data, and also information about the Census Bureau.
Every public company in the United States, meaning every company that offers shares of stock for sale on the Exchange, must file certain documents with the United States Securities and Exchange Commission. These filings include financial accounting statements, commentary on how the numbers in the statements were derived, and disclosures of executive compensation.
The American judicial system covers a wide variety of areas, ranging from court hearings, to corrections, and with many areas in between. The US Department of Justice oversees the country’s programs at the federal level and through the Bureau of Justice Statistics gathers data at the federal and state levels.
No one likes to pay taxes, but the good news is that the filings generate a lot of useful data. In the U.S., the Internal Revenue Service provides data collections through its Tax Statistics service, which you can find on-line through the IRS’s website.
If you do business in or with the United States, it’s important to keep careful track of the country’s economic trends. The Department of Commerce’s Bureau of Economic Analysis gathers statistics relating to the U.S. economy that let you gain useful insights into the state of personal and business economic health in the U.S.
Started in 1997, FedStats is a U.S. government website that aggregates links to and statistics generated by government agencies. The benefit of looking for data through FedStats is that you don’t need to know which agency produced a particular statistic. In addition to the latest news, which you can see in this section here, you also have links to other U.S. government agencies: The Bureau of Economic Analysis, Bureau of Justice Statistics, Bureau of Labor Statistics, and so on.
Education provides the foundation for a productive society. The United States Department of Education gathers statistics from educational institutions around the U.S. and makes them available through the Department’s data and research site, enabling analysts to examine and evaluate education in the U.S.
You can get information on the salaries paid in various fields, examine employment trends, and look through the Occupational Outlook Handbook which looks at the future prospects for a variety of professions. That and there’s a lot of other data available as well.
The Bureau of Transportation Statistics,which is part of the Department of Transportation, gathers statistics on highway, water, rail, and inter modal transportation in the US. If you work in manufacturing, or need a baseline for national and international transportation trends, this website provides the data you need to make good decisions.
The U.S. federal government, like almost all governments around the world,lets its citizens register inventions, trademarks, and other forms of intellectual property to protect those valuable ideas against unauthorized use. In general terms, patents protect processes while trademarks protect words, phrases, and images that are used to identify companies, products, and services. If you want to search for existing patents and trademarks in the U.S., you can go to the U.S. Patent and Trademark Office.
Which gives you information on the world economy, health, such as life expectancy, education, and so on.
CIA, which is the United States Central Intelligence Agency, is tasked with gathering, analyzing, and assessing information about foreign countries. The goal of this gathering is to discover the current conditions in and intentions of countries other than the U.S. As part of its outreach to U.S. citizens, the CIA publishes its World Factbook, which offers basic information about the government, citizens, and economies of countries throughout the world.
The United Nations or UN is an international organization that provides a form for its 193 member states to express their views on relevant issues and to coordinate action. As part of its mission, the UN provides access to national data services and also its own data collections.
Statistics Canada, or StatCan, as it’s known informally, is a Canadian government agency that gathers statistics from industry and government institutions around Canada, and makes those numbers and commentary available on their website.
Eurostat, which is the European Union’s Statistics Agency, provides access to data about the countries of the European Union, and as a subset of those entities, the countries of the Euro zone. The latter group is comprised of the countries such as Germany, France and Finland that have adopted the Euro as their national currency.
The Organisation for Economic Co-operation and Development or OECD, is an international organization that promotes policies to improve the economic and social well-being of people around the world. Data gathering in foreign policy creation and evaluation, of course, so the OECD shares their data online for free.
Search engines provide ways to find data based on certain search terms. You might search for stock prices, oil production figures, or other benchmarks you use in your business. One search engine, Quandl, provides both a search interface and a set of curated data collections to streamline the discovery process.
Inforum is the inter-industry forcasting project at the University of Maryland.The Inforum team builds models to forecast future performance of the US and other economies. On their website you can find details of the econometric models they use, the data they work with, and links to software that let you work with their data as well.
Google is an exceptionally popular search engine, but the company also makes public data available through their Google Public Data collection. As of this recording, the collection contains 136 data sets covering a wide variety of economic and technology topics.
Amazon.com is best known as the internet’s book seller of choice, but it also provides Cloud Computing Services such as remote data storage and processing. As part of it’s operations, Amazon provides access to big data sets in several scientific fields.
Data.gov is the United States Federal Government’s data clearing house,where you can find links to data from all federal websites and many state and local collections, as well. If you’re not sure which government site or bureau has the data you’re looking for, searching at data.gov is a great place to start.
I mentioned Google’s public data service in another part of this course. In this movie I’d like to point you to the Google Ngram Viewer. An Ngram is a series of characters of a given length. For example, a two-character string is a bigram. A three-character string is a trigram, and so on. If you perform linguistic analysis, and want to search word usage in books published from 1800 to about 2008, the Google Ngram viewer is a great tool to keep in mind. Google Ngram viewer which finds the popularity of certain character strings and words in books that were publishedfrom about 1800 to 2008.
The corpus of contemporary American English which tracks English word usage in books,magazine, television, films and other media.
One of the most important parts of working in a data science team is discovering great questions. To ask great questions you have to understand critical thinking. Critical thinking is not about being hostile or disapproving. It’s about finding the critical questions.
These questions will help you shake out the bad assumptions and false conclusions that can keep your team from making real discoveries.
Your team needs to be comfortable in a world of uncertainty, arguments, questions, and reasoning. When you think about it, data science already gives you a lot of the information. You’ll have the reports that show buying trends. There will be terabytes of data on product ratings. Your team needs to take this information and ask the interesting questions that creates valuable insights.
A good question will challenge your thinking. It’s not easily dismissed or ignored. It forces you to unravel what you already neatly understood. It requires a lot more work than just listening passively.
The critical in critical thinking is about finding the critical questions. These are the questions that might chip away at the foundation of the idea.
It’s about your ability to pick apart the conclusions that are part of an accepted belief.
Here’s an example:
At the end of the month, the data analysts ran a report that showed a 10% increase in sales. It’s very easy here to make a quick judgment. The lower prices encouraged more people to buy shoes. The higher shoe sales made up for the discounted prices. It looks like the promotions worked. More people bought shoes and the company earned greater revenue. Many teams would leave it at that. Your data science team would want to apply their critical thinking.Remember it’s not about good or bad, it’s about asking the critical questions. How do we know that the jump in revenue was related to the promotion? Maybe if you hadn’t had the promotion, the same number of people would have bought shoes.
What data would show a strong connection between the promotion and sales? You can even ask them essential questions. Do your promotions work? Would the same number of people have bought shoes? Everyone assumes that promotions work. That’s why many companies have them. Does that mean that they work for your website? These questions open up a whole new area for the research lead. When you just accept that promotions work, that everything is easy. These worked so let’s do more promotions. Instead the research lead has to go in a different direction.
How do we show that these promotions work? Should we look at the revenue from the one-day event? Did customers buy things that were on sale? Was it simply a matter of bringing more people into the website? This technique is often called panning for gold. It’s a reference to an early mining technique. It’s when miners would sift through sand looking for gold. The sand here are all the questions that your team asks. The research lead works with the team to find the gold nuggets that are worth exploring. The point of panning for gold is that you’ll have a lot of wasted material.
They’ll be a lot of sand for every nugget of gold. It takes patience to sift through that many questions. Don’t be afraid to ask big why’s?
That’s why a key part of critical thinking is understanding the reasoning behind these ideas. Reasoning is your beliefs, evidence, experience, and valuesthat support conclusions about the data. It’s important to always keep track of everyone’s reasoning when working on a data science team.
Everyone on your team should question their own ideas. Everyone should come up with interesting questions and explore the weak points in their own arguments.
A University of California physicist named Richard Muller spent years arguing against global climate change. Much of his work was funded by the gas and oil industry.
Later, his own research found very strong evidence of global temperature increases. He concluded that he was wrong and that humans are to blame for climate change. Muller saw the facts against him were too strong to ignore, so he changed his mind. He didn’t do it in a quiet way or seem ashamed. Instead he wrote a long op-ed piece in the New York Times that outlined his initial arguments and why his new findings showed that he was wrong. That’s how you should apply critical thinking in your data science team.
This is sometimes called a question-first approach. These meetings are about creating the maximum number of questions.
If you run an effective question meeting, then you’ll likely get a lot of good questions. That’s a good thing. Remember, you want your team to be panning for gold. They should be going through dozens of questions before they find a few that they want to explore. The more ideas you can expose, the better. Then you can decide which ones are best to explore. Just like the early miners who panned for gold, you want to be able to sort out the gold from the sand. You want to know how to separate good questions from those that you can leave behind.
You don’t want your team asking too many open questions, it’ll make everyone spend too much time questioning, and not enough time sorting through the data. On the flip side, you don’t want the team asking too many closed ended questions. Then the team will spend too much time asking smaller easier questions, without looking at the big picture. Once you’ve identified whether your question is open or closed, you’ll want to figure out if it’s essential.When it’s essential it gets to the essence of an assumption, idea, or challenge.
Below them, you can use yellow notes for non-essential questions. Remember that these are questions that address smaller issues. They’re usually closed questions with a quicker answer.Finally you can use white or purple stickies for results. These are little data points that the team discovered which might help address the question. There are five benefits to having a question wall. This will help your team stay organized and even prioritize their highest-value questions.
Remember, the data science is using the scientific method to explore your data. That means that most of your data science will be empirical. Your team will ask a few questions, then gather the data, then they’ll react to the data and ask a series of new questions.
When you use a question tree it will reflect how the team has learned. At the same time, it will show the rest of the organization your progress.
You want to focus your questions on six key areas. The six key areas are questions that clarify key terms, root out assumptions, find errors, see other causes, uncover misleading statistics, and highlight missing data. If you discuss these six areas, then you’re bound to come up with at least a few questions.
You need to carefully look at the reasoning behind your ideas and then question it. That way you’ll have a better understanding of everyone’s ideas.
Remember that correlation doesn’t necessarily mean causation. The key is to focus on identifying where they are. An assumption that’s accepted as fact might cause a chain reaction of flawed reasoning. Also keep in mind that an assumption isn’t just an error to be corrected. It’s more like an avenue to explore.
There are key phrases that you might want to clarify. There’s also assumptions which might connect incorrect reasoning to false conclusions. Once you peel back these assumptions and clarify the language, you should be left with the bare reasoning. In many ways, now you’re asking more difficult questions.
In fact, your data science team might be one of the only groups in the organization that’s interested in questioning well established facts. When you’re in a data science team, each time you encounter a fact, you should start with three questions. Should we believe it? Is there evidence to support it? How good is the evidence? Evidence is well established data that you can use to prove a larger fact. Still, you shouldn’t just think of evidence as proving or disproving the facts. Instead, try to think of the evidence as being stronger or weaker.
The important thing to remember is that facts are not always chiseled in marble. Facts can change as the evidence gets stronger or weaker. When you’re working in a data science team, don’t be afraid to question the evidence. Often it will be a great source of new insights.
It’s easy to say that correlation doesn’t imply causation. It’s not always easy to see it in practice. Often you see cause and effect and there’s no reason to question how they relate.Sometimes it’s difficult to see an outcome that happens after something is different from an outcome that happens because of something.
If they don’t make sense then you should investigate the connection. Some of your best questions might come from eliminating these rival causes and finding an actual cause.
When you’re in a question meeting, your team should closely evaluate statistical data. They should question the data and be skeptical of statistics from outside the team. The person on your data science team suggests that as many of half of your customers run with their friends. The best way to sort this out is to separate the statistic from the story. With the running shoe website, you had two stories. One that says that the customer likes their friends to save money. The other one says that customers run with their friends.
The first thing is to try and understand the reason that information is missing. Maybe there was no time or limited space in their report.
Questions are at the heart of getting insights from your data. It take courage to ask a good question.