Recently I used R for my course project on data mining. The course didn’t require that we use R, or Python. Instead, the course was thought on WEKA. But here’s why I think it should be done on R or Python in future years.
R is a heavy-duty language – R is a powerful scripting language. It will help you handle large, complex data sets. I was struggling to run WEKA with a dataset of no more than 5 million. Since part of data mining involves creating visualizations to better understand the relations of attributes, R seemed to be the natural best-fit for a course on data mining, and not WEKA. WEKA keeps crashing and the algorithms run comparatively faster on R and Python. This is partly due to the fact that R can be used on a high performance computer clusters which can manage the processing capacity of huge number of processes. One other thing I liked the most was visualization tool that R is equipped with. The graphs and plots of R are so vivid and eye-catching.
Python is user-friendly- Python, similar to Java, C, Perl, is one of the more easier languages to grasp. Finding and squashing bugs is easier in python because it is a scripting language. Moreover, python is a object oriented language. Python is a performer like R. The other good thing is that if you are planning to do some fun oriented things with something called the Raspberry Pi, then Python is the language to learn.
Hadoop – Hadoop is well suited for huge data. Remember the issue I had with WEKA due to the size of my dataset. That problem can be eliminated by using Hadoop. Hadoop will split the dataset into many clusters and perform the analysis on those clusters and combine them together. Top companies like Dell, Amazon, and IBM that own terra-bytes of data have no choice but to use Hadoop.
You need to learn this three tools at a minimum in order to be a good data scientist and to do a good, thorough analysis on a given data.