My journey to become a Data Scientist began when I started learning Python. Being a database developer for almost 10 years, I have heard different folks conversing about usage of different languages and the demand for the languages in the software market. I noticed that this 'Python' language was getting mentioned more often in those conversations and also it kept coming up as a 'nice to have skill' during the job searches. I have always had a notion that Python will be incredibly difficult to learn, since this is one of the fastest growing languages and one of the most on-demand languages in the market. I could not have been far from the truth.
If you are also thinking along my lines that Python is tough or learning and mastering it is going to take a considerable amount of effort, then now is the time to wipe away all the fears! It is one of the most easiest languages to learn. As with every programming language, Python is also a vast ocean. However, you need not learn everything about it. There are certain parts that we will have to concentrate more - specifically the parts that deal with Data Science / Machine Learning aspects of it. After learning about Python that is relevant to data science, we will move on to the next relevant topic - Statistics.
We all might have learnt the basics of statistics during our schooling - basics like mean, median, mode, etc. However, it is absolutely necessary to brush-up our statistics related knowledge to have a mastery of data science. Along with the basics, we will have to learn about different distributions, F-statistics, p value so on and so forth. To play around with the data, one must have understanding of statistics to know which features are important and how to derive meaningful results out of the available data. Also, it would be helpful if one could understand the mathematics behind the machine learning algorithm. It will help to tune the algorithm and take us to the desired result.
The next step involves studying different machine learning models - regression, KNN, k-means, decision trees and random forest, PCA, recommendor systems, etc.. Python has numerous libraries associated with machine learning; mastering them will be a major milestone in your data science journey.
In addition to the things mentioned above, one has to know big data technologies like (Py)Spark, Sqoop, Scala to deploy the code on to the clusters and perform data analysis or to train and test the models appropriately.
Since I have worked on database related technologies to extract, transform and load data, I did not spend too much time in learning SQL, Big data or associated concepts. However, it is highly advisable to have at least basic knowledge of concepts related to Relational databases such as Oracle / MS SQL as well as Big data technologies such as Hadoop, Spark and Sqoop.
Last but not the least - learning all the above mentioned things will not automatically make you a data scientist. One has to practice these learnings on real-world datasets as well. There are lots of website and forums that provides such impactful, huge, real-world datasets. These can be utilized to train our models and hone our skillset. There are also companies, which use these online platforms as a channel to actively host several competitions and recruit high performing talents through such forums or websites.
Get ready to travel with me on this long and interesting road to data scientist!
nice read! I have already started learning python since it's a "nice to have skill" in the job market. looking forward for more :)
ReplyDeleteThanks! If you are starting fresh to learn Python, I would recommend using 'Learn to Program with Python' book by Irv Kalib. It provided a nice overview about basic functionalities in Python. But you have to note that it does not have any reference to Data Science related part of Python. Once you are familiar with the Python basics, then you can concentrate more on Data Science libraries of Python.
DeleteWill have a detailed posted by today or tomorrow since I was out last weekend.