Sunday, 19 August 2018

Python - Anaconda & Jupyter notebooks installation

The first step in getting started in this journey is to learn Python. As mentioned in my earlier post, Python is one of the easiest programming languages to learn. It has lots of features that makes a programmers' life easy.

Conda install


We will be using Anaconda distribution to install latest version of Python and associated ML / Data science related libraries. It is easy to setup in all the environments - Windows / Mac / Linux; and easy to use as well.

Visit the Conda link, download the appropriate version of Anaconda distribution and follow the on-screen instructions to setup and install Python:


Anaconda command prompt

Once Conda distribution is installed, you can access Jupyter notebooks via command prompt.
Go to Start menu and type 'Anaconda prompt'. You will get a command prompt.
Within the command prompt type 'Jupyter notebook'


You can open Jupyter notebook from any of the folders in your system. Once you open from a particular folder, you will be able to access / save files from or to that corresponding location. A new browser window will open displaying the contents of the directory. You can also note the version of Python that is installed and create new notebooks accordingly.


Hello world program


Your first hello world program - it is quite straight-forward and easy to execute. Here is the screenshot of the same:

Hello World

print('Hello World')

You can rename the notebook, setup keyboard shortcuts, save, execute statements, etc. Feel free to explore the notebook interface to get a feel of the working environment.
Explore the documentation for more details.

In the next post, we will start exploring more about Python as a programming language and then further down the line, we will move towards exploring the data science aspects of Python.

Sunday, 12 August 2018

Tracing my steps

My journey to become a Data Scientist began when I started learning Python. Being a database developer for almost 10 years, I have heard different folks conversing about usage of different languages and the demand for the languages in the software market. I noticed that this 'Python' language was getting mentioned more often in those conversations and also it kept coming up as a 'nice to have skill' during the job searches. I have always had a notion that Python will be incredibly difficult to learn, since this is one of the fastest growing languages and one of the most on-demand languages in the market. I could not have been far from the truth.


If you are also thinking along my lines that Python is tough or learning and mastering it is going to take a considerable amount of effort, then now is the time to wipe away all the fears! It is one of the most easiest languages to learn. As with every programming language, Python is also a vast ocean. However, you need not learn everything about it. There are certain parts that we will have to concentrate more - specifically the parts that deal with Data Science / Machine Learning aspects of it. After learning about Python that is relevant to data science, we will move on to the next relevant topic - Statistics.


We all might have learnt the basics of statistics during our schooling - basics like mean, median, mode, etc. However, it is absolutely necessary to brush-up our statistics related knowledge to have a mastery of data science. Along with the basics, we will have to learn about different distributions, F-statistics, p value so on and so forth. To play around with the data, one must have understanding of statistics to know which features are important and how to derive meaningful results out of the available data. Also, it would be helpful if one could understand the mathematics behind the machine learning algorithm. It will help to tune the algorithm and take us to the desired result.


The next step involves studying different machine learning models - regression, KNN, k-means, decision trees and random forest, PCA, recommendor systems, etc.. Python has numerous libraries associated with machine learning; mastering them will be a major milestone in your data science journey. 


In addition to the things mentioned above, one has to know big data technologies like (Py)Spark, Sqoop, Scala to deploy the code on to the clusters and perform data analysis or to train and test the models appropriately.

Since I have worked on database related technologies to extract, transform and load data, I did not spend too much time in learning SQL, Big data or associated concepts. However, it is highly advisable to have at least basic knowledge of concepts related to Relational databases such as Oracle / MS SQL as well as Big data technologies such as Hadoop, Spark and Sqoop.

Last but not the least - learning all the above mentioned things will not automatically make you a data scientist. One has to practice these learnings on real-world datasets as well. There are lots of website and forums that provides such impactful, huge, real-world datasets. These can be utilized to train our models and hone our skillset. There are also companies, which use these online platforms as a channel to actively host several competitions and recruit high performing talents through such forums or websites.

Get ready to travel with me on this long and interesting road to data scientist!