Before you start with machine learning from scratch there is a piece of advice I would like to give my readers. The field of machine learning(ML) is a field that needs dedication and a correct mindset. We are dealing with data, which is a core of any machine learning algorithm. If you are able to understand what you need from the data then your half work is done. To help you I will walk you through a problem that has been added to Kaggle as a competition. In this way, you will learn Kaggle and learn how to write your first machine learning code.
Objectives
- Teach you machine learning
- Get you familiar with the Kaggle platform, so that you are able to take part in competitions.
Let’ start with the basic definition of machine learning. It is the ability to improve behavior based on experience. Machine Learning explores algorithms that learn from data, build models from data.
Now a question arises, “What is a model?”
According to Microsoft,
A machine learning model is a file that has been trained to recognize certain types of patterns. You train a model over a set of data, providing it an algorithm that it can use to reason over and learn from those data.
According to AWS
The term ML model refers to the model artifact that is created by the training process.
To my understanding ML model is a mathematical equation, with computation capabilities that if trained with data , learns patterns in data, store intermediate information so as to predict or classify whenever we give unseen data to it.
I had used many terms like mathematical equation, computation capabilities, training, learning patterns, classification, and prediction. I will be explaining all these terms while explaining the process. So now let us learn machine learning from scratch.
The Process
- Data understanding, Data Preprocessing
- Model Building
- Training
- Model Selection
- Predictions
All the above steps are the basic steps involved in machine learning. We will take one case study and solve it using machine learning. In the process we will follow all the steps mentioned in the paragraph above.
Case Study: Titanic
Here we will start with the problem Titanic – Machine Learning from Disaster. We have to Predict survival on the Titanic and get familiar with ML basics. This is a Kaggle competition. This way I will guide you on how to start with the Kaggle competition and learn by doing.
I will give you my understanding of the problem. I would suggest you first take a look at the Titanic- Machine Learning from Disaster page and go through the description of the problem. To get started with the problem you can follow this tutorial.
First you need to Join the competition from the competition page. To do this click Join Competition button. After that then click on I Understand and Accept.
Now let us move a step at a time and understand the entire process. I’ll start with the data understanding and data preprocessing step.
Data Understanding and Data Preprocessing
For the Titanic problem, we have to first understand the data in the data set. The problem we have to solve here is in this challenge, “What sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
A dataset is a collection of data.
To study data you need to click on data tab.
When you scroll down, you will find three files namely train.csv, test.csv, and gender_submission.csv. Download the data and open the CSV file. You will find different columns. Let us understand closely each column in each file.
train.csv
Variable | Definition | Key |
---|---|---|
PassengerId | Unique ID of each passenger | 1 – 891 |
Survived | whether survived or not | 0: could not survive and 1: survived |
Pclass | Ticket Class | 1 = Ist Class, 2 = IInd Class, 3 = IIIrd Class |
Name | Name of Passenger | |
sex | Gender of Passenger | male, female |
Age | Age of the Passenger | |
SibSp | Number of siblings/ spouses on the ship | |
Parch | Number of parents/children on the ship | |
ticket | Ticket number | |
fare | Passenger fare | |
cabin | Cabin number | |
embarked | Port from where passengers boarded the ship | C = Cherbourg, Q = Queenstown, S = Southampton |
Let us closely watch dataset train.csv and ask some questions from it.
- Is there a correlation between survived and the sex of the passenger?
- Is there a correlation between survived and age of passenger?
- Is there a correlation between survived and Siblings or spouses on the ship?
- Is there a correlation between survived and the number of parents/children on the ship?
- Is there a correlation between survived and fare paid by the passenger?
- Is there a correlation between survived and embarked?
My hypothesis related to each feature is described below. The validity of the hypothesis will be found after the predictions made by the algorithm.
A hypothesis is an educated guess about something in the world around you. It should be testable, either by experiment or observation. -Statisticshowto.com
- From this question, we will try to find the relationship between the survival rate of men and women. Women have a better survival rate than men.
- Younger passenger has better survival rate than children and older passengers.
- If a person is well off and has siblings or spouses then they have a better chance of survival.
- If a person is well off and has parents or children then they have a better chance of survival.
- I think the fare is directly related to class, therefore we can do away with this feature.
- I think this question can be used to find the survival relationship with embarked.
Now since we have placed our hypothesis, we need to test our hypothesis for its validity in my next article.
Leave a Reply
You must be logged in to post a comment.