Background Information

This dataset is taken from the Kaggle competition: Titanic: Machine Learning from Disaster

Machine Learning Algorithm used: k-Nearest Neighbors (or k-NN)

Programming language used: Python3

Machine Learning in Python

Import Statements:

Firstly, we need at least the 3 following packages:


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

Here, the pandas package allows the titanic dataset, which is a comma separated file to be loaded up. The sklearn.model to import the train_test_split function allows our dataset to be split into two parts, the training and testing datasets. This must be prepared for the machine learning process.

Finally, the sklearn.neighbors package imports in our KNeighborsClassifier which is the machine learning algorithm needed for our prediction of the survival of the individuals on that titanic. These are the packages that I needed but since there many ways to approach the problem there may be additional packages needed.

Ignore Warnings:

In order to remove the SettingWithCopyWarning warning when running the problem we can add the below code. This warning occurs as we will be modifying the original dataframe.


pd.options.mode.chained_assignment = None

Function Blocks:


def import_csv(file):
    return pd.read_csv(file)

def drop_cols(df, cols_to_drop):
    return df.drop(columns=columns_to_drop, inplace=True)

First function just reads the csv files (train.csv and test.csv) and returns back a dataframe of the contents of each csv file. The second selects which columns on the specified dataframe to drop as that column may not contribute to the machine learning process.

Usage:

Reading both the training and testing csv files into a pandas dataframe.


train_data = import_csv('train.csv')
test_data = import_csv('test.csv')

Dropping columns that will not contribute to the machine learning process. The columns dropped in my eyes are not considered as important as the individuals gender, age and ticket class. This is because we know that the females and children were the main priorities to go on the lifeboats first.

Also their social class was a factor as wealth would of gave priority to the upper and middle classes over the lower class. A side note is that the variable test_ids is to save all the PassengerIds as we will be dropping it for the analysis. It will be returned in the final dataset to check which individual survived based on the id.


columns_to_drop = ['PassengerId', 'Name','Ticket','Fare','Embarked', 'SibSp', 'Parch', 'Cabin']
drop_cols(train_data, columns_to_drop)
test_ids = test_data['PassengerId']
drop_cols(test_data, columns_to_drop)

Data Cleaning:

First, we clean up the datasets and we can see that the remaining column 'Age' has NaN values instead of a number. Here, we can replace these NaN values by getting the average of the Age values and using it instead.


train_avg = train_data['Age'].mean()
train_data.fillna({'Age': train_avg}, inplace=True)
 
test_age = test_data['Age'].mean()
test_data.fillna({'Age': test_age}, inplace=True)

Next, we make the categorical values in the 'Sex' column into numerical counterparts. This can be done by assigning the 'male' value as 0 and the 'female' value as 1. It must be done as most of the time we want quantifiable features as that is better for analysis.


train_data['Sex'].replace({'male': 0, 'female': 1}, inplace=True)
test_data['Sex'].replace({'male': 0, 'female': 1}, inplace=True)

Machine Learning Process:

Now, we train the model using our training dataset and we separate our training answers column from the features.


X = train_data.drop('Survived', axis=1) # features
Y = train_data['Survived'] # training answer

Then, we use the above variables and put it into the train_test_split function and the test_size=0.30 because 30% of our dataset is testing data.


X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30)

So, we can know train the model by fitting in our training dataset and we will use two clusters, hence the n_neighbors = 2.


knn = KNeighborsClassifier(n_neighbors = 2)
knn.fit(X_train, Y_train)

Now, we can submit our testing dataset and we should get an output of whether than individual survived (value of 1) or perished (value of 0).


survived = knn.predict(test_data)

Add the survival column generated above back to the dataset and the passenger ids. This will show whether each individual in the testing dataset would have survived or not but obviously this would just be an prediction.


test_data['PassengerId'] = test_ids
test_data['Survived'] = survived

Now, as per the Kaggle competition requirements, we would only keep two columns. These are the 'PassengerIds' and the 'Survived' column so in that case we would have to drop all the other columns. Thus, the results can now be converted back to csv format and submitted to the Kaggle competition.


columns_to_drop = ['Pclass', 'Sex', 'Age']
drop_cols(test_data, columns_to_drop)
test_data.to_csv('predict_survival.csv', index=False)

Complete Python Code:


# import statements
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# SettingWithCopyWarning warning removed
# Because we want to modify the original dataframe
pd.options.mode.chained_assignment = None

def import_csv(file):
    return pd.read_csv(file)

def drop_cols(df, cols_to_drop):
    return df.drop(columns=columns_to_drop, inplace=True)

if __name__ == "__main__":
    # fetch training and testing datasets
    train_data = import_csv('train.csv')
    test_data = import_csv('test.csv')
    
    # drop unwanted columns
    # save the ids in the test_data as not a necessary feature but will be required back later
    columns_to_drop = ['PassengerId', 'Name','Ticket','Fare','Embarked', 'SibSp', 'Parch', 'Cabin']
    drop_cols(train_data, columns_to_drop)
    test_ids = test_data['PassengerId']
    drop_cols(test_data, columns_to_drop)
    
    # get the average age and set the NaN age values to the average
    train_avg = train_data['Age'].mean()
    train_data.fillna({'Age': train_avg}, inplace=True)

    test_age = test_data['Age'].mean()
    test_data.fillna({'Age': test_age}, inplace=True)

    # change Sex values for 'male' and 'female'
    train_data['Sex'].replace({'male': 0, 'female': 1}, inplace=True)
    test_data['Sex'].replace({'male': 0, 'female': 1}, inplace=True)
    
    # prepare the features for the machine learning (k-NN)
    X = train_data.drop('Survived', axis=1) # features
    Y = train_data['Survived'] # training answer
    
    # split dataset into training and testing dataset
    # 70% training and 30% testing
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30)
    
    # use the training sets for k-NN
    knn = KNeighborsClassifier(n_neighbors = 2)
    knn.fit(X_train, Y_train)
  
    # predict the people who have survived in the test dataset
    survived = knn.predict(test_data)
    
    # add PassengerId and their predicted survival back to test data
    test_data['PassengerId'] = test_ids
    test_data['Survived'] = survived
    
    # drop the other columns i.e Pclass, Sex, Age and send the rest to a csv file
    columns_to_drop = ['Pclass', 'Sex', 'Age']
    drop_cols(test_data, columns_to_drop)
    test_data.to_csv('predict_survival.csv', index=False)

Kaggle Submission Result

After submitting the 'predict_survival.csv' to Kaggle the accuracy of the code above netted a public score of 0.71770.

Improvements

In order to produce a more accurate result, there should be more tuning of the features of the dataset such as finding family relations, putting more weighting on the wealth of the individuals and also the distance of each cabin to a lifeboat (if possible).

Also, could of added more functions to clean up the code instead of putting everything in the if statement and more clarification of each piece of code through comments if it is still unclear.