Introduction

Today, we are going to use machine learning on the famous
MNIST (Modified National Institute of Standards and Technology) dataset. This dataset focuses on computer vision fundamentals such as recognizing handwritten digits and classifying them under which digit it was supposed to be.

We will be using the Random Forest Classifier algorithm in Python 3 and the dataset will be taken from the Kaggle Competition found here. The usual packages such as pandas and scikit-learn is required and the dataset will have to be the usual csv format.

Here, it will just be the basic setup of the machine learning process and there is always room for improvement to provide a more accurate classification. It is achieved through analyzing the features and trying new classification algorithms.

Import Packages

For this example, we only need 3 packages and they are the following:

  • pandas – for importing our training and testing datasets (csv format)
  • train_test_split from sklearn.model – this is used to split up our training dataset into 2 portions. This will allow our model to learn from the dataset with the supervised answers.
  • Random Forest Classifier from sklearn.ensemble: Our classification algorithm that will be used in the machine learning process for recognizing our digits.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

Main Loop


if __name__ == "__main__":

Importing the datasets

In our main loop, we will be importing the training and testing datasets. They must be in the same directory as the Python file for the code to work.


train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

Preparing the datasets and training the model

Since we have to give each of the 28000 images an id in the testing dataset we can use the range() python function to generate a list of ids. Then we should have one dataframe with the features only (let us call it ‘X’).

And another dataframe for the supervised answer (let us call it ‘Y’). Here, we can split these up into two portions with one portion (70%) being the training set and the other portion (30%) as the testing.


imageIds = list(range(1,28001))
X = train_df.drop('label', axis=1)
Y = train_df['label']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30)

Classifying the unknown images

Now, we set up the Random Forest Classifier model and use the training dataset to train it. Then, we can use the testing dataset we had initially and use our model to classify each image.


clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, Y_train)  
classified = clf.predict(test_df)

Exporting the result

Now, we can just export the resultant dataframe as a csv file, such that it is ready for submission for the Kaggle competition. Here, they require only two columns which were the image ids we generated earlier and the second column indicating what digit the features were classified as in our model.


df = pd.DataFrame()
df['ImageId'] = imageIds
df['Label'] = classified
df.to_csv('submission.csv', index=False)

Complete Code:


# Digit Recognizer

# import statements
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

if __name__ == "__main__":
    # importing the datasets
    train_df = pd.read_csv('train.csv')
    test_df = pd.read_csv('test.csv')
    
    # prepare training dataset
    imageIds = list(range(1,28001))
    X = train_df.drop('label', axis=1) # features
    Y = train_df['label'] # supervised answer

    # split dataset into training and testing dataset
    # 70% training and 30% testing
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30)
   
    # use the training sets for random forest classifier
    clf = RandomForestClassifier(n_estimators=100)
    clf.fit(X_train, Y_train)  
    
    # classify the images in the test dataset
    classified = clf.predict(test_df)
    
    # create submission file for the Kaggle competition with the predictors answers
    df = pd.DataFrame()
    df['ImageId'] = imageIds
    df['Label'] = classified
    df.to_csv('submission.csv', index=False)

Submission Result

After submitting the ‘submission.csv’ on Kaggle, the accuracy of the model we coded up netted is a public score of 0.96342. This could be used with more tweaking and is a good foundation for starting code.