Today, we are going to use machine learning on the famous
MNIST (Modified National Institute of Standards and Technology) dataset. This dataset focuses on computer vision fundamentals such as recognizing handwritten digits and classifying them under which digit it was supposed to be.
We will be using the Random Forest Classifier algorithm in Python 3 and the dataset will be taken from the Kaggle Competition found here. The usual packages such as pandas and scikit-learn is required and the dataset will have to be the usual csv format.
Here, it will just be the basic setup of the machine learning process and there is always room for improvement to provide a more accurate classification. It is achieved through analyzing the features and trying new classification algorithms.
For this example, we only need 3 packages and they are the following:
- pandas – for importing our training and testing datasets (csv format)
- train_test_split from sklearn.model – this is used to split up our training dataset into 2 portions. This will allow our model to learn from the dataset with the supervised answers.
- Random Forest Classifier from sklearn.ensemble: Our classification algorithm that will be used in the machine learning process for recognizing our digits.
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier
if __name__ == "__main__":
Importing the datasets
In our main loop, we will be importing the training and testing datasets. They must be in the same directory as the Python file for the code to work.
train_df = pd.read_csv('train.csv') test_df = pd.read_csv('test.csv')
Preparing the datasets and training the model
Since we have to give each of the 28000 images an id in the testing dataset we can use the range() python function to generate a list of ids. Then we should have one dataframe with the features only (let us call it ‘X’).
And another dataframe for the supervised answer (let us call it ‘Y’). Here, we can split these up into two portions with one portion (70%) being the training set and the other portion (30%) as the testing.
imageIds = list(range(1,28001)) X = train_df.drop('label', axis=1) Y = train_df['label'] X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30)
Classifying the unknown images
Now, we set up the Random Forest Classifier model and use the training dataset to train it. Then, we can use the testing dataset we had initially and use our model to classify each image.
clf = RandomForestClassifier(n_estimators=100) clf.fit(X_train, Y_train) classified = clf.predict(test_df)
Exporting the result
Now, we can just export the resultant dataframe as a csv file, such that it is ready for submission for the Kaggle competition. Here, they require only two columns which were the image ids we generated earlier and the second column indicating what digit the features were classified as in our model.
df = pd.DataFrame() df['ImageId'] = imageIds df['Label'] = classified df.to_csv('submission.csv', index=False)
# Digit Recognizer # import statements import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier if __name__ == "__main__": # importing the datasets train_df = pd.read_csv('train.csv') test_df = pd.read_csv('test.csv') # prepare training dataset imageIds = list(range(1,28001)) X = train_df.drop('label', axis=1) # features Y = train_df['label'] # supervised answer # split dataset into training and testing dataset # 70% training and 30% testing X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30) # use the training sets for random forest classifier clf = RandomForestClassifier(n_estimators=100) clf.fit(X_train, Y_train) # classify the images in the test dataset classified = clf.predict(test_df) # create submission file for the Kaggle competition with the predictors answers df = pd.DataFrame() df['ImageId'] = imageIds df['Label'] = classified df.to_csv('submission.csv', index=False)
After submitting the ‘submission.csv’ on Kaggle, the accuracy of the model we coded up netted is a public score of 0.96342. This could be used with more tweaking and is a good foundation for starting code.