Introduction

In this project, we are looking to predict the future population of the world using the data from the past (1960 to 2017). This is achieved using the magic of machine learning and more specific, regression.

The data for the total population will be taken from World Bank and will be in a comma separated format (csv). The project will be coded up in Python so knowledge of the pandas and scikit-learn packages will be required.

Overall, our goal is to create a csv file of their future population up to 10 years away from now, into the file ‘future_world_population.csv’.

Resource

To begin, download the zip file below and extract out the csv file with the largest size, the ‘API_SP.POP.TOTL_DS2_en_csv_v2_10307762.csv’ file. After putting the file in your directory it is better to rename it to something like ‘world_population.csv’ for convenience.

  • worldbank.org – Total Population (csv) (1960 – 2017) (Download Link)

Importing Packages

The first step is to import the packages needed for the analysis and that is at the minimum the pandas and the scikit-learn. Also, the matplotlib is used for all the data visualizations to draw questions about the data and to format the data if necessary. Lastly, numpy is used in order to convert the data to numpy arrays and/or be reshaped into different dimensions,


import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

Importing the data

Let's first open the csv file in Microsoft Excel as see the layout of the data.

Header of world_population.csv

Here, we see that the first four rows is not apart of the data. Hence, when importing the csv file to pandas we can skip the first four rows.


csv_file = 'world_population.csv'
df = pd.read_csv(csv_file, skiprows=4)

Data Cleaning

Cleaning Columns

Now, we have successfully imported the file, the data must be cleaned in order to reduce the chance of errors. Here, we check which columns are useful and which are not.


print(df.columns)

This gives us the columns or commonly known as the variables of the dataset which outputs:

['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code', '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', 'Unnamed: 62']

Here, we do not want features that do not help with the analysis and these are the 'Country Code', 'Indicator Name', 'Indicator Code' and 'Unnamed: 62'.


columns_to_drop = ['Country Code', 'Indicator Name', 'Indicator Code', 'Unnamed: 62']
df.drop(columns=columns_to_drop, inplace=True)

The only columns left in the dataframe should be the following:

['Country Name', '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017']

Having the country name as the identifier and the population count for every single year from 1960 to 2017. This will allow us to not only predict the future population of a specific country or all countries, but also get data visualizations.

Renaming Columns

A minor change is to change the first column which is named 'Country Name' to 'Country' for convenience.


df.rename(columns={'Country Name': 'Country'}, inplace=True)

Removing NaN Rows

Lastly, we must check for NaN values and we can use the following to print out which rows contain NaN values.


print(df[df.isnull().T.any().T])

This outputs in the terminal the following:

Here, we see that these rows will have to be removed before the machine learning state. This can be achieve by using the following code:


df.dropna(inplace=True)

Data Visualizations

In order to get some idea about the data we are working with we can visualize the data. Let's say we want to get an idea of the Australia's population throughout the years.

First, we fetch the tuple with the data for the population of Australia:


record = df[df['Country'] == 'Australia']
years = record.columns.tolist()[1:]
population = record.values.tolist()[0][1:]

We can plot a scatter plot:

Note: Import the following code where the other import statements are to remove scientific notation.


from matplotlib.ticker import ScalarFormatter, FormatStrFormatter

Scatterplot Code:


record = df[df['Country'] == 'Australia'] # get the tuple with the Australia's population data
years = record.columns.tolist()[1:] # get the years
population = record.values.tolist()[0][1:] # get the population with respect to the year
plt.scatter(years, population)  # plot scatter plot
plt.plot(years, population) # line to connect the points)
plt.xticks(rotation='vertical') # rotate x axis labels text to vertical inorder for it to show up and not cluster together
plt.title('Australia\'s population from 1960 to 2017') # set graph title
plt.xlabel('Year') # set y axis label
plt.ylabel('Total Population') # set y axis label
plt.gca().yaxis.set_major_formatter(FormatStrFormatter('%.0f')) # turn of scientific notations
plt.show() # display graph

Scatterplot of Australia's population


Click to view enlarged image

Here, we see that since the 1960's, Australia's population has been increasing at a steady pace.

Machine Learning in Python

Now, we try to predict every countries population up to 10 years into the future. Since, we want to predict the future populations for each country we must make a list to iterate through each country. Also, we will create a temporary dataframe to store the new future population data.


countries = df['Country'].tolist()
temp_df = pd.DataFrame()

So, with the country list we can iterate through using a for loop and prepare each country's population data for the model:


record = df[df['Country'] == country].drop(['Country'], axis=1)
record = record.T
record.reset_index(inplace=True)
record.columns = ['Year', 'Population']
X = record['Year']
Y = record['Population']

This will allow us to train the model for each country's population data:


regressor = LinearRegression()
regressor.fit(np.array(X).reshape(-1,1), Y)

After fitting the model with the training data of the country, we can not predict the futures 10 years using a loop for each year:


for year in range(2018,2029):
        future_population = round(regressor.predict(np.array([year]).reshape(-1,1))[0])
        row = pd.DataFrame([[year,future_population]], columns=['Year','Population'])
        record = record.append(row, ignore_index=True)

Now, all we have to do is organize the dataframe in order to add to the main dataframe:


record = record.T
new_header = record.iloc[0]
record = record[1:]
record.columns = new_header
record.columns.name = None
record.index = [country]
temp_df = pd.concat([temp_df, record])

Then, once the for loop completes we can export the new main dataframe into a csv file as we planned earlier:


df = temp_df
df.to_csv('future_world_population.csv')

Scatterplot of Australia's future population


Click to view enlarged image

In the plot, we see the future population of Australia for the next 10 years has been added (Year 2018 to 2028). We see that Australia's population takes a dip from 2017 to 2018, but starts to linearly increase until 2028.

Complete Python Code:


# World Population Predictor

# import statements
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

def scatter_plot_aus():
    record = df[df['Country'] == 'Australia'] # get the tuple with the Australia's population data
    years = record.columns.tolist()[1:] # get the years
    population = record.values.tolist()[0][1:] # get the population with respect to the year
    plt.scatter(years, population)  # plot scatter plot
    plt.plot(years, population) # line to connect the points)
    plt.xticks(rotation='vertical') # rotate x axis labels text to vertical inorder for it to show up and not cluster together
    plt.title('Australia\'s population from 1960 to 2017') # set graph title
    plt.xlabel('Year') # set y axis label
    plt.ylabel('Total Population') # set y axis label
    plt.gca().yaxis.set_major_formatter(FormatStrFormatter('%.0f')) # turn of scientific notations
    plt.show() # display graph
    
def scatter_plot_future_aus():
    record = df[df.index == 'Australia']
    record.columns = record.columns.astype(str)
    years = record.columns.tolist()
    population = record.values.tolist()[0]
    plt.scatter(years, population)
    plt.plot(years, population)
    plt.xticks(rotation='vertical') 
    plt.title('Australia\'s future population from 1960 to 2028')
    plt.xlabel('Year')
    plt.ylabel('Total Population')
    plt.gca().yaxis.set_major_formatter(FormatStrFormatter('%.0f'))
    plt.show()    
    
if __name__ == "__main__":
    # importing the data
    csv_file = 'world_population.csv'
    df = pd.read_csv(csv_file, skiprows=4)
 
    # drop unneeded columns
    columns_to_drop = ['Country Code', 'Indicator Name', 'Indicator Code', 'Unnamed: 62']
    df.drop(columns=columns_to_drop, inplace=True)
   
    # rename columns
    df.rename(columns={'Country Name': 'Country'}, inplace=True)

    # drop rows with NaN values
    df.dropna(inplace=True)

    # plot Australia's historical population from 1960 to 2017
    scatter_plot_future_aus()

    # machine learning
    countries = df['Country'].tolist()
    temp_df = pd.DataFrame()
    for country in countries:
        # prepare data for the model
        record = df[df['Country'] == country].drop(['Country'], axis=1)
        record = record.T
        record.reset_index(inplace=True)
        record.columns = ['Year', 'Population']
        X = record['Year']
        Y = record['Population']
        
        # train the model
        regressor = LinearRegression()
        regressor.fit(np.array(X).reshape(-1,1), Y)
        
        # predict future population with respective year and add back to current record
        for year in range(2018,2029):
            future_population = round(regressor.predict(np.array([year]).reshape(-1,1))[0])
            row = pd.DataFrame([[year,future_population]], columns=['Year','Population'])
            record = record.append(row, ignore_index=True)
        
        # change narrow dataframe back to a wide one
        record = record.T
        new_header = record.iloc[0]
        record = record[1:]
        record.columns = new_header
        record.columns.name = None
        record.index = [country]
        temp_df = pd.concat([temp_df, record])
    
    # set new dataframe instead of the original
    df = temp_df
    df.to_csv('future_world_population.csv')
        
    # plot new scatterplot of Australia with the future population data
    scatter_plot_future_aus()

Extra

For more machine learning examples, be sure to check out the machine learning example on the Titanic dataset with this link.