Linear Regression Example

Table 1-1 Does money make people happier?

Country GDP per Capita (USD) Life Satisfaction
Hungary 12,240 4.9
Korea 27,195 5.8
France 37,675 6.5
Australia 50,962 7.3
United States 55,805 7.2

A simple linear regression model:

life_satisfaction = Θ0 + Θ1× GDP_per_capita

This model has two parameters theta subscript zero and theta subscript one.

The goal here is to determine which values will make the model perform the best and to do that we need to specify a performance measure by defining a utility function that measures how good the model is or determine a cost function to determine how bad it is. For linear regression a cost function is usually used to measure the distance between the training examples and the model’s predictions.

We start by feeding the training data into the linear regression algorithm to find the parameters that work best for the data.

 
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn 

#load the data
oecd_bli = pd.read_csv('oecd_bli_2015.csv', thousands=',')
gdp_per_capita = pd.read_csv("gdp_per_capita.csv", thousands=',',delimiter='\t', 
							encoding = 'latin1', na_values='n/a')

#prepare the data
country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
X = np.c_[country_stats["GDP per capita"]]
y = np.c_[country_stats["Life Satisfaction"]]

#visualze the data
country_stats.plot(kind='scatter',X = "GDP per capita", y = 'Life Satisfaction')
plt.show()

#select a linear model
lin_reg_model = sklearn.linear_model.LinearRegression()

#train the model 
line_reg_model.fit(X,y)

#make a prediction for Cyprus
X_new = [[22587]] # Cyprus' GDP per capita
print(line_reg_model.predict(X_new)) #outputs [5.9624338]