Table 1-1 Does money make people happier?
Country | GDP per Capita (USD) | Life Satisfaction |
---|---|---|
Hungary | 12,240 | 4.9 |
Korea | 27,195 | 5.8 |
France | 37,675 | 6.5 |
Australia | 50,962 | 7.3 |
United States | 55,805 | 7.2 |
A simple linear regression model:
life_satisfaction = Θ0 + Θ1× GDP_per_capita
This model has two parameters theta subscript zero and theta subscript one.
The goal here is to determine which values will make the model perform the best and to do that we need to specify a performance measure by defining a utility function that measures how good the model is or determine a cost function to determine how bad it is. For linear regression a cost function is usually used to measure the distance between the training examples and the model’s predictions.
We start by feeding the training data into the linear regression algorithm to find the parameters that work best for the data.
import matplotlib import matplotlib.pyplot as plt import numpy as np import pandas as pd import sklearn #load the data oecd_bli = pd.read_csv('oecd_bli_2015.csv', thousands=',') gdp_per_capita = pd.read_csv("gdp_per_capita.csv", thousands=',',delimiter='\t', encoding = 'latin1', na_values='n/a') #prepare the data country_stats = prepare_country_stats(oecd_bli, gdp_per_capita) X = np.c_[country_stats["GDP per capita"]] y = np.c_[country_stats["Life Satisfaction"]] #visualze the data country_stats.plot(kind='scatter',X = "GDP per capita", y = 'Life Satisfaction') plt.show() #select a linear model lin_reg_model = sklearn.linear_model.LinearRegression() #train the model line_reg_model.fit(X,y) #make a prediction for Cyprus X_new = [[22587]] # Cyprus' GDP per capita print(line_reg_model.predict(X_new)) #outputs [5.9624338]