Auto Miles Per Gallon Data Experiments

This notebook runs the experiments and generates the figures on the auto mpg data. We are working with a subset of the data created by:

  1. selecting a subset of columns to suit our problem setting: three continuous (mpg, acceleration, and horsepower) and three categorical attributes (cylinders, model year, and origin)
  2. removing incomplete records

This notebook generates the results and figures related to the AutoMPG experiments

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.colors as mcolors
import matplotlib.pyplot as plt

# clean up notebook output by removing warnings about future changes
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# our code packaged for easy use
import detect_simpsons_paradox as dsp
In [2]:
# import the prepared copy of the data
auto_df = pd.read_csv('../../../data/auto2.csv')
auto_df.head()
Out[2]:
mpg cylinders horsepower acceleration model year origin
0 18.0 8 130.0 12.0 70 1
1 15.0 8 165.0 11.5 70 1
2 18.0 8 150.0 11.0 70 1
3 16.0 8 150.0 12.0 70 1
4 17.0 8 140.0 10.5 70 1

From examining the above, we know that the integer columns are the group-by variables and the float type variables are the continuous attributes. The detector function will automatically

In [3]:
groupbyAttrs = auto_df.select_dtypes(include=['int64'])
groupbyAttrs_labels = list(groupbyAttrs)
print(groupbyAttrs_labels)
['cylinders', 'model year', 'origin']
In [4]:
continuousAttrs = auto_df.select_dtypes(include=['float64'])
continuousAttrs_labels = list(continuousAttrs)
print(continuousAttrs_labels)
['mpg', 'horsepower', 'acceleration']

Results

Now we can run our algorithm and print out the results after a little bit of post-processing to improve readability.

In [5]:
# run detection algorithm
result_df = dsp.detect_simpsons_paradox(auto_df)

# Map attribute index to attribute name
result_df['attr1'] = result_df['attr1'].map(lambda x:continuousAttrs_labels[x])
result_df['attr2'] = result_df['attr2'].map(lambda x:continuousAttrs_labels[x])
# sort for easy reading
result_df = result_df.sort_values(['attr1', 'attr2'], ascending=[1, 1])

# data frames print neatly in notebooks
result_df
Out[5]:
allCorr attr1 attr2 reverseCorr groupbyAttr subgroup
1 0.423329 mpg acceleration -0.818873 cylinders 3
3 0.423329 mpg acceleration -0.341214 cylinders 6
4 0.423329 mpg acceleration -0.050545 model year 75
5 0.423329 mpg acceleration -0.051280 model year 79
0 -0.778427 mpg horsepower 0.620807 cylinders 3
2 -0.778427 mpg horsepower 0.013135 cylinders 6

Plotting

We plot all data in scatter plots based on each group by attribute, for each pair of candidate attributes. For each plot we add the overall trendline and the trend line for each occurence of Simpson’s Paradox.

In [6]:
print(auto_df.cylinders.unique())
print(auto_df['model year'].unique())
[8 4 6 3 5]
[70 71 72 73 74 75 76 77 78 79 80 81 82]
In [7]:
fig = plt.figure()
colors = {'3':'red', '4':'blue', '5':'purple', '6':'black','8':'green'}
markers = {'3':'x', '4':'o', '5':'s','6':'*','8':'d'}

#plt.scatter(auto_df['mpg'], auto_df['acceleration'], c=auto_df['cylinders'].apply(lambda x: colors[str(x)]))
for i in range(len(auto_df['mpg'])):
    plt.scatter(auto_df['mpg'][i], auto_df['acceleration'][i], c=colors[str(auto_df['cylinders'][i])], marker=markers[str(auto_df['cylinders'][i])], label=auto_df['cylinders'][i])

plt.xlabel('mpg',  fontsize=24)
plt.ylabel('acceleration', fontsize=24)
plt.xticks(fontsize = 20)
plt.yticks(fontsize = 20)

#import matplotlib.patches as mpatches
#red_patch = mpatches.Patch(color='red', label='3')
#green_patch = mpatches.Patch(color='blue', label='4')
#purple_patch = mpatches.Patch(color='purple', label='5')
#black_patch = mpatches.Patch(color='black', label='6')
#green_patch = mpatches.Patch(color='green', label='8')
#plt.legend(handles=[red_patch, green_patch, blue_patch,black_patch,orange_patch])

from collections import OrderedDict
handles, labels = plt.gca().get_legend_handles_labels()
by_label = OrderedDict(zip(labels, handles))
plt.legend(by_label.values(), by_label.keys(), prop={'size':15})

# Add correlation line
axes = plt.gca()
x = auto_df['mpg']
y = auto_df['acceleration']

m, b = np.polyfit(x, y, 1)
X_plot = np.linspace(axes.get_xlim()[0],axes.get_xlim()[1],100)
plt.plot(X_plot, m*X_plot + b, '--',color='black')

cylinder3 = auto_df[auto_df['cylinders'] ==3]
cylinder6 = auto_df[auto_df['cylinders'] ==6]
x1 = cylinder3['mpg']
y1 = cylinder3['acceleration']

m1, b1 = np.polyfit(x1, y1, 1)
#print(axes.get_xlim()[0])
#print(axes.get_xlim()[1])
X_plot1 = np.linspace(5,48,100)
plt.plot(X_plot1, m1*X_plot1 + b1, '-', color='red')

x2 = cylinder6['mpg']
y2 = cylinder6['acceleration']

m, b = np.polyfit(x2, y2, 1)
X_plot = np.linspace(5,48,100)
plt.plot(X_plot, m*X_plot + b, '-', color='black')

plt.show()

#fig.savefig('auto1.jpg')
../_images/notebooks_exp_autompg_10_0.png
In [8]:
fig = plt.figure()
colors = {'70':'coral', '71':'blue', '72':'purple', '73':'orange','74':'green', '75':'black', '76':'grey','77':'gold', '78':'lightgreen','79':'red', '80':'cyan', '81':'skyblue','82':'pink'}
markers = {'70':'x', '71':'o', '72':'s','73':'*','74':'d', '75':'v', '76':'^','77':'<', '78':'>','79':'1', '80':'2', '81':'3','82':'4'}

#plt.scatter(auto_df['mpg'], auto_df['acceleration'], c=auto_df['cylinders'].apply(lambda x: colors[str(x)]))
for i in range(len(auto_df['mpg'])):
    plt.scatter(auto_df['mpg'][i], auto_df['acceleration'][i], c=colors[str(auto_df['model year'][i])], marker=markers[str(auto_df['model year'][i])], label=auto_df['model year'][i])


#plt.scatter(auto_df['mpg'], auto_df['acceleration'], c=auto_df['model year'].apply(lambda x: colors[str(x)]))

plt.xlabel('mpg',  fontsize=24)
plt.ylabel('acceleration', fontsize=24)
plt.xticks(fontsize = 20)
plt.yticks(fontsize = 20)
#import matplotlib.patches as mpatches
#patch1 = mpatches.Patch(color='coral', label='70')
#patch2 = mpatches.Patch(color='blue', label='71')
#patch3 = mpatches.Patch(color='purple', label='72')
#patch4 = mpatches.Patch(color='orange', label='73')
#patch5 = mpatches.Patch(color='green', label='74')
#patch6 = mpatches.Patch(color='black', label='75')
#patch7 = mpatches.Patch(color='grey', label='76')
#patch8 = mpatches.Patch(color='gold', label='77')
#patch9 = mpatches.Patch(color='lightgreen', label='78')
#patch10 = mpatches.Patch(color='red', label='79')
#patch11 = mpatches.Patch(color='cyan', label='80')
#patch12 = mpatches.Patch(color='skyblue', label='81')
#patch13 = mpatches.Patch(color='pink', label='82')

#plt.legend(handles=[patch1, patch2, patch3,patch4,patch5,patch6, patch7, patch8,patch9,patch10,patch11, patch12, patch13])
from collections import OrderedDict
handles, labels = plt.gca().get_legend_handles_labels()
by_label = OrderedDict(zip(labels, handles))
plt.legend(by_label.values(), by_label.keys(), prop={'size':15})


# Add correlation line
axes = plt.gca()
x = auto_df['mpg']
y = auto_df['acceleration']

m, b = np.polyfit(x, y, 1)
X_plot = np.linspace(axes.get_xlim()[0],axes.get_xlim()[1],100)
plt.plot(X_plot, m*X_plot + b, '--',color='black')

cylinder3 = auto_df[auto_df['model year'] ==75]
cylinder6 = auto_df[auto_df['model year'] ==79]
x1 = cylinder3['mpg']
y1 = cylinder3['acceleration']

m1, b1 = np.polyfit(x1, y1, 1)
#print(axes.get_xlim()[0])
#print(axes.get_xlim()[1])
X_plot1 = np.linspace(5,48,100)
plt.plot(X_plot1, m1*X_plot1 + b1, '-', color='red')

x2 = cylinder6['mpg']
y2 = cylinder6['acceleration']

m, b = np.polyfit(x2, y2, 1)
X_plot = np.linspace(5,48,100)
plt.plot(X_plot, m*X_plot + b, '-', color='black')

plt.show()

#fig.savefig('auto2.jpg')
../_images/notebooks_exp_autompg_11_0.png
In [9]:
fig = plt.figure()
colors = {'3':'red', '4':'blue', '5':'purple', '6':'black','8':'green'}
markers = {'3':'x', '4':'o', '5':'s','6':'*','8':'d'}

for i in range(len(auto_df['mpg'])):
    plt.scatter(auto_df['mpg'][i], auto_df['horsepower'][i], c=colors[str(auto_df['cylinders'][i])], marker=markers[str(auto_df['cylinders'][i])], label=auto_df['cylinders'][i])

#plt.scatter(auto_df['mpg'], auto_df['horsepower'], c=auto_df['cylinders'].apply(lambda x: colors[str(x)]))

plt.xlabel('mpg',  fontsize=24)
plt.ylabel('horsepower', fontsize=24)
plt.xticks(fontsize = 20)
plt.yticks(fontsize = 20)
#import matplotlib.patches as mpatches
#red_patch = mpatches.Patch(color='red', label='3')
#green_patch = mpatches.Patch(color='blue', label='4')
#purple_patch = mpatches.Patch(color='purple', label='5')
#black_patch = mpatches.Patch(color='black', label='6')
#green_patch = mpatches.Patch(color='green', label='8')
#plt.legend(handles=[red_patch, green_patch, blue_patch,black_patch,orange_patch])
from collections import OrderedDict
handles, labels = plt.gca().get_legend_handles_labels()
by_label = OrderedDict(zip(labels, handles))
plt.legend(by_label.values(), by_label.keys(), prop={'size':15}, loc = 1)


# Add correlation line
axes = plt.gca()
x = auto_df['mpg']
y = auto_df['horsepower']

m, b = np.polyfit(x, y, 1)
X_plot = np.linspace(axes.get_xlim()[0],axes.get_xlim()[1],100)
plt.plot(X_plot, m*X_plot + b, '--',color='black')

cylinder3 = auto_df[auto_df['cylinders'] ==3]
cylinder6 = auto_df[auto_df['cylinders'] ==6]
x1 = cylinder3['mpg']
y1 = cylinder3['horsepower']

m1, b1 = np.polyfit(x1, y1, 1)
#print(axes.get_xlim()[0])
#print(axes.get_xlim()[1])
X_plot1 = np.linspace(5,48,100)
plt.plot(X_plot1, m1*X_plot1 + b1, '-', color='red')

x2 = cylinder6['mpg']
y2 = cylinder6['horsepower']

m, b = np.polyfit(x2, y2, 1)
X_plot = np.linspace(5,48,100)
plt.plot(X_plot, m*X_plot + b, '-', color='black')

plt.show()

#fig.savefig('auto3.jpg')
../_images/notebooks_exp_autompg_12_0.png