Iris Data Experiments

This notebook runs the experiments and generates the figures on the Iris data set.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.colors as mcolors
import matplotlib.pyplot as plt
import warnings
import itertools as it

# our code packaged for easy use
import detect_simpsons_paradox as dsp

In [2]:
iris_df = pd.read_csv('../../../data/iris.csv')
iris_df.head()
Out[2]:
sepal length sepal width petal length petal width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

From examining the above, we know that the class is the group-by variables and the float type variables are the continuous attributes. First we select the columns accordingly and output the names of the selected columns of each type to verify.

In [3]:
groupbyAttrs = iris_df.select_dtypes(include=['object'])
groupbyAttrs_labels = list(groupbyAttrs)
print(groupbyAttrs_labels)
['class']
In [4]:
continuousAttrs = iris_df.select_dtypes(include=['float64'])
continuousAttrs_labels = list(continuousAttrs)
print(continuousAttrs_labels)
['sepal length', 'sepal width', 'petal length', 'petal width']

Results

Now we can run our algorithm and print out the results after a little bit of post-processing to improve readability.

In [5]:
# run detection algorithm
result_df = dsp.detect_simpsons_paradox(iris_df)

# Map attribute index to attribute name
result_df['attr1'] = result_df['attr1'].map(lambda x:continuousAttrs_labels[x])
result_df['attr2'] = result_df['attr2'].map(lambda x:continuousAttrs_labels[x])
# sort for easy reading
result_df = result_df.sort_values(['attr1', 'attr2'], ascending=[1, 1])

result_df
Out[5]:
allCorr attr1 attr2 reverseCorr groupbyAttr subgroup
0 -0.109369 sepal length sepal width 0.746780 class Iris-setosa
3 -0.109369 sepal length sepal width 0.525911 class Iris-versicolor
6 -0.109369 sepal length sepal width 0.457228 class Iris-virginica
1 -0.420516 sepal width petal length 0.176695 class Iris-setosa
4 -0.420516 sepal width petal length 0.560522 class Iris-versicolor
7 -0.420516 sepal width petal length 0.401045 class Iris-virginica
2 -0.356544 sepal width petal width 0.279973 class Iris-setosa
5 -0.356544 sepal width petal width 0.663999 class Iris-versicolor
8 -0.356544 sepal width petal width 0.537728 class Iris-virginica

Plotting

We plot all data in scatter plots based on each group by attribute, for each pair of candidate attributes. For each plot we add the overall trendline and the trend line for each occurence of Simpson’s Paradox.

Plotting with data frame

Plots in simpler code

In [6]:
# create a set (to keep only unique) or the attribute pairs that had SP
sns.set_context("talk")
sp_meas = set([(a,b) for a,b in zip(list(result_df['attr1']), result_df['attr2'])])
# for x,y in it.combinations(measurements,2):
for x, y in sp_meas:
    sns.lmplot(y,x, data=iris_df, hue='class', ci=None,
               markers =['x','o','s'], palette="Set1")
    # adda whole data regression line, but don't cover the scatter data
    sns.regplot(y,x, data=iris_df, color='black', scatter=False, ci=None)
../_images/notebooks_exp_iris_10_0.png
../_images/notebooks_exp_iris_10_1.png
../_images/notebooks_exp_iris_10_2.png

Experiments on subsampled data

We run our algorithm on 10%, 30%, 50%, 60%, 90% sampled data set with 5 random repeats and for each couont the number of detections.

In [7]:
portions = [.1,.3, .5, .6, .9]
num_repeats = 5

# for sampling
rows = []
iris_df_portion = []
res_df = dsp.detect_simpsons_paradox(iris_df)
res_df['portion'] = 1
res_df['exp'] = 0


summary_cols = ['portion', 'exp', 'SPcount']
summary_data = [[1,0,len(res_df)]]


for i,cur_portion in enumerate(np.repeat(portions,num_repeats)):
    # sample and run detector algorithm
    res_tmp = dsp.detect_simpsons_paradox(iris_df.sample(frac=cur_portion))
    # append results with info on this trial
    res_tmp['portion'] = cur_portion
    res_tmp['exp'] = i +1
    res_df = res_df.append(res_tmp)
    summary_data.append([cur_portion, i, len(res_tmp)])

# make a dataframe of the summary stats
summary_df = pd.DataFrame(data =summary_data, columns =summary_cols)

Now, we can compute descriptive statistics on the occurences summary, using the grouby feature

In [8]:
exp_summary = summary_df.groupby('portion')['SPcount'].describe()
# exp_summary.count()
# summary_df
exp_summary
Out[8]:
count mean std min 25% 50% 75% max
portion
0.1 5.0 6.2 1.788854 4.0 6.0 6.0 6.0 9.0
0.3 5.0 9.0 0.707107 8.0 9.0 9.0 9.0 10.0
0.5 5.0 9.0 0.000000 9.0 9.0 9.0 9.0 9.0
0.6 5.0 8.8 0.447214 8.0 9.0 9.0 9.0 9.0
0.9 5.0 9.0 0.000000 9.0 9.0 9.0 9.0 9.0
1.0 1.0 9.0 NaN 9.0 9.0 9.0 9.0 9.0