Iris Data Experiments¶

This notebook runs the experiments and generates the figures on the Iris data set.

In [1]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.colors as mcolors
import matplotlib.pyplot as plt
import warnings
import itertools as it

# our code packaged for easy use
import detect_simpsons_paradox as dsp

In [2]:

iris_df = pd.read_csv('../../../data/iris.csv')
iris_df.head()

Out[2]:

	sepal length	sepal width	petal length	petal width	class
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa

From examining the above, we know that the class is the group-by variables and the float type variables are the continuous attributes. First we select the columns accordingly and output the names of the selected columns of each type to verify.

In [3]:

groupbyAttrs = iris_df.select_dtypes(include=['object'])
groupbyAttrs_labels = list(groupbyAttrs)
print(groupbyAttrs_labels)

['class']

In [4]:

continuousAttrs = iris_df.select_dtypes(include=['float64'])
continuousAttrs_labels = list(continuousAttrs)
print(continuousAttrs_labels)

['sepal length', 'sepal width', 'petal length', 'petal width']

Results¶

Now we can run our algorithm and print out the results after a little bit of post-processing to improve readability.

In [5]:

# run detection algorithm
result_df = dsp.detect_simpsons_paradox(iris_df)

# Map attribute index to attribute name
result_df['attr1'] = result_df['attr1'].map(lambda x:continuousAttrs_labels[x])
result_df['attr2'] = result_df['attr2'].map(lambda x:continuousAttrs_labels[x])
# sort for easy reading
result_df = result_df.sort_values(['attr1', 'attr2'], ascending=[1, 1])

result_df

Out[5]:

	allCorr	attr1	attr2	reverseCorr	groupbyAttr	subgroup
0	-0.109369	sepal length	sepal width	0.746780	class	Iris-setosa
3	-0.109369	sepal length	sepal width	0.525911	class	Iris-versicolor
6	-0.109369	sepal length	sepal width	0.457228	class	Iris-virginica
1	-0.420516	sepal width	petal length	0.176695	class	Iris-setosa
4	-0.420516	sepal width	petal length	0.560522	class	Iris-versicolor
7	-0.420516	sepal width	petal length	0.401045	class	Iris-virginica
2	-0.356544	sepal width	petal width	0.279973	class	Iris-setosa
5	-0.356544	sepal width	petal width	0.663999	class	Iris-versicolor
8	-0.356544	sepal width	petal width	0.537728	class	Iris-virginica

Plotting¶

We plot all data in scatter plots based on each group by attribute, for each pair of candidate attributes. For each plot we add the overall trendline and the trend line for each occurence of Simpson’s Paradox.

Plotting with data frame¶

Plots in simpler code

In [6]:

# create a set (to keep only unique) or the attribute pairs that had SP
sns.set_context("talk")
sp_meas = set([(a,b) for a,b in zip(list(result_df['attr1']), result_df['attr2'])])
# for x,y in it.combinations(measurements,2):
for x, y in sp_meas:
    sns.lmplot(y,x, data=iris_df, hue='class', ci=None,
               markers =['x','o','s'], palette="Set1")
    # adda whole data regression line, but don't cover the scatter data
    sns.regplot(y,x, data=iris_df, color='black', scatter=False, ci=None)

Experiments on subsampled data¶

We run our algorithm on 10%, 30%, 50%, 60%, 90% sampled data set with 5 random repeats and for each couont the number of detections.

In [7]:

portions = [.1,.3, .5, .6, .9]
num_repeats = 5

# for sampling
rows = []
iris_df_portion = []
res_df = dsp.detect_simpsons_paradox(iris_df)
res_df['portion'] = 1
res_df['exp'] = 0


summary_cols = ['portion', 'exp', 'SPcount']
summary_data = [[1,0,len(res_df)]]


for i,cur_portion in enumerate(np.repeat(portions,num_repeats)):
    # sample and run detector algorithm
    res_tmp = dsp.detect_simpsons_paradox(iris_df.sample(frac=cur_portion))
    # append results with info on this trial
    res_tmp['portion'] = cur_portion
    res_tmp['exp'] = i +1
    res_df = res_df.append(res_tmp)
    summary_data.append([cur_portion, i, len(res_tmp)])

# make a dataframe of the summary stats
summary_df = pd.DataFrame(data =summary_data, columns =summary_cols)

Now, we can compute descriptive statistics on the occurences summary, using the grouby feature

In [8]:

exp_summary = summary_df.groupby('portion')['SPcount'].describe()
# exp_summary.count()
# summary_df
exp_summary

Out[8]:

	count	mean	std	min	25%	50%	75%	max
portion
0.1	5.0	6.2	1.788854	4.0	6.0	6.0	6.0	9.0
0.3	5.0	9.0	0.707107	8.0	9.0	9.0	9.0	10.0
0.5	5.0	9.0	0.000000	9.0	9.0	9.0	9.0	9.0
0.6	5.0	8.8	0.447214	8.0	9.0	9.0	9.0	9.0
0.9	5.0	9.0	0.000000	9.0	9.0	9.0	9.0	9.0
1.0	1.0	9.0	NaN	9.0	9.0	9.0	9.0	9.0

Table Of Contents

Related Topics

This Page

Iris Data Experiments¶

Results¶

Plotting¶

Plotting with data frame¶

Experiments on subsampled data¶