Timing Experiment¶

In [1]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.colors as mcolors
import matplotlib.pyplot as plt
import warnings
import detect_simpsons_paradox as dsp
import sp_data_util as sp_dat
import time

We will draw samples from a number of clusters according to a Gaussian Mixture Model and add both continuous and categorical noise values.

First we have to set up the number of clusters, samples and extra values.

In [2]:

# set the data size
N = int(10**5)
# and 5 extra continuous attributes and 5 extra categorical attributes
num_clusters = 32
numExtra = 5

First, we generate cluseters that are roughly distributed with a positive trend that will help us ensure that SP occurs throughout the dataset

In [3]:

mu = np.asarray([[1,1],[5,5]])

variance = 1000

# generate rest of the mu
for i in range(num_clusters - 2):
    mu_x = np.random.randint(10, 99);
    mu_y = np.random.normal(mu_x, np.sqrt(variance))
    mu_new = np.asarray([mu_x,mu_y])
    mu = np.append(mu,[mu_new],axis=0)

plt.scatter(mu[:,0],mu[:,1])
plt.show()

Next we use a built in function to our package that takes a list of means and a covariance

In [4]:

# covariance of each cluster
cov = [[.6,-1],[0,.6]]

# call mixed_regression_sp to generate the data set
latent_df = sp_dat.mixed_regression_sp_extra(N,mu,cov, numExtra)

/home/smb/anaconda2/envs/simpsonsparadox/lib/python3.6/site-packages/sp_data_util/SPData.py:134: RuntimeWarning: covariance is not positive-semidefinite.
  x = np.asarray([np.random.multivariate_normal(mu[z_i],cov) for z_i in z])

In [5]:

np.random.choice(range(5),20,)

Out[5]:

array([3, 3, 3, 2, 2, 2, 4, 0, 1, 2, 1, 3, 0, 3, 4, 3, 4, 0, 3, 4])

In [6]:

latent_df['cluster'].head()

Out[6]:

  26
  27
   0
   8
   8
Name: cluster, dtype: int64

In [7]:

plt.scatter(latent_df['x1'], latent_df['x2'],
            c =  latent_df['cluster'], marker= 'o')
plt.show()

In [8]:

# check the size of the data
latent_df.shape

Out[8]:

(100000, 13)

Since we store the data in a pandas dataframe, we can easily sample a subset of the rows and we can check how that works:

In [9]:

subset_df = latent_df.sample(frac=.1)
print(len(subset_df))
subset_df.head()

Out[9]:

	x1	x2	cluster	con_0	con_1	con_2	con_3	con_4	cat_0	cat_1	cat_2	cat_3	cat_4
47787	40.285355	8.279074	4	42.821083	-5.945982	9.116584	47.626833	-7.233663	70	54	27	91	96
97210	34.170575	4.504379	29	-17.100323	-29.930051	-157.993357	26.581181	74.024540	72	64	23	13	60
42521	19.381192	-21.545576	28	-168.137523	-76.675448	146.536801	168.084883	178.468823	92	16	66	89	76
96452	51.975142	61.556268	11	35.396721	109.826010	98.528868	-37.957128	-179.428724	58	4	70	54	71
47115	42.532394	40.353842	6	91.515160	99.488643	-24.809279	-95.285213	145.379252	52	95	82	75	79

Now, we can do the Time experiment for the whole dataset and the sampled dataset.

In [10]:

# whole data set
data_portions = np.linspace(.1,1,10)

time_data = []

for cur_portion in data_portions:
    start_time = time.time()
    dsp.detect_simpsons_paradox(latent_df.sample(frac=cur_portion))
    time_data.append([cur_portion, (time.time() - start_time)])

In [11]:

time_res = pd.DataFrame(data = time_data, columns =['portion of data','time'])
time_res # show the results

Out[11]:

	portion of data	time
0	0.1	3.679361
1	0.2	3.910151
2	0.3	3.797533
3	0.4	4.676100
4	0.5	3.796321
5	0.6	4.221745
6	0.7	4.523880
7	0.8	4.056846
8	0.9	5.073282
9	1.0	5.017606

Computing it just once, is not the most indicative, so we can repeat the experiment and then compute statistics on that. We repeat it 4 more times to get a total of 5

In [12]:

num_repeats = 4

for cur_portion in np.repeat(data_portions,num_repeats):
    start_time = time.time()
    dsp.detect_simpsons_paradox(latent_df.sample(frac=cur_portion))
    time_data.append([cur_portion, (time.time() - start_time)])

In [13]:

time_res = pd.DataFrame(data = time_data, columns =['portion','time'])
len(time_res)

Out[13]:

Now we have 50 rows in our result table and we can compute the statistics that we want. We want to first, group the data by the portion of the data so that we can compute the mean and variance of all of the trials of each portion.

In [14]:

time_repeats = time_res.groupby('portion')
time_repeats.describe()

Out[14]:

	time
	count	mean	std	min	25%	50%	75%	max
portion
0.1	5.0	4.446497	0.826705	3.679361	3.812636	4.061862	5.313767	5.364857
0.2	5.0	4.256970	0.259700	3.910151	4.060727	4.379942	4.405277	4.528751
0.3	5.0	4.704718	0.929317	3.797533	4.171579	4.323587	5.094294	6.136595
0.4	5.0	3.936846	0.418850	3.666440	3.735602	3.749104	3.856985	4.676100
0.5	5.0	4.456102	0.752906	3.796321	4.026359	4.129114	4.646039	5.682674
0.6	5.0	4.540301	0.607244	4.137453	4.173215	4.221745	4.592580	5.576510
0.7	5.0	4.545214	0.508978	4.006332	4.097922	4.523880	4.911530	5.186406
0.8	5.0	4.289220	0.594578	3.791577	3.839077	4.056846	4.552631	5.205971
0.9	5.0	4.318349	0.453169	3.968818	4.045736	4.099937	4.403975	5.073282
1.0	5.0	4.496278	0.573671	3.903625	3.998321	4.404508	5.017606	5.157329

We can plot the means to see if there’s a clear trend

In [15]:

time_repeats.mean().plot()

Out[15]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fe04763efd0>

Table Of Contents

Related Topics

This Page

Timing Experiment¶