API Reference

Detection Code

detect_simpsons_paradox.detect_simpsons_paradox(latent_df, continuousAttrs_labels=None, groupbyAttrs_labels=None)[source]

A detection function which can detect Simpson Paradox happened in the data’s subgroup.

Parameters:

latent_df : dataframe

data organized in a pandas dataframe containing both categorical and continuous attributes.

continuousAttrs_labels : list [None]

list of continuous attributes by name in dataframe, if None will be detected by all float64 type columns in dataframe

groupbyAttrs_labels : list [None]

list of group by attributes by name in dataframe, if None will be detected by all object and int64 type columns in dataframe

Returns:

result_df : dataframe

In the result dataframe, it stores the information of the subgroup which is detected having Simpson Paradox. TODO: Clarify the return information

Data Generation and preparation

sp_data_util.simple_regression_sp(N, mu, cov)[source]
generate synthetic data for simplest case of group-wise SP of the
regression type, generates data from $k$k clusters with centers mu each with covariance cov

mu and cov must induce SP, this does not make SP happen adds 1 noisy dimensions

Parameters:

N : scalar integer

number of samples total to draw

mu : k cluster centers in d dimensions

locations of the clusters

cov : d_1 xd_1 covariance

shared covariance of all subgroup clusters

sp_data_util.noise_regression_sp(N, mu, cov, d_noise)[source]
generate synthetic data for simplest case of group-wise SP of the
regression type, generates data from $k$k clusters with centers mu each with covariance cov

mu and cov must induce SP, this does not make SP happen adds d_noise noisy dimensions

Parameters:

N : scalar integer

number of samples total to draw

mu : k cluster centers in d dimensions

locations of the clusters

cov : d_1 xd_1 covariance

shared covariance of all subgroup clusters

sp_data_util.generateDataset(N, numClu, numExtra)[source]

generate synthetic dataset for time experiments

Parameters:

N : scalar

total samples to draw

numClu: number of clusters

numberOfExtraColumn: number of extra categorical columns and continuous columns

sp_data_util.mixed_regression_sp_extra(N, mu, cov, extra, p=None)[source]

generate synthetic data for simplest case of group-wise SP of the regression type mu and cov must induce SP, this does not make SP happen adds 1 noisy dimensions and an interacting char attribute

Parameters:

N : scalar integer

number of samples total to draw

mu : k cluster centers in d dimensions

locations of the clusters

cov : d_1 xd_1 covariance

shared covariance of all subgroup clusters

extra : scalar

number of extra variables ot add.

p : vector length k

probability of each cluster

sp_data_util.sp_plot(df, x_col, y_col, color_col)[source]

create SP vizualization plot from 2 columns of a df

sp_data_util.geometric_2d_gmm_sp(r_clusters, cluster_size, cluster_spread, p_sp_clusters, domain_range, k, N, p_clusters=None)[source]

Sample from a gaussian mixture model with Simpson’s Paradox and spread means return data in a data fram

r_clusters : scalar [0,1]
correlation coefficient of clusters
cluster_size : 2 vector
variance in each direction of each cluster
cluster_spread : scalar [0,1]
pearson correlation of means
p_sp_clusters : scalar in [0,1]
portion of clusters with SP
p_clusters : vector in [0,1)^k, optional
probabilty of membership of a sample in each cluster (controls relative size of clusters) default is [1.0/k]*k for uniform
domain_range : [xmin, xmax, ymin, ymax]
planned region for points to be in, means will be in middle 80%
k : integer
number of clusters
N : scalar
number of points
sp_data_util.geometric_indep_views_gmm_sp(d, r_clusters, cluster_size, cluster_spread, p_sp_clusters, domain_range, k, N, p_clusters=None)[source]

Sample from a gaussian mixture model with Simpson’s Paradox and spread means return data in a data fram

d : integer
number of independent views, groups of 3 columns with sp
r_clusters : scalar [0,1] or list of d
correlation coefficient of clusters
cluster_size : 2 vector or list of d
variance in each direction of each cluster
cluster_spread : scalar [0,1] list of d
pearson correlation of means
p_sp_clusters : scalar in [0,1] list of d
portion of clusters with SP
p_clusters : vector in [0,1)^k, optional or list of d vectors
probabilty of membership of a sample in each cluster (controls relative size of clusters) default is [1.0/k]*k for uniform
domain_range : [xmin, xmax, ymin, ymax] list of d
planned region for points to be in, means will be in middle 80%
k : integer or list of d
number of clusters
N : scalar
number of points, shared across all views
sp_data_util.data_only_geometric_2d_gmm(r_clusters, cluster_size, cluster_spread, p_sp_clusters, domain_range, k, N, p_clusters)[source]

private, sampler only, returns raw variables, utily for sharing in other samplers Sample from a gaussian mixture model with Simpson’s Paradox and spread means

r_clusters : scalar [0,1]
correlation coefficient of clusters
cluster_size : 2 vector
variance in each direction of each cluster
cluster_spread : scalar [0,1]
pearson correlation of means
p_sp_clusters : scalar in [0,1]
portion of clusters with SP
p_clusters : vector in [0,1)^k, optional
probabilty of membership of a sample in each cluster (controls relative size of clusters) default is [1.0/k]*k for uniform
domain_range : [xmin, xmax, ymin, ymax]
planned region for points to be in, means will be in middle 80%
k : integer
number of clusters
N : scalar
number of points