API Reference¶
Detection Code
-
detect_simpsons_paradox.
detect_simpsons_paradox
(latent_df, continuousAttrs_labels=None, groupbyAttrs_labels=None)[source]¶ A detection function which can detect Simpson Paradox happened in the data’s subgroup.
Parameters: latent_df : dataframe
data organized in a pandas dataframe containing both categorical and continuous attributes.
continuousAttrs_labels : list [None]
list of continuous attributes by name in dataframe, if None will be detected by all float64 type columns in dataframe
groupbyAttrs_labels : list [None]
list of group by attributes by name in dataframe, if None will be detected by all object and int64 type columns in dataframe
Returns: result_df : dataframe
In the result dataframe, it stores the information of the subgroup which is detected having Simpson Paradox. TODO: Clarify the return information
Data Generation and preparation
-
sp_data_util.
simple_regression_sp
(N, mu, cov)[source]¶ - generate synthetic data for simplest case of group-wise SP of the
- regression type, generates data from $k$k clusters with centers mu each with covariance cov
mu and cov must induce SP, this does not make SP happen adds 1 noisy dimensions
Parameters: N : scalar integer
number of samples total to draw
mu : k cluster centers in d dimensions
locations of the clusters
cov : d_1 xd_1 covariance
shared covariance of all subgroup clusters
-
sp_data_util.
noise_regression_sp
(N, mu, cov, d_noise)[source]¶ - generate synthetic data for simplest case of group-wise SP of the
- regression type, generates data from $k$k clusters with centers mu each with covariance cov
mu and cov must induce SP, this does not make SP happen adds d_noise noisy dimensions
Parameters: N : scalar integer
number of samples total to draw
mu : k cluster centers in d dimensions
locations of the clusters
cov : d_1 xd_1 covariance
shared covariance of all subgroup clusters
-
sp_data_util.
generateDataset
(N, numClu, numExtra)[source]¶ generate synthetic dataset for time experiments
Parameters: N : scalar
total samples to draw
numClu: number of clusters
numberOfExtraColumn: number of extra categorical columns and continuous columns
-
sp_data_util.
mixed_regression_sp_extra
(N, mu, cov, extra, p=None)[source]¶ generate synthetic data for simplest case of group-wise SP of the regression type mu and cov must induce SP, this does not make SP happen adds 1 noisy dimensions and an interacting char attribute
Parameters: N : scalar integer
number of samples total to draw
mu : k cluster centers in d dimensions
locations of the clusters
cov : d_1 xd_1 covariance
shared covariance of all subgroup clusters
extra : scalar
number of extra variables ot add.
p : vector length k
probability of each cluster
-
sp_data_util.
sp_plot
(df, x_col, y_col, color_col)[source]¶ create SP vizualization plot from 2 columns of a df
-
sp_data_util.
geometric_2d_gmm_sp
(r_clusters, cluster_size, cluster_spread, p_sp_clusters, domain_range, k, N, p_clusters=None)[source]¶ Sample from a gaussian mixture model with Simpson’s Paradox and spread means return data in a data fram
- r_clusters : scalar [0,1]
- correlation coefficient of clusters
- cluster_size : 2 vector
- variance in each direction of each cluster
- cluster_spread : scalar [0,1]
- pearson correlation of means
- p_sp_clusters : scalar in [0,1]
- portion of clusters with SP
- p_clusters : vector in [0,1)^k, optional
- probabilty of membership of a sample in each cluster (controls relative size of clusters) default is [1.0/k]*k for uniform
- domain_range : [xmin, xmax, ymin, ymax]
- planned region for points to be in, means will be in middle 80%
- k : integer
- number of clusters
- N : scalar
- number of points
-
sp_data_util.
geometric_indep_views_gmm_sp
(d, r_clusters, cluster_size, cluster_spread, p_sp_clusters, domain_range, k, N, p_clusters=None)[source]¶ Sample from a gaussian mixture model with Simpson’s Paradox and spread means return data in a data fram
- d : integer
- number of independent views, groups of 3 columns with sp
- r_clusters : scalar [0,1] or list of d
- correlation coefficient of clusters
- cluster_size : 2 vector or list of d
- variance in each direction of each cluster
- cluster_spread : scalar [0,1] list of d
- pearson correlation of means
- p_sp_clusters : scalar in [0,1] list of d
- portion of clusters with SP
- p_clusters : vector in [0,1)^k, optional or list of d vectors
- probabilty of membership of a sample in each cluster (controls relative size of clusters) default is [1.0/k]*k for uniform
- domain_range : [xmin, xmax, ymin, ymax] list of d
- planned region for points to be in, means will be in middle 80%
- k : integer or list of d
- number of clusters
- N : scalar
- number of points, shared across all views
-
sp_data_util.
data_only_geometric_2d_gmm
(r_clusters, cluster_size, cluster_spread, p_sp_clusters, domain_range, k, N, p_clusters)[source]¶ private, sampler only, returns raw variables, utily for sharing in other samplers Sample from a gaussian mixture model with Simpson’s Paradox and spread means
- r_clusters : scalar [0,1]
- correlation coefficient of clusters
- cluster_size : 2 vector
- variance in each direction of each cluster
- cluster_spread : scalar [0,1]
- pearson correlation of means
- p_sp_clusters : scalar in [0,1]
- portion of clusters with SP
- p_clusters : vector in [0,1)^k, optional
- probabilty of membership of a sample in each cluster (controls relative size of clusters) default is [1.0/k]*k for uniform
- domain_range : [xmin, xmax, ymin, ymax]
- planned region for points to be in, means will be in middle 80%
- k : integer
- number of clusters
- N : scalar
- number of points