traintestdiff package¶

Submodules¶

traintestdiff.core module¶

class traintestdiff.core.TrainTestDiff(datasets)[source]¶

Bases: object

Helper class to ease distribution analysis on the same datasets

plot_cat_diff(features, col_wrap=3, kind='prop', title=None)[source]¶: See plot_categorical_diff()

plot_cont_diff(features, kind='box', col_wrap=3, size=4, aspect=1, title=None)[source]¶: See plot_continuous_diff()

traintestdiff.core.categorical_longform(datasets, features)[source]¶

Given datasets and features it returns a long form representation of it

Parameters:	datasets (dict) – each key is a dataset name and each value is a `pandas.DataFrame` features (list) – a list of string features present in the datasets
Returns:	A tidy data long form
Return type:	pandas.core.frame.DataFrame
Raises:	`KeyError` – if any of the `features` isn’t present in the `datasets` dict

traintestdiff.core.continuous_longform(datasets, features)[source]¶

Given datasets and features it returns a long form representation of it

Parameters:	datasets (dict) – each key is a dataset name and each value is a `pandas.DataFrame` features (list) – a list of string features present in the datasets
Returns:	A tidy data longform dataframe
Return type:	pandas.core.frame.DataFrame
Raises:	`KeyError` – if any of the `features` isn’t present in the `datasets` dict

traintestdiff.core.datasets_from_frame(dataframe, feature)[source]¶

Creates a dict dataset from a dataframe

Given a categorical feature it creates a dict where each key is a level of the feature and each value is a dataframe, then you can use this datasets dict to plot graphs

Parameters:	dataframe (pandas.DataFrame) – the frame that you’re going to use to create a dict datasets feature (str) – this feature will be used for grouping and creating the datasets dict
Returns:	A `dict` where keys are levels of `feature` and values are `pandas.core.frame.DataFrame` from a `dataframe.groupby(feature)`
Return type:	dict
Raises:	`KeyError` – if `feature` is not present in `dataframe`

traintestdiff.core.plot_categorical_diff(datasets, features, kind='prop', col_wrap=4, size=4, aspect=1, title=None)[source]¶

Plots the distribution differences of categorical features in each dataset

Parameters:	datasets (dict) – a dict where the keys are names and the values are `pandas.DataFrame` features (list) – a list of categorical features present in every dataset of `datasets` kind (Optional[str]) – {count, prop} Use “count” for count of unique values for every level of a feature in every dataset present in `datasets` Use “prop” for the proportion of that level of a feature col_wrap (int) – how many charts you want per row size (float) – Height (in inches) aspect (float) – Aspect ratio of each facet, so that aspect * size gives the width of each facet in inches title (str) – the title of the figure
Returns:	a tuple with a longform data frame and matplotlib figure to customize
Return type:	(pandas.core.frameDataFrame, matplotlib.Figure)
Raises:	`KeyError` – if any of the `features` isn’t present in the `datasets` dict

traintestdiff.core.plot_continuous_diff(datasets, features, kind='box', col_wrap=3, size=4, aspect=1, title=None)[source]¶

Plots the distribution differences of continuous features in each dataset

Parameters:	datasets (dict) – a dict where the keys are names and the values are `pandas.DataFrame` features (list) – a list of continuous features present in every dataset of `datasets` kind (str) – {point, bar, box, violin, strip} The kind of plot to draw. col_wrap (int) – how many charts you want per row size (float) – Height (in inches) aspect (float) – Aspect ratio of each facet, so that aspect * size gives the width of each facet in inches title (str) – the title of the figure
Returns:	a tuple with a longform data frame and matplotlib figure to customize
Return type:	(pandas.core.frameDataFrame, matplotlib.Figure)
Raises:	`KeyError` – if any of the `features` isn’t present in the `datasets` dict

traintestdiff package¶

Submodules¶

traintestdiff.core module¶

Module contents¶