traintestdiff package

Submodules

traintestdiff.core module

class traintestdiff.core.TrainTestDiff(datasets)[source]

Bases: object

Helper class to ease distribution analysis on the same datasets

plot_cat_diff(features, col_wrap=3, kind='prop', title=None)[source]

See plot_categorical_diff()

plot_cont_diff(features, kind='box', col_wrap=3, size=4, aspect=1, title=None)[source]

See plot_continuous_diff()

traintestdiff.core.categorical_longform(datasets, features)[source]

Given datasets and features it returns a long form representation of it

Parameters:
  • datasets (dict) – each key is a dataset name and each value is a pandas.DataFrame
  • features (list) – a list of string features present in the datasets
Returns:

A tidy data long form

Return type:

pandas.core.frame.DataFrame

Raises:

KeyError – if any of the features isn’t present in the datasets dict

traintestdiff.core.continuous_longform(datasets, features)[source]

Given datasets and features it returns a long form representation of it

Parameters:
  • datasets (dict) – each key is a dataset name and each value is a pandas.DataFrame
  • features (list) – a list of string features present in the datasets
Returns:

A tidy data longform dataframe

Return type:

pandas.core.frame.DataFrame

Raises:

KeyError – if any of the features isn’t present in the datasets dict

traintestdiff.core.datasets_from_frame(dataframe, feature)[source]

Creates a dict dataset from a dataframe

Given a categorical feature it creates a dict where each key is a level of the feature and each value is a dataframe, then you can use this datasets dict to plot graphs

Parameters:
  • dataframe (pandas.DataFrame) – the frame that you’re going to use to create a dict datasets
  • feature (str) – this feature will be used for grouping and creating the datasets dict
Returns:

A dict where keys are levels of feature and values are pandas.core.frame.DataFrame from a dataframe.groupby(feature)

Return type:

dict

Raises:

KeyError – if feature is not present in dataframe

traintestdiff.core.plot_categorical_diff(datasets, features, kind='prop', col_wrap=4, size=4, aspect=1, title=None)[source]

Plots the distribution differences of categorical features in each dataset

Parameters:
  • datasets (dict) – a dict where the keys are names and the values are pandas.DataFrame
  • features (list) – a list of categorical features present in every dataset of datasets
  • kind (Optional[str]) – {count, prop} Use “count” for count of unique values for every level of a feature in every dataset present in datasets Use “prop” for the proportion of that level of a feature
  • col_wrap (int) – how many charts you want per row
  • size (float) – Height (in inches)
  • aspect (float) – Aspect ratio of each facet, so that aspect * size gives the width of each facet in inches
  • title (str) – the title of the figure
Returns:

a tuple with a longform data frame and matplotlib figure to customize

Return type:

(pandas.core.frameDataFrame, matplotlib.Figure)

Raises:

KeyError – if any of the features isn’t present in the datasets dict

traintestdiff.core.plot_continuous_diff(datasets, features, kind='box', col_wrap=3, size=4, aspect=1, title=None)[source]

Plots the distribution differences of continuous features in each dataset

Parameters:
  • datasets (dict) – a dict where the keys are names and the values are pandas.DataFrame
  • features (list) – a list of continuous features present in every dataset of datasets
  • kind (str) – {point, bar, box, violin, strip} The kind of plot to draw.
  • col_wrap (int) – how many charts you want per row
  • size (float) – Height (in inches)
  • aspect (float) – Aspect ratio of each facet, so that aspect * size gives the width of each facet in inches
  • title (str) – the title of the figure
Returns:

a tuple with a longform data frame and matplotlib figure to customize

Return type:

(pandas.core.frameDataFrame, matplotlib.Figure)

Raises:

KeyError – if any of the features isn’t present in the datasets dict

Module contents