traintestdiff package¶
Submodules¶
traintestdiff.core module¶
-
class
traintestdiff.core.TrainTestDiff(datasets)[source]¶ Bases:
objectHelper class to ease distribution analysis on the same datasets
-
traintestdiff.core.categorical_longform(datasets, features)[source]¶ Given datasets and features it returns a long form representation of it
Parameters: - datasets (dict) – each key is a dataset name and each value is a
pandas.DataFrame - features (list) – a list of string features present in the datasets
Returns: A tidy data long form
Return type: pandas.core.frame.DataFrame
Raises: KeyError– if any of thefeaturesisn’t present in thedatasetsdict- datasets (dict) – each key is a dataset name and each value is a
-
traintestdiff.core.continuous_longform(datasets, features)[source]¶ Given datasets and features it returns a long form representation of it
Parameters: - datasets (dict) – each key is a dataset name and each value is a
pandas.DataFrame - features (list) – a list of string features present in the datasets
Returns: A tidy data longform dataframe
Return type: pandas.core.frame.DataFrame
Raises: KeyError– if any of thefeaturesisn’t present in thedatasetsdict- datasets (dict) – each key is a dataset name and each value is a
-
traintestdiff.core.datasets_from_frame(dataframe, feature)[source]¶ Creates a dict dataset from a dataframe
Given a categorical feature it creates a dict where each key is a level of the feature and each value is a dataframe, then you can use this datasets dict to plot graphs
Parameters: - dataframe (pandas.DataFrame) – the frame that you’re going to use to create a dict datasets
- feature (str) – this feature will be used for grouping and creating the datasets dict
Returns: A
dictwhere keys are levels offeatureand values arepandas.core.frame.DataFramefrom adataframe.groupby(feature)Return type: dict
Raises: KeyError– iffeatureis not present indataframe
-
traintestdiff.core.plot_categorical_diff(datasets, features, kind='prop', col_wrap=4, size=4, aspect=1, title=None)[source]¶ Plots the distribution differences of categorical features in each dataset
Parameters: - datasets (dict) – a dict where the keys are names and the values
are
pandas.DataFrame - features (list) – a list of categorical features present in every
dataset of
datasets - kind (Optional[str]) – {count, prop}
Use “count” for count of unique values for every level of a feature
in every dataset present in
datasetsUse “prop” for the proportion of that level of a feature - col_wrap (int) – how many charts you want per row
- size (float) – Height (in inches)
- aspect (float) – Aspect ratio of each facet, so that aspect * size gives the width of each facet in inches
- title (str) – the title of the figure
Returns: a tuple with a longform data frame and matplotlib figure to customize
Return type: (pandas.core.frameDataFrame, matplotlib.Figure)
Raises: KeyError– if any of thefeaturesisn’t present in thedatasetsdict- datasets (dict) – a dict where the keys are names and the values
are
-
traintestdiff.core.plot_continuous_diff(datasets, features, kind='box', col_wrap=3, size=4, aspect=1, title=None)[source]¶ Plots the distribution differences of continuous features in each dataset
Parameters: - datasets (dict) – a dict where the keys are names and the values
are
pandas.DataFrame - features (list) – a list of continuous features present in every
dataset of
datasets - kind (str) – {point, bar, box, violin, strip} The kind of plot to draw.
- col_wrap (int) – how many charts you want per row
- size (float) – Height (in inches)
- aspect (float) – Aspect ratio of each facet, so that aspect * size gives the width of each facet in inches
- title (str) – the title of the figure
Returns: a tuple with a longform data frame and matplotlib figure to customize
Return type: (pandas.core.frameDataFrame, matplotlib.Figure)
Raises: KeyError– if any of thefeaturesisn’t present in thedatasetsdict- datasets (dict) – a dict where the keys are names and the values
are