traintestdiff

https://img.shields.io/pypi/v/traintestdiff.svg LicenseDocumentation Status

Installation

$ pip install traintestdiff

Documentation and Examples

You can find the documentation in https://traintestdiff.readthedocs.io and a Jupyer notebook in example

Overview

traintestdiff provides a simple way to explore differences on your train, validation and test data: it’s main entry point is the class TrainTestDiff whose only argument is a dict of datasets you would like to explore.

In this case we’re going to explore the tips dataset provided by Seaborn

import pandas as pd
import seaborn as sns

from traintestdiff import TrainTestDiff

tips = sns.load_dataset("tips")

# Let's split our data in train and test
train=tips.sample(frac=0.8,random_state=0)
test=tips.drop(train.index)

Once you have your train and test set you’re ready to use TrainTestDiff

datasets = {'train': train, 'test': test}
ttd = TrainTestDiff(datasets)

The two main methods are plot_cat_diff and plot_cont_diff: the first one produces a plot of categorical features, and the second one a plot of continuous features.

long_form, fig1 = ttd.plot_cat_diff(features=['smoker', 'day', 'time'])
./examples/cat_diff.png

With plot_cont_diff we can explore the continuous features of the datasets

longform_cont1, fig2 = ttd.plot_cont_diff(features=["total_bill", "size", "tip"], kind="box")
./examples/cont_diff.png

Long Form data and figures

As you can see from the code, both plot_cat_diff and plot_cont_diff return two values: a pandas.core.frame.DataFrame and a matplotlib.figure.Figure

The idea is to give you a way to explore the data in a tidy format and the figure to tweak how it looks. For example, let’s change the title:

fig1.suptitle("The same graph with other title")
fig1
./examples/title.png