Stratified Train-Test Split

This function splits a dataset into training and validation sets while preserving the class distribution.

Usage

To use this function, follow the example below:

import pandas as pd
from sklearn.datasets import fetch_20newsgroups

# Load dataset
newsgroups_data = fetch_20newsgroups(subset='all')
df = pd.DataFrame({'text': newsgroups_data.data, 'label': newsgroups_data.target})
classes = np.unique(df['label'].values)

# Split dataset
train_data, validation_data = stratified_train_test_split(df, classes, train_size=0.8)

Parameters

  • dataset (Union[pd.DataFrame, datasets.Dataset]): The input dataset.
  • classes (np.ndarray): The array of unique class labels present in the dataset.
  • train_size (Union[float, int]): The proportion of the dataset to include in the training split. Should be a float in the range (0, 1) if expressed as a fraction, or an integer if expressed as a number of samples.

Returns

A tuple containing two dictionaries representing the training and validation data splits:

  • Each dictionary contains two keys: 'label' and 'text'.
  • The 'label' key corresponds to a list of class labels.
  • The 'text' key corresponds to a list of text samples.

Notes

  • Ensure that the dataset contains columns named 'label' and 'text' representing the class labels and text samples, respectively.
  • The 'label' column should contain categorical class labels.
  • The 'text' column should contain textual data.
  • If the dataset is a pandas DataFrame, it should be in the format where each row represents a sample, and each column represents a feature.
import numpy as np
from datasets import Dataset
import pandas as pd

train_data, validation_data = stratified_train_test_split(df, classes, train_size=0.8)

```