Stratified Train-Test Split

This function splits a dataset into training and validation sets while preserving the class distribution.

Usage

To use this function, follow the example below:

import pandas as pd
from sklearn.datasets import fetch_20newsgroups

# Load dataset
newsgroups_data = fetch_20newsgroups(subset='all')
df = pd.DataFrame({'text': newsgroups_data.data, 'label': newsgroups_data.target})
classes = np.unique(df['label'].values)

# Split dataset
train_data, validation_data = stratified_train_test_split(df, classes, train_size=0.8)

Parameters

dataset (Union[pd.DataFrame, datasets.Dataset]): The input dataset.
classes (np.ndarray): The array of unique class labels present in the dataset.
train_size (Union[float, int]): The proportion of the dataset to include in the training split. Should be a float in the range (0, 1) if expressed as a fraction, or an integer if expressed as a number of samples.

Returns

A tuple containing two dictionaries representing the training and validation data splits:

Each dictionary contains two keys: 'label' and 'text'.
The 'label' key corresponds to a list of class labels.
The 'text' key corresponds to a list of text samples.

Notes

Ensure that the dataset contains columns named 'label' and 'text' representing the class labels and text samples, respectively.
The 'label' column should contain categorical class labels.
The 'text' column should contain textual data.
If the dataset is a pandas DataFrame, it should be in the format where each row represents a sample, and each column represents a feature.

import numpy as np
from datasets import Dataset
import pandas as pd

train_data, validation_data = stratified_train_test_split(df, classes, train_size=0.8)

```