ImageLayoutDataset

A PyTorch Dataset for handling image layout data with tokenization and labeling.

Arguments

data (List[Dict]): A list of dictionaries containing the image layout data. Each dictionary should contain at least the following keys:
- 'words': List of words in the text.
- 'bboxes': List of bounding boxes corresponding to each word.
- 'ner_tags': List of named entity recognition tags.
tokenizer: The tokenizer to tokenize the text.
device (str, optional): The device where tensors will be placed. Defaults to 'cuda'.
encode (bool, optional): Whether to encode the data during initialization. Defaults to True.
tokenize_all_labels (bool, optional): Whether to tokenize all labels or only the first token of a word. Defaults to False.
valid_labels_keymap (Dict, optional): A dictionary mapping valid labels to their corresponding token ids. Defaults to None.

Methods

tokenize_labels(ner_tags, tokens): Tokenizes and aligns the labels with the tokens.
tokenize_boxes(words, boxes): Tokenizes the bounding boxes and pads them to match the sequence length.
encode(example): Encodes an example from the dataset.
__getitem__(index): Retrieves an item from the dataset at the specified index.
__len__() : Returns the length of the dataset.

Attributes

tokenizer: The tokenizer used for tokenization.
device (str): The device where tensors will be placed.
valid_labels_keymap (Dict): A dictionary mapping valid labels to their corresponding token ids.
tokenize_all_labels (bool): Whether to tokenize all labels or only the first token of a word.
X (List): List to store the encoded data or raw data.

Usage Example

from transformers import AutoTokenizer
from torch.utils.data import DataLoader
from datasets import load_dataset
from few_shot_learning_nlp.few_shot_ner_image_documents.image_dataset import ImageLayoutDataset

# Load the FUNSD dataset
dataset = load_dataset("nielsr/funsd")

# Example tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Example data from the FUNSD dataset
data = dataset["train"]

# Initialize the dataset
image_layout_dataset = ImageLayoutDataset(data, tokenizer)

# Get the length of the dataset
print("Dataset length:", len(image_layout_dataset))

# Get an item from the dataset
example_item = image_layout_dataset[0]
print("Example item:", example_item)

# DataLoader example
loader = DataLoader(image_layout_dataset, batch_size=4, shuffle=True)
for batch in loader:
    print("Batch shape:", batch.shape)