mSOP-765k

Abstract

This paper introduces mSOP-765k, a large-scale benchmark for the evaluation of multi-modal Structured Output Prediction (mSOP) pipelines. Besides novel evaluation metrics, the benchmark provides combined training and test datasets with over 765,000 images taken from real-world product advertisements. Each of these images contains product visualizations, textual information like product name or brand, and numerical data such as product weight, price, and discount. All images are annotated with the corresponding structured information in form of dictionaries containing key-value pairs. An initial baseline evaluation, including various LLMs and VLMs, as well as multi-modal RAG approaches, shows that the proposed benchmark provides a challenging problem which can not yet be solved completely by state-of-the-art mSOP methods. The benchmark and dataset are available under a creative-commons license:
https://huggingface.co/datasets/retail-product-promotion/mSOP-765k.

Data

Image Data:
- The images are cropped from scanned advertisement leaflets.
- The image data is divided into train and test splits.
- The image dataset is available in two versions: one with images resized so that the longer edge measures 512 pixels, and another where the longer edge measures 256 pixels.
Refer to the folders rpp-765k_512 and rpp-765k_256.
Product and Promotion Data:

Product Data:
- Product data contains the targets: brand, product category, GTINs, product weight, and different types.
- If a promotion covers a variety of different types/flavors of the product, the GTIN of each type is recorded.
Promotion Data:
- Promotion data contains the targets: price, regular price, and relative discount or absolute discount.
Refer to the files train.parquet and test.parquet.
Text Extraction Data:

The text extracted from the images by OCR with the PaddleOCR tool.

Refer to the folder text_extraction.

Usage

You can load and use the dataset with the Hugging Face datasets library.

import os
import pandas as pd
import tarfile

from huggingface_hub import hf_hub_download, list_repo_files, login


repo_id     = "retail-product-promotion/mSOP-765k"
extract_dir = "your/path/to/extract/directory"
os.makedirs(extract_dir, exist_ok=True)

# Use your HF access token here
login(token="your/huggingface/token")

# 1. List all files in the repo
files = list_repo_files(repo_id=repo_id, repo_type="dataset")
# 2. Filter for .tar.gz files
tar_files = [f for f in files if f.endswith(".tar.gz")]
# 3. Download and extract each archive
for file in tar_files:
    print(f"Processing: {file}")
    archive_path = hf_hub_download(repo_id=repo_id, repo_type="dataset", filename=file)
    
    extract_path = os.path.join(extract_dir, os.path.dirname(file))
    os.makedirs(extract_path, exist_ok=True)

    with tarfile.open(archive_path, "r:gz") as tar:
        tar.extractall(path=extract_path)


df_train    = pd.read_parquet(hf_hub_download(repo_id=repo_id, repo_type="dataset", filename='train.parquet'), engine='pyarrow')
df_test     = pd.read_parquet(hf_hub_download(repo_id=repo_id, repo_type="dataset", filename='test.parquet'), engine='pyarrow')

License

This dataset is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International.

mSOP-765k

mSOP-765k: A Benchmark For Multi-Modal Structured Output Predictions

Abstract

Data

Usage

License