mSOP-765k

mSOP-765k: A Benchmark For Multi-Modal Structured Output Predictions

Abstract

This paper introduces mSOP-765k, a large-scale benchmark for the evaluation of multi-modal Structured Output Prediction (mSOP) pipelines. Besides novel evaluation metrics, the benchmark provides combined training and test datasets with over 765,000 images taken from real-world product advertisements. Each of these images contains product visualizations, textual information like product name or brand, and numerical data such as product weight, price, and discount. All images are annotated with the corresponding structured information in form of dictionaries containing key-value pairs. An initial baseline evaluation, including various LLMs and VLMs, as well as multi-modal RAG approaches, shows that the proposed benchmark provides a challenging problem which can not yet be solved completely by state-of-the-art mSOP methods. The benchmark and dataset are available under a creative-commons license:
https://huggingface.co/datasets/retail-product-promotion/mSOP-765k.

Data

Usage

You can load and use the dataset with the Hugging Face datasets library.

import os
import pandas as pd
import tarfile

from huggingface_hub import hf_hub_download, list_repo_files, login


repo_id     = "retail-product-promotion/mSOP-765k"
extract_dir = "your/path/to/extract/directory"
os.makedirs(extract_dir, exist_ok=True)

# Use your HF access token here
login(token="your/huggingface/token")

# 1. List all files in the repo
files = list_repo_files(repo_id=repo_id, repo_type="dataset")
# 2. Filter for .tar.gz files
tar_files = [f for f in files if f.endswith(".tar.gz")]
# 3. Download and extract each archive
for file in tar_files:
    print(f"Processing: {file}")
    archive_path = hf_hub_download(repo_id=repo_id, repo_type="dataset", filename=file)
    
    extract_path = os.path.join(extract_dir, os.path.dirname(file))
    os.makedirs(extract_path, exist_ok=True)

    with tarfile.open(archive_path, "r:gz") as tar:
        tar.extractall(path=extract_path)


df_train    = pd.read_parquet(hf_hub_download(repo_id=repo_id, repo_type="dataset", filename='train.parquet'), engine='pyarrow')
df_test     = pd.read_parquet(hf_hub_download(repo_id=repo_id, repo_type="dataset", filename='test.parquet'), engine='pyarrow')

License

This dataset is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International.