SIGIR 2026 Reproducibility — Query Reformulation × LLMs × Datasets

About this leaderboard

The QueryGym Leaderboard tracks reproducible query-reformulation results across IR benchmarks (BEIR, MS MARCO, TREC DL). Every row is backed by:

a JSON file conforming to reproducibility/schema.json v1,
a TREC-format .run.txt for re-evaluation, and
the reformulated queries TSV used to produce the run file.

All three live in the repository under reproducibility/data/runs/{dataset}/{method}/{model}/. Citing a number is as simple as linking the commit + the run JSON.

Submitting a result

Run the example pipeline, then use submit_run.py and open a PR.

python examples/querygym_pyserini/pipeline.py \
    --dataset msmarco-v1-passage.trecdl2019 \
    --method query2e --model gpt-4.1-mini \
    --output-dir outputs/dl19_query2e

python -m reproducibility.scripts.submit_run --from-dir outputs/dl19_query2e
make repro-aggregate
git add reproducibility/data/ && git commit -m "..." && git push
gh pr create

Full guide: Reproducibility User Guide ↗

Papers

WWW 2026 Demos — QueryGym toolkit paper. arXiv 2511.15996 ↗
SIGIR 2026 Reproducibility Track — multi-LLM baseline reproduction (link TBD).