QueryGym
QueryGym Leaderboard
Reproducible benchmarks for LLM query reformulation.
← Leaderboard

Run detail

5c68574a029be6c6
Dataset
beir-v1.0.0-arguana
Method
Q2D (FS)
Model
gpt-4.1-nano
Retriever
BGE-base-en-v1.5 (dense)
params_hash
7b44f539
Queries
1406

Metrics

ndcg_cut_10 0.6188
recall_100 0.9900

Reproduce this run

Three steps: (1) reformulate the queries with QueryGym's example pipeline, (2) run retrieval with Pyserini, (3) evaluate with trec_eval.

1. reformulate
python examples/querygym_pyserini/pipeline.py \
    --dataset beir-v1.0.0-arguana \
    --method query2doc \
    --model openai/gpt-4.1-nano \
    --steps reformulate \
    --temperature 1 \
    --max-tokens 128 \
    --method-params '{"mode":"fs","num_examples":4,"train_split":"train"}' \
    --output-dir outputs/reproduce
2. retrieve (BGE-base-en-v1.5)
python -m pyserini.search.faiss \
  --threads 16 --batch-size 128 \
  --index beir-v1.0.0-arguana.bge-base-en-v1.5 \
  --topics outputs/reproduce/queries/reformulated_queries.tsv \
  --encoder BAAI/bge-base-en-v1.5 \
  --output run.txt \
  --hits 1000
3. evaluate
python -m pyserini.eval.trec_eval -c -m ndcg.cut.10 -m recall.100 \
  beir-v1.0.0-arguana-test run.txt

Artifacts

Config

config.json
{
  "method_params": {
    "mode": "fs",
    "num_examples": 4,
    "dataset_type": "msmarco",
    "collection_path": "/mnt/data/son/data/msmarco/collection.tsv",
    "train_queries_path": "/mnt/data/son/data/msmarco/queries.train.tsv",
    "train_qrels_path": "/mnt/data/son/data/msmarco/qrels.train.tsv",
    "train_split": "train"
  },
  "llm_config": {
    "temperature": 1,
    "max_tokens": 128
  },
  "dataset_config": {
    "topics": "beir-v1.0.0-arguana-test",
    "index": "beir-v1.0.0-arguana.flat",
    "num_queries": 1406
  },
  "retrieval": {
    "retriever_id": "bge-base-en-v1.5",
    "paradigm": "dense",
    "params": {
      "encoder": "BAAI/bge-base-en-v1.5"
    }
  }
}