BEIR Dataset for Zero-shot Evaluation of IR Models

The BEIR information retrievel (IR) dataset was introduced in the paper “BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models”, published in 2021. The paper puts together 18 publicly available datasets, to evaluate 10 IR models. The task is: given a query, retrieve the relevant passages/documents as a ranked list. Evaluate using nDCG@10.

Datasets

The dataset domains and datasets are:

  • Bio-medical IR: Given a biomedical scientific query, retrieve bio-medical documents as output.
  • TREC-COVID: An ad-hoc search challenge based on the CORD-19 dataset containing scientific articles related to the COVID-19 pandemic.
  • NFCorpus: Natural language queries from NutritionFacts, and medical documents from PubMed as target corpus.
    • BioASQ: Biomedical semantic QA. Articles from PubMed as target corpus.
  • Open domain QA:
    • Natural Questions [KPR+]: Given a Google search query, return relevant Wikipedia passages.
    • HotpotQA: Each question requires reasoning over multiple Wikipedia passages to get the answer.
    • FiQA-2018: Financial domain opinion based QA, mined from StackExchange posts under the Investment topic.
  • Tweet retrieval:
    • Signal-1M related tweets: Given a news article title, retrieve relevant tweets.
  • News retrieval:
    • TREC-NEWS: Given a news headline, retrieve relevant news articles that provide important context or background information.
    • Robust04: TREC task focusing on poorly performing topics, where queries are single sentences.
  • Argument retrieval:
    • ArguAna Counterargs corpus: Given an argument, retrieve best counter-argument. Scraped from online debate portal.
    • Touche-2020: A conversational argument retrieval task.
  • Duplicate question retrieval: a given query is the input, and duplicate questions are the output.
    • CQADupStack: A query is a title + body. From StackExchange subforums.
    • Quora: Duplicate questions detection from Quora.
  • Entity retrieval: retrieve Wikipedia pages (title + abstract) to entities mentioned in the query.
    • DBPedia-Entity-v2: Given queries containing entities, retrieve entities from English DBpedia.
  • Citation prediction:
    • SCIDOCS: Given a paper title, retrieve cited papers from a list of 5 cited and 25 uncited papers.
  • Fact checking: sentence-level claim as input, and the relevant document passage verifying the claim as output.
    • FEVER [TVC+]: claims verified against introductory sections of Wikipedia pages.
    • Climate-FEVER: Climate claims verified against Wikipedia articles.
    • SciFact: Verifies scientific claims against scientific paper abstracts.

Evaluation

The following IR models perform well:

  • BM25 is a strong baseline.
  • DocT5query: First train a T5 (base) sequence-to-sequence model (on MS MARCO dataset) to generate queries when given a document. Then concatenate these generated queries (up to 40) to each original document in the retrieval set. Then index these expanded documents using BM25. This improves 1.6% over BM25.
  • COLBERT (refer to the summary on COLBERT-v2)
Written on December 28, 2022