SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Overview

img

THUIR License made-with-python code-size

Introduction

This codebase contains source-code of the Python-based implementation (ARES) of our SIGIR 2022 paper.

Requirements

  • python 3.7
  • torch==1.9.0
  • transformers==4.9.2
  • tqdm, nltk, numpy, boto3
  • trec_eval for evaluation on TREC DL 2019
  • anserini for generating "RANK" axiom scores

Why this repo?

In this repo, you can pre-train ARESsimple and TransformerICT models, and fine-tune all pre-trained models with the same architecture as BERT. The papers are listed as follows:

You can download the pre-trained ARES checkpoint ARESsimple from Google drive and extract it.

Pre-training Data

Download data

Download the MS MARCO corpus from the official website.
Download the ADORE+STAR Top100 Candidates files from this repo.

Pre-process data

To save memory, we store most files using the numpy memmap or jsonl format in the ./preprocess directory.

Document files:

  • doc_token_ids.memmap: each line is the token ids for a document
  • docid2idx.json: {docid: memmap_line_id}

Query files:

  • queries.doctrain.jsonl: MS MARCO training queries {"id" qid, "ids": token_ids} for each line
  • queries.docdev.jsonl: MS MARCO validating queries {"id" qid, "ids": token_ids} for each line
  • queries.dl2019.jsonl: TREC DL 2019 queries {"id" qid, "ids": token_ids} for each line

Human label files:

  • msmarco-doctrain-qrels.tsv: qid 0 docid 1 for training set
  • dev-qrels.txt: qid relevant_docid for validating set
  • 2019qrels-docs.txt: qid relevant_docid for TREC DL 2019 set

Top 100 candidate files:

  • train.rank.tsv, dev.rank.tsv, test.rank.tsv: qid docid rank for each line

Pseudo queries and axiomatic features:

  • doc2qs.jsonl: {"docid": docid, "queries": [qids]} for each line
  • sample_qs_token_ids.memmap: each line is the token ids for a pseudo query
  • sample_qid2id.json: {qid: memmap_line_id}
  • axiom.memmap: axiom can be one of the ['rank', 'prox-1', 'prox-2', 'rep-ql', 'rep-tfidf', 'reg', 'stm-1', 'stm-2', 'stm-3'], each line is an axiomatic score for a query

Quick Start

Note that to accelerate the training process, we adopt the parallel training technique. The scripts for pre-training and fine-tuning are as follow:

Pre-training

export BERT_DIR=/path/to/bert-base/
export XGB_DIR=/path/to/xgboost.model

cd pretrain

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 NCCL_BLOCKING_WAIT=1 \
python  -m torch.distributed.launch --nproc_per_node=6 --nnodes=1 train.py \
        --model_type ARES \
        --PRE_TRAINED_MODEL_NAME BERT_DIR \
        --gpu_num 6 --world_size 6 \
        --MLM --axiom REP RANK REG PROX STM \
        --clf_model XGB_DIR

Here model type can be ARES or ICT.

Zero-shot evaluation (based on AS top100)

export MODEL_DIR=/path/to/ares-simple/
export CKPT_NAME=ares.ckpt

cd finetune

CUDA_VISIBLE_DEVICES=0 python train.py \
        --test \
        --PRE_TRAINED_MODEL_NAME MODEL_DIR \
        --model_type ARES \
        --model_name ARES_simple \
        --load_ckpt \
        --model_path CKPT_NAME

You can get:

#####################
<----- MS Dev ----->
MRR @10: 0.2991
MRR @100: 0.3130
QueriesRanked: 5193
#####################

on MS MARCO dev set and:

#############################
<--------- DL 2019 --------->
QueriesRanked: 43
nDCG @10: 0.5955
nDCG @100: 0.4863
#############################

on DL 2019 set.

Fine-tuning

export MODEL_DIR=/path/to/ares-simple/

cd finetune

CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_BLOCKING_WAIT=1 \
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 train.py \
        --model_type ARES \
        --distributed_train \
        --PRE_TRAINED_MODEL_NAME MODEL_DIR \
        --gpu_num 4 --world_size 4 \
        --model_name ARES_simple

Visualization

export MODEL_DIR=/path/to/ares-simple/
export SAVE_DIR=/path/to/output/
export CKPT_NAME=ares.ckpt

cd visualization

CUDA_VISIBLE_DEVICES=0 python visual.py \
    --PRE_TRAINED_MODEL_NAME MODEL_DIR \
    --model_name ARES_simple \
    --visual_q_num 1 \
    --visual_d_num 5 \
    --save_path SAVE_DIR \
    --model_path CKPT_NAME

Results

Zero-shot performance:

Model Name MS MARCO [email protected] MS MARCO [email protected] DL [email protected] DL [email protected] COVID EQ
BM25 0.2962 0.3107 0.5776 0.4795 0.4857 0.6690
BERT 0.1820 0.2012 0.4059 0.4198 0.4314 0.6055
PROPwiki 0.2429 0.2596 0.5088 0.4525 0.4857 0.5991
PROPmarco 0.2763 0.2914 0.5317 0.4623 0.4829 0.6454
ARESstrict 0.2630 0.2785 0.4942 0.4504 0.4786 0.6923
AREShard 0.2627 0.2780 0.5189 0.4613 0.4943 0.6822
ARESsimple 0.2991 0.3130 0.5955 0.4863 0.4957 0.6916

Few-shot performance: img

Visualization (attribution values have been normalized within a document): img

Citation

If you find our work useful, please do not save your star and cite our work:

@inproceedings{chen2022axiomatically,
  title={Axiomatically Regularized Pre-training for Ad hoc Search},
  author={Chen, Jia and Liu, Yiqun and Fang, Yan and Mao, Jiaxin and Fang, Hui and Yang, Shenghao and Xie, Xiaohui and Zhang, Min and Ma, Shaoping},
  booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year={2022}
}

Notice

  • Please make sure that all the pre-trained model parameters have been loaded correctly, or the zero-shot and the fine-tuning performance will be greatly impacted.
  • We welcome anyone who would like to contribute to this repo. 🤗
  • If you have any other questions, please feel free to contact me via [email protected] or open an issue.
  • Code for data preprocessing will come soon. Please stay tuned~
Owner
Jia Chen
My life is a beauty. 🦋
Jia Chen
Just a basic Telegram AI chat bot written in Python using Pyrogram.

Nikko ChatBot Just a basic Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher. A bot token. Installation $ https

ʀᴇxɪɴᴀᴢᴏʀ 2 Oct 21, 2022
Toward Model Interpretability in Medical NLP

Toward Model Interpretability in Medical NLP LING380: Topics in Computational Linguistics Final Project James Cross ( 1 Mar 04, 2022

The training code for the 4th place model at MDX 2021 leaderboard A.

The training code for the 4th place model at MDX 2021 leaderboard A.

Chin-Yun Yu 32 Dec 18, 2022
Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

LibreTranslate Try it online! | API Docs | Community Forum Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it d

3.4k Dec 27, 2022
SentAugment is a data augmentation technique for semi-supervised learning in NLP.

SentAugment SentAugment is a data augmentation technique for semi-supervised learning in NLP. It uses state-of-the-art sentence embeddings to structur

Meta Research 363 Dec 30, 2022
It analyze the sentiment of the user, whether it is postive or negative.

Sentiment-Analyzer-Tool It analyze the sentiment of the user, whether it is postive or negative. It uses streamlit library for creating this sentiment

Paras Patidar 18 Dec 17, 2022
Conditional probing: measuring usable information beyond a baseline

Conditional probing: measuring usable information beyond a baseline

John Hewitt 20 Dec 15, 2022
OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters - where the final result looks like waves in the ocean.

The first online catalogue for Arabic NLP datasets.

Masader The first online catalogue for Arabic NLP datasets. This catalogue contains 200 datasets with more than 25 metadata annotations for each datas

ARBML 94 Dec 26, 2022
Protein Language Model

ProteinLM We pretrain protein language model based on Megatron-LM framework, and then evaluate the pretrained model results on TAPE (Tasks Assessing P

THUDM 77 Dec 27, 2022
Codename generator using WordNet parts of speech database

codenames Codename generator using WordNet parts of speech database References: https://possiblywrong.wordpress.com/2021/09/13/code-name-generator/ ht

possiblywrong 27 Oct 30, 2022
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 847 Dec 19, 2022
Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization

Line as a Visual Sentence with LineTR This repository contains the inference code, pretrained model, and demo scripts of the following paper. It suppo

SungHo Yoon 158 Dec 27, 2022
🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

Gustavo Rosa 21 Aug 12, 2022
DataCLUE: 国内首个以数据为中心的AI测评(含模型分析报告)

DataCLUE 以数据为中心的AI测评(DataCLUE) DataCLUE: A Chinese Data-centric Language Evaluation Benchmark 内容导引 章节 描述 简介 介绍以数据为中心的AI测评(DataCLUE)的背景 任务描述 任务描述 实验结果

CLUE benchmark 135 Dec 22, 2022
Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

SpeechMix Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together. Introduction For the same input: from datas

Eric Lam 31 Nov 07, 2022
Rhyme with AI

Local development Create a conda virtual environment and activate it: conda env create --file environment.yml conda activate rhyme-with-ai Install the

GoDataDriven 28 Nov 21, 2022
Finetune gpt-2 in google colab

gpt-2-colab finetune gpt-2 in google colab sample result (117M) from retraining on A Tale of Two Cities by Charles Di

212 Jan 02, 2023
A collection of models for image - text generation in ACM MM 2021.

Bi-directional Image and Text Generation UMT-BITG (image & text generator) Unifying Multimodal Transformer for Bi-directional Image and Text Generatio

Multimedia Research 63 Oct 30, 2022
Example code for "Real-World Natural Language Processing"

Real-World Natural Language Processing This repository contains example code for the book "Real-World Natural Language Processing." AllenNLP (2.5.0 or

Masato Hagiwara 303 Dec 17, 2022