Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Last update: Oct 26, 2021

Related tags

Overview

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Official Code Repository for the paper "Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation" ([email protected] 2021): https://aclanthology.org/2021.sdp-1.2/

Abstract

One of the challenges in information retrieval (IR) is the vocabulary mismatch problem, which happens when the terms between queries and documents are lexically different but semantically similar. While recent work has proposed to expand the queries or documents by enriching their representations with additional relevant terms to address this challenge, they usually require a large volume of query-document pairs to train an expansion model. In this paper, we propose an Unsupervised Document Expansion with Generation (UDEG) framework with a pre-trained language model, which generates diverse supplementary sentences for the original document without using labels on query-document pairs for training. For generating sentences, we further stochastically perturb their embeddings to generate more diverse sentences for document expansion. We validate our framework on two standard IR benchmark datasets. The results show that our framework significantly outperforms relevant expansion baselines for IR.

Dependencies

Python 3.7.9
Pytorch 1.7.0
Transformers 4.3

Run

1. Installing anserini

We use the open-source information retrieval toolkit anserini.

# install maven
sudo apt-get install maven

# cloning / installing anserini
git clone https://github.com/castorini/anserini.git --recurse-submodules
cd anserini/
# changing jacoco from 0.8.2 to 0.8.3 in pom.xml to build correctly
mvn clean package appassembler:assemble

# compile evaluation tools and other scripts
cd tools/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make && cd ../../..
cd tools/eval/ndeval && make && cd ../../..

2. Data Preprocessing

python 0_0_extract_text.py
python 0_1_convert_qrels_to_binary.py
python 0_2_convert_qrels_to_ndcg_scale.py

3. Data Tokenization

python 1_convert_text_to_tokenized.py

4. Abstractive Generation with Stochastic Text Generation

python 2_abstract_summary_multi.py

We provide the abstractly & stochastically generated output file in this repository (test_pegasus_xsum_4mc.tar.gz).

5. Convert to json format

We refer to the repository of https://github.com/nyu-dl/dl4ir-doc2query.

python 3_concat_collection_summary_to_json.py

6. Indexing, Retrieval, Evaluation

We refer to the repository of https://github.com/boudinfl/ir-using-kg#data.

sh 4_create_indexes.sh
sh 5_retrieve.sh
sh 6_evaluate.sh

Cite

@inproceedings{jeong-etal-2021-unsupervised,
    title = "Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation",
    author = "Jeong, Soyeong  and
      Baek, Jinheon  and
      Park, ChaeHun  and
      Park, Jong",
    booktitle = "Proceedings of the Second Workshop on Scholarly Document Processing",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.sdp-1.2",
    doi = "10.18653/v1/2021.sdp-1.2",
    pages = "7--17"
}

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Related tags

Overview

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Abstract

Dependencies

Run

1. Installing anserini

2. Data Preprocessing

3. Data Tokenization

4. Abstractive Generation with Stochastic Text Generation

5. Convert to json format

6. Indexing, Retrieval, Evaluation

Cite

Owner

NLP*CL Laboratory

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

تولید اسم های رندوم فینگیلیش

BiNE: Bipartite Network Embedding

STonKGs is a Sophisticated Transformer that can be jointly trained on biomedical text and knowledge graphs

Code for Emergent Translation in Multi-Agent Communication

Words-per-minute - A terminal app written in python utilizing the curses module that tests the user's ability to type

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

[Preprint] Escaping the Big Data Paradigm with Compact Transformers, 2021

Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT)

Extract rooms type, door, neibour rooms, rooms corners nad bounding boxes, and generate graph from rplan dataset

LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

Maix Speech AI lib, including ASR, chat, TTS etc.

BERN2: an advanced neural biomedical namedentity recognition and normalization tool

내부 작업용 django + vue(vuetify) boilerplate. 짠 하면 돌아감.

Contains the code and data for our #ICSE2022 paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences"

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Arabic-Phonetic-Output - You can input the phonetic version of any Arabic text here. This software will show you output in Arabic (with vowels)

Pytorch version of BERT-whitening

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization (ACL 2021)