Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

Related tags

Deep LearningBiDR
Overview

BiDR

Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval.

Requirements

torch==1.7
transformers==4.6
faiss-gpu==1.6.4.post2

Data Download and Preprocess

bash download_data.sh
python preprocess.py

These commands will download and preprocess the MSMARCO Passage and Doc dataset, then the resutls will be saved to ./data.
We take the Passage dataset as the example to show the running workflow.

Conventional Workflow

Representation Learning

Train the encoder with random negative (or set --hardneg_json to provied bm25/hard negatives) :

mkdir log
dataset=passage
savename=dense_global_model
python train.py --model_name_or_path roberta-base \
--max_query_length 24 --max_doc_length 128 \
--data_dir ./data/${dataset}/preprocess \
--learning_rate 1e-4 --optimizer_str adamw \
--per_device_train_batch_size 128 \
--per_query_neg_num 1 \
--generate_batch_method random \
--loss_method multi_ce  \
--savename ${savename} --save_model_path ./model \
--world_size 8 --gpu_rank 0_1_2_3_4_5_6_7  --master_port 13256 \
--num_train_epochs 30  \
--use_pq False \
|tee ./log/${savename}.log

Unsupervised Quantization

Generate dense embeddings of queries and docs:

data_type=passage
savename=dense_global_model
epoch=20
python ./inference.py \
--data_type ${data_type} \
--preprocess_dir ./data/${data_type}/preprocess/ \
--max_doc_length 256 --max_query_length 32 \
--eval_batch_size 512 \
--ckpt_path ./model/${savename}/${epoch}/ \
--output_dir  evaluate/${savename}_${epoch} 

Product Quantization based on Faiss and recall performance:

data_type=passage
savename=dense_global_model
epoch=20
python ./test/lightweight_ann.py \
--output_dir ./data/${data_type}/evaluate/${savename}_${epoch} \
--ckpt_path /model/${savename}/${epoch}/ \
--subvector_num 96 \
--index opq \
--topk 1000 \
--data_type ${data_type} \
--MRR_cutoff 10 \
--Recall_cutoff 5 10 30 50 100

Progressively Optimized Bi-Granular Document Representation

Sparse Representation Learning

Instead of running unsupervised quantization for the well-learned dense embeddings, the sparse embeddings are generated from contrastive learning, which optimizes the global discrimination and helps to enable high-quality answers to be covered in candidate search.

Train

We find that using Faiss OPQ to initialize the PQ module has a significant gain for MSMARCO dataset. But for the largest dataset: Ads dataset, initialization with Faiss OPQ is redundant and has no promotion.

dataset=passage
savename=sparse_global_model
python train.py --model_name_or_path ./model/dense_global_model/20 \
--max_query_length 24 --max_doc_length 128 \
--data_dir ./data/${dataset}/preprocess \
--learning_rate 1e-4 --optimizer_str adamw \
--per_device_train_batch_size 128 \
--per_query_neg_num 1 \
--generate_batch_method random \
--loss_method multi_ce  \
--savename ${savename} --save_model_path ./model \
--world_size 8 --gpu_rank 0_1_2_3_4_5_6_7  --master_port 13256 \
--num_train_epochs 30  \
--use_pq True \
--init_index_path ./data/${data_type}/evaluate/dense_global_model_20/OPQ96,PQ96x8.index \
--partition 96 --centroids 256 --quantloss_weight 0.0 \
|tee ./log/${savename}.log

where the ./model/dense_global_model/20 and ./data/${data_type}/evaluate/dense_global_model_20/OPQ96,PQ96x8.index is generated by conventional workflow.

Test

data_type=passage
savename=sparse_global_model
epoch=20

python ./inference.py \
--data_type ${data_type} \
--preprocess_dir ./data/${data_type}/preprocess/ \
--max_doc_length 256 --max_query_length 32 \
--eval_batch_size 512 \
--ckpt_path ./model/${savename}/${epoch}/ \
--output_dir  evaluate/${savename}_${epoch} 

python ./test/lightweight_ann.py \
--output_dir ./data/${data_type}/evaluate/${savename}_${epoch} \
--subvector_num 96 \
--index opq \
--topk 1000 \
--data_type ${data_type} \
--MRR_cutoff 10 \
--Recall_cutoff 5 10 30 50 100 \
--ckpt_path ./model/${savename}/${epoch}/ \
--init_index_path ./data/${data_type}/evaluate/dense_global_model_20/OPQ96,PQ96x8.index

Dense Representation Learning

The dense embeddings are optimized based on the candidate distribution generated by sparse embeddings. We propose a novel sampling strategy called locality-centric sampling to enhance local discrimination: construct a bipartite proximity graph and conduct random walk or snow sample on it.

Train

Encode the quries in train set and generate the candidates for all train queries:

data_type=passage
savename=sparse_global_model
epoch=20

python ./inference.py \
--data_type ${data_type} \
--preprocess_dir ./data/${data_type}/preprocess/ \
--max_doc_length 256 --max_query_length 32 \
--eval_batch_size 512 \
--ckpt_path ./model/${savename}/${epoch}/ \
--output_dir  evaluate/${savename}_${epoch} \
--mode train

python ./test/lightweight_ann.py \
--output_dir ./data/${data_type}/evaluate/${savename}_${epoch} \
--subvector_num 96 \
--index opq \
--topk 1000 \
--data_type ${data_type} \
--MRR_cutoff 10 \
--Recall_cutoff 5 10 30 50 100 \
--ckpt_path ./model/${savename}/${epoch}/ \
--init_index_path ./data/${data_type}/evaluate/dense_global_model_20/OPQ96,PQ96x8.index \
--mode train \
--save_hardneg_to_json

This command will save the train_hardneg.json to output_dir. Then train the dense embeddings to distinguish the ground truth from the negative in candidate:

dataset=passage
savename=dense_local_model
python train.py --model_name_or_path roberta-base \
--max_query_length 24 --max_doc_length 128 \
--data_dir ./data/${dataset}/preprocess \
--learning_rate 1e-4 --optimizer_str adamw \
--per_device_train_batch_size 128 \
--per_query_neg_num 1 \
--generate_batch_method {random_walk or snow_sample} \
--loss_method multi_ce  \
--savename ${savename} --save_model_path ./model \
--world_size 8 --gpu_rank 0_1_2_3_4_5_6_7  --master_port 13256 \
--num_train_epochs 30  \
--use_pq False \
--hardneg_json ./data/${data_type}/evaluate/sparse_global_model_20/train_hardneg.json \
--mink 0  --maxk 200 \
|tee ./log/${savename}.log

Test

data_type=passage
savename=dense_local_model
epoch=10

python ./inference.py \
--data_type ${data_type} \
--preprocess_dir ./data/${data_type}/preprocess/ \
--ckpt_path ./model/${savename}/${epoch}/ \
--max_doc_length 256 --max_query_length 32 \
--eval_batch_size 512 \
--ckpt_path ./model/${savename}/${epoch}/ \
--output_dir  evaluate/${savename}_${epoch} 

python ./test/post_verification.py \
--data_type ${data_type} \
--output_dir  evaluate/${savename}_${epoch} \
--candidate_from_ann ./data/${data_type}/evaluate/sparse_global_model_20/dev.rank_1000_score_faiss_opq.tsv \
--MRR_cutoff 10 \
--Recall_cutoff 5 10 30 50 100

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Lexical Substitution Framework

LexSubGen Lexical Substitution Framework This repository contains the code to reproduce the results from the paper: Arefyev Nikolay, Sheludko Boris, P

Samsung 37 Sep 15, 2022
Generic template to bootstrap your PyTorch project with PyTorch Lightning, Hydra, W&B, and DVC.

NN Template Generic template to bootstrap your PyTorch project. Click on Use this Template and avoid writing boilerplate code for: PyTorch Lightning,

Luca Moschella 520 Dec 30, 2022
Face Recognition plus identification simply and fast | Python

PyFaceDetection Face Recognition plus identification simply and fast Ubuntu Setup sudo pip3 install numpy sudo pip3 install cmake sudo pip3 install dl

Peyman Majidi Moein 16 Sep 22, 2022
A Python module for parallel optimization of expensive black-box functions

blackbox: A Python module for parallel optimization of expensive black-box functions What is this? A minimalistic and easy-to-use Python module that e

Paul Knysh 426 Dec 08, 2022
ICLR 2021: Pre-Training for Context Representation in Conversational Semantic Parsing

SCoRe: Pre-Training for Context Representation in Conversational Semantic Parsing This repository contains code for the ICLR 2021 paper "SCoRE: Pre-Tr

Microsoft 28 Oct 02, 2022
Summary of related papers on visual attention

This repo is built for paper: Attention Mechanisms in Computer Vision: A Survey paper Vision-Attention-Papers Channel attention Spatial attention Temp

MenghaoGuo 2.1k Dec 30, 2022
Adversarial Learning for Semi-supervised Semantic Segmentation, BMVC 2018

Adversarial Learning for Semi-supervised Semantic Segmentation This repo is the pytorch implementation of the following paper: Adversarial Learning fo

Wayne Hung 464 Dec 19, 2022
Class activation maps for your PyTorch models (CAM, Grad-CAM, Grad-CAM++, Smooth Grad-CAM++, Score-CAM, SS-CAM, IS-CAM, XGrad-CAM, Layer-CAM)

TorchCAM: class activation explorer Simple way to leverage the class-specific activation of convolutional layers in PyTorch. Quick Tour Setting your C

F-G Fernandez 1.2k Dec 29, 2022
Multi-Content GAN for Few-Shot Font Style Transfer at CVPR 2018

MC-GAN in PyTorch This is the implementation of the Multi-Content GAN for Few-Shot Font Style Transfer. The code was written by Samaneh Azadi. If you

Samaneh Azadi 422 Dec 04, 2022
Graph InfoClust: Leveraging cluster-level node information for unsupervised graph representation learning

Graph-InfoClust-GIC [PAKDD 2021] PAKDD'21 version Graph InfoClust: Maximizing Coarse-Grain Mutual Information in Graphs Preprint version Graph InfoClu

Costas Mavromatis 21 Dec 03, 2022
Genpass - A Passwors Generator App With Python3

Genpass Welcom again into another python3 App this is simply an Passwors Generat

Mal4D 1 Jan 09, 2022
Visyerres sgdf woob - Modules Woob pour l'intranet et autres sites Scouts et Guides de France

Vis'Yerres SGDF - Modules Woob Vous avez le sentiment que l'intranet des Scouts

Thomas Touhey (pas un pseudonyme) 3 Dec 24, 2022
Code related to the manuscript "Averting A Crisis In Simulation-Based Inference"

Abstract We present extensive empirical evidence showing that current Bayesian simulation-based inference algorithms are inadequate for the falsificat

Montefiore Artificial Intelligence Research 3 Nov 14, 2022
This example implements the end-to-end MLOps process using Vertex AI platform and Smart Analytics technology capabilities

MLOps with Vertex AI This example implements the end-to-end MLOps process using Vertex AI platform and Smart Analytics technology capabilities. The ex

Google Cloud Platform 238 Dec 21, 2022
scalingscattering

Scaling The Scattering Transform : Deep Hybrid Networks This repository contains the experiments found in the paper: https://arxiv.org/abs/1703.08961

Edouard Oyallon 78 Dec 21, 2022
免费获取http代理并生成proxifier配置文件

freeproxy 免费获取http代理并生成proxifier配置文件 公众号:台下言书 工具说明:https://mp.weixin.qq.com/s?__biz=MzIyNDkwNjQ5Ng==&mid=2247484425&idx=1&sn=56ccbe130822aa35038095317

说书人 32 Mar 25, 2022
Deepfake Scanner by Deepware.

Deepware Scanner (CLI) This repository contains the command-line deepfake scanner tool with the pre-trained models that are currently used at deepware

deepware 110 Jan 02, 2023
Background-Click Supervision for Temporal Action Localization

Background-Click Supervision for Temporal Action Localization This repository is the official implementation of BackTAL. In this work, we study the te

LeYang 221 Oct 09, 2022
Implementation of Perceiver, General Perception with Iterative Attention in TensorFlow

Perceiver This Python package implements Perceiver: General Perception with Iterative Attention by Andrew Jaegle in TensorFlow. This model builds on t

Rishit Dagli 84 Oct 15, 2022
Pytorch for Segmentation

Pytorch for Semantic Segmentation This repo has been deprecated currently and I will not maintain it. Meanwhile, I strongly recommend you can refer to

ycszen 411 Nov 22, 2022