LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Last update: Jan 12, 2022

Related tags

Deep Learning ZaloAI2021_LTR

Overview

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

We propose a cross encoder model (LTR_CrossEncoder) for information retrieval, re-retrieval text relevant base on result of elasticsearch

Model achieved 0.747 F2 score in public test (Legal Text Retrieval Zalo AI Challenge 2021)
If using elasticsearch only, our F2 score is 0.54

Algorithm design

Our algorithm includes two key components:

Elasticsearch
Cross Encoder Model

Elasticsearch

Elasticsearch is used for filtering top-k most relevant articles based on BM25 score.

Cross Encoder Model

Our model accepts query, article text (passage) and article title as inputs and outputs a relevant score of that query and that article. Higher score, more relavant. We use pretrained vinai/phobert-base and CrossEntropyLoss or BCELoss as loss function

Train dataset

Non-relevant samples in dataset are obtained by top-10 result of elasticsearch, the training data (train_data_model.json) has format as follow:

[
    {
        "question_id": "..."
        "question": "..."
        "relevant_articles":[
            {
                "law_id": "..."
                "article_id": "..."
                "title": "..."
                "text": "..."
            },
            ...
        ]
        "non_relevant_articles":[
            {
                "law_id": "..."
                "article_id": "..."
                "title": "..."
                "text": "..."
            },
            ...
        ]
    },
    ...
]

Test dataset

First we use elasticsearch to obtain k relevant candidates (k=top-50 result of elasticsearch), then LTR_CrossEncoder classify which actual relevant article. The test data (test_data_model.json) has format as follow:

[
    {
        "question_id": "..."
        "question": "..."
        "articles":[
            {
                "law_id": "..."
                "article_id": "..."
                "title": "..."
                "text": "..."
            },
            ...
        ]
    },
    ...
]

Training

Run the following bash file to train model:

bash run_phobert.sh

Inference

We also provide model checkpoints. Please download these checkpoints if you want to make inference on a new text file without training the models from scratch. Create new checkpoint folder, unzip model file and push it in checkpoint folder. https://drive.google.com/file/d/1oT8nlDIAatx3XONN1n5eOgYTT6Lx_h_C/view?usp=sharing

Run the following bash file to infer test dataset:

bash run_predict.sh

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Related tags

Overview

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Algorithm design

Elasticsearch

Cross Encoder Model

Train dataset

Test dataset

Training

Inference

Owner

Hieu Duong

Sparse Physics-based and Interpretable Neural Networks

A semantic segmentation toolbox based on PyTorch

BabelCalib: A Universal Approach to Calibrating Central Cameras. In ICCV (2021)

A Java implementation of the experiments for the paper "k-Center Clustering with Outliers in Sliding Windows"

A PyTorch library and evaluation platform for end-to-end compression research

Self-supervised Multi-modal Hybrid Fusion Network for Brain Tumor Segmentation

Semi-supevised Semantic Segmentation with High- and Low-level Consistency

toroidal - a lightweight transformer library for PyTorch

[CVPR'22] COAP: Learning Compositional Occupancy of People

Implementation for paper "Towards the Generalization of Contrastive Self-Supervised Learning"

Barlow Twins and HSIC

yolov5 deepsort 行人车辆跟踪检测计数

Automated Evidence Collection for Fake News Detection

TSDF++: A Multi-Object Formulation for Dynamic Object Tracking and Reconstruction

Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers (arXiv2021)

[CVPR 2022 Oral] Balanced MSE for Imbalanced Visual Regression https://arxiv.org/abs/2203.16427

Time Series Cross-Validation -- an extension for scikit-learn

Repositório criado para abrigar os notebooks com a listas de exercícios propostos pelo professor Gustavo Guanabara do canal Curso em Vídeo do YouTube durante o Curso de Python 3

Development Kit for the SoccerNet Challenge

On Generating Extended Summaries of Long Documents

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Related tags

Overview

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Algorithm design

Elasticsearch

Cross Encoder Model

Train dataset

Test dataset

Training

Inference

Owner

Hieu Duong

Sparse Physics-based and Interpretable Neural Networks

A semantic segmentation toolbox based on PyTorch

BabelCalib: A Universal Approach to Calibrating Central Cameras. In ICCV (2021)

A Java implementation of the experiments for the paper "k-Center Clustering with Outliers in Sliding Windows"

A PyTorch library and evaluation platform for end-to-end compression research

Self-supervised Multi-modal Hybrid Fusion Network for Brain Tumor Segmentation

Semi-supevised Semantic Segmentation with High- and Low-level Consistency

toroidal - a lightweight transformer library for PyTorch

[CVPR'22] COAP: Learning Compositional Occupancy of People

Implementation for paper "Towards the Generalization of Contrastive Self-Supervised Learning"

Barlow Twins and HSIC

yolov5 deepsort 行人 车辆 跟踪 检测 计数

Automated Evidence Collection for Fake News Detection

TSDF++: A Multi-Object Formulation for Dynamic Object Tracking and Reconstruction

Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers (arXiv2021)

[CVPR 2022 Oral] Balanced MSE for Imbalanced Visual Regression https://arxiv.org/abs/2203.16427

Time Series Cross-Validation -- an extension for scikit-learn

Repositório criado para abrigar os notebooks com a listas de exercícios propostos pelo professor Gustavo Guanabara do canal Curso em Vídeo do YouTube durante o Curso de Python 3

Development Kit for the SoccerNet Challenge

On Generating Extended Summaries of Long Documents

yolov5 deepsort 行人车辆跟踪检测计数