TFIDF-based QA system for AIO2 competition

Last update: Feb 19, 2022

Related tags

Overview

AIO2 TF-IDF Baseline

This is a very simple question answering system, which is developed as a lightweight baseline for AIO2 competition.

In the training stage, the model builds a sparse matrix of TF-IDF features from the questions in training dataset. In the inference stage, the model predicts answers of unseen questions by finding the most similar training question to the input by computing dot product scores of TF-IDF features.

Therefore, in principle, the model cannot predict answers unseen in the training data.

Steps to experiment with the model

Install requirements

$ pip install -r requirements.txt

Train

$ python train.py \
--train_file <data dir>/aio_02_train.jsonl \
--output_dir model \
--pos_list 名詞 \
--stop_words でしょ う \
--max_features 10000

Predict

$ python predict.py \
--model_dir model \
--test_file <data dir>/aio_02_dev_unlabeled_v1.0.jsonl \
--prediction_file <output dir>/predictions.jsonl

Building Docker image

$ docker build -t aio2-tfidf-baseline .

Test locally:

:/app/input" -v ":/app/output" aio2-tfidf-baseline bash ./submission.sh input/aio_02_dev_unlabeled_v1.0.jsonl output/predictions.jsonl "> $ docker run --rm -v ":/app/input" -v ":/app/output" aio2-tfidf-baseline bash ./submission.sh input/aio_02_dev_unlabeled_v1.0.jsonl output/predictions.jsonl 

Save the docker image to file:

$ docker save aio2-tfidf-baseline | gzip > aio2-tfidf-baseline.tar.gz

License

The codes in this repository are open-sourced under MIT License.

TFIDF-based QA system for AIO2 competition

Related tags

Overview

AIO2 TF-IDF Baseline

Steps to experiment with the model

Install requirements

Train

Predict

Building Docker image

License

Owner

Masatoshi Suzuki

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

A framework for cleaning Chinese dialog data

AI and Machine Learning workflows on Anthos Bare Metal.

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

Mkdocs + material + cool stuff

NL. The natural language programming language.

FewCLUE: 为中文NLP定制的小样本学习测评基准

A Chinese to English Neural Model Translation Project

This is the source code of RPG (Reward-Randomized Policy Gradient)

voice2json is a collection of command-line tools for offline speech/intent recognition on Linux

Question and answer retrieval in Turkish with BERT

A python gui program to generate reddit text to speech videos from the id of any post.

A Paper List for Speech Translation

Fuzzy String Matching in Python

TTS is a library for advanced Text-to-Speech generation.

Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

CoSENT 比Sentence-BERT更有效的句向量方案