Reading Wikipedia to Answer Open-Domain Questions

Related tags

Text Data & NLPDrQA
Overview

DrQA

This is a PyTorch implementation of the DrQA system described in the ACL 2017 paper Reading Wikipedia to Answer Open-Domain Questions.

Quick Links

Machine Reading at Scale

DrQA is a system for reading comprehension applied to open-domain question answering. In particular, DrQA is targeted at the task of "machine reading at scale" (MRS). In this setting, we are searching for an answer to a question in a potentially very large corpus of unstructured documents (that may not be redundant). Thus the system has to combine the challenges of document retrieval (finding the relevant documents) with that of machine comprehension of text (identifying the answers from those documents).

Our experiments with DrQA focus on answering factoid questions while using Wikipedia as the unique knowledge source for documents. Wikipedia is a well-suited source of large-scale, rich, detailed information. In order to answer any question, one must first retrieve the few potentially relevant articles among more than 5 million, and then scan them carefully to identify the answer.

Note that DrQA treats Wikipedia as a generic collection of articles and does not rely on its internal graph structure. As a result, DrQA can be straightforwardly applied to any collection of documents, as described in the retriever README.

This repository includes code, data, and pre-trained models for processing and querying Wikipedia as described in the paper -- see Trained Models and Data. We also list several different datasets for evaluation, see QA Datasets. Note that this work is a refactored and more efficient version of the original code. Reproduction numbers are very similar but not exact.

Quick Start: Demo

Install DrQA and download our models to start asking open-domain questions!

Run python scripts/pipeline/interactive.py to drop into an interactive session. For each question, the top span and the Wikipedia paragraph it came from are returned.

>>> process('What is question answering?')

Top Predictions:
+------+----------------------------------------------------------------------------------------------------------+--------------------+--------------+-----------+
| Rank |                                                  Answer                                                  |        Doc         | Answer Score | Doc Score |
+------+----------------------------------------------------------------------------------------------------------+--------------------+--------------+-----------+
|  1   | a computer science discipline within the fields of information retrieval and natural language processing | Question answering |    1917.8    |   327.89  |
+------+----------------------------------------------------------------------------------------------------------+--------------------+--------------+-----------+

Contexts:
[ Doc = Question answering ]
Question Answering (QA) is a computer science discipline within the fields of
information retrieval and natural language processing (NLP), which is
concerned with building systems that automatically answer questions posed by
humans in a natural language.
>>> process('What is the answer to life, the universe, and everything?')

Top Predictions:
+------+--------+---------------------------------------------------+--------------+-----------+
| Rank | Answer |                        Doc                        | Answer Score | Doc Score |
+------+--------+---------------------------------------------------+--------------+-----------+
|  1   |   42   | Phrases from The Hitchhiker's Guide to the Galaxy |    47242     |   141.26  |
+------+--------+---------------------------------------------------+--------------+-----------+

Contexts:
[ Doc = Phrases from The Hitchhiker's Guide to the Galaxy ]
The number 42 and the phrase, "Life, the universe, and everything" have
attained cult status on the Internet. "Life, the universe, and everything" is
a common name for the off-topic section of an Internet forum and the phrase is
invoked in similar ways to mean "anything at all". Many chatbots, when asked
about the meaning of life, will answer "42". Several online calculators are
also programmed with the Question. Google Calculator will give the result to
"the answer to life the universe and everything" as 42, as will Wolfram's
Computational Knowledge Engine. Similarly, DuckDuckGo also gives the result of
"the answer to the ultimate question of life, the universe and everything" as
42. In the online community Second Life, there is a section on a sim called
43. "42nd Life." It is devoted to this concept in the book series, and several
attempts at recreating Milliways, the Restaurant at the End of the Universe, were made.
>>> process('Who was the winning pitcher in the 1956 World Series?')

Top Predictions:
+------+------------+------------------+--------------+-----------+
| Rank |   Answer   |       Doc        | Answer Score | Doc Score |
+------+------------+------------------+--------------+-----------+
|  1   | Don Larsen | New York Yankees |  4.5059e+06  |   278.06  |
+------+------------+------------------+--------------+-----------+

Contexts:
[ Doc = New York Yankees ]
In 1954, the Yankees won over 100 games, but the Indians took the pennant with
an AL record 111 wins; 1954 was famously referred to as "The Year the Yankees
Lost the Pennant". In , the Dodgers finally beat the Yankees in the World
Series, after five previous Series losses to them, but the Yankees came back
strong the next year. On October 8, 1956, in Game Five of the 1956 World
Series against the Dodgers, pitcher Don Larsen threw the only perfect game in
World Series history, which remains the only perfect game in postseason play
and was the only no-hitter of any kind to be pitched in postseason play until
Roy Halladay pitched a no-hitter on October 6, 2010.

Try some of your own! Of course, DrQA might provide alternative facts, so enjoy the ride.

Installing DrQA

Setting up DrQA is easy!

DrQA requires Linux/OSX and Python 3.5 or higher. It also requires installing PyTorch version 1.0. Its other dependencies are listed in requirements.txt. CUDA is strongly recommended for speed, but not necessary.

Run the following commands to clone the repository and install DrQA:

git clone https://github.com/facebookresearch/DrQA.git
cd DrQA; pip install -r requirements.txt; python setup.py develop

Note: requirements.txt includes a subset of all the possible required packages. Depending on what you want to run, you might need to install an extra package (e.g. spacy).

If you use the CoreNLPTokenizer or SpacyTokenizer you also need to download the Stanford CoreNLP jars and spaCy en model, respectively. If you use Stanford CoreNLP, have the jars in your java CLASSPATH environment variable, or set the path programmatically with:

import drqa.tokenizers
drqa.tokenizers.set_default('corenlp_classpath', '/your/corenlp/classpath/*')

IMPORTANT: The default tokenizer is CoreNLP so you will need that in your CLASSPATH to run the README examples.

Ex: export CLASSPATH=$CLASSPATH:/path/to/corenlp/download/*.

If you do not already have a CoreNLP download you can run:

./install_corenlp.sh

Verify that it runs:

from drqa.tokenizers import CoreNLPTokenizer
tok = CoreNLPTokenizer()
tok.tokenize('hello world').words()  # Should complete immediately

For convenience, the Document Reader, Retriever, and Pipeline modules will try to load default models if no model argument is given. See below for downloading these models.

Trained Models and Data

To download all provided trained models and data for Wikipedia question answering, run:

./download.sh

Warning: this downloads a 7.5GB tarball (25GB untarred) and will take some time.

This stores the data in data/ at the file paths specified in the various modules' defaults. This top-level directory can be modified by setting a DRQA_DATA environment variable to point to somewhere else.

Default directory structure (see embeddings for more info on additional downloads for training):

DrQA
├── data (or $DRQA_DATA)
    ├── datasets
    │   ├── SQuAD-v1.1-<train/dev>.<txt/json>
    │   ├── WebQuestions-<train/test>.txt
    │   ├── freebase-entities.txt
    │   ├── CuratedTrec-<train/test>.txt
    │   └── WikiMovies-<train/test/entities>.txt
    ├── reader
    │   ├── multitask.mdl
    │   └── single.mdl
    └── wikipedia
        ├── docs.db
        └── docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz

Default model paths for the different modules can also be modified programmatically in the code, e.g.:

import drqa.reader
drqa.reader.set_default('model', '/path/to/model')
reader = drqa.reader.Predictor()  # Default model loaded for prediction

Document Retriever

TF-IDF model using Wikipedia (unigrams and bigrams, 2^24 bins, simple tokenization), evaluated on multiple datasets (test sets, dev set for SQuAD):

Model SQuAD [email protected] CuratedTREC [email protected] WebQuestions [email protected] WikiMovies [email protected] Size
TF-IDF model 78.0 87.6 75.0 69.8 ~13GB

[email protected] here is defined as the % of questions for which the answer segment appears in one of the top 5 documents.

Document Reader

Model trained only on SQuAD, evaluated in the SQuAD setting:

Model SQuAD Dev EM SQuAD Dev F1 Size
Single model 69.4 78.9 ~130MB

Model trained with distant supervision without NER/POS/lemma features, evaluated on multiple datasets (test sets, dev set for SQuAD) in the full Wikipedia setting:

Model SQuAD EM CuratedTREC EM WebQuestions EM WikiMovies EM Size
Multitask model 29.5 27.2 18.5 36.9 ~270MB

Wikipedia

Our full-scale experiments were conducted on the 2016-12-21 dump of English Wikipedia. The dump was processed with the WikiExtractor and filtered for internal disambiguation, list, index, and outline pages (pages that are typically just links). We store the documents in an sqlite database for which drqa.retriever.DocDB provides an interface.

Database Num. Documents Size
Wikipedia 5,075,182 ~13GB

QA Datasets

The datasets used for DrQA training and evaluation can be found here:

Format A

The retriever/eval.py, pipeline/eval.py, and distant/generate.py scripts expect the datasets as a .txt file where each line is a JSON encoded QA pair, like so:

'{"question": "q1", "answer": ["a11", ..., "a1i"]}'
...
'{"question": "qN", "answer": ["aN1", ..., "aNi"]}'

Scripts to convert SQuAD and WebQuestions to this format are included in scripts/convert. This is automatically done in download.sh.

Format B

The reader directory scripts expect the datasets as a .json file where the data is arranged like SQuAD:

file.json
├── "data"
│   └── [i]
│       ├── "paragraphs"
│       │   └── [j]
│       │       ├── "context": "paragraph text"
│       │       └── "qas"
│       │           └── [k]
│       │               ├── "answers"
│       │               │   └── [l]
│       │               │       ├── "answer_start": N
│       │               │       └── "text": "answer"
│       │               ├── "id": "<uuid>"
│       │               └── "question": "paragraph question?"
│       └── "title": "document id"
└── "version": 1.1
Entity lists

Some datasets have (potentially large) candidate lists for selecting answers. For example, WikiMovies' answers are OMDb entries while WebQuestions is based on Freebase. If we have known candidates, we can impose that all predicted answers must be in this list by discarding any higher scoring spans that are not.

DrQA Components

Document Retriever

DrQA is not tied to any specific type of retrieval system -- as long as it effectively narrows the search space and focuses on relevant documents.

Following classical QA systems, we include an efficient (non-machine learning) document retrieval system based on sparse, TF-IDF weighted bag-of-word vectors. We use bags of hashed n-grams (here, unigrams and bigrams).

To see how to build your own such model on new documents, see the retriever README.

To interactively query Wikipedia:

python scripts/retriever/interactive.py --model /path/to/model

If model is left out our default model will be used (assuming it was downloaded).

To evaluate the retriever accuracy (% match in top 5) on a dataset:

python scripts/retriever/eval.py /path/to/format/A/dataset.txt --model /path/to/model

Document Reader

DrQA's Document Reader is a multi-layer recurrent neural network machine comprehension model trained to do extractive question answering. That is, the model tries to find the answer to any question as a text span in one of the returned documents.

The Document Reader was inspired by, and primarily trained on, the SQuAD dataset. It can also be used standalone on such SQuAD-like tasks where a specific context is supplied with the question, the answer to which is contained in the context.

To see how to train the Document Reader on SQuAD, see the reader README.

To interactively ask questions about text with a trained model:

python scripts/reader/interactive.py --model /path/to/model

Again, here model is optional; a default model will be used if it is left out.

To run model predictions on a dataset:

python scripts/reader/predict.py /path/to/format/B/dataset.json --model /path/to/model

DrQA Pipeline

The full system is linked together in drqa.pipeline.DrQA.

To interactively ask questions using the full DrQA:

python scripts/pipeline/interactive.py

Optional arguments:

--reader-model    Path to trained Document Reader model.
--retriever-model Path to Document Retriever model (tfidf).
--doc-db          Path to Document DB.
--tokenizer      String option specifying tokenizer type to use (e.g. 'corenlp').
--candidate-file  List of candidates to restrict predictions to, one candidate per line.
--no-cuda         Use CPU only.
--gpu             Specify GPU device id to use.

To run predictions on a dataset:

python scripts/pipeline/predict.py /path/to/format/A/dataset.txt

Optional arguments:

--out-dir             Directory to write prediction file to (<dataset>-<model>-pipeline.preds).
--reader-model        Path to trained Document Reader model.
--retriever-model     Path to Document Retriever model (tfidf).
--doc-db              Path to Document DB.
--embedding-file      Expand dictionary to use all pretrained embeddings in this file (e.g. all glove vectors to minimize UNKs at test time).
--candidate-file      List of candidates to restrict predictions to, one candidate per line.
--n-docs              Number of docs to retrieve per query.
--top-n               Number of predictions to make per query.
--tokenizer           String option specifying tokenizer type to use (e.g. 'corenlp').
--no-cuda             Use CPU only.
--gpu                 Specify GPU device id to use.
--parallel            Use data parallel (split across GPU devices).
--num-workers         Number of CPU processes (for tokenizing, etc).
--batch-size          Document paragraph batching size (Reduce in case of GPU OOM).
--predict-batch-size  Question batching size (Reduce in case of CPU OOM).

Distant Supervision (DS)

DrQA's performance improves significantly in the full-setting when provided with distantly supervised data from additional datasets. Given question-answer pairs but no supporting context, we can use string matching heuristics to automatically associate paragraphs to these training examples.

Question: What U.S. state’s motto is “Live free or Die”?

Answer: New Hampshire

DS Document: Live Free or Die “Live Free or Die” is the official motto of the U.S. state of New Hampshire, adopted by the state in 1945. It is possibly the best-known of all state mottos, partly because it conveys an assertive independence historically found in American political philosophy and partly because of its contrast to the milder sentiments found in other state mottos.

The scripts/distant directory contains code to generate and inspect such distantly supervised data. More information can be found in the distant supervision README.

Tokenizers

We provide a number of different tokenizer options for convenience. Each has its own pros/cons based on how many dependencies it requires, overhead for running it, speed, and performance. For our reported experiments we used CoreNLP (but results are all similar).

Available tokenizers:

  • CoreNLPTokenizer: Uses Stanford CoreNLP (option: 'corenlp'). We used v3.7.0. Requires Java 8.
  • SpacyTokenizer: Uses spaCy (option: 'spacy').
  • RegexpTokenizer: Custom regex-based PTB-style tokenizer (option: 'regexp').
  • SimpleTokenizer: Basic alpha-numeric/non-whitespace tokenizer (option: 'simple').

See the list of mappings between string option names and tokenizer classes.

Citation

Please cite the ACL paper if you use DrQA in your work:

@inproceedings{chen2017reading,
  title={Reading {Wikipedia} to Answer Open-Domain Questions},
  author={Chen, Danqi and Fisch, Adam and Weston, Jason and Bordes, Antoine},
  booktitle={Association for Computational Linguistics (ACL)},
  year={2017}
}

DrQA Elsewhere

Connection with ParlAI

This implementation of the DrQA Document Reader is closely related to the one found in ParlAI. Here, however, the work is extended to interact with the Document Retriever in the open-domain setting. On the other hand, the implementation in ParlAI is more general, and follows the appropriate API to work in more QA/Dialog settings.

Web UI

Hamed Zaghaghi has provided a wrapper for a Web UI.

License

DrQA is BSD-licensed.

Owner
Facebook Research
Facebook Research
Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

THUNLP-MT 9 Jun 27, 2022
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Hugging Face 6.2k Dec 31, 2022
jiant is an NLP toolkit

jiant is an NLP toolkit The multitask and transfer learning toolkit for natural language processing research Why should I use jiant? jiant supports mu

ML² AT CILVR 1.5k Jan 04, 2023
📔️ Generate a text-based journal from a template file.

JGen 📔️ Generate a text-based journal from a template file. Contents Getting Started Example Overview Usage Details Reserved Keywords Gotchas Getting

Harrison Broadbent 21 Sep 25, 2022
Official implementations for various pre-training models of ERNIE-family, covering topics of Language Understanding & Generation, Multimodal Understanding & Generation, and beyond.

English|简体中文 ERNIE是百度开创性提出的基于知识增强的持续学习语义理解框架,该框架将大数据预训练与多源丰富知识相结合,通过持续学习技术,不断吸收海量文本数据中词汇、结构、语义等方面的知识,实现模型效果不断进化。ERNIE在累积 40 余个典型 NLP 任务取得 SOTA 效果,并在 G

5.4k Jan 03, 2023
Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products.

Leah Pathan Khan 2 Jan 12, 2022
A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to ach

Keon Lee 237 Jan 02, 2023
Module for automatic summarization of text documents and HTML pages.

Automatic text summarizer Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains sim

Mišo Belica 3k Jan 08, 2023
test

Lidar-data-decode In this project, you can decode your lidar data frame(pcap file) and make your own datasets(test dataset) in Windows without any hug

46 Dec 05, 2022
TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

Alexa 98 Dec 09, 2022
In this repository we have tested 3 VQA models on the ImageCLEF-2019 dataset.

Med-VQA In this repository we have tested 3 VQA models on the ImageCLEF-2019 dataset. Two of these are made on top of Facebook AI Reasearch's Multi-Mo

Kshitij Ambilduke 8 Apr 14, 2022
Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization (ACL 2021)

Structured Super Lottery Tickets in BERT This repo contains our codes for the paper "Super Tickets in Pre-Trained Language Models: From Model Compress

Chen Liang 16 Dec 11, 2022
Auto translate textbox from Japanese to English or Indonesia

priconne-auto-translate Auto translate textbox from Japanese to English or Indonesia How to use Install python first, Anaconda is recommended Install

Aji Priyo Wibowo 5 Aug 25, 2022
A linter to manage all your python exceptions and try/except blocks (limited only for those who like dinosaurs).

Manage your exceptions in Python like a PRO Currently in BETA. Inspired by this blog post. I shared the building process of this tool here. “For those

Guilherme Latrova 353 Dec 31, 2022
2021海华AI挑战赛·中文阅读理解·技术组·第三名

文字是人类用以记录和表达的最基本工具,也是信息传播的重要媒介。透过文字与符号,我们可以追寻人类文明的起源,可以传播知识与经验,读懂文字是认识与了解的第一步。对于人工智能而言,它的核心问题之一就是认知,而认知的核心则是语义理解。

21 Dec 26, 2022
Stuff related to Ben Eater's 8bit breadboard computer

8bit breadboard computer simulator This is an assembler + simulator/emulator of Ben Eater's 8bit breadboard computer. For a version with its RAM upgra

Marijn van Vliet 29 Dec 29, 2022
Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

ESACL: Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization This repo is for our paper "Enhanced Seq2Seq Autoencode

Rachel Zheng 14 Nov 01, 2022
Python library for Serbian Natural language processing (NLP)

SrbAI - Python biblioteka za procesiranje srpskog jezika SrbAI je projekat prikupljanja algoritama i modela za procesiranje srpskog jezika u jedinstve

Serbian AI Society 3 Nov 22, 2022