Semantic search through a vectorized Wikipedia (SentenceBERT) with the Weaviate vector search engine

Overview

Semantic search through Wikipedia with the Weaviate vector search engine

Weaviate is an open source vector search engine with build-in vectorization and question answering modules. We imported the complete English language Wikipedia article dataset into a single Weaviate instance to conduct semantic search queries through the Wikipedia articles, besides this, we've made all the graph relations between the articles too. We have made the import scripts, pre-processed articles, and backup available so that you can run the complete setup yourself.

In this repository, you'll find the 3-steps needed to replicate the import, but there are also downlaods available to skip the first two steps.

If you like what you see, a on the Weaviate Github repo or joining our Slack is appreciated.

Additional links:

Frequently Asked Questions

Q A
Can I run this setup with a non-English dataset? Yes – first, you need to go through the whole process (i.e., start with Step 1). E.g., if you want French, you can download the French version of Wikipedia like this: https://dumps.wikimedia.org/frwiki/latest/frwiki-latest-pages-articles.xml.bz2 (note that en if replaced with fr). Next, you need to change the Weaviate vectorizer module to an appropriate language. You can choose an OOTB language model as outlined here or add your own model as outlined here.
Can I run this setup with all languages? Yes – you can follow two strategies. You can use a multilingual model or extend the Weaviate schema to store different languages with different classes. The latter has the upside that you can use multiple vectorizers (e.g., per language) or a more elaborate sharding strategy. But in the end, both are possible.
Can I run this with Kubernetes? Of course, you need to start from Step 2. But if you follow the Kubernetes set up in the docs you should be good :-)
Can I run this with my own data? Yes! This is just a demo dataset, you can use any data you have and like. Go to the Weaviate docs or join our Slack to get started.

Acknowledgments

Stats

description value
Articles imported 11.348.257
Paragaphs imported 27.377.159
Graph cross references 125.447.595
Wikipedia version truthy October 9th, 2021
Machine for inference 12 CPU – 100 GB RAM – 250Gb SSD – 1 x NVIDIA Tesla P4
Weaviate version v1.7.2
Dataset size 122GB

Example queries

Example semantic search queries in Weaviate's GraphQL interface

Import

There are 3-steps in the import process. You can also skip the first two and directly import the backup

Step 1: Process the Wikipedia dump

In this process, the Wikipedia dataset is processed and cleaned (the markup is removed, HTML tags are removed, etc). The output file is a JSON Lines document that will be used in the next step.

Process from the Wikimedia dump:

$ cd step-1
$ wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
$ bzip2 -d filename.bz2
$ pip3 install -r requirements.txt
$ python3 process.py

The import takes a few hours, so probably you want to do something like:

$ nohup python3 -u process.py &

You can also download the processed file from October 9th, 2021, and skip the above steps

$ wget https://storage.googleapis.com/semi-technologies-public-data/wikipedia-en-articles.json.gz
$ gunzip wikipedia-en-articles.json.gz

Step 2: Import the dataset and vectorized the content

Weaviate takes care of the complete import and vectorization process but you'll need some GPU and CPU muscle to achieve this. Important to bear in mind is that this is only needed on import time. If you don't want to spend the resources on doing the import, you can go to the next step in the process and download the Weaviate backup. The machine needed for inference is way cheaper.

We will be using a single Weaviate instance, but four Tesla T4 GPUs that we will stuff with 8 models each. To efficiently do this, we are going to add an NGINX load balancer between Weaviate and the vectorizers.

Weaviate Wikipedia import architecture with transformers and vectorizers

  • Every Weaviate text2vec-module will be using a multi-qa-MiniLM-L6-cos-v1 sentence transformer.
  • The volume is mounted outside the container to /var/weaviate. This allows us to use this folder as a backup that can be imported in the next step.
  • Make sure to have Docker-compose with GPU support installed.
  • The import scripts assumes that the JSON file is called wikipedia-en-articles.json.
$ cd step-2
$ docker-compose up -d
$ pip3 install -r requirements.txt
$ python3 import.py

The import takes a few hours, so probably you want to do something like:

$ nohup python3 -u import.py &

After the import is done, you can shut down the Docker containers by running docker-compose down.

You can now query the dataset!

Step 3: Load from backup

Start here if you want to work with a backup of the dataset without importing it

You can now run the dataset! We would advise running it with 1 GPU, but you can also run it on CPU only (without Q&A). The machine you need for inference is significantly smaller.

Note that Weaviate needs some time to import the backup (if you use the setup mentioned above +/- 15min). You can see the status of the backup in the docker logs of the Weaviate container.

# clone this repository
$ git clone https://github.com/semi-technologies/semantic-search-through-Wikipedia-with-Weaviate/
# go into the backup dir
$ cd step-3
# download the Weaviate backup
$ curl https://storage.googleapis.com/semi-technologies-public-data/weaviate-1.8.0-rc.2-backup-wikipedia-py-en-multi-qa-MiniLM-L6-cos.tar.gz -O
# untar the backup (112G unpacked)
$ tar -xvzf weaviate-1.8.0-rc.2-backup-wikipedia-py-en-multi-qa-MiniLM-L6-cos.tar.gz
# get the unpacked directory
$ echo $(pwd)/var/weaviate
# use the above result (e.g., /home/foobar/var/weaviate)
#   update volumes in docker-compose.yml (NOT PERSISTENCE_DATA_PATH!) to the above output
#   (e.g., 
#     volumes:
#       - /home/foobar/var/weaviate:/var/lib/weaviate
#   )    
#
#   With 12 CPUs this process takes about 12 to 15 minutes to complete.
#   The Weaviate instance will be available directly, but the cache is pre-filling in this timeframe

With GPU

$ cd step-3
$ docker-compose -f docker-compose-gpu.yml up -d

Without GPU

$ cd step-3
$ docker-compose -f docker-compose-no-gpu.yml up -d

Example queries

"Where is the States General of The Netherlands located?" try it live!

##
# Using the Q&A module I
##
{
  Get {
    Paragraph(
      ask: {
        question: "Where is the States General of The Netherlands located?"
        properties: ["content"]
      }
      limit: 1
    ) {
      _additional {
        answer {
          result
          certainty
        }
      }
      content
      title
    }
  }
}

"What was the population of the Dutch city Utrecht in 2019?" try it live!

##
# Using the Q&A module II
##
{
  Get {
    Paragraph(
      ask: {
        question: "What was the population of the Dutch city Utrecht in 2019?"
        properties: ["content"]
      }
      limit: 1
    ) {
      _additional {
        answer {
          result
          certainty
        }
      }
      content
      title
    }
  }
}

About the concept "Italian food" try it live!

##
# Generic question about Italian food
##
{
  Get {
    Paragraph(
      nearText: {
        concepts: ["Italian food"]
      }
      limit: 50
    ) {
      content
      order
      title
      inArticle {
        ... on Article {
          title
        }
      }
    }
  }
}

"What was Michael Brecker's first saxophone?" in the Wikipedia article about "Michael Brecker" try it live!

##
# Mixing scalar queries and semantic search queries
##
{
  Get {
    Paragraph(
      ask: {
        question: "What was Michael Brecker's first saxophone?"
        properties: ["content"]
      }
      where: {
        operator: Equal
        path: ["inArticle", "Article", "title"]
        valueString: "Michael Brecker"
      }
      limit: 1
    ) {
      _additional {
        answer {
          result
        }
      }
      content
      order
      title
      inArticle {
        ... on Article {
          title
        }
      }
    }
  }
}

Get all Wikipedia graph connections for "jazz saxophone players" try it live!

##
# Mixing semantic search queries with graph connections
##
{
  Get {
    Paragraph(
      nearText: {
        concepts: ["jazz saxophone players"]
      }
      limit: 25
    ) {
      content
      order
      title
      inArticle {
        ... on Article { # <== Graph connection I
          title
          hasParagraphs { # <== Graph connection II
            ... on Paragraph {
              title
            }
          }
        }
      }
    }
  }
}
Owner
SeMI Technologies
SeMI Technologies creates database software like the Weaviate vector search engine
SeMI Technologies
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Alexander Veysov 3.2k Dec 31, 2022
Pretrain CPM - 大规模预训练语言模型的预训练代码

CPM-Pretrain 版本更新记录 为了促进中文自然语言处理研究的发展,本项目提供了大规模预训练语言模型的预训练代码。项目主要基于DeepSpeed、Megatron实现,可以支持数据并行、模型加速、流水并行的代码。 安装 1、首先安装pytorch等基础依赖,再安装APEX以支持fp16。 p

Tsinghua AI 37 Dec 06, 2022
CoSENT 比Sentence-BERT更有效的句向量方案

CoSENT 比Sentence-BERT更有效的句向量方案

苏剑林(Jianlin Su) 201 Dec 12, 2022
A paper list of pre-trained language models (PLMs).

Large-scale pre-trained language models (PLMs) such as BERT and GPT have achieved great success and become a milestone in NLP.

RUCAIBox 124 Jan 02, 2023
Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET

Training COMET using seq2seq setting Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET. The codes are modified from run_summarizati

tqfang 9 Dec 17, 2022
Stack based programming language that compiles to x86_64 assembly or can alternatively be interpreted in Python

lang lang is a simple stack based programming language written in Python. It can

Christoffer Aakre 1 May 30, 2022
LUKE -- Language Understanding with Knowledge-based Embeddings

LUKE (Language Understanding with Knowledge-based Embeddings) is a new pre-trained contextualized representation of words and entities based on transf

Studio Ousia 587 Dec 30, 2022
A Fast Command Analyser based on Dict and Pydantic

Alconna Alconna 隶属于ArcletProject, 在Cesloi内有内置 Alconna 是 Cesloi-CommandAnalysis 的高级版,支持解析消息链 一般情况下请当作简易的消息链解析器/命令解析器 文档 暂时的文档 Example from arclet.alcon

19 Jan 03, 2023
LSTC: Boosting Atomic Action Detection with Long-Short-Term Context

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context This Repository contains the code on AVA of our ACM MM 2021 paper: LSTC: Boosting

Tencent YouTu Research 9 Oct 11, 2022
Задания КЕГЭ по информатике 2021 на Python

КЕГЭ 2021 на Python В этом репозитории мои решения типовых заданий КЕГЭ по информатике в 2021 году, БЕСПЛАТНО! Задания Взяты с https://inf-ege.sdamgia

8 Oct 13, 2022
Translation for Trilium Notes. Trilium Notes 中文版.

Trilium Translation 中文说明 This repo provides a translation for the awesome Trilium Notes. Currently, I have translated Trilium Notes into Chinese. Test

743 Jan 08, 2023
Use the power of GPT3 to execute any function inside your programs just by giving some doctests

gptrun Don't feel like coding today? Use the power of GPT3 to execute any function inside your programs just by giving some doctests. How is this diff

Roberto Abdelkader Martínez Pérez 11 Nov 11, 2022
Scikit-learn style model finetuning for NLP

Scikit-learn style model finetuning for NLP Finetune is a library that allows users to leverage state-of-the-art pretrained NLP models for a wide vari

indico 665 Dec 17, 2022
This is my reading list for my PhD in AI, NLP, Deep Learning and more.

This is my reading list for my PhD in AI, NLP, Deep Learning and more.

Zhong Peixiang 156 Dec 21, 2022
Rethinking the Truly Unsupervised Image-to-Image Translation - Official PyTorch Implementation (ICCV 2021)

Rethinking the Truly Unsupervised Image-to-Image Translation (ICCV 2021) Each image is generated with the source image in the left and the average sty

Clova AI Research 436 Dec 27, 2022
Code for ACL 2020 paper "Rigid Formats Controlled Text Generation"

SongNet SongNet: SongCi + Song (Lyrics) + Sonnet + etc. @inproceedings{li-etal-2020-rigid, title = "Rigid Formats Controlled Text Generation",

Piji Li 212 Dec 17, 2022
NLP project that works with news (NER, context generation, news trend analytics)

СоАвтор СоАвтор – платформа и открытый набор инструментов для редакций и журналистов-фрилансеров, который призван сделать процесс создания контента ма

38 Jan 04, 2023
Library for fast text representation and classification.

fastText fastText is a library for efficient learning of word representations and sentence classification. Table of contents Resources Models Suppleme

Facebook Research 24.1k Jan 05, 2023
Almost State-of-the-art Text Generation library

Ps: we are adding transformer model soon Text Gen 🐐 Almost State-of-the-art Text Generation library Text gen is a python library that allow you build

Emeka boris ama 63 Jun 24, 2022
Implementation of the Hybrid Perception Block and Dual-Pruned Self-Attention block from the ITTR paper for Image to Image Translation using Transformers

ITTR - Pytorch Implementation of the Hybrid Perception Block (HPB) and Dual-Pruned Self-Attention (DPSA) block from the ITTR paper for Image to Image

Phil Wang 17 Dec 23, 2022