This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

Overview

OpenWebText2

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

Very briefly, OpenWebText2 is a large filtered dataset of text documents scraped from URL found on Reddit submisisons.

The plug and play version of OpenWebText2 contains:

  • 17,103,059 documents
  • 65.86GB uncompressed text

Download Dataset / Documentation

For further information please visit our documentation.

Acknowledgements

researcher2 Wrote much of this code, with inspiration and some straight copying of the scraping code found here.
sdtblck kindly put together the Colab notebook, and performed a chunk of the scraping.
leogao2 provided overall design guidance, lm_dataformat, and performed another chunk of scraping.
Colaboratory VMs helped us with about 10% of our overall scraping.
The Eye host our processed datasets.
Read The Docs host our documentation.

You might also like...
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

GPT-2 Catalan playground and scripts to train a GPT-2 model either from scrath or from another pretrained model.

Fake news detector filters - Smart filter project allow to classify the quality of information and web pages

fake-news-detector-1.0 Lists, lists and more lists... Spam filter list, quality keyword list, stoplist list, top-domains urls list, news agencies webs

Ongoing research training transformer language models at scale, including: BERT & GPT-2

What is this fork of Megatron-LM and Megatron-DeepSpeed This is a detached fork of https://github.com/microsoft/Megatron-DeepSpeed, which in itself is

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Tevatron Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized

Comments
  • Fixing an issue with sha256 checking

    Fixing an issue with sha256 checking

    The pushshift.pushshift_to_sqlite method passes the arguments to best_download.download_file in a wrong order, and the code crashes. Hence, the dataset is not reproducible without this modification.

    opened by ardacihaner 0
Releases(v1.0)
Owner
EleutherAI
EleutherAI
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Stefan Dumitrescu 9 Nov 07, 2022
Weaviate demo with the text2vec-openai module

Weaviate demo with the text2vec-openai module This repository contains an example of how to use the Weaviate text2vec-openai module. When using this d

SeMI Technologies 11 Nov 11, 2022
Write Alphabet, Words and Sentences with your eyes.

The-Next-Gen-AI-Eye-Writer The Eye tracking Technique has become one of the most popular techniques within the human and computer interaction era, thi

Rohan Kasabe 2 Apr 05, 2022
This project deals with a simplified version of a more general problem of Aspect Based Sentiment Analysis.

Aspect_Based_Sentiment_Extraction Created on: 5th Jan, 2022. This project deals with an important field of Natural Lnaguage Processing - Aspect Based

Naman Rastogi 4 Jan 01, 2023
A Structured Self-attentive Sentence Embedding

Structured Self-attentive sentence embeddings Implementation for the paper A Structured Self-Attentive Sentence Embedding, which was published in ICLR

Kaushal Shetty 488 Nov 28, 2022
Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Yase Yet Another Sequence Encoder - encode sequences to vector of vectors in python ! Why Yase ? Yase enable you to encode any sequence which can be r

Pierre PACI 12 Aug 19, 2021
Mesh TensorFlow: Model Parallelism Made Easier

Mesh TensorFlow - Model Parallelism Made Easier Introduction Mesh TensorFlow (mtf) is a language for distributed deep learning, capable of specifying

1.3k Dec 26, 2022
BERN2: an advanced neural biomedical namedentity recognition and normalization tool

BERN2 We present BERN2 (Advanced Biomedical Entity Recognition and Normalization), a tool that improves the previous neural network-based NER tool by

DMIS Laboratory - Korea University 99 Jan 06, 2023
Awesome Treasure of Transformers Models Collection

๐Ÿ’ Awesome Treasure of Transformers Models for Natural Language processing contains papers, videos, blogs, official repo along with colab Notebooks. ๐Ÿ›ซโ˜‘๏ธ

Ashish Patel 577 Jan 07, 2023
ConferencingSpeech2022; Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge

ConferencingSpeech 2022 challenge This repository contains the datasets list and scripts required for the ConferencingSpeech 2022 challenge. For more

21 Dec 02, 2022
Beyond the Imitation Game collaborative benchmark for enormous language models

BIG-bench ๐Ÿช‘ The Beyond the Imitation Game Benchmark (BIG-bench) will be a collaborative benchmark intended to probe large language models, and extrap

Google 1.3k Jan 01, 2023
NLP codes implemented with Pytorch (w/o library such as huggingface)

NLP_scratch NLP codes implemented with Pytorch (w/o library such as huggingface) scripts โ”œโ”€โ”€ models: Neural Network models โ”œโ”€โ”€ data: codes for dataloa

3 Dec 28, 2021
Natural Language Processing with transformers

we want to create a repo to illustrate usage of transformers in chinese

Datawhale 763 Dec 27, 2022
Fast topic modeling platform

The state-of-the-art platform for topic modeling. Full Documentation User Mailing List Download Releases User survey What is BigARTM? BigARTM is a pow

BigARTM 633 Dec 21, 2022
Chinese Named Entity Recognization (BiLSTM with PyTorch)

BiLSTM-CRF for Name Entity Recognition PyTorch version A PyTorch implemention of Bi-LSTM-CRF model for Chinese Named Entity Recognition. ไฝฟ็”จ PyTorch ๅฎž็Žฐ

5 Jun 01, 2022
Indonesia spellchecker with python

indonesia-spellchecker Ganti kata yang terdapat pada file teks.txt untuk diperiksa kebenaran kata. Run on local machine python3 main.py

Rahmat Agung Julians 1 Sep 14, 2022
Knowledge Management for Humans using Machine Learning & Tags

HyperTag helps humans intuitively express how they think about their files using tags and machine learning. Represent how you think using tags. Find what you look for using semantic search for your t

Ravn Tech, Inc. 166 Jan 07, 2023
Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention ACL2021 Findings Usage 0. Prepare environment Requirements: python==3.6 te

Xiaobao Wu 8 Dec 16, 2022
๋ฌธ์žฅ๋‹จ์œ„๋กœ ๋ถ„์ ˆ๋œ ๋‚˜๋ฌด์œ„ํ‚ค ๋ฐ์ดํ„ฐ์…‹. Releases์—์„œ ๋‹ค์šด๋กœ๋“œ ๋ฐ›๊ฑฐ๋‚˜, tfds-korean์„ ํ†ตํ•ด ๋‹ค์šด๋กœ๋“œ ๋ฐ›์œผ์„ธ์š”.

Namuwiki corpus ๋ฌธ์žฅ๋‹จ์œ„๋กœ ๋ฏธ๋ฆฌ ๋ถ„์ ˆ๋œ ๋‚˜๋ฌด์œ„ํ‚ค ์ฝ”ํผ์Šค. ๋ชฉ์ ์ด LM๋“ฑ์—์„œ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์…‹์ด๋ผ, ๋งํฌ/์ด๋ฏธ์ง€/ํ…Œ์ด๋ธ” ๋“ฑ๋“ฑ์ด ์ž˜๋ ค์žˆ์Šต๋‹ˆ๋‹ค. ๋ฌธ์žฅ ๋‹จ์œ„ ๋ถ„์ ˆ์€ kss๋ฅผ ํ™œ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋ผ์ด์„ ์Šค๋Š” ๋‚˜๋ฌด์œ„ํ‚ค์— ๋ช…์‹œ๋œ ๋ฐ”์™€ ๊ฐ™์ด CC BY-NC-SA 2.0

Jeong Ukjae 16 Apr 02, 2022
Linear programming solver for paper-reviewer matching and mind-matching

Paper-Reviewer Matcher A python package for paper-reviewer matching algorithm based on topic modeling and linear programming. The algorithm is impleme

Titipat Achakulvisut 66 Jul 05, 2022