Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Last update: Dec 20, 2022

Overview

Spanish Language Models 💃🏻

A repository part of the MarIA project.

Corpora 📃

Corpora	Number of documents	Number of tokens	Size (GB)
BNE	201,080,084	135,733,450,668	570GB

Models 🤖

RoBERTa-base BNE: https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne
RoBERTa-large BNE: https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne
GPT2-base BNE: https://huggingface.co/PlanTL-GOB-ES/gpt2-base-bne
GPT2-large BNE: https://huggingface.co/PlanTL-GOB-ES/gpt2-large-bne
Other models: (WIP)

Fine-tunned models 🧗🏼‍♀️🏇🏼🤽🏼‍♀️🏌🏼‍♂️🏄🏼‍♀️

RoBERTa-base-BNE for Capitel-POS: https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne-capitel-pos
RoBERTa-large-BNE for Capitel-POS: https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne-capitel-pos
RoBERTa-base-BNE for Capitel-NER: https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne-capitel-ner
RoBERTa-base-BNE for Capitel-NER: https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne-capitel-ner-plus (very robust)
RoBERTa-large-BNE for Capitel-NER: https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne-capitel-ner
RoBERTa-base-BNE for SQAC: https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne-sqac
RoBERTa-large-BNE for SQAC: https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne-sqac

Word embeddings 🔤

Word embeddings trained with FastText for 300d:

CBOW Word embeddings: https://zenodo.org/record/5044988
Skip-gram Word embeddings: https://zenodo.org/record/5046525

Datasets 🗂️

Spanish Question Answering Corpus (SQAC) 🦆 : https://huggingface.co/datasets/PlanTL-GOB-ES/SQAC

Evaluation ✅

Dataset	Metric	RoBERTa-b	RoBERTa-l	BETO*	mBERT	BERTIN**	Electricidad***
UD-POS	F1	0.9907	0.9898	0.9900	0.9886	0.9898	0.9818
Conll-NER	F1	0.8851	0.8772	0.8759	0.8691	0.8835	0.7954
Capitel-POS	F1	0.9846	0.9851	0.9836	0.9839	0.9847	0.9816
Capitel-NER	F1	0.8960	0.8998	0.8772	0.8810	0.8856	0.8035
STS	Combined	0.8533	0.8353	0.8159	0.8164	0.7945	0.8063
MLDoc	Accuracy	0.9623	0.9675	0.9663	0.9550	0.9673	0.9493
PAWS-X	F1	0.9000	0.9060	0.9000	0.8955	0.8990	0.9025
XNLI	Accuracy	0.8016	0.7958	0.8130	0.7876	0.7890	0.7878
SQAC	F1	0.7923	0.7993	0.7923	0.7562	0.7678	0.7383

* A model based on BERT architecture.

** A model based on RoBERTa architecture.

*** A model based on Electra architecture.

Usage example ⚗️

For the RoBERTa-base

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
model = AutoModelForMaskedLM.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

For the RoBERTa-large

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('PlanTL-GOB-ES/roberta-large-bne')
model = AutoModelForMaskedLM.from_pretrained('PlanTL-GOB-ES/roberta-large-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

Other Spanish Language Models 👩‍👧‍👦

We are developing domain-specific language models:

⚖️ Legal Language Model
⚕️ Biomedical and Clinical Language Models

Cite 📣

@misc{gutierrezfandino2021spanish,
      title={Spanish Language Models}, 
      author={Asier Gutiérrez-Fandiño and Jordi Armengol-Estapé and Marc Pàmies and Joan Llop-Palao and Joaquín Silveira-Ocampo and Casimiro Pio Carrino and Aitor Gonzalez-Agirre and Carme Armentano-Oller and Carlos Rodriguez-Penagos and Marta Villegas},
      year={2021},
      eprint={2107.07253},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact 📧

📋 We are interested in (1) extending our corpora to make larger models (2) train/evaluate the model in other tasks.

For questions regarding this work, contact Asier Gutiérrez-Fandiño ([email protected])

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Related tags

Overview

Spanish Language Models 💃🏻

Corpora 📃

Models 🤖

Fine-tunned models 🧗🏼‍♀️🏇🏼🤽🏼‍♀️🏌🏼‍♂️🏄🏼‍♀️

Word embeddings 🔤

Datasets 🗂️

Evaluation ✅

Usage example ⚗️

Other Spanish Language Models 👩‍👧‍👦

Cite 📣

Contact 📧

Owner

Plan de Tecnologías del Lenguaje - Gobierno de España

NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

A Python/Pytorch app for easily synthesising human voices

Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Ray-based parallel data preprocessing for NLP and ML.

MMDA - multimodal document analysis

Multilingual finetuning of Machine Translation model on low-resource languages. Project for Deep Natural Language Processing course.

Code for text augmentation method leveraging large-scale language models

多语言降噪预训练模型MBart的中文生成任务

中文問句產生器；使用台達電閱讀理解資料集(DRCD)

Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

TalkNet: Audio-visual active speaker detection Model

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense.

Machine translation models released by the Gourmet project

PyWorld3 is a Python implementation of the World3 model

A multi-voice TTS system trained with an emphasis on quality

APEACH: Attacking Pejorative Expressions with Analysis on Crowd-generated Hate Speech Evaluation Datasets