A pre-trained language model for social media text in Spanish

Overview

RoBERTuito

A pre-trained language model for social media text in Spanish

READ THE FULL PAPER Github Repository

RoBERTuito is a pre-trained language model for user-generated content in Spanish, trained following RoBERTa guidelines on 500 million tweets. RoBERTuito comes in 3 flavors: cased, uncased, and uncased+deaccented.

We tested RoBERTuito on a benchmark of tasks involving user-generated text in Spanish. It outperforms other pre-trained language models for this language such as BETO, BERTin and RoBERTa-BNE. The 4 tasks selected for evaluation were: Hate Speech Detection (using SemEval 2019 Task 5, HatEval dataset), Sentiment and Emotion Analysis (using TASS 2020 datasets), and Irony detection (using IrosVa 2019 dataset).

model hate speech sentiment analysis emotion analysis irony detection score
robertuito-uncased 0.801 ± 0.010 0.707 ± 0.004 0.551 ± 0.011 0.736 ± 0.008 0.699
robertuito-deacc 0.798 ± 0.008 0.702 ± 0.004 0.543 ± 0.015 0.740 ± 0.006 0.696
robertuito-cased 0.790 ± 0.012 0.701 ± 0.012 0.519 ± 0.032 0.719 ± 0.023 0.682
roberta-bne 0.766 ± 0.015 0.669 ± 0.006 0.533 ± 0.011 0.723 ± 0.017 0.673
bertin 0.767 ± 0.005 0.665 ± 0.003 0.518 ± 0.012 0.716 ± 0.008 0.667
beto-cased 0.768 ± 0.012 0.665 ± 0.004 0.521 ± 0.012 0.706 ± 0.007 0.665
beto-uncased 0.757 ± 0.012 0.649 ± 0.005 0.521 ± 0.006 0.702 ± 0.008 0.657

We release the pre-trained models on huggingface model hub:

Usage

IMPORTANT -- READ THIS FIRST

RoBERTuito is not yet fully-integrated into huggingface/transformers. To use it, first install pysentimiento

pip install pysentimiento

and preprocess text using pysentimiento.preprocessing.preprocess_tweet before feeding it into the tokenizer

','▁Esto','▁es','▁un','▁tweet','▁estoy','▁usando','▁','▁hashtag','▁','▁ro','bert','uito','▁@usuario','▁','▁emoji','▁cara','▁revolviéndose','▁de','▁la','▁risa','▁emoji',''] ">
from transformers import AutoTokenizer
from pysentimiento.preprocessing import preprocess_tweet

tokenizer = AutoTokenizer.from_pretrained('pysentimiento/robertuito-base-cased')

text = "Esto es un tweet estoy usando #Robertuito @pysentimiento 🤣"
preprocessed_text = preprocess_tweet(text, ha)

tokenizer.tokenize(preprocessed_text)
# ['','▁Esto','▁es','▁un','▁tweet','▁estoy','▁usando','▁','▁hashtag','▁','▁ro','bert','uito','▁@usuario','▁','▁emoji','▁cara','▁revolviéndose','▁de','▁la','▁risa','▁emoji','']

We are working on integrating this preprocessing step into a Tokenizer within transformers library

Development

Installing

We use python==3.7 and poetry to manage dependencies.

pip install poetry
poetry install

Benchmarking

To run benchmarks

python bin/run_benchmark.py <model_name> --times 5 --output_path <output_path>

Check RUN_BENCHMARKS for all experiments

Smoke test

Test the benchmark running

./smoke_test.sh

Citation

If you use RoBERTuito, please cite our paper:

@misc{perez2021robertuito,
      title={RoBERTuito: a pre-trained language model for social media text in Spanish},
      author={Juan Manuel Pérez and Damián A. Furman and Laura Alonso Alemany and Franco Luque},
      year={2021},
      eprint={2111.09453},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
"Exploring Vision Transformers for Fine-grained Classification" at CVPRW FGVC8

FGVC8 Exploring Vision Transformers for Fine-grained Classification paper presented at the CVPR 2021, The Eight Workshop on Fine-Grained Visual Catego

Marcos V. Conde 19 Dec 06, 2022
Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations. [2021]

Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations This repo contains the Pytorch implementation of our paper: Revisit

Wouter Van Gansbeke 80 Nov 20, 2022
The implementation of the lifelong infinite mixture model

Lifelong infinite mixture model 📋 This is the implementation of the Lifelong infinite mixture model 📋 Accepted by ICCV 2021 Title : Lifelong Infinit

Fei Ye 5 Oct 20, 2022
Yolov5 + Deep Sort with PyTorch

딥소트 수정중 Yolov5 + Deep Sort with PyTorch Introduction This repository contains a two-stage-tracker. The detections generated by YOLOv5, a family of obj

1 Nov 26, 2021
Improving Object Detection by Estimating Bounding Box Quality Accurately

Improving Object Detection by Estimating Bounding Box Quality Accurately Abstrac

2 Apr 14, 2022
Solving reinforcement learning tasks which require language and vision

Multimodal Reinforcement Learning JAX implementations of the following multimodal reinforcement learning approaches. Dual-coding Episodic Memory from

Henry Prior 31 Feb 26, 2022
A general-purpose programming language, focused on simplicity, safety and stability.

The Rivet programming language A general-purpose programming language, focused on simplicity, safety and stability. Rivet's goal is to be a very power

The Rivet programming language 17 Dec 29, 2022
Code release for Convolutional Two-Stream Network Fusion for Video Action Recognition

Convolutional Two-Stream Network Fusion for Video Action Recognition

Christoph Feichtenhofer 676 Dec 31, 2022
Image-popularity-score - A novel deep regression method for image scoring.

Image-popularity-score - A novel deep regression method for image scoring.

Shoaib ahmed 1 Dec 26, 2021
Fast and simple implementation of RL algorithms, designed to run fully on GPU.

RSL RL Fast and simple implementation of RL algorithms, designed to run fully on GPU. This code is an evolution of rl-pytorch provided with NVIDIA's I

Robotic Systems Lab - Legged Robotics at ETH Zürich 68 Dec 29, 2022
Part-Aware Data Augmentation for 3D Object Detection in Point Cloud

Part-Aware Data Augmentation for 3D Object Detection in Point Cloud This repository contains a reference implementation of our Part-Aware Data Augment

Jaeseok Choi 62 Jan 03, 2023
All-in-one Docker container that allows a user to explore Nautobot in a lab environment.

Nautobot Lab This container is not for production use! Nautobot Lab is an all-in-one Docker container that allows a user to quickly get an instance of

Nautobot 29 Sep 16, 2022
A pytorch implementation of Paper "Improved Training of Wasserstein GANs"

WGAN-GP An pytorch implementation of Paper "Improved Training of Wasserstein GANs". Prerequisites Python, NumPy, SciPy, Matplotlib A recent NVIDIA GPU

Marvin Cao 1.4k Dec 14, 2022
Apply a perspective transformation to a raster image inside Inkscape (no need to use an external software such as GIMP or Krita).

Raster Perspective Apply a perspective transformation to bitmap image using the selected path as envelope, without the need to use an external softwar

s.ouchene 19 Dec 22, 2022
TrackFormer: Multi-Object Tracking with Transformers

TrackFormer: Multi-Object Tracking with Transformers This repository provides the official implementation of the TrackFormer: Multi-Object Tracking wi

Tim Meinhardt 321 Dec 29, 2022
Social Distancing Detector

Computer vision has opened up a lot of opportunities to explore into AI domain that were earlier highly limited. Here is an application of haarcascade classifier and OpenCV to develop a social distan

Ashish Pandey 2 Jul 18, 2022
LQM - Improving Object Detection by Estimating Bounding Box Quality Accurately

Improving Object Detection by Estimating Bounding Box Quality Accurately Abstract Object detection aims to locate and classify object instances in ima

IM Lab., POSTECH 0 Sep 28, 2022
A new data augmentation method for extreme lighting conditions.

Random Shadows and Highlights This repo has the source code for the paper: Random Shadows and Highlights: A new data augmentation method for extreme l

Osama Mazhar 35 Nov 26, 2022
Interactive Terraform visualization. State and configuration explorer.

Rover - Terraform Visualizer Rover is a Terraform visualizer. In order to do this, Rover: generates a plan file and parses the configuration in the ro

Tu Nguyen 2.3k Jan 07, 2023
STBP is a way to train SNN with datasets by Backward propagation.

Spiking neural network (SNN), compared with depth neural network (DNN), has faster processing speed, lower energy consumption and more biological interpretability, which is expected to approach Stron

Ling Zhang 18 Dec 09, 2022