Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

Overview

🌳 Fingerprinting Fine-tuned Language Models in the wild

This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

Clone the repo

git clone https://github.com/LCS2-IIITD/ACL-FFLM.git
pip3 install -r requirements.txt 

Dataset

The dataset includes both organic and synthetic text.

  • Synthetic -

    Collected from posts of r/SubSimulatorGPT2. Each user on the subreddit is a GPT2 small (345 MB) bot that is fine-tuned on 500k posts and comments from a particular subreddit (e.g., r/askmen, r/askreddit,r/askwomen). The bots generate posts on r/SubSimulatorGPT2, starting off with the main post followed by comments (and replies) from other bots. The bots also interact with each other by using the synthetic text in the preceding comment/reply as their prompt. In total, the sub-reddit contains 401,214 comments posted between June 2019 and January 2020 by 108 fine-tuned GPT2 LMs (or class).

  • Organic -

    Collected from comments of 108 subreddits the GPT2 bots have been fine-tuned upon. We randomly collected about 2000 comments between the dates of June 2019 - Jan 2020.

The complete dataset is available here. Download the dataset as follows -

  1. Download the 2 folders organic and synthetic, containing the comments from individual classes.
  2. Store them in the data folder in the following format.
data
├── organic
├── synthetic
└── authors.csv

Note -
For the below TL;DR run you also need to download dataset.json and dataset.pkl files which contain pre-processed data.
Organize them in the dataset/synthetic folder as follows -

dataset
├── organic
├── synthetic
  ├── splits (Folder already present)
    ├── 6 (Folder already present)
      └── 108_800_100_200_dataset.json (File already present)
  ├── dataset.json (to be added via drive link)
  └── dataset.pkl (to be added via drive link)
└── authors.csv (File already present)

108_800_100_200_dataset.json is a custom dataset which contains the comment ID's, the labels and their separation into train, test and validation splits.
Upon running the models, the comments for each split are fetched from the dataset.json using the comment ID's in the 108_800_100_200_dataset.json file .

Running the code

TL;DR

You can skip the pre-processing and the Create Splits if you want to run the code on some custom datasets available in the dataset/synthetic....splits folder. Make sure to follow the instructions mentioned in the Note of the Dataset section for the setting up the dataset folders.

Pre-process the dataset

First, we pre-process the complete dataset using the data present in the folder create-splits. Select the type of data (organic/synthetic) you want to pre-process using the parameter synthetic in the file. By deafult the parameter is set for synthetic data i.e True. This would create a pre-processed dataset.json and dataset.pkl files in the dataset/[organic OR synthetic] folder.

Create Train, Test and Validation Splits

We create splits of train, test and validation data. The parameters such as min length of sentences (default 6), lowercase sentences, size of train (max and default 800/class), validation (max and default 100/class) and test (max and default 200/class),number of classes (max and default 108) can be set internally in the create_splits.py in the splits folder under the commented PARAMETERS Section.

cd create-splits.py
python3 create_splits.py

This creates a folder in the folder dataset/synthetic/splits/[min_len_of_sentence/min_nf_tokens = 6]/. The train, validation and test datasets are all stored in the same file with the filename [#CLASSES]_[#TRAIN_SET_SIZE]_[#VAL_SET_SIZE]_[#TEST_SET_SIZE]_dataset.json like 108_800_100_200_dataset.json.

Running the model

Now fix the same parameters in the seq_classification.py file. To train and test the best model (Fine-tuned GPT2/ RoBERTa) -

cd models/generate-embed/ft/
python3 seq_classification.py 

A results folder will be generated which will contain the results of each epoch.

Note -
For the other models - pretrained and writeprints, first generate the embeddings using the files in the folders models/generate-embed/[pre-trained or writeprints]. The generated embeddings are stored in the results/generate-embed folder. Then, use the script in the models/classifiers/[pre-trained or writeprints] to train sklearn classifiers on generated embeddings. The final results will be in the results/classifiers/[pre-trained or writeprints] folder.

👪 Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. For any detailed clarifications/issues, please email to nirav17072[at]iiitd[dot]ac[dot]in .

⚖️ License

MIT

Owner
LCS2-IIITDelhi
Laboratory for Computation Social Systems (LCS2) is a research group led by Dr. Tanmoy Chakraborty and Dr. Md. Shad Akhtar at IIIT-D
LCS2-IIITDelhi
ReCoin - Restoring our environment and businesses in parallel

Shashank Ojha, Sabrina Button, Abdellah Ghassel, Joshua Gonzales "Reduce Reuse R

sabrina button 1 Mar 14, 2022
Practical Natural Language Processing Tools for Humans is build on the top of Senna Natural Language Processing (NLP)

Practical Natural Language Processing Tools for Humans is build on the top of Senna Natural Language Processing (NLP) predictions: part-of-speech (POS) tags, chunking (CHK), name entity recognition (

jawahar 20 Apr 30, 2022
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Chi Han 43 Dec 28, 2022
A versatile token stream for handwritten parsers.

Writing recursive-descent parsers by hand can be quite elegant but it's often a bit more verbose than expected, especially when it comes to handling indentation and reporting proper syntax errors. Th

Valentin Berlier 8 Nov 30, 2022
Automated question generation and question answering from Turkish texts using text-to-text transformers

Turkish Question Generation Offical source code for "Automated question generation & question answering from Turkish texts using text-to-text transfor

Open Business Software Solutions 29 Dec 14, 2022
Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

Trains an OpenNMT PyTorch model and SentencePiece tokenizer. Designed for use with Argos Translate and LibreTranslate.

Argos Open Tech 61 Dec 13, 2022
A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

A Python package implementing a new model for text classification with visualization tools for Explainable AI 🍣 Online live demos: http://tworld.io/s

Sergio Burdisso 285 Jan 02, 2023
A Fast Command Analyser based on Dict and Pydantic

Alconna Alconna 隶属于ArcletProject, 在Cesloi内有内置 Alconna 是 Cesloi-CommandAnalysis 的高级版,支持解析消息链 一般情况下请当作简易的消息链解析器/命令解析器 文档 暂时的文档 Example from arclet.alcon

19 Jan 03, 2023
☀️ Measuring the accuracy of BBC weather forecasts in Honolulu, USA

Accuracy of BBC Weather forecasts for Honolulu This repository records the forecasts made by BBC Weather for the city of Honolulu, USA. Essentially, t

Max Halford 12 Oct 15, 2022
A Plover python dictionary allowing for consistent symbol input with specification of attachment and capitalisation in one stroke.

Emily's Symbol Dictionary Design This dictionary was created with the following goals in mind: Have a consistent method to type (pretty much) every sy

Emily 68 Jan 07, 2023
KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

KLUE Baseline Korean(한국어) KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark. See our paper fo

74 Dec 13, 2022
PG-19 Language Modelling Benchmark

PG-19 Language Modelling Benchmark This repository contains the PG-19 language modeling benchmark. It includes a set of books extracted from the Proje

DeepMind 161 Oct 30, 2022
Différents programmes créant une interface graphique a l'aide de Tkinter pour simplifier la vie des étudiants.

GP211-Grand-Projet Ce repertoire contient tout les programmes nécessaires au bon fonctionnement de notre projet-logiciel. Cette interface graphique es

1 Dec 21, 2021
🦆 Contextually-keyed word vectors

sense2vec: Contextually-keyed word vectors sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detaile

Explosion 1.5k Dec 25, 2022
CCF BDCI BERT系统调优赛题baseline(Pytorch版本)

CCF BDCI BERT系统调优赛题baseline(Pytorch版本) 此版本基于Pytorch后端的huggingface进行实现。由于此实现使用了Oneflow的dataloader作为数据读入的方式,因此也需要安装Oneflow。其它框架的数据读取可以参考OneflowDataloade

Ziqi Zhou 9 Oct 13, 2022
Fixes mojibake and other glitches in Unicode text, after the fact.

ftfy: fixes text for you print(fix_encoding("(ง'⌣')ง")) (ง'⌣')ง Full documentation: https://ftfy.readthedocs.org Testimonials “My life is li

Luminoso Technologies, Inc. 3.4k Dec 29, 2022
TruthfulQA: Measuring How Models Imitate Human Falsehoods

TruthfulQA: Measuring How Models Imitate Human Falsehoods

69 Dec 25, 2022
Source code of paper "BP-Transformer: Modelling Long-Range Context via Binary Partitioning"

BP-Transformer This repo contains the code for our paper BP-Transformer: Modeling Long-Range Context via Binary Partition Zihao Ye, Qipeng Guo, Quan G

Zihao Ye 119 Nov 14, 2022
中文空间语义理解评测

中文空间语义理解评测 最新消息 2021-04-10 🚩 排行榜发布: Leaderboard 2021-04-05 基线系统发布: SpaCE2021-Baseline 2021-04-05 开放数据提交: 提交结果 2021-04-01 开放报名: 我要报名 2021-04-01 数据集 pa

40 Jan 04, 2023
Every Google, Azure & IBM text to speech voice for free

TTS-Grabber Quick thing i made about a year ago to download any text with any tts voice, over 630 voices to choose from currently. It will split the i

16 Dec 07, 2022