Journalism AI – Quotes extraction for modular journalism

This repo contains the code for the Guardian and AFP contribution for the JournalismAI Festival 2021.

Further reading can be found in our blog post.

The aim of the project is to extract quotes from news articles using Named Entity Recognition, add coreferencing information and format the results for an exploratory search tool.

The contribution consists of several self-contained pieces of work, namely:

a regular expression pipeline attempting to extract quotes by matching patterns
a rule set to define different types of quotes and guide the quote annotation
custom annotation recipes for the Prodigy software enabling quick and efficient data annotation
a post-processing pipeline for extracting quotes using a trained Spacy model and adding coreferencing information
example data and data schema for displaying the extracted quote information in a search tool

Repo structure

Each folder in this repo reflects one of the pieces of work mentioned above.

regex_pipeline/ – code to run the regular expression-based quote extraction
annotation_rules/ – document with rules and definitions to guide the quote annotation step
annotation_scripts/ – custom annotation scripts for Prodigy
coreference/ – proof of concept for rules-based coreferencing tool
schema/ – data output schema and example data

Each folder contains a separate README file with instructions to set up and run each piece of work.

Journalism AI – Quotes extraction for modular journalism

Related tags

Overview

Journalism AI – Quotes extraction for modular journalism

Repo structure

Owner

Journalism AI collab 2021

NeMo: a toolkit for conversational AI

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

Curso práctico: NLP de cero a cien 🤗

Mastering Transformers, published by Packt

Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

RIDE automatically creates the package and boilerplate OOP Python node scripts as per your needs

DELTA is a deep learning based natural language and speech processing platform.

Transformers-regression - Regression Bugs Are In Your Model! Measuring, Reducing and Analyzing Regressions In NLP Model Updates

Retraining OpenAI's GPT-2 on Discord Chats

Awesome Treasure of Transformers Models Collection

LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating

EdiTTS: Score-based Editing for Controllable Text-to-Speech

[Preprint] Escaping the Big Data Paradigm with Compact Transformers, 2021

Enterprise Scale NLP with Hugging Face & SageMaker Workshop series

Concept Modeling: Topic Modeling on Images and Text

Train GPT-3 model on V100(16GB Mem) Using improved Transformer.

SimBERT升级版（SimBERTv2）！

profile tools for pytorch nn models

Creating a Feed of MISP Events from ThreatFox (by abuse.ch)