A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.

Last update: Jan 04, 2023

Overview

Crosslingual Coreference

Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non-English languages also proved to be poorly annotated. Crosslingual Coreference, therefore, uses the assumption a trained model with English data and cross-lingual embeddings should work for languages with similar sentence structures.

Install

pip install crosslingual-coreference

Quickstart

from crosslingual_coreference import Predictor

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)

# choose minilm for speed/memory and info_xlm for accuracy
predictor = Predictor(
    language="en_core_web_sm", device=-1, model_name="minilm"
)

print(predictor.predict(text)["resolved_text"])
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.

Models

As of now, there are two models available "spanbert", "info_xlm", "xlm_roberta", "minilm", which scored 83, 77, 74 and 74 on OntoNotes Release 5.0 English data, respectively.

The "minilm" model is the best quality speed trade-off for both mult-lingual and english texts.
The "info_xlm" model produces the best quality for multi-lingual texts.
The AllenNLP "spanbert" model produces the best quality for english texts.

Chunking/batching to resolve memory OOM errors

from crosslingual_coreference import Predictor

predictor = Predictor(
    language="en_core_web_sm",
    device=0,
    model_name="minilm",
    chunk_size=2500,
    chunk_overlap=2,
)

Use spaCy pipeline

import spacy

import crosslingual_coreference

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)


nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(
    "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0}
)

doc = nlp(text)
print(doc._.coref_clusters)
# Output
#
# [[[4, 5], [7, 7], [27, 27], [36, 36]],
# [[12, 12], [15, 16]],
# [[9, 10], [27, 28]],
# [[22, 23], [31, 31]]]
print(doc._.resolved_text)
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.

More Examples

Comments

Which language model is using for minilm
I am using the following code snippet for coreference resolution

predictor = Predictor(language="en_core_web_sm", device=-1, model_name="minilm")

While checking the below source code,

"minilm": { "url": ( "https://storage.googleapis.com/pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz" ), "f1_score_ontonotes": 74, "file_extension": ".tar.gz", },

it seems that the language model using here is https://storage.googleapis.com/pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

Is this the same one that I can see in https://huggingface.co/models like https://huggingface.co/microsoft/Multilingual-MiniLM-L12-H384/tree/main or any other huggingface model?
opened by pradeepdev-1995 7

Error when using coref as a spaCy pipeline

Hi all, while trying to run a spacy test

import spacy
import crosslingual_coreference

text = """
    Do not forget about Momofuku Ando!
    He created instant noodles in Osaka.
    At that location, Nissin was founded.
    Many students survived by eating these noodles, but they don't even know him."""

# use any model that has internal spacy embeddings
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(
    "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0})

doc = nlp(text)

print(doc._.coref_clusters)
print(doc._.resolved_text)

I encountered the following issue:

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/user/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
Traceback (most recent call last):
  File "/home/user/test_coref/test.py", line 12, in <module>
    nlp.add_pipe(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/spacy/language.py", line 792, in add_pipe
    pipe_component = self.create_pipe(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/spacy/language.py", line 674, in create_pipe
    resolved = registry.resolve(cfg, validate=validate)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 746, in resolve
    resolved, _ = cls._make(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 795, in _make
    filled, _, resolved = cls._fill(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 867, in _fill
    getter_result = getter(*args, **kwargs)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/__init__.py", line 33, in make_crosslingual_coreference
    return SpacyPredictor(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictorSpacy.py", line 18, in __init__
    super().__init__(language, device, model_name, chunk_size, chunk_overlap)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictor.py", line 55, in __init__
    self.set_coref_model()
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictor.py", line 85, in set_coref_model
    self.predictor = Predictor.from_path(self.filename, language=self.language, cuda_device=self.device)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/predictors/predictor.py", line 366, in from_path
    load_archive(archive_path, cuda_device=cuda_device, overrides=overrides),
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/models/archival.py", line 232, in load_archive
    dataset_reader, validation_dataset_reader = _load_dataset_readers(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/models/archival.py", line 268, in _load_dataset_readers
    dataset_reader = DatasetReader.from_params(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 604, in from_params
    return retyped_subclass.from_params(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 636, in from_params
    kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 206, in create_kwargs
    constructed_arg = pop_and_construct_arg(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 314, in pop_and_construct_arg
    return construct_arg(class_name, name, popped_params, annotation, default, **extras)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 394, in construct_arg
    value_dict[key] = construct_arg(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 348, in construct_arg
    result = annotation.from_params(params=popped_params, **subextras)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 604, in from_params
    return retyped_subclass.from_params(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 638, in from_params
    return constructor_to_call(**kwargs)  # type: ignore
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/token_indexers/pretrained_transformer_mismatched_indexer.py", line 58, in __init__
    self._matched_indexer = PretrainedTransformerIndexer(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/token_indexers/pretrained_transformer_indexer.py", line 56, in __init__
    self._allennlp_tokenizer = PretrainedTransformerTokenizer(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py", line 72, in __init__
    self.tokenizer = cached_transformers.get_tokenizer(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/cached_transformers.py", line 204, in get_tokenizer
    tokenizer = transformers.AutoTokenizer.from_pretrained(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 546, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1788, in from_pretrained
    return cls._from_pretrained(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1923, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py", line 140, in __init__
    super().__init__(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 110, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: EOF while parsing a list at line 1 column 4920583

Here's what I have installed (pulled by poetry add crosslingual-coreference or pip install crosslingual-coreference):

(.venv) [email protected]$ pip freeze
aiohttp==3.8.1
aiosignal==1.2.0
allennlp==2.9.3
allennlp-models==2.9.3
async-timeout==4.0.2
attrs==21.4.0
base58==2.1.1
blis==0.7.7
boto3==1.23.5
botocore==1.26.5
cached-path==1.1.2
cachetools==5.1.0
catalogue==2.0.7
certifi==2022.5.18.1
charset-normalizer==2.0.12
click==8.0.4
conllu==4.4.1
crosslingual-coreference==0.2.4
cymem==2.0.6
datasets==2.2.1
dill==0.3.5.1
docker-pycreds==0.4.0
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl
en-core-web-trf @ https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.2.0/en_core_web_trf-3.2.0-py3-none-any.whl
fairscale==0.4.6
filelock==3.6.0
frozenlist==1.3.0
fsspec==2022.5.0
ftfy==6.1.1
gitdb==4.0.9
GitPython==3.1.27
google-api-core==2.8.0
google-auth==2.6.6
google-cloud-core==2.3.0
google-cloud-storage==2.3.0
google-crc32c==1.3.0
google-resumable-media==2.3.3
googleapis-common-protos==1.56.1
h5py==3.6.0
huggingface-hub==0.5.1
idna==3.3
iniconfig==1.1.1
Jinja2==3.1.2
jmespath==1.0.0
joblib==1.1.0
jsonnet==0.18.0
langcodes==3.3.0
lmdb==1.3.0
MarkupSafe==2.1.1
more-itertools==8.13.0
multidict==6.0.2
multiprocess==0.70.12.2
murmurhash==1.0.7
nltk==3.7
numpy==1.22.4
packaging==21.3
pandas==1.4.2
pathtools==0.1.2
pathy==0.6.1
Pillow==9.1.1
pluggy==1.0.0
preshed==3.0.6
promise==2.3
protobuf==3.20.1
psutil==5.9.1
py==1.11.0
py-rouge==1.1
pyarrow==8.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pydantic==1.8.2
pyparsing==3.0.9
pytest==7.1.2
python-dateutil==2.8.2
pytz==2022.1
PyYAML==6.0
regex==2022.4.24
requests==2.27.1
responses==0.18.0
rsa==4.8
s3transfer==0.5.2
sacremoses==0.0.53
scikit-learn==1.1.1
scipy==1.6.1
sentence-transformers==2.2.0
sentencepiece==0.1.96
sentry-sdk==1.5.12
setproctitle==1.2.3
shortuuid==1.0.9
six==1.16.0
smart-open==5.2.1
smmap==5.0.0
spacy==3.2.4
spacy-alignments==0.8.5
spacy-legacy==3.0.9
spacy-loggers==1.0.2
spacy-sentence-bert==0.1.2
spacy-transformers==1.1.5
srsly==2.4.3
tensorboardX==2.5
termcolor==1.1.0
thinc==8.0.16
threadpoolctl==3.1.0
tokenizers==0.12.1
tomli==2.0.1
torch==1.10.2
torchaudio==0.10.2
torchvision==0.11.3
tqdm==4.64.0
transformers==4.17.0
typer==0.4.1
typing-extensions==4.2.0
urllib3==1.26.9
wandb==0.12.16
wasabi==0.9.1
wcwidth==0.2.5
word2number==1.1
xxhash==3.0.0
yarl==1.7.2

Do you have any recommendations? Is there an installation step missing?

Thanks in advance!

opened by alexander-belikov 4

Comparatively high initial prediction time for first predict() hit

I am using minilm model with language 'en_core_web_sm'. While comparing the prediction time, i.e., predictor.predict(text), the prediction time for first hit is always a bit high than the following hits. Suppose after creating a predictor object, I call predict as follows:

predictor.predict(text) ---> first call predictor.predict(text) ---> second call predictor.predict(text) ---> third call

Time taken for the first call is comparatively a bit higher(.2 sec) than the next prediction calls(.05 sec). Could you please help me understand why this initial hit takes a bit high prediction time?

opened by nemeer 2
Why does this package need to install google cloud auth, storage, api etc?

Hi,

after installing the library I saw google-api-core-2.10.1 google-auth-2.12.0 google-cloud-core-2.3.2 google-cloud-storage-1.44.0 have been installed as well. In fact these packages can be found in the poetry.lock file.

Is there a reason (I don't get) why this library needs these packages?

Thanks

opened by GiacomoCherry 1
HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

Python 3.8.13 Spacy - 3.1.0 en_core_web_sm-3.1.0 crosslingual_coreference - 0.2.8

requests.exceptions.SSLError: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))

opened by jscoder1009 1
Retrieving cluster heads without replacing corefs

I am interested in being able to extract the cluster heads with something like doc._.coref_cluster_heads to get the cluster heads without getting the reconstituted text. It could be a separate function that also acts as input into replace_corefs potentially.

opened by MikeMikeMikeMike 1
[Errno 101] Network is unreachable
Hello, when I try to run the code below

predictor = Predictor( language="en_core_web_sm", device=1, model_name="info_xlm" )

I get the following error:

ConnectionError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Max retries exceeded with url: /microsoft/infoxlm-base/cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff90cba1a00>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

Is this url still valid and what should I use instead?
opened by ttranslit 1
spaCy issues and suggestions
@martin-kirilov It might be worth looking into including batching + training a model for Spanish/Italian. See this issue from spaCy.

batching

empty cluster issue (resolved)

additional model pro-drop languages

bug enhancement
opened by davidberenstein1957 0
feat: look into ONNX enhanched transformer embeddings

Creating embeddings roughly takes 50% of the inference time. allennlp/modules/token_embedders/pretrained_transformer_embedder.py hold the logic for creating these embeddings. Make sure we can call them in a faster way.
enhancement

opened by davidberenstein1957 3

Releases(0.2.9)

0.2.9(Sep 24, 2022)

Source code(tar.gz)
Source code(zip)
0.2.6(Jun 8, 2022)

Source code(tar.gz)
Source code(zip)
0.2.5(May 25, 2022)

added cluster_heads to the implementation within doc._.cluster_heads and predict['cluster_heads']
Source code(tar.gz)
Source code(zip)
0.2.4(May 10, 2022)

Source code(tar.gz)
Source code(zip)
0.2.3(May 5, 2022)

x 3 speedup combined with minilm vs. initial 'default' info_xlm model.
Source code(tar.gz)
Source code(zip)
0.2.2(May 5, 2022)

Source code(tar.gz)
Source code(zip)
0.2.1(Apr 13, 2022)

check new automated release
Source code(tar.gz)
Source code(zip)
v0.2.0(Apr 3, 2022)

#3
Source code(tar.gz)
Source code(zip)
v.0.1.5(Mar 31, 2022)

Without finding clusters, the package would just return the text. Now it returns the prediciton.
Source code(tar.gz)
Source code(zip)
v0.1.4(Mar 30, 2022)

Improve the dependencies to resolve specific issues with Typer v0.4.0 and Click v8.1.0 and a combination of AllenNLP and spaCy versions.
Source code(tar.gz)
Source code(zip)
v0.1.3(Mar 28, 2022)

The initial version of crosslingual-coreference.
Source code(tar.gz)
Source code(zip)

Owner

Pandora Intelligence

Pandora Intelligence is an independent intelligence company, specialized in security risks.

GitHub Repository

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

490 Dec 15, 2022

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

74 Oct 07, 2022

CoNLL-English NER Task (NER in English)

CoNLL-English NER Task en | ch Motivation Course Project review the pytorch framework and sequence-labeling task practice using the transformers of Hu

2 Jan 14, 2022

All the code I wrote for Overwatch-related projects that I still own the rights to.

overwatch_shit.zip This is (eventually) going to contain all the software I wrote during my five-year imprisonment stay playing Overwatch. I'll be add

2 Dec 31, 2021

Weaviate demo with the text2vec-openai module

Weaviate demo with the text2vec-openai module This repository contains an example of how to use the Weaviate text2vec-openai module. When using this d

11 Nov 11, 2022

An easy to use Natural Language Processing library and framework for predicting, training, fine-tuning, and serving up state-of-the-art NLP models.

Welcome to AdaptNLP A high level framework and library for running, training, and deploying state-of-the-art Natural Language Processing (NLP) models

407 Jan 03, 2023

A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

132 Nov 25, 2022

Beyond Paragraphs: NLP for Long Sequences

338 Dec 02, 2022

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Espresso Espresso is an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning libra

919 Jan 03, 2023

CLIPfa: Connecting Farsi Text and Images

CLIPfa: Connecting Farsi Text and Images OpenAI released the paper Learning Transferable Visual Models From Natural Language Supervision in which they

66 Dec 14, 2022

Open solution to the Toxic Comment Classification Challenge

Starter code: Kaggle Toxic Comment Classification Challenge More competitions 🎇 Check collection of public projects 🎁 , where you can find multiple

153 Jun 22, 2022

ChessCoach is a neural network-based chess engine capable of natural-language commentary.

380 Dec 03, 2022

Rootski - Full codebase for rootski.io (without the data)

📣 Welcome to the Rootski codebase! This is the codebase for the application run

20 Nov 18, 2022

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

20.5k Jan 08, 2023

MiCECo - Misskey Custom Emoji Counter

MiCECo Misskey Custom Emoji Counter Introduction This little script counts custo

7 Dec 25, 2022

A high-level yet extensible library for fast language model tuning via automatic prompt search

ruPrompts ruPrompts is a high-level yet extensible library for fast language model tuning via automatic prompt search, featuring integration with Hugg

37 Dec 07, 2022

💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes Official PyTorch implementation and EmoCause evaluatio

50 Dec 21, 2022

A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.

Related tags

Overview

Crosslingual Coreference

Install

Quickstart

Models

Chunking/batching to resolve memory OOM errors

Use spaCy pipeline

More Examples

Comments

Releases(0.2.9)

0.2.9(Sep 24, 2022)

0.2.6(Jun 8, 2022)

0.2.5(May 25, 2022)

0.2.4(May 10, 2022)

0.2.3(May 5, 2022)

0.2.2(May 5, 2022)

0.2.1(Apr 13, 2022)

v0.2.0(Apr 3, 2022)

v.0.1.5(Mar 31, 2022)

v0.1.4(Mar 30, 2022)

v0.1.3(Mar 28, 2022)