中文生成式预训练模型

Last update: Jan 03, 2023

Related tags

Text Data & NLP t5-pegasus

Overview

T5 PEGASUS

中文生成式预训练模型，以mT5为基础架构和初始权重，通过类似PEGASUS的方式进行预训练。

详情可见：https://kexue.fm/archives/8209

Tokenizer

我们将T5 PEGASUS的Tokenizer换成了BERT的Tokenizer，它对中文更加友好。同时，我们重新整理了一版词表，使得里边的字、词都更加完善，目前的vocab.txt共包含5万个token，真正覆盖了中文的常用字、词。

预训练任务

预训练任务模仿了PEGASUS的摘要式预训练。具体来说，假设一个文档有n个句子，我们从中挑出大约n/4个句子（可以不连续），使得这n/4个句子拼起来的文本，跟剩下的3n/4个句子拼起来的文本，最长公共子序列尽可能长，然后我们将3n/4个句子拼起来的文本视为原文，n/4个句子拼起来的文本视为摘要，通过这样的方式构成一个“(原文, 摘要)”的伪摘要数据对。

模型下载

目前开源的T5 PEGASUS是base版，总参数量为2.75亿，训练时最大长度为512，batch_size为96，学习率为10^-4，使用6张3090训练了100万步，训练时间约13天，数据是30多G的精处理通用语料，训练acc约47%，训练loss约2.97。模型使用bert4keras进行编写、训练和测试。

运行环境：tensorflow 1.15 + keras 2.3.1 + bert4keras 0.10.0

链接: https://pan.baidu.com/s/1lQ9Dt9wZDO3IgiCL9tP-Ug 提取码: 3sfn

部分评测

摘要生成效果：

小样本学习：

如何引用

Bibtex：

@techreport{zhuiyit5pegasus,
  title={T5 PEGASUS - ZhuiyiAI},
  author={Jianlin Su},
  year={2021},
  url="https://github.com/ZhuiyiTechnology/t5-pegasus",
}

联系我们

邮箱：[email protected] 追一科技：https://zhuiyi.ai

中文生成式预训练模型

Related tags

Overview

T5 PEGASUS

Tokenizer

预训练任务

模型下载

部分评测

如何引用

联系我们

Owner

Dual languaged (rus+eng) tool for packing and unpacking archives of Silky Engine.

SummerTime - Text Summarization Toolkit for Non-experts

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

CCF BDCI BERT系统调优赛题baseline（Pytorch版本）

A Plover python dictionary allowing for consistent symbol input with specification of attachment and capitalisation in one stroke.

A Facebook Messenger Chatbot using NLP

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs".

Tools to download and cleanup Common Crawl data

Chatbot with Pytorch, Python & Nextjs

This is a modification of the OpenAI-CLIP repository of moein-shariatnia

Deep Learning Topics with Computer Vision & NLP

🌸 fastText + Bloom embeddings for compact, full-coverage vectors with spaCy

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.

NVDA, the free and open source Screen Reader for Microsoft Windows

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含自然语言处理各领域的面试题积累。

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

中文生成式预训练模型

Related tags

Overview

T5 PEGASUS

Tokenizer

预训练任务

模型下载

部分评测

如何引用

联系我们

Owner

Dual languaged (rus+eng) tool for packing and unpacking archives of Silky Engine.

SummerTime - Text Summarization Toolkit for Non-experts

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

CCF BDCI BERT系统调优赛题baseline（Pytorch版本）

A Plover python dictionary allowing for consistent symbol input with specification of attachment and capitalisation in one stroke.

A Facebook Messenger Chatbot using NLP

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs".

Tools to download and cleanup Common Crawl data

Chatbot with Pytorch, Python & Nextjs

This is a modification of the OpenAI-CLIP repository of moein-shariatnia

Deep Learning Topics with Computer Vision & NLP

🌸 fastText + Bloom embeddings for compact, full-coverage vectors with spaCy

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.

NVDA, the free and open source Screen Reader for Microsoft Windows

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含 自然语言处理各领域的 面试题积累。

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含自然语言处理各领域的面试题积累。