NLP

T5 Project proposal

Topic Modeling and Clustering of News-Articles-and-Essays

Students:

Nasser Alshehri
Abdullah Bushnag
Abdulrhman Alqurashi

OVERVIEW

News come in different formats, different types and different categories. Here we attempt to use Topic modeling and Clustering to get answers on what each content containt based on its content and then we try to do it based only on its title.

The process would be: We load the data. Keep what we need from the data. Clean the text(ex:stopwords).

Build the bag of words for all documents. Build the bag of words for each document.

Vectorize the data. Run the LDA model. Run the model on all data and save the output to dataframe

Run the Clustering algorithm. Save the data to csv. Make the charts.

Data

The data is acquired from: https://components.one/datasets/all-the-news-articles-dataset

The Raw data containts 12 features: id, title, author, date, content, year, month, publication, category, digital, section, url.

The features we are using are only the 'title' and 'content'.

The data we are not interested in will be dropped/ignored.

The 'title' is the headling/name/title of the news/Article/Essay. The 'Content' is the body/content/Essay/Article/News itself.

TOOLS

Pandas Numpy Scikit-learn Matplotlib Seaborn nltk gensim

News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

Related tags

Overview

NLP

Students:

OVERVIEW

Data

TOOLS

Owner

Document processing using transformers

Neural network sequence labeling model

PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

A python gui program to generate reddit text to speech videos from the id of any post.

A PyTorch-based model pruning toolkit for pre-trained language models

A sample project that exists for PyPUG's "Tutorial on Packaging and Distributing Projects"

Training code for Korean multi-class sentiment analysis

Natural Language Processing Tasks and Examples.

[WWW 2021 GLB] New Benchmarks for Learning on Non-Homophilous Graphs

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Code for "Finetuning Pretrained Transformers into Variational Autoencoders"

PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

State-of-the-art NLP through transformer models in a modular design and consistent APIs.

CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

Mlcode - Continuous ML API Integrations

Yet Another Neural Machine Translation Toolkit

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022

Ongoing research training transformer language models at scale, including: BERT & GPT-2

This repository implements a brute-force spellchecker utilizing the Damerau-Levenshtein edit distance.