NLP_0-project

Group project for MFIN7036. Our goal is to predict firm profitability with text-based competition measures¹. We are a "democratic" and collaborative group of five, and I mentioned our names based on our initial work division below 😄 .

Here is the outline of our project:

Data collection.

@LeiyuanHuo, jyang130, FanFanShark, xdc1999, gaojiamin1116

Based on file data-WRDS-list.csv, write a web-scraping algorithm to download all 10-Ks (html format) these companies filed to the SEC within 2010 to 2022 at Historical EDGAR documents, and rename them data-10K-COMPNAME-Year.html.
Parse html files to extract Business and MD&A sections.

Text Processing: feature extraction²

Part of Speech Tagging (POS) (mainly this method) to get product name, descriptions. Store these for each company.
Named Entity Recognition (NER) (also mainly this method) to get mentioned competitor names. Store these for each company.
Product texts: BoW and tf-idf for each company's product(s), and hopefully we have a term-product matrix then.
Competitor texts: definitely BoW, as we care about the frequency of being mentioned.
‼️ We also need to combine sector and firm size/market power into competitor texts and re-count.

Text Processing: feature transformation and representation²

Term-product matrix: calculate cosine similarity scores for products pairwise; use score threshold to cluster products into similar groups.
Term-product matrix: directly apply clustering method (e.g., KMeans clustering) to product vectors, and cluster them.

Econometric Analysis and Hypothesis Testing²

Multivariate regression: DV is profitability (e.g., sales, revenue, Tobin's q), IV is competition measures (one from similar product count, one from mentions as competitors), also include relevant control variables.
Cross-section portfolios: our competition measures are cross-sectional (one for each year), so we can create long-short portfolios for both measures, and examine stock return effects.

Two papers inspired this project. Citations: Eisdorfer, A., Froot, K., Ozik, G., & Sadka, R. (2021). Competition Links and Stock Returns. The Review of Financial Studies, The Review of financial studies, 2021-12-20. && Hoberg, G., & Phillips, G. (2016). Text-Based Network Industries and Endogenous Product Differentiation. The Journal of Political Economy, 124(5), 1423-1465. ↩
Text processing processes are based on MFIN7036 Lecture_Notes and a review paper. Citation: Marty, T., Vanstone, B., & Hahn, T. (2020). News media analytics in finance: A survey. Accounting and Finance (Parkville), 60(2), 1385-1434. ↩ ↩ ² ↩ ³

Group project for MFIN7036. Our goal is to predict firm profitability with text-based competition measures.

Related tags

Overview

NLP_0-project

Data collection.

Text Processing: feature extraction²

Text Processing: feature transformation and representation²

Econometric Analysis and Hypothesis Testing²

Owner

Cross-media Structured Common Space for Multimedia Event Extraction (ACL2020)

Kalidokit is a blendshape and kinematics solver for Mediapipe/Tensorflow.js face, eyes, pose, and hand tracking models

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

Implementation of gMLP, an all-MLP replacement for Transformers, in Pytorch

Neural models of common sense. 🤖

SimpleDepthEstimation - An unified codebase for NN-based monocular depth estimation methods

Ladder Variational Autoencoders (LVAE) in PyTorch

Vision Transformer and MLP-Mixer Architectures

Weakly-supervised semantic image segmentation with CNNs using point supervision

Physical Anomalous Trajectory or Motion (PHANTOM) Dataset

pip install python-office

Custom IMDB Dataset is extracted between 2020-2021 and custom distilBERT model is trained for movie success probability prediction

Setup and customize deep learning environment in seconds.

Constraint-based geometry sketcher for blender

Complete the code of prefix-tuning in low data setting

Sparse-dense operators implementation for Paddle

Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Code samples for my book "Neural Networks and Deep Learning"

Meshed-Memory Transformer for Image Captioning. CVPR 2020

pytorch implementation of trDesign

Group project for MFIN7036. Our goal is to predict firm profitability with text-based competition measures.

Related tags

Overview

NLP_0-project

Data collection.

Text Processing: feature extraction2

Text Processing: feature transformation and representation2

Econometric Analysis and Hypothesis Testing2

Footnotes

Owner

Cross-media Structured Common Space for Multimedia Event Extraction (ACL2020)

Kalidokit is a blendshape and kinematics solver for Mediapipe/Tensorflow.js face, eyes, pose, and hand tracking models

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

Implementation of gMLP, an all-MLP replacement for Transformers, in Pytorch

Neural models of common sense. 🤖

SimpleDepthEstimation - An unified codebase for NN-based monocular depth estimation methods

Ladder Variational Autoencoders (LVAE) in PyTorch

Vision Transformer and MLP-Mixer Architectures

Weakly-supervised semantic image segmentation with CNNs using point supervision

Physical Anomalous Trajectory or Motion (PHANTOM) Dataset

pip install python-office

Custom IMDB Dataset is extracted between 2020-2021 and custom distilBERT model is trained for movie success probability prediction

Setup and customize deep learning environment in seconds.

Constraint-based geometry sketcher for blender

Complete the code of prefix-tuning in low data setting

Sparse-dense operators implementation for Paddle

Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Code samples for my book "Neural Networks and Deep Learning"

Meshed-Memory Transformer for Image Captioning. CVPR 2020

pytorch implementation of trDesign

Text Processing: feature extraction²

Text Processing: feature transformation and representation²

Econometric Analysis and Hypothesis Testing²