Contains the code and data for our #ICSE2022 paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences"

Last update: Oct 31, 2022

Overview

CodeFill

This repository contains the code for our paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences", DOI: 10.1145/3510003.3510172. This work is authored by Maliheh Izadi, Roberta Gismondi, and Georgios Gousios and it has been accepted for publication at #ICSE2022.

Abstract

Code completion is an essential feature of IDEs, yet current autocompleters are restricted to either grammar-based or NLP-based single token completions. Both approaches have significant drawbacks: grammar-based autocompletion is restricted in dynamically-typed language environments, whereas NLP-based autocompleters struggle to understand the semantics of the programming language and the developer's code context.

In this work, we present CodeFill, a language model for autocompletion that combines learned structure and naming information. Using a parallel Transformer architecture and multi-task learning, CodeFill consumes sequences of source code token names and their equivalent AST token types. Uniquely, CodeFill is trained both for single-token and multi-token (statement) prediction, which enables it to learn long-range dependencies among grammatical and naming elements. We train CodeFill on two datasets, consisting of 29M and 425M lines of code, respectively. To make the evaluation more realistic, we develop a method to automatically infer points in the source code at which completion matters. We compare CodeFill against four baselines and two state-of-the-art models, GPT-C and TravTrans+. CodeFill surpasses all baselines in single token prediction (MRR: 70.9% vs. 66.2% and 67.8%) and outperforms the state of the art for multi-token prediction (ROUGE-L: 63.7% vs. 52.4% and 59.2%, for n=4 tokens). We publicly release our source code and datasets.

Data

Our datasets are available on HuggingFace hub.

Contains the code and data for our #ICSE2022 paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences"

Related tags

Overview

CodeFill

Abstract

Data

Owner

Software Analytics Lab

Sentiment Analysis Project using Count Vectorizer and TF-IDF Vectorizer

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Translators - is a library which aims to bring free, multiple, enjoyable translation to individuals and students in Python

Artificial Conversational Entity for queries in Eulogio "Amang" Rodriguez Institute of Science and Technology (EARIST)

Black for Python docstrings and reStructuredText (rst).

A minimal Conformer ASR implementation adapted from ESPnet.

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

Simple text to phones converter for multiple languages

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Trained T5 and T5-large model for creating keywords from text

A library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

gaiic2021-track3-小布助手对话短文本语义匹配复赛rank3、决赛rank4

Applied Natural Language Processing in the Enterprise - An O'Reilly Media Publication

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

Data and code to support "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley)

:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

Code for the paper TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks