Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Last update: Nov 15, 2022

Overview

Transformers for variable misuse, function naming and code completion tasks

The official PyTorch implementation of:

Empirical Study of Transformers for Source Code [arxiv] (accepted to ESEC/FSE'21)
A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code [arxiv] (accepted to NAACL'21)

The repository also contains code for resplitting Python150k and JavaScript150k datasets (with splitting by repository, removing duplicates and the redistributable version of Py150k).

Repository structure

data_utils: scripts for downloading Python150k and JavaScript150k datasets and obtaining new train / val / test splits (with splitting by repository, removing duplicates and the redistributable version of Py150k)
vm_fn: code for Variable Misuse (VM) and Function Naming (FN) tasks (additional preprocessing, models, training etc)
cc: code for Code Completion (CC) task (additional preprocessing, models, training etc)

See README in each directory for details.

Run

The code was tested on a system with Linux 3.10.0. Experiments were run using a Tesla V100 GPU. Required libraries are listed in requirments.txt in VM_FN and CC directories. The implementation is based on PyTorch>=1.5.

Running experiments:

Download and resplit data, see data_utils for details;
Preprocess data for a task you are interested in (VM, FN or CC), see vm_fn or cc for details;
Run the experiment you are interested in, see vm_fn or cc for details.

Attribution

Parts of this code are based on the following repositories:

Citation

If you found this code useful, please cite our papers

@misc{chirkova2020empirical,
      title={Empirical Study of Transformers for Source Code}, 
      author={Nadezhda Chirkova and Sergey Troshin},
      year={2020},
      eprint={2010.07987},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

@inproceedings{chirkova2020simple,
      title={A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code}, 
      author={Nadezhda Chirkova and Sergey Troshin},
      booktitle={North American Chapter of the Association for Computational Linguistics}
      year={2021}, 
}

Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Related tags

Overview

Transformers for variable misuse, function naming and code completion tasks

Repository structure

Run

Attribution

Citation

Owner

Bayesian Methods Research Group

sequitur is a library that lets you create and train an autoencoder for sequential data in just two lines of code

Dataset and Code for ICCV 2021 paper "Real-world Video Super-resolution: A Benchmark Dataset and A Decomposition based Learning Scheme"

Medical Image Segmentation using Squeeze-and-Expansion Transformers

Official implementation of particle-based models (GNS and DPI-Net) on the Physion dataset.

Novel Instances Mining with Pseudo-Margin Evaluation for Few-Shot Object Detection

A memory-efficient implementation of DenseNets

BankNote-Net: Open dataset and encoder model for assistive currency recognition

ByteTrack超详细教程！训练自己的数据集&&摄像头实时检测跟踪

Code for the TIP 2021 Paper "Salient Object Detection with Purificatory Mechanism and Structural Similarity Loss"

Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

This repo is customed for VisDrone.

High-Resolution 3D Human Digitization from A Single Image.

PyTorch implementations of the NeRF model described in "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis"

[ICCV'21] NEAT: Neural Attention Fields for End-to-End Autonomous Driving

Official repository for the paper "Self-Supervised Models are Continual Learners" (CVPR 2022)

MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space

A platform to display the carbon neutralization information for researchers, decision-makers, and other participants in the community.

Face and Body Tracking for VRM 3D models on the web.

A simple algorithm for extracting tree height in sparse scene from point cloud data.

Progressive Coordinate Transforms for Monocular 3D Object Detection