VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

Last update: Oct 24, 2022

Overview

VarCLR: Variable Representation Pre-training via Contrastive Learning

New: Paper accepted by ICSE 2022. Preprint at arXiv!

This repository contains code and pre-trained models for VarCLR, a contrastive learning based approach for learning semantic representations of variable names that effectively captures variable similarity, with state-of-the-art results on [email protected].

VarCLR: Variable Representation Pre-training via Contrastive Learning

Step 0: Install

pip install -e .

Step 1: Load a Pre-trained VarCLR Model

from varclr.models import Encoder
model = Encoder.from_pretrained("varclr-codebert")

Step 2: VarCLR Variable Embeddings

Get embedding of one variable

emb = model.encode("squareslab")
print(emb.shape)
# torch.Size([1, 768])

Get embeddings of list of variables (supports batching)

emb = model.encode(["squareslab", "strudel"])
print(emb.shape)
# torch.Size([2, 768])

Step 2: Get VarCLR Similarity Scores

Get similarity scores of N variable pairs

print(model.score("squareslab", "strudel"))
# [0.42812108993530273]
print(model.score(["squareslab", "average", "max", "max"], ["strudel", "mean", "min", "maximum"]))
# [0.42812108993530273, 0.8849745988845825, 0.8035818338394165, 0.889922022819519]

Get pairwise (N * M) similarity scores from two lists of variables

variable_list = ["squareslab", "strudel", "neulab"]
print(model.cross_score("squareslab", variable_list))
# [[1.0000007152557373, 0.4281214475631714, 0.7207341194152832]]
print(model.cross_score(variable_list, variable_list))
# [[1.0000007152557373, 0.4281214475631714, 0.7207341194152832],
#  [0.4281214475631714, 1.0000004768371582, 0.549992561340332],
#  [0.7207341194152832, 0.549992561340332, 1.000000238418579]]

Step 3: Reproduce IdBench Benchmark Results

Load the IdBench benchmark

from varclr.benchmarks import Benchmark

# Similarity on IdBench-Medium
b1 = Benchmark.build("idbench", variant="medium", metric="similarity")
# Relatedness on IdBench-Large
b2 = Benchmark.build("idbench", variant="large", metric="relatedness")

Compute VarCLR scores and evaluate

id1_list, id2_list = b1.get_inputs()
predicted = model.score(id1_list, id2_list)
print(b1.evaluate(predicted))
# {'spearmanr': 0.5248567181503295, 'pearsonr': 0.5249843473193132}

print(b2.evaluate(model.score(*b2.get_inputs())))
# {'spearmanr': 0.8012168379981921, 'pearsonr': 0.8021791703187449}

Let's compare with the original CodeBERT

codebert = Encoder.from_pretrained("codebert")
print(b1.evaluate(codebert.score(*b1.get_inputs())))
# {'spearmanr': 0.2056582946575104, 'pearsonr': 0.1995058696927054}
print(b2.evaluate(codebert.score(*b2.get_inputs())))
# {'spearmanr': 0.3909218857993804, 'pearsonr': 0.3378219622284688}

Results on IdBench benchmarks

Similarity

Method	Small	Medium	Large
FT-SG	0.30	0.29	0.28
LV	0.32	0.30	0.30
FT-cbow	0.35	0.38	0.38
VarCLR-Avg	0.47	0.45	0.44
VarCLR-LSTM	0.50	0.49	0.49
VarCLR-CodeBERT	0.53	0.53	0.51

Combined-IdBench	0.48	0.59	0.57
Combined-VarCLR	0.66	0.65	0.62

Relatedness

Method	Small	Medium	Large
LV	0.48	0.47	0.48
FT-SG	0.70	0.71	0.68
FT-cbow	0.72	0.74	0.73
VarCLR-Avg	0.67	0.66	0.66
VarCLR-LSTM	0.71	0.70	0.69
VarCLR-CodeBERT	0.79	0.79	0.80

Combined-IdBench	0.71	0.78	0.79
Combined-VarCLR	0.79	0.81	0.85

Pre-train your own VarCLR models

Coming soon.

Cite

If you find VarCLR useful in your research, please cite our [email protected]:

@misc{chen2021varclr,
      title={VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning},
      author={Qibin Chen and Jeremy Lacomis and Edward J. Schwartz and Graham Neubig and Bogdan Vasilescu and Claire Le Goues},
      year={2021},
      eprint={2112.02650},
      archivePrefix={arXiv},
      primaryClass={cs.SE}
}

VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

Related tags

Overview

VarCLR: Variable Representation Pre-training via Contrastive Learning

Step 0: Install

Step 1: Load a Pre-trained VarCLR Model

Step 2: VarCLR Variable Embeddings

Get embedding of one variable

Get embeddings of list of variables (supports batching)

Step 2: Get VarCLR Similarity Scores

Get similarity scores of N variable pairs

Get pairwise (N * M) similarity scores from two lists of variables

Step 3: Reproduce IdBench Benchmark Results

Load the IdBench benchmark

Compute VarCLR scores and evaluate

Let's compare with the original CodeBERT

Results on IdBench benchmarks

Similarity

Relatedness

Pre-train your own VarCLR models

Cite

Owner

squaresLab

Code for "Unsupervised State Representation Learning in Atari"

Methods to get the probability of a changepoint in a time series.

This is the official PyTorch implementation for "Mesa: A Memory-saving Training Framework for Transformers".

ScriptProfilerPy - Module to visualize where your python script is slow

Industrial knn-based anomaly detection for images. Visit streamlit link to check out the demo.

Balancing Principle for Unsupervised Domain Adaptation

g2o: A General Framework for Graph Optimization

A Python library for generating new text from existing samples.

An Open Source Machine Learning Framework for Everyone

Solution to the Weather4cast 2021 challenge

Codebase for the solution that won first place and was awarded the most human-like agent in the 2021 NeurIPS Competition MineRL BASALT Challenge.

Implementation of Deep Deterministic Policy Gradiet Algorithm in Tensorflow

UT-Sarulab MOS prediction system using SSL models

Snscrape-jsonl-urls-extractor - Extracts urls from jsonl produced by snscrape

🌳 A Python-inspired implementation of the Optimum-Path Forest classifier.

Weighted QMIX: Expanding Monotonic Value Function Factorisation

Code for the TASLP paper "PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation".

Randomizes the warps in a stock pokeemerald repo.

Code for Deep Single-image Portrait Image Relighting

A TensorFlow implementation of SOFA, the Simulator for OFfline LeArning and evaluation.