Learned model to estimate number of distinct values (NDV) of a population using a small sample.

Last update: Nov 21, 2022

Overview

Learned NDV estimator

Learned model to estimate number of distinct values (NDV) of a population using a small sample. The model approximates the maximum likelihood estimation of NDV, which is difficult to obtain analytically. See our VLDB 2022 paper Learning to be a Statistician: Learned Estimator for Number of Distinct Values for more details.

How to use

Install the package

pip install estndv
Import and create an instance

   from estndv import ndvEstimator
   estimator = ndvEstimator()

Assume your sample is S=[1,1,1,3,5,5,12] and the population size is N=100000. You can estimate population ndv by:

ndv = estimator.sample_predict(S=[1,1,1,3,5,5,12], N=100000)
If you have the sample profile e.g. f=[2,1,1], you can estimate population NDV by:

ndv = estimator.profile_predict(f=[2,1,1], N=100000)
If you have multiple samples/profiles from multiple populations, you can estimate population NDV for all of them in a batch by method estimator.sample_predict_batch() or estimator.profile_predict_batch().

How to train the ndv estimator

You can directly use our package on PyPI for your datasets, as the pre-trained model is agnostic to any workloads. However, if you want to train the model from scratch anyway, do the following:

Go to the model_training folder cd model_training
Install requirements

pip install requirements.txt
Generate training data. (This uses a lot of memory.)

python training_data_generation.py
Train model

python model_training.py
Save trained pytorch model parameters to numpy, this generates a file model_paras.npy

python torch2npy.py
Test with your model parameters by specifying a path to your model_paras.npy

estimator = ndvEstimator(para_path=your path to model_paras.npy)

Citation

If you use our work or found it useful, please cite our paper:

@article{wu2022learning,
   author = {Wu, Renzhi and Ding, Bolin and Chu, Xu and Wei, Zhewei and Dai, Xiening and Guan, Tao and Zhou, Jingren},
   title = {Learning to Be a Statistician: Learned Estimator for Number of Distinct Values},
   year = {2021},
   issue_date = {October 2021},
   publisher = {VLDB Endowment},
   volume = {15},
   number = {2},
   issn = {2150-8097},
   url = {https://doi.org/10.14778/3489496.3489508},
   doi = {10.14778/3489496.3489508},
   journal = {Proc. VLDB Endow.},
   month = {oct},
   pages = {272–284},
   numpages = {13}
}

Learned model to estimate number of distinct values (NDV) of a population using a small sample.

Related tags

Overview

Learned NDV estimator

How to use

How to train the ndv estimator

Citation

Owner

Disease Informed Neural Networks (DINNs) — neural networks capable of learning how diseases spread, forecasting their progression, and finding their unique parameters (e.g. death rate).

This's an implementation of deepmind Visual Interaction Networks paper using pytorch

Recreate CenternetV2 based on MMDET.

App customer segmentation cohort rfm clustering

Pytorch implementation of the AAAI 2022 paper "Cross-Domain Empirical Risk Minimization for Unbiased Long-tailed Classification"

A Model for Natural Language Attack on Text Classification and Inference

A Streamlit demo demonstrating the Deep Dream technique. Adapted from the TensorFlow Deep Dream tutorial.

A python module for scientific analysis of 3D objects based on VTK and Numpy

My personal Home Assistant configuration.

Supervised Sliding Window Smoothing Loss Function Based on MS-TCN for Video Segmentation

Get a Grip! - A robotic system for remote clinical environments.

Privacy-Preserving Portrait Matting [ACM MM-21]

For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

A computational block to solve entity alignment over textual attributes in a knowledge graph creation pipeline.

PyTorch implementation of paper: HPNet: Deep Primitive Segmentation Using Hybrid Representations.

Improving the robustness and performance of biomedical NLP models through adversarial training

Official Implementation of SWAGAN: A Style-based Wavelet-driven Generative Model

Opinionated code formatter, just like Python's black code formatter but for Beancount

Fit Fast, Explain Fast