Predicting Auction Sale Price using the kaggle bulldozer auction sales data: Modeling with Ensembles vs Neural Network

Overview

Predicting Auction Sale Price using the kaggle bulldozer auction sales data: Modeling with Ensembles vs Neural Network

The performances of tree ensembles and neural networks on structured data are evaluated. In addition, the effectiveness of combining neural network and decision trees (such as random trees, histogram based gradient boosting, and xgboost) is investigated. Covariant shift, Random forest's inability to extrapolate, and data leakage are investigated.

A simple 2-layer Neural network outperformed xgboost, followed by random forests. The worst performance based on RMSE was obtained from the histogram based gradient boosting regressor.

Overall, the best rmse (0.220194)--about 4.04% improvement over the kaggle's leaderboard first place score -- was obtained by taking the average of the predictions by the neural network and xgboost regressor.

Key takeaways:

  1. Always start with a baseline

  2. Random forests are generally bad at extrapolating, hence, if there is a shift in the domain between the training input and the validation (or test) inputs, then the random forest model will perform rather poorly on the validation set(or test set).

rf_failure

The red portion of the plot above shows the extrapolation problem. The random forest was trained on the first 70% of the data and used to make predictions on thr full data including the last 30%. It fails because there is an obvious linear trend it was unable to properly capture. Moreover, the predictions by random forests are confined within the range of the training input labels, since random forests make predictions by taking the average of previously observed data. Hence, when the input for prediction is

  1. To improve the performance of random forests, you could attempt to find the columns or features on which the training and validation sets differ the most. You may drop the ones that least impacts the accuracy of the model. To achieve this, I trained a random forest that can tell if a given input is from a training set or validation set. This helped me determine if a validation set has the same or similar distribution as the training set. Lastly, I computed the feature importances. The feature importances for this model revealed the degree of dissimilarity of the features between the training and validation sets. The features with high feature importances are the most dissimilar between the sets. salesID and machineID were significantly different between the sets but impacts RMSE the least, hence they were dropped. Other common approaches taken to improve performance include: finding and removing the redundant features by making similarity plot (shown below), choosing more recent data for both the training and the validation sets.

similarity plot

  1. For forecasting tasks (time dependent targets), the validation set should not be arbitrarily chosen i.e train_test_split may not be your best option for splitting the data. Since you are looking to make predictions on future sales, your validation set should contain more recent data, so that if your model is able to do well on the validation set, then, you can be more confident about its predictions in the future.

  2. Data leakage should be investigated. Signs of data leakage include:

    • Unrealistically high level of performance on the test set
    • Apparently meaningless feature(s) scoring very high on feature importance
    • Partial dependence plots that do not make sense.

popularitypartial_dependence

Observations extracted from the notebook*

Towards the end of the productsize plot, we see an interesting trend. The auction price is at its lowest in the end. This group represent the missing values in our product size. Missing values constitute the greatest percentage in our ProductSize. However, recall that productsize is our third most important feature. So, how is it possible that a feature that is missing so often could be so important to the prediction? The answer may be tied to data leakage. We can theorize that the auctions with missing product size information were not really successful since they were sold at very low prices, as a resutlt, the size information were either removed or intentionally omitted. It is also possible that most of these data were collected after sales were made, and for the sales that were not great, the product size were simply left blank. The intention is completely debatable, it might be intended to provide clue as to the nature of the sale, however, such information can harm our model or even render it completely useless. Clearly, our model could be misled into thinking that missing product size is an indication of low price and as such will always predict a low price whenever the product size attribute is missing. A model afflicted with data leakage will not perform well in production.

  1. An histogram based gradient boosting regressor may not be the best for forecasting on time dependent data. It showed the least peroformance with an RMSE of 0.239826

  2. A simple Neural network can show superior performance on structured data. A 2-layer neural network in which the categorical variables (i.e features with cardinality < 1000) were handled using embeddings showed a 1.93% improvement in RMSE compared to the best random forest model. It also outperformed the xgboost regressor even after the hyperparameters were tuned.

  3. There is some benefit to be derived by using an ensemble of models. In this project, each time, the neural network was combined with any of the trees, a superior performance always ensues. The best performance was obtained from the combination of neural network and the xgboost model.

Owner
Mustapha Unubi Momoh
Python Developer| Data scientist
Mustapha Unubi Momoh
hipCaffe: the HIP port of Caffe

Caffe Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Cent

ROCm Software Platform 126 Dec 05, 2022
A parametric soroban written with CADQuery.

A parametric soroban written in CADQuery The purpose of this project is to demonstrate how "code CAD" can be intuitive to learn. See soroban.py for a

Lee 4 Aug 13, 2022
MaskTrackRCNN for video instance segmentation based on mmdetection

MaskTrackRCNN for video instance segmentation Introduction This repo serves as the official code release of the MaskTrackRCNN model for video instance

411 Jan 05, 2023
Robustness between the worst and average case

Robustness between the worst and average case A repository that implements intermediate robustness training and evaluation from the NeurIPS 2021 paper

CMU Locus Lab 16 Dec 02, 2022
Keras implementation of AdaBound

AdaBound for Keras Keras port of AdaBound Optimizer for PyTorch, from the paper Adaptive Gradient Methods with Dynamic Bound of Learning Rate. Usage A

Somshubra Majumdar 132 Sep 23, 2022
Malware Env for OpenAI Gym

Malware Env for OpenAI Gym Citing If you use this code in a publication please cite the following paper: Hyrum S. Anderson, Anant Kharkar, Bobby Fila

ENDGAME 563 Dec 29, 2022
DanceTrack: Multiple Object Tracking in Uniform Appearance and Diverse Motion

DanceTrack DanceTrack is a benchmark for tracking multiple objects in uniform appearance and diverse motion. DanceTrack provides box and identity anno

260 Dec 28, 2022
CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

Mingyang Zhou 28 Dec 30, 2022
Unofficial implementation of "Coordinate Attention for Efficient Mobile Network Design"

Unofficial implementation of "Coordinate Attention for Efficient Mobile Network Design". CoordAttention tensorflow slim

Billy 9 Aug 22, 2022
tmm_fast is a lightweight package to speed up optical planar multilayer thin-film device computation.

tmm_fast tmm_fast or transfer-matrix-method_fast is a lightweight package to speed up optical planar multilayer thin-film device computation. It is es

26 Dec 11, 2022
Mscp jamf - Build compliance in jamf

mscp_jamf Build compliance in Jamf. This will build the following xml pieces to

Bob Gendler 3 Jul 25, 2022
A Comprehensive Analysis of Weakly-Supervised Semantic Segmentation in Different Image Domains (IJCV submission)

wsss-analysis The code of: A Comprehensive Analysis of Weakly-Supervised Semantic Segmentation in Different Image Domains, arXiv pre-print 2019 paper.

Lyndon Chan 48 Dec 18, 2022
Code Release for Learning to Adapt to Evolving Domains

EAML Code release for "Learning to Adapt to Evolving Domains" (NeurIPS 2020) Prerequisites PyTorch = 0.4.0 (with suitable CUDA and CuDNN version) tor

23 Dec 07, 2022
Acoustic mosquito detection code with Bayesian Neural Networks

HumBugDB Acoustic mosquito detection with Bayesian Neural Networks. Extract audio or features from our large-scale dataset on Zenodo. This repository

31 Nov 28, 2022
Inference code for "StylePeople: A Generative Model of Fullbody Human Avatars" paper. This code is for the part of the paper describing video-based avatars.

NeuralTextures This is repository with inference code for paper "StylePeople: A Generative Model of Fullbody Human Avatars" (CVPR21). This code is for

Visual Understanding Lab @ Samsung AI Center Moscow 18 Oct 06, 2022
PyTorch implementation of MSBG hearing loss model and MBSTOI intelligibility metric

PyTorch implementation of MSBG hearing loss model and MBSTOI intelligibility metric This repository contains the implementation of MSBG hearing loss m

BUT <a href=[email protected]"> 9 Nov 08, 2022
SmallInitEmb - LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence

SmallInitEmb LayerNorm(SmallInit(Embedding)) in a Transformer I find that when t

PENG Bo 11 Dec 25, 2022
Lucid library adapted for PyTorch

Lucent PyTorch + Lucid = Lucent The wonderful Lucid library adapted for the wonderful PyTorch! Lucent is not affiliated with Lucid or OpenAI's Clarity

Lim Swee Kiat 520 Dec 26, 2022
MNIST, but with Bezier curves instead of pixels

bezier-mnist This is a work-in-progress vector version of the MNIST dataset. Samples Here are some samples from the training set. Note that, while the

Alex Nichol 15 Jan 16, 2022
Deep learning models for classification of 15 common weeds in the southern U.S. cotton production systems.

CottonWeeds Deep learning models for classification of 15 common weeds in the southern U.S. cotton production systems. requirements pytorch torchsumma

Dong Chen 8 Jun 07, 2022