Python Implementation of Scalable In-Memory Updatable Bitmap Indexing

Related tags

Data AnalysisPyUpBit
Overview

Contributors Forks Stargazers Issues LinkedIn


PyUpBit

CS490 Large Scale Data Analytics — Implementation of Updatable Compressed Bitmap Indexing
Paper

Table of Contents
  1. About The Project
  2. Usage
  3. Contact
  4. Acknowledgements

About The Project

Bitmaps are common data structures used in database implemen- tations due to having fast read performance. Often they are used in applications in need of common equality and selective range queries. Essentially, they store a bit-vector for each value in the domain of each attribute to keep track of large scale data files. How- ever, the main drawbacks associated with bitmap indexes are its encoding and decoding performances of bit-vectors. Currently the state of art update-optimized bitmap index, update conscious bitmaps, are able to support extremely efficient deletes and have improved update speeds by treating updates as delete then insert. Update conscious bitmaps make use of an additional bit-vector, called the existence bit-vector, to keep track of whether or not a value has been updated. By initializing all values of the existence bit-vector to 1, the data for each attribute associated with each row in the existence bit-vector is validated and presented. If a value needs to be deleted, the corresponding row in the existence bit-vector gets changed to 0, invalidating any data associated with that row. This new method in turn allows for very efficient deletes. To add on, updates are then performed as a delete operation, then an insert operation in to the end of the bit-vector. However, update conscious bitmaps do not scale well with more data. As more and more data gets updated and inserted, the run time increases significantly as well. Because update queries are out-of- place and increase size of vectors, read queries become increasingly expensive and time consuming. Furthermore, as the number of updates and deletes increases, the bit-vector becomes less and less compressible. This brings us to updateable Bitmaps (UpBit). According to the paper, UpBit: Scalable In-Memory Updatable Bitmap Indexing, re- searchers Manos Athanassoulis, Zheng Yan, and Stratos Idreos developed a new bitmap structure that improved the write per- formance of bitmaps without sacrificing read performance. The main differentiating point of UpBit is its use of an update bit vector for every value in the domain of an attribute that keeps track of updated values. This allows for faster write performance without sacrificing read performance. Based on this paper, we implemented UpBit and compared it to our implementation of update conscious bitmaps to compare and test the performances of both methods.

Usage

We used PyCharm to conduct our tests, /ucb, /upbit for algorithms, /tests for running testing scripts, and rest of the files for compression for memory usage improvement as well as creating and visualizing data.

Contact

Daniel Park - @h1yung - [email protected]

Acknowledgements

  • Original Paper
  • Winston Chen
  • Gregory Chininis
  • Daniel Hooks
  • Michael Lee
Owner
Hyeong Kyun (Daniel) Park
I like coding
Hyeong Kyun (Daniel) Park
Making the DAEN information accessible.

The purpose of this repository is to make the information on Australian COVID-19 adverse events accessible. The Therapeutics Goods Administration (TGA) keeps a database of adverse reactions to medica

10 May 10, 2022
Full automated data pipeline using docker images

Create postgres tables from CSV files This first section is only relate to creating tables from CSV files using postgres container alone. Just one of

1 Nov 21, 2021
BioMASS - A Python Framework for Modeling and Analysis of Signaling Systems

Mathematical modeling is a powerful method for the analysis of complex biological systems. Although there are many researches devoted on produ

BioMASS 22 Dec 27, 2022
songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Songplays User activity datamart The following document describes the model used to build the songplays datamart table and the respective ETL process.

Leandro Kellermann de Oliveira 1 Jul 13, 2021
Average time per match by division

HW_02 Unzip matches.rar to access .json files for matches. Get an API key to access their data at: https://developer.riotgames.com/ Average time per m

11 Jan 07, 2022
ped-crash-techvol: Texas Ped Crash Tech Volume Pack

ped-crash-techvol: Texas Ped Crash Tech Volume Pack In conjunction with the Final Report "Identifying Risk Factors that Lead to Increase in Fatal Pede

Network Modeling Center; Center for Transportation Research; The University of Texas at Austin 2 Sep 28, 2022
VevestaX is an open source Python package for ML Engineers and Data Scientists.

VevestaX Track failed and successful experiments as well as features. VevestaX is an open source Python package for ML Engineers and Data Scientists.

Vevesta 24 Dec 14, 2022
A highly efficient and modular implementation of Gaussian Processes in PyTorch

GPyTorch GPyTorch is a Gaussian process library implemented using PyTorch. GPyTorch is designed for creating scalable, flexible, and modular Gaussian

3k Jan 02, 2023
Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data

WeRateDogs Twitter Data from 2015 to 2017 Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data Table of Contents Introduction Proj

Keenan Cooper 1 Jan 12, 2022
Includes all files needed to satisfy hw02 requirements

HW 02 Data Sets Mean Scale Score for Asian and Hispanic Students, Grades 3 - 8 This dataset provides insights into the New York City education system

7 Oct 28, 2021
A collection of robust and fast processing tools for parsing and analyzing web archive data.

ChatNoir Resiliparse A collection of robust and fast processing tools for parsing and analyzing web archive data. Resiliparse is part of the ChatNoir

ChatNoir 24 Nov 29, 2022
A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Realtime Financial Market Data Visualization and Analysis Introduction This repo shows my project about real-time stock data pipeline. All the code is

6 Sep 07, 2022
A set of tools to analyse the output from TraDIS analyses

QuaTradis (Quadram TraDis) A set of tools to analyse the output from TraDIS analyses Contents Introduction Installation Required dependencies Bioconda

Quadram Institute Bioscience 2 Feb 16, 2022
This project is the implementation template for HW 0 and HW 1 for both the programming and non-programming tracks

This project is the implementation template for HW 0 and HW 1 for both the programming and non-programming tracks

Donald F. Ferguson 4 Mar 06, 2022
Orchest is a browser based IDE for Data Science.

Orchest is a browser based IDE for Data Science. It integrates your favorite Data Science tools out of the box, so you don’t have to. The application is easy to use and can run on your laptop as well

Orchest 3.6k Jan 09, 2023
A program that uses an API and a AI model to get info of sotcks

Stock-Market-AI-Analysis I dont mind anyone using this code but please give me credit A program that uses an API and a AI model to get info of stocks

1 Dec 17, 2021
Produces a summary CSV report of an Amber Electric customer's energy consumption and cost data.

Amber Electric Usage Summary This is a command line tool that produces a summary CSV report of an Amber Electric customer's energy consumption and cos

Graham Lea 12 May 26, 2022
Tools for working with MARC data in Catalogue Bridge.

catbridge_tools Tools for working with MARC data in Catalogue Bridge. Borrows heavily from PyMarc

1 Nov 11, 2021
Data Intelligence Applications - Online Product Advertising and Pricing with Context Generation

Data Intelligence Applications - Online Product Advertising and Pricing with Context Generation Overview Consider the scenario in which advertisement

Manuel Bressan 2 Nov 18, 2021
PyEmits, a python package for easy manipulation in time-series data.

PyEmits, a python package for easy manipulation in time-series data. Time-series data is very common in real life. Engineering FSI industry (Financial

Thompson 5 Sep 23, 2022