Instant search for and access to many datasets in Pyspark.

Last update: Dec 16, 2022

Overview

SparkDataset

Provides instant access to many datasets right from Pyspark (in Spark DataFrame structure).

Drop a star if you like the project. 😃 Motivates 💪 me to keep working on such projects

What?

The idea is simple. There are various datasets available out there, but they are scattered in different places over the web. Is there a quick way (in Pyspark) to access them instantly without going through the hassle of searching, downloading, and reading ... etc? SparkDataset tries to address that question :)

Usage:

Start with importing data():

from sparkdataset import data

To load a dataset:

titanic = data('titanic')

To display the documentation of a dataset:

data('titanic', show_doc=True)

To see the available datasets:

data()

To search for datasets with terms

data('ab')

Did you mean:
crabs, abbey, Vocab

That's it.

Go to this notebook for a demonstration of the functionality

Why?

In R, there is a very easy and immediate way to access multiple statistical datasets, in almost no effort. All it takes is one line > data(dataset_name). This makes the life easier for quick prototyping and testing. Well, I am jealous that Pyspark does not have a similar functionality. Thus, the aim of sparkdataset is to fill that gap.

Currently, sparkdataset has about 757 (mostly numerical-based) datasets, that are based on RDatasets. In the future, I plan to scale it to include a larger set of datasets. For example,

include textual data for NLP-related tasks, and
allow adding a new dataset to the in-module repository.

Installation:

$ pip install sparkdataset

Uninstall:

$ pip uninstall sparkdataset
$ rm -rf $HOME/.sparkdataset

Changelog

1.0.0

Added search dataset by name similarity.
Example:

>>> data('heat')
Did you mean:
Wheat, heart, Heating, Yeast, eidat, badhealth, deaths, agefat, hla, heptathlon, azt

Added support to Windows.

Dependency:

pandas
pyspark :: 3.1.2

Miscellaneous:

Tested on OSX and Linux (debian).
Supports both Python 3 (3.8.8 and above).

TODO:

add textual datasets (e.g. NLTK stuff).
add samples generators.

Thanks to:

RDatasets: R's datasets collection.

Releases(1.0.0)

1.0.0(Nov 1, 2021)

Provides instant 🚀 access to many popular datasets 📑 right from Pyspark 🔥 (in dataframe structure).
Source code(tar.gz)
Source code(zip)

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

1 Nov 17, 2021

Programmatically access the physical and chemical properties of elements in modern periodic table.

API to fetch elements of the periodic table in JSON format. Uses Pandas for dumping .csv data to .json and Flask for API Integration. Deployed on "pyt

3 Oct 23, 2022

A utility for functional piping in Python that allows you to access any function in any scope as a partial.

WithPartial Introduction WithPartial is a simple utility for functional piping in Python. The package exposes a context manager (used with with) calle

1 Oct 26, 2021

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

102 Nov 10, 2022

Python tools for querying and manipulating BIDS datasets.

PyBIDS is a Python library to centralize interactions with datasets conforming BIDS (Brain Imaging Data Structure) format.

180 Dec 18, 2022

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

3 Jul 5, 2022

CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images.

cleanX CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological

20 Jan 5, 2023

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

2 Feb 14, 2022

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets that can be described as multidimensional arrays o

411 Dec 27, 2022

Instant search for and access to many datasets in Pyspark.

Related tags

Overview

SparkDataset

What?

Usage:

Why?

Installation:

Uninstall:

Changelog

Dependency:

Miscellaneous:

TODO:

Thanks to:

You might also like...

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

Programmatically access the physical and chemical properties of elements in modern periodic table.

A utility for functional piping in Python that allows you to access any function in any scope as a partial.

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Python tools for querying and manipulating BIDS datasets.

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images.

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

Releases(1.0.0)

1.0.0(Nov 1, 2021)

Owner

Souvik Pratiher

An Integrated Experimental Platform for time series data anomaly detection.

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Exploring the Top ML and DL GitHub Repositories

Pip install minimal-pandas-api-for-polars

A probabilistic programming library for Bayesian deep learning, generative models, based on Tensorflow

SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

Probabilistic reasoning and statistical analysis in TensorFlow

Spectral Analysis in Python

A simple and efficient tool to parallelize Pandas operations on all available CPUs

vartests is a Python library to perform some statistic tests to evaluate Value at Risk (VaR) Models

Python beta calculator that retrieves stock and market data and provides linear regressions.

Intake is a lightweight package for finding, investigating, loading and disseminating data.

pyETT: Python library for Eleven VR Table Tennis data

Incubator for useful bioinformatics code, primarily in Python and R

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

cLoops2: full stack analysis tool for chromatin interactions

Functional tensors for probabilistic programming

An Aspiring Drop-In Replacement for NumPy at Scale

Office365 (Microsoft365) audit log analysis tool