Catalogue data - A Python Scripts to prepare catalogue data

Last update: Mar 03, 2022

Related tags

Data Analysis catalogue_data

Overview

catalogue_data

Scripts to prepare catalogue data.

Setup

Clone this repo.

Install git-lfs: https://github.com/git-lfs/git-lfs/wiki/Installation

sudo apt-get install git-lfs
git lfs install

Install dependencies:

sudo apt-add-repository non-free
sudo apt-get update
sudo apt-get install unrar

Create virtual environment, activate it and install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Create User Access Token (with write access) at Hugging Face Hub: https://huggingface.co/settings/token and set environment variables in the .env file at the root directory:

HF_USERNAME=
   
    
HF_USER_ACCESS_TOKEN=
    
     
GIT_USER=
     
      
GIT_EMAIL=

Create metadata

To create dataset metadata (in file dataset_infos.json) run:

python create_metadata.py --repo <repo_id>

where you should replace , e.g. bigscience-catalogue-lm-data/lm_ca_viquiquad

Aggregate datasets

To create an aggregated dataset from multiple datasets, and save it as sharded JSON Lines GZIP files, run:

python aggregate_datasets.py --dataset_ratios_path <path_to_file_with_dataset_ratios> --save_path <dir_path_to_save_aggregated_dataset>

where you should replace:

path_to_file_with_dataset_ratios: path to JSON file containing a dict with dataset names (keys) and their ratio (values) between 0 and 1.
: directory path to save the aggregated dataset

Catalogue data - A Python Scripts to prepare catalogue data

Related tags

Overview

catalogue_data

Setup

Create metadata

Aggregate datasets

Owner

BigScience Workshop

Data Analytics on Genomes and Genetics

Office365 (Microsoft365) audit log analysis tool

An Indexer that works out-of-the-box when you have less than 100K stored Documents

Processo de ETL (extração, transformação, carregamento) realizado pela equipe no projeto final do curso da Soul Code Academy.

Python Kalman filtering and optimal estimation library. Implements Kalman filter, particle filter, Extended Kalman filter, Unscented Kalman filter, g-h (alpha-beta), least squares, H Infinity, smoothers, and more. Has companion book 'Kalman and Bayesian Filters in Python'.

This is a python script to navigate and extract the FSD50K dataset

Developed for analyzing the covariance for OrcVIO

Spectacular AI SDK fuses data from cameras and IMU sensors and outputs an accurate 6-degree-of-freedom pose of a device.

Handle, manipulate, and convert data with units in Python

peptides.py is a pure-Python package to compute common descriptors for protein sequences

Airflow ETL With EKS EFS Sagemaker

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Produces a summary CSV report of an Amber Electric customer's energy consumption and cost data.

Pipeline to convert a haploid assembly into diploid

A data parser for the internal syncing data format used by Fog of World.

A 2-dimensional physics engine written in Cairo

Spectral Analysis in Python

Modular analysis tools for neurophysiology data

A crude Hy handle on Pandas library

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).