A Lightweight Experiment & Resource Monitoring Tool 📺

Overview

Lightweight Experiment & Resource Monitoring 📺

Pyversions PyPI version Code style: black Colab codecov

"Did I already run this experiment before? How many resources are currently available on my cluster?" If these are common questions you encounter during your daily life as a researcher, then mle-monitor is made for you. It provides a lightweight API for tracking your experiments using a pickle protocol database (e.g. for hyperparameter searches and/or multi-configuration/multi-seed runs). Furthermore, it comes with built-in resource monitoring on Slurm/Grid Engine clusters and local machines/servers.

mle-monitor provides three core functionalities:

  • MLEProtocol: A composable protocol database API for ML experiments.
  • MLEResource: A tool for obtaining server/cluster usage statistics.
  • MLEDashboard: A dashboard visualizing resource usage & experiment protocol.

To get started I recommend checking out the colab notebook and an example workflow.

drawing

MLEProtocol: Keeping Track of Your Experiments 📝

from mle_monitor import MLEProtocol

# Load protocol database or create new one -> print summary
protocol_db = MLEProtocol("mle_protocol.db", verbose=False)
protocol_db.summary(tail=10, verbose=True)

# Draft data to store in protocol & add it to the protocol
meta_data = {
    "purpose": "Grid search",  # Purpose of experiment
    "project_name": "MNIST",  # Project name of experiment
    "experiment_type": "hyperparameter-search",  # Type of experiment
    "experiment_dir": "experiments/logs",  # Experiment directory
    "num_total_jobs": 10,  # Number of total jobs to run
    ...
}
new_experiment_id = protocol_db.add(meta_data)

# ... train your 10 (pseudo) networks/complete respective jobs
for i in range(10):
    protocol_db.update_progress_bar(new_experiment_id)

# Wrap up an experiment (store completion time, etc.)
protocol_db.complete(new_experiment_id)

The meta data can contain the following keys:

Search Type Description Default
purpose Purpose of experiment 'None provided'
project_name Project name of experiment 'default'
exec_resource Resource jobs are run on 'local'
experiment_dir Experiment log storage directory 'experiments'
experiment_type Type of experiment to run 'single'
base_fname Main code script to execute 'main.py'
config_fname Config file path of experiment 'base_config.yaml'
num_seeds Number of evaluations seeds 1
num_total_jobs Number of total jobs to run 1
num_job_batches Number of jobs in single batch 1
num_jobs_per_batch Number of sequential job batches 1
time_per_job Expected duration: days-hours-minutes '00:01:00'
num_cpus Number of CPUs used in job 1
num_gpus Number of GPUs used in job 0

Additionally you can synchronize the protocol with a Google Cloud Storage (GCS) bucket by providing cloud_settings. In this case also the results stored in experiment_dir will be uploaded to the GCS bucket, when you call protocol.complete().

# Define GCS settings - requires 'GOOGLE_APPLICATION_CREDENTIALS' env var.
cloud_settings = {
    "project_name": "mle-toolbox",  # GCP project name
    "bucket_name": "mle-protocol",  # GCS bucket name
    "use_protocol_sync": True,  # Whether to sync the protocol to GCS
    "use_results_storage": True,  # Whether to sync experiment_dir to GCS
}
protocol_db = MLEProtocol("mle_protocol.db", cloud_settings, verbose=True)

The MLEResource: Keeping Track of Your Resources 📉

On Your Local Machine

from mle_monitor import MLEResource

# Instantiate local resource and get usage data
resource = MLEResource(resource_name="local")
resource_data = resource.monitor()

On a Slurm Cluster

resource = MLEResource(
    resource_name="slurm-cluster",
    monitor_config={"partitions": ["<partition-1>", "<partition-2>"]},
)

On a Grid Engine Cluster

resource = MLEResource(
    resource_name="sge-cluster",
    monitor_config={"queues": ["<queue-1>", "<queue-2>"]}
)

The MLEDashboard: Dashboard Visualization 🎞️

from mle_monitor import MLEDashboard

# Instantiate dashboard with protocol and resource
dashboard = MLEDashboard(protocol, resource)

# Get a static snapshot of the protocol & resource utilisation printed in console
dashboard.snapshot()

# Run monitoring in while loop - dashboard
dashboard.live()

Installation

A PyPI installation is available via:

pip install mle-monitor

Alternatively, you can clone this repository and afterwards 'manually' install it:

git clone https://github.com/mle-infrastructure/mle-monitor.git
cd mle-monitor
pip install -e .

Development & Milestones for Next Release

You can run the test suite via python -m pytest -vv tests/. If you find a bug or are missing your favourite feature, feel free to contact me @RobertTLange or create an issue 🤗 .

You might also like...
Meta Representation Transformation for Low-resource Cross-lingual Learning

MetaXL: Meta Representation Transformation for Low-resource Cross-lingual Learning This repo hosts the code for MetaXL, published at NAACL 2021. [Meta

OpenDILab RL Kubernetes Custom Resource and Operator Lib

DI Orchestrator DI Orchestrator is designed to manage DI (Decision Intelligence) jobs using Kubernetes Custom Resource and Operator. Prerequisites A w

Punctuation Restoration using Transformer Models for High-and Low-Resource Languages
Punctuation Restoration using Transformer Models for High-and Low-Resource Languages

Punctuation Restoration using Transformer Models This repository contins official implementation of the paper Punctuation Restoration using Transforme

Byte-based multilingual transformer TTS for low-resource/few-shot language adaptation.

One model to speak them all 🌎 Audio Language Text ▷ Chinese 人人生而自由,在尊严和权利上一律平等。 ▷ English All human beings are born free and equal in dignity and rig

PyTorch Implementation of [1611.06440] Pruning Convolutional Neural Networks for Resource Efficient Inference

PyTorch implementation of [1611.06440 Pruning Convolutional Neural Networks for Resource Efficient Inference] This demonstrates pruning a VGG16 based

Real-Time Social Distance Monitoring tool using Computer Vision
Real-Time Social Distance Monitoring tool using Computer Vision

Social Distance Detector A Real-Time Social Distance Monitoring Tool Table of Contents Motivation YOLO Theory Detection Output Tech Stack Functionalit

An air quality monitoring service with a Raspberry Pi and a SDS011 sensor.

Raspberry Pi Air Quality Monitor A simple air quality monitoring service for the Raspberry Pi. Installation Clone the repository and run the following

Attendance Monitoring with Face Recognition using Python
Attendance Monitoring with Face Recognition using Python

Attendance Monitoring with Face Recognition using Python A python GUI integrated attendance system using face recognition to take attendance. In this

Complete system for facial identity system. Include one-shot model, database operation, features visualization, monitoring

Complete system for facial identity system. Include one-shot model, database operation, features visualization, monitoring

Comments
  • Is the dashboard pooling squeue?

    Is the dashboard pooling squeue?

    Hey, Thanks for publishing the library, the dashboard looks great!

    However, I was a bit concerned to see you are using squeue since the official documentation says

    "Executing squeue sends a remote procedure call to slurmctld. If enough calls from squeue or other Slurm client commands that send remote procedure calls to the slurmctld daemon come in at once, it can result in a degradation of performance of the slurmctld daemon, possibly resulting in a denial of service.

    Do not run squeue or other Slurm client commands that send remote procedure calls to slurmctld from loops in shell scripts or other programs. Ensure that programs limit calls to squeue to the minimum necessary for the information you are trying to gather."

    Do you poll squeue or is there some other, smarter management of it that I missed?

    Thanks, Eliahu

    opened by eliahuhorwitz 0
Releases(v0.0.1)
  • v0.0.1(Dec 9, 2021)

    Basic API for MLEProtocol, MLEResource & MLEDashboard:

    from mle_monitor import MLEProtocol
    
    # Load protocol database or create new one -> print summary
    protocol_db = MLEProtocol("mle_protocol.db", verbose=False)
    protocol_db.summary(tail=10, verbose=True)
    
    # Draft data to store in protocol & add it to the protocol
    meta_data = {
        "purpose": "Grid search",  # Purpose of experiment
        "project_name": "MNIST",  # Project name of experiment
        "experiment_type": "hyperparameter-search",  # Type of experiment
        "experiment_dir": "experiments/logs",  # Experiment directory
        "num_total_jobs": 10,  # Number of total jobs to run
        ...
    }
    new_experiment_id = protocol_db.add(meta_data)
    
    # ... train your 10 (pseudo) networks/complete respective jobs
    for i in range(10):
        protocol_db.update_progress_bar(new_experiment_id)
    
    # Wrap up an experiment (store completion time, etc.)
    protocol_db.complete(new_experiment_id)
    
    Source code(tar.gz)
    Source code(zip)
PyTorch implementation of SwAV (Swapping Assignments between Views)

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments This code provides a PyTorch implementation and pretrained models for SwAV

Meta Research 1.7k Jan 04, 2023
A Unified Generative Framework for Various NER Subtasks.

This is the code for ACL-ICJNLP2021 paper A Unified Generative Framework for Various NER Subtasks. Install the package in the requirements.txt, then u

177 Jan 05, 2023
An image processing project uses Viola-jones technique to detect faces and then use SIFT algorithm for recognition.

Attendance_System An image processing project uses Viola-jones technique to detect faces and then use LPB algorithm for recognition. Face Detection Us

8 Jan 11, 2022
Voice assistant - Voice assistant with python

🌐 Python Voice Assistant 🌵 - User's greeting 🌵 - Writing tasks to todo-list ?

PythonToday 10 Dec 26, 2022
《Deep Single Portrait Image Relighting》(ICCV 2019)

Ratio Image Based Rendering for Deep Single-Image Portrait Relighting [Project Page] This is part of the Deep Portrait Relighting project. If you find

62 Dec 21, 2022
Exploring Cross-Image Pixel Contrast for Semantic Segmentation

Exploring Cross-Image Pixel Contrast for Semantic Segmentation Exploring Cross-Image Pixel Contrast for Semantic Segmentation, Wenguan Wang, Tianfei Z

Tianfei Zhou 510 Jan 02, 2023
[AAAI2021] The source code for our paper 《Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion》.

DSM The source code for paper Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion Project Website; Datasets li

Jinpeng Wang 114 Oct 16, 2022
Only valid pull requests will be allowed. Use python only and readme changes will not be accepted.

❌ This repo is excluded from hacktoberfest This repo is for python beginners and contains lot of beginner python projects for practice. You can also s

Prajjwal Pathak 50 Dec 28, 2022
Rank 1st in the public leaderboard of ScanRefer (2021-03-18)

InstanceRefer InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring

63 Dec 07, 2022
Multi-objective constrained optimization for energy applications via tree ensembles

Multi-objective constrained optimization for energy applications via tree ensembles

C⚙G - Imperial College London 1 Nov 19, 2021
An Unsupervised Detection Framework for Chinese Jargons in the Darknet

An Unsupervised Detection Framework for Chinese Jargons in the Darknet This repo is the Python 3 implementation of 《An Unsupervised Detection Framewor

7 Nov 08, 2022
UAV-Networks-Routing is a Python simulator for experimenting routing algorithms and mac protocols on unmanned aerial vehicle networks.

UAV-Networks Simulator - Autonomous Networking - A.A. 20/21 UAV-Networks-Routing is a Python simulator for experimenting routing algorithms and mac pr

0 Nov 13, 2021
Prototypical Networks for Few shot Learning in PyTorch

Prototypical Networks for Few shot Learning in PyTorch Simple alternative Implementation of Prototypical Networks for Few Shot Learning (paper, code)

Orobix 835 Jan 08, 2023
A Real-ESRGAN equipped Colab notebook for CLIP Guided Diffusion

#360Diffusion automatically upscales your CLIP Guided Diffusion outputs using Real-ESRGAN. Latest Update: Alpha 1.61 [Main Branch] - 01/11/22 Layout a

78 Nov 02, 2022
My implementation of transformers related papers for computer vision in pytorch

vision_transformers This is my personnal repo to implement new transofrmers based and other computer vision DL models I am currenlty working without a

samsja 1 Nov 10, 2021
implicit displacement field

Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields [project page][paper][cite] Geometry-Consistent Neural Shape Represe

Yifan Wang 100 Dec 19, 2022
Wind Speed Prediction using LSTMs in PyTorch

Implementation of Deep-Forecast using PyTorch Deep Forecast: Deep Learning-based Spatio-Temporal Forecasting Adapted from original implementation Setu

Onur Kaplan 151 Dec 14, 2022
Pytorch Lightning 1.2k Jan 06, 2023
Earth Vision Foundation

EVer - A Library for Earth Vision Researcher EVer is a Pytorch-based Python library to simplify the training and inference of the deep learning model.

Zhuo Zheng 34 Nov 26, 2022
Pytorch implementation of CVPR2021 paper "MUST-GAN: Multi-level Statistics Transfer for Self-driven Person Image Generation"

MUST-GAN Code | paper The Pytorch implementation of our CVPR2021 paper "MUST-GAN: Multi-level Statistics Transfer for Self-driven Person Image Generat

TianxiangMa 46 Dec 26, 2022