Simple Similarities Service

Last update: Dec 25, 2022

Related tags

Overview

simsity

Simsity is a Super Simple Similarities Service[tm].
It's all about building a neighborhood. Literally!

This repository contains simple tools to help in similarity retreival scenarios by making a convient wrapper around encoding strategies as well as nearest neighbor approaches. Typical usecases include early stage bulk labelling and duplication discovery.

Warning

Alpha software. Expect things to break. Do not use in production.

Quickstart

This is the basic setup for this package.

import pandas as pd

from simsity.service import Service
from simsity.indexer import PyNNDescentIndexer
from simsity.preprocessing import Identity, ColumnLister


# The Indexer handles the nearest neighbor search
# The Encoder handles the encoding of the datapoints
service = Service(
    indexer=PyNNDescentIndexer(metric="euclidean"),
    encoder=CountVectorizer()
)

# The encoder defines how we encode the data going in.
encoder = make_pipeline(
    ColumnLister(column="text"),
    CountVectorizer()
)

# The indexer handles the nearest neighbor lookup.
indexer = PyNNDescentIndexer(metric="euclidean", n_neighbors=2)

# The service combines the two into a single object.
service_clinc = Service(
    encoder=encoder,
    indexer=indexer,
)

# We can now train the service.
df_clinc = pd.read_csv("tests/data/clinc-data.csv")
service_clinc.train_from_dataf(df_clinc, features=["text"])

# Query the datapoints
service.query("give me directions", n_neighbors=20)

# Save the entire system
service.save("/tmp/simple-model")

# You can also load the model now.
reloaded = Service.load("/tmp/simple-model")

# We can also host it as a web service
reloaded.serve(host='0.0.0.0', port=8080)

# You can now POST to http://0.0.0.0:8080/query with payload:
# {"query": {"text": "hello there"}, "n_neighbors": 20}

Comments

Add support for pretrained encoders and transformed data

First of all this project looks great! I've taken an initial stab at #12 and also tried to add support querying data that has already been transformed. If you have data that you've already transformed (e.g. a UMAP embedding), you probably don't want to rerun encoder.transform again. In this case you want to index the transformed data and query it directly.

This is just a first crack so happy to incorporate any feedback you might have!

opened by gclen 10
embetter: better embeddings
This is conceptual work in progress. The maintainer is actively researching this, please do not work on it.

Problem Statement

When you submit where is my phoone and you get similarities you may get things like:

where is my phone

where is my credit card

Depending on your task, either the "where is" part of the sentence is more important or the "phone" part is more important. The encoder, however, may be very brittle when it comes to spelling errors. So to put it more generally;

The similarity in an embedded space in our case is very much "general". I'm using "general" here, as opposed to "specific" to indicate that these similarities have been constructed without having a task in mind.

Similar Issue

Suppose that we are deduplicating and we have a zipcode, city, first-, and last-name. How would our encoding be able to understand that having the same city is not a strong signal while having the first name certainly is? Can we really expect a standard encoding to understand this? Without labels ... I think not.
opened by koaning 3
Add `Identity` as default encoder for Service.

As mentioned in https://github.com/koaning/simsity/pull/13:

I think the refit parameter should go in the Service() call. I think there should also be a parameter somewhere to avoid calling .transform() if the data has already been transformed. Do you think it is worth adding an additional parameter to Service() and keeping the indexed_from_transformed_data method?

It's a fair remark. I think preventing a transfrom() is fair, but the solution would be to have an Identity() transformer that just keeps the data as-is. This would also make a great default value for the encoder.

Made this issue to track progress and to discuss the approach.

opened by koaning 2
Codecalm tutorial on simsity

Hi Vincent. Since I discovered you my barrier towards Python has eroded! Thank you. I'm a Data Scientist who wants to check if simsity can help with retrieving similar regions based on environmental variables.

opened by FrancyJGLisboa 2

Update indexer

Hi! Are there any plans to add support for updating the indexer, i.e. add new documents without retraining the entire pipeline? Would be a very useful feature .

from simsity.service import Service

service = Service(
    indexer=indexer,
    encoder=encoder
)

service.train_from_dataf(df, features=["text"])

....

service.update(new_docs, features=["text"])  # <- this

opened by nthomsencph 1

New API

I think the original design was flawed and this project should stick to the scikit-learn API more.

from simsity.preprocessing import Grab
from simsity.service import Service
from simsity.indexer import (AnnoyIndexer, PynnDescentIndexed, NMSlibIndexer,
                             PineconeIndexer, QdrantIndexer, WeviateIndexer)


encoder = make_pipeline(
    make_union(
        make_pipeline(Grab("text"), SentenceEncoder()),
        make_pipeline(Grab("title"), SentenceEncoder())
    )
)

service = Service(encoder, indexer, batch_size=50)
service.index(X)
items, dists = service.query(X, n=10)

opened by koaning 0

Education Day Goals
[x] add typing + type checker

[x] add tests for the minhash tools

[ ] collect more useful datasets

[x] automate the benchmarking

[x] write getting started guides

[ ] record a quick demo for colleagues

[ ] add github actions stash
opened by koaning 0
added-components
Adding the MinHash components. This is also an amazing opportunity to:

[ ] add types and a type checker

[ ] add some standard tests for indexers

[ ] add a script to run some benchmarks on the clinc dataset
opened by koaning 0

Releases(0.1.1)

0.1.1(Nov 4, 2021)

Thanks to @gclen you can now re-use scikit-learn pipelines without refitting them internally.
Source code(tar.gz)
Source code(zip)

Owner

vincent d warmerdam

Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].

GitHub Repository

Получение интересной информации о любой пиццерии Додо

dodopizza-abuse Получение инфорации о выбранной пиццерии Додо Установка и запуск на Linux Устанавливаем git и python: apt-get update && apt-get -y ins

24 Nov 02, 2022

Blankly - 🚀 💸 Trade stocks, cryptos, and forex w/ one package. Easily build, backtest, trade, and deploy across exchanges in a few lines of code.

💨 Rapidly build and deploy quantitative models for stocks, crypto, and forex 🚀 View Docs · Our Website · Join Our Newsletter · Getting Started Why B

1.4k Jan 03, 2023

Access LeetCode problems via id

LCid - access LeetCode problems via id Introduction As a world's leading online programming learning platform, LeetCode is quite popular among program

14 Oct 08, 2022

Repository for the Nexus Client software.

LinkScope Client Description This is the repository for the LinkScope Client Online Investigation software. LinkScope allows you to perform online inv

107 Dec 30, 2022

An unoffcial python API client for primeuploads.com

primeuploads-py An unoffcial python API wrapper for primeuploads.com Installation pip3 install primeuploads-py Usage example from prime import PrimeUp

41 Dec 05, 2022

Streaming Finance Data with AWS Lambda

A data pipeline consisting of an AWS lambda function reading data from yfinance API, an AWS Kinesis stream to receive & store data in S3 buckets and AWS Glue crawler & Athena to run SQL queries.

4 Aug 30, 2022

Discord bot for the IOTA Wiki

IOTA Wiki Bot Discord bot for the IOTA Wiki Report Bug · Request Feature About The Project This is a Discord bot for the IOTA Wiki. It's currently use

2 Nov 14, 2021

HASOKI DDOS TOOL- powerful DDoS toolkit for penetration tests

DDoS Attack Panel includes CloudFlare Bypass (UAM, CAPTCHA, GS ,VS ,BFM, etc..) This is open source code. I am not responsible if you use it for malic

1 Dec 02, 2022

Easy to use API Wrapper for somerandomapi.ml.

Overview somerandomapi is an API Wrapper for some-random-api.ml Examples Asynchronous from somerandomapi import Animal

1 Dec 31, 2021

A python notification tool used for sending you text messages when certain conditions are met in the game, Neptune's Pride.

1 Jan 16, 2022

❤️ Hi There Im EzilaX ❤️ A next gen powerful telegram group manager bot 😱 for manage your groups and have fun with other cool modules Made By Sadew Jayasekara 🔥

❤️ EzilaX v1 ❤️ Unmaintained. The new repo of @EzilaXBot is Public. (It is no longer based on this source code. The completely rewritten bot available

18 Nov 24, 2021

Simple Similarities Service

Related tags

Overview

simsity

Warning

Quickstart

Comments

Problem Statement

Similar Issue

Releases(0.1.1)

0.1.1(Nov 4, 2021)

Owner

vincent d warmerdam

Получение интересной информации о любой пиццерии Додо

Blankly - 🚀 💸 Trade stocks, cryptos, and forex w/ one package. Easily build, backtest, trade, and deploy across exchanges in a few lines of code.

Access LeetCode problems via id

Repository for the Nexus Client software.

An unoffcial python API client for primeuploads.com

Streaming Finance Data with AWS Lambda

Discord bot for the IOTA Wiki

HASOKI DDOS TOOL- powerful DDoS toolkit for penetration tests

Easy to use API Wrapper for somerandomapi.ml.

A python notification tool used for sending you text messages when certain conditions are met in the game, Neptune's Pride.

❤️ Hi There Im EzilaX ❤️ A next gen powerful telegram group manager bot 😱 for manage your groups and have fun with other cool modules Made By Sadew Jayasekara 🔥

A project in order to analyze user's favorite musics, artists and genre

A discord token nuker With loads of options that will screw an account up real bad

A tool for extracting plain text from Wikipedia dumps

twtxt is a decentralised, minimalist microblogging service for hackers.

A template that everyone can use for the start of their discord bot

Backend.AI Client Library for Python

Basic-Discord-Response-Bot, in Python

A Telegram Bot to manage your music channel with some cool features.

A modern, easy to use, feature-rich, and async ready API wrapper for Discord written in Python.