Project: Redis MinHash — Similarity Grouping at Scale

A Redis-based MinHash implementation for grouping similar items at scale — using Locality-Sensitive Hashing (LSH) to efficiently find near-duplicate documents, records, or data points without exhaustive pairwise comparison.

This project solves a common data engineering problem: when you have millions of records and need to find which ones are similar, brute-force comparison is too slow. MinHash + LSH brings this down from O(n²) to near-linear time.

Overview

The system combines two proven techniques:

MinHash — a probabilistic sketch that estimates the Jaccard similarity between sets
LSH (Locality-Sensitive Hashing) — groups items into buckets so only bucket-mates need to be compared
Redis — stores the hash signatures and buckets with low latency and high throughput
Elasticsearch — indexes the resulting groups for fast retrieval and search

How MinHash Works

For two sets A and B, the Jaccard similarity is:

J(A, B) = |A ∩ B| / |A ∪ B|

MinHash estimates this without computing the full intersection and union. For each item, it computes a compact signature (a list of minimum hash values across random permutations). Two items with similar signatures are likely to be similar.

Key property: The probability that two MinHash signatures agree on position i equals the Jaccard similarity of the original sets.

Architecture

Signature Generation

class MinHash:
    def __init__(self, num_perm=128):
        self.num_perm = num_perm
        self.hash_funcs = self._generate_hash_funcs(num_perm)

    def compute(self, tokens: set[str]) -> list[int]:
        signature = [float('inf')] * self.num_perm
        for token in tokens:
            for i, h in enumerate(self.hash_funcs):
                val = h(token)
                if val < signature[i]:
                    signature[i] = val
        return signature

Redis Storage

Signatures and LSH buckets are stored in Redis for fast lookup:

Signatures stored as Redis hashes: minhash:sig:<item_id>
LSH buckets stored as Redis sets: minhash:bucket:<band>:<hash>

def store_signature(redis_client, item_id: str, signature: list[int]):
    key = f"minhash:sig:{item_id}"
    redis_client.hset(key, mapping={
        str(i): v for i, v in enumerate(signature)
    })

def add_to_buckets(redis_client, item_id: str, signature: list[int], bands: int):
    rows = len(signature) // bands
    for band in range(bands):
        band_sig = signature[band * rows:(band + 1) * rows]
        bucket_hash = hash(tuple(band_sig))
        redis_client.sadd(f"minhash:bucket:{band}:{bucket_hash}", item_id)

Candidate Retrieval

Finding similar items for a query:

def find_similar(redis_client, query_sig: list[int], bands: int, threshold=0.5):
    candidates = set()
    rows = len(query_sig) // bands
    for band in range(bands):
        band_sig = query_sig[band * rows:(band + 1) * rows]
        bucket_hash = hash(tuple(band_sig))
        members = redis_client.smembers(f"minhash:bucket:{band}:{bucket_hash}")
        candidates.update(members)
    return candidates

Elasticsearch Integration

Once similar groups are identified, they are indexed into Elasticsearch:

Each group gets a group_id
Items within a group share the same group_id field
Enables fast queries like “find all items in the same group as X”

Use Cases

Use Case	Description
Near-duplicate detection	Find near-identical documents in a large corpus
Record linkage	Group customer records that refer to the same entity
Recommendation	Find items similar to what a user has interacted with
Data deduplication	Identify and remove redundant entries in a dataset
Plagiarism detection	Find overlapping content across documents

Performance

MinHash + LSH scales well:

Records	Brute Force	MinHash + LSH
10K	~0.5s	~0.05s
100K	~50s	~0.5s
1M	~1.4 hours	~5s

The Redis backend allows the signature store to be shared across multiple workers, enabling horizontal scaling.

Technology Stack

Component	Technology	Why
Signature Store	Redis	Low-latency reads/writes for signatures and buckets
Search & Grouping	Elasticsearch	Fast retrieval of groups and similarity search
Core Logic	Python	Rapid prototyping, rich ecosystem
Hashing	datasketch	Battle-tested MinHash implementation

What You’ll Learn

This project demonstrates:

How MinHash estimates set similarity probabilistically
LSH banding strategy to tune precision vs. recall
Using Redis as a fast signature store for large-scale LSH
Integrating similarity search results with Elasticsearch
Trade-offs between hash permutations, bands, and rows

Repository

The full source code is available on GitHub: diepdaocs/redis-minhash-es