Content Similarity Checker
Compare documents and text for similarity using multiple algorithms.
Features Cosine Similarity: TF-IDF based comparison Jaccard Similarity: Set-based comparison Levenshtein Distance: Edit distance for short texts Batch Comparison: Compare multiple documents Similarity Matrix: Pairwise comparison of all documents Reports: Detailed similarity reports Quick Start from similarity_checker import SimilarityChecker
checker = SimilarityChecker()
Compare two texts
score = checker.compare( "The quick brown fox jumps over the lazy dog", "A fast brown fox leaps over a sleepy dog" ) print(f"Similarity: {score:.2%}")
Compare documents
score = checker.compare_files("doc1.txt", "doc2.txt")
CLI Usage
Compare two texts
python similarity_checker.py --text1 "Hello world" --text2 "Hello there world"
Compare two files
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt
Compare all files in folder
python similarity_checker.py --folder ./documents/ --output matrix.csv
Use specific algorithm
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --method jaccard
Find similar documents (threshold)
python similarity_checker.py --folder ./documents/ --threshold 0.7
JSON output
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --json
API Reference SimilarityChecker Class class SimilarityChecker: def init(self, method: str = "cosine")
# Text comparison
def compare(self, text1: str, text2: str) -> float
def compare_files(self, file1: str, file2: str) -> float
# Multiple algorithms
def compare_all_methods(self, text1: str, text2: str) -> dict
# Batch comparison
def compare_to_corpus(self, text: str, corpus: list) -> list
def similarity_matrix(self, documents: list) -> pd.DataFrame
def find_duplicates(self, documents: list, threshold: float = 0.8) -> list
# Folder operations
def compare_folder(self, folder: str, threshold: float = None) -> dict
def find_most_similar(self, text: str, folder: str, top_n: int = 5) -> list
# Report
def generate_report(self, output: str) -> str
Similarity Methods Cosine Similarity (Default)
Best for comparing documents of different lengths:
checker = SimilarityChecker(method="cosine") score = checker.compare(text1, text2)
Returns: 0.0 to 1.0
Jaccard Similarity
Good for comparing sets of words/tokens:
checker = SimilarityChecker(method="jaccard") score = checker.compare(text1, text2)
Returns: 0.0 to 1.0
Levenshtein (Edit Distance)
Best for short texts, typo detection:
checker = SimilarityChecker(method="levenshtein") score = checker.compare(text1, text2)
Returns: 0.0 to 1.0 (normalized)
TF-IDF + Cosine
Advanced: considers term importance:
checker = SimilarityChecker(method="tfidf") score = checker.compare(text1, text2)
Batch Comparison Compare to Corpus checker = SimilarityChecker()
target = "Machine learning is a subset of artificial intelligence." corpus = [ "AI includes machine learning and deep learning.", "Python is a programming language.", "Neural networks power deep learning systems." ]
results = checker.compare_to_corpus(target, corpus)
Returns:
[ {"index": 0, "similarity": 0.65, "text": "AI includes..."}, {"index": 2, "similarity": 0.42, "text": "Neural networks..."}, {"index": 1, "similarity": 0.12, "text": "Python is..."} ]
Similarity Matrix documents = [ "Document one content...", "Document two content...", "Document three content..." ]
matrix = checker.similarity_matrix(documents)
Returns DataFrame:
doc_0 doc_1 doc_2
doc_0 1.000 0.750 0.320
doc_1 0.750 1.000 0.410
doc_2 0.320 0.410 1.000
Find Duplicates documents = [...] # List of texts
duplicates = checker.find_duplicates(documents, threshold=0.85)
Returns:
[ {"doc1_index": 0, "doc2_index": 3, "similarity": 0.92}, {"doc1_index": 2, "doc2_index": 7, "similarity": 0.88} ]
Compare All Methods
Get similarity scores from all algorithms:
checker = SimilarityChecker() results = checker.compare_all_methods(text1, text2)
Returns:
{ "cosine": 0.82, "jaccard": 0.65, "levenshtein": 0.71, "tfidf": 0.78, "average": 0.74 }
Folder Operations Compare All Files in Folder checker = SimilarityChecker() results = checker.compare_folder("./documents/")
Returns:
{
"files": ["doc1.txt", "doc2.txt", "doc3.txt"],
"comparisons": 3,
"similar_pairs": [
{"file1": "doc1.txt", "file2": "doc3.txt", "similarity": 0.87}
],
"matrix":
Find Most Similar to Query query = "Your search text here..." results = checker.find_most_similar(query, "./documents/", top_n=5)
Returns:
[ {"file": "doc3.txt", "similarity": 0.89}, {"file": "doc1.txt", "similarity": 0.72}, ... ]
Output Format Comparison Result result = checker.compare_with_details(text1, text2)
Returns:
{ "similarity": 0.82, "method": "cosine", "text1_length": 150, "text2_length": 180, "common_words": 25, "unique_words_text1": 10, "unique_words_text2": 15, "interpretation": "High similarity - likely related content" }
Example Workflows Plagiarism Check checker = SimilarityChecker()
submission = open("student_paper.txt").read() results = checker.compare_folder("./source_materials/")
suspicious = [p for p in results["similar_pairs"] if p["similarity"] > 0.6]
if suspicious: print(f"Warning: Found {len(suspicious)} potentially similar sources") for p in suspicious: print(f" {p['file1']} matches {p['file2']}: {p['similarity']:.0%}")
Document Deduplication checker = SimilarityChecker()
Load all documents
docs = {} for file in Path("./articles/").glob("*.txt"): docs[file.name] = file.read_text()
Find near-duplicates
duplicates = checker.find_duplicates(list(docs.values()), threshold=0.9)
print(f"Found {len(duplicates)} duplicate pairs")
Content Matching checker = SimilarityChecker()
query = "Best practices for Python web development" results = checker.find_most_similar(query, "./blog_posts/", top_n=10)
print("Most relevant articles:") for r in results: print(f" {r['file']}: {r['similarity']:.0%} match")
Dependencies scikit-learn>=1.3.0 nltk>=3.8.0 numpy>=1.24.0 pandas>=2.0.0