·

How to find similar strings with python?

Published at 2024-06-28 00:09:18Viewed 281 times
Professional article
Please reprint with source link

Suppose that we have a string str together with a list of strings arr = [str1, str2, ... ]. How to find similar string of str in arr? To do this, we can calculate the similarity of str and elements in arr one by one. But how to calculate the degree of similarity between str and an element in arr? First, we need a measurement of the similarity of two strings. In fact, there are many methods that measure the similarity of strings. In this tutorial, we will use python to find similar strings of str.

In python, there are several packages that concern similarity of strings.

1. Difflib

The builtin module difflib is for comparing sequences. You can use the SequenceMatcher.ratio method to calculate the similarity. It returns a float in range [0, 1].

>>> from difflib import SequenceMatcher
>>> a='How to do syntax highlighting for code blocks in Nuxt?'
>>> b='How to setup PrismJS and Autoloader plugin with Nuxt 3?'
>>> SequenceMatcher(None, a, b).ratio()
0.3853211009174312
>>> SequenceMatcher(None, "abcd", "abcd").ratio()
1.0

The more similar two strings are, the closer to 1 the ratio is.

You can also use the get_close_matches method.

>>> from difflib import get_close_matches
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']

>>> get_close_matches('manitori', ['manifold', 'torus', 'tori', 'differential'])
['tori', 'manifold']

However, this method can't be used for long strings comparsion. To compare long strings, you may need to split the strings.

>>> a='How to do syntax highlighting for code blocks in Nuxt?'
>>> b='How to setup PrismJS and Autoloader plugin with Nuxt 3?'
>>> c='Mathjax loads slow. How to load local JS file with Nuxt?'
>>> d='How to render math in Vue or Nuxt?'
>>> e='Nginx is installed but command not found'
>>> f='Django ImageField max_length error when uploading image'
>>> get_close_matches(a, [b,c,d,e,f])
[]

The difflib module is pure python, not a python c library. Thus, it is not the fastest way.

2. Python-levenshtein

Python-levenshtein package is a Python C extension module. It is used for fast computation of Levenshtein distance. Levenshtein distance is a metric measuring the similarity of two strings. The bigger Levenshtein distance is, the more different two strings are.

To install this package, use pip:

pip install python-levenshtein

The following are the official examples of python-levenshtein:

>>> import Levenshtein
>>> help(Levenshtein.ratio)
ratio(...)
    Compute similarity of two strings.

    ratio(string1, string2)

    The similarity is a number between 0 and 1, it's usually equal or
    somewhat higher than difflib.SequenceMatcher.ratio(), becuase it's
    based on real minimal edit distance.

    Examples:
    >>> ratio('Hello world!', 'Holly grail!')
    0.58333333333333337
    >>> ratio('Brian', 'Jesus')
    0.0

>>> help(Levenshtein.distance)
distance(...)
    Compute absolute Levenshtein distance of two strings.

    distance(string1, string2)

    Examples (it's hard to spell Levenshtein correctly):
    >>> distance('Levenshtein', 'Lenvinsten')
    4
    >>> distance('Levenshtein', 'Levensthein')
    2
    >>> distance('Levenshtein', 'Levenshten')
    1
    >>> distance('Levenshtein', 'Levenshtein')
    0

3. Jellyfish

Jellyfish is a Python library that supports multiple methods for strings comparsion, including Levenshtein distance. Moreover, it supports phonetic encoding. To install jellyfish, use pip:

pip install jellyfish

Using jellyfish to compute Levenshtein distance is much faster than pure Python implementations. Here is an example:

>>> a='How to do syntax highlighting for code blocks in Nuxt?'
>>> b='How to setup PrismJS and Autoloader plugin with Nuxt 3?'
>>> import jellyfish
>>> jellyfish.levenshtein_distance(a,b)
35

Note that the Levenshtein distance is sensitive to case, so remember to use lower() if you care about the case:

>>> jellyfish.levenshtein_distance('Nuxt','nuxt')
1
>>> jellyfish.levenshtein_distance('nuxt','nuxt')
0

>>> a='How can I define a metric for uniqueness of strings on Django model?'
>>> b='How to specify uniqueness for a tuple of field in a Django model'
>>> jellyfish.levenshtein_distance(a.lower(),b.lower())
36

Finally, it is difficult to see the difference between two strings by these numbers. Therefore, it is good to normalize the Levenshtein distance between 0 to 1:

def normalizedLevenshteinDistance(double levenshtein, String s1, String s2) {
    return levenshtein / max(s1.length(), s2.length());
}

Here are some examples:

>>> a='How can I define a metric for uniqueness of strings on Django model?'
>>> b='How to specify uniqueness for a tuple of field in a Django model'
>>> jellyfish.levenshtein_distance(a.lower(),b.lower())
36
>>> 36/max(len(a),len(b))
0.5294117647058824

>>> b='Django Unique Together (with foreign keys)'
>>> jellyfish.levenshtein_distance(a.lower(),b.lower())
52
>>> 52/max(len(a),len(b))
0.7647058823529411

>>> b='Django unique_together with a specifc filter (like: somefield=somevalue)'
>>> jellyfish.levenshtein_distance(a.lower(),b.lower())
59
>>> 59/max(len(a),len(b))
0.8194444444444444

We can see that the closer to 1 the result is, the more different two strings are.

0 人喜欢

Comments

There is no comment, let's add the first one.

弦圈热门内容

Vertically aligning CSS :before and :after content

I am trying to centre the link with the image, but can't seem to move the content vertically in any way.<h4>More Information</h4> <a href="#" class="pdf">File Name</a>The icon is 22 x 22px.pdf { font-size: 12px; } .pdf:before { padding:0 5px 0 0; content: url(../img/icon/pdf_small.png); } .pdf:after { content: " ( .pdf )"; font-size: 10px; } .pdf:hover:after { color: #000; }

Get connected with us on social networks! Twitter

©2024 Guangzhou Sinephony Technology Co., Ltd All Rights Reserved