edit-distance’s documentation

It is suggested to use either the function edit_distance() or the SequenceMatcher class.

Functions

edit_distance.edit_distance(seq1, seq2, action_function=<function lowest_cost_action>, test=<built-in function eq>)

Computes the edit distance between the two given sequences. This uses the relatively fast method that only constructs two columns of the 2d array for edits. This function actually uses four columns because we track the number of matches too.

edit_distance.edit_distance_backpointer(seq1, seq2, action_function=<function lowest_cost_action>, test=<built-in function eq>)

Similar to edit_distance() except that this function keeps backpointers during the search. This allows us to return the opcodes (i.e. the specific edits that were used to change from one string to another). This function contructs the full 2d array (actually it contructs three of them: one for distances, one for matches, and one for backpointers).

SequenceMatcher class

class edit_distance.SequenceMatcher(a=None, b=None, test=<built-in function eq>, action_function=<function lowest_cost_action>)

Similar to the difflib SequenceMatcher, but uses Levenshtein/edit distance.

__init__(a=None, b=None, test=<built-in function eq>, action_function=<function lowest_cost_action>)

Initialize the object with sequences a and b. Optionally, one can specify a test function that is used to compare sequence elements. This defaults to the built in eq operator (i.e. operator.eq()).

__weakref__

list of weak references to the object (if defined)

distance()

Returns the edit distance of the two loaded sequences. This should be a little faster than getting the same information from get_opcodes().

find_longest_match(alo, ahi, blo, bhi)

Not implemented!

get_grouped_opcodes(n=None)

Not implemented!

get_matching_blocks()

Similar to get_opcodes(), but returns only the opcodes that are equal and returns them in a somewhat different format (i.e. (i, j, n) ).

get_opcodes()

Returns a list of opcodes. Opcodes are the same as defined by difflib.

matches()

Returns the number of matches in the alignment of the two sequences. This should be a little faster than getting the same information from get_opcodes().

quick_ratio()

Same as ratio().

ratio()

Ratio of matches to the average sequence length.

real_quick_ratio()

Same as ratio().

set_seq1(a)

Specify a new sequence for sequence 1, resetting cached values.

set_seq2(b)

Specify a new sequence for sequence 2, resetting cached values.

set_seqs(a, b)

Specify two alternative sequences – reset any cached values.

Match functions

These functions can be used to toggle whether we’re minimizing edits or maximizing matches.

edit_distance.lowest_cost_action(ic, dc, sc, im, dm, sm, cost)

Given the following values, choose the action (insertion, deletion, or substitution), that results in the lowest cost (ties are broken using the ‘match’ score). This is used within the dynamic programming algorithm.

  • ic - insertion cost
  • dc - deletion cost
  • sc - substitution cost
  • im - insertion match (score)
  • dm - deletion match (score)
  • sm - substitution match (score)
edit_distance.highest_match_action(ic, dc, sc, im, dm, sm, cost)

Given the following values, choose the action (insertion, deletion, or substitution), that results in the highest match score (ties are broken using the distance values). This is used within the dynamic programming algorithm.

  • ic - insertion cost
  • dc - deletion cost
  • sc - substitution cost
  • im - insertion match (score)
  • dm - deletion match (score)
  • sm - substitution match (score)

Indices and tables