Package pology :: Module diff

Module diff

Produce special diffs between strings and other interesting objects.


Author: Chusslove Illich (Часлав Илић) <caslav.ilic@gmx.net>

License: GPLv3

Functions
[(string, element)...] or ([(string, element)...], float)
tdiff(seq_old, seq_new, reductf=None, diffr=False)
Create tagged difference of two sequences.
[(string, element)...] or ([(string, element)...], float)
itdiff(seq_old, seq_new, reductf=None, cutoff=0.6, diffr=False)
Create interleaved tagged difference of two sequences.
[(string, string)...] or ([(string, string)...], float)
word_diff(text_old, text_new, markup=False, format=None, diffr=False)
Create word-level difference between old and new text.
string/ColorString/None or (string/ColorString/None, float)
word_ediff(text_old, text_new, markup=False, format=None, colorize=False, diffr=False)
Create word-level embedded difference between old and new texts.
string or None
word_ediff_to_old(dtext)
Recover old version (-) from text with embedded differences.
string or None
word_ediff_to_new(dtext)
Recover new version (+) from text with embedded differences.
string or list or None
word_ediff_to_rem(dtext, sep=' ')
Recover removed segments (-) from text with embedded differences.
string or list or None
word_ediff_to_add(dtext, sep=' ')
Recover added segments (+) from text with embedded differences.
[[(string, string)...]...] or ([([(string, string)...], float)...], float)
line_diff(lines_old, lines_new, markup=False, format=None, diffr=False)
Create word-level difference between old and new lines of text.
[string...] or ([(string, float)...], float)
line_ediff(lines_old, lines_new, markup=False, format=None, colorize=False, diffr=False)
Create word-level embedded difference between old and new lines of text.
list of strings
line_ediff_to_old(dlines)
Recover old version (-) from lines of text with embedded differences.
list of strings
line_ediff_to_new(dlines)
Recover new version (+) from lines of text with embedded differences.
list of index tuples
adapt_spans(otext, ftext, spans, merge=True)
Adapt matched spans in filtered text to original text.
[(string, int/None, [(string, string)...])...] or ([(string, int/None, [(string, string)...], float)...], float)
msg_diff(msg1, msg2, pfilter=None, addrem=None, diffr=False)
Create word-level difference between extraction-invariant parts of messages.
type(emsg or msg2 or msg1 or None) or (type(~), float)
msg_ediff(msg1, msg2, pfilter=None, addrem=None, emsg=None, ecat=None, eokpos=None, enoctxt=None, emptydc=False, colorize=False, diffr=False)
Create word-level embedded difference between extraction-invariant parts of messages.
type of first non-None of rmsg, emsg, or None
msg_ediff_to_new(emsg, rmsg=None)
Resolve message with embedded difference to the newer message.
 
msg_ediff_to_old(emsg, rmsg=None)
Resolve message with embedded difference to the older message.
float
editprob(oldtext, newtext)
Compute the probability that a human would rather edit the old text to obtain the new text, then write it from scratch.
None or float or (float, int, int)
descprob(descpath, ancpath, cutoff=None, getcsz=False)
Compute the probability that one PO file is a descendant of another.
Variables
  __package__ = 'pology'
  x = 'msgid_plural'
Function Details

tdiff(seq_old, seq_new, reductf=None, diffr=False)

 

Create tagged difference of two sequences.

Difference is presented as a list of tuples, with each tuple composed of a difference tag and a sequence element. Difference tag is string "+", "-", or " ", for elements which belong to the old, the new, or to both sequences, respectively.

The list is ordered such that collecting all elements not tagged as old will reconstruct the new sequence, and collecting all not tagged as new will reconstruct the old sequence.

If requested by the diffr parameter, also reported is the difference ratio, a heuristic measure of difference between two texts. 0.0 means no difference, and 1.0 that sequences are completely different.

Examples:

   >>> s1 = "A type of foo".split()
   >>> s2 = "A kind of foo".split()
   >>> tdiff(s1, s2)
   [(' ', 'A'), ('-', 'type'), ('+', 'kind'), (' ', 'of'), (' ', 'foo')]
   >>> tdiff(s1, s2, diffr=True)
   ([(' ', 'A'), ('-', 'type'), ('+', 'kind'), (' ', 'of'), (' ', 'foo')],
   0.25)

To be able to diff them, sequence elements only need to be hashable. However, for compound elements it may be better to diff them only by some subset of data, e.g. by one of their string attributes. Parameter reductf can be used to specify a reduction function, which will be called on each element to produce its diffing representative.

Parameters:
  • seq_old (sequence with hashable elements) - sequence to diff from
  • seq_new (sequence with hashable elements) - sequence to diff to
  • reductf ((sequence element) -> diffing representative) - function to produce diffing representatives
  • diffr (bool) - whether to report difference ratio
Returns: [(string, element)...] or ([(string, element)...], float)
difference list and possibly difference ratio

itdiff(seq_old, seq_new, reductf=None, cutoff=0.6, diffr=False)

 

Create interleaved tagged difference of two sequences.

Similar to tdiff, except that blocks of added/removed elements are further heuristically interleaved by similarity, such that each removed element may be followed by a similar added element, if such has been determined. This is useful e.g. to be able to afterwards make inner difference of each two paired similar elements (e.g. word diff within line diff).

Example:

   >>> s1 = "Two blue airplanes".split()
   >>> s2 = "Two bluish ships".split()
   >>> tdiff(s1, s2)
   [(' ', 'Two'), ('-', 'blue'), ('-', 'airplanes'), ('+', 'bluish'),
    ('+', 'ships')]
   >>> itdiff(s1, s2)
   [(' ', 'Two'), ('-', 'blue'), ('+', 'bluish'), ('-', 'airplanes'),
    ('+', 'ships')]

To be able to interleave blocks, each element in turn must be a sequence in its own. This means that function supplied by reductf, otherwise of same semantics as in tdiff, here must also produce a sequence as diffing representative (e.g. a string).

Parameter cutoff states the minimal similarity between two elements needed for them to be considered similar at all.

Parameters:
  • seq_old (sequence with hashable elements) - sequence to diff from
  • seq_new (sequence with hashable elements) - sequence to diff to
  • reductf ((sequence element) -> representative sequence) - function to produce diffing representatives
  • cutoff (float [0, 1]) - minimal similarity to consider elements similar
  • diffr (bool) - whether to report difference ratio
Returns: [(string, element)...] or ([(string, element)...], float)
interleaved difference list and possibly difference ratio

word_diff(text_old, text_new, markup=False, format=None, diffr=False)

 

Create word-level difference between old and new text.

The difference is computed by looking at texts as collections of words and intersegments. Difference is presented as a list of tuples, with each tuple composed of a difference tag and a text segment. Difference tag is string "+", "-", or " ", for text segments which are new, old, or present in both texts, respectively. If one of the texts is None, as opposed to empty string, a tilde is appended to the base difference tag.

The list is ordered such that joining all text segments not marked as old will reconstruct the new text, and joining all not marked as new will reconstruct the old text.

If requested by the diffr parameter, also reported is the difference ratio, a heuristic measure of difference between two texts. 0.0 means no difference, and 1.0 that the texts are completely different.

Differencing may take into account when the texts are expected to have XML-like markup, or when they are of certain format defined by Gettext.

Examples:

   >>> s1 = "A new type of foo."
   >>> s2 = "A new kind of foo."
   >>> word_diff(s1, s2)
   [(' ', 'A new '), ('+', 'kind'), ('-', 'type'), (' ', ' of foo.')]
   >>> word_diff(s1, s2, diffr=True)
   ([(' ', 'A new '), ('+', 'kind'), ('-', 'type'), (' ', ' of foo.')],
   0.36363636363636365)
   >>> word_diff(s1, None, diffr=True)
   ([('-~', 'A new type of foo.')], 1.0)
   >>> word_diff(None, s2, diffr=True)
   ([('+~', 'A new kind of foo.')], 1.0)
Parameters:
  • text_old (string or None) - the old text
  • text_new (string or None) - the new text
  • markup (bool) - whether <...> markup can be expected in the texts
  • format (string) - Gettext format flag (e.g. "c-format", etc.)
  • diffr (bool) - whether to report difference ratio
Returns: [(string, string)...] or ([(string, string)...], float)
difference list and possibly difference ratio

word_ediff(text_old, text_new, markup=False, format=None, colorize=False, diffr=False)

 

Create word-level embedded difference between old and new texts.

Same as word_diff, but the difference is returned as text in which the new segments are wrapped as {+...+}, and the old segments as {-...-}. If a difference wrapper is already contained in the text, it will be escaped by inserting a tilde, e.g. "{+...+}" -> "{~+...+~}". If even an escaped wrapper is contained in the text, another tilde is inserted, and so on.

If one of the texts is None, then the whole other text is wrapped as suitable difference, and a tilde added to its end to indicate that the other text was None. If neither of the texts is None, but after differencing the tilde appears in the end of embedded difference, it is escaped by another tilde. If both texts are None, None is returned as the difference.

The colorize parameter can be used to additionally highlight embedded difference by using color markup provided by ColorString. If colorizing is enabled, the return value is a ColorString.

See word_diff for description of other parameters.

Parameters:
  • colorize (bool) - whether to colorize differences
Returns: string/ColorString/None or (string/ColorString/None, float)
string with embedded differences and possibly difference ratio

See Also: word_diff

word_ediff_to_old(dtext)

 

Recover old version (-) from text with embedded differences.

In case there was no old text, None is returned.

Parameters:
  • dtext (string) - text with embedded differences
Returns: string or None
old version of the text

See Also: word_ediff

word_ediff_to_new(dtext)

 

Recover new version (+) from text with embedded differences.

In case there was no new text, None is returned.

Parameters:
  • dtext (string) - text with embedded differences
Returns: string or None
new version of the text

See Also: word_ediff

word_ediff_to_rem(dtext, sep=' ')

 

Recover removed segments (-) from text with embedded differences.

If separator is not None, the joined string of selected segments is returned. Otherwise, the list of selected segments is returned. In either case, if there was no old text, None is returned.

Parameters:
  • dtext (string) - text with embedded differences
  • sep (string or None) - separator with which to join selected segments
Returns: string or list or None
text with only the removed segments

See Also: word_ediff

word_ediff_to_add(dtext, sep=' ')

 

Recover added segments (+) from text with embedded differences.

If separator is not None, the joined string of selected segments is returned. Otherwise, the list of selected segments is returned. In either case, if there was no new text, None is returned.

Parameters:
  • dtext (string) - text with embedded differences
  • sep (string or None) - separator with which to join selected segments
Returns: string or list or None
text with only the added segments

See Also: word_ediff

line_diff(lines_old, lines_new, markup=False, format=None, diffr=False)

 

Create word-level difference between old and new lines of text.

First makes a difference on a line-level, and then for each set of differing lines a difference on word-level, using word_diff. Difference is presented as a list of tuples of word diffs and ratios as constructed by word_diff. See word_diff for description of keyword parameters. The difference ratio is computed as line-length weighted average of word difference ratios per line.

Parameters:
  • lines_old (string) - old lines of text
  • lines_new (string) - new lines of text
Returns: [[(string, string)...]...] or ([([(string, string)...], float)...], float)
difference list and possibly difference ratios

line_ediff(lines_old, lines_new, markup=False, format=None, colorize=False, diffr=False)

 

Create word-level embedded difference between old and new lines of text.

Same as line_diff, but the difference is returned as list of tuples of line of text (in which the new segments are wrapped as {+...+}, and the old segments as {-...-}) and difference ratio for the line. See word_diff and word_ediff for description of keyword parameters.

Returns: [string...] or ([(string, float)...], float)
lines with embedded differences and possibly difference ratios

See Also: line_diff

line_ediff_to_old(dlines)

 

Recover old version (-) from lines of text with embedded differences.

Parameters:
  • dlines (list of strings) - lines of text with embedded differences
Returns: list of strings
old version of the lines

See Also: line_ediff

line_ediff_to_new(dlines)

 

Recover new version (+) from lines of text with embedded differences.

Parameters:
  • dlines (list of strings) - lines of text with embedded differences
Returns: list of strings
new version of the lines

See Also: line_ediff

adapt_spans(otext, ftext, spans, merge=True)

 

Adapt matched spans in filtered text to original text.

Sometimes text gets filtered before being matched, and when a match is found in the filtered text, it needs to be reported relative to the original text. This function will heuristically adapt matched spans relative to the filtered text back to the original text.

Spans are given as list of index tuples [(start1, end1), ...] where start and end index have standard Python semantics (may be negative too). If merge is True, any spans that overlap or abut after adaptation will be merged into a single span, ordered by increasing start index, and empty spans removed; otherwise each adapted span will strictly correspond to the input span at that position.

Span tuples may have more elements past the start and end indices. They will be ignored, but preserved; if merging is in effect, extra elements will be preserved for only the frontmost of the overlapping spans (undefined for which if there are several).

If an input span is invalid in any way, it is carried over verbatim into result.

Parameters:
  • otext (string) - original text
  • ftext (string) - filtered text
  • spans (list of index tuples) - matched spans
  • merge (bool) - whether to merge overlapping spans
Returns: list of index tuples
adapted spans

msg_diff(msg1, msg2, pfilter=None, addrem=None, diffr=False)

 

Create word-level difference between extraction-invariant parts of messages.

For which parts of a message are considered extraction-invariant, see description of inv instance variable of message objects.

There are two return modes, depending on the value of diffr parameter.

If diffr is False, the difference is returned as list of 3-tuples of differences by message part: (part name, part item, word difference). The part name can be used to fetch the part value from the message, using get() method of message objects. The part item is None for singular message parts (e.g. msgid), and index for list parts (e.g. msgstr). See word_diff for the format of word-level difference.

If diffr is True, then each part difference has a fourth element, the difference ratio; see word_diff for its semantics. Additionally, the total difference ratio is computed, based on partial ones (also counting the zero difference of parts which were equal). The return value is now a 2-tuple of list of part differences (as 4-tuples) and the total difference ratio.

Either of the messages can be given as None. In case only one of the messages is None, the difference of msgid field will show that this field does not exist in the non-existant message (according to format of non-existant counterparts of word_diff). If both messages are None, the difference is empty list, as the messages are same, even if non-existant.

Every msgstr field can be passed through a filter before differencing, using the pfilter parameter.

Instead of constructing the full difference, using the addrem parameter only equal, added, or removed segments can be reported. The value of this parameter is a string, such that the first character selects the type of partial difference: one of ('=', "e') for equal, ('+', 'a') for added, and ('-', 'r') for removed segments, and the rest of the string is used as separator to join the selected segments (if the separator is empty, space is used instead).

Parameters:
  • msg1 (Message_base or None) - the message from which to make the difference
  • msg2 (Message_base or None) - the message to which to make the difference
  • pfilter (callable) - filter to be applied to translation prior to differencing
  • addrem (string) - report equal, added or removed segments instead of full difference, joined by what follows the selection character
  • diffr (bool) - whether to report difference ratio
Returns: [(string, int/None, [(string, string)...])...] or ([(string, int/None, [(string, string)...], float)...], float)
difference list

msg_ediff(msg1, msg2, pfilter=None, addrem=None, emsg=None, ecat=None, eokpos=None, enoctxt=None, emptydc=False, colorize=False, diffr=False)

 

Create word-level embedded difference between extraction-invariant parts of messages.

Like msg_diff, but instead of difference list the result is a message with embedded differences, of the kind produced by word_ediff. See msg_diff for description pfilter and addrem parameters, and word_ediff for the format of embedded differences. Additionally, if pfilter is given, msgstr fields will be diffed both with and without the filter, and if the two diffs are not equal, both embeddings are going to be presented in the field, suitably visually separated.

By default, a new message with embedded difference will be constructed, of the type of first non-None of msg2 and msg1. Alternatively, the difference can be embedded into the message supplied by emsg parameter.

If resulting messages with embedded differences are to be inserted into a catalog, that catalog can be given by the ecat parameter. Then, if the key of the resulting message would conflict one of those already in the catalog, its context will be appropriately padded to avoid the conflict. This is done by adding a pipe character and an unspecified number of alphanumerics (generally junk-looking) to the end of the msgctxt. In case the conflict with a particular message in the catalog is acceptable (e.g. when resulting message is to be inserted in its place), the position of this message can be given by the eokpos parameter. In case a certain value of msgctxt should be padded regardless of whether there is a conflict or not, this value can be given by enoctxt parameter.

An additional automatic comment starting with ediff: may be added to the message, possibly followed by some indicators necessary to complete the difference specification. These include:

  • state <STATE_DIFF> ...: changes in message state, like obsolete and fuzzy; e.g. state {+obsolete+} means that the message has been obsoleted from msg1 to msg2, while state {-obsolete-} means that it has been was revived.
  • ctxtpad <STRING>: padding alphanumerics added to the msgctxt field to avoid key collision with one of the messages from ecat.
  • infsep <BLOCK> <LENGTH>: if pfilter was used, this indicator states the building block and length in blocks of in-field separators.

By default the difference comment is not added if there are no indicators, but it may be forced by setting emptydc parameter to True.

Embedded differences can be additionally colorized (e.g. for terminal) by setting colorize parameter to True.

If diffr is True, aside from the message with embedded differences, the total difference ratio is returned (see msg_diff). If pfilter is given, the ratio refers to difference under filter.

Parameters:
  • msg1 (Message_base or None) - the message from which to make the difference
  • msg2 (Message_base or None) - the message to which to make the difference
  • pfilter (callable) - filter to be applied to translation prior to differencing
  • addrem (string) - report equal, added or removed segments instead of full difference, joined by what follows the selection character
  • emsg (Message_base) - message to embedd the difference to
  • ecat (Catalog) - catalog of messages to avoid key conflict with
  • eokpos (int) - position into ecat where key conflict is ignored
  • enoctxt (string) - msgctxt string that should be padded unconditionally
  • emptydc (bool) - whether to add difference comment even if empty
  • colorize (bool) - whether to colorize the difference
  • diffr (bool) - whether to report difference ratio
Returns: type(emsg or msg2 or msg1 or None) or (type(~), float)
message with embedded differences (or None) and possibly difference ratio

msg_ediff_to_new(emsg, rmsg=None)

 

Resolve message with embedded difference to the newer message.

Message cannot be properly resolved if addrem parameter to msg_ediff was used on embedding. If this function is called on such a message, the result is undefined.

By default a new message object is created, but using the rmsg parameter, en existing message can be given to be filled with all the resolved parts (keeping its own, ignored parts). This message can be the emsg itself.

If the resolved message evaluates to no message, the function returns None, and rmsg is not touched if it was given.

Any states indicated as added by the difference comment are ignored in favor of the actual states of embedded difference message. The two sets should normally be equal, but if they are not, the actual state in effect overrides the indicated added state.

Parameters:
  • emsg (Message_base or None) - resolvable message with embedded differences
  • rmsg (Message_base) - message to fill in the resolved parts
Returns: type of first non-None of rmsg, emsg, or None
resolved message (or None)

msg_ediff_to_old(emsg, rmsg=None)

 

Resolve message with embedded difference to the older message.

Like msg_ediff_to_new, only constructing the opposite message (except that states indicated as removed by difference comment are never ignored, i.e. they always override actual states). See msg_ediff_to_new for parameters and return values.

editprob(oldtext, newtext)

 

Compute the probability that a human would rather edit the old text to obtain the new text, then write it from scratch.

Classical algorithms to compute similarity ratio between two texts sometimes produce high ratios for texts which a human would unlikely consider similar enough to make one text by editing the other, and vice versa. This functions uses some heuristics to derive the probability that one text was really edited by a human into the other.

Not commutative in general.

If one of the texts is given as None, the result is 0.0; if both are None, the result is 1.0.

Parameters:
  • oldtext (string) - candidate for initial text
  • newtext (string) - current text
Returns: float
the probability of editing the old into the new text [0, 1]

descprob(descpath, ancpath, cutoff=None, getcsz=False)

 

Compute the probability that one PO file is a descendant of another.

Sometimes PO files are renamed, split into two, joined into one, also with possible small changes in messages between old and new set. This functions uses some heuristics to derive the probability that the PO file given by apath is an ancestor of the PO file given by dpath. If the probability cannot be determined (for whatever reason, e.g. if the file contains syntax errors), None is returned.

By default, only equality versus non-equality of messages is taken into consideration. If cutoff is set to a number 0.0-1.0, then fuzzy matching is performed, and partial similarities greater than the cutoff are counted into the final probability. However, this reduces performance by orders of magnitude (the more the lower the cutoff; 0.7-0.8 may be a reasonable tradeoff).

Parameters:
  • descpath (string) - path to possible descendent PO file
  • ancpath (string) - path to possible ancestor PO file
  • cutoff (float) - the cuttoff for fuzzy matching
  • getcsz (bool) - also report the referent character sizes of the first and second file
Returns: None or float or (float, int, int)
the probability of ancestry [0, 1], the referent character sizes if requested