Chapter 4. Diffing and Patching

Line-level diffing of plain text files assumes that the file is chunked into lines as largest well-defined units, that each line has a significant standalone meaning, and that the ordering of lines is not arbitrary. For example, this is typical of programming language code.

Superficially, PO files could also be considered "a programming language of translation", and amenable to same line-level treatment on diffing. However, some of the outlined assumptions, which make line-level diffing viable, are violated in the PO format. Firstly, the minimal unit of PO file is one message, whereas one line has little semantic value. Secondly, ordering of messages can be arbitrary in principle (e.g. dependent on the order of extraction from program code files), such that two line-wise very different PO files are actually equivalent from translator's viewpoint. And thirdly, good number of lines in the PO file are auxiliary, neither original text nor translation, generated either automatically or by the programmer (e.g. source references, extracted comments), all of which are out of translator's scope for modifications.

Due to these difficulties, the common way to use line-level diffing with PO files is only for review, and even that with some preparations. Due to myriad line-wise different but semantically equivalent representations of the PO file, it is almost useless to send line-level diffs as patches. Translators are instead told to always send full PO files to the reviewer or the commiter, no matter what is the amount of modifications. Then, the reviewer merges the received PO file (new version), and possibly the original (old version), with current PO template, without wrapping of message strings (msgid, msgstr, etc.). This "normalizes" the old and the new file with respect to all semantically non-significant elements, and only then can line-level diffing be performed. Additionally, since a long non-wrapped line of text may differ only in few words, a dedicated diff viewer which can highlight word-level differences should be used. Ordinary diff syntax highlighting (e.g. in shell, or in general text editor) would waste reviewer's time in trying to see those few changed words.

Even with preparations and dedicated diff viewer at hand, there is at least one significant case which is still not reasonably covered: when a fuzzy message with previous strings (i.e. when PO file was merged with --previous option to msgmerge) has been updated and unfuzzied. For example:

old
#: main.c:110
#, fuzzy
#| msgid "The Record of The Witch River"
msgid "Records of The Witch River"
msgstr "Beleška o Veštičjoj reci"
new
#: main.c:110
msgid "Records of The Witch River"
msgstr "Beleške o Veštičjoj reci"
diff
⁠ #: main.c:110
- #, fuzzy
- #| msgid "The Record of The Witch River"
  msgid "Records of The Witch River"
- msgstr "Beleška o Veštičjoj reci"
+ msgstr "Beleške o Veštičjoj reci"

The line-level diff viewer will know to show word-level diff for modified translation, but it cannot know that it should also show word-level diff between the removed previous and current msgid strings, so that reviewer can see what has changed in the original text (i.e. why had the message became fuzzy), and based on that judge whether the translation was properly adapted.

A dedicated PO editor may be able to show the truly proper, message-level difference.[10] Even then, however, it remains necessary to send around full PO files, and possibly to normalize them to a lesser extent before comparing. Additionally, the diff format becomes tied to the given PO editor, instead of being self-contained and processable by various tools (such as line-level diffs are).

This chapter therefore introduces the format and semantics for self-contained, message-level diffing of PO files -- the embedded diff -- and presents the Pology tools which implement it.

4.1. The Embedded Diff Format

Difference between two PO messages should primarily, though not exclusively, consist of differences between its string parts (msgid, msgstr, etc.) To be well observable, differences between strings should be as localized as possible -- think of a long paragraph in which only the spelling of a word or some punctuation was changed. Finally, the format of the complete PO message diff should be intuitively comprehensible to translators which are used to the PO format itself, and to some extent compatible with existing PO processing tools.

These considerations lead to making the diff of two PO messages be a PO message itself. In other words, the diff gets embedded into the regular parts of a PO message. An embedded diff (ediff for short) message should be at least syntactically valid, if not semantically (it should not cause a simple msgfmt run to fail, though msgfmt --check could). To be possible to exchange ediffs as patches for PO files, the embedding should be resolvable into the old and the new messages from which the diff was created.

In this way, if ediff messages are packed into a PO file (an ediff PO), existing PO tools can be used to review and modify the diff. For example, highlighting in a text editor will need only minimal upgrades to show the embedded differences (more on that below), and otherwise it will already highlight ediff message parts as usual.

To fully define the ediff format, the following questions should be answered:

  • How to represent embedded differences in strings?

  • Which parts of the PO message should be diffed?

  • How to pair for diffing messages from two PO files?

  • How to present collection of diffed messages?

4.1.1. Embedding Differences into Strings

Once the word-level difference between the old and the new string has been computed, it should be somehow embedded it into the new string (or, equivalently, the old string). This can be done by wrapping removed and added text segments with {-...-} and {+...+}, respectively:

old
"The Record of The Witch River"
new
"Records of The Witch River"
diff
"{-The Record-}{+Records+} of The Witch River"

It may happen that an opening or closing wrapper sequence occurs as a literal part of diffed strings[11], so some method of escaping is necessary. This is done by inserting a ~ (tilde) in the middle of the literal sequence:

old
"Foo {+ bar"
new
"Foo {+ qwyx"
diff
"Foo {~+ {-bar-}{+qwyx+}"

If strings instead contain the literal sequence {~+, then another tilde is inserted, and so on. In this way, ediff can be unambiguously resolved to old and new versions of the string. Escaping by inserting tildes also makes it easier to write a syntax higlighting definition for an editor, as the wrapper pattern is automatically broken by the tilde.

It may happen that a given string is not merely empty in the old or new PO message, but that it does not exist at all (e.g. msgctxt). For this reason it is possible to make ediff between an existing and non-existing string as well, in which case a tilde is appended to the very end of the ediff:

old
new
"a-context-note"
diff
"{+a-context-note+}~"

Here too escaping is provided, by inserting further tildes if the ediff between two existing strings would result in a trailing tilde (if the old string is "~" and the new "foo~", the ediff is "{+foo+}~~").

It is not necessary to prescribe the exact algorithm for computing the difference between two strings. In fact, the diffing tool may allow translator to select between several diffing algorithms, depending on personal taste and situation. For example, the default algorithm of Pology's poediff does the following: words are diffed as atomic sequences, all non-word segments (punctuation, markup tags, etc.) are diffed character by character, and equal non-word segments in between two different words (e.g. whitespace) are included into the difference segment. Hence the above ediff

"{-The Record-}{+Records+} of The Witch River"

instead of the smaller

"{-The -}Record{+s+} of The Witch River"

as the former is (tentatively) easier to comprehend.

Since every difference segment in the ediff message is represented in the described way, it is sufficient to upgrade the PO syntax highlighting of an editor[12] to indiscriminately highlight {-...-} and {+...+} segments everywhere in the message.

4.1.2. Message Parts Included in Diffing

A PO message consists of several types of parts: strings, comments, flags, source references, etc. It would not be very constructive to diff all of them; for example, while msgstr strings should clearly be included into diffing, source references most probably should not. To avoid pondering over the advantages and disadvantages of including each and every message part, there already exists a well-defined splitting of message parts into two groups, one of which will be taken into diffing, and the other not. These two groups are:

  • Extraction-invariant parts are those which do not depend on placement (or even presence) of the message in the source file. These are msgid string, msgstr strings, manual comments, etc.

  • Extraction-prescribed parts are those which cannot exist independently of the source file from which the message is extracted, such as format flags or extracted comments.

Only extraction-invariant parts will be diffed. The working definition of which parts belong to this group is provided by what remains in obsolete messages in PO files:

  • current original text: msgctxt, msgid, and msgid_plural strings

  • previous original text: #| msgctxt, #| msgid, and #| msgid_plural comments

  • translation text: msgstr strings

  • translator comments

  • fuzzy state (whether the fuzzy flag is present)

  • obsolete state (whether the message is obsolete)

Strings and translator comments are presented in the ediff message as embedded word-level differences, as described earlier. Changes in state, fuzzy and obsolete, are represented differently. A special "extracted" comment is added to the ediff message, starting with #. ediff: and listing any extra information needed to describe the ediff, including the state changes. Here is an example of two messages and the ediff they would produce[13]:

old
#, fuzzy
#~| msgid "Accurate subpolar weather cycles"
#~ msgid "Accurate subpolar climate cycles"
#~ msgstr "Tačni ciklusi subpolarnog vremena"
new
#. ui: property (text), widget (QCheckBox, accCyclesTrop)
#: config.ui:180
#, fuzzy
#| msgid "Accurate tropical weather cycles"
msgctxt "some-superfluous-context"
msgid "Accurate tropical climate cycles"
msgstr "Tačni ciklusi tropskog vremena"
diff
#. ediff: state {-obsolete-}
#. ui: property (text), widget (QCheckBox, accCyclesTrop)
#: config.ui:180
#, fuzzy
#| msgid "Accurate {-subpolar-}{+tropical+} weather cycles"
msgctxt "{+some-superfluous-context+}~"
msgid "Accurate {-subpolar-}{+tropical+} climate cycles"
msgstr "Tačni ciklusi {-subpolarnog-}{+tropskog+} vremena"

The first thing to note is that the ediff message contains not only the extraction-invariant parts, but also verbatim copies of extraction-prescribed parts from the new message. Effectively, the ediff is embedded into the copy of the new message. Extraction-prescribed parts are not simply discarded in order to provide more context when reviewing the diff. Here, for example, the extracted comment states that the text is a checkbox label, which may be important for the style of translation.

The other important element is the #. ediff: dummy extracted comment, which here indicates that the obsolete state has been "removed", i.e. the message was unobsoleted betwen then old and the new version of the PO file. Aside from state changes, few other indicators may be present in this comment, and they will be mentioned later on. The ediff comment is present only when necessary, if there are any indicators to show.

If diffing of two messages would always be conducted part for part, for all message parts which are taken into diffing, then in some cases the resulting ediff would not be very useful. Consider how the first example in this chapter, the line-level diff of a fuzzy and translated message, would look like as ediff if diffed part for part:

old
#: main.c:110
#, fuzzy
#| msgid "The Record of The Witch River"
msgid "Records of The Witch River"
msgstr "Beleška o Veštičjoj reci"
new
#: main.c:110
msgid "Records of The Witch River"
msgstr "Beleške o Veštičjoj reci"
diff
#. ediff: state {-fuzzy-}
#: main.c:110
#| msgid "{-The Record of The Witch River-}~"
msgid "Records of The Witch River"
msgstr "{-Beleška-}{+Beleške+} o Veštičjoj reci"

This ediff suffers from the same problem as the line-level diff: instead of showing the difference from previous to current msgid string, the current msgid is left untouched, while the previous msgid is simply shown to have been removed.

Therefore, instead of diffing directly part for part, a special transformation takes place when exactly one of the two diffed messages is fuzzy and contains previous original strings. This splits into two directions: from fuzzy to non-fuzzy, and from non-fuzzy to fuzzy.

Diffing from a fuzzy to a non-fuzzy message is the more usual of the two directions. It typically appears when the translation has been updated after merging with template. In this case, the old and the new message are shuffled prior to diffing in the following way (*-rest denotes all diffed parts that are neither original text nor fuzzy state):

old
fuzzy                   -->     fuzzy
old-previous-strings    -->     old-previous-strings
old-current-strings     -->     old-previous-strings
old-rest                -->     old-rest
new
-                       -->     -
-                       -->     old-current-strings
new-current-strings     -->     new-current-strings
new-rest                -->     new-rest

When these shuffled messages are diffed, the resulting ediff message's current strings will show the important difference, that between the previous original text of the old (fuzzy) message and the current original text of the new (non-fuzzy) message. Ediff message's previous strings will show the less important difference between the old message's previous and current strings, but only if it is not the same as the difference between current strings. This may sound confusing, but the actual ediff produced in this way is quite intuitive:

old
#: main.c:110
#, fuzzy
#| msgid "The Record of The Witch River"
msgid "Records of The Witch River"
msgstr "Beleška o Veštičjoj reci"
new
#: main.c:110
msgid "Records of The Witch River"
msgstr "Beleške o Veštičjoj reci"
diff
#. ediff: state {-fuzzy-}
#: main.c:110
msgid "{-The Record-}{+Records+} of The Witch River"
msgstr "{-Beleška-}{+Beleške+} o Veštičjoj reci"

From this the reviewer can see that the message was unfuzzied, the change in the original text that caused the message to become fuzzy, and what was changed in the translation to unfuzzy it. The old version of the text (in removed and equal segments) is that from the message before it got fuzzied, and the new version (in added and equal segments) is that from the message after it was unfuzzied.

The other special direction, from a non-fuzzy to a fuzzy message, should be less frequent. It appears, for example, when the diff is taken from the old, completely translated PO file, to the new PO file which has been merged with the latest template. In this case, the shuffling is as follows:

old
-                       -->     -
-                       -->     new-previous-strings
old-current-strings     -->     old-current-strings
old-rest                -->     old-rest
new
fuzzy                   -->     fuzzy
new-previous-strings    -->     new-current-strings
new-current-strings     -->     new-current-strings
new-rest                -->     new-rest

The difference in ediff messages's current strings will again be the most important one, and in previous strings the less important one and shown only if not equal to the difference in current strings. Here is what this will result in when applied one step earlier, just after merging with template:

old
#: main.c:89
msgid "The Record of The Witch River"
msgstr "Beleška o Veštičjoj reci"
new
#: main.c:110
#, fuzzy
#| msgid "The Record of The Witch River"
msgid "Records of The Witch River"
msgstr "Beleška o Veštičjoj reci"
diff
#. ediff: state {+fuzzy+}
#: main.c:110
#, fuzzy
msgid "{-The Record-}{+Records+} of The Witch River"
msgstr "Beleška o Veštičjoj reci"

The reviewer can see that the message became fuzzy, and the change in the original text that caused that.

The diffing tool may add custom additional information at the end of any strings in the ediff message (msgid, msgstr, etc.), separated with a newline, a repeated block of one or more characters, and a newline. When this is done, the #. ediff: comment will have the infsep indicator, which states the character block used and the number of repetitions in the separator:

#. ediff: state {+fuzzy+}, infsep +- 20
#: main.c:110
#, fuzzy
msgid "{-The Record-}{+Records+} of The Witch River"
msgstr ""
"Beleška o Veštičjoj reci\n"
"+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-\n"
"some-additional-information"

Of course, the diffing tool should compute the appropriate separator such that it does not conflict with a part of the text in one of the strings. What could be this additional information? For example, it could be a filtered version of the text, to ease some special review type.

4.1.3. Pairing Messages From Two PO Files

By now it was described how to make an embedded diff out of two messages, once it has been decided that those messages should be diffed. However, the translator is not expected to decide which messages to diff, but which PO files to diff. The diffing tools should then automatically pair for diffing the messages from the two PO files, and this section describes the several pairing criteria.

Most obviously, messages should be paired by key, which can be called primary pairing. The PO message key is the unique combination of msgctxt and msgid strings. In the most usual case -- reviewing an ediff from incomplete PO file with fuzzy and untranslated messages, to an updated PO file with those messages translated -- pairing by key will be fully sufficient, as both PO files will contain exactly the same set of messages. These two messages will be paired by key:

old
#: main.c:110
#, fuzzy
#| msgid "The Record of The Witch River"
msgid "Records of The Witch River"
msgstr "Beleška o Veštičjoj reci"
new
#: main.c:110
msgid "Records of The Witch River"
msgstr "Beleške o Veštičjoj reci"

But what should happen if some messages are left unpaired after pairing by key? Consider the earlier example where the diff was taken from the older fully translated to the newer merged PO file:

old
#: main.c:89
msgid "The Record of The Witch River"
msgstr "Beleška o Veštičjoj reci"
new
#: main.c:110
#, fuzzy
#| msgid "The Record of The Witch River"
msgid "Records of The Witch River"
msgstr "Beleška o Veštičjoj reci"

The keys, here just current msgid strings, of the two messages do not match, so they cannot be paired by key. Yet it would be ungainly to represent the old message as fully removed, and the new message as fully added, in the resulting ediff:

diff
#: main.c:89
msgid "{-The Record of The Witch River-}~"
msgstr "{-Beleška o Veštičjoj reci-}~"#. ediff: state {+fuzzy+}
#: main.c:110
#, fuzzy
#| msgid "{+The Record of The Witch River+}~"
msgid "{+Records of The Witch River+}~"
msgstr "{+Beleška o Veštičjoj reci+}~"

(That the message has been fully added or removed can be seen by trailing tilde in the msgid string, which indicates that the old or new msgid does not exist at all, and so neither the message with it.)

Instead, messages left unpaired by key should be tested for pairing by pivoting around previous strings (secondary pairing). The two messages above will thus be paired due to the fact that the current msgid of the old message is equal to the previous msgid of the new message, and will produce a single ediff message as shown earlier.

Finally, consider the third related combination, when the old PO file has not yet been merged with the template, while the new PO file has both been merged and its translation updated:

old
#: main.c:89
msgid "The Record of The Witch River"
msgstr "Beleška o Veštičjoj reci"
new
#: main.c:110
msgid "Records of The Witch River"
msgstr "Beleške o Veštičjoj reci"

Once again it would be a waste to present the old message as fully removed and the new message as fully added in the resulting ediff. When a message is left unpaired after both pairing by key and pairing by pivoting, then the two PO files can be merged in the background -- as if the new is the template for the old, and vice versa -- and then tested for chained pairing by pivoting and by key with the merged PO file as intermediary. This pairing by merging (tertiary pairing) will then produce another natural ediff:

diff
#: main.c:110
msgid "{-The Record-}{+Records+} of The Witch River"
msgstr "{-Beleška-}{+Beleške+} o Veštičjoj reci"

It can be left to the diffing tool to decide which pairing methods beyond the primary pairing, by key, to use. There should not be much reason not to perform secondary pairing, by pivoting, as well. If tertiary pairing, by merging, is done, the user should be allowed to disable it, as it can sometimes produce strange results (subject to the fuzzy matching algorithm).

4.1.4. Collecting Diffed Messages

For the ediff of two PO files to also be a syntactically valid PO file, constructed ediff messages should be preceded by a PO header in output. At first glance, this PO header could be itself the ediff of headers of the PO files which were diffed. However, there are several issues with this approach:

  • The reviewer of the ediff PO file would not be informed at once if there was any difference between the headers. Headers tend to be long, and a small change in one of header fields may go visually unnoticed.

  • Depending on the amount of changes between the two headers, the resulting ediff message of the header could be too badly formed to represent the header as such. For example, if some header fields in msgstr were added or removed, embedded difference wrappers would invalidate the MIME-header format of msgstr, which could confuse PO processing tools.

  • How would the diff of two collections of PO files (e.g. directories) be packed into a single ediff PO? To pack diffs of several file pairs into one diff file is an expected feature of diffing tools.

To avert these difficulties, the following is done instead. First, a minimal valid header is constructed for the ediff PO file, independently of the headers in diffed PO files. The precise content can be left to the diffing tool, with Pology's poediff producing something like:

# +- ediff -+
msgid ""
msgstr ""
"Project-Id-Version: ediff\n"
"PO-Revision-Date: 2009-02-08 01:20+0100\n"
"Last-Translator: J. Random Translator\n"
"Language-Team: Differs\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"X-Ediff-Header-Context: ~\n"

The PO-Revision-Date header field is naturally set to the date when the ediff was made. Values for the Last-Translator and Language-Team fields can be somehow pulled from the environment (poediff will fetch them from Pology user configuration, or set some dummy values). Encoding of the ediff PO can be chosen at will, so long as all constructed ediff messages can be encoded with it (poediff will always use UTF-8). The purpose of the final, X-Ediff-Header-Context field will be explained shortly.

It is the first next entry in the ediff PO file that will actually be the ediff of headers of the two diffed PO files. Headers are diffed just like any other message, but the resulting ediff is given a few additional decorations:

# =========================================================
# Translation of The Witch River into Serbian.
# Koja Kojic <koja.kojic@nedohodnik.net>, 2008.
# {+Era Eric <era.eric@ledopad.net>, 2008.+}~
msgctxt "~"
msgid ""
"- l10n-wr/sr/wriver-main.po\n"
"+ l10n-wr/sr-mod/wriver-main.po\n"
msgstr ""
"Project-Id-Version: wriver 0.1\n"
"POT-Creation-Date: 2008-09-22 09:17+0200\n"
"PO-Revision-Date: 2008-09-{-25 20:44-}{+28 21:49+}+0100\n"
"Last-Translator: {-Koja Kojic <koja.kojic@nedohodnik-}"
"{+Era Eric <era.eric@ledopad+}.net>\n"
"Language-Team: Serbian\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"

Observe the usual ediff segments: translator comment with a new translator who updated the PO file has been added, and the PO-Revision-Date and Last-Translator header fields contain ediffs reflecting the update. These are the only actual differences between the two headers. More interesting are the additional decorations:

  • The very first translator comment (here a long line of equality signs) can be anything, and serves as a strong visual indicator of the header ediff. This is especially convenient when the ediff PO file contains diffs of several pairs of PO files.

  • That this particular message is a header ediff, is indicated by the msgctxt string set to a special value, here a single tilde. This value is given up front by the X-Ediff-Header-Context of the ediff PO header. It should be computed during diffing such that it does not conflict with msgctxt of one of the message ediffs (e.g. it may simply be a sufficiently long sequence of tildes).

  • The msgid string of the header ediff contains newline-separated paths of the diffed PO files. More precisely, the two lines of the msgid string are in the form [+-] file-path[ <<< comment]\n. The trailing newline of the second file path is elided if the msgstr string does not end in newline, to prevent msgfmt from complaining. The file path is followed by the optional, <<<-separated comment. This comment can be used for any purpose, one which will be demonstrated in poediff.

Although when a PO file is properly updated there should always be some difference in the header, it may happen that there is none. In such case, the header ediff message is still added, but it contains only the additional decorations: the visual separator comment, the special msgctxt, and the msgid with file paths. All other comments and msgstr are empty; the empty msgstr immediatelly shows that there is no difference between the headers. This "empty" header ediff is needed to provide the file paths of diffed PO files, and, if several pairs of PO files were diffed, to separate their diffs in the ediff PO file.

After the header ediff message, ordinary ediff messages follow. When all constructed ediff messages from the current pair of PO files are listed, the next pair starts with a new header ediff message, and so on.

Especially when diffing several pairs of PO files, it may happen that two ediff messages have same keys (msgid and msgctxt strings) and thus cannot be both added as such to the ediff PO file. When that happens, the ediff message which was added after the first with the same key, will have its msgctxt string padded by few random alphanumerics, to make its key unique. This padding sequence will be recorded in the #. ediff: comment, as ctxtpad field. For example:

# =========================================================
msgctxt "~"
msgid "...(first PO header ediff)..."
msgstr "..."#. ediff: state {-fuzzy-}
msgid "White{+ horizon+}"
msgstr "Belo{+ obzorje+}"# =========================================================
msgctxt "~"
msgid "...(second PO header ediff)..."
msgstr "..."#. ediff: state {-fuzzy-}, ctxtpad q9ac3
msgctxt "|q9ac3~"
msgid "White{+ horizon+}"
msgstr "Belo{+ obzorje+}"

The padding sequence is appended to the original msgctxt, separated by |. If there was no original msgctxt, the padding sequence is further extended by a tilde.

4.2. Producing Ediffs with poediff

The poediff script in Pology implements embedded diffing of PO files as defined in the previous section. To diff two PO files, running the usual:

$ poediff orig/foo.po mod/foo.po

will write out the ediff PO content to standard output, with some basic shell coloring of difference segments. The ediff can be written into a file (an ediff PO file) either with shell redirection, or the -o/--output. It is equally simple to diff directories:

$ poediff orig/ mod/

By default, given directories are recursively searched for PO files, and the PO files present in only one of the directories will also be included in the ediff.

4.2.1. Diffing With Underlying VCS

When PO files are handled by a version control system (VCS), poediff can be put into VCS mode using the -c/--vcs VCS option, where the value is the keyword of one of the version control systems supported by Pology. In VCS mode, instead of giving two paths to diff, any number of version-controlled paths (files or directories) are given. Without other options, all locally modified PO files in these paths are diffed against the last commit known to local repository. For example, if a program is using a Subversion repository, then the PO files in its po/ directory can be diffed with:

$ poediff -c svn prog/po/

Specific revisions to diff can be given by the -r/--revision REV1[:REV2]. REV1 and REV2 are not necessarily direct revision IDs, but any strings that the underlying VCS can convert into revision IDs. If REV2 is omitted, diffing is preformed from REV1 to current working copy.

When ediff is made in VCS mode, msgid strings in header ediffs will state revision IDs, in <<<-separated comments next to file paths:

# =========================================================
# ...
msgctxt "~"
msgid ""
"- prog/po/sr.po <<< 20537\n"
"+ prog/po/sr.po"
msgstr "..."

4.2.2. Command Line Options

Options specific to poediff:

-b, --skip-obsolete

By default, obsolete messages are treated equally to non-obsolete, and can feature in the ediff output. This makes it possible to detect when a message has become obsolete, or has returned from obsolescence, and show this in the ediff. But sometimes including obsolete messages into diffing may not desired, and then this option can be issued to ignore them.

-c VCS, --vcs=VCS

The keyword of the underlying version control system, to switch poediff into VCS mode. See Section 9.7.2, “Version Control Systems” for the list of supported version control systems (or issue --list-vcs option).

--list-options, --list-vcs

Simple listings of options and VCS keywords. Intended mainly for writting shell completion definitions.

-n, --no-merge

Disable pairing of messages by by internal merging of diffed PO files. Merging is performed only if there were some messages left unpaired after pairing by key and by pivoting, so in the usual circumstances it is not done anyway. But when it is done, it may produce strange results, so this option can be used to prevent it.

-o FILE, --output=FILE

The ediff is by default written to the standard output, and this option can be used to send it to a file instead.

-p, --paired-only

When directories are diffed, by default the PO files present in only one of them will be included into the ediff, i.e. all their messages will be shown as added or removed. This option will limit diffing only to files present in both directories, in the sense of having the same relative paths (rather than e.g. same PO domain name).

-Q, --quick

Produced maximally stripped-down output, sometimes useful for quick visual observation of changes, but which cannot be used as patch. Equivalent to -bns.

-r REV1[:REV2], --revision=REV1[:REV2]

When operating in VCS mode, the default is to make the diff from the last commit to the current working copy. This option can be used to diff between any two revisions. If the second revision is omitted, the diff is taken from first revision to current working copy.

-s, --strip-headers

Prevents diffing of PO headers, as well as inclusion of top ediff header in the output. This reduces clutter when the intention is to see only changes in messages through many PO files, but the resulting ediff cannot be used as patch.

-U, --update-effort

Instead of outputing the diff, the translation update effort is computed. It is expressed as the nominal number of newly translated words, from old to new paths. The procedure to compute this quantity is not straightforward, but the intention is that it roughly approximate the number of words (in original text) as if messages were translated from scratch. Options -b and -n are ignored.

Options common with other Pology tools:

-R, --raw-colors; --coloring-type

See Section 9.6, “Output Coloring”.

4.2.3. User Configuration

poediff will consult the [user] section in user configuration to fill out some of the header of the ediff PO file. It also consults its own section, with the following fields avaialbe:

[poediff]/merge=[*yes|no]

Setting to no is counterpart to --no-merge command line option, i.e. this field can be used to permanently disable message pairing by merging.

4.3. Applying Ediffs as Patches with poepatch

Basic application of an ediff patch is much easier than that of a line-level patch, because there will be no conflicts if messages have different wrapping, ordering, or extraction-prescribed parts (source references, etc.). The patch is applied by resolving each ediff message from it into the originating old and new message, and if either the old or the new message exists (by key) in the target PO file and has equal extraction-invariant parts, then the message modification is applied, and otherwise rejected.

Applying the modification to the target message means overwriting its extraction-invariant parts with those from the new message from the ediff, and leaving other parts untouched. If the target message is already equal to the new message by extraction-invariant parts, then the patch is silently ignored. This means that if the same patch is applied twice to the target PO file, the second application makes no modifications. Likewise if, by chance, the modifications given by the patch were already independently performed by another translator (e.g. a few simple updates to unfuzzy messages).

Command-line interface of Pology's poepatch is much like that of patch(1), sans the myriad of its more obscure options. There is the -p option to strip leading elements of file paths in the ediff, and -d option to append to them a directory path where target PO files are to be looked up. If the ediff was produced in VCS mode, then it can be applied as patch in any of the following ways:

$ cd repos/prog/po && poepatch <ediff.po
$ cd repos/ && poepatch -p0 <ediff.po
$ poepatch -d repos/app/po <ediff.po

Header modifications (coming from the header ediff message) are applied in a slightly relaxed fashion: some of the standard header fields are ignored when checking whether the patch is applicable. These are the fields which are known to be volatile as the PO file goes through different translators, and do not influence the processing of the PO file (e.g. such as encoding or plural forms). The ignored fields are: POT-Creation-Date, PO-Revision-Date, Last-Translator, X-Generator. When the header modification is accepted, the ignored fields in the target header are overwritten with those from the patch (including being added or removed).

4.3.1. Handling Rejected Ediffs

All ediff messages which were rejected as patches will be written out to stdin.rej.po in the current working directory if the patch was read from standard input, or to FILE.rej.po if the patch file was given by -i FILE.po option.

The file with rejected ediff messages will again be an ediff PO file. It will have the header as before, except that its comment will mention that the file contains rejects of a patching operation. Afterwards, rejected ediff messages rejected will follow. Every header ediff message will be present whether rejected or not, for the same purpose of separation and provision of file paths, but if it was not rejected as patch itself, it will be stripped of comments and msgstr string.

Furthermore, to every straigh-out rejected ediff message an ediff-no-match flag will be added. This is done, naturally, because some ediff messages may not be rejected straight-out. Consider the following scenario. A PO file has been merged to produce the fuzzy message:

old
#: tools/power.c:348
msgid "Active sonar low frequency"
msgstr "Niska frekvencija aktivnog sonara"
new
#: tools/power.c:361
#, fuzzy
#| msgid "Active sonar low frequency"
msgid "Active sonar high frequency"
msgstr "Niska frekvencija aktivnog sonara"

The translator updates the PO file, which produces the usual ediff message when going from fuzzy to translated:

diff
#. ediff: state {-fuzzy-}
#: tools/power.c:361
msgid "Active sonar {-low-}{+high+} frequency"
msgstr "{-Niska-}{+Visoka+} frekvencija aktivnog sonara"

However, before this patch could have been applied, the programmer adds a trailing colon to the same message, and the catalog is merged again to produce:

new-2
#: tools/power.c:361
#, fuzzy
#| msgid "Active sonar low frequency"
msgid "Active sonar high frequency:"
msgstr "Niska frekvencija aktivnog sonara"

The patch cannot be cleanly applied at this point, due to the extra colon added in the meantime to the msgid, so it has to be rejected. If nothing else is done, it would appear in the file of rejects as:

#. ediff: state {-fuzzy-}
#: tools/power.c:361
#, ediff-no-match
msgid "Active sonar {-low-}{+high+} frequency"
msgstr "{-Niska-}{+Visoka+} frekvencija aktivnog sonara"

It is wastefull to reject such a near-matching patch without any indication that it could be easily adapted to the latest message in the target PO file. Therefore, when an ediff message is rejected, the following analysis is performed: by trying out message pairings as on diffing, could the old message from the patch be paired with a current message from the target PO, and that current message with the new message from the patch? Or, in other words, can an existing message in the target PO be "fitted in between" the old and new messages defined by the patch? If this is the case, instead of the original, two special ediff messages -- split rejects -- are constructed and written out: one from the old to the current message, and another from the current to the new message. They are flagged as ediff-to-cur and ediff-to-new, respectively:

#: tools/power.c:361
#, fuzzy, ediff-to-cur
#| msgid "Active sonar low frequency"
msgid "Active sonar high frequency{+:+}"
msgstr "Niska frekvencija aktivnog sonara"#. ediff: state {-fuzzy-}
#: tools/power.c:361
#, ediff-to-new
#| msgid "Active sonar {-low-}{+high+} frequency{+:+}"
msgid "Active sonar {-low-}{+high+} frequency"
msgstr "{-Niska-}{+Visoka+} frekvencija aktivnog sonara"

There are more ways to interpret split rejects, depending on the circumstances. In this example, from the ediff-to-cur message the reviewer can see what had changed in the target message after the translator made the ediff. This can also be seen by comparing difference embedded into previous and current msgid strings in the ediff-to-new message. With a bit of editing, the reviewer can fold these two messages into an applicable patch:

#. ediff: state {-fuzzy-}
#: tools/power.c:361
#, ediff
msgid "Active sonar {-low-}{+high+} frequency:"
msgstr "{-Niska-}{+Visoka+} frekvencija aktivnog sonara:"

Since the file of rejects is also an ediff PO, after edits such as this to make some patches applicable, it can be reapplied as patch. When that is done, poepatch will silently ignore all ediff messages having ediff-no-match or ediff-to-new flags, as these have already been determined inapplicable. That is why in this example the reviewer has replaced the ediff-to-new flag with the plain ediff in the folded ediff message.

4.3.2. Embedding Patches

Depending on the kind of text which is being translated, and distance between the source and target language grammar, ortography, and style, it may be difficult to review the ediff in isolation. In general, messages in ediff PO file will lack positional context, which is in the full PO provided by messages immediately preceding and following the observed message. For example, a long passage from documentation probably needs no positional context. But a short, newly added message such as "Crimson" could very well need one, if it has neither msgctxt nor an extracted comment describing it: is it really a color? what grammatical ending should it have (in a language which matches adjective to noun gender)? Several messages around it in the full PO file could easily show whether it is just another color in a row, and their grammatical endings (determined by a translator earlier).

Another difficulty is when an ediff message needs some editing before being applied. This may not be easy to do this directly in the ediff PO file. Everything is fine so long as only the added text segments ({+...+}) are edited, but if the sentence needs to be restructured more thoroughly, the reviewer would have to make sure to put all additions into existing or new {+...+} segments, and to wrap all removals as {-...-} segments. If this is not carefully performed, the patch will not be applicable any more, as the old message resolved from it will no longer exactly match a message in the target PO file.

For these reasons, poepatch can apply the patch such as not to resolve the ediff, but to set all its extraction-invariant fields to the message in the target PO file. In effect, the target PO file becomes an ediff PO by itself, but only in the messages which were actually patched. To mark these messages for lookup, the usual ediff flag is added to them. For example, if the message in the patch file was:

#: title.c:274
msgid "Tutorial"
msgstr "{-Tutorijal-}{+Podučavanje+}"

then when the patch is successfully applied with embedding, the patched message in target PO file will look like this, among other messages:

#: main.c:110
msgid "Records of The Witch River"
msgstr "Beleške o Veštičjoj reci"#: title.c:292
#, ediff
msgid "Tutorial"
msgstr "{-Tutorijal-}{+Podučavanje+}"#: title.c:328
msgid "Start the Expedition"
msgstr "Pođi u ekspediciju"

Other than the addition of the ediff flag, note that the patched message kept its own source reference, rather than being overwritten by that from the patch. Same holds for all extraction-prescribed parts.

The reviewer can now jump over ediff flags, always having the full positional context for each patched message, and being able to edit it to heart's content, with only minimal care not to invalidate the ediff format. Wrapped difference segments can be entirely removed, non-wrapped segments can be freely edited; it should only not happen that a wrapped segment looses its opening or closing sequence. But this does not mean that the reviewer has to remove difference segments, that is, to manually unembed patched messages. poepatch can do this automatically, when run on the embedded-patched PO file with the -u/--unembed option.

A patch is applied with embedding by issuing the -e/--embed option:

$ poepatch -e <ediff.po
patched (E): foo.po

where (E) in the output indicates that the embedding is engaged. After the patched PO file had been reviewed and patched messages possibly edited, all remaining embedded differences are removed, i.e. resolved to new versions, by running:

$ poepatch -u foo.po

More precisely, only those messages having the ediff flag are resolved, therefore the reviewer must not remove them (unless manually unembedding the whole message).

What happens with rejected patches when embedding is engaged? They are also added into the target PO file, with heuristic positioning, and no separate file with rejects is created. Same as on plain patching, straight-out rejects will have the ediff-no-match flag, and split rejects ediff-to-cur or ediff-to-new. If these are not manually resolved during the review (ediff-no-match messages removed, ediff-to-* messages removed or folded), when poepatch is run to unembed the differences, it will remove all ediff-no-match and ediff-to-new messages, and resolve ediff-to-cur messages to current version.

4.3.3. Command Line Options

Options specific to poepatch:

-a, --aggressive

After the messages from the patch and the target PO file have been paired, normally only those differences that have no conflicts (e.g. in translation) will be applied. This option can be issued to instead unconditionally overwrite all extraction-invariant parts of the message in the target PO file with those defined by the paired patch.

-d, --directory

The directory path to prepend to file paths read from the patch file, when trying to match the files on disk to patch.

-e, --embed

Apply patch with embedding.

-i FILE, --input=FILE

Read the patch from the given file instead from standard input.

-n, --no-merge

When split rejects are computed, all methods for pairing messages like on diffing are used. Pairing by merging can sometimes lead to same strange results as on diffing, and this option disables it.

-p NUM, --strip=NUM

Strips the smallest prefix containing the given number of slashes from file paths read from the patch file, when trying to match the files on disk to patch. If this option is not given, only the base name of each read file path is taken as relative path to match on disk. (This is the same behavior as in patch(1).)

-u, --unembed

Clears all embedded differences in input PO files, after they have been patched with embedding.

4.3.4. User Configuration

poepatch consults the following user configuration fields:

[poepatch]/merge=[*yes|no]

Setting to no is counterpart to --no-merge command line option, i.e. this field can be used to permanently disable pairing by mergingM when computing split rejects.



[10] For example Lokalize, when operating in merge mode.

[11] Although this should be quite rare. In the collection of PO files from several translation projects, with over 2 million words in total, there was not a single occurence where one of the chosen wrapper sequences was part of the text.

[12] At the moment, the following text and PO editors are known to have highlighting for ediffs: Kate, Kwrite, Lokalize.

[13] Whether two messages such as these would get paired for diffing in the first place, will be discussed later on.