Chapter 7. Miscellaneous Tools

This chapter describes various smaller standalone tools in Pology, which do not introduce any major PO processing concepts nor can be grouped under a common topic.

7.1. Rewrapping PO Files with porewrap

The porewrap script does one simple thing: it rewraps message strings (msgid, msgstr, etc.) in PO files. Gettext's tools, e.g. msgcat, can be used for rewrapping as well, so what is the reason of existence of porewrap? The lesser reason is convenience. Arbitrary number of PO file paths can be given to it as arguments, as well as directory paths which will be recursively search for PO files. The more important reason is that Pology can also perform "fine" wrapping, as described in Section 9.8, “Line Wrapping in PO Messages”. Thus, running:

$ porewrap --no-wrap --fine-wrap somedir/

will rewrap all PO files found in somedir/ and below, such that basic wrapping (on column) is disabled (--no-wrap), while fine wrapping (on logical breaks) is enabled (--fine-wrap).

Other than from command line options, porewrap will also consult the PO file header and the user configuration, for the wrapping mode. Command line options have the highest priority, followed by the PO header, and the user configuration at the end. For details on how to set the wrapping mode in PO headers, see the description of X-Wrapping header field in Section 9.9, “Influential Header Fields”. If none of these sources specify the wrapping mode, porewrap will apply basic wrapping.

7.1.1. Command Line Options

Options specific to porewrap:

-v, --verbose

Since porewrap just opens and writes back all the PO files given to it, it normally does not report anything. But this option can be issued for it to report PO file paths as they have been written out.

Options common with other Pology tools:

--wrap; --no-wrap; --fine-wrap; --no-fine-wrap; --wrap-column=COL

See Section 9.8.1, “Common Command Line Options for Wrapping”.

-F FILE, --files-from=FILE

See Section 9.5, “Reading Paths From a File”.

7.1.2. User Configuration

porewrap reads the wrapping mode fields as described in Section 9.8.2, “Common User Configuration Fields for Wrapping”, from its [porewrap] section.

7.2. Self-Merging PO Files with poselfmerge

Normally, PO files are periodically merged with latest PO templates, to introduce changes from the source material while preserving as much of the existing translation as possible. poselfmerge, on the other hand, will merge the PO file with "itself". More precisely, it will derive the temporary template version of the PO file (by cleaning it from translations and other details), and then merge the original PO file with the derived template, by calling msgmerge internally. This can have several uses:

  • The fuzzy matching algorithm of msgmerge is extremely fast and robust, but treats all messages the same and in isolation, without trying out more complicated (and necessarily much slower) heuristic criteria. This can cause the translator to spend more time updating a fuzzy message than it would take to translate it from scratch. poselfmerge can be therefore instructed to go over all fuzzy messages created by merging, and apply additional heuristics to determine whether to leave the message fuzzy or to clean it up and make it fully untranslated.

  • Sometimes the PO file can contain a number of quite similar longer messages (this is especially the case when translating in summit). A capable PO editor should automatically offer the previous translation on the next similar message (by using internal translation memory), and show the what the small differences in the original text are, thus greately speeding up the translation of that message. If, however, the PO editor is not that capable, or you use a plain text editor, while translating you can simply skip every long message that looks familiar, and afterwards run poselfmerge on the PO file to introduce fuzzy matches on those messages.

  • More generally, if your PO editor does not have (a good enough) translation memory feature, or you edit PO files with a plain text editor, you can instruct poselfmerge to use one or more PO compendia to provide additional exact and fuzzy matches. This is essentially the batch application of translation memory. Section 10.1, “Creating and Using PO Compendia” provides some hints on how to create and maintain PO compendia.

Arguments to poselfmerge are any number of PO file paths or directories to search for PO files, which will be modified in place:

$ poselfmerge foo.po bar.po somedir/

However, this run will do almost nothing (except possibly rewrap files), just as msgmerge would do nothing if the same template were used twice. Instead, all special processing must be requested by command line options, or activated through the user configuration to avoid issuing some options with same values all the time.

7.2.1. Command Line Options

Options specific to poselfmerge:

-A RATIO, --min-adjsim-fuzzy=RATIO

The minimum required "adjust similarity" between the old and the new orginal text in a fuzzy message, in order to accept it and not clean it to untranslated state. The similarity is expressed as the ratio in range 0.0-1.0, with 0.0 meaning no similarity and 1.0 no difference. A practical range is 0.6-0.8. If this option is not issued, fuzzy messages are kept as they are (as if 0.0 would be given).

The requirement for computation of adjusted similarity is that fuzzy messages contain previous strings, i.e. that the PO file was originally merged with --previous to msgmerge.

-b, --rebase-fuzzies

Normally, when merging with template, the untranslated and fuzzy messages already present in the PO file are not checked again for approximate matches. This is on the one hand side a performance measure (why fuzzy match again something that was already matched before?), and on the other hand a safety measure (higher trust in an old fuzzy match based on the PO file itself than e.g. a new match from an arbitrary compendium). By issuing this option, prior to merging all untranslated message are removed from the PO file, as well as all fuzzy messages which still have their base translated message in the PO file (judging by previous strings). This activates fuzzy matching on untranslated messages (e.g. if new compendium given, or for similar messages skipped during translation), and updates base translated messages on fuzzy messages.

-C POFILE, --compendium=POFILE

The PO file to use as compendium on merging, to produce more exact and fuzzy matches. This option can be repeated to add several compendia.

-v, --verbose

poselfmerge normally operates silently, and this option requests some progress information. Quite useful if processing a large collection of PO files, because merging and post-merge processing can take a lot of time (especially in presence of compendium).

-W NUMBER, --min-words-exact=NUMBER

When an exact match for an untranslated message is produced from the compendium, it is not always safe to silently accept it, because the compendium may contain translations from contexts totally unrelated with the current PO file. The shorter the message, the higher the chance that translation will not be suitable in current context. This option provides the minimum number of words (in the original) to accept an exact match from the compendium, or else the message is made fuzzy. The reasonable value depends on the relation between the source and the target language, with 5 to 10 probably being on the safe side.

Note that afterwards you can see when an exact match has been demoted into a fuzzy one, by that message not having previous strings (#| msgid "...", etc.).

-x, --fuzzy-exact

This option is used to unconditionally demote exact matches from the compendium into fuzzy messages (e.g. regardless of the length of the text, as done by -W/--min-words-exact). This may be needed, for example, when there is a strict review procedure in place, and the compendium is built from unreviewed translations.

Options common with other Pology tools:

--wrap; --no-wrap; --fine-wrap; --no-fine-wrap; --wrap-column=COL

See Section 9.8.1, “Common Command Line Options for Wrapping”.

-F FILE, --files-from=FILE

See Section 9.5, “Reading Paths From a File”.

7.2.2. User Configuration

It is likely that the translator will have a certain personal preference of the various match acceptance criteria provided by command line options. Instead of issuing those options all the time, the following user configuration fields may be set:

[poselfmerge]/fuzzy-exact=[yes|*no]

Counterpart to the -x/--fuzzy-exact option.

[poselfmerge]/min-adjsim-fuzzy

Counterpart to the -A/--min-adjsim-fuzzy option.

[poselfmerge]/min-words-exact

Counterpart to the -W/--min-words-exact option.

[poselfmerge]/rebase-fuzzies=[yes|*no]

Counterpart to the -b/--rebase-fuzzies option.

Of course, command line options can be issued to override the user configuration fields when necessary.

poselfmerge also reads the wrapping mode fields as described in Section 9.8.2, “Common User Configuration Fields for Wrapping”, from its [poselfmerge] section.

7.3. Machine Translation with pomtrans

Machine translation is the process where a computer program is used to produce translation of more than a trivial piece of text, starting from single sentences, over paragraphs, to full documents. There are debates on how useful machine translation is right now and how much better it could become in the future, and there is a steady line of research in that direction. Limiting to widely available examples of machine translation software today, it is safe to say that, on the one hand, machine translation can preserve a lot of the meaning of the original and thus be very useful to the reader who needs to grasp the main points of the text, but on the other hand, are not at all passable for producing translations of the quality expected of human translators who are native speaker of the target language.

As far as Pology is concerned, the question of machine translation reduces to this: would it increase the efficiency of translation if PO files were first machine-translated, and then manually corrected by a human translator? There is no general answer to this question, as it depends stronly on all elements in the chain: the quality of machine translation software, the source language, the target language, and the human translator. Be that as it may, Pology provides the pomtrans script, which can fill in untranslated messages in PO files by passing original text through various machine translation services.

pomtrans has two principal modes of operation. The more straightforward is the direct mode, where original texts are simply msgid strings in the given PO file. In this mode, PO files can be machine-translated with:

$ pomtrans transerv -t lang paths...

The first argument is the translation service keyword, chosen from one known to pomtrans. The -t option specifies the target language; it may not be necessary if processed PO files have the Language: header field properly set. The source language is assumed to be English, but there is an option to specify another source language. Afterwards an arbitrary number of paths follow, which may be either single PO files or directories which will be recursively searched for PO files.

pomtrans will try to translate only untranslated messages, and not fuzzy messages. When it translates a message, by default it will make it fuzzy as well, meaning that a human should go through all machine-translated messages. These defaults are based on the perceived current quality of most machine translation services. There are several command line options to change this behavior.

The other mode of operation is the parallel mode. Here pomtrans takes the original text to be the translation into another language, i.e. msgstr strings from a PO file translated into another language. For example, if a PO file should be translated into Spanish (i.e. from English to Spanish), and that same PO file is available fully translated into French (i.e. from English to French), then pomtrans could be used to translate from French to Spanish. This is done in the following way:

$ pomtrans transerv -s lang1 -t lang2 -p search:replace paths...

As in direct mode, the first argument is the translation service. Then both the source (-s) and the target language (-t) are specified; again, if PO files have their Language: header fields set, these options are not necessary. The perculiar here is the -p option, which specifies two strings, separated by colon. These are used to construct paths to source language PO files, by replacing the first string in paths of target language PO files with the second string. For example, if the file tree is:

foo/
    po/
        alpha/
            alpha.pot
            fr.po
            es.po
        bravo/
            bravo.pot
            fr.po
            es.po

then the invocation could be:

$ cd .../foo/
$ pomtrans transerv -s fr -t es -p es.:fr. po/*/es.po

In case a PO file in target language does not have a counterpart in source language, it is simply skipped.

There is another variation of the parallel mode, where source language texts are drawn not from counterpart PO files, but from a single, compendium PO file in source language. This mode is engaged by giving the path to that compendium with the -c option, instead of the -p option for path replacement.

7.3.1. Command Line Options

Options specific to pomtrans:

-a CHARS, --accelerator=CHARS

Characters used as accelerator markers in user interface messages. They should be removed from the source language text before translation, in order not to confuse the translation service.[29]

-c FILE, --parallel-compendium=FILE

The path to source language compendium, in parallel translation mode.

-l, --list-transervs

Lists known translation services (the keywords which can be the first argument to pomtrans).

-m, --flag-mtrans

Adds the mtrans flag to each machine-translated message. This may be useful to positively identify machine-translated messages in the resulting PO file, as otherwise they are simply fuzzy.

-M MODE, --translation-mode=MODE

Translation services need as input the mode in which to operate, usually the source and target language at minimum. By default the translation mode is constructed based on source and target languages, but this is sometimes not precise enough. This option can be used to issue a custom mode string for the chosen translation service, overriding the default construction. The format of the mode string is translation service dependent, check documentation of respective translation services for details.

-n, --no-fuzzy-flag

By default machine-translated messages are made fuzzy, which is prevented by this option. It goes without saying that this is dangerous at current state of the art in machine translation, and should be used only in very specific scenarios (e.g. high quality machine translation between two dialects of the same language).

-p SEARCH:REPLACE, --parallel-catalogs=SEARCH:REPLACE

The string to search for in paths of target language PO files, and the string to replace them with to construct paths of source language PO files, in parallel translation mode.

-s LANG, --source-lang=LANG

The source language code, i.e. the language which is being translated from.

-t LANG, --target-lang=LANG

The target language code, i.e. the language which is being translated into.

-T PATH, --transerv-bin=PATH

If the selected translation service is (or can be) a program on the local computer, this option can be used to specify the path to its executable file, if it is not in the PATH.

7.3.2. Supported Machine Translation Services

Currently supported translation services are as follows (with keyword in parenthesis):

Apertium (apertium)

Apertium is a free machine translation platform, developed by the TRANSDUCENS research group of University of Alicante. There is a basic web service, but the software can be locally installed and that is how pomtrans uses it (some distributions provide packages).

Google Translate (google)

Google Translate is Google's proprietary web machine-translation service, which can be used free of charge. At the moment, pomtrans makes one query to it per message, which can take quite some time on long PO files.



[29] This also means that, at the moment, machine-translated text has no accelerator when the original text did have one. Some heuristics may be implemented in the future to add the accelerator to translated text as well.