Chapter 3. Sieving

Translator may want to apply batch-type operations to every message in a single PO file or in collection of PO files, such as searching and replacing text, computing statistics, or validating. However, batch-processing tools for general plain text (grep, sed, awk, etc.) are not very well suited to processing PO files. For example, when looking for a particular word, a generic search tool will not see it if it contains an accelerator marker; or, if looking for a two-word phrase, a generic tool will miss it if it is wrapped. Therefore many tools tailored specifically for batch-processing messages in PO files have been developed, such as those bundled with Gettext (msggrep, msgfilter, msgattrib...), or from Translate Toolkit (pocount, pogrep, pofilter...).

Pology also provides a per-message batch-processing tool, the posieve. What was the need for it, given the myriad of other previously available and powerful tools? In accordance with philosophy of Pology, posieve goes deeper than these other tools. posieve makes easy that which is possible but awkward by combining generic command line tools. posieve is modular from the ground up, such that it is never a design problem to add new functionality to it, even when it is of narrow applicability. Users who know some Python can even write own "plugins" for it. Several processing modules can be applied in a single run of posieve, possibly affecting each other, in ways not possible by generic shell piping and not requiring temporary intermediate files.

3.1. Basic Usage of posieve

The posieve script itself is actually a simple shell for applying various processing modules, called sieves, to every message in one or more PO files. Some sieves can also request to operate on the header of the PO file, which posieve will then feed to them. A sieve can both examine and modify messages; if any message is modified, by default the modified PO file will be written out in place. Naturally, posieve has a number of options, but more interestingly, each sieve can define some parameters which determine its behavior. Pology comes with many internal sieves, which do things from general to obscure (possibly language or project specific), and users can define their own sieves.

Here is how you would run the stats sieve to collect statistics on all PO files in frobaz/ directory:

$ posieve stats frobaz/

While PO files in frobaz/ are being processed, you will see a progress bar with the current file and the number of files to process, and after some time the stats sive will present its findings in a table.

The first non-option argument in the posieve command line is the sieve name, and then any number of directory or file paths can be specified. posieve will consider file path arguments to be PO files, and recursively search directory paths to collect all files ending with .po or .pot. If no paths are specified, PO files to process will be collected from the current working directory.

If the sieve modifies a message and the new PO file is written out in place of the old, the user will be informed by an exclamation mark followed by the file path. An example of a sieve which modifies messages is the tag-untranslated sieve; it adds the untranslated flag to every untranslated message, so that you can look them up in a plain text editor (as opposed to dedicated PO editor):

$ posieve tag-untranslated frobaz/
! frobaz/alfa.po
! frobaz/bravo.po
! frobaz/charlie.po
Tagged 42 untranslated messages.

posieve itself tracks message modifications and informs about modified PO files, whereas the final line in this example has been output by the tag-untranslated sieve. Sieves will frequently issue such final reports of their actions.

If a sieve defines some parameters to control its behavior, these can be issued using the -s. This option takes the parameter specification as the argument, which is of the form name:value or just name for switch-type parameters. More than one parameter can be issued by repeating the -s. For example, the stats sieve can be instructed to take into account only messages with at most 5 words:

$ posieve stats -s maxwords:5 frobaz/

to show statistics in greater detail:

$ posieve stats -s detail frobaz/

or to ignore a certain accelerator marker and show bar-type statistics instead of tabular:

$ posieve stats -s accel:_ -s msgbar frobaz/

posieve lists and shows descriptions of its options by the usual -h/--help option. Help for a sieve can be requested by issuing the -H/--help-sieves while a sieve name is present in the command line. All available internal sieves with short descriptions are listed using -l/--list-sieves.

Some sieves are language-specific, which can be seen by their names being of the form langcode:name. These sieves are primarily intendedfor use on PO files translated to indicated language, but depending on particularities, may be applicable to several more closely related languages. (A sieve which is doing language-specific things, but which is applicable to many languages, is more likely to be named as a general sieve.)

If shell completion is active, it can be used to complete sieve names and their parameters.

3.2. Sieve Chains

It is possible to issue several sieves at once, by passing a comma-separated list of sieve names to posieve in place of single sieve name. This is called a sieve chain.

At minimum, chaining sieves is a performance improving measure, since each PO file is opened (and possibly written out) only once, instead of on each sieve run. For example, you can in one run compute the statistics to see how many messages need to be update and tag all untranslated messages:

$ posieve stats,tag-untranslated frobaz/
! frobaz/alfa.po
! frobaz/bravo.po
! frobaz/charlie.po
... (table with statistics) ...
Tagged 42 untranslated messages.

A message in the PO file is passed through each sieve in turn, in the order in which they are issued, before proceding to the next message. If a sieve modifies the message, the next sieve in the chain will operate on that modified version of the message. This means that the ordering of sieves in the command line is significant in general, and that it is interchangable only if the sieves in the chain are independent of each other (as in this example). Chain order also determines the order in which sieve reports are shown; if in this example the order had been tag-untranslated,stats, then first the tagged messages line would be written out, followed by the statistics table.

Other than for performance, sieve chains are useful when messages should be modified in a particular way before a sieve gets to operate on it. A good example is when statistics is to be computed on PO files which contain old embedded contexts, where if nothing would be done, contexts would add to the word count of the original text. To avoid this, a context normalization sieve (which converts embedded contexts to msgctxt) can be chained with statistics sieve, and the posieve instructed not to write modifications to the PO file. If the embedded context is of the single-separator type, with separator character |, the sieve chain is:

$ posieve --no-sync normctxt-sep,stats -s sep:'|' frobaz/
Converted 21 separator-embedded contexts.
... (table with statistics) ...

The --no-sync option prevents writing modified messages in the PO file on disk. Note that | as parameter value is quoted, because it would be interpreted as a shell pipe otherwise.

Finally, some sieves can stop messages from being pushed further through the sieve chain, so they can be used as a prefilter to other sieves. The archetypal example of this the find-messages, which stops non-matched messages from further sieving. For example, to include into statistics only the messages containing the word "quasar", this would be executed:

$ posieve find-messages,stats -s msgid:quasar -s nomsg
Found 12 messages satisfying the conditions.
... (table with statistics) ...

The msgid: parameter specifies the word (actually, a regular expression) to be looked up in the original text, while nomsg parameter tells find-messages not to write out matched messages to standard output, which it would by default do. Note that no path was specified, meaning that all PO files in current working directory and below will be sieved.

Examples of sieve chaining so far should have raised the following question: when several sieves are issued, to which of them are the parameters specified by -s options passed? The answer is that a parameter is sent to all sieves which accept parameter of that name. Continuing the previous example, if message texts can contain accelerator marker &, this would be specified like this:

$ posieve find-messages,stats -s msgid:quasar -s nomsg -s accel:'&'

find-messages will accept accel in order to also match messages like "Charybdis Q&uasar", while stats will use it to properly split text into words for counting them.

3.3. Command Line Options

Options specific to posieve:

-a, --announce-entry

A sieve may be buggy and crash or keep posieve in infinite loop on a particular PO entry (header or message). When this option is given, each PO entry will be announced before sieving it, so that you can see exactly where the problem occurs.

-b, --skip-obsolete

By default posieve will process all messages in the PO file, including the obsolete. Sometimes sieving obsolete messages is not desired, for example when running translation validation sieves. This option can then be used to skip obsolete messages.

-c, --msgfmt-check

For posieve to process the PO file, it is only necessary that basic PO syntax is valid, i.e. that msgfmt can compile the file. msgfmt also offers stricter validation mode: to have posieve run this stricter validation on the PO file, issue this option. Invalid files will be reported and will not be sieved.

--force-sync

When some messages in the PO file are modified, by default only those messages will be reformatted (e.g. strings wrapped as selected) when the PO file is modified on disk. This makes posieve friendly to version control systems. Sometimes, however, you may want that all messages are reformatted, modified or not, and then you can issue this option.

-h, --help

General help on posieve.

-H, --help-sieves

-h/--help shows only description of posieve and its options, while this option shows the descriptions and available parameters of issued sieves. For example:

$ posieve find-messages,stats -H

would output help for find-messages and stats sieves.

--issued-params

List of all sieve parameters and their values that would be issued. Used to check the interplay of command line and configuration on sieve parameters.

-l, --list-sieves

List of all internal sieves, with short descriptions.

--list-options; --list-sieve-names; --list-sieve-params

Simple listings of global options, internal sieve names, and parameters of issued sieves. Intended mainly for writting shell completion definitions.

-m OUTFILE, --output-modified=OUTFILE

If some PO files were modified by sieving, you may want to follow up with a command to process only those files. posieve will by default output the paths of modified PO files, but also other information, which makes parsing this output for modified paths ungainly. Instead, this option can be used to specify a file to which path of all modified PO files will be written to, one per line.

--no-skip

If a sieve reports an error, posieve normally skips the problematic message and continues sieving the rest of the PO file, if possible. This is sometimes not desired, when this option will tell posieve to abort with an error message in such cases.

--no-sync

All messages modified by sieves are by default written back to disk, i.e. their PO files modifed. This option prevents modification of PO files. This comes handy in two cases. One is when you want to check what effect a modifying sieve will have before actually accepting it (a "dry" run). The other case is when you use a modifying sieve as a filter for the next sieve in chain, which only needs to examine messages.

-q, --quiet

posieve normally shows the progress of sieving, which can be cancelled by this option. (Sieves will still output their own lines.)

-s PARAM[:VALUE]

The central option of posieve, which is used to issue parameters to sieves.

-S PARAM

When a sieve parameter is issued through user configuration, this option can be used to cancel it for one particular run.

--version

Release and copyright information on posieve.

-v, --verbose

More verbose output, where posieve shows the sieving modes, lists files which are being sieved, etc.

Options common with other Pology tools:

-F FILE, --files-from=FILE

See Section 9.5, “Reading Paths From a File”.

-e REGEX, --exclude-name=REGEX; -E REGEX, --exclude-path=REGEX; -i REGEX, --include-name=REGEX; -I REGEX, --include-path=REGEX

See Section 9.4, “Path Inclusion and Exclusion”.

-R, --raw-colors; --coloring-type

See Section 9.6, “Output Coloring”.

3.4. User Configuration

The following configuration fields can be used to modify general behavior of posieve:

[posieve]/skip-on-error=[*yes|no]

Setting to no is counterpart to --no-skip command line option.

[posieve]/msgfmt-check=[yes|*no]

Setting to yes is counterpart to -c/--msgfmt-check command line option.

[posieve]/skip-obsolete=[yes|*no]

Setting to yes is counterpart to -b/--skip-obsolete command line option.

For configuration fields that have counterpart command line options, the command line option always takes precedence if issued.

Configuration can also be used to issue sieve parameters, by specifying [posieve]/param-name fields. For example, parameters transl (a switch) and accel (with value &) are issued to all sieves that accept them by writing:

[posieve]
param-transl = yes
param-accel = &

To issue parameters only to certain sieves, parameter name can be followed by a sieve list of the form /sieve1,sieve2,...; to prevent the parameter from being issued only to certain sieves, prepend ~ to the sieve list. For example:

[posieve]
param-transl/find-messages = yes  # only for find-messages
param-accel/~stats = &            # not for stats

Same parameters can sometimes be repeated in the command line, when it is logically meaningfull to provide several values of that type to a sieve. However, same-name fields cannot be used in configuration to supply several values, because they override each other. Instead, a dot and a unique string (within the sequence) can be appended to the parameter name to make it a unique configuration field:

[posieve]
param-accel.0 = &
param-accel.1 = _

Strings after the dot can be anything, but a sequence of numbers or letters in alphabetical order is the least confusing choice.

Sieve parameters should be issued from the configuration only as a matter of convenience, when they are almost always used in sieve runs. But occasionaly the parameter issued from the configuration is not appropriate for the given run. Instead of going to configuration and commenting the parameter out temporarily, it can be cancelled in the command line using the -S option (note capital S) followed by the parameter name. You can use --issued-params option to confirm which parameters will be issued after both the command line and the configuration have been taken into account.

3.5. Internal Sieves

This section describes the sieves which are contained in Pology distribution and provides instruction for their use.

Parameters which take a value (which are not switches) may or may not have a default value, and when they do, it will be given in square brackets ([...]) in the header.

3.5.1. apply-filter

apply-filter is used to pipe translation through one or several hooks (see Section 9.10, “Processing Hooks”). The hooks may modify the translation, validate it, or do something else. More precisely, the following hook types are applicable:

  • F1A, F3A, F3C, to modify the translation and write changes back to the PO file;

  • V1A, V3A, V3C, to validate the translation, with standard validation output (highlighted spans and problem messages);

  • S1A, S3A, S3C, for any side-effect processing on translation (but no modification).

Parameters:

filter:hookspec

The hook specification. Can be repeated to add several hooks, which are then applied in the order of specification.

showmsg

Report every modified message to standard output. (For validation hooks, message is automatically reported if not valid.)

3.5.2. apply-header-filter

apply-header-filter is the counterpart to apply-filter to operate on headers instead of messages. Here the applicable hook types are accordingly F4B, V4B, S4B.

Parameters:

filter:hookspec

The hook specification. Can be repeated to add several hooks, which are then applied in the order of specification.

3.5.3. bad-patterns

Sometimes it is possible to use simple pattern matching to discover things that should never appear in the text, such as common grammar or orthographical errors. bad-patterns can apply such patterns to translation, either as plain substring matching or regular expressions. Patterns can be given as parameters, or more conveniently, read from files.

Parameters:

pattern:string

The pattern to search for. Can be repeated to search for several patterns.

fromfile:path

Read patterns to search for from the file. Each line contains one pattern. If line starts with #, it is treated as comment. Empty lines are ignored. Trailing and leading whitespace is removed from patterns; if it is significant, it can be given inside [...] regex operator. This parameter can be repeated to read patterns from several files.

rxmatch

By default patterns are treated as plain substrings. This parameter requests to treat patterns as regular expressions.

casesens

By default patterns are case-sensitive. This parameter make them case-insensitive.

Caution

This sieve is deprecated. Use check-rules instead, which applies Pology's validation rules.

3.5.4. check-docbook4

check-docbook4 checks PO files extracted from Docbook 4.x files. Docbook is an XML format, typically used for documenting software.

Parameters:

showmsg

Instead of just showing the message location and problem description, also show the complete message with problematic segments higlighted.

lokalize

Open the PO file on reported messages in Lokalize. Lokalize must be already running with the project that contains the PO file opened.

Currently performed checks:

  • Markup validity. Docbook is a complex XML format, and nothing short of full validation of XML files generated from translated PO files can show if the translation is technically valid. Therefore check-docbook4 checks only well-formedness, whether tags are defined by Docbook, and some nesting constraints, and that on the level of single message. But this is already enough to catch great majority of usual translation errors.

    This check can be skipped on a message by adding to it the no-check-markup translator flag.

  • Message insertion placeholders. Some extractors of Docbook split out into standalone messages contextually separate units that are found in the middle of flowing paragraphs (e.g. footnotes). When that happens, a special placeholder is left in the originating message, so that the markup can be reconstructed when the translated Docbook file is built. Such placeholders must be carried into translation.

3.5.5. check-grammar

check-grammar checks translation with LanguageTool, an open source grammar and style checker (http://www.languagetool.org/). LanguageTool supports a number of languages to greater or smaller extent, which you can check on its web site.

LanguageTool can be run as standalone program or in client-server mode, and this sieve expects the latter. This means that LanguageTool has to be up and running before this sieve is run. Messages in which problems are discovered are reported to standard output.

Parameters:

lang:code

The language code for which to apply the rules. If not given, it will be read from each PO file in turn, and if not found there either, an error will be signaled.

host:hostname [localhost]

Name of the host where the LanguageTool server is running. The default value of localhost means that it is running on the same computer where the sieve is run.

port:number [8081]

TCP port of the host on which the LanguageTool server listens for queries.

3.5.6. check-kde4

check-kde4 checks PO files extracted from program code based on KDE4 library and its translation system. Note that this really means what it says; this sieve should not be used to check just any PO file which happens to be part of the KDE project (e.g. PO files covering .desktop files, pure Qt code, etc.).

Parameters:

strict

Partly due to historical reasons, and partly due to programmers being sloppy, the original text itself is sometimes not valid by some checks. By default, when the original is not valid, the translation is not expected to be valid either, i.e. it is not checked. This parameter requires that the translation is always checked, regardless of the validity of the original (problems can almost always be avoided in the translation).

lokalize

Open the PO file on reported messages in Lokalize. Lokalize must be already running with the project that contains the PO file opened.

Currently performed checks:

  • Markup validity. KDE4 messages can contain a mix of KUIT and Qt rich text markup. Although Qt rich text does not have to be well-formed in XML sense, this check expects well-formedness to be preserved in translation if the original is such (also see the strict parameter).

    This check can be skipped on a message by adding to it the no-check-markup translator flag.

3.5.7. check-rules

check-rules applies language- and project-dependent Pology validation rules to translation. See Section 8.5, “Validation Rules” for detailed discussion on writing and applying rules.

Parameters:

lang:code

The language code for which to apply the rules. If not given, it will be read from each PO file in turn, and if not found there either, an error will be signaled.

env:environment

The language environment for which to apply the rules (see Section 8.1, “The Notion of Language in Pology”). Several environments can be given as comma-separated list, in which case the later environment in the list takes precedence on conflicted rules. If not given, it may also be read from PO files (see X-Environment in Section 9.9, “Influential Header Fields”).

envonly

When language environment is given, only the rules explicitly belonging to it are applied, while general rules for the selected language are ignored.

rule:identifiers

Comma-separated list of rule identifiers, to apply only those rules. If a rule selected in this way is disabled in its definition, this enables it.

rulerx:regexes

Like rule, but the values are interpreted as regular expressions by which to match rule identifiers.

norule:identifiers

Inverse of the rule parameter: selected rules are not applied, and all other are applied.

norulerx:regexes

Inverse of the rulerx parameter: selected rules are not applied, and all other are applied.

stat

Rules can take time to apply to all sieved PO files, and this parameter requests to write out some statistics of rule application at the end of sieving.

accel:characters

Characters to consider as accelerator markers. If not given, they may be read from sieved PO files. Note that this parameter in itself does nothing: it only makes it possible for a particular rule or group of rules to remove the accelerator before matching.

markup:types

The type of text markup used in messages, by keyword. It can also be a comma-separated list of keywords. If not given, it may be read from sieved PO files. See description of X-Text-Markup in Section 9.9, “Influential Header Fields” for the list of markup keywords currently known to Pology. Similarly to accel parameter, this parameter only enables rules to remove the markup (or do something else) before matching.

xml:file

By default, messages failed by rules are reported to standard output, and this parameter requests that they be written into a custom (but simple) XML format. This also causes results to be cached: on subsequent runs of check-rules only modified PO files will be checked again, and results for non-modified files will be pulled from the cache. The cache can be found in $HOME/.pology-check_rules-cache/ directory.

rfile:file

By default internal Pology rules are applied, and this parameter can be used to apply external rules instead, defined in the given rule file.

rdir:directory

Like rfile, but external rules are read from a directory containing any number of rule files.

branch:branch

Apply rules only to messages from given branch (summit). Several branches may be given as comma-separated list.

showfmsg

Rules are sometimes applied to the filtered instead of the original message, and when such message is failed, it may not be obvious what triggered the rule. This parameter requests that the filtered message is written out too when the original message is reported.

nomsg

When a message is failed, by default it is output in full together with the problem description. This parameter requests that only the problem description is output.

lokalize

Open the PO file on reported messages in Lokalize. Lokalize must be already running with the project that contains the PO file opened.

mark

To each failed message a failed-rule flag is added, modifying the PO file. Modified files can then be opened in the editor, and failed messages looked up by this flag.

byrule

As usual for sieving, by default each failed message is output as soon as it is processed. This parameter makes the failed messages output ordered by rules instead, where rules are sorted alphabetically by their identifiers. Note that this will cause there to be no output until all messages have been sieved.

One or more rules can be disabled on a particular message in the PO file itself, by adding a special translator comment that starts with skip-rule: and continues with comma-separated list of rule identifiers:

# skip-rule: ruleid1, ruleid2, ...

3.5.8. check-spell

check-spell checks spelling of translation by splitting it into words and passing them through GNU Aspell (http://aspell.net/). This sieve is a more specific counterpart to check-spell-ec, which exposes some options specific to Aspell and requires no external Python modules, only the Aspell installation. Also read Section 8.2, “Spell Checking” for details on spell-checking in Pology.

check-spell behaves mostly the same as check-spell-ec, and accepts all the same parameters with same meanings; the exception is the provider parameter, which is not present here since Aspell is the fixed provider. Only the parameters specific to this sieve are described in the following:

enc:encoding

The encoding in which the text should be sent to Aspell.

var:variety

The variety of the Aspell dictionary, if any.

skip:regex

Words matched by this regular expression are not sent to spell-checker.

case

Matching patterns given as parameter values (e.g. with skip:) are by default case-insensitive, and this parameter switches them to case-sensitive.

xml:file

By default, messages with unknown words are reported to standard output, and this parameter requests that they be written into a custom (but simple) XML format.

Aspell can be configured for use in Pology through user configuration, so that it is not necessary to issue some parameters on every run. See Section 9.2.4, “The [aspell] section”.

3.5.9. check-spell-ec

check-spell-ec uses the Enchant library (http://www.abisource.com/projects/enchant/) through PyEnchant Python module (http://pyenchant.sourceforge.net) to provide uniform access to different spell-checkers, such as Aspell, Ispell, Hunspell, etc. Translation is first split into words, possibly eliminating markup and other literal content, and the words are then fed to spell-checker. Messages containing unknown words are reported to standard output, with list of replacement suggestions.

Parameters:

provider:keyword

The spell-checker that Enchant should use. The value is one of keywords defined by Enchant (e.g. aspell, myspell...), and can be seen by running enchant-lsmod command (only providers available on the system are shown). If not given either by this parameter or in user configuration, Enchant will try to select a provider on its own.

lang:code

The language code for which the spelling is checked. If not given, it will be read from each PO file in turn, and if not found there either, an error will be signaled.

env:environment

The language environment for which to include supplemental dictionaries (see Section 8.1, “The Notion of Language in Pology”). Several environments can be given as comma-separated list, in which case the union of their dictionaries is used. If not given, environments may be read from PO files (see X-Environment in Section 9.9, “Influential Header Fields”) or from user configuration.

accel:characters

Characters to consider as accelerator markers, to remove them before splitting text into words. If not given, they may be read from PO files (see X-Acclerator-Marker in Section 9.9, “Influential Header Fields”).

markup:types

The type of text markup used in messages, by keyword. It can also be a comma-separated list of keywords. If not given, it may be read from PO files (see X-Text-Markup in Section 9.9, “Influential Header Fields”; there the list of markup keywords currently known to Pology is given as well).

skip:regex

Words matched by this regular expression are not sent to spell-checker.

case

Matching patterns given as parameter values (e.g. with skip:) are by default case-insensitive, and this parameter switches them to case-sensitive.

filter:hookspec

The hook to modify the text before splitting into words and spell-checking them (see Section 9.10, “Processing Hooks”). The hook type must be F1A, F3A, or F3C. The parameter can be repeated to add several hooks, which are then applied in the order of specification.

suponly

By default, internal supplemental spelling dictionaries are added to the system dictionary of the selected spell-checker. This parameter can be issued to instead use only internal dictionaries and not the system dictionary.

list

By default, when an unknown word is found, the complete message is output, with the problematic word highlighted and possibly the replacement suggestions. With this parameter, only a plain sorted list of unknown words, one per line, is output at the end of sieving. This is useful when a lot of false positives are expected, to quickly add them to the supplemental dictionary.

lokalize

Open the PO file on messages containing unknown words in Lokalize. Lokalize must be already running with the project that contains the PO file opened.

check-spell-ec may be told to skip checking specific messages and words, and it may use internal supplemental spelling dictionaries. See Section 8.2, “Spell Checking” for these and other details on spell-checking in Pology.

Enchant can be configured for use in Pology through user configuration, so that it is not necessary to issue some parameters on every run. See Section 9.2.3, “The [enchant] section”.

3.5.10. check-tp-kde

The KDE Translation Project contains a great number of PO files extracted from various types of sources. This results in that for each message, there are things that the translation can, must or must not contain, for the translation to be technically valid. When run over PO files within the KDE TP, check-tp-kde will first try to determine the type of each message and then apply appropriate technical checks to it. Message type is determined based on file location, file header, message flags and contexts; even a particular message in a particular file may be checked for some very specific issue.

"Technical" issues are those which should be fixed regardless of the language and style of translation, because they can lead to loss of functionality, information or presentation to the user. For example, a technical issue would be badly paired XML tags in translation, when in the original they were well paired; a non-technical issue (and thus not checked) would be when the original ends with a certain punctuation, but translation does not -- whether such details are errors or not, depends on the target language and translation style.

For the sieve to function properly, it needs to detect the project subdirectory of each PO file up to topmost division within the branch, e.g. messages/kdebase docmessages/kdegames. This means that the local copy of the repository tree needs to follow the repository layout up to that point, e.g. kde-trunk-ui/kdebase and kde-trunk-doc/kdegames would not be valid local paths.

Parameters:

strict

Sometimes the original text itself may not be valid against a certain check. When this is the case, by default the translation is not expected to be valid either, and the check is skipped. Issuing this parameter will force all checks on translation, regardless of whether the original is valid or not. It may still be possible to avoid some checks on those messages that just cannot be repared through translation, if those checks define their own mechanism of cancelation (like adding a special translator comment).

check:keywords

Comma-separated list of checks to apply, by keyword, instead of all. Available checks are listed below.

showmsg

By default, when the message does not pass a check, only its location and the problem are reported. This parameter requests that message is reported in total, possibly with problematic segments of translation highlighted.

lokalize

Open the PO file on reported messages in Lokalize. Lokalize must be already running with the project that contains the PO file opened.

Currently available checks (keyword in parenthesis):

  • KDE4 markup checking (kde4markup).

  • Qt markup checking (qtmarkup).

  • Docbook markup checking (dbmarkup)

  • HTML markup checking (htmlmarkup).

  • No translation scripting in "dumb" messages (nots). Translations fetched at runtime by KDE4 translation system may use translation scripting. This check will make sure that scripting is not attempted for other types of messages (used by Qt-only code, for .desktop files, etc.).

  • Qt datetime format messages (qtdt). A message is considered to be in this format if it contains the string qtdt-format in its msgctxt string or among flags.

  • Validity of translator credits (trcredits). PO files may contain meta-messages to input translator credits, which should have both valid translations on their own and some congruence between them.

  • Query placeholders in Plasma runners (plrunq). Messages in Plasma runners may contain special query placeholder :q:, which should be present in translation too.

  • File-specific checking (catspec). Certain messages in certain PO files have special validity requirements, and this check activates all such file-specific checks.

All markup checks can be skipped on a message by adding the no-check-markup translator flag.

3.5.11. check-tp-wesnoth

PO files of The Battle of Wesnoth contain a mix of well-known and custom markup and format directives. check-tp-wesnoth heuristically determines the type of each message in a Wesnoth PO file and applies appropriate technical checks to it (where "technical" has the same meaning as in the check-tp-kde sieve).

Parameters:

check:keywords

Comma-separated list of checks to apply, by keyword, instead of all. Available checks are listed below.

showmsg

Instead of just showing the message location and problem description, also show the complete message, possibly with higlighted problematic segments.

lokalize

Open the PO file on reported messages in Lokalize. Lokalize must be already running with the project that contains the PO file opened.

Currently available checks (keyword in parenthesis):

  • Stray context separators in translation (ctxtsep). Wesnoth is still embedding disambiguating context into msgid, by putting it in front of the actual text and separated by ^. An unwary translator will sometimes mistakes such context for part of the original text, and translate it too.

  • Congruence of WML interpolations (interp). WML interpolations look like "...side $side_number is..." and normally must match between the original and translation, or else the player would loose information. Only in very rare cases (e.g. some plurals and Markov chain generators) some interpolations may be missing in translation, and then they can be listed space-separated in a translator comment to silence the check:

    # ignore-interpolations: interp1 interp2 ...
    

    (the $ character is not necessary in the list).

  • WML markup checking (wml). If WML in translation is not valid, player may see some visual artifacts. Also, links in WML must match between original and translation, to avoid loss of information.

  • Pango markup checking (pango). Pango is used in some places for visual text markup instead of WML.

  • Congruence of leading and trailing space (space). For many languages, significant leading and trailing space from the original should be preserved. A heuristic is used to determine when leading or trailing space is significant. Only languages explicitly specified internally are checked for this.

  • Docbook validity (docbook). Docbook is actually not used as a source format anywhere in Wesnoth, but the Wesnoth manual is converted into Docbook specifically to facilitate translation (weird as it may sound).

  • Man page validity (man).

3.5.12. collect-pmap

Property maps (or pmaps for short) are one way in which arbitrary properties of language phrases can be defined for use in scripted translations, such as provided by Transcript, the translation scripting system in KDE 4.

A property map is a text file with a number of entries, each defining the properties of a certain phrase. A pmap entry starts with one or more keys and continues with arbitrary number of key-value properties. An example entry would be grammar declinations of a noun:

=/Athens/Atina/nom=Atina/gen=Atine/dat=Atini/acc=Atinu//

The first two characters define, in order, the key-value separator (here =) and the property separator (here /) for the current entry. The two separators can be any non-alphanumeric characters, and must be different. Then follows a number of entry keys, delimited by property separators, and then a number of key-value properties, each internaly delimited by the key-value separator. The entry is terminated by double property separator. Properties of an entry can be fetched in the translation scripting system by any of the entry keys; keys are case- and whitespace-insensitive.

collect-pmap will parse pmap entries from manual comments in messages, collect them, and write out a property map file. It is not necessary to explicitly specify entry keys, since the contents of msgid and msgstr are automatically added as keys. Since each manual comment is one line, it is also allowed to drop the final double separator which would normally terminate the entry. The above example would thus look like this in a PO message:

# pmap: =/nom=Atina/gen=Atine/dat=Atini/acc=Atinu/
msgctxt "Greece/city"
msgid "Athens"
msgstr "Atina"

The manual comment starts with pmap: keyword, which is followed by a normal pmap entry, except for missing keys (but additional keys can be specified when msgid and msgstr are not sufficient). It is also possible to split the entry into several comments, with only condition that all share the same set of separators:

# pmap: =/nom=Atina/gen=Atine/
# pmap: =/dat=Atini/acc=Atinu/

After collecting pmap entries from all processed PO files, if two or more entries end up having same keys, they are all removed from the collection and a warning is reported.

Pmap entries are collected only from translated, non-plural messages.

Parameters:

outfile:file

File path into which the property map should be written. If not given, nothing is written out; this is useful for validating entries.

propcons:file

Path to the file which defines constraints on property keys and values, used to validate parsed entries (see Section 3.5.12.2, “Validating Entries”).

extrakeys

By default, it is actually not possible to add any aditional entry keys besides the automatically added msgid and msgstr. This gives extra safety against errors, such as translator mistyping the key-value pair. If additional keys are actually needed, this parameter can be issued to accept them.

derivs:file

Path to the file which defines derivators for synder entries (see Section 3.5.12.1, “Derivating Entries”).

pmhead:string

Default pmap: as entry prefix may not be the most convenient; for example, when the language of translation is not written with Latin script. This parameter makes makes it possibly to use an arbitrary string for the entry prefix.

sdhead:string

Like pmhead, but for prefix to synder entries, instead of the default synder: (see Section 3.5.12.1, “Derivating Entries”).

3.5.12.1. Derivating Entries

There is another, more succint way to define pmap entries in comments. Instead of writting out all key-value combinations, it is possible instead to generate them by using syntagma derivators (or synders) for short. From the earlier example:

# pmap: =/nom=Atina/gen=Atine/dat=Atini/acc=Atinu/

it can be observed that each form has the same root, Atin, followed by the appropriate ending for that form type. This makes it convenient to reformulate it as a syntagma derivation:

# synder: Atin|a

Here |a is a derivator; all derivators are defined in a separate synder file (with .sd extension by convention) and made known to the sieve through the derivs parameter. The derivator in this example would be defined like this:

|a: nom=a, gen=e, dat=i, acc=u

First comes the derivator name, starting with | and ending with :, and then the comma-separated list of key-value pairs similar as in the pmap entry, except that now only the endings for the given form are specified. Synders are actually a standalone subsystem of Pology, see Section 8.6, “Syntagma Derivation” for all details.

It is possible to mix pmap (# pmap: ...) and synder (# synder: ...) entries in translator comments. For example, synder entries may be used to cover majority of cases, which follow the general language rules, while pmap entries can be used for exceptions.

On the other hand, every pmap entry can be reformulated as a synder entry which does not refer to an external derivator:

# synder: nom=Atina, gen=Atine, dat=Atini, acc=Atinu

This begs the question of what is the need for pmap entries at all, if synder entries can be used in the same capacity and beyond? Pmap entries are still useful because synders have a lot of special syntax and rules to keep in mind (e.g. what if the phrase itself contains a comma?), while raw pmaps have none past what was described above.

3.5.12.2. Validating Entries

The propcons parameter can be used to specify a file which defines constraints on acceptable property keys, and on values by each key. Its format is the following:

# Full-line comment.
/key_regex_1/value_regex_1/flags # a trailing comment
/key_regex_2/value_regex_2/flags
:key_regex_3:value_regex_3:flags # different separator
# etc.

Regular expressions for keys and values are delimited by a separator defined by first non-whitespace character in the line, which must also be non-alphanumeric. Before being compiled, regular expressions are automatically wrapped as ^(regex)$, so that an expression to require a certain prefix is given as prefix.* and a suffix as .*suffix. A property key must match one of the key regexs, or else it is considered invalid. Value to that property must then match the value regexes attached to all matched key regexes.

For example, a constraint file defining no constraints on either property keys or values is:

/.*/.*/

while a file explicitly listing all allowed property keys, and constraining values to some of them, would be:

/nom|gen|dat|acc/.*/
/gender/m|f|n/
/number/s|p/

The last separator in the constraint can be followed by a string of single-character flags. These flags are currently defined:

  • i: case-insensitive matching for the value.

  • I: case-insensitive matching for the key.

  • t: the value must both match the regular expression and be equal to msgstr. If i flag is added too, equality check is also case-insensitive.

  • r: regular expression for the key must match at least one key among all defined properties.

Constraint definition file must be encoded with UTF-8.

3.5.13. diff-previous

When PO files are merged with --previous option to msgmerge, fuzzy messages will retain the previous version of original text (msgctxt, msgid and msgid_plural) under #| comments. Then diff-previous can be used to embedded differences from previous to current original into previous original strings. For example, the message:

#: main.c:110
#, fuzzy
#| msgid "The Record of The Witch River"
msgid "Records of The Witch River"
msgstr "Beleška o Veštičjoj reci"

will become after sieving:

#: main.c:110
#, fuzzy
#| msgid "{-The Record-}{+Records+} of The Witch River"
msgid "Records of The Witch River"
msgstr "Beleška o Veštičjoj reci"

Text editors may even provide highlighting for the wrapped difference segments (e.g. Kwrite/Kate).

This sieve is very useful if your PO editor does not show differences in the original by itself. To be able to easily see exactly what was changed in the original is important both for efficiency and for quality. Think of a long paragraph in which only one word was changed: without a diff it will take you time to reread it, and you may even miss that changed word.

Parameters:

strip

Instead of embedding diffs, remove them from messages, recovering the original form of previous strings. This is useful if you did not update all fuzzy messages but you anyway want to send the PO file away (commit it to the repository, etc.).

branch:branch

Embed diffs only into messages from given branch (summit). Several branches may be given as comma-separated list.

3.5.14. empty-fuzzies

For every fuzzy message, empty-fuzzies removes the translation and fuzzy data (the fuzzy flag, previous strings). Translator comments are kept by default, but they can be removed as well. Obsolete fuzzy messages are completely removed.

Parameters:

rmcomments

Also remove translator comments from fuzzy messages.

noprev

Empty only those fuzzy messages which do not have previous strings (i.e. when the PO file was merged without --previous option to msgmerge).

3.5.15. equip-header-tp-kde

equip-header-tp-kde applies the kde%header/equip-header hook to headers of PO files within the KDE Translation Project.

There are no parameters.

3.5.16. fancy-quote

Ordinary ASCII quotes are easy to type on most keyboard layouts, and these quotes are frequently encountered in non-typeset English texts, rather than proper English quotes. These proper quotes are sometimes called "fancy" quotes. When translating from English, translators can thus be easily moved to use ASCII quotes themselves, instead of the fancy quotes appropriate for their language. To somewhat correct this, fancy-quote can be used to replace ASCII quotes in the translation with selected pairs of fancy quotes.

ASCII quotes that are part of text markup (e.g. attribute values in XML-like tags) must not be replaced, and this sieve will use heuristics to determine such places. In fact, it will replace quotes rather conservatively. Nevertheless, unless some sort of automatic validation is available, converted text should be manually inspected for correctness.

Parameters:

single:quotes

Opening and closing quote to replace ASCII single quotes (i.e. quotes is a two-character string). If not given, single quotes are not replaced (but see the longsingle parameter).

single:quotes

Opening and closing quote to replace ASCII double quotes. If not given, double quotes are not replaced (but see the longdouble parameter).

longsingle:open,close

Alternative to single, if opening and closing quotes are not single characters. The value are the opening quote string and the closing quote string, separated by comma.

longdouble:open,close

Alternative to double, if opening and closing quotes are not single characters.

3.5.17. find-messages

find-messages is the search and replace workhorse of Pology. It applies one or several conditions to different parts of the PO message, with selectable boolean linking between them. If the message is matched as whole, it is reported and possibly some replacements are done. Messages are by default reported to standard output, with full location reference (PO file path, line and entry number), but can also be opened directly in one of supported PO editors (see Section 9.7.1, “PO Editors”).

When used in a sieve chain, find-messages will stop further sieving of messages which did not satisfy the conditions. This makes it useful as a filter for selecting subsets of messages on which other sieves should operate.

There are three logical groups of parameters: matching parameters, replacement parameters, and general parameters. Matching and replacement parameters have certain relationships between themselves, while general parameters have mutually independent effects (i.e. as usual for sieve parameters).

3.5.17.1. Matching Parameters

Matching parameters specify patterns for matching by parts of the message, or represent binary conditions (whether the message is translated, etc.). For example:

$ posieve find-messages -s msgid:'foo bar'

will report all messages which contain the phrase "foo bar" in their msgid (or msgid_plural) string. When several matching parameters are given, by default the message is matched if all patterns match; that is, boolean linking of conditions is AND. This:

$ posieve find-messages -s msgid:'foo bar' -s transl

will report all messages that contain "foo bar" in original and are translated. Boolean linking can be switched to OR by issuing the or parameter. To find all messages that contain the word "tooltip" in either context or comments:

$ posieve find-messages -s msgctxt:tooltip -s comment:tooltip -s or

(Actually, the effect of or is somewhat more specific, see its description below.) String matching is by default case insensitive, which can be changed globally by issuing the case parameter.

Every matching parameter has a negative counterpart, named by prepending n to the original parameter, which matches when the original parameter does not. Running:

$ posieve find-messages -s msgid:'hello' -s nmsgstr:'zdravo'

would find all messages that contain "hello" in the original and do not contain "zdravo" in the translation (a typical usage pattern in quick terminology checks).

To find all messages not matching a set of conditions, in principle it would be possible to negate the whole condition set by switching between positive/negative parameters and AND/OR-linking, but this can be cumbersome. Instead, the invert parameter can be issued to report messages that are not matched by the condition set.

Sometimes neither simple AND nor simple OR boolean linking is sufficient to form the search. Therefore the fexpr parameter is provided, which can be used to specify a search expression with explicit boolean operators and parentheses for controlling the evaluation order. With fexpr, the previous example could be reformulated as:

$ posieve find-messages -s fexpr:'msgid/hello/ and not msgstr/zdravo/'

For details, see the description of fexpr below.

Currently defined matching parameters:

(n)msgctxt:regex

Regular expression to match the msgctxt string.

(n)msgid:regex

Regular expression to match the msgid and msgid_plural strings. The condition is satisfed as whole if either of these strings matches.

(n)msgstr:regex

Regular expression to match msgstr strings. The condition is satisfed as whole if any of the msgstr strings matches.

(n)comment:regex

Regular expression to match extracted and translator comments and source reference comments. The condition is satisfed as whole if any of these comments matches.

(n)flag:regex

Regular expression to match flags. This matches each flag in turn, and not the flag comment as a monolithic string. The condition is satisfed as whole if any flag matches.

(n)transl

The message must be translated.

(n)obsol

The message must be obsolete.

(n)active

The message must be active, i.e. translated and not obsolete.

(n)plural

The message must be a plural message.

(n)maxchar:number

Original and translation can have at most this many characters. The condition is satisfied as whole if all these strings satisfy it.

(n)lspan:start:end

The referent line number of the message (the line in which its msgid string starts) must fall within given range. The starting number is included in the range, the ending number is not.

(n)espan:start:end

Like lspan, but instead of line numbers it applies to entry numbers. These are the numbers that dedicated PO editors usually report in their user interfaces.

(n)branch:branch

The message must belong to this branch (summit). Several branches may be given as comma-separated list.

(n)fexpr:expression

Boolean expression with explict boolean operators and parenthesis for priority, constructed out of any of the other matching parameters. If a match parameter needs a value (like a regular expression), in the expression it is given as match/value/, where any nonalphanumeric character can be used consistently instead of / (in case the value itself contains /). For example, the expression:

fexpr:'(msgctxt/foo/ or comment/foo/) and msgid/bar/'

is satisfied if either the context or comments contain "foo", and the original text contains "bar".

If matching is influenced by a general parameter (e.g. case sensitivity), in the expression it may be able to take overriding modifiers in form of single characters after the value, i.e. match/value/modifiers. Assuming that case parameter has not been issued, the expression:

fexpr:'msgid/quuk/ and msgstr/Qaak/c'

will be satisfied if the original text contains "quuk" in any casing, and translation contains exactly "Qaak". Currently available modifiers are:

  • c: matching is case-sensitive.

  • i: matching is case-insensitive. May be needed when string matching is globally case-sensitive due to case being issued.

3.5.17.2. Replacement Parameters

Replacement is done in pair with matching the appropriate string in the message. For example, to replace each appearance of "foobar" with "fumbar" in translation, this would be run:

$ posieve find-messages -s msgstr:foobar -s replace:fumbar

The replace parameter works in pair with msgstr, i.e. replace cannot be issued without issuing msgstr as well. There are two possible problems with replacement as straightforward as this. The first is that if "foobar" was a whole word (or start of a word), and this word in the text started with upper-case letter, the replacement would make it lower-case. This can be avoided by executing replacement twice with case sensitivity:

$ posieve find-messages -s msgstr:foobar -s replace:fumbar -scase
$ posieve find-messages -s msgstr:Foobar -s replace:Fumbar -scase

The other problem is if the word is split by an accelerator marker, for example:

msgstr "... f_oobar ..."

The search may still find the word (see the accel parameter below), but direct replacement would cause the loss of accelerator marker, and therefore it is not done.[8] To see such cases, you should monitor the output of find-messages (always a good idea when doing batch replacement), where matched and replaced parts of the text will be highlighted.

As usual for replacement based on regular expression, the replacement string may contain \number references to groups defined in the matching pattern. For example, the previous example of case-aware replacement could be more efficiently and more elegantly performed with:

$ posieve find-messages -s msgstr:'(f)oobar' -s replace:'\1umbar'

(Though this is possible only if the original and the replacement start with the same letter.)

Currently defined replacement parameters:

replace:string

The string to replace the match by msgstr parameter. Can contain regular expression group references.

3.5.17.3. General Parameters

Parameters influencing general behavior of find-messages are as follows:

or

Boolean OR instead of AND linking of conditions, but only for string matchers: msgctxt, msgid, msgstr, comment. This restriction may seem odd, but it is what is mostly needed in practice. For example, the set of conditions:

-s msgctxt:tooltip -s comment:tooltip -s transl -s or

would match all translated messages which have "tooltip" in context or in comments, and not messages which are either translated or have "tooltip" in context or in comments. For full control over the expression, use the fexpr parameter.

invert

Inverts the selection: messages satisfying the condition set are not selected.

accel:characters

Characters to consider as accelerator markers, to remove before applying matching patterns. If not given, they may be read from PO files (see X-Acclerator-Marker in Section 9.9, “Influential Header Fields”).

case

Matching patterns for strings and comments are by default case-insensitive, and this parameter switches them to case-sensitive.

mark

To each selected message a match flag is added, modifying the PO file. Modified files can then be opened in the editor, and selected messages looked up by this flag. This is typically done when something should be modified in selected messages, but doing that automatically (using replace parameter) is not possible or safe enough. Also useful here is the option -m/--output-modified of posieve, to write out the paths of modified PO files into a separate file, which can then be fed to the editor.

filter:hookspec

The hook to modify the translation before applying the msgstr matcher to it. The hook type must be F1A. The parameter can be repeated to add several hooks.

nomsg

Do not report selected messages, either to standard output or to PO editors. Useful when find-messages is a pre-filter in the sieve chain.

lokalize

Open the PO file on selected messages in Lokalize (unless nomsg is in effect). Lokalize must be already running with the project that contains the PO file opened.

3.5.18. generate-xml

generate-xml creates a partial XML representation of a group of PO files.

The output XML format is as follows. Each PO file in the group is represented by a <po> element, which contains a list of <msg> elements, one for each message. The C<msg> element contains the usual parts of a PO message:

  • <line>: referent line number of the message

  • <refentry>: referent entry number of the message

  • <status>: current status of the message (obsolete, translated, untranslated, fuzzy)

  • <msgid>: the original text

  • <msgstr>: the translation

  • <msgctxt>: disambiguating context

If the PO message contains plural forms, they will be represented with <plural> subelements of <msgstr>.

Parameters:

xml:file

By default the XML content is written to standard output, and this parameter can be used to send it to a file instead

translatedOnly

Only translated messages are exported to XML (i.e. fuzzy, untranslated and obsolete are ignored).

3.5.19. merge-corr-tree

When doing corrections on a copy of PO files tree, it is not possible to easily merge back just the updated translations, because word wrapping in PO file can be different, generating much more difference than it should.

Additionally, tools like pogrep from Translate Toolkit will create new partial tree as output, containing matched messages only. merge-corr-tree will help you to merge changes made in that partial tree back into the main tree.

The main PO files tree is the input, and the pathdelta parameter is used to provide the path difference to where the partial correction tree is located.

Parameters:

pathdelta:search:replace

Specifies that the partial tree is located at path obtained when search is replaced with replace in the input path.

3.5.20. normalize-header

normalize-header applies the normalize/canonical-header hook to PO file headers.

There are no parameters.

3.5.21. normctxt-delim

In older PO files, disambiguating contexts may be embedded into msgid strings, as the initial part of the string delimited from the actual text with predefined substrings, here called the "head" and the "tail". For example, in:

msgid ""
"_:this-is-context\n"
"This is original text"
msgstr "This is translated text"

the head is the underscore-colon sequence (_:), and the tail the newline (\n). normctxt-delim will convert embedded contexts of the delimiter-type to proper msgctxt strings.

Parameters:

head:string

The head of the delimiter-type embedded context.

tail:string

The tail of the delimiter-type embedded context.

3.5.22. normctxt-sep

In older PO files, disambiguating contexts may be embedded into msgid strings, as the initial part of the string separated from the actual text by a predefined substring. For example, in:

msgid "this-is-context|This is original text"
msgstr "This is translated text"

the separator string is the pipe character (|). normctxt-sep will convert embedded contexts of the separator-type to proper msgctxt strings.

Parameters:

sep:string

The string that separates the context and the text in separator-type embedded context.

3.5.23. remove-fuzzy-comments

Being translator's input, translator comments are copied verbatim to fuzzy messages created on merging with template. Depending on the purpose of translator comments (e.g. see Section 9.11, “Skipping and Selecting Checks” for some special types), it may be better to automatically remove some of them from fuzzy messages (and then possibly add them back manually when updating the translation). If run without any parameters remove-fuzzy-comments will do nothing, so one or more parameters need to be given to actually remove any comment.

Parameters:

all

Simply all translator comments in fuzzy messages are removed.

nopipe

Translator comments containing translator flags (see Section 9.11, “Skipping and Selecting Checks”) are removed.

pattern:regex

Translator comment must match the given regular expression to be removed.

exclude:regex

Translator comment is removed if it does not match the given regular expression.

case

Matching patterns are by default case-insensitive, and this parameter switches to case-sensitivity.

When several removal criteria are specified, first those other than pattern and exclude are applied in unspecified order, then the pattern match, and finally the exclude match.

3.5.24. remove-obsolete

remove-obsolete simply removes all obsolete messages, whether fuzzy or translated, from the PO file.

There are no parameters.

3.5.25. remove-previous

remove-previous removes previous strings, i.e. #| ... comments, from messages.

Parameters:

all

Previous strings are by default removed only from non-fuzzy messages. This parameter specifies to remove previous strings from all messages, including fuzzy.

3.5.26. resolve-aggregates

In its default mode of operation, msgcat(1) produces an aggregate message when in different catalogs it encounters a message with the same key but different translation or translator or extracted comments. A general aggregate message looks like this:

# #-#-#-#-#  po-file-name-1 (project-version-id-1)  #-#-#-#-#
# manual-comments-1
# #-#-#-#-#  po-file-name-2 (project-version-id-2)  #-#-#-#-#
# manual-comments-2
# ...
# #-#-#-#-#  po-file-name-n (project-version-id-n)  #-#-#-#-#
# manual-comments-n
#. #-#-#-#-#  po-file-name-1 (project-version-id-1)  #-#-#-#-#
#. automatic-comments-1
#. #-#-#-#-#  po-file-name-2 (project-version-id-2)  #-#-#-#-#
#. automatic-comments-2
#. ...
#. #-#-#-#-#  po-file-name-n (project-version-id-n)  #-#-#-#-#
#. automatic-comments-n
#: source-refs-1 source-refs-2 ... source-refs-n
#, fuzzy, other-flags
msgctxt "context"
msgid "original-text"
msgstr ""
"#-#-#-#-#  po-file-name-1 (project-version-id-1)  #-#-#-#-#\n"
"translated-text-1\n"
"#-#-#-#-#  po-file-name-2 (project-version-id-2)  #-#-#-#-#\n"
"translated-text-2\n"
"..."
"#-#-#-#-#  po-file-name-n (project-version-id-n)  #-#-#-#-#\n"
"translated-text-n"

Each message part is aggregated only if different in at least one message in the group. For example, extracted comments may be aggregated while translations not.

resolve-aggregates is used to resolve aggregate messages of this kind into normal messages, by picking one variant from each aggregated part.

Parameters:

first

By default, the picked variant is the one with most occurences, or the first of the several with same number of occurences. If this parameter is issued, the first variant is picked unconditionally.

unfuzzy

Aggregated messages are always made fuzzy, leaving no way to determine if and which of the original messages were fuzzy. Therefore, by default, the resolved message is left fuzzy too. If, however, it is known beforehand that none of the original messages were fuzzy, resolved messages can be unfuzzied by issuing this parameter.

keepsrc

Since there is no information based on which the aggregated source references can be split into originating groups, they are entirely removed unless this parameter is issued.

3.5.27. resolve-alternatives

resolve-alternatives resolves alternatives directives found in the translation into one of the alternatives.

An alternative directive is a substring of the form ~@/.../.../..., for example:

msgstr "I see a ~@/pink/white/ elephant."

~@ is the directive head, which is followed by a character that defines the delimiter of alternatives (can be arbitrary), and then by alternatives themselves. The number of alternatives per directive is not defined by the directive itself, but it is provided as the sieve parameter (i.e. all alternative directives must have some number of alternatives).

Parameters:

alt:N,Mt

Specifies how to resolve alternatives. N is the index (starting from 1) of the alternative to take from each directive, and M is the number of alternatives per directive. Example: alt:1,2t.

If an alternatives directive is invalid (e.g. too little alternatives), it is reported to standard output. If at least one alternatives directive in the text is not valid, the text is not modifed.

3.5.28. resolve-entities

XML entities are substrings of the form <entityname>, typically encountered in XML-like text markups, but elsewhere too. They are resolved into underlying, human-readable values at build time (when translated text documents are created) or at run time (in translated user interfaces). Sometimes it may be better to have them resolved already in the PO file itself, and that is what resolve-entities does.

Parameters:

entdef:file

Path to the file which contains entitiy definitions. It can be repeated to add several files.

Entity definition files are plain text files of the following format:

<!-- This is a commment. -->
<!ENTITY name1 'value1'>
<!ENTITY name2 'value2'>
<!ENTITY name3 'value3'>
...
ignore:entitynames

Entities which should be ignored during resolution. Standard XML entities (&lt;, &gt;, &apos;, &quot;, &amp;) are ignored by default.

3.5.29. set-header

Sometimes a PO header field or comment needs to be updated in many PO files at once, and set-header serves that purpose.

Parameters for setting and removing header fields:

field:name:value

Set the field with given name to given value. This parameter can be repeated to set several fields in one run.

By default, field will actually set the field only if it is already present in the header. To add the field if not present, the create parameter must be issued as well. If the field is being added, parameters after and before can be used to specify where to insert it, or else the new field is appended at the end of the header. If the field is present but not positioned according to after and before, the reorder parameter can be issued to move the field within the header.

create

The field should be added if it is not present in the header.

after

When a field is added, it should be inserted after this field.

before

When a field is added, it should be inserted before this field.

reorder

If the field is present, but it is in the wrong place according to after and before, this parameter will cause it to be reinserted in proper place.

remove:field

Remove the field with this name. If there are several fileds of that name, all are removed.

removerx:regex

Remove all fields matched by the given regular expression.

Parameters for setting and removing header comments:

title:value

Set the title comment to the given value. It can be repeated, since the title can be composed of multiple comment lines.

rmtitle

Remove title comments.

copyright:value

Set the copyright comment to the given value.

rmcopyright

Remove the copyright comment.

license:value

Set the license comment to the given value.

rmlicense

Remove the license comment.

author:value

Set the author comment to the given value. It can be repeated, since there may be more authors (i.e. translators).

rmauthor

Remove author comments.

comment:value

Set the free comment to the given value. It can be repeated, since there can be any number of free comment lines.

rmcomment

Remove free comments.

rmallcomm

Remove all header comments.

Note that all existing comments of given type are removed before setting the new ones, i.e. the new comments are not appended to the existing. For example, if single author parameter is issued, with a translator name and email address as value, this one translator will replace all existing translators in the header comments.

Comment values are checked for some minimal consistency, e.g. author comments must contain email addresses, licence comments the word "licence", etc.

Value strings (both of fields and comments) may contain %-directives, which are expanded to catalog-dependent substrings prior to setting the value. Currently available directive are:

  • %poname: PO domain name (equal to file name without .po extension)

If literal % character is needed (e.g. when setting the Plural-Forms field), it can be escaped by doubling it, %%. The directive can also be given inside braces, as %{...} when it would be ambiguous otherwise.

3.5.30. stats

stats collects statistics on PO files, such as message and word counts, and more. Statistics can be presented in several ways and on several levels.

Parameters:

accel:characters

Characters to consider as accelerator markers, to remove them when splitting text to count words. If not given, they may be read from PO files (see X-Acclerator-Marker in Section 9.9, “Influential Header Fields”), or else some usual accelerator marker characters are removed.

detail

In table views, by default only message, word, and character counts are given. This parameter requests additional derived data, such as expansion factors (ratio of words in translation to words in original), number of words per message, etc.

incomplete

When run over a collection of PO files, all non-fully translated PO files are listed separately, with very brief statistics of incompleteness.

incompfile:file

Write a file with paths of all non-fully translated PO files, one per line. This file can then be fed with -f/--from-files back to posieve or another script, to process only incomplete PO files.

templates:search:replace

If there exists both a directory with translated PO files and with POT (template) files, and not every POT file has the corresponding PO file, this parameter can be used to count POT files without PO counterpart as fully untranslated in statistics. Value to the parameter are two strings separated by colon: the first string will be searched for in directory paths of processed PO files, and replaced with the second string to construct corresponding directory paths of POT files. For example:

$ cd $MYTRANSLATIONS
$ ls
my_lang  templates
$ posieve stats -s templates:my_lang:templates my_lang/
minwords:number

Only messages with at least this many words (in any of original or translation strings) are counted into statistics.

maxwords:number

Only messages with at most this many words (in any of original or translation strings) are counted into statistics.

lspan:start:end

Only messages with referent line numbers (line number of msgid) in this range are counted into statistics. The starting line is included in the range, the ending line is not. If start is omitted (e.g. lspan::500) it is assumed 0, and if end is omitted (e.g. lspan:300 or lspan:300:) it is assumed the total number of lines.

espan:start:end

Only messages with entry numbers (as reported by PO editors) in this range are counted into statistics. Same boundary inclusion and omission rules as for lspan apply; e.g. espan:4:8 means to count messages with entry numbers 4, 5, 6, and 7.

branch:branch

Only messages from given branch are counted into statistics (summit). Several branches may be given as comma-separated list.

bydir

Statistics is broken by directories, that is a report is displayed for each group of PO files in the same directory (and not below it). More usually used with bar displays than with tabular displays.

byfile

Statistics is broken by files, that is a report is displayed for each PO file. Usually used with bar displays.

msgbar

Instead of a table with detailed statistics, only message counts are shown, accompanied with a text-art bar. Mostly useful in combination with bydir and byfile.

wbar

Like msgbar, but to have word instead of message counts.

absolute

Bar displays (on msgbar and wbar) are normaly relative, meaning that when byfile or bydir is in effect, each bar is of same length. This parameter makes bars scaled to sizes of PO files or directories. For example, if msgbar and byfile are issued, then the bar of a PO file with twice as many messages as another PO file will be twice as long.

ondiff

Fuzzy messages are often very easy to correct (e.g. a typo fixed), which may make their word count misleading when estimating translation effort. This can be amended by issuing this parameter, to split word and character counts of fuzzy messages into translated and untranslated counts. The split is based on the difference ratio between current and previous original text, and a threshold. If the difference ratio is larger than the threshold, everything is counted as untranslated. The fuzzy count is left at zero. If previous original text is missing, the correction is not made, and counts are assigned to fuzzy as usual.

mincomp:fraction

Only those PO files which have translation completeness (measured by the ratio of translated to all messages, excluding obsolete) equal to or higher than the given fraction are included into statistics. This is especially useful when for each new template an empty PO file is automatically produced (instead of translators having to start work from a template), to include into statistics only those files which have actually seen some translation (using a small non-zero number for the fraction, e.g. fraction:1e-6).

The hook to modify the translation before splitting it to count words and characters (see Section 9.10, “Processing Hooks”). The hook type must be F1A. The parameter can be repeated to add several hooks, which are then applied in the order of specification.

3.5.30.1. Handling Embedded Contexts

Some older PO files will have disambiguating contexts embedded into the msgid string, instead of using the newer standard msgctxt string. There are several customary ways in which this is done, but in general it depends on the translation environment where such PO files are used.

Embedded contexts will skew the statistics. Pology contains several sieves for converting embedded contexts into msgctxt contexts, named normctxt-*. When statistics on such PO files is computed, a sieve chain should be used in which the stats sieve is preceeded by the context conversion sieve. For example, if the embedded context starts the msgid and ends with |, statistics should be computed with:

$ posieve --no-sync normctxt-sep,stats -s sep:'|' ...

Note that normctxt-* sieves, since they modify messages, would by default cause PO files to be modified on disk. Option --no-sync is therefore issued to prevent modifications to sieved files.

3.5.30.2. Output Legend

The default output from stats is a table where rows present statistics for a category of messages, and columns the particular categories of data:

$ posieve stats frobaz/
-              msg  msg/tot  w-or  w/tot-or  w-tr  ch-or  ch-tr
translated     ...    ...    ...     ...     ...    ...    ...
fuzzy          ...    ...    ...     ...     ...    ...    ...
untranslated   ...    ...    ...     ...     ...    ...    ...
total          ...    ...    ...     ...     ...    ...    ...
obsolete       ...    ...    ...     ...     ...    ...    ...

The total row is the sum of translated, fuzzy, and untranslated rows, whereas obsolete row is excluded. The columns are as follows:

  • msg: number of messages

  • msg/tot: percentage of messages relative to total

  • w-or: number of words in the original

  • w/tot-or: percentage of words in the original relative to total

  • w-tr: number of words in the translation

  • ch-or: number of characters in original

  • ch-tr: number of characters in the translation

The output with detail parameter in effect is the same as default, with several columns of derived data appended to the table:

  • w-ef: word expansion factor (increase in words from the original to the translation)

  • ch-ef: character expansion factor (increase in characters from the original to the translation)

  • w/msg-or: average of number words per message in the original

  • w/msg-tr: average number of words per message in the translation

  • ch/w-or: average number of characters per message in the original

  • ch/w-tr: average number of characters per message in the translation

If any of the sieve parameters that restrict or modify counting (such as ondiff, lspan, etc.) have been issued, this is indicated in the output by a modifiers: ... line:

$ posieve stats -s maxwords:5 -s ondiff frobaz/
(...the statistics table...)
modifiers: at most 5 words and scaled fuzzy counts

When the incomplete parameter is given, the statistics table is followed by a table of non-fully translated PO files, with counts of fuzzy and untranslated messages and words:

$ posieve stats -s incomplete frobaz/
(...the overall statistics table...)
catalog              msg/f   msg/u   msg/f+u   w/f   w/u   w/f+u
frobaz/foxtrot.po        0      11        11     0   123     123
frobaz/november.po      19      14        33    85    47     132
frobaz/sierra.po        22       0        22   231     0     231

In the column names, msg/* and w/* stand for messages and words; */f, */u, and */f+u stand for fuzzy, untranslated, and the two summed.

When parameters msgbar or wbar are in effect, statistics is presented in the form of a text-art bar, giving visual relation between numbers of translated, fuzzy, and untranslated messages or words:

$ posieve stats -s wbar frobaz/
4572/1829/2533 w-or |¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤×××××××××············|

A typical condensed overview of translation state is obtained by:

$ posieve stats -s byfile -s msgbar frobaz/
frobaz/foxtrot.po   34/ -/11 msgs |¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤·····|
frobaz/november.po  58/19/14 msgs |¤¤¤¤¤¤¤¤¤¤¤×××××····|
frobaz/sierra.po    65/22/ - msgs |¤¤¤¤¤¤¤¤¤¤¤¤¤¤××××××|
(overall)          147/41/25 msgs |¤¤¤¤¤¤¤¤¤¤¤¤¤××××···|

Note that while message counts are the classic for bar overviews (msgbar), you are probably better off looking at word counts (wbar) instead, because word counts represent more closely the amount of work needed to complete the translation. Rounding of fractions for bars is such that as long as there is at least one fuzzy or untranslated message (or word), the bar will show one incomplete cell.

3.5.30.3. Notes on Counting

Word and character counts for a message string are obtained by processing it in the following order:

  • Accelerator markers are removed.

  • Text markup is eliminated (e.g. XML-like tags).

  • Other special substrings, such as format directives, are also eliminated (e.g. %s in messages with c-format flag).

  • Text is split into words by taking all contiguous sequences of "word characters", which include letters, numbers, and underscore.

  • All words not starting with a letter are eliminated.

  • Words that remain are counted into statistics. Whitespace is not included in character count.

In plural messages, counts for the original are the average of msgid and msgid_plural strings, and likewise the average of all msgstr strings for the translation. In this way, the comparative statistics between the original and the translation is not skewed for languages that have more or less than two plural forms.

3.5.31. tag-untranslated

Some translators like to edit PO files with a plain text editor, which may provide no special support for editing PO files, other than perhaps PO syntax highlighting. In this scenario, tag-untranslated can be used to equip untranslated messages with untranslated flag, so that they can be easily looked up in the editor.

Since untranslated is not one of defined PO flags, it will be lost if the PO file is merged with the template. This is intentional: the only purpose of this flag is to facilitate immediate editing of the PO file, and you may miss to remove some of them while editing. There is no reason for untranslated flags to persist in that case. Also, if the flag is not removed after the message has been translated, a subsequent run of this sieve will remove the flag.

Parameters:

strip

Instead of being added, untranslated flags are stripped. This is useful when you had no time to translate all messages but you want to send the PO file away.

wfuzzy

untranslated flags are added to fuzzy messages as well. This can be useful to be able to jump in the text editor through all incomplete message by just giving , untranslated[9], or when the set of messages to be updated has been limited somehow (e.g. by the branch parameter).

branch:branch

Tag only untranslated messages from given branch (summit). Several branches may be given as comma-separated list.

3.5.32. unfuzzy-context-only

Sometimes the message is made fuzzy during merging only due to change in the msgctxt string, or its addition or removal. Some translators and languages may be less dependent on contexts than the other, or they may be in a hurry prior to the release of the translation, and then unfuzzy-context-only can be used to unfuzzy these messages in which only the context was modified. This state can be detected by comparing the current and the previous strings in the fuzzy message, i.e. the PO file must have been merged with --previous option to msgmerge.

Parameters:

noreview

By default, unfuzzied messages will also be given a translator comment with unreviewed-context string, so that you may find and review these messages at a later time. This parameter will prevent the addition of such comment, but it is usually safer to review automatically unfuzzied messages when you find the time.

eqmsgid

Sometimes a lot of messages in the code may be semi-automatically equipped with contexts (e.g. to group items by a common property), and then it may be necessary to review only those messages which got split into two or more messages due to newly added contexts. This parameter may be issued to specifically report all translated messages which have the their msgid string equal to an unfuzzied message, including unfuzzied messages themselves. Depending on exactly what kind of contexts have been added, the noreview parameter may be useful here as well.

lokalize

Open the PO file on reported messages in Lokalize. Lokalize must be already running with the project that contains the PO file opened.

3.5.33. unfuzzy-ctxmark-only

unfuzzy-ctxmark-only has a similar but less wide effect compared to the unfuzzy-context-only sieve. It unfuzzies a message only if the only change that caused fuzzyness is in a specific part of msgctxt string, the UI context marker.

UI context markers are en element of KUIT markup (KDE user interface text), which state more formally the user interface context in which the text given by the PO message is used. This may be important for translation, since style guidelines will typically somewhat depend on where in the UI the text is seen. For example, there may be two messages in the code which have exactly the same text in English, but one is used as a menu item, and the other as a dialog title; with KUIT, they would be marked as:

msgctxt "@action:inmenu File"
msgid "Export as HTML"
msgstr ""msgctxt "@title:window"
msgid "Export as HTML"
msgstr ""

The UI context marker here is the leading part of msgctxt, starting with @... and ending with first whitespace. unfuzzy-ctxmark-only will unfuzzy the message if only this marker has changed (or was added or removed), but not if the change was in the rest of the context (after the first whitespace).

Parameters:

noreview

See the same-name parameter of unfuzzy-ctxmark-only. Using it here is probably somewhat safer, but this in general it depends on translation style guidelines.

3.5.34. unfuzzy-inplace-only

Some text markups may have a "permissible" or "sloppy" mode, where some tags do not have to be explicitly terminated. The typical example is HTML, where <br>, <hr>, etc. do not have to be written as <br/>. (This is unlike XHTML, which is an XML instance and therefore strict in this respect.) When this permissible markup was used in the code, a programmer revisiting that code at a later time may consider it a poor style, and go about fixing it. This may cause some messages in the PO file to become fuzzy. unfuzzy-inplace-only will recognize some of these situations in a fuzzy message (by comparing the current and previous strings) and automatically modify the translation accordingly and unfuzzy the message.

There are no parameters.

3.5.35. unfuzzy-qtclass-only

PO messages obtained by conversion from Qt Linguist translation files can contain in the msgctxt an automatically extracted C++ class name, referring to the class where the message is located in the code. In the following two example messages, the C++ class name is the text before the | character:

#: ui/configdialog.cpp:50
msgctxt "Sonnet::ConfigDialog|"
msgid "Spell Checking Configuration"
msgstr ""

#: core/loader.cpp:206
#, qt-format
msgctxt "Sonnet::Loader|%1 = language name, %2 = country name"
msgid "%1 (%2)"
msgstr ""

If the programmer later changes a class name in the code, all messages inside that class will become fuzzy. The unfuzzy-qtclass-only sieve can be used to unfuzzy such messages, by verifying that the only difference between the old and the new message is in the part of msgctxt before the | character. For this to work, the PO file must have been merged with --previous option to msgmerge.

There are no parameters.

3.5.36. update-header

When translation on a PO file starts for the first time, or when a previously translated PO file is being updated after merging, update-header can be used to automatically set and update PO header fields to proper values. The revision date is taken as current, while other pieces of information are read from the user configuration (see Section 9.2, “User Configuration”). Note that this sieve is normally only of use when you are translating with a plain text editor, while dedicated PO editors should do this automatically when the PO file is saved after editing.

Parameters:

proj:projectid

The ID of the project to which the PO files to be updated belong. This ID is used to construct the name of the configuration section as [project-projectid], which contains the project data fields. Also used are the fields from the [user], whenever they are not overriden in project's section. See Section 9.2.2, “The [user] section” and Section 9.2.5, “Per-project sections ([project-*])”.

init

By default, the sieve tries to detect if the header has been initialized before or not, because it differs somewhat what should be changed in the header on initialization and on update. This parameter can be issued to unconditionally treat the header as not initialized, i.e. overwrite any existing content.

onmod

The header should be updated only if the PO file was otherwise modified. This parameter makes sense only in a sieve chane, when this sieve is preceded by a potentially modifying sieve.

An example of a user configuration appropriate for this sieve would be:

[user]
name = Chusslove Illich
original-name = Часлав Илић
email = caslav.ilic@gmx.net
po-editor = Kate

[project-kde]
language = sr
language-team = Serbian
team-email = kde-i18n-sr@kde.org
plural-forms = nplurals=4; plural=n==1 ? 3 : n%%10==1 && \
               n%%100!=11 ? 0 : n%%10>=2 && n%%10<=4 && \
               (n%%100<10 || n%%100>=20) ? 1 : 2;

Note that percent characters in the plural-forms field are escaped by doubling, because single % in configuration has special meaning. Also note splitting into several lines by trailing \ (only for better looks, since configuration lines can be arbitrarily long).

3.5.37. fr:setUbsp

In French language, some punctuation characters are separated with an unbreakable space from the preceding word. This is unlike in English, so unwary French translators sometimes miss to add the required unbreable space after or before such punctuation when translating from English. fr:setUbsp will heuristically detect such places and insert an unbreakable space.

There are no parameters.

3.5.38. ru:fill-doc-date-kde

Each translation file for a docbook in KDE has a string for documentation last update date in the format 'yyyy-mm-dd'. This sieve automatically translated those strings into Russian. The sieve uses date command in order to change date formatting. But Russian names of months are hardcoded, so that you do not need to set up Russian locale to use the sieve.

There are no parameters.

3.6. Using External Sieves

Each internal sieve is a single Python file in sieve/ subdirectory (and in lang/langcode/sieve/ for language-specific sieves). The Python file is named like the sieve, only with hyphens replaced with underscores and with .py extension. posieve therefore knows how to find which file to execute when an internal sieve name is given as its first argument.

However, instead of an internal sieve name, the first argument to posieve can also be an explicit path (relative or absolute) to a Python file which implements a sieve. Explicit paths can also be part of a sieve chain, mixed with internal sieve names. This is all there is to running external sieves; see Section 11.3, “Writing Sieves” for instructions on how to write one.



[8] Some heuristics for reinsertion of the accelerator marker may be implemented in the future.

[9] Alternatively, if the editor provides regular expressions for searches, you can search for , fuzzy|, untranslated.