Pology was designed with strong language-specific support in mind, and this chapter describes the currently available features in the direction of validation and derivation of translation as whole and various bits in it.
A versatile translation-supporting tool has to have some language-specific functionality. But, it is difficult to agree on what is a language and what is a dialect, what is standard and what is jargon, what is derived from what, how any of these are named, and there are many witty remarks about existing classifications. Therefore, Pology takes a rather simple and non-formal approach to the definition of "language", but such that should provide good technical leverage for constructing language-specific functionality.
There are two levels of language-specificity in Pology.
The first level is simply the "language". In linguistic sense this can be a language proper (whatever that means), a dialect, a variant written in different script, etc. Each language in this sense is assigned a code in Pology, when first elements of support for that language are introduced. By convention this code should be an ISO 639 code (either two- or three-digit) if applicable, but in principle can be anything. Another convenient source of language codes is the GNU C library. For example, Portugese language spoken in Portugal would have the code pt
(ISO 639) while Portugese spoken in Brazil would be pt_BR
(GNU C library).
The second level of language-specificity is the "environment". In linguistic terms this could be whatever distinct but minor variations in vocabulary, style, tone, or ortography, which are specific to certain groups of people within a single language community. Within Pology, this level is used to support variations between specific translation environments, such as long-standing translation projects and their teams. Although translating into the same language, translation teams will almost inevitably have some differences in terminology, style guidelines, etc. Environments also have codes assigned.
In every application in Pology, the language and its environments have a hierarchical relation. In general, language-specific elements defined outside of a specific environment ("environment-agnostic" elements) are a sort of a relaxed least common denominator, and specific environments add their own elements to that. Relaxed means that environment-agnostic elements can sometimes include that which holds for most but not all environments, while each environment can override what it needs to. This prevents the environment-agnostic language support from getting too limited just to cater for perculiarities in certain environments.
When processing PO files, it is necessary to somehow convey to Pology tools to which language and environment the PO files belong. The most effective way of doing this is by adding the necessary information to PO headers. All Pology tools that deal with language-specific elements will check the header of the PO file they process for the language and environment. Some Pology tools will also consult the user configuration (typically with lower priority than PO headers) or provide appropriate command line options (typically giving them higher priority). See Section 9.9, “Influential Header Fields” and Section 9.2, “User Configuration” for details.
The following languages and environments within those languages currently have some level of support in Pology (assigned code in parenthesis, "t.t." stands for translation team):
Language | Environments | ||
---|---|---|---|
Catalan (ca ) |
|||
French (fr ) |
|||
Galician (gl ) |
|||
Japanese (ja ) |
|||
Low Saxon (nds ) |
|||
Norwegian Nynorsk (nn ) |
|||
Romanian (ro ) |
|||
Russian (ru ) |
|||
Serbian (sr ) |
|
||
Spanish (es ) |
Pology can employ various well-known spell-checkers to check the translation in PO files. Currently there is standalone support for Aspell, and unified support for many spell-checkers (including Aspell) through Enchant, the spell-checking wrapper library (more precisely, through Python bindings for Enchant).
Spell-checking of one PO file or a collection of PO files can be performed directly by sieving them through one of check-spell (Aspell) or check-spell-ec sieves. The sieve will report each unknown word, possibly with a list of suggestions, and the location of the message (file and line/entry numbers). It can also be requested to show the full message, with unknown words in the translation highlighted.
Also provided are several spell-checking hooks, which can be used as building blocks in custom translation validation chains. For example, a spell-checking hook can be used to define the spell-checking rule within Pology's validation rules collection for a given language.
Pology collects internal language-specific word lists as supplements to system spelling dictionaries. One use of internal dictionaries is to record those words which are omitted in the system spelling dictionaries, but are actually proper words in the given language. Such words should be added into internal dictionaries only as an immediate fix for false spelling warnings, with an eye towards integrating them into the upstream spelling dictionaries of respective spell-checkers.
More importantly, internal dictionaries serve to collect words specific to a given environment, i.e. the words which are deemed too specific to be part of the upstream, general spelling dictionaries for the language. For example, this can be technical jargon, with newly coined terms which are yet to be more widely accepted. Another example could be translation of fiction, in books or computer games, where it is common-place to make up words for fictional objects, animals, places, etc. which are not even intended to be more widely used.
In Pology source tree, internal spelling dictionaries by language are located in lang/
directories. This directory can contain arbitrary number of dictionary files, which are all automatically picked up by Pology when spelling-checking for that language is done. Dictionary files directly in this directory are environment-agnostic, and should contain only the words which are standard (or standard derivations) in the language, but happen to be missing from the system spelling dictionary. Subdirectories represent specific environments, they are named with the environment code, and can also contain any number of dictionaries. An example of internal dictionary tree with environments:lang
/spell/
lang/ sr/ spell/ colors.aspell fruit.aspell ... science.aspell kde/ general.aspell wesnoth/ general.aspell propernames.aspell
When one of Pology's spell-checking routes is applied for a given language without further qualifiers, only the environment-agnostic dictionaries of that language are automatically included. It must be explicitly requested to additionaly include dictionaries from one of the environments (e.g. by env:
parameter to check-spell sieve).
Dictionary files are in the Aspell word list format (regardless of the spell-checker actually used), and must have .aspell
extension. This is a simple plain text format, listing one word per line. Only the first line is special, the header, which states the language code, number of words in the list, and the encoding. For example:
personal_ws-1.1 fr 1234 UTF-8 apricot banana cherry ...
Actually the only significant element of the header is the encoding. Language code and number of words can be arbitrary, as Pology will not use them.
Pology provides the normalize-aspell-word-list command which sorts word list files alphabetically (and corrects the word count in the header, even if not important), so that you do not have to manually insert new words in proper order. The script is simply run with arbitrary number of word list files as arguments, and modifies them in place. In case of duplicate words, it will report duplicates and eliminate them. In case of words with invalid characters (e.g. space), the script will output a warning, but it will not remove them; automatic removal of invalid words can be requested with -r
/--remove-invalid
option.
Sometimes a message or a few words in it should not be spell-checked. This can be, for example, when the message is dense computer input (like a command line synopsis), or when a word is part of a literal phrase (such as an email address). It may be possible to filter the text to remove some of the non-checkable words prior to spell-checking (especially when spell-checking is done as a validation rule), but not all such words can be automatically detect. For example, especially problematic are onomatopoeic constructs ("Aaargh! Who released the beast?!").
For this reason it is possible to manually skip spell-checking on a message, or on certain words within a message, by adding a special translator comment. The whole message is skipped by adding the no-check-spell
translator flag to it:
# |, no-check-spell
Words within the message are skipped by listing them in well-spelled:
translator comment, comma- or space-separated:
# well-spelled: Aaarg, gaaah, khh
Which of these two levels of skipping to use depends on the nature of the text. For example, if most of the text is composed of proper words, and there are only a few which should not be checked, it is probably better to list those words explicitly instead of skipping the whole message.
With Pology you can use LanguageTool, a free grammar and style checker, to check translation in PO files. At the moment LanguageTool is applicable only through the check-grammar sieve, so look up the details in its documentation.
In program documentation, but also in help texts in running programs, frequently labels from user interface are mentioned. Here are two such messages, one a UI tooltip, the other a Docbook paragraph:
#: comic.cpp:466 msgid "Press the \"Get New Comics...\" button to install comics." msgstr "" #: index.docbook:157 msgid "" "<guimenuitem>Selected files only</guimenuitem> extracts only " "the files which have been selected." msgstr ""
In the usual translation process, an embedded UI label is manually translated just like the surrounding text. You could directly translate the label, hoping that the original UI message was translated in the same way, but this will frequently not be the case (especially for longer labels). To be thorough, you could look up the UI message in its PO file, or run the program, to see how it was actually translated. There are two problems with being thorough in this way: it takes time to look up original UI messages, and worse, translation of a UI message might change in the future (e.g. after a review) and leave the referencing message out of date.
An obvious solution to these problems, in principle, would be to leave embedded UI labels untranslated but properly marked (such as with <gui*>
tags in Docbook), and have an automatic system fetch their translations from original UI messages and insert them into referencing messages. However, there could be many implementational variations of this approach (like at which stage of the translation chain the automatic insertion happens), with some significant details to get right.
At present, Pology approaches automatic insertion of UI labels in a generalized way, which does not mandate any particular organization of PO files or translation workflow. It defines a syntax for wrapping and disambiguating UI references, for linking referencing and originating PO files, and provides a series of hooks to resolve and validate UI references. A UI reference resolving hook will simply replace a properly equipped non-translated UI label with its translation. This implies that PO files which are delivered must not be the same PO files which are directly translated, because resolving UI references in directly translated PO files would preclude their automatic update in the future[30]. It is upon the translator or the translation team to establish the separation between delivered and translated PO files. One way is by translating in summit (see Chapter 5, Summitting Translation Branches), which by definition provides the desired separation, and setting UI reference resolving hooks as filters on scatter.
If UI references are inserted into the text informally (even if relying on certain ortographic or typographic conventions), then they must be manually wrapped in the translation using an explicit UI reference directive. For example:
#: comic.cpp:466 msgid "Press the \"Get New Comics...\" button to install comics." msgstr "Pritisnite dugme „~%/Get New Comics/“ da instalirate stripove."
Explicit UI reference directives are of the format head
/reference-text
/. The directive head is ~%
in this example, which is the default, but another head may be specified as parameter to UI resolving hooks. Delimiting slashes in the UI reference directive can be replaced with any other character consistenly (e.g. if the UI text itself contains a slash). Note that the directive head must be fixed for a collection of PO files (though more than one head can be defined), while delimiting character can be freely chosen from one to another directive.
The other the type are implicit UI references, which do not require special directive, made possible when UI text is indicated in the text through formal markup. This is the case, for example, in PO files coming from Docbook documenation:
#: index.docbook:157 msgid "" "<guimenuitem>Selected files only</guimenuitem> extracts only " "the files which have been selected." msgstr "" "<guimenuitem>Selected files only</guimenuitem> raspakuje samo " "datoteke koje su izabrane."
Here the translation contains nothing special, save for the fact that the UI reference is not translated. UI resolving hooks can be given a list of tags to be considered as UI references, and for some common formats (such as Docbook) there are predefined specialized hooks which already list all UI tags.
If the message of the UI text is unique by its msgid
string in the originating PO file, then it can be wrapped simply as in previous examples. This means that even if it has the msgctxt
string, the reference will still be resolved. But, if there are several UI messages with same msgid
(implying different msgctxt
), then the msgctxt
string has to be manually added to the reference. This is done by puting the context into the prefix of the reference, separated by the pipe |
character. For example, if the PO file has these two messages:
msgctxt "@title:menu" msgid "Columns" msgid "Kolone" msgctxt "@action:inmenu View Mode" msgid "Columns" msgstr "kolone"
then the correct one can be selected in an implicit UI reference like this:
msgid "...<guibutton>Columns</guibutton>..." msgstr "...<guibutton>@title:menu|Columns</guibutton>..."
In the very unlikely case of |
character being part of the context string itself, the ¦
character ("broken bar") can be used as the context separator instead.
If the UI reference equipped with context does not resolve to a message through direct match on context, the given context string will next be tried as regular expression match on msgctxt
strings of the messages with matching msgid
(matching will be case-insensitive). If this results in exactly one matched message, the reference is resolved. This matching sequence allows simplification and robustness in case of longer contexts, which would look ungainly in the UI reference and may slightly change over time.
If two UI messages have equal msgid
but are not part of the same PO file, that is not a conflict because one of those PO files has the priority (see Section 8.4.3, “Linking to Originating PO Files”).
If of UI two messages with equal msgid
one has msgctxt
and the other does not, the message without context can be selected by adding the context separator in front of the text with nothing before it (i.e. as if the context is "empty").
Sometimes, though rarely, it happens that the referenced UI text is not statically complete, that is, that it contains a format directive which is resolved at runtime. In such cases, the reference must be transformed to exactly an existing msgid
, and the arguments are substituted with special syntax. If the UI message is:
msgid "Configure %1..." msgstr "Podesi %1..."
then it can be used in an implicit UI reference like this:
msgid "...<guimenuitem>Configure Foobar...</guimenuitem>..." msgstr "...<guimenuitem>Configure %1...^%1:Foobar</guimenuitem>..."
Substitution arguments follow after the text, separated with the ^
character. Each argument specifies the format directive it replaces and the argument text, separated by :
. In the unlikely case that ^
is part of the msgid
itself, the ª
("feminine ordinal indicator") can be used instead as the argument separator.
If there are several format directives in the UI reference, they are by default considered "named". This means that all same format directives will be replaced by the same argument. This is the right thing to do for some formats, e.g. python-format
or kde-format
messages, but not for all formats. In c-format
, if there are two %s
in the text, to replace just one of them with the current argument, the format directive attached to the argument must be preceded with !
:
#, c-format msgid "...<guilabel>This Foo or that Bar</guilabel>..." msgstr "...<guilabel>This %s or that %s.^!%s:foo^!%s:bar</guilabel>..."
In general, but especially with implicit references, the text wrapped as reference may actually contain several references in form of UI path ("...go to Foo->Bar->Baz, and click on..."
). To handle such cases, if it is not possible or it is not convenient to wrap each element of the UI path separately, UI reference resolving hooks can be given one or more UI path separators (e.g. ->
) to split and resolve the element references on their own.
Sometimes the UI reference in the original text is not valid, i.e. such message no longer exists in the program. This can happen due to slight interpunction mismatch, small style changes, etc., such that you can easily locate the correct UI message and use its msgid
as the reference. However, if the UI reference is not valid due to documentation being outdated, there is no correct UI message to use in translation. This should most certainly be reported to the authors, but up until they fix it, it presents a problem for immediate resolution of UI references. For this reason, a UI reference can be temporarily translated in place, by preceding it with twin context separators:
msgid "...An Outdated Label..." msgstr "...||Zastarela etiketa..."
This will resolve into the verbatim text of the reference (i.e. context separators will simply be removed), without the hook complaining about an unresolvable reference.
The text of the UI message may contain some characters and substrings which should not be carried over into the text which references the message, or should be modified. To cater for this, UI PO files are normalized after being opened and before UI references are looked up in them. In fact, UI references are written precisely in this normalized form, rather than using the true original msgid
from the UI PO file. This is both for convenience and for necessity.
One typical thing to handle in normalization is the accelerator marker. UI reference resolving hooks eliminate accelerator markers automatically, by for that they need to known what the accelerator marker character is. To find this out, hooks will read the X-Accelerator-Marker
header field.
Another problem is when UI messages contain subsections which would invalidate the target format which is being translated in the referencing PO file, e.g. malformed XML in Docbook catalogs. For example, literal &
must be represented as &
in Docbook markup, thus this UI message:
msgid "Scaled & Cropped" msgstr ""
would be referenced as:
msgid "...<guimenuitem>Scaled & Cropped</guimenuitem>..." msgstr "...<guimenuitem>Scaled & Cropped</guimenuitem>..."
Resolving hooks have parameters for specifying the type of escaping needed by the target format.
Normalization may flatten several different messages from the UI PO file into one. Example of this is when msgid
fields are equal but for the accelerator marker. If this happens and normalized translations are not equal for all flattened messages, a special "tail" is added to their contexts, consisting of a tilde and several alphanumeric characters. The first run of the resolving (or validation) hook will report ambiguities of this kind, as well as assigned contexts, so that proper context can be copied and pasted over into the UI reference. The alphanumeric context tail is computed from the non-normalized msgid
alone, so it will not change if, for example, messages in the UI PO file get reordered.
In general, the UI message may not be present in the same PO file in which it is referenced in another messages. This is always the case for documentation PO files. Therefore UI reference resolving hooks need to know two things: the list of all UI PO files (those from which UI references may be drawn), and, for each PO file which contains UI references, the list of PO files from which it may draw UI references.
The list of UI PO files can be given to resolving hooks explicitly, as list of PO file paths (or directory paths to search for PO files). This can, however, be inconvenient, as it implies either that the resolution script must be invoked in a specific directory (if paths are relative), or that UI PO files must reside in a fixed directory on the system where the resolution script is run (if paths are absolute). Therefore there is another way of specifying paths to UI PO files, through an environment variable which contains a colon-separated list of directory paths. Both the explict list of paths and the environment variable which contains the paths can be given as parameters to hooks.
By default, for a given PO file, UI references are looked for only in the PO file of the same name, assuming that it is found among UI PO files. This may be sufficient, for example, for UI references in tooltips, but it is frequently not sufficient for documentation PO files, which may have a different names from corresponding UI PO file names. Therefore a PO file can be manually linked to UI PO files from which it draws UI references, through a special header field X-Associated-UI-Catalogs
. This field specifies only the PO domain names, as space- or comma-separated list:
msgid "" msgstr "" "Project-Id-Version: foobar\n" "..." "X-Associated-UI-Catalogs: foobar libfoobar libqwyx\n"
The order of domain names in the list is important: if the referenced UI message exists in more than one linked PO file, the translation is taken from the one which appears earlier in the list. Knowing PO domain names, resolving hooks can look up the exact file paths in the supplied list of paths.
When a UI reference cannot be resolved, for whatever reason -- it does not exist, there is a context conflict, the message is not translated, etc. -- resolving hooks will output warnings and fallback to original text.
For each resolving hook there exists the counterpart validation hook. Validation hooks may be used in a "dry run" before starting to build PO files for delivery, or they may be built into a general translation validation framework (such as Pology's validation rules).
There are great many possible mistakes to be made when translating. Some of these mistakes can only be observed and corrected by a human reviewer[31], and review is indeed an important part of the translation workflow. However, many mistakes, especially those more technical in nature, can be fully or partially detected by automatic means.
A number of tools are available to perform various checks on translation in PO files. The basic one is Gettext's msgfmt command, which, when run with -c
/--check
option, will detect many "hard" technical problems. These are the kind of problems which may cause the program that uses translation to crash, or that may cause loss of information to the program user. Another is Translate Toolkit's pofilter command, which applies heuristic checks to detect common (and not so common) stylistic and semantic slips in translation. Dedicated PO editors may also provide some checks of their own, or make use of external batch tools.
One commonality of existing validation tools is that they aim for generality, that is, try to apply a fixed battery of checks to all languages and environments (although some differentiation by translation projects may be present, such as in pofilter). Another commonality, unavoidable in heuristic approaches, is wrong detection of valid translation as invalid, the so called "false positives". These two elements produce combined negative effect: since the number and specificity of checks is not that great compared to what a dedicated translator could come up with for given language and environment, and since many reported errors are false positives without possibility for cancelation, the motivation to apply automatic checks sharply decreases; the more so the greater the amount of translation.
Pology therefore provides a system for users to assemble collections of validation rules adapted to their language and environment, with multi-level facilities for applying or skipping rules in certain contexts, pre-filtering of text before applying rules, and post-filtering and opening problematic messages in PO editors. Rules can be written and tuned in the course of translation, and false positives can be systematically canceled, such that over time the collection of rules becomes both highly specific and highly accurate. Since Pology supports language and environment variations from the ground up, such rule collections can be committed to Pology source distribution, so that anyone may use them when applicable.
Validation rules are primarily based on pattern matching with regular expressions, but they can in principle contain any Python code through Pology's hook system. For example, since there are spell-checking hooks provided, spell-checking can be easily made into one validation rule. One could even aim to integrate every available check into the validation rule system, such that it becomes the single and uniform source of all automatic checks in the translation workflow.
The primary tool in Pology for applying validation rules is the check-rules sieve. This section describes how to write rules, how to organize rule collections, and, importantly, how to handle false positives.
There are many nuances to the validation rule system in Pology, so it is best to start off with an example-based exposition of the main elements. Subsequent sections will then look into each element in detail.
Rules are defined in rule files, with flat structure and minimalistic syntax, since the idea is to write the rules during the translation (or the translation review). Here is one rule file with two rules:
# Personal rules of Horatio the Indefatigable. [don't|can't|isn't|aren't|won't|shouldn't|wouldn't]i id="gram-contr" hint="Do not use contractions." {elevator}i id="term-elevator" hint="Translate 'elevator' as 'lift'." valid msgstr="lift"
A rule file should begin with a comment telling something about the rules defined in the file. Then the rules follow, normally separated by one or more blank lines. Each rule starts with a trigger pattern, of which there are several types. The trigger pattern can sometimes be everything there is to the rule, but it is usually followed by a number of subdirectives.
The first rule above starts with a regular expression pattern on the translation, which is denoted by the [...]
syntax. The regular expression matches English contractions, case-insensitively as indicated by trailing i
flag. The trigger pattern is followed by the id
subdirective, which specifies an identifier for the rule (here gram-contr
is short for "grammar, contractions"). The identifier does not have to be present, and does not even have to be unique if present (uses of rule identifiers will be explained later). If the rule matches a message, the message is reported to the user as problematic, along with a note provided in the hint
subdirective.
The second rule starts with a regular expression pattern on the original (rather than the translation), for which the {...}
syntax is reserved. Then the id
and hint
subdirectives follow, as in the first rule. But unlike the first rule, up to this point the second rule would be somewhat strange: report a problem whenever the word "elevator" is found in the original text? That is where the final valid
subdirective comes in, by specifying a condition on translation (msgstr=
) which cancels the trigger pattern. So this rule efectively states "report every message which has the word 'elevator' in the original, but not the word 'lift' in the translation", making it a terminology assertion rule.
If the given example rule file is saved as personal.rules
, it can be applied to a collection of PO files by the check-rules sieve in the following way:
$ posieve check-rules -s rfile:pathto
/personal.rulesPATHS...
The path to the rule file to apply is given by the rfile:
sieve parameter. All messages which are "failed" by rules will be output to the terminal, with spans of the text that triggered the rule highlighted and the note attached to the rule displayed after the message. Additionally, one of the parameters for automatically opening messages in the PO editor can be issued, to make correcting problems (or canceling false positives) that more comfortable.
The rfile:
sieve parameter can be repeated to add several rule files. If all rule files put into one directory (and its subdirectories), a single rdir:
parameter can be used to specify the path to that directory, and all files with .rules
extension will be recursively collected from it and applied. Finally, if rule files are put into Pology's rule directory for the given language, at lang/
, then check-rules will automatically pick them up when neither lang
/rules/rfile:
nor rdir:
parameters are issued. This is a simple way to test the rules if the intention is to include them into Pology distribution.
Instead of applying all defined rules, parameters rule:
, rulerx:
, norule:
, norulerx:
of check-rules can be used to select specific rules to apply or to not apply, by their identifiers. To apply only the no-contractions rule:
$ posieve check-rules -s rfile:pathto
/personal.rules -s rule:gram-contrPATHS...
and to apply all but terminology rules, assuming that their identifiers start with term-
:
$ posieve check-rules -s rfile:pathto
/personal.rules -s norulerx:term-.*PATHS...
When the rule trigger pattern is a regular expression, it can always be made more or less specific. The previous example of matching English contractions could be generalized like this:
[\w+'t\b]i
This regular expression will match one or more word-characters (\w+
) followed by 't ('t
) which is positioned at the word boundary (\b
). More general patterns increase the likelyhood of false positives, but this is not really a problem, since tweaking the rules in the course of translation is expected. It is a bigger problem if the pattern is made too specific at first, such that it misses out some cases. It is therefore recommended to start with "greedy" patterns, and then constrain them as false positivies are observed.
However, tweaking trigger patterns can only go so far.[32] The workhorse of rule flexibility is instead the mentioned valid
subdirective. Within a single valid
directive there may be several tests, and many types of tests are provided. The trigger will be canceled if all the tests in the valid
subdirective are satisfied (boolean AND linking). There may be several valid
subdirectives, each with its own battery of tests, and then the trigger is canceled if any of the valid
subdirectives are satisfied (boolean OR linking). For example, to disallow a certain word in translation unless used in few specific constructs, the following set of valid
subdirectives can be used:
[foo]i id="style-nofoo" hint="The word 'foo' is allowed only in '*goo foo' and 'foo bar*' constructs." valid after="goo " valid before=" bar"
The first valid
subdirective cancels the rule if the trigger pattern matched just after a "goo " segment, and the second if it matched just before a " bar" segment. Another example would be a terminology assertion rule where a certain translation is expected in general, but another translation as well is allowed in a specific PO file:
{foobar}i id="term-foobar" hint="Translate 'foobar' as 'froobaz' (somewhere 'groobaz' allowed too)." valid msgstr="froobaz" valid msgstr="groobaz" cat="gfoo"
Here the second valid
subdirective uses the cat=
test to specify the other possible translation in the specific PO file. Tests can be negated by prepending !
to them, so to require the specific PO file to have only the other translation:
valid msgstr="froobaz" !cat="gfoo" valid msgstr="groobaz" cat="gfoo"
When a regular expression is not sufficient as the rule trigger, a validation hook can be used instead (one of V* hook types). See Section 9.10, “Processing Hooks” for general discussion on hooks in Pology. For example, since there are spell-checking hooks already available, the complete rule for spell-checking could be:
*hook name="spell/check-spell-sp" on="msgstr" id="spelling" hint="Misspelled words detected."
The name=
field specifies the hook, and the on=
field what parts of the message it should operate on. The parts given by on=
field must be appropriate for the hook type; since spell/check-spell-sp
is a V3A hook, it can operate on any string in the message, including the translation as requested here. Validation hooks can provide some notes of their own (here list of replacement suggestions for a faulty word), which will be shown next to the note given by rule's hint=
subdirective.
Examples so far all suffer from one basic problem: the trigger pattern will fail to match a word which has an accelerator marker inside it.[33] This is actually an instance of a broader problem, that some rules should operate on a somewhat modified, filtered text, instead on the original text. This is why the rule system in Pology also provides extensive filtering capabilities. If the accelerator marker is _
(the underscore), here is how it could be removed before applying the rules:
# Personal rules of Horatio the Indefatigable. addFilterRegex match="_" repl="" on="pmsgid,pmsgstr" # Rules follow...
The addFilterRegex
directive sets a regular expression filter that will be applied to messages before applying any of the rules that follow. match=
field provides the pattern, repl=
what to replace it with, and on=
which parts of the message to filter.
The accelerator marker filter from the previous example is quite crude. It fixes the accelerator marker character, and it will simply remove all of them from the text. Filters too can be hooks instead of regular expressions, and in this case it is better to use the dedicated accelerator marker removal hook:
# Personal rules of Horatio the Indefatigable. addFilterHook name="remove/remove-accel-msg" on="msg" # Rules follow...
The remove/remove-accel-msg
hook is an F4A hook, and therefore the on=
field specifies the whole message as the target of filtering. This hook will use information from PO file headers and respect command line overrides to determine the accelerator marker character, and then remove them only from valid accelerator positions.
Filters to not have to be given as global directives, influencing all the rules below them, but they can be defined for a single rule, using one of rule subdirectives. The other way around, global filters can also have a handle assigned (using the handle=
field), and then this handle can be used to remove the filter on a specific rule.
The last important concept in the Pology's validation rule system are rule environments. The examples so far defined rules for a given language, which means that they in principle apply to any PO file of that language. This is generally insufficient (e.g. terminology differences between translation projects), so rules too can be made to support Pology's language and environment hierarchy. Going back to the initial rule file example, let us assume that while "elevator" should always become "lift", but that English contractions are not accepted only in more formal translations. Then, the rule file could be modified to:
# Personal rules of Horatio the Indefatigable. [don't|can't|isn't|aren't|won't|shouldn't|wouldn't]i environment formal ... {elevator}i ...
The first rule now has the environment
subdirective, which sets this rule's environment to formal
. If check-rules is now run as before, only the second rule will be applied, as it is environment-agnostic. To apply the first rule as well, the formal
environment must be requested through the env:
sieve parameter:
$ posieve check-rules -s rfile:pathto
/personal.rules -s env:formalPATHS...
Another way to request the environment is to specify it inside the PO file itself, through the the X-Environment:
header field. This is generally preferable, because it both reduces the amount of command line arguments (which may be accidentaly omitted sometimes), other parts of Pology too can make use of the environment information in the PO header, and, most importantly, it makes possible that not all PO files processed in a single run belong to the same environment.
If all the rules which belong to the formal environment are grouped at the end of the rule file, then the global environment
directive can be used to set the environment for all of them, instead of the subdirective on each of them:
# Personal rules of Horatio the Indefatigable. {elevator}i ... environment formal [don't|can't|isn't|aren't|won't|shouldn't|wouldn't]i ...
A more usual application of the global environment
directive is to split environment-specific rules into a separate file, and then put the environment
directive at the top. Most flexibly, valid
subdirectives provide the env=
test, so that the rule trigger can be canceled in a condition including the environment. In the running example, this could be used as:
# Personal rules of Horatio the Indefatigable. [don't|can't|isn't|aren't|won't|shouldn't|wouldn't]i ... valid !env="formal" {elevator}i ...
It depends on the particular organization of rule files, and on types of rules, which method of environment-sensitivity should be used. Filters too are sensitive to environments, either conforming to global environment directives same as rules, or using their own env=
fields.
When requesting environments in validation runs (through env:
sieve parameter or X-Environment:
header field), more than one environment can be specified. Then the rules from all those environments, plus the environment-agnostic rules, will be applied. Here comes another function of rule identifiers (provided with the id=
rule subdirective): if two rules in different environments have same identifier, then the rule from the more specific environment overrides the rule from the less specific environment. The more specific environment is normally taken to be the one encountered later in the requested environment list.
Rule files are kept simple, to facilitate easy editing without verbose syntax getting in the way. A rule file has the following layout:
# Title of the rule collection. # Author name. # License. # Directives affecting all the rules.global-directive
...global-directive
# Rule 1.trigger-pattern
subdirective-1
...subdirective-n
# Rule 2.trigger-pattern
subdirective-1
...subdirective-n
... # Rule N.trigger-pattern
subdirective-1
...subdirective-n
The rather formal top comment (licence, etc.) is required for the rule files inside Pology distribution. In most contexts rule files are expected to have the .rules
extension, so it is best to always use it (mandatory for internal rules files). Rule files must be UTF-8 encoded.
The rule trigger is most often a regular expression pattern, given within curly or square brackets, {...}
or [...]
, to match the original or the translation part of the message, respectively. The closing bracket may be followed by single-character matching modifiers, as follows:
i
: case-sensitive matching for all patterns in the rule, including but not limited to the trigger pattern. Default matching is case-insensitive.
Bracketed patterns are the shorthand notation, which are sufficient most of the time. There is also the more verbose notation *
, where instead of message-part
/regex
/modifiers
/
any other non-letter character can be used consistently as separator. The verbose notation is needed when some part of the message other than the original or the translation should be matched, or when brackets would cause balancing issues (e.g. when a closing curly bracket without the opening bracket is a part of the match for the original text). For all messages,
can be one of the following keywords:message-part
msgid
: match on original
msgstr
: match on translation
msgctxt
: match on disambiguating context
For example, {foobar}i
is equivalent to *msgid/foobar/i
.
For plural messages, msgid/.../
(and conversely {...}
) tries to match either the msgid
or the msgid_plural
string, whereas msgstr/.../
(and [...]
) try to match any msgstr
string. If only particular of these strings should be matched, the following keywords can be used as well:
msgid_singular
: match only the msgid
string
msgid_plural
: match only the msgid_plural
string
msgstr_
: match only the N
msgstr
string with index N
When regular expressions on message strings are not sufficient as rule triggers, a hook can be used instead. Hooks are described in Section 9.10, “Processing Hooks”. Since hooks are Python functions, in principle any kind of test can be performed by them. A rule with the hook trigger is defined as follows:
*hook name="hookspec
" on="part
" casesens="[yes|no]" # Rule subdirectives follow...
The name=
field provides the hook specification. Only V* type (validation) hooks can be used in this context. The on=
field defines on which part of the message the hook will operate, and needs to conform to the hook type. The following message parts can be specified, with associated hook types:
msg
: the hook applies to the complete message; for type V4A hooks.
msgid
: the hook applies to the original text (msgid
, msgid_plural
), but considering other parts of the message; for type V3A and V3B hooks.
msgstr
: the hook applies to the translation text (all msgstr
strings), but considering other parts of the message; for type V3A and V3C hooks.
pmsgid
: the hook applies to the original text, without considering the rest of the message; for type V1A hooks.
pmsgstr
: the hook applies to the translation, without considering the rest of the message; for type V1A hooks.
The casesens=
field in trigger hook specification controls whether the patterns in the rest of the rule (primarily in valid
subdirectives) are case-sensitive or not. This field can be omitted, and then patterns are case-sensitive.
If the rule trigger pattern matches (or the trigger hook reports some problems), the message is by default considered "failed" by the rule. The message may be still passed by subdirectives that follow, which test if some additional conditions hold.
There are several types of rule subdirectives. The main subdirective is valid
, which provides additional tests to pass the message failed by the trigger pattern. The tests are given by a list of
entries. For a name
="pattern
"valid
directive to pass the message, all its tests must hold, and if any of the valid
directives passes the message, then the rule as whole passes it. Effectively, this means the boolean AND relationship within a directive, and OR across directives.
The following tests are currently available in valid
subdirectives:
msgid="REGEX
"
The original text (msgid
or msgid_plural
string) must match the regular expression.
msgstr="REGEX
"
The translation (any msgstr
string) must match the regular expression.
ctx="REGEX
"
The disambiguating context (msgctxt
string) must match the regular expression.
srcref="REGEX
"
The file path of one of the source references (in #: ...
comment) must match the regular expression
comment="REGEX
"
One of the extracted or translator comments (#. ...
or # ...
) must match the regular expression.
span="REGEX
"
The text segment matched by the trigger pattern must match this regular expression as well.
before="REGEX
"
The text segment matched by the trigger pattern must be placed exactly before one of the text segments matched by this regular expression.
after="REGEX
"
The text segment matched by the trigger pattern must be placed exactly after one of the text segments matched by this regular expression.
cat="DOMAIN1
,DOMAIN2
,..."
The PO domain name (i.e. MO file name without .mo
extension) must be contained in the given comma-separated list of domain names.
catrx="REGEX
The PO domain name must match the regular expression.
env="ENV1
,ENV2
,..."
The operating environment must be contained in the given comma-separated list of environment keywords.
head="/FIELD-REGEX
/VALUE-REGEX
"
The PO file header must contain the field and value combination, each specified by a regular expression pattern. Instead of /
, any other character may be used consistently as delimiter for the field regular expression.
Each test can be negated by prefixing it with !
. For example, !cat="foo,bar"
will match if the PO domain name is neither foo
nor bar
. Tests are "short-circuiting", so it is good for performance to put simple direct matching tests (e.g. cat=
, env=
) before more more expensive regular expression tests (msgid=
, msgstr=
, etc.).
Subdirectives other than valid
set states and properties of the rule. Property directives are written simply as
. These include:property
="value
"
hint="TEXT
"
A note to show to the user when the rule fails a message.
id="IDENT
"
An "almost unique" identifier for the rule (see Section 8.5.6, “Effect of Rule Environments”).
State directives are given by the directive name, possibly followed by keyword parameters:
. These can be:directive
arg1
...
validGroup GROUPNAME
Includes a previously defined standalone group of valid
subdirectives.
environment ENVNAME
Sets the environment in which the rule is applied.
disabled
Disables the rule, so that it is no longer applied to messages. Disabled rule can still be applied by explicit request (e.g. using the rule:
parameter of check-rules sieve).
manual
Makes it necessary to manually apply the rule to a message, by using one of special translator comments (e.g. apply-rule:
).
addFilterRegex
, addFilterHook
, removeFilter
A group of subdirectives to define filters which are applied to messages before the rule is applied to them. See Section 8.5.7, “Filtering Messages”.
Global directives are typically placed at the beginning of a rule file, before any rules. They define common elements for all rules to use, or set state for all rules below them. A global directive can also be placed in the middle of the rule file, between two rules, when it will affect all the rules that follow it, but not those that precede it. The following global directives are defined:
validGroup
Defines common groups of valid
subdirectives, which can be included by any rule using the validGroup
subdirective:
# Global validity group. validGroup passIfQuoted valid after="“" before="”" valid after="‘" before="’" .... # Rule X. {...} validGroup passIfQuoted valid ... ... # Rule Y. {...} validGroup passIfQuoted valid ... ...
environment
Sets a specific environment for the rules that follow, unless overriden with the namesake rule subdirective:
# Global environment. environment FOO ... # Rule X, belongs to FOO. {...} ... # Rule Y, overrides to BAR. {...} environment BAR ...
See Section 8.5.6, “Effect of Rule Environments” for details on use of environments.
include
Used to include files into rule files:
include file="foo.something"
If the file to include is specified by relative path, it is taken as relative to the file which includes it.
The intent behind include
directive is not to include one rule file into another (files with .rules
extension), because normally all rule files in a directory are automatically included by the rule applicator (e.g. check-rules sieve). Instead, files that are included should have an extension different from .rules
, and contain a number of directives needed in several rule files; for example, a set of filters.
addFilterRegex
, addFilterHook
, removeFilter
A group of directives to define filters which are applied to messages before the rules are applied. See Section 8.5.7, “Filtering Messages”.
When there are no environment
directives in a rule file, either global or as rule subdirectives, then all rules in that rule file are considered as being "environment-agnostic". When applying a rule set (e.g. with the check-rules sieve), the applicator may be put into one or more operating environments, either by specifying them as arguments (e.g. in command line) or in PO file headers. If one or more operating environments are given and the rule is environment-agnostic, it will be applied to the message irrespective of the operating environments. However, if there were some environment
directives in the rule file, some rules will be environment-specific. An environment-specific rule will be applied only if its environment matches one of the set operating environments.
Rule environments are used to control application of rules between different translation environments (projects, teams, people). Some rules may be common to all environments, some may be somewhat common, and some not common at all. Common rules would than be made environment-agnostic (i.e. not covered by any environment
directive), while entirely non-common rules would be provided in separate rule files per environment, with one global environment
directive in each.
How to handle "somewhat" common rules depends on circumstances. They could simply be defined as environment-specific, just like non-common rules, but this may reduce the amount of common rules too much for the sake of perculiar environments. Another way would be to define them as environment-agnostic, and then override them in certain environments. This is done by giving the environment-specific rule the same identifier (id
subdirective) as that of the environment-agnostic rule. It may also happen that the bulk of the rule is environment-agnostic, except for a few tests in valid
subdirectives which are not. In this case, env=
and !env=
tests can be used to differentiate between environments.
It is frequently advantageous to apply a set of rules not on the message as it is, but on a suitably filtered variant. For example, if rules are used for terminology checks, it would be good to remove any markup from the text; otherwise, an <email>
tag in the original could be understood as a real word, and a warning issued for missing the expected counterpart in the translation.
Filters sets are created using addFilter*
directives, global or within rules:
# Remove XML-like tags. addFilterRegex match="<.*?>" on="pmsgid,pmsgstr" # Remove long command-line options. addFilterRegex match="--[\w-]+" on="pmsgid,pmsgstr" # Rule A will act on a message filtered by previous two directives. {...} ... # Remove function calls like foo(x, y). addFilterRegex match="\w+\(.*?\)" on="pmsgid,pmsgstr" # Rule B will act on a message filtered by previous three directives. {...} ...
Filters are added cumulatively to the filter set, and the current set is affecting all the rules below it.[34] If a addFilter*
directive appears within the rule, it adds a filter only to the filter set of that rule:
# Rule C, with an additional filter just for itself. {...} addFilterRegex match="grep\(1\)" on="pmsgstr" ... # Rule D, sees only previous global filter additions. {...} ...
These examples illustrate use of the addFilterRegex
directive, which is described in more detail below, as well as other addFilter*
directives.
All addFilter*
have the on=
field. It specifies the message part on which the filter should operate, similar to the on=
field in hook rule triggers. Unlike in triggers, in filters it is possible to state several parts to filter, as comma-separated list. The following message parts are exposed for filtering:
msg
: filter the "complete" message. What this means exactly depends on the particular filter directive.
msgid
: filter the original text (msgid
, msgid_plural
), but possibly taking into account other parts of the message.
msgstr
: filter the translation (all msgstr
strings), but possibly taking into account other parts of the message.
pmsgid
: filter the original text.
pmsgstr
: filter the translation.
pattern
: a quasi-part, to filter not the message, but all matching patterns (regular expressions, substring tests, equality tests) in the rules themselves.
Not all filter directives can filter on all of these parts. Admissible parts are listed with each filter directive.
To remove a filter from the current filter set, addFilter*
directives can define the filter handle, which can then be given to a removeFilter
directive:
addFilterRegex match="<.*?>" on="pmsgid,pmsgstr" handle="tags" # Rule A, "tags" filter applies to it. {...} ... # Rule B, removes "tags" filter only for itself. {...} removeFilter handle="tags" ... # Rule C, "tags" filter applies to it again. {...} ... removeFilter handle="tags" # Rule D, "tags" filter does not apply to it and any following rule. {...} ...
Several filters may share the same handle, in which case the removeFilter
directive removes all of them from the current filter set. One filter can have more than one handle, given as comma-separated list of handles in handle=
field, and then it can be removed from the filter set by any of those handles. Likewise, the handle=
field in removeFilter
directive can state several handles by which to remove filters. removeFilter
as rule subdirective influences the complete rule, regardless of its position among other subdirectives.
clearFilters
directive is used to completely clear the filter set. It has no fields. Like removeFilter
, it can be issued either globally, or as rule subdirective.
A filter may be added or removed only in certain environments, specified by the env=
field in addFilter*
and removeFilter
directives.
Currently the following directives for adding filters are available:
addFilterRegex
Parts of the text to remove are determined by a regular expression match. The pattern is given by the match=
field. If instead of simple removal of the matched segment the replacement is wanted, the repl=
field is used to specify the replacement string (it can include backreferences to regex groups in the pattern):
# Replace in translation the %<number> format directives with a tilde. addFilterRegex match="%\d+" repl="~" on="pmsgstr"
Case-sensitivity of matching can be changed by adding the casesens=[yes|no]
field; default is case-sensitive matching.
Applicable (on=
field) to pmsgid
, pmsgstr
, and pattern
.
addFilterHook
Text is processed with a filtering hook (F* hook types). The hook specification is given by the name=
field. For example, to remove accelerator markers from UI messages in a smart way, while checking various sources for the exact accelerator marker character (command line, PO file header), this filter can be set:
addFilterHook name="remove/remove-accel-msg" on="msg"
Applicable (on=
field) to msg
(for F4A hooks), msgid
(F3A, F3B), msgstr
(F3A, F3C), pmsgid
(F1A), pmsgstr
(F1A), and pattern
(F1A).
Filtering may be run-time expensive, and it normally is in practical uses. Therefore the rule applicator will try to create and apply as few unique filter sets as possible, by considering their signatures -- a hash of ordering, type, and fields in the filter set for the given rule. Each message will be filtered only as many times as there are different filter sets, rather than once for every rule. The appropriate filtered version of the message will be given to each rule according to its filter set.
This means that you should be careful when adding and removing filters, in order to have as few filter sets as really necessary. For example, you may know that filters P and Q can be applied in any order, and in one rule file specify P followed by Q, but in another rule file Q followed by P. However, the rule applicator must assume that the order of filters is significant, so it will create two filter sets, PQ and QP, and spend twice as much time in filtering.
For big filter sets which are needed in several rule files, the best is to split them out in a separate file and use the include
global directive to include them at the beginning of rule files.
In all the examples so far, ASCII double quotes were used as value delimiters ("..."
). However, just as in the verbose notation for trigger patterns (*msgid/.../
, etc.), all quoted values can in fact consistently use any other non-alphanumeric character (e.g. single quote, slash, etc.). On the other hand, literal quotes inside a value can be escaped by prefixing them with \
(backslash). Values which are regular expression are sent to the regular expression engine without resolving any escapes other than for the quote character itself.
The general statement terminator in a rule file is the newline, but if a line would be too long, it can be continued into the next line by putting \
(backslash) in the last column.
As it was explained earlier, it is very important to have a through system of handling false positives in validation rules. There are several levels on which false positives can be canceled, and they will be described in the following, going from the nearest to the furthest from the rule definition itself. Some guidelines on when to use which level will also be provided, but you should keep in mind that this is far from a well-examined topic.
The disable
subdirective can be added to the rule to disable its application. This may seem a quaint method of "handling false positivies", but it is not outright ridiculous, because a disabled rule can still be applied by directly requesting it (e.g. rule:
parameter of check-rules). This is useful for rules which produce too many false positivies to be applied as part of a rule set, but which are still better than ad-hoc searches. In other words, such rules can be understood as codified special searches, which you would round only when you have enough time to wade through all the false positives in search for the few real problems.
The first real way of canceling false positives is by making the regular expression pattern for the rule trigger less greedy. For example, the trigger pattern for the terminology rule on "tool" could be written at first as:
{\btool}i
This will match any word that starts with tool
, due to \b
word boundary token at pattern start. The word boundary is not repeated at the end with the intention to also catch the plural form of the word, "tools". But, this pattern will also match the word "toolbar", which may have its own rule. Then, the pattern can be restricted to really match only "tool" and "tools", in several ways, for example:
{\btools?\b}i
Now the word boundary is placed at the end as well, but also the optional letter 's' is inserted (?
means "zero or one appearance of the preceding element"). Another way would be to write out both forms in full:
{\b(tool|tools)\b}i
The brackets are needed because the OR-operator |
has lower priority than word boundary \b
, so without brackets the meaning would be "word which starts with 'tool' or ends with 'tools'".
Python's regular expressions, used in rule patterns, have rich special features, but which are frequently better not used in rules. For example, the trigger for the terminology rule on "line" (of text) could be written at first as:
{\blines?\b}i
But this would also catch the phrase "command line", which as a standalone concept, may have its own rule. To avoid this match, a proficient user of regular expressions may think of adding a negative lookbehind to the trigger pattern:
{(?<!command )\blines?\b}i
However, it is much less cryptic and more extensible to add a valid
subdirective instead:
{\blines?\b}i valid after="command "
This cancels the rule if the word "line" was matched just after the word "command", while clearly showing the special-case context.
valid
subdirectives are particularly useful for wider rule cancelations, such as by PO domain (catalog) name. For example, the word "wizard" could be translated differently when denoting a step-by-step dialog in a utilitarian program and a learned magic wielding character in a computer game. Then the cat=
test can be used to allow the other term in the game's PO file:
{\bwizard}i valid msgstr="term-for-step-by-step-dialog
" valid cat="foodungeon" msgstr="term-for-magician
"
This requires specifying the domain names of all games with wizard characters to which the rule set is applied, which may not be that comfortable. Another way could be to introduce the fantasy
environment and use the env=
test:
{\bwizard}i valid msgstr="term-for-step-by-step-dialog
" valid env="fantasy" msgstr="term-for-magician
"
and to add the fantasy
environment into the header of the PO file that needs it.
Sometimes there is just a single strange message that falsely triggers the rule, such that there is nothing to generalize about the false positive. You could still cancel this false positivie in the rule definition itself, by adding a valid
directive with the cat=
test for the PO domain name and msgid=
test to single out the troublesome message:
{\bfroobaz}i id="term-frobaz" valid msgstr="..." valid cat="foo" msgid="the amount of froobaz-HX which led to"
However, rules are supposed to be at least somewhat general, and singling out a particular message in the rule is as excessive non-generality as it gets. It is also a maintenance problem: the message may dissapear in the future, leaving cruft in the rule file, or it may change slightly, but enough for the msgid=
test not to match it any more.
A much better way of skipping a rule on a particular message is by adding a special translator comment to that message, in the PO file:
# skip-rule: term-froobaz msgid "...the amount of froobaz-HX which led to..." msgstr "..."
The comment starts with skip-rule:
, and is followed by a comma-separated list of rules to skip, by their identifiers (defined by id=
in the rule).
The other way around, a rule can be set for manual application only, by adding the manual
subdirective to it. Then the apply-rule:
translator comment must be added to apply that rule to a particular message:
# apply-rule: term-froobaz msgid "...the amount of froobaz-HX which led to..." msgstr "..."
There is a pattern where an automatic rule and a manual rule are somehow closely related, so that on a particular message the automatic one should be skipped and the manual one applied. To make this pattern obvious and avoid adding two translator comments (both skip-rule:
and apply-rule:
), a single switch-rule:
comment can be added instead:
# switch-rule: term-froobaz > term-froobaz-chem msgid "...the amount of froobaz-HX which led to..." msgstr "..."
The rule before >
is skipped, and the rule after >
is applied. Several rules can be stated as a comma-separated list, on both sides of >
.
There is a catch to the translator comment approach, though. When the message becomes fuzzy, it depends on the new text whether the rule application comment should be kept or removed. This means that on fuzzy messages translators have to observe and adapt translator comments just as they adapt the msgstr
strings. Unfortunately, some translators do not pay sufficient attention to translator comments, which is further exacerbated by some PO editors not presenting translator comments conspicuously enough (or even do not enable editing them). However, from the point of view of PO translation workflow, not giving full attention to translator comments is plainly an error: unwary translators should be told better, and deficient PO editors should be upgraded.[35]
Sometimes it is possible to do better than plainly skipping a rule on a message. Consider the following message:
#: dialogs/ScriptManager.cpp:498 msgid "Please refer to the console debug output for more information." msgstr "Pogledajte ispravljački izlaz u školjci za više podataka."
An observant translator could conclude that "console" is not the best choice of term in the original text, that "shell" (or "terminal") would be more accurate, and translate the message as if the more accurate term was used in the original. However, this could cause the terminology rule for "console" (in its accurate meaning) to complain about the proper term missing in translation. Adding skip-rule: term-console
comment would indeed cancel this false positive, but what about the terminology rule on "shell"? There is nothing in the original text to trigger it and check for the proper term in translation.
This example is an instance of the general case where the translator would formulate the original text somewhat differently, and make the translation based on that reformulation. Or, when the mere style of the original causes a rule to be falsely triggered, while diferently worded original would be just fine. In such cases, instead of adding a comment to crudely skip a rule, translator can add a comment to rewrite the original text before applying rules to it:
# rewrite-msgid: /console/shell/ #: dialogs/ScriptManager.cpp:498 msgid "Please refer to the console debug output for more information." msgstr "Pogledajte ispravljački izlaz u školjci za više podataka."
The rewrite directive comment starts with rewrite-msgid:
and is followed by search regular expression and replacement strings, delimited with /
or another non-alphanumeric character. With this rewrite, the wrong terminology rule, for "console", will not be triggered, while the correct rule, for "shell", will be.
At the moment, unlike skip-rule:
, rewrite-msgid:
is not an integral part of the rule system. It is instead implemented as a filtering hook. So to use it, add this filter into rule files (or into the filter set file included by rule files):
addFilterHook name="remove/rewrite-msgid" on="msg"
Sometimes it is not quite clear whether to skip a rule or rewrite the original, that is, whether to use skip-rule:
or rewrite-msgid:
comment. A guideline could be as follows. If the concept covered by the falsely triggered rule is present but somewhat camouflaged in the original, or one concept is switched for another (such as "console" with "shell" in the example above), then rewrite-msgid:
should be used to "normalize" the original text. If the original text has nothing to do with the concept covered by the triggered rule, then skip-rule:
should be used. An example of the latter would be such a message from a game:
# skip-rule: term-shell # src/tanks_options.cpp:249 msgid "Fire shells upward"
Here the word "shell" denotes a cannon shell, which has nothing to do with term-shell
rule for the operating system shell, and the rule is therefore skipped.
Consider a message extracted from a .desktop file, representing the name of a GUI utility:
#. field: Name #: data/froobaz.desktop:5 msgid "Froobaz Image Examiner" msgstr ""
Program names from .desktop files can be read and presented to the user by any other program. For example, when an image is right-clicked in a file browser, it could offer to open the file with the utility named with this message. In the PO file of that file browser, the message for the menu item could be:
#. TRANSLATORS: %s is a program name, to open a file with. #: src/contextmenu.c:5 msgid "Open with %s" msgstr ""
In languages featuring noun inflection, it is likely that the program name in this message should have the grammar case different from the nominative (basic) case. This means that simply inserting the name read from the .desktop file, into directly translated text, will produce a grammatically incorrect phrase. Translator may try to adapt the message to the nominative form of the name (by shuffling words, adding "helper" words, adding punctuation), but this will produce stylistically suboptimal phrase. That is, style will be sacrificed for grammar. In order not to have to make such compromises, now and in the future certain translation scripting systems may be available atop the PO format[36], which would, in this example, enable the translator to specify which non-nominative form of the program name to fetch and insert.
Whatever the shape the translation scripting system takes, different forms of phrases have to be derived somehow for use by that system. Given the nuances of spoken languages, fully automatic derivation is probably not going to be possible[37]. Pology therefore provides the syntagma[38] derivator system (synder for short), which allows manual derivation of phrase forms and properties with minimal verbosity, using macro expansion based on partial regularities in the grammar.
Syntagma derivations can be written and maintained in a standalone plain text file, although currently Pology provides no end-user functionality to convert such files (i.e. derive all forms defined by them) to formats which a target translation system could consume. Instead, one can make use of the Synder
class from the pology.synder module to construct a custom converters. Of course, in the future, such converters may become part of Pology. There are already syntax highlighting definitions for the synder file format, for some text editors, in the syntax/
directory of Pology distribution.
What is provided right now in terms of end-user functionality is the collect-pmap sieve. It enables translators to write syntagma derivations in translator comments in PO messages, and then extract them (deriving all forms) into a file in the appropriate format for the target translation system. The example message above from the .desktop file could be equipped with a synder entry like this:
# synder: Frubaz|ov ispitiv|ač slika #. field: Name #: data/froobaz.desktop:5 msgid "Froobaz Image Examiner" msgstr "Frubazov ispitivač slika"
The translator comment starts with the keyword synder:
, and is followed by the synder entry which defines all the needed forms of the translated name. What you can see is that the synder entry is quite compact, exactly two characters longer than the pure translated name, and yet it defines over a dozen forms and some properties (gender, number) of the name.
The rest of this section describes the syntax of synder entries, and the layout and organization of synder files. As an example application, we consider a dictionary of proper names, where for each name in the source language we want to define the basic name and some of its forms and properties in the target language.
For the name in source language Venus and in target language Venera, we could write the following simplest derivation, which defines only the basic form in the target language:
Venus: =Venera
Venus
is the key syntagma or the derivation key, and it is separated by the colon (:
) from the properties of the syntagma. Properties are written as
pairs, and separated by commas; in key
=value
=Venera
, the property key is the empty string, and the property value is Venera
.
We would now like to define some grammar cases in the target language. Venera is the nominative (basic) case, so instead of the empty string we set nom
as its property key. Other cases that we want to define are genitive (gen
) Venere, dative (dat
) Veneri, and accusative (acc
) Veneru. Then we can write:
Venus: nom=Venera, gen=Venere, dat=Veneri, acc=Veneru
By this point, everything is written out manually, there are no "macro derivations" to speak of. But observe the difference between different grammar cases of Venera -- only the final letter is changing. Therefore, we first write the following base derivation for this system of case endings alone, called declension-1
:
|declension-1: nom=a, gen=e, dat=i, acc=u
A base derivation is normally also hidden, by prepending |
(pipe) to its key syntagma. We make it hidden because it should be used only in other derivations, and does not represent a proper entry in our dictionary example. In the processing stage, derivations with hidden key syntagmas will not be offered on queries into dictionary. We can now use this base derivation to shorten the derivation for Venus:
Venus: Vener|declension-1
Here Vener
is the root, and |declension-1
is the expansion, which references the previously defined base derivation. The final forms are derived by inserting the property values found in the expansion (a
from nom=a
, e
from gen=e
, etc.) at the position where the expansion occurs, for each of the property keys found in the expansion, thus obtaining the desired properties (nom=Venera
, gen=Venere
, etc.) for the current derivation.
Note that declension-1
may be a too verbose name for the base derivation. If the declension type can be identified by the stem of the nominative case (here a
), to have much more natural derivations we could write:
|a: nom=a, gen=e, dat=i, acc=u Venus: Vener|a
Now the derivation looks just like the nominative case alone, only having the root and the stem separated by |
.
The big gain of this transformation is, of course, when there are many syntagmas having the same declension type. Other such source-target pairs could be Earth and Zemlja, Europe and Evropa, Rhea and Reja, so we can write:
|a: nom=a, gen=e, dat=i, acc=u Venus: Vener|a Earth: Zemlj|a Europe: Evrop|a Rhea: Rej|a
From this it can also be seen that derivations are terminated by newline. If necessary, single derivation can be split into several lines by putting a \
character (backslash) at the end of each line but the last.
Expansions are implicitly terminated by a whitespace or a comma, or by another expansion. If these characters are part of the expansion itself (i.e. of the key syntagma of the derivation that the expansion refers to), or the text continues right after the expansion without a whitespace, curly brackets can be used to explicitly delimit the expansion:
Alpha Centauri: Alf|{a}-Kentaur
Any character which is special in the current context may be escaped with a backslash. Only the second colon here is the separator:
Destination\: Void: Odredišt|{e}: ništavilo
because the first colon is escaped, and the third colon is not in the context where colon is a special character.
A single derivation may state more than one key syntagma, comma-separated. For example, if the syntagma in source language has several spellings:
Iapetus, Japetus: Japet|
The key syntagma can also be an empty string. This is useful for base derivations when the stem-naming is used and the stem happens to be null -- such as in the previous example. The derivation to which this empty expansion refers to would be:
|: nom=, gen=a, dat=u, acc=
Same-valued properties do not have to be repeated, but instead several property keys can be linked to one value, separated with &
(ampersand). In the previous base derivation, nom=
and acc=
properties could be unified in this way, resulting in:
|: nom&acc=, gen=a, dat=u
Synder files may contain comments, starting with #
and continuing to the end of line:
# A comment. Venus: Vener|a # another comment
A single derivation may contain more than one expansion. There are two distinct types of multiple expansion, outer and inner.
Outer multiple expansion is used when it is advantageous to split derivations by grammar classes. The examples so far were only deriving grammar cases of nouns, but we may also want to define possesive adjectives per noun. For Venera, the possesive adjective in nominative case is Venerin. Using the stem-naming of base derivations, we could write:
|a: … # as above |in: … # posessive adjective Venus: Vener|a, Vener|in
Expansions are resolved from left to right, with the expected effect of derived properties accumulating along the way. The only question is what happens if two expansions produce properties with same keys but different values. In this case, the value produced by the last (rightmost) expansion overrides previous values.
Inner multiple expansion is used on multi-word syntagmas, when more than one word needs expansion. For example, the source syntagma Orion Nebula has the target pair Orionova maglina, in which the first word is a possesive adjective, and the second word a noun. The derivation for this is:
|a: … # as above |ova>: … # posessive adjective as noun, > is not special here Orion Nebula: Orion|ova> maglin|a
Inner expansions are resolved from left to right, such everything on the right of the expansion currently resolved is treated as literal text. If all expansions define same properties by key, then the total derivation will have all those properties, with values derived as expected. However, if there is some difference in property sets, then the total derivation will get their intersection, i.e. only those properties found in all expansions.
Both outer and inner expansion may be used in a single derivation.
An expansion can be made not to include all the properties defined in the refered to derivation, but only a subset of them. It can also be made to modify the property keys from the refered to derivation.
Recall the example of Orion Nebula and Orionova maglina. Here the possesive adjective Orionova has to be matched in both case and gender to the noun maglina, which is of feminine gender. Earlier we defined a special adjective-as-noun derivation |ova>
, specialized for feminine gender nouns, but now we want to make use of the full posessive adjective derivation, which is not specialized to any gender. Let the property keys of this derivation be of the form nommas
(nominative masculine), genmas
(genitive masculine), …, nomfem
(nominative feminine), genfem
(genitive feminine), …. If we use the stem of nominative masculine form, Orionov, to name the possesive adjective base derivation, we get:
|ov: nommas=…, genmas=…, …, nomfem=…, genfem=…, … Orion Nebula: Orion|ov~...fem maglin|a
|ov~...fem
is a masked expansion. It specifies to include only those properties with keys starting with any three characters and ending in fem
, as well as to drop fem
(being a constant) from the resulting property keys. This precisely selects only the feminine forms of the possesive adjective and transforms their keys into noun keys needed to match with those of |a
expansion.
We could also use this same masked expansion as the middle step, to produce the feminine-specialized adjective-as-noun base derivation:
|ov: nommas=…, genmas=…, …, nomfem=…, genfem=…, … |ova>: |ov~...fem Orion Nebula: Orion|ova> maglin|a
A special case of masked expansion is when there are no variable characters in the mask (no dots). In the pair Constellation of Cassiopeia and Sazvežđe Kasiopeje, the of Cassiopeia is translated as single word in genitive case, Kasiopeje, avoiding the need for preposition. If standalone Cassiopeia has its own derivation, then we can use it like this:
Cassiopeia: Kasiopej|a Constellation of Cassiopeia: Sazvežđ|e |Cassiopeia~gen
|e
is the usual nominative-stem expansion. The |Cassiopeia~gen
expansion produces only the genitive form of Cassiopeia, but with the empty property key. If this expansion would be treated as normal inner expansion, it would cancel all properties produced by |e
expansion, since none of them has an empty key. Instead, when an expansion produces a single property with empty key, its value is treated as literal text and concatenated to all property values produced up to that point. Just as if we had written:
Constellation of Cassiopeia: Sazvežđ|e Kasiopeje
Sometimes the default modification of propety keys, removal of all fixed characters in the mask, is not what we want. This should be a rare case, but if it happens, the mask can also be given a key extender. For example, if we would want to select only feminine forms of the |ov
expansion, but preserve the fem
ending of the resulting keys, we would write:
Foobar: Fubar|ov~...fem%*fem
The key extender in this expansion is %*fem
. For each resulting property, the final key is constructed by substituting every *
with the key resulting from the ~...fem
mask. Thus, the fem
ending is readded to every key, as desired.
Expanded values can have their capitalization changed. By prepending ^
(circumflex) or `
(backtick) to the syntagma key of the expansion, the first letter in fetched values is uppercased or lowercased, respectively. We could derive the pair Distant Sun and Udaljeno sunce by using the pair Sun and Sunce (note the case difference in Sunce/sunce) like this:
Sun: Sunc|e # this defines uppercase first letter Distant Sun: Dalek|o> |`Sun # this needs lowercase first letter
Property keys may be given several endings, to make these properties behave differently from what was described so far. These ending are not treated as part of the property key itself, so they should not be given when querying derivations by syntagma and property key.
Cutting properties are used to avoid the normal value concatenation on expansion. For example, if we want to define the gender of nouns through base expansions, we could come up with:
|a: nom=a, gen=e, dat=i, acc=u, gender=fem Venus: Vener|a
However, this will cause the gender
property in expansion to become Venerafem
. For the gender
property to be taken verbatim, without concatenting segments from the calling derivation, we make it a cutting property by appending !
(exclamation mark) to its key:
|a: nom=a, gen=e, dat=i, acc=u, gender!=fem
Now when dictionary is queried for Venus
syntagma and gender
property, we will get the expected fem
value.
Cutting properties also behave differently in multiple inner expansions. Instead of being canceled when not all inner expansions define it, simply the rightmost value is taken -- just like in outer expansions.
Terminal properties are those hidden with respect to expansion, i.e. they are not taken into the calling derivation. A property is made terminal by appending .
(dot) to its key. For example, if some derivations have the short description property desc
, we typically do not want it to propagate into calling derivations which happen not to override it by outer expansion:
Mars: Mars|, desc.=planet Red Mars: Crven|i> |Mars # a novel
Canceling properties will cause a previously defined property with the same key to be removed from the collection of properties. Canceling property is indicated by ending its key with ^
(circumflex). The value of canceling property has no meaning, and can be anything. Canceling is useful in expansions and alternative derivations (more on that later), where some properties introduced by expansion or alternative fallback should be removed from the final collection of properties.
Key syntagmas and property values can be equipped with arbitrary simple tags, which start with the tag name in the form ~
and extend to the next tag or the end of syntagma. For example, when deriving people names, we may want to tag their first and last name, using tags tag
~fn
and ~ln
respectively:
~fn Isaac ~ln Newton: ~fn Isak| ~ln Njutn|
In default queries to the dictionary, tags are simply ignored, syntagmas and property values are reported as if there were no tags. However, custom derivators (based on the Synder
class from pology.synder) can define transformation functions, to which tagged text segments will be passed, so that they can treat them specially when producing the final text.
Tag is implicitly terminated by whitespace or comma (or colon in key syntagmas), but if none of these characters can be put after the tag, the tag name can be explicitly delimited with curly brackets, as ~{
.tag
}
Sometimes there may be several alternative derivations to the given syntagma. The default derivation (in some suitable sense) is written as explained so far, and alternative derivations are written under named environments.
For example, if deriving a transcribed person's name, there may be several versions of the transcription. Isaac Newton, as the name of the Renaissance scientist, may be normally used in its traditional transcription Isak Njutn, while a contemporary person of that name would be transcribed in the modern way, as Ajzak Njuton. Then, in the entry of Newton the scientist, we could also mention what the modern transcription would be, under the environment modern
:
Isaac Newton: Isak| Njutn| @modern: Ajzak| Njuton|
Alternative derivations are put on their own lines after the default derivation, and instead of the key syntagma, they begin with the environment name. The environment name starts with @
and ends with colone, and then the usual derivation follows. It is conventional, but not mandatory, to add some indent to the environment name. There can be any number of non-default environments.
The immediate question that arises is how are expansions treated in non-default environments. In the previous example, what does |
expansion resolve to in modern
environment? This depends on how the synder file is processed. By default, it is required that derivations referenced by expansions have matching environments. If |
were defined as:
|: nom=, gen=a, dat=u, acc=
then the expansion of Isaac Newton in modern
environment would fail. Instead, it would be necessary to define the base derivations as:
|: nom=, gen=a, dat=u, acc= @modern: nom=, gen=a, dat=u, acc=
However, this may not be a very useful requirement. As can be seen in this example already, in many cases base derivations are likely to be same for all environments, so they would be needlessly duplicated. It is therefore possible to define environment fallback chain in processing, such that when a derivation in certain environment is requested but not available, environments in the fallback chain are tried in order. In this example, if the chain would be given as ("modern", "")
(the empty string is the name of default environment), then we could write:
|: nom=, gen=a, dat=u, acc= Isaac Newton: Isak| Njutn| @modern: Ajzak| Njuton| Charles Messier: Šarl| Mesje|
When derivation of Isaac Newton in modern
environment is requested, the default expansion for |
will be used, and the derivation will succeed. Derivation of Charles Messier in modern
environment will succeed too, because the environment fallback chain is applied throughout; if Charles Messier had different modern transcription, we would have explicitly provided it.
ASCII whitespace in derivations, namely the space, tab and newline, is not preserved as-is, but by default it is simplified in final property values. The simplification consists of removing all leading and trailing ASCII whitespace, and replacing all inner sequences of ASCII whitespace with a single space. Thus, these two derivations are equivalent:
Venus: nom=Venera Venus : nom = Venera
but these two are not:
Venus: Vener|a Venus: Vener |a
because the two spaces between the root Vener
and expansion |a
become inner spaces in resulting values, so they get converted into a single space.
Non-ASCII whitespace, on the other hand, is preserved as-is. This means that significant whitespace, like non-breaking space, zero width space, word joiners, etc. can be used normally.
It is possible to have different treatment of whitespace, through an optional parameter to the derivator object (Synder
class). This parameter is a transformation function to which text segments with raw whitespace are passed, so that anything can be done with them.
Due to simplifaction of whitespace, indentation of key syntagmas and environment names is not significant, but it is nevertheless enforced to be consistent. This will not be accepted as valid syntax:
Isaac Newton: Isak| Njutn| @modern: Ajzak| Njuton| George Washington: Džordž| Vašington| # inconsitent indent @modern: Džordž| Vošington| # inconsitent indent
Consistent indenting is enforced both for stylistic reasons when several people are working on the same synder file, and to discourage indentation styles unfriendly to version control systems, such as:
Isaac Newton: Isak| Njutn| @modern: Ajzak| Njuton| George Washington: Džordž| Vašington| @modern: Džordž| Vošington| # inconsitent indent
Unfriendliness to version control comes from the need to reindent lines which are otherwise unchanged, merely in order to keep them aligned to lines which were actually changed.
Within single synder file, each derivation must have at least one unique key syntagma, because they are used as keys in dictionary lookups. These two derivations are in conflict:
Mars: Mars| # the planet Mars: mars| # the chocholate bar
There are several possibilities to resolve key conflicts. The simplest possibility is to use keyword-like key syntagmas, if key syntagmas themselves do not need to be human readable:
marsplanet: Mars| marsbar: mars|
If key syntagmas have to be human readable, then one option is to extend them in human readable way as well:
Mars (planet): Mars| Mars (chocolate bar): mars|
This method too is not acceptable if key syntagmas are intended to be of equal weight to derived syntagmas, like in a dictionary application. In that case, the solution is to add a hidden keyword-like syntagma to both derivations:
Mars, |marsplanet: Mars| Mars, |marsbar: mars|
Processing will now silently eliminate Mars
as the key to either derivation, because it is conflicted, and leave only marsplanet
as key for the first and marsbar
as key for the second derivation. These remaining keys must also used in expansions, to reference the appropriate derivation. However, when querying the dictionary for key syntagmas by key marsplanet
, only Mars will be returned, because marsplanet
is hidden; likewise for marsbar
.
Ordering of derivations is not important. The following order is valid, although the expansion |Venus~gen
is seen before the derivation of Venus:
Merchants of Venus: Trgovc|i> s |Venus~gen Venus: Vener|a
This enables derivations to be ordered naturally, e.g. alphabetically, instead of the order being imposed by dependencies.
It is possible to include one synder file into another. A typical use case would be to split out base derivations into a separate file, and include it into other synder files. If basic derivations are defined in base.sd
:
|: nom=, gen=a, dat=u, acc=, gender!=mas |a: nom=a, gen=e, dat=i, acc=u, gender!=fem …
then the file solarsys.sd
, placed in the same directory, can include base.sd
and use its derivations in expansions like this:
>base.sd Mercury: Merkur| Venus: Vener|a Earth: Zemlj|a …
>
is the inclusion directive, followed by the absolute or relative path to file to be included. If the path is relative, it is considered relative to the including file, and not to some externaly defined set of inclusion paths.
If the including and the included file contain a derivation with same key syntagmas, these two derivations are not a conflict. On expansion, first the derivations from the current file are checked, and if the referenced derivation is not there, then the included files are checked in reverse of the inclusion order. In this way, it is possible to override some of base derivations in one or few including files.
Inclusions are "shallow": only the derivations in the included file itself are visible (available for use in expansions) in the including file. In other words, if file A includes file B, and file B includes file C, then derivations from C are not automatically visible in A; to use them, A must explicitly include C.
Shallow inclusion and ordering-independent resolution of expansions, taken together, enable mutual inclusions: A can include B, while B can include A. This is an important capability when building derivations of taxonomies. While derivation of X naturally belongs to file A and of Y to file B, X may nevertheless be used in expansion in another derivation in B, and Y in another derivation in A.
To make derivations from several synder files available for queries, these files are imported into the derivator object one by one. Derivations from imported files (but not from files included by them, according to shallow inclusion principle) all share a single namespace. This means that key syntagmas across imported files can conflict, and must be resolved by one of outlined methods.
The design rationale for the inclusion mechanism was that in each collection of derivations, each visible derivation, one which is available to queries by the user of the collection, must be accessible by at least one unique key, which does not depend on the underlying file hierarchy.
There are three levels of errors which may happen in syntagma derivations.
The first level are syntax errors, such as synder entry missing a colon which separates the key syntagma from the rest of the entry, unclosed curly bracket in expansion, etc. These errors are reported as soon as the synder file is imported into the derivator object or included by another synder file.
The second level of errors are expansion errors, such as an expansion referencing an undefined derivation, or an expansion mask discarding all properties. These errors are reported lazily, when the problematic derivation is actually looked up for the first time.
The third level is occupied by semantic errors, such as if we want every derivation to have a certain property, or gender
property to have only values mas
, fem
, and neu
, etc. and a derivation violates some of these requirements. At the moment, there is no prepared way to catch semantic errors.
In future, a mechanism (in form of file-level directives, perhaps) may be introduced to immediately report reference errors on request, and to constrain property keys and property values to avoid semantic errors. Until then, the way to validate a collection of derivations would be to write a piece of Python code which will import all files into a derivator object, iterate through derivations (this alone will catch expansion errors) and check for semantic errors.
[30] Another advantage is that original text too will sometimes contain out-of-date UI references, which this process will automatically discover and enable the translation to be more up-to-date than the original. Of course, reporting the problem to the authors would be desireable, or even necessary when the related feature no longer exists.
[31] Taking into account the current level of artificial intelligence development, which, granted, may become more sophisticated in the future.
[32] And cause regular expressions to become horribly complicated.
[33] Why not remove accelerator markers automatically before applying rules? Because some rules might be exactly about accelerator markers, e.g. if it should not be put next to certain letters.
[34] These filtering examples are only for illustrative purposes, as there are more precise methods to remove markup, or literals such as command line options.
[35] Until that is sufficiently satisfied, one simple safety measure is to remove rule application comments from fuzzy messages just after the PO file is merged with template. This will sometimes cause false positive to reappear, but, after all, this is only a tertiary element in the translation workflow (after translation and review).
[36] As of this writting, one currently operative translation scripting system is KDE's Transcript. Another one being developed, albeit not with PO format as base, is Mozilla's L20n.
[37] An exception would be constructed languages with regular grammar, such as Esperanto.
[38] A combination of words having a certain meaning, possibly greater than the sum of meanings of each word.