Abstract
Pology is a Python library and collection of scripts for in-depth processing of PO files. The library is designed for easy and robust writing of custom scripts in field environments. The scripts perform wide variety of tasks, from precision operations on individual messages in PO files, to cross-file operations on large collections of PO files. Special support is provided for common elements in PO translation workflow, such as translation editors and version control systems, as well as for language-specific and project-specific needs.
This document is the referent source of information on Pology. It describes the functionality available to end users, but also gives the overview of programming with Pology, since Pology is easily extensible and users are encouraged to introduce their own processing elements.
Many people like to use computer programs in their native language. On the average, the working language of developers of a computer program, and the native language of its users, are different. This means that programs need to be translated. For this to be possible, first the program has to be written in such a way that it can fetch and display translations in the language set by the user. Then, there has to exist a method to collect discrete pieces of text (such as button labels, menu items, messages in dialogs...) from the program. Collected pieces of text are written into one or more files of a certain format, which the translators can work on. Finally, translated files may need to be converted into a form that the program can interpret and show translation to the user. There are many different translation systems, which support one or more of these elements of translation process.
In the realm of free software, one particular translation system has become ubiquitous: the GNU Gettext. It covers all the elements of the translation process. It provides a way for programmers to write translatable programs, a way for text to be extracted and collected for translation, a file format on which the translators work, and a set of tools for processing translation files. Beyond the technical aspects, Gettext has evolved a set of conventions, workflows and communication patterns -- a translation culture of sorts.
The most salient element of Gettext from translators' perspective is the translation file format it defines, the PO[1] format. Along with other parts of Gettext, the PO format has developed over the years into the technically most capable translation file format today. Its features both enable high quality and efficiency of translation, and yet can all "fit into one person's head". A chapter of this manual provides a tour of the PO format from translator's perspective.
Aside from the tools provided by GNU Gettext itself, many other tools for processing PO files have been written. These consist of translation editors (or "PO editors"), which provide translators with more power in editing PO files, and batch-processing tools, for purposes more specific than those covered by Gettext (e.g. conversion from and to other file formats). Pology is one of these specific batch-processing tools.
Pology consists of a Python library, with much translation-related functionality beyond basic manipulation of PO file objects, and a collection of scripts based on this library. Both the library and the scripts have this basic trait: they tend to go in depth. Pology is designed to apply precision tasks to standalone PO files, to process collections of PO files in sophisticated ways, and while doing this to cooperate well with other tools commonly used to handle PO files (such as PO editors and version control systems). On the programming side Pology strives for simplicity and robustness, so that users who know (some) Python can easily add functionality for their custom needs. To achieve this, Pology fully relies on the conventions of the PO format, without trying to abstract the translation file format.
As one measure of attention to detail, Pology has sections of language-specific and project-specific functionality, and even combinations of those. Users are encouraged to contribute their custom solutions into the main distribution, if these solutions can possibly serve needs of others.
In short, Pology is a study of PO.
Naturally, the easiest way is to install Pology packages for your operating system distribution, if they exist. Otherwise you must obtain Pology as source code, but you will still be able to prepare it for use quite simply.
You can either download a release tarball from [[insert link here]], or fetch the latest development version from the version control repository. To do the latter, execute[2]:
$ cd PARENTDIR
$ svn checkout svn://anonsvn.kde.org/home/kde/trunk/l10n-support/pology
This will create the pology
subdirectory inside the PARENTDIR
, and download full Pology source into it. When you want to update to the latest version later on, you do not have to download everything again; instead you can execute svn update
in the directory Pology's root directory:
$ cd POLOGYDIR
$ svn update
This will fetch only the modifications since the checkout (or the previous update) and apply them to the existing source tree.
To prepare Pology for use, you can either properly install it or use it directly from the source directory. To install it, you first run CMake in a separate build directory to configure the build, and then make and make install to build and install:
$ cd POLOGYDIR
$ mkdir build && cd build
$ cmake ..
$ make && make install
CMake will warn you of missing requirements, and give some hints on how to customize the build (e.g. installation prefix, etc). If cmake is run like this without any arguments, Pology will be installed into a standard system location, and should be ready to use. If you install it into a custom location (e.g. inside your home directory), then you may need to set some environment variables (see below).
If you want to run Pology from its source directory, it is sufficient to set two environment variables:
$ export PATH=POLOGYDIR
/bin:$PATH $ export PYTHONPATH=POLOGYDIR
:$PYTHONPATH
You can put these commands in the shell startup script (~/.bashrc
for Bash shell), so that paths are already set whenever you start a shell. Setting PATH
will ready Pology's scripts for execution, and PYTHONPATH
its Python library for use in custom Python scripts. You should also build some documentation:
$POLOGYDIR
/user/local.sh build # user manual $POLOGYDIR
/api/local.sh build # API documenation $POLOGYDIR
/lang/LANG
/doc/local.sh build # language-specific, if any
This will make HTML pages appear in
. To have Pology scripts output translated messages, if there exists translation into your language, you can execute:POLOGYDIR
/doc-html/
$POLOGYDIR
/po/pology/local.sh build [LANG
]
This will put compiled PO files into
, from where they will be automatically picked up by scripts running from the source directory.POLOGYDIR
/mo/
Pology provides shell completion for some of the included scripts, which you can activate by sourcing the corresponding completion definition file. If you have installed Pology:
$ . INSTALLDIR
/share/pology/completion/bash/pology
and if running Pology from the source directory
$ . POLOGYDIR
/completion/bash/pology
The following lists the dependencies of Pology, and notes whether they are required or optional, and what they are used for.
Required external Python packages:
None.
Required general software:
CMake >= 2.8.3. The build system used for Pology.
Gettext >= 0.17. Some Pology scripts use Gettext tools internally, and the library module pology.gtxtools
wraps some of Gettext tools for use inside Python scripts. Also needed to build Pology user interface and documentation translations.
Python >= 2.5.
Optional external Python packages:
python-dbus >= 0.81. Used for communication with various external applications (e.g. with the Lokalize PO editor).
python-enchant >= 1.5.2. Frontend to various spell-checkers, used by most of Pology's spell checking functionality.
python-pygments >= 1.6. Syntax highlighting for PO and other code snippets in Pology documentation.
Optional general software:
Apertium >= 0.2. A free/open-source machine translation platform, used by the pomtrans script.
Docbook XSL >= 1.75.2. XSL transformations for converting Docbook into various end-user formats, used for building Pology documentation.
Epydoc >= 3.0. Python doctring to HTML doc generator. Needed to build the API documentation of Pology Python library.
LanguageTool >= 1.0. Open source language checker, used by the check-grammar sieve.
Libxml2. XML processing library. Some of command line tools that come with it are needed to build Pology documentation (xmllint, xsltproc).
Version control systems. Used by various Pology scripts that process PO files on the collection level, when the PO files are under version control. Currently supported: Git >= 1.6, Subversion >= 1.4.
[1] "PO" is an acronym for "portable object". This phrase is a quite generic term from the depths of computer science, opaque for practicing translators. Texts on software translation therefore always write simply "PO format", "PO files", etc.
[2] svn is the primary command of Subversion version control system. Subversion is almost certainly ready for installation from your operating system's package repositories.
There is no formal specification of the PO format; instead, the related parts of the Gettext manual serve as its working definition. Although the PO format has been documented both by the Gettext manual and elsewhere, in smaller and greater detail, it will be presented here as well. This is in order to thoroughly explain how the format elements influence the translation practice, and to make sure that the terms used in the rest of this manual are understood in their precise meaning.
Before going into the format description, it is useful to give an overview of usage contexts for the PO format and of the basic principles behind it.
There are three distinct contexts in which PO files are used:
Native dynamic translations. Many programs use the PO format as the native format for their user interface text. These include the KDE and Gnome desktop environments, GNU tools, etc. Translated PO files are compiled into binary MO files (which is done by the msgfmt command from Gettext) and installed in a proper location. Then the program fetches translations from them at runtime, which is what makes this "dynamic" translation.
Intermediate dynamic translations. Some software keeps user interface text in their own custom format. This is the case, for example, with Mozilla and OpenOffice programs. Such custom format files are first converted into PO files, translated, and then converted back into the original format, for runtime consumption by these programs.
Intermediate static translations. Static text data, such as software documentation, is converted from its source format into the PO format, translated, and then converted back into the original format. An example of such documentation format would be the Docbook. Out of translated files in the original format, the final documents for user consumption are created, such as PDF files or HTML pages.
This variety of usage should be kept in mind, as while the PO format is one, the text exposed for translation in PO files will have embedded elements which are tightly related to the source of what is translated. For example, user interface text will frequently contain format directives, while documentation text may be written with HTML-like markup. This means that the translator should be aware, in general, of what kind of source is being translated through a particular PO file.
The development of the PO format has been driven solely by the needs of its users, as with time these needs became well formulated and generalizable. Thanks to this, features of the PO format other than the very basic can be gradually introduced as necessary, and stay out of the way when they are not. The format is quite compact, human-readable and editable without special-purpose tools (though, of course, these come in handy). These aspects benefit the learning curve, everyday usage, and instructional texts such as this one.
Although translators will frequently prefer to work on PO files using dedicated PO editors, which purport to hide "technical details" such as the underlying file format, they should nevertheless understand the PO format well. This is because the PO format is more than a simple container of the text to be translated, instead it reflects important concepts in the translation workflow. To put it more concretely, the translator should determine out how a given dedicated PO editor exposes the bits of information from the PO file in its interface, and whether it trully exposes all of them.
The PO format is a plain text format, written in files with .po
extension. A PO file contains a number of messages, partly independent text segments to be translated, which have been grouped into one file according to some logical division of what is being translated. For example, a standalone program will frequently have all its user interface messages in one PO file, and all documentation messages in another; or, user interface may be split into several PO files by major program modules, documentation split by chapters, etc. PO files are also called message catalogs.
Here is an excerpt from the middle of a PO file, showing three simple messages, which are untranslated:
#: finddialog.cpp:38 msgid "Globular Clusters" msgstr "" #: finddialog.cpp:39 msgid "Gaseous Nebulae" msgstr "" #: finddialog.cpp:40 msgid "Planetary Nebulae" msgstr ""
Each message contains the keyword msgid
, which is followed by the original string (usually in English for software), wrapped in double quotes. The keyword msgstr
denotes the string which to become the translation, also double-quoted. After you go through the PO file and add translations, these messages would read:
#: finddialog.cpp:38 msgid "Globular Clusters" msgstr "Globularna jata" #: finddialog.cpp:39 msgid "Gaseous Nebulae" msgstr "Gasne magline" #: finddialog.cpp:40 msgid "Planetary Nebulae" msgstr "Planetarne magline"
Based on this example, translating a PO file looks rather simple, and for the most part it is. There exists, however, a number of details which you have to take into account from time to time, in order to produce translation of high quality. The rest of this chapter deals with such details.
As is usual with text formats, immediately something must be said about the text encoding of a PO file. While you could use encodings other than UTF-8 if no non-ASCII letters are used in the original text, you really should use UTF-8. The encoding is specified within the PO file itself, and by default it is UTF-8; if you want to use another encoding, you must specify it in the PO header (described later).
Leaving some messages in the PO file untranslated is technically not a problem. For every untranslated message, programs will typically show the original text to the user, so that not all information is lost. Format converters (such as used in intermediate static translations) may do the same, or decline to create the target file unless the PO file is translated fully or over a prescribed threshold. Of course, you should strive to have the PO files under your maintenance completely translated, in order for the users not to be faced with mixed original and translated text.
Each message in the previous example also contains the source reference comment, which is the line starting with #:
above the msgid "..."
line. It tells from which source file of the program code (or source document of any kind), and the line in that source file, the message has been extracted into the PO file. This piece of data may look strange at first--of what use is it to translators, to merit inclusion in the PO file? Since the PO format has been developed in context of free software, the source reference enables you to actually look up the message in the source file, when you need more context to translate a certain message. This does not require of you to be a programmer, as source code is frequently readable enough to infer the message context without actually understanding the code.
For example, in the translation the text in title position may need to have a certain grammatical or ortographical form, and it may not be apparent from the PO file alone if the message:
#: addcatdialog.cpp:45 msgid "Import Catalog" msgstr ""
is used in title position. By following the source reference, you find this statement in the source file addcatdialog.cpp
, line 45:
setCaption( i18n( "Import Catalog" ) );
The setCaption(...)
bit makes it highly likely that the message is indeed being used in a title position. Some dedicated PO editors provide ways to quickly and comfortably look up source references, just by pressing a keyboard shortcut, which makes this approach to context determination that much easier.
When a message is long or contains some logical line-breaks, its original and translation strings may be wrapped in the PO file (with wrapping boundary usually at column 80), such as this:
#: indimenu.cpp:96 msgid "" "No INDI devices currently running. To run devices, please select devices " "from the Device Manager in the devices menu." msgstr ""
This wrapping is entirely invisible to the consumer of the PO file. PO processing tools introduce wrapping mostly as a convenience to translators who like to work on PO files with plain text editors. This means that you are free to wrap the translation (the msgstr
string) in the same way, differently, or not to wrap it at all. You should only not forget to enclose each wrapped line in double quotes, same as it is done for msgid
. For example, this translation of the previous message:
#: indimenu.cpp:96 msgid "" "No INDI devices (...)" "(...) in the devices menu." msgstr "" "Nema INDI uređaja (...)" "(...) u meniju uređaja."
is equivalent to this one:
#: indimenu.cpp:96 msgid "" "No INDI devices (...)" "(...) in the devices menu." msgstr "Nema INDI uređaja (...) u meniju uređaja."
Dedicated PO editors may even not show wrapping to the translator, or wrap lines on their own independently of the underlying PO file. Curiosly enough, most PO editors seem to follow the original wrapping, at least by default. At any rate, if you would like to have all strings non-wrapped (including msgid
) or vice versa, there are command line tools to achieve this.
A message in the PO file is uniquely identified by its msgid
string (this is not entirely true, as will be explained shortly, but consider it approximately true for the moment). This means that, as the source which is translated evolves in time, a message may change some of its elements or the position within the PO file, but as long as it has the same msgid
string, it is the same message. Those other, non-identifying elements include the translation (msgstr
string), source reference comments, etc. Position means either the line number in the PO file, or relative position to other messages.
The first consequence of this fact is that the only reliable way to report a message to someone is to state its msgid
string, in full or in sufficient part, even if the other person has access to the PO file where the message is found.[3] Newcomer translators are sometimes not briefed about this, and then they at first report the line number of the message, or its ordinal number in the range of all messages, without giving the msgid
. Line numbers cannot work because, for example, of the arbitrary line wrapping as described previously. Ordinal numbers do not work because your PO file may be slightly older or newer than that of the other person, and the ordinals may have changed in the meantime.
The second consequence is that there cannot be two messages with the same msgid
in the same PO file (again not exactly true, see later). If the same text has been used two or more times in the source, then in the PO file it will appear as a single message, with its source reference comment (#:
) listing all appearances. For example, the source reference of this message:
#: colorscheme.cpp:79 skycomponents/equator.cpp:31 msgid "Equator" msgstr ""
shows that it is used at two places in the program source code. This feature of the PO format prevents needless duplication of work, by assuring that any duplicate text in the source is translated only once. This efficiency optimization can sometimes be a double-edged sword, but with an elegant solution for the problem that can arise, as you will see shortly.
The third, so to say, consequence, though more of a remark for clarity, is this: you should never modify the msgid
string. Not only that doing so would have no purpose, but if the msgid
is modified, the consumer of the translated PO file will not see the message as translated, since it will fetch messages by matching their msgid
strings.
Depending on the language of translation, sometimes it may be hard to translate a message properly by considering it in isolation, without any additional context. Naive translation may break style guidelines, or worse, misinterpret the meaning of the original text. To avoid this, there are several ways in which you can infer the context in which the message is used.
One way you have seen already: looking into the source file of the message, as pointed to by the source reference comment. But, this way can be tedious. Not only because the source code may look menacing to a translator, but also, while readily available for free software, it is usually not very comfortable to keep all that source code around just for the sake of context checking. This is a well understood difficulty, so additional context indicators have been devised.
One simple way to keep track of the context is to, when translating a given message, keep in sight several messages that precede and follow it. As a trivial example, the following four messages:
#: locationdialog.cpp:228 msgid "Really override original data for this city?" msgstr "" #: locationdialog.cpp:229 msgid "Override Existing Data?" msgstr "" #: locationdialog.cpp:229 msgid "Override Data" msgstr "" #: locationdialog.cpp:229 msgid "Do Not Override" msgstr ""
are rather obviously a question in some kind of a message dialog, the title of that dialog, and the two answer buttons, so that you know exactly how the messages are related. Aside from the pure meaning, conclusions such as this may be further supported by the style conventions of original text (for English, title word case for dialog titles, but also for push buttons), and the source reference comments (here they reveal that all four messages are in two adjacent lines of the same source file). With time you will start to pick up patterns of this kind which are typical for the source which you translate, and be more confident in your estimates.
Up to this point, all the context gathering rested on the shoulders of the translator. However, when authors of the original text, for example programmers, are themselves sufficiently aware of the translation issues, they can explicitly provide some context for translators. This is particularly warranted when a message is quite strange, when it puts technical limitations on the translation, when it is used in an unexpected way, and so on.
One place where explicit context provided by the authors can be found in a message, is within extracted comments, which start with #.
. For example, the message:
#. TRANSLATORS: A test phrase with all letters of the English alphabet. #. Replace it with a sample text in your language, such that it is #. representative of language's writing system. #: kdeui/fonts/kfontchooser.cpp:382 msgid "The Quick Brown Fox Jumps Over The Lazy Dog" msgstr ""
has an extracted comment which tells you to avoid translating the English phrase for what it is, but to instead construct a phrase with the described property in your language.
This kind of context usually begins with an agreed-upon keyword, which in the above case is TRANSLATORS:
, which is recommended by Gettext, but in principle depends on the source environment. It could be, for example, i18n:
(short for "internationalization").
Extracted comments can sometimes be added not by a human author, but by a tool used to create or process PO files. For example, when markup-text documents are translated, such as HTML, or Docbook for documentation, the extracted comment frequently states the tag which wraps the text in the original document:
#. Tag: title #: skycoords.docbook:73 msgid "The Horizontal Coordinate System" msgstr ""
In this example, the #. Tag: title
comment informs you that the message is a title, so that you can adjust the translation accordingly.
Another frequent example where processing tools provide extracted comments is when the PO file is created in a slightly roundabout way, such that source references do not really point to the source file, but to a temporary source file which existed only during the creation of the PO file. To make this less misleading, the extracted comment may state the true source:
#. i18n: file: tools/observinglist.ui:263 #. i18n: ectx: property (toolTip), widget (KPushButton, ScopeButton) #: rc.cpp:5865 msgid "Point telescope at highlighted object" msgstr ""
Here rc.cpp:5865
is the reference to the temporary source file, whereas the true source file is given as file: tools/observinglist.ui:263
. (The other automatically extracted comment, ectx: ...
, may look a bit cryptic, but you can still easily conclude from it that this message is a tooltip for a push button.)
Consider the following two messages from an program user interface:
#. TRANSLATORS: First letter in 'Scope' #: tools/observinglist.cpp:700 msgid "S" msgstr "" #. TRANSLATORS: South #: skycomponents/horizoncomponent.cpp:429 msgid "S" msgstr ""
At first sight, you could think that it was nice of the programmer to add the explicit context (#. TRANSLATORS: ...
lines), informing that the "S" of the first message is short for "Scope", and the "S" of the second message short for "South", so that translators know that they should use the letters corresponding to these words in their languages. But, can you spot the problem?
The problem is that these messages cannot be part of a valid PO file, since, as it was mentioned earlier, all messages must have unique msgid
strings. Instead, in a real PO file, these two messages would be collapsed into one:
#. TRANSLATORS: First letter in 'Scope' #. TRANSLATORS: South #: tools/observinglist.cpp:700 skycomponents/horizoncomponent.cpp:429 msgid "S" msgstr ""
Both contexts are still present, translators are still well informed, but it is now required that the words "Scope" and "South" also begin with the same letter in the target language--an extremely unlikely proposal.
In situations such as this, the programmer can equip messages with a different type of context, the disambiguating context. These contexts are no longer presented as extracted comments, but through another keyword string, the msgctxt
:
#: tools/observinglist.cpp:700 msgctxt "First letter in 'Scope'" msgid "S" msgstr "" #: skycomponents/horizoncomponent.cpp:429 msgctxt "South" msgid "S" msgstr ""
This is now a valid PO file, and you can translate each "S" on its own.
This updates the earlier approximation that messages must be unique by msgid
strings to the real requirement: messages must be unique by the combination of msgctxt
and msgid
strings. If the msgctxt
string is missing, as it usually is, you can think of it as being present but null-valued.[4]
A rather frequent example of need for disambiguating contexts is when the original text is a single adjective in English, and used at several places in the source:
#: utils/kateautoindent.cpp:78 utils/katestyletreewidget.cpp:132 msgid "Normal" msgstr ""
In many languages the adjective form must match the gender of the noun to which it refers, so if the "Normal" above refers both to indentation mode and text style, it is almost certainly necessary to provide disambiguating contexts:
#: utils/katestyletreewidget.cpp:132 msgctxt "Text style" msgid "Normal" msgstr "običan" #: utils/kateautoindent.cpp:78 msgctxt "Autoindent mode" msgid "Normal" msgstr "obično"
You can imagine that programmers in general cannot know when a certain phrase, same in English when used in two contexts, needs different translations in some other language. This means that you, the translator, should inform them to add a disambiguating context when you determine that you need one.[5]
At the moment of this writing, the msgctxt
string is one of the younger additions to the PO format. But the need for disambiguating contexts was observed much earlier, and different translation environments have historically used different custom solutions to provide them. Such older PO files can still be encountered, so it is useful to present a few examples of custom disambiguating contexts. Before the msgctxt
was introduced, messages indeed had to be unique by msgid
alone, so disambiguating context had to be a part of the msgid
, embedded with some special syntax. Here is how the first message from the previous example would look like in a PO file comming from a KDE program of circa 2006:
#: utils/katestyletreewidget.cpp:132 msgid "" "_: Text style\n" "Normal" msgstr "običan"
The disambiguating context has been embedded at the beginning of the msgid
, surrounded by _: ...\n
. In a contemporary Gnome program, the same message would look something like this:
#: utils/gatestyletreewidget.c:132 msgid "Text style|Normal" msgstr "običan"
Here the context is again at the beginning of msgid
, but it is separated from the text only by the pipe character (|
).
Sometimes you will need to translate a message without explicit context in a non-obvious way, after you have determined that such translation is needed by looking into the source or seeing the message in user interface at runtime. This may present a difficulty when the message is revisited, for example, by a proof-reader in the review process, or by another translator if the message got modified later on. This other person may conclude that the translation is wrong and "fix" it, or at the very least waste time by asking around why it was translated in that way.
Conversely, sometimes you may be unsure if your translation is exactly correct, for example if you have correctly guessed the context, or whether you have used correct terminology. In that case you can, of course, consult with fellow translators, but this would break you out of the "flow" state while working. It is better if such communication is delayed to the moment when the translation of the PO file is otherwise complete.
For these situations, you can write down your own inferred context, doubts or notes, in another type of comment, the translator comment. These comments start simply with #
(hash and space), followed by any text whatsoever. As with other comments, there may be any number of them. A hypothetical example:
# Wikipedia says that ‘etrurski’ is our name for this script. #: viewpart/UnicodeBlocks.h:151 msgid "Old Italic" msgstr "etrurski"
In reality, a translator comment such as the one above would probably be written in the language of translation, as there is no reason for it to be in English. This is not to say that translator comments should never be in English, there may be situations when that could be advantageous.
It is particularly important to know that translator comments are the only type of comment that all well-behaved PO processing tools are guaranteed to preserve in the same way as translation. For example, if you would write something into an extracted comment (#.
), it would very soon dissapear in one of the standard maintenance procedures. So make sure you add any personal remarks into translator comments, and nowhere else.
Message text sometimes contains substrings which are not visible to the user of the program or to the reader of the manual, but are used by the program or the rendering engine to construct the final visible text. Translators should reproduce such substrings in the translation as well, most of the time exactly as they are in the original, but sometimes also with some modifications.
For better or worse, constructive substrings tend to be tightly linked to the source environment of the text, for example the particular programming language in which the program is written, or the particular markup language for static content like documentation. To produce high quality translations, you will benefit from having basic understanding of the constructive substrings possible in the source environment, of their function and behavior. The prerequisite to this, as mentioned in the opening of this chapter, is that you are aware of what is the source of the text in the PO file.
When a file manager shows a message like "Really delete file tmp10.txt?" or "Open with Froobaz", the "tmp10.txt" and "Froobaz" parts had to be added to the rest of the text at runtime. In such cases, the original text as seen by the translator will contain format directives, substrings which the program will replace with dynamically determined arguments to complete the message to be shown to the user.
For example, in the PO file comming from a KDE program, there will be messages like this one:
#: skycomponents/constellationlines.cpp:106 #, kde-format msgid "No star named %1 found." msgstr "Nema zvezde po imenu %1."
The format directive in this message is %1
, and it will be substituted at runtime with the text provided by the user as the name to search for. If several arguments need to be substituted in the text, there can be more format directives with increasing numbers: %1
, %2
, %3
...
A new type of comment has appeared as well, the flags comment. This comment begins with #,
, followed by the comma-separated list of keywords--the flags--which clarify the state or the type of the message. In this example the flag is kde-format
, indicating that format directives in the message are of KDE type.
Format directives differ across source environments, but they are usually easy to recognize. The previous message, if it would be found in a Gnome program, would look like this:
#: skycomponents/constellationlines.c:106 #, c-format msgid "No star named %s found." msgstr "Nema zvezde po imenu %s."
The format directive changed to %s
, and the format flag to c-format
. This is the format used by most programs written in C, and by many written in C++. In C format, the %s
directive is for substituting string arguments, and another frequent directive is %d
for integer numbers; but there are many more.
For one more example, to illustrate the diversity of format directives, if the program would have been written in Python the message could look like:
#: skycomponents/constellationlines.cpp:106 #, python-format msgid "No star named %(starname)s found." msgstr "Nema zvezde po imenu %(starname)s."
Here the format directive is %(starname)s
, which indicates the argument type similar to C format (%s
), but also its name in parenthesis. Hence the python-format
flag. This name must not be changed in translation, as otherwise the program will not be able to match the directive and make the substitute. This would probably make the program crash when it tries to display the message.
You only need to make sure that each directive from the original string is found in the translation, and very rarely to modify the directives themselves. Format flags, such as kde-format
, c-format
, etc., are there not only as information for translators, but they are also used by tools for validating PO files. For example, if you forget or mistype a format directive in the translation, such tools will report it. Dedicated PO editors may warn on the spot, or when saving the PO file. This provides you with a "safety net", so long as you remember to perform the checks after completing the translation (if the PO editor does not do it automatically).
One situation that may require modification of directives is when there are several of them, and they need to be ordered differently in the translation:
#: kxsldbgpart/libxsldbg/xsldbg.cpp:256 #, kde-format msgid "%1 took %2 ms to complete." msgstr "Trebalo je %2 ms da se %1 završi."
With KDE format directives, which are numbered, reordering is as simple as above. Similarly for the Python format, where directives are named. But for formats where directives are neither numbered nor named by default, like in C format (where they only state argument type), you can sometimes modify directives to the desired effect:
#: gxsldbgpart/libxsldbg/xsldbg.c:256 #, c-format msgid "%s took %d ms to complete." msgstr "Trebalo je %2$d ms da se %1$s završi."
If the directives are numbered or named, and there is more than one same-number or same-name directive, usually any of the duplicates can be dropped in the translation. This may be useful in a longer text, for example when in the translation a pronoun can be safely used instead of repeating the argument:
#: hypothetical.cpp:100 #, kde-format msgid "%1 is the blah, blah, blah. With %1 you can blah, blah." msgstr "%1 je bla, bla, bla. Pomoću njega možete bla, bla."
Here "njega" is a pronoun used instead of repeating the %1
. Conversely, it is possible to repeat the directive where the original text had used a pronoun, if it better fits the translation.
Sometimes, instead of using a format directive, the programmer may try to concatenate the full text out of separate messages:
#: hypothetical.cpp:100 msgid "No star named " msgstr "" #: hypothetical.cpp:100 msgid " found." msgstr ""
Here the program will fetch the first message, append to it the argument, and then append the second message. This kind of programming is considered as one of the basic errors when making a translatable program, because it forces translators to "piece the puzzle", which may not even be possible in every language. This is thankfully rare today, but when it does happen, while you can try to work around, it is better that you contact the authors to have the source code fixed.
Programs sometimes show parts of the text in non-plain text: certain words may be italic or bold, titles in larger font size, list items with graphical bullets, etc. This is frequent, for example, in tooltips and message boxes. Yet richer typographic elements of this kind are usually found in documentation and other static content, which may need to be suitable both for reading on screen and printing on paper. In such messages, the original text will contain markup, where words, phrases, and whole paragraphs are wrapped with special tags.
The following messages show typical examples of markup in program user interface:
#: rc.cpp:1632 rc.cpp:3283 msgid "<b>Name:</b>" msgstr "" #: kgeography.cpp:375 #, kde-format msgid "<qt>Current map:<br/><b>%1</b></qt>" msgstr "" #: rc.cpp:2537 rc.cpp:4188 msgid "" "<b>Tip</b><br/>Some non-Meade telescopes support a subset of the LX200 " "command set. Select <tt>LX200 Basic</tt> to control such devices." msgstr ""
The markup in these messages is XML-like, where tags for visual formatting are specified as <
wrappings around the visible text segments. For example tag
>...</tag
><b>...</b>
tells that the text inside should be shown in boldface, while <tt>...</tt>
that a monospace font should be used, and lone <br/>
introduces the line break. A reader knowing some HTML will instantly recognize these tags.
Another frequent XML-like markup is used in documentation PO files, which are in many environments (like KDE or Gnome) mostly written in the Docboox XML format:
#. Tag: title #: blackbody.docbook:13 msgid "<title>Blackbody Radiation</title>" msgstr "" #. Tag: para #: geocoords.docbook:28 msgid "" "The Equator is obviously an important part of this coordinate system; " "it represents the <emphasis>zeropoint</emphasis> of the latitude angle, " "and the halfway point between the poles. The Equator is the " "<firstterm>Fundamental Plane</firstterm> of the geographic coordinate " "system. <link linkend='ai-skycoords'>All Spherical</link> Coordinate " "Systems define such a Fundamental Plane." msgstr ""
The Docbook tags are named somewhat differently to the HTML-like tags from the previous example. The describe the meaning of text that they wrap, rather than the visual appearance (the so called semantic markup). But it is all the same for translator, except that knowing the meanings of text parts may be benefitial for context. Docbook tags will also sometimes provide one or few attributes following the opening tag, such as <link linkend=...>
in the second message above (HTML tags may have this too).
When translating markup text, you should, in general, reproduce the same set of tags in the translation, assigning them to appropriate translated segments. Under no circumstances may the tags themselves be translated (e.g. <title>
or <emphasis>
), since they are processed by the computer to produce the final formatted text. As for tag attributes (linkend='ai-skycoords'
in the example above), attribute names are also never translated, but in rare occasions their values in quotes may be (usually when a value is clearly a human-readable text).
However, this is not to say that you should never modify markup. Especially with HTML-like tags, not so rarely the markup in the original text is sloppy (missing closing tags), and you are free to correct it in translation. Another example would be in CJK languages[6], where bold text is hard to read at normal font sizes, so CJK translators tend to remove <b>
tags in favor of quotes. In general, the more you are familiar with the particular markup, the more you can think of doing something other than directly copying it from the original text.
Sometimes there are parts in the original text that may look somewhat like XML-like markup, but are actually not. For example:
#: utils/katecmds.cpp:180 #, kde-format msgid "Missing argument. Usage: %1 <value>" msgstr ""
The <value>
here is not markup, and is shown verbatim to the user. It is a placeholder, an indicator to the user that a real argument should be put in its place. For this reason, in many languages the placeholders are translated, and there is no technical problem with that. You should only exercise caution not to misjudge a tag for a placeholder. After little experience with the particular markup, the difference usually becomes obvious.
There are also non-XML like markups that tend to come up for translation. One could be the wiki markup:
#: .txt:191 msgid "=== Overlay Images ===" msgstr "" #: poformat.txt:193 msgid "" "A special kind of localized image is an ''overlay image'', one which " "does not simply replace the original, but is combined with it [...]" msgstr ""
Here ===...===
is the approximate of <h2>...<h2>
in HTML, while ''...''
is the counterpart of <i>...<i>
. Another markup type is the source language for man pages, troff:
# type: Plain text #: ../../doc/man/wesnoth.6:55 msgid "" "compresses a savefile (B<infile>) that is in text WML format into " "binary WML format (B<outfile>)." msgstr ""
where B<...>
is the equivalent of <b>...<b>
in HTML.
When you are faced with a new kind of markup, which you have never translated before, you should at least skim through a tutorial or two about it. This will enable you both to recognize it in the original text, and to modify it in translation if necessary.
There are a few special characters which cannot appear verbatim in the msgid
or msgstr
strings. Most obviously, think of the plain double quote ("
): since it is used to delimit strings, a raw double quote inside the text would terminate the string prematurely, and invalidate the message syntax. Such characters are therefore written as escape sequences, a combination of the backslash (\
) and another character, which is interpreted into the appropriate real character when showing the message to the user. The plain double quote is written as \"
:
#: kstars_i18n.cpp:3591 msgid "The \"face\" on Mars" msgstr "\"Lice\" na Marsu"
Another frequent escaped character is the newline, presented as \n
:
#: kstarsinit.cpp:699 msgid "" "The initial position is below the horizon.\n" "Would you like to reset to the default position?" msgstr "" "Početni položaj je ispod horizonta.\n" "Želite li da vratite na podrazumevani?"
Tools that write out PO files usually unconditionally wrap the text at newlines, ignoring the specified wrap column, even when wrapping has been turned off. This is to increase readability for translator editing the PO file. If the text is not composed of markup (e.g. not Docbook), newlines are significant to the program user too, so you should carry them over into the translation. In general, unless you are confident that you can manipulate newlines in a certain way, you should follow the lead of msgid
.
Another two escape sequences, usually of much lower frequency than the double quote and the newline, are the tabulator \t
and the backslash itself \\
(because single backslash always starts an escape sequence). While other escape sequences are possible, they are extremely rare.
Returning to double quotes, keep in mind that while the English original usually uses plain ASCII quotes, translators tend to use "fancy" quotes according to the orthography of the language:
#: kstars_i18n.cpp:3591 msgid "The \"face\" on Mars" msgstr "„Lice“ na Marsu"
This holds both for double and single quotes. Do check if some particular quote pairs are prescribed by the ortography of your language, and use them if they are.
In user interfaces, short texts on widgets used to perform an action or open a dialog, frequently have one letter in them underlined. This indicates that when the user presses the Alt key (on an IBM PC type keyboard) and the underlined letter together, the corresponding action will be triggered. Such letters are called accelerators, and in message strings they are usually specified by preceding them with a special character, the accelerator marker:
#: kstarsinit.cpp:163 msgid "Set Focus &Manually..." msgstr "Zadaj fokus &ručno..."
Here the accelerator marker is the ampersand (&
). Thus, the accelerator in this message will be the letter 'm' in the original text, and the letter 'r' in the translation. Accelerator markers differ across environments: ampersand is typical KDE and Qt programs, in Gnome programs it is the underscore (_
), in OpenOffice the tilde (~
), etc.
It may be difficult to choose accelerators in the translation (where to put the accelerator marker), because you can easily get into situations where in the same interface context (e.g. within one menu) two items end up having the same accelerator. This will not do anything too bad, e.g. the program may automatically reassign conflicting accelerators, or the user may have to press Alt and the letter several times to go through all such items. Nevertheless, it is good to avoid conflicting accelerators, but there is no definite way to do that; you can only try to track the message context in the PO file, and check the running program. This is not only the problem of translation, as not so rarely the original itself introduces conflicting accelerators.
CJK languages use input methods different to alphabetical ones (keyboard layouts), so instead of assigning an ideogram as the accelerator, they add a single Latin letter for that purpose alone:
#: kstarsinit.cpp:163 msgid "Set Focus &Manually..." msgstr "フォーカスを手動でセット(&M)..."
This letter is usually picked to be the same as in the original text, thereby reducing the possibility of accelerator conflicts as much as the programmers were able to avoid conflicts themselves.
Accelerator does not have to be positioned at the start of a word, it can be put next to any letter or number. A reasonable order of choices would be: at the start of the most significant word in the message by default, then if it conflicts another message, at the start of another word, and if it still conflicts, inside one of the words.
The accelerator marker is usually chosen as one of the rarely used characters in normal text, but it may still appear in contexts in which it does not mark an accelerator. For example:
#: kspopupmenu.cpp:203 msgid "Center && Track" msgstr "" #. Tag: phrase #: config.docbook:137 msgid "<phrase>Configure &kstars; Window</phrase>" msgstr ""
In the first message, the accelerator marker has been used to escape itself, to produce a verbatim ampersand in output (similar as with escape sequences where double-backslash was used to represent a verbatim backslash). In the second message, the ampersand is used to insert an XML entity &kstars;
. Only by context can it be concluded that the character is not used as accelerator marker, but after gaining little experience, the distinction will almost always be obvious to you.
Programs frequently need to report to the user the number of objects in a given context: "10 files found", "Do you really want to delete 5 messages?" etc. Of, course, in English such messages should also have singular counterparts, like "1 file found", "...delete 1 message?". This means that two separate English texts are needed in the PO file, one for the singular and another the plural case. You could assume that these would then be two messages, like in this hypothetical example:
#: hypothetical.cpp:100 #, kde-format msgid "Time: %1 second" msgstr "" #: hypothetical.cpp:101 #, kde-format msgid "Time: %1 seconds" msgstr ""
Here the program would use the first message when the number of objects is 1, and the second message for any other number.
However, while this also works for some languages other than English (e.g. Spanish, German, French), it does not work for all languages. The reason is that, while English needs one text for unity and another text for any other number, in many languages it is more complicated than that. For example, in some languages the singular form is used for all numbers ending with the digit 1, so it would be wrong to use the singular form only for exactly 1. Furthermore, in some languages more than two texts are needed, for example three: one for all numbers ending in 1, the second for all numbers ending in 2, 3, 4, and the third for all other numbers.
To handle this diversity of plural forms, the PO format implements plural messages. The example above in reality looks like this:
#: mainwindow.cpp:127 #, kde-format msgid "Time: %1 second" msgid_plural "Time: %1 seconds" msgstr[0] "" msgstr[1] ""
The English singular form is given by the msgid
string, and the plural form by the msgid_plural
string. There are now several msgstr
strings, with zero-based indices in square brackets, so that you can write as many translations as there are plural forms in your language. By default two msgstr
strings will be given, but you may insert the line with the third one (index 2), and so on. For example, the Spanish language has same plural forms as English, and translation to it looks like this:
#: mainwindow.cpp:127 #, kde-format msgid "Time: %1 second" msgid_plural "Time: %1 seconds" msgstr[0] "Tiempo: %1 segundo" msgstr[1] "Tiempo: %1 segundos"
while the Polish translation, which needs three plural forms, is:
#: mainwindow.cpp:127 #, kde-format msgid "Time: %1 second" msgid_plural "Time: %1 seconds" msgstr[0] "Czas: %1 sekunda" msgstr[1] "Czas: %1 sekundy" msgstr[2] "Czas: %1 sekund"
But, how will the program know which plural form corresponds to which numbers? The specification for this is written within the PO file itself, in the file header (PO headers will be explained later). The specifiction consists of the number of plural forms which every plural message in the given PO file should have, and the computable logical expression which for any given number computes the index of the required plural form. This expression is quite cryptic to untrained eye, but you do not have to really understand how it works. Since it is constant for a given language, you can just copy it from any other translated PO file with plural forms, and by observing the plural messages in that other file, you will clearly see which form (by index of msgstr
) is used in which situation. Bearing this in mind, just to complete the examples, here is the plural specification for Spanish:
nplurals=2; plural=n != 1;
and for the more complicated Polish plural:
nplurals=3; plural=(n==1 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2);
The nplurals
variable tells how many forms there are, and plural
is the expression which computes the index of the msgstr
string for the given number n
. (If the syntax of the expression is familiar to you, that is because you know some C programming language).
Sometimes you will come upon a message, or a pair of messages, which are just like the hypothetical example above: having a number in it, but not presented as plural message, when you clearly see it should be. In most programming environments today, this simply means that the programmer forgot to use the plural message. Since this is considered a bug, you should inform the authors to replace the ordinary message with the plural message. In some environments, however, programs are not capable of using plurals, mostly when the PO format is used as intermediate (e.g. for OpenOffice programs). If that is the case, you can only try to translate the message in the "least bad" way.
Quite frequently English singular form will omit the number, that is, only the plural form will contain the format directive for the number:
#: modes/typesdialog.cpp:425 #, kde-format msgid "Are you sure you want to delete this type?" msgid_plural "Are you sure you want to delete these %1 types?" msgstr[0] "" msgstr[1] ""
It depends on the programming environment whether it is allowed to omit the number like this. For example, in KDE programs (kde-format
flag) this is always possible, also in Gnome programs (c-format
), but not in pure Qt programs (qt-format
). If number omission is supported, in the translation you can either omit or retain the number in singular according to what is better for you language, and regardless of whether or not the number was omitted in the original. More precisely, you can omit the number in any plural form that is used for exactly one number. Conversely, if all forms are used for more than one number (e.g. the singular form is used for all numbers ending in digit 1), you cannot omit the number at all.
On rare occasions the plural message will have no number in the original, in either singular or plural. This happens when the programmer merely wanted to choose between the forms for "one" and "several", like this:
#: kgpg.cpp:498 msgid "Decryption of this file failed:" msgid_plural "Decryption of these files failed:" msgstr[0] "" msgstr[1] ""
In such cases, in translation you should just use the same plural text for all forms but the one which is used for unity (if there is any such).
At one point you will have translated the complete PO file, every message in it, and sent it back to the source where it is used. As time passes, the original text at the source is going to change. Programs will get bugs fixed and new features implemented, which will require both new strings in the user interface, and modifications to some of the existing. Documentation will get new chapters, old chapters expanded, old paragraphs modified to better style. At some point, you will want to update your translation so that the source is again fully translated into your language.
This is done in the following way. On the one side, there is your last translated version of the PO file. On the other side, there is the latest pristine PO, with non-translated messages corresponding to the current state of the source. Pristine PO files are called templates, and have the .pot
extension instead of .po
. The translated PO file and the template are then merged in a special way, producing a new, partially translated PO for you to work on. The technicalities of merging are not so important at first, as in any established translation project you can just fetch the latest merged PO files. More important is what you can expect to see in a merged PO file.
In general, merged PO files contain four categories of messages. First are those messages which were present in the PO file when you last worked on it, in the sense of having unchanged msgctxt
and msgid
strings since then. As expected, their translations (msgstr
strings) are as you made them, so there is nothing new for you to do about these messages. The second category are entirely new messages, added to the source in the meantime, which you should now translate. New messages are not added in an arbitrary way, for example simply appended to the end of the PO file. Instead they are be interspersed with translated messages, following the order of appearance of messages in the current source. This allows you to continue to infer contexts by preceding and following messages, same as you did when you were translating the PO from scratch. For example:
#: fitshistogram.cpp:347 msgid "Auto Scale" msgstr "" #: fitshistogram.cpp:350 msgid "Linear Scale" msgstr "linearna skala" #: fitshistogram.cpp:353 msgid "Logarithmic Scale" msgstr "logaritamska skala"
The first message is a new one, untranslated, and the two other messages are old, translated earlier. From the old messages you can see that the new message is a new choice of scale (possibly for a diagram axis), and not, say, a command or option to change the size of something (as in "scale automatically").
The most interesting is the third category of messages in a merged PO file. These are the old messages which were somewhat modified in the meantime, i.e. one or both of their msgctxt
and msgid
strings have changed. Or, this can also be a new message, but very similar to one of the old messages. There is actually no way to tell between the two, it is only by similarity to one of the old messages that a modified or new message falls into this category. Either way, such a message is called fuzzy, and looks like this:
#: src/somwidget_impl.cpp:120 #, fuzzy #| msgid "Elements with boiling point around this temperature:" msgid "Elements with melting point around this temperature:" msgstr "Elementi s tačkom ključanja u blizini ove temperature:"
The fuzzy
flag indicates that the message is fuzzy. The comment starting with #|
is called the previous-string comment. It contains the previous value of the msgid
string, for which the translation in msgstr
was made. This translation is, however, not valid for the current (non-commented) msgid
string. By comparing the previous and current msgid
, you can see that the word "boiling" was replaced with "melting", and you can adjust the translation accordingly. Once you did that, to unfuzzy the message you should remove the fuzzy
flag and previous string comments (#|
), so that the final updated message is:
#: src/somwidget_impl.cpp:120 msgid "Elements with melting point around this temperature:" msgstr "Elementi s tačkom topljenja u blizini ove temperature:"
Previous-string comments are still somewhat fresh addition to the PO format, which means that in some translation environments you will not have them in merged POs. The fuzzy message is then presented only with the fuzzy
flag:
#: src/somwidget_impl.cpp:120 #, fuzzy msgid "Elements with melting point around this temperature:" msgstr "Elementi s tačkom ključanja u blizini ove temperature:"
It may seem that this is no great loss: so long as you are visually comparing texts, instead of comparing the previous (here missing) and current msgid
, you might as well compare the current msgid
and the old translation in msgstr
, and adjust translation based on that. However, there are two disadvantages to this. Less importantly, it may not always be easy to spot a difference by comparing the new original and the old translation. For example, only a typo or some punctuation may have been fixed in the original, leaving you to wonder if you are missing something. More importantly, a dedicated PO editor can use the previous and current msgid
to highlight differences between them, which makes it that much easier to see what has changed. Even if you are working with an ordinary text editor, there are command-line tools which can embed differences into previous msgid
, again making them easier to spot. And the bigger the message, the more important to have automatic highlighting--think of a long paragraph where only one word has been changed. For these reasons, if the merged PO files you work on do not have previous-string comments, do inquire with authors if they can enable them (they may simply not know about this possibility, as it is not the default behavior on merging).
Other than msgid
, the msgctxt
string can also have the corresponding previous-string comment. Regardless of whether one or both of the msgctxt
and msgid
have been changed, both will be given in previous-string comments:
#: kstarsinit.cpp:451 #, fuzzy #| msgctxt "Constellation Line" #| msgid "Constell. Line" msgctxt "Toggle Constellation Lines in the display" msgid "Const. Lines" msgstr "Linija sazvežđa"
In particular, a message will be fuzzied if it previously had no msgctxt
and got one after merging, or had one and lost it. In the first case, the previous-string comments will contain only the msgid
, although it may be the same as the current one; by this you will know that the change was only the adding of context. In the second case, the previous-string comments will contain both the msgctxt
and the msgid
strings, while there will be no current msgctxt
. Here are two examples:
#: kstarsinit.cpp:444 #, fuzzy #| msgid "Solar System" msgctxt "Toggle Solar System objects in the display" msgid "Solar System" msgstr "Sunčev sistem" #: finddialog.cpp:102 #, fuzzy #| msgctxt "object name (optional)" #| msgid "Andromeda Galaxy" msgid "Andromeda Galaxy" msgstr "Andromeda, galaksija"
It is important for a message to become fuzzy when only the disambiguating context is added or removed, because this has been done precisely to shed some light on the original text, which may require modifying the translation.
Fuzzy messages are a special category only from translator's viewpoint. Consumers of PO files (programs, etc.) will treat them as ordinary untranslated messages, i.e. they will use the original instead of the old translation. This is necessary, as there is no telling how inappropriate the old translation may be for the current original. The algorithm that produces fuzzy messages will sometimes turn out rather strange pairings, which to you or to the user may not look similar at all.
It is important to keep in mind that fuzzy messages are treated as untranslated. Fresh translators will sometimes manually add the fuzzy
flag to a message to mark they are not entirely sure that the translation is proper, not knowing that this will totally exclude the translation from being used. Thus, you should manually add the fuzzy
flag only when you are so unsure of the meaning of the message, that you explicitly want to prevent the translation from being used. This is fairly rarely needed. Instead, when you just want to mark the message so that you or someone else can check it later, you should write your doubts in a translator comment.
The last, fourth category are obsolete messages, the messages which are not present in the source any more. All obsolete messages are grouped at the end of the merged PO file, and fully commented out by the #~
comment:
#~ msgid "Set the telescope longitude and latitude." #~ msgstr "Postavi geo. dužinu i širinu teleskopa."
Obsolete messages have no extracted comments or source references, as they are no longer present in the source. Translator comments and flags are retained, as they don't depend on the presence in the source.
It could be said that obsolete messages are in fact no messages at all, given that they do not exist from the point of consumers of the PO file, and there is nothing for translators to do with them. PO tools in general will ignore them, except to preserve them when the PO file is modified. Dedicated PO editors will invariably not show obsolete messages to the translator, and may provide an option to automatically remove them from the file on saving.
What is then the purpose of obsolete messages? It frequently happens that a section of the source content, e.g. the code around a certain feature of a program, is temporarily removed. Authors sometimes want to improve a section of the text separately, outside of the main content which is being translated, and sometimes a section is even briefly omitted by mistake when there are moves and renames in the source. When this happens, the affected messages will become obsolete in the merged PO; but, when the missing section is put back into the source, the merging algorithm will take obsolete messages into account, and promote them to real messages (either translated or fuzzy) where possible. Thus, some previous translation work may be saved.
What you should do with obsolete messages depends on the tools with which you work on PO files. For example, if you and other translators working on the given PO all use dedicated PO editors with internal storage of all previously encountered translations, the translation memory[7], there is less need for keeping obsolete messages around, as the editor will be able to fill new messages from the memory; but there are some difficulties, as the need for translators to share the same memory. In practice, many translators choose to keep obsolete messages around for some time, and periodically (e.g. months apart) remove them from PO files. By this they achieve that accidental removals of source content, which are quickly corrected, do not bother them, while avoiding accretion of far too much obsolete material.
In light of the translation maintenance through the process of merging with templates, you can think of starting to work on a never-before translated PO file as just the "initial merging": you will have to take the template and rename it to something with the .po
extension, and work from there on. What you rename it to depends on the environment, but it is usually one of two things: either the same name as that of the template but with the .po
extension (like in KDE), or your language code with the .po
extension (like in Gnome). This basically depends on the organization of the particular translation project.
On the other hand, sometimes for each template in the project an empty PO for your language will have been created and put in a proper place in the source tree, so that you can just start translating it when you get to it.
At any rate, when you start working on a PO file from scratch, the first thing you should do is fill out its header.
The very first message in each PO file is not a real message, but the header, which provides administrative and technical pieces of information about the PO file. Here is one pristine header, before any translation on the PO file has been done:
# SOME DESCRIPTIVE TITLE. # Copyright (C) YEAR This_file_is_part_of_KDE # This file is distributed under the same license as the PACKAGE package. # FIRST AUTHOR <EMAIL@ADDRESS>, YEAR. # #, fuzzy msgid "" msgstr "" "Project-Id-Version: PACKAGE VERSION\n" "Report-Msgid-Bugs-To: http://bugs.kde.org\n" "POT-Creation-Date: 2008-09-03 10:09+0200\n" "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n" "Language-Team: LANGUAGE <kde-i18n-doc@kde.org>\n" "MIME-Version: 1.0\n" "Content-Type: text/plain; charset=UTF-8\n" "Content-Transfer-Encoding: 8bit\n" "Plural-Forms: nplurals=INTEGER; plural=EXPRESSION;\n"
The header consists of introductory comments, followed by the empty msgid
, and by the msgstr
which contains header fields. The header comments, similar to those of normal messages, are not entirely free form, but have some structure to them. The msgstr
is divided by newlines (\n
) into fields of
form (the name of the piece of information and the information itself). Although the header is pristine, some of the environment-dependent values are typically already supplied, e.g. wherever the KDE is mentioned in this example. The name
: value
fuzzy
flag indicates that the PO file has not been translated earlier. All-uppercase text segments are placeholders which you should replace with real values.
The header updated to reflect the translation state could look like this:
# Translation of kstars.po into Spanish. # This file is distributed under the same license as the kdeedu package. # Pablo de Vicente <pablo@foo.com>, 2005, 2006, 2007, 2008. # Eloy Cuadra <eloy@bar.net>, 2007, 2008. msgid "" msgstr "" "Project-Id-Version: kstars\n" "Report-Msgid-Bugs-To: http://bugs.kde.org\n" "POT-Creation-Date: 2008-09-01 09:37+0200\n" "PO-Revision-Date: 2008-07-22 18:13+0200\n" "Last-Translator: Eloy Cuadra <eloy@bar.net>\n" "Language-Team: Spanish <kde-l10n-es@kde.org>\n" "MIME-Version: 1.0\n" "Content-Type: text/plain; charset=UTF-8\n" "Content-Transfer-Encoding: 8bit\n" "Plural-Forms: nplurals=2; plural=n != 1;\n"
Even if this particular header has been slightly abridged for clarity, it probably still looks menacing, with a lot of data. Are you supposed to manually get all that correct? Not really. If you are using a dedicated PO editor, it will have a comfortable configuration dialog where you can enter data about yourself, your language, and so on, and whenever you save a PO file, the editor will automatically fill out the header. If you are using a plain text editor, there are command line tools to similarly fill out the header automatically. But even with such aids, it is useful to give a few general directions about header comments and fields.
The first comment line usually has the title role, saying something about what is translated and into which language. The second comment tells something about licensing. The following comments each list a translator who at one time worked on this particular PO file, his name, email address, and years of contribution. After that, any freeform comments may be added. The fuzzy
flag is removed once the work on the PO file is started.
The Project-Id-Version
header field states the name and possibly version of what is translated, Report-Msgid-Bugs-To
gives address to write to when you discover problems in original text, POT-Creation-Date
the time when the PO template was created, PO-Revision-Date
the time when the PO file was last edited by a translator, Last-Translator
the name and address of last translator who worked on the file, and Language-Team
the name and address of the translation team (if any) which the last translator is part of. The fields MIME-Version
, Content-Type
, and Content-Transfer-Encoding
, are pretty much always and for any language as given above, so they are not interesting (though you could change encoding to something else than UTF-8, in this day and age really think thrice before doing that). The final field, Plural-Forms
, is where you write the plural specification for your language (as explained in the section on plural forms).
Of the presented comments and fields, almost all of them are set when the PO file is translated for the first time. When you come back to a certain PO to update the translation, if no one else worked on that PO in the meantime, you should only update the PO-Revision-Date
field. If someone has worked on it, you will also have to put your data in Last-Translator
field. If you get to work on a PO file for the first time after someone else has already worked on it, you should add yourself in the translator list in comments. If you are using a dedicated PO editor, it will perform all these updates for you whenever you save the file.
Note that everything in the header is supposed to be in English, to be understandable to people who do not speak your language. Aside from comments in English, this also means that the name of the language and the language team should be in English, and your own name and names of other translators in their romanized equivalents. This is because, for example, people speaking other languages may need to contact you or your team about any technical problems in the translation (e.g. program maintainers). Keep this in mind also when you are setting up your data in a dedicated PO editor.
Other than the standard header fields, you may encounter some custom fields, whose names begin with X-
. These fields are added by various PO processing tools. One typical custom field is X-Generator
, where the dedicated PO editor which you use will write its name and version. Another custom field sometimes seen is X-Accelerator-Marker
, which states the character used as the accelerator marker (recognized by some tools e.g. for searching through PO files, when otherwise the accelerator marker could "mask" a word by being in the middle of it). Different translation environments may add various environment-specific fields for their internal use.
When you translate PO files using a plain text editor, all the message elements will be displayed in it as we have seen in the examples so far. You can edit them at will, including invalidating the syntax if you are not careful. Most capable text editors nowdays have syntax highlighting for the PO format, albeit with different levels of specificity. If you are working with a plain text editor, you should definitely use a command line tool to check the basic correctness of the PO file. msgfmt from the Gettext package is one such tool (use it with the -c
).
Dedicated PO editors will provide you with much more automation, but each will have its own ways of presenting and means of editing different elements of a message. As this text has tried to convince you, every element of the PO message is potentially important, so you should take time to find out how and where the given PO editor shows them. Some editors may even not show all elements of the messages, which in the opinion of the author of this text reflects poorly on them. At the extreme end, immediatelly discard an editor which shows you only the original text (the msgid
string), regardless of any other qualities it may have (this is typical of translation editors not developed around the PO format, but later upgraded to "support" it).
Here is the summary of PO message elements, as a checklist of what to look for in a PO editor:
msgid
string (original text)
msgstr
string (translated text)
msgctxt
string (disambiguating context)
extracted comments (context in comment)
source references (source file and line of the message)
flags (fuzzy
, *-format
, etc.)
fuzzy state (although among flags, requires special attention)
previous strings (previous msgctxt
and msgid
strings in fuzzy messages)
translator comments (added by translators, therefore they should be editable as well)
positional context (good view of preceding and following messages)
There is a number of dedicated PO editors available. They all have the same good basic support for the PO format, but each has some specialities and quirks that reflect the background of their authors. Namely, dedicated PO editors are normally written and maintained by people who are themselves engaged in certain translation projects. You should therefore try out the available editors and choose the one which is best suited to you, and possibly to the translation project within which you translate. Here is the list of some dedicated PO editors:
PO editor developed within the Gnome translation project.
Computer-aided translation tool developed within the KDE translation project.
Cross-platform, lightweight PO editor.
Translation editor designed to be visually compact and easy to use, yet powerfull.
Some plain text editors can operate in modes, where additional editing commands became available to the user when a file of certain type is opened. Such mode for PO files is available for the following text editors: Emacs, Gedit, Vim.
[3] You may want to point to a message when consulting with fellow translators, or when reporting a typo or another problem in the original text to the authors.
[4] If the msgctxt
is present but empty, i.e. msgctxt ""
, this is actually different than the msgctxt
not being present at all. Hence the term "null-valued" as opposed to simply "empty".
[5] Programmers of free software are frequently aware of this latent necessity, and readily reachable, so you should be able to make the request with little communication overhead.
[6] CJK is the usual acronym for ideographical east-Asian languages, the Chinese, Japanese, and Korean.
[7] Translation memory is an extremely important topic on its own when the translation is not done using the PO format. With PO files and the concept of merging with templates, translation memories are not of such great importance, but can come in handy.
Translator may want to apply batch-type operations to every message in a single PO file or in collection of PO files, such as searching and replacing text, computing statistics, or validating. However, batch-processing tools for general plain text (grep, sed, awk, etc.) are not very well suited to processing PO files. For example, when looking for a particular word, a generic search tool will not see it if it contains an accelerator marker; or, if looking for a two-word phrase, a generic tool will miss it if it is wrapped. Therefore many tools tailored specifically for batch-processing messages in PO files have been developed, such as those bundled with Gettext (msggrep, msgfilter, msgattrib...), or from Translate Toolkit (pocount, pogrep, pofilter...).
Pology also provides a per-message batch-processing tool, the posieve. What was the need for it, given the myriad of other previously available and powerful tools? In accordance with philosophy of Pology, posieve goes deeper than these other tools. posieve makes easy that which is possible but awkward by combining generic command line tools. posieve is modular from the ground up, such that it is never a design problem to add new functionality to it, even when it is of narrow applicability. Users who know some Python can even write own "plugins" for it. Several processing modules can be applied in a single run of posieve, possibly affecting each other, in ways not possible by generic shell piping and not requiring temporary intermediate files.
The posieve script itself is actually a simple shell for applying various processing modules, called sieves, to every message in one or more PO files. Some sieves can also request to operate on the header of the PO file, which posieve will then feed to them. A sieve can both examine and modify messages; if any message is modified, by default the modified PO file will be written out in place. Naturally, posieve has a number of options, but more interestingly, each sieve can define some parameters which determine its behavior. Pology comes with many internal sieves, which do things from general to obscure (possibly language or project specific), and users can define their own sieves.
Here is how you would run the stats sieve to collect statistics on all PO files in frobaz/
directory:
$ posieve stats frobaz/
While PO files in frobaz/
are being processed, you will see a progress bar with the current file and the number of files to process, and after some time the stats sive will present its findings in a table.
The first non-option argument in the posieve command line is the sieve name, and then any number of directory or file paths can be specified. posieve will consider file path arguments to be PO files, and recursively search directory paths to collect all files ending with .po
or .pot
. If no paths are specified, PO files to process will be collected from the current working directory.
If the sieve modifies a message and the new PO file is written out in place of the old, the user will be informed by an exclamation mark followed by the file path. An example of a sieve which modifies messages is the tag-untranslated sieve; it adds the untranslated
flag to every untranslated message, so that you can look them up in a plain text editor (as opposed to dedicated PO editor):
$ posieve tag-untranslated frobaz/
! frobaz/alfa.po
! frobaz/bravo.po
! frobaz/charlie.po
Tagged 42 untranslated messages.
posieve itself tracks message modifications and informs about modified PO files, whereas the final line in this example has been output by the tag-untranslated sieve. Sieves will frequently issue such final reports of their actions.
If a sieve defines some parameters to control its behavior, these can be issued using the -s
. This option takes the parameter specification as the argument, which is of the form
or just name
:value
for switch-type parameters. More than one parameter can be issued by repeating the name
-s
. For example, the stats sieve can be instructed to take into account only messages with at most 5 words:
$ posieve stats -s maxwords:5 frobaz/
to show statistics in greater detail:
$ posieve stats -s detail frobaz/
or to ignore a certain accelerator marker and show bar-type statistics instead of tabular:
$ posieve stats -s accel:_ -s msgbar frobaz/
posieve lists and shows descriptions of its options by the usual -h
/--help
option. Help for a sieve can be requested by issuing the -H
/--help-sieves
while a sieve name is present in the command line. All available internal sieves with short descriptions are listed using -l
/--list-sieves
.
Some sieves are language-specific, which can be seen by their names being of the form langcode
:name
. These sieves are primarily intendedfor use on PO files translated to indicated language, but depending on particularities, may be applicable to several more closely related languages. (A sieve which is doing language-specific things, but which is applicable to many languages, is more likely to be named as a general sieve.)
If shell completion is active, it can be used to complete sieve names and their parameters.
It is possible to issue several sieves at once, by passing a comma-separated list of sieve names to posieve in place of single sieve name. This is called a sieve chain.
At minimum, chaining sieves is a performance improving measure, since each PO file is opened (and possibly written out) only once, instead of on each sieve run. For example, you can in one run compute the statistics to see how many messages need to be update and tag all untranslated messages:
$ posieve stats,tag-untranslated frobaz/ ! frobaz/alfa.po ! frobaz/bravo.po ! frobaz/charlie.po ... (table with statistics) ... Tagged 42 untranslated messages.
A message in the PO file is passed through each sieve in turn, in the order in which they are issued, before proceding to the next message. If a sieve modifies the message, the next sieve in the chain will operate on that modified version of the message. This means that the ordering of sieves in the command line is significant in general, and that it is interchangable only if the sieves in the chain are independent of each other (as in this example). Chain order also determines the order in which sieve reports are shown; if in this example the order had been tag-untranslated,stats
, then first the tagged messages line would be written out, followed by the statistics table.
Other than for performance, sieve chains are useful when messages should be modified in a particular way before a sieve gets to operate on it. A good example is when statistics is to be computed on PO files which contain old embedded contexts, where if nothing would be done, contexts would add to the word count of the original text. To avoid this, a context normalization sieve (which converts embedded contexts to msgctxt
) can be chained with statistics sieve, and the posieve instructed not to write modifications to the PO file. If the embedded context is of the single-separator type, with separator character |
, the sieve chain is:
$ posieve --no-sync normctxt-sep,stats -s sep:'|' frobaz/ Converted 21 separator-embedded contexts. ... (table with statistics) ...
The --no-sync
option prevents writing modified messages in the PO file on disk. Note that |
as parameter value is quoted, because it would be interpreted as a shell pipe otherwise.
Finally, some sieves can stop messages from being pushed further through the sieve chain, so they can be used as a prefilter to other sieves. The archetypal example of this the find-messages, which stops non-matched messages from further sieving. For example, to include into statistics only the messages containing the word "quasar", this would be executed:
$ posieve find-messages,stats -s msgid:quasar -s nomsg Found 12 messages satisfying the conditions. ... (table with statistics) ...
The msgid:
parameter specifies the word (actually, a regular expression) to be looked up in the original text, while nomsg
parameter tells find-messages not to write out matched messages to standard output, which it would by default do. Note that no path was specified, meaning that all PO files in current working directory and below will be sieved.
Examples of sieve chaining so far should have raised the following question: when several sieves are issued, to which of them are the parameters specified by -s
options passed? The answer is that a parameter is sent to all sieves which accept parameter of that name. Continuing the previous example, if message texts can contain accelerator marker &
, this would be specified like this:
$ posieve find-messages,stats -s msgid:quasar -s nomsg -s accel:'&'
find-messages will accept accel
in order to also match messages like "Charybdis Q&uasar"
, while stats will use it to properly split text into words for counting them.
Options specific to posieve:
-a
, --announce-entry
A sieve may be buggy and crash or keep posieve in infinite loop on a particular PO entry (header or message). When this option is given, each PO entry will be announced before sieving it, so that you can see exactly where the problem occurs.
-b
, --skip-obsolete
By default posieve will process all messages in the PO file, including the obsolete. Sometimes sieving obsolete messages is not desired, for example when running translation validation sieves. This option can then be used to skip obsolete messages.
-c
, --msgfmt-check
For posieve to process the PO file, it is only necessary that basic PO syntax is valid, i.e. that msgfmt can compile the file. msgfmt also offers stricter validation mode: to have posieve run this stricter validation on the PO file, issue this option. Invalid files will be reported and will not be sieved.
--force-sync
When some messages in the PO file are modified, by default only those messages will be reformatted (e.g. strings wrapped as selected) when the PO file is modified on disk. This makes posieve friendly to version control systems. Sometimes, however, you may want that all messages are reformatted, modified or not, and then you can issue this option.
-h
, --help
General help on posieve.
-H
, --help-sieves
-h
/--help
shows only description of posieve and its options, while this option shows the descriptions and available parameters of issued sieves. For example:
$ posieve find-messages,stats -H
would output help for find-messages and stats sieves.
--issued-params
List of all sieve parameters and their values that would be issued. Used to check the interplay of command line and configuration on sieve parameters.
-l
, --list-sieves
List of all internal sieves, with short descriptions.
--list-options
; --list-sieve-names
; --list-sieve-params
Simple listings of global options, internal sieve names, and parameters of issued sieves. Intended mainly for writting shell completion definitions.
-m OUTFILE
, --output-modified=OUTFILE
If some PO files were modified by sieving, you may want to follow up with a command to process only those files. posieve will by default output the paths of modified PO files, but also other information, which makes parsing this output for modified paths ungainly. Instead, this option can be used to specify a file to which path of all modified PO files will be written to, one per line.
--no-skip
If a sieve reports an error, posieve normally skips the problematic message and continues sieving the rest of the PO file, if possible. This is sometimes not desired, when this option will tell posieve to abort with an error message in such cases.
--no-sync
All messages modified by sieves are by default written back to disk, i.e. their PO files modifed. This option prevents modification of PO files. This comes handy in two cases. One is when you want to check what effect a modifying sieve will have before actually accepting it (a "dry" run). The other case is when you use a modifying sieve as a filter for the next sieve in chain, which only needs to examine messages.
-q
, --quiet
posieve normally shows the progress of sieving, which can be cancelled by this option. (Sieves will still output their own lines.)
-s PARAM
[:VALUE
]
The central option of posieve, which is used to issue parameters to sieves.
-S PARAM
When a sieve parameter is issued through user configuration, this option can be used to cancel it for one particular run.
--version
Release and copyright information on posieve.
-v
, --verbose
More verbose output, where posieve shows the sieving modes, lists files which are being sieved, etc.
Options common with other Pology tools:
-F FILE
, --files-from=FILE
-e REGEX
, --exclude-name=REGEX
; -E REGEX
, --exclude-path=REGEX
; -i REGEX
, --include-name=REGEX
; -I REGEX
, --include-path=REGEX
-R
, --raw-colors
; --coloring-type
The following configuration fields can be used to modify general behavior of posieve:
[posieve]/skip-on-error=[*yes|no]
Setting to no
is counterpart to --no-skip
command line option.
[posieve]/msgfmt-check=[yes|*no]
Setting to yes
is counterpart to -c
/--msgfmt-check
command line option.
[posieve]/skip-obsolete=[yes|*no]
Setting to yes
is counterpart to -b
/--skip-obsolete
command line option.
For configuration fields that have counterpart command line options, the command line option always takes precedence if issued.
Configuration can also be used to issue sieve parameters, by specifying [posieve]/param-
fields. For example, parameters name
transl
(a switch) and accel
(with value &
) are issued to all sieves that accept them by writing:
[posieve] param-transl = yes param-accel = &
To issue parameters only to certain sieves, parameter name can be followed by a sieve list of the form /
; to prevent the parameter from being issued only to certain sieves, prepend sieve1
,sieve2
,...~
to the sieve list. For example:
[posieve] param-transl/find-messages = yes # only for find-messages param-accel/~stats = & # not for stats
Same parameters can sometimes be repeated in the command line, when it is logically meaningfull to provide several values of that type to a sieve. However, same-name fields cannot be used in configuration to supply several values, because they override each other. Instead, a dot and a unique string (within the sequence) can be appended to the parameter name to make it a unique configuration field:
[posieve] param-accel.0 = & param-accel.1 = _
Strings after the dot can be anything, but a sequence of numbers or letters in alphabetical order is the least confusing choice.
Sieve parameters should be issued from the configuration only as a matter of convenience, when they are almost always used in sieve runs. But occasionaly the parameter issued from the configuration is not appropriate for the given run. Instead of going to configuration and commenting the parameter out temporarily, it can be cancelled in the command line using the -S
option (note capital S) followed by the parameter name. You can use --issued-params
option to confirm which parameters will be issued after both the command line and the configuration have been taken into account.
This section describes the sieves which are contained in Pology distribution and provides instruction for their use.
Parameters which take a value (which are not switches) may or may not have a default value, and when they do, it will be given in square brackets ([...]
) in the header.
apply-filter is used to pipe translation through one or several hooks (see Section 9.10, “Processing Hooks”). The hooks may modify the translation, validate it, or do something else. More precisely, the following hook types are applicable:
F1A, F3A, F3C, to modify the translation and write changes back to the PO file;
V1A, V3A, V3C, to validate the translation, with standard validation output (highlighted spans and problem messages);
S1A, S3A, S3C, for any side-effect processing on translation (but no modification).
Parameters:
filter:hookspec
The hook specification. Can be repeated to add several hooks, which are then applied in the order of specification.
showmsg
Report every modified message to standard output. (For validation hooks, message is automatically reported if not valid.)
apply-header-filter is the counterpart to apply-filter to operate on headers instead of messages. Here the applicable hook types are accordingly F4B, V4B, S4B.
Parameters:
filter:hookspec
The hook specification. Can be repeated to add several hooks, which are then applied in the order of specification.
Sometimes it is possible to use simple pattern matching to discover things that should never appear in the text, such as common grammar or orthographical errors. bad-patterns can apply such patterns to translation, either as plain substring matching or regular expressions. Patterns can be given as parameters, or more conveniently, read from files.
Parameters:
pattern:string
The pattern to search for. Can be repeated to search for several patterns.
fromfile:path
Read patterns to search for from the file. Each line contains one pattern. If line starts with #
, it is treated as comment. Empty lines are ignored. Trailing and leading whitespace is removed from patterns; if it is significant, it can be given inside [...]
regex operator. This parameter can be repeated to read patterns from several files.
rxmatch
By default patterns are treated as plain substrings. This parameter requests to treat patterns as regular expressions.
casesens
By default patterns are case-sensitive. This parameter make them case-insensitive.
This sieve is deprecated. Use check-rules instead, which applies Pology's validation rules.
check-docbook4 checks PO files extracted from Docbook 4.x files. Docbook is an XML format, typically used for documenting software.
Parameters:
showmsg
Instead of just showing the message location and problem description, also show the complete message with problematic segments higlighted.
lokalize
Open the PO file on reported messages in Lokalize. Lokalize must be already running with the project that contains the PO file opened.
Currently performed checks:
Markup validity. Docbook is a complex XML format, and nothing short of full validation of XML files generated from translated PO files can show if the translation is technically valid. Therefore check-docbook4 checks only well-formedness, whether tags are defined by Docbook, and some nesting constraints, and that on the level of single message. But this is already enough to catch great majority of usual translation errors.
This check can be skipped on a message by adding to it the no-check-markup
translator flag.
Message insertion placeholders. Some extractors of Docbook split out into standalone messages contextually separate units that are found in the middle of flowing paragraphs (e.g. footnotes). When that happens, a special placeholder is left in the originating message, so that the markup can be reconstructed when the translated Docbook file is built. Such placeholders must be carried into translation.
check-grammar checks translation with LanguageTool, an open source grammar and style checker (http://www.languagetool.org/). LanguageTool supports a number of languages to greater or smaller extent, which you can check on its web site.
LanguageTool can be run as standalone program or in client-server mode, and this sieve expects the latter. This means that LanguageTool has to be up and running before this sieve is run. Messages in which problems are discovered are reported to standard output.
Parameters:
lang:code
The language code for which to apply the rules. If not given, it will be read from each PO file in turn, and if not found there either, an error will be signaled.
host:hostname
[localhost
]Name of the host where the LanguageTool server is running. The default value of localhost
means that it is running on the same computer where the sieve is run.
port:number
[8081
]TCP port of the host on which the LanguageTool server listens for queries.
check-kde4 checks PO files extracted from program code based on KDE4 library and its translation system. Note that this really means what it says; this sieve should not be used to check just any PO file which happens to be part of the KDE project (e.g. PO files covering .desktop files, pure Qt code, etc.).
Parameters:
strict
Partly due to historical reasons, and partly due to programmers being sloppy, the original text itself is sometimes not valid by some checks. By default, when the original is not valid, the translation is not expected to be valid either, i.e. it is not checked. This parameter requires that the translation is always checked, regardless of the validity of the original (problems can almost always be avoided in the translation).
lokalize
Open the PO file on reported messages in Lokalize. Lokalize must be already running with the project that contains the PO file opened.
Currently performed checks:
Markup validity. KDE4 messages can contain a mix of KUIT and Qt rich text markup. Although Qt rich text does not have to be well-formed in XML sense, this check expects well-formedness to be preserved in translation if the original is such (also see the strict
parameter).
This check can be skipped on a message by adding to it the no-check-markup
translator flag.
check-rules applies language- and project-dependent Pology validation rules to translation. See Section 8.5, “Validation Rules” for detailed discussion on writing and applying rules.
Parameters:
lang:code
The language code for which to apply the rules. If not given, it will be read from each PO file in turn, and if not found there either, an error will be signaled.
env:environment
The language environment for which to apply the rules (see Section 8.1, “The Notion of Language in Pology”). Several environments can be given as comma-separated list, in which case the later environment in the list takes precedence on conflicted rules. If not given, it may also be read from PO files (see X-Environment
in Section 9.9, “Influential Header Fields”).
envonly
When language environment is given, only the rules explicitly belonging to it are applied, while general rules for the selected language are ignored.
rule:identifiers
Comma-separated list of rule identifiers, to apply only those rules. If a rule selected in this way is disabled in its definition, this enables it.
rulerx:regexes
Like rule
, but the values are interpreted as regular expressions by which to match rule identifiers.
norule:identifiers
Inverse of the rule
parameter: selected rules are not applied, and all other are applied.
norulerx:regexes
Inverse of the rulerx
parameter: selected rules are not applied, and all other are applied.
stat
Rules can take time to apply to all sieved PO files, and this parameter requests to write out some statistics of rule application at the end of sieving.
accel:characters
Characters to consider as accelerator markers. If not given, they may be read from sieved PO files. Note that this parameter in itself does nothing: it only makes it possible for a particular rule or group of rules to remove the accelerator before matching.
markup:types
The type of text markup used in messages, by keyword. It can also be a comma-separated list of keywords. If not given, it may be read from sieved PO files. See description of X-Text-Markup
in Section 9.9, “Influential Header Fields” for the list of markup keywords currently known to Pology. Similarly to accel
parameter, this parameter only enables rules to remove the markup (or do something else) before matching.
xml:file
By default, messages failed by rules are reported to standard output, and this parameter requests that they be written into a custom (but simple) XML format. This also causes results to be cached: on subsequent runs of check-rules only modified PO files will be checked again, and results for non-modified files will be pulled from the cache. The cache can be found in $HOME/.pology-check_rules-cache/
directory.
rfile:file
By default internal Pology rules are applied, and this parameter can be used to apply external rules instead, defined in the given rule file.
rdir:directory
Like rfile
, but external rules are read from a directory containing any number of rule files.
branch:branch
Apply rules only to messages from given branch (summit). Several branches may be given as comma-separated list.
showfmsg
Rules are sometimes applied to the filtered instead of the original message, and when such message is failed, it may not be obvious what triggered the rule. This parameter requests that the filtered message is written out too when the original message is reported.
nomsg
When a message is failed, by default it is output in full together with the problem description. This parameter requests that only the problem description is output.
lokalize
Open the PO file on reported messages in Lokalize. Lokalize must be already running with the project that contains the PO file opened.
mark
To each failed message a failed-rule
flag is added, modifying the PO file. Modified files can then be opened in the editor, and failed messages looked up by this flag.
byrule
As usual for sieving, by default each failed message is output as soon as it is processed. This parameter makes the failed messages output ordered by rules instead, where rules are sorted alphabetically by their identifiers. Note that this will cause there to be no output until all messages have been sieved.
One or more rules can be disabled on a particular message in the PO file itself, by adding a special translator comment that starts with skip-rule:
and continues with comma-separated list of rule identifiers:
# skip-rule: ruleid1
, ruleid2
, ...
check-spell checks spelling of translation by splitting it into words and passing them through GNU Aspell (http://aspell.net/). This sieve is a more specific counterpart to check-spell-ec, which exposes some options specific to Aspell and requires no external Python modules, only the Aspell installation. Also read Section 8.2, “Spell Checking” for details on spell-checking in Pology.
check-spell behaves mostly the same as check-spell-ec, and accepts all the same parameters with same meanings; the exception is the provider
parameter, which is not present here since Aspell is the fixed provider. Only the parameters specific to this sieve are described in the following:
enc:encoding
The encoding in which the text should be sent to Aspell.
var:variety
The variety of the Aspell dictionary, if any.
skip:regex
Words matched by this regular expression are not sent to spell-checker.
case
Matching patterns given as parameter values (e.g. with skip:
) are by default case-insensitive, and this parameter switches them to case-sensitive.
xml:file
By default, messages with unknown words are reported to standard output, and this parameter requests that they be written into a custom (but simple) XML format.
Aspell can be configured for use in Pology through user configuration, so that it is not necessary to issue some parameters on every run. See Section 9.2.4, “The [aspell]
section”.
check-spell-ec uses the Enchant library (http://www.abisource.com/projects/enchant/) through PyEnchant Python module (http://pyenchant.sourceforge.net) to provide uniform access to different spell-checkers, such as Aspell, Ispell, Hunspell, etc. Translation is first split into words, possibly eliminating markup and other literal content, and the words are then fed to spell-checker. Messages containing unknown words are reported to standard output, with list of replacement suggestions.
Parameters:
provider:keyword
The spell-checker that Enchant should use. The value is one of keywords defined by Enchant (e.g. aspell
, myspell
...), and can be seen by running enchant-lsmod command (only providers available on the system are shown). If not given either by this parameter or in user configuration, Enchant will try to select a provider on its own.
lang:code
The language code for which the spelling is checked. If not given, it will be read from each PO file in turn, and if not found there either, an error will be signaled.
env:environment
The language environment for which to include supplemental dictionaries (see Section 8.1, “The Notion of Language in Pology”). Several environments can be given as comma-separated list, in which case the union of their dictionaries is used. If not given, environments may be read from PO files (see X-Environment
in Section 9.9, “Influential Header Fields”) or from user configuration.
accel:characters
Characters to consider as accelerator markers, to remove them before splitting text into words. If not given, they may be read from PO files (see X-Acclerator-Marker
in Section 9.9, “Influential Header Fields”).
markup:types
The type of text markup used in messages, by keyword. It can also be a comma-separated list of keywords. If not given, it may be read from PO files (see X-Text-Markup
in Section 9.9, “Influential Header Fields”; there the list of markup keywords currently known to Pology is given as well).
skip:regex
Words matched by this regular expression are not sent to spell-checker.
case
Matching patterns given as parameter values (e.g. with skip:
) are by default case-insensitive, and this parameter switches them to case-sensitive.
filter:hookspec
The hook to modify the text before splitting into words and spell-checking them (see Section 9.10, “Processing Hooks”). The hook type must be F1A, F3A, or F3C. The parameter can be repeated to add several hooks, which are then applied in the order of specification.
suponly
By default, internal supplemental spelling dictionaries are added to the system dictionary of the selected spell-checker. This parameter can be issued to instead use only internal dictionaries and not the system dictionary.
list
By default, when an unknown word is found, the complete message is output, with the problematic word highlighted and possibly the replacement suggestions. With this parameter, only a plain sorted list of unknown words, one per line, is output at the end of sieving. This is useful when a lot of false positives are expected, to quickly add them to the supplemental dictionary.
lokalize
Open the PO file on messages containing unknown words in Lokalize. Lokalize must be already running with the project that contains the PO file opened.
check-spell-ec may be told to skip checking specific messages and words, and it may use internal supplemental spelling dictionaries. See Section 8.2, “Spell Checking” for these and other details on spell-checking in Pology.
Enchant can be configured for use in Pology through user configuration, so that it is not necessary to issue some parameters on every run. See Section 9.2.3, “The [enchant]
section”.
The KDE Translation Project contains a great number of PO files extracted from various types of sources. This results in that for each message, there are things that the translation can, must or must not contain, for the translation to be technically valid. When run over PO files within the KDE TP, check-tp-kde will first try to determine the type of each message and then apply appropriate technical checks to it. Message type is determined based on file location, file header, message flags and contexts; even a particular message in a particular file may be checked for some very specific issue.
"Technical" issues are those which should be fixed regardless of the language and style of translation, because they can lead to loss of functionality, information or presentation to the user. For example, a technical issue would be badly paired XML tags in translation, when in the original they were well paired; a non-technical issue (and thus not checked) would be when the original ends with a certain punctuation, but translation does not -- whether such details are errors or not, depends on the target language and translation style.
For the sieve to function properly, it needs to detect the project subdirectory of each PO file up to topmost division within the branch, e.g. messages/kdebase
docmessages/kdegames
. This means that the local copy of the repository tree needs to follow the repository layout up to that point, e.g. kde-trunk-ui/kdebase
and kde-trunk-doc/kdegames
would not be valid local paths.
Parameters:
strict
Sometimes the original text itself may not be valid against a certain check. When this is the case, by default the translation is not expected to be valid either, and the check is skipped. Issuing this parameter will force all checks on translation, regardless of whether the original is valid or not. It may still be possible to avoid some checks on those messages that just cannot be repared through translation, if those checks define their own mechanism of cancelation (like adding a special translator comment).
check:keywords
Comma-separated list of checks to apply, by keyword, instead of all. Available checks are listed below.
showmsg
By default, when the message does not pass a check, only its location and the problem are reported. This parameter requests that message is reported in total, possibly with problematic segments of translation highlighted.
lokalize
Open the PO file on reported messages in Lokalize. Lokalize must be already running with the project that contains the PO file opened.
Currently available checks (keyword in parenthesis):
KDE4 markup checking (kde4markup
).
Qt markup checking (qtmarkup
).
Docbook markup checking (dbmarkup
)
HTML markup checking (htmlmarkup
).
No translation scripting in "dumb" messages (nots
). Translations fetched at runtime by KDE4 translation system may use translation scripting. This check will make sure that scripting is not attempted for other types of messages (used by Qt-only code, for .desktop files, etc.).
Qt datetime format messages (qtdt
). A message is considered to be in this format if it contains the string qtdt-format
in its msgctxt
string or among flags.
Validity of translator credits (trcredits
). PO files may contain meta-messages to input translator credits, which should have both valid translations on their own and some congruence between them.
Query placeholders in Plasma runners (plrunq
). Messages in Plasma runners may contain special query placeholder :q:
, which should be present in translation too.
File-specific checking (catspec
). Certain messages in certain PO files have special validity requirements, and this check activates all such file-specific checks.
All markup checks can be skipped on a message by adding the no-check-markup
translator flag.
PO files of The Battle of Wesnoth contain a mix of well-known and custom markup and format directives. check-tp-wesnoth heuristically determines the type of each message in a Wesnoth PO file and applies appropriate technical checks to it (where "technical" has the same meaning as in the check-tp-kde sieve).
Parameters:
check:keywords
Comma-separated list of checks to apply, by keyword, instead of all. Available checks are listed below.
showmsg
Instead of just showing the message location and problem description, also show the complete message, possibly with higlighted problematic segments.
lokalize
Open the PO file on reported messages in Lokalize. Lokalize must be already running with the project that contains the PO file opened.
Currently available checks (keyword in parenthesis):
Stray context separators in translation (ctxtsep
). Wesnoth is still embedding disambiguating context into msgid
, by putting it in front of the actual text and separated by ^
. An unwary translator will sometimes mistakes such context for part of the original text, and translate it too.
Congruence of WML interpolations (interp
). WML interpolations look like "...side $side_number is..."
and normally must match between the original and translation, or else the player would loose information. Only in very rare cases (e.g. some plurals and Markov chain generators) some interpolations may be missing in translation, and then they can be listed space-separated in a translator comment to silence the check:
# ignore-interpolations: interp1
interp2
...
(the $
character is not necessary in the list).
WML markup checking (wml
). If WML in translation is not valid, player may see some visual artifacts. Also, links in WML must match between original and translation, to avoid loss of information.
Pango markup checking (pango
). Pango is used in some places for visual text markup instead of WML.
Congruence of leading and trailing space (space
). For many languages, significant leading and trailing space from the original should be preserved. A heuristic is used to determine when leading or trailing space is significant. Only languages explicitly specified internally are checked for this.
Docbook validity (docbook
). Docbook is actually not used as a source format anywhere in Wesnoth, but the Wesnoth manual is converted into Docbook specifically to facilitate translation (weird as it may sound).
Man page validity (man
).
Property maps (or pmaps for short) are one way in which arbitrary properties of language phrases can be defined for use in scripted translations, such as provided by Transcript, the translation scripting system in KDE 4.
A property map is a text file with a number of entries, each defining the properties of a certain phrase. A pmap entry starts with one or more keys and continues with arbitrary number of key-value properties. An example entry would be grammar declinations of a noun:
=/Athens/Atina/nom=Atina/gen=Atine/dat=Atini/acc=Atinu//
The first two characters define, in order, the key-value separator (here =
) and the property separator (here /
) for the current entry. The two separators can be any non-alphanumeric characters, and must be different. Then follows a number of entry keys, delimited by property separators, and then a number of key-value properties, each internaly delimited by the key-value separator. The entry is terminated by double property separator. Properties of an entry can be fetched in the translation scripting system by any of the entry keys; keys are case- and whitespace-insensitive.
collect-pmap will parse pmap entries from manual comments in messages, collect them, and write out a property map file. It is not necessary to explicitly specify entry keys, since the contents of msgid
and msgstr
are automatically added as keys. Since each manual comment is one line, it is also allowed to drop the final double separator which would normally terminate the entry. The above example would thus look like this in a PO message:
# pmap: =/nom=Atina/gen=Atine/dat=Atini/acc=Atinu/ msgctxt "Greece/city" msgid "Athens" msgstr "Atina"
The manual comment starts with pmap:
keyword, which is followed by a normal pmap entry, except for missing keys (but additional keys can be specified when msgid
and msgstr
are not sufficient). It is also possible to split the entry into several comments, with only condition that all share the same set of separators:
# pmap: =/nom=Atina/gen=Atine/ # pmap: =/dat=Atini/acc=Atinu/
After collecting pmap entries from all processed PO files, if two or more entries end up having same keys, they are all removed from the collection and a warning is reported.
Pmap entries are collected only from translated, non-plural messages.
Parameters:
outfile:file
File path into which the property map should be written. If not given, nothing is written out; this is useful for validating entries.
propcons:file
Path to the file which defines constraints on property keys and values, used to validate parsed entries (see Section 3.5.12.2, “Validating Entries”).
extrakeys
By default, it is actually not possible to add any aditional entry keys besides the automatically added msgid
and msgstr
. This gives extra safety against errors, such as translator mistyping the key-value pair. If additional keys are actually needed, this parameter can be issued to accept them.
derivs:file
Path to the file which defines derivators for synder entries (see Section 3.5.12.1, “Derivating Entries”).
pmhead:string
Default pmap:
as entry prefix may not be the most convenient; for example, when the language of translation is not written with Latin script. This parameter makes makes it possibly to use an arbitrary string for the entry prefix.
sdhead:string
Like pmhead
, but for prefix to synder entries, instead of the default synder:
(see Section 3.5.12.1, “Derivating Entries”).
There is another, more succint way to define pmap entries in comments. Instead of writting out all key-value combinations, it is possible instead to generate them by using syntagma derivators (or synders) for short. From the earlier example:
# pmap: =/nom=Atina/gen=Atine/dat=Atini/acc=Atinu/
it can be observed that each form has the same root, Atin
, followed by the appropriate ending for that form type. This makes it convenient to reformulate it as a syntagma derivation:
# synder: Atin|a
Here |a
is a derivator; all derivators are defined in a separate synder file (with .sd
extension by convention) and made known to the sieve through the derivs
parameter. The derivator in this example would be defined like this:
|a: nom=a, gen=e, dat=i, acc=u
First comes the derivator name, starting with |
and ending with :
, and then the comma-separated list of key-value pairs similar as in the pmap entry, except that now only the endings for the given form are specified. Synders are actually a standalone subsystem of Pology, see Section 8.6, “Syntagma Derivation” for all details.
It is possible to mix pmap (# pmap: ...
) and synder (# synder: ...
) entries in translator comments. For example, synder entries may be used to cover majority of cases, which follow the general language rules, while pmap entries can be used for exceptions.
On the other hand, every pmap entry can be reformulated as a synder entry which does not refer to an external derivator:
# synder: nom=Atina, gen=Atine, dat=Atini, acc=Atinu
This begs the question of what is the need for pmap entries at all, if synder entries can be used in the same capacity and beyond? Pmap entries are still useful because synders have a lot of special syntax and rules to keep in mind (e.g. what if the phrase itself contains a comma?), while raw pmaps have none past what was described above.
The propcons
parameter can be used to specify a file which defines constraints on acceptable property keys, and on values by each key. Its format is the following:
# Full-line comment. /key_regex_1/value_regex_1/flags # a trailing comment /key_regex_2/value_regex_2/flags :key_regex_3:value_regex_3:flags # different separator # etc.
Regular expressions for keys and values are delimited by a separator defined by first non-whitespace character in the line, which must also be non-alphanumeric. Before being compiled, regular expressions are automatically wrapped as ^(
, so that an expression to require a certain prefix is given as regex
)$
and a suffix as prefix
.*.*
. A property key must match one of the key regexs, or else it is considered invalid. Value to that property must then match the value regexes attached to all matched key regexes.suffix
For example, a constraint file defining no constraints on either property keys or values is:
/.*/.*/
while a file explicitly listing all allowed property keys, and constraining values to some of them, would be:
/nom|gen|dat|acc/.*/ /gender/m|f|n/ /number/s|p/
The last separator in the constraint can be followed by a string of single-character flags. These flags are currently defined:
i
: case-insensitive matching for the value.
I
: case-insensitive matching for the key.
t
: the value must both match the regular expression and be equal to msgstr
. If i
flag is added too, equality check is also case-insensitive.
r
: regular expression for the key must match at least one key among all defined properties.
Constraint definition file must be encoded with UTF-8.
When PO files are merged with --previous
option to msgmerge, fuzzy messages will retain the previous version of original text (msgctxt
, msgid
and msgid_plural
) under #|
comments. Then diff-previous can be used to embedded differences from previous to current original into previous original strings. For example, the message:
#: main.c:110 #, fuzzy #| msgid "The Record of The Witch River" msgid "Records of The Witch River" msgstr "Beleška o Veštičjoj reci"
will become after sieving:
#: main.c:110 #, fuzzy #| msgid "{-The Record-}{+Records+} of The Witch River" msgid "Records of The Witch River" msgstr "Beleška o Veštičjoj reci"
Text editors may even provide highlighting for the wrapped difference segments (e.g. Kwrite/Kate).
This sieve is very useful if your PO editor does not show differences in the original by itself. To be able to easily see exactly what was changed in the original is important both for efficiency and for quality. Think of a long paragraph in which only one word was changed: without a diff it will take you time to reread it, and you may even miss that changed word.
Parameters:
strip
Instead of embedding diffs, remove them from messages, recovering the original form of previous strings. This is useful if you did not update all fuzzy messages but you anyway want to send the PO file away (commit it to the repository, etc.).
branch:branch
Embed diffs only into messages from given branch (summit). Several branches may be given as comma-separated list.
For every fuzzy message, empty-fuzzies removes the translation and fuzzy data (the fuzzy
flag, previous strings). Translator comments are kept by default, but they can be removed as well. Obsolete fuzzy messages are completely removed.
Parameters:
rmcomments
Also remove translator comments from fuzzy messages.
noprev
Empty only those fuzzy messages which do not have previous strings (i.e. when the PO file was merged without --previous
option to msgmerge).
equip-header-tp-kde applies the kde%header/equip-header
hook to headers of PO files within the KDE Translation Project.
There are no parameters.
Ordinary ASCII quotes are easy to type on most keyboard layouts, and these quotes are frequently encountered in non-typeset English texts, rather than proper English quotes. These proper quotes are sometimes called "fancy" quotes. When translating from English, translators can thus be easily moved to use ASCII quotes themselves, instead of the fancy quotes appropriate for their language. To somewhat correct this, fancy-quote can be used to replace ASCII quotes in the translation with selected pairs of fancy quotes.
ASCII quotes that are part of text markup (e.g. attribute values in XML-like tags) must not be replaced, and this sieve will use heuristics to determine such places. In fact, it will replace quotes rather conservatively. Nevertheless, unless some sort of automatic validation is available, converted text should be manually inspected for correctness.
Parameters:
single:quotes
Opening and closing quote to replace ASCII single quotes (i.e. quotes
is a two-character string). If not given, single quotes are not replaced (but see the longsingle
parameter).
single:quotes
Opening and closing quote to replace ASCII double quotes. If not given, double quotes are not replaced (but see the longdouble
parameter).
longsingle:open
,close
Alternative to single
, if opening and closing quotes are not single characters. The value are the opening quote string and the closing quote string, separated by comma.
longdouble:open
,close
Alternative to double
, if opening and closing quotes are not single characters.
find-messages is the search and replace workhorse of Pology. It applies one or several conditions to different parts of the PO message, with selectable boolean linking between them. If the message is matched as whole, it is reported and possibly some replacements are done. Messages are by default reported to standard output, with full location reference (PO file path, line and entry number), but can also be opened directly in one of supported PO editors (see Section 9.7.1, “PO Editors”).
When used in a sieve chain, find-messages will stop further sieving of messages which did not satisfy the conditions. This makes it useful as a filter for selecting subsets of messages on which other sieves should operate.
There are three logical groups of parameters: matching parameters, replacement parameters, and general parameters. Matching and replacement parameters have certain relationships between themselves, while general parameters have mutually independent effects (i.e. as usual for sieve parameters).
Matching parameters specify patterns for matching by parts of the message, or represent binary conditions (whether the message is translated, etc.). For example:
$ posieve find-messages -s msgid:'foo bar'
will report all messages which contain the phrase "foo bar" in their msgid
(or msgid_plural
) string. When several matching parameters are given, by default the message is matched if all patterns match; that is, boolean linking of conditions is AND. This:
$ posieve find-messages -s msgid:'foo bar' -s transl
will report all messages that contain "foo bar" in original and are translated. Boolean linking can be switched to OR by issuing the or
parameter. To find all messages that contain the word "tooltip" in either context or comments:
$ posieve find-messages -s msgctxt:tooltip -s comment:tooltip -s or
(Actually, the effect of or
is somewhat more specific, see its description below.) String matching is by default case insensitive, which can be changed globally by issuing the case
parameter.
Every matching parameter has a negative counterpart, named by prepending n
to the original parameter, which matches when the original parameter does not. Running:
$ posieve find-messages -s msgid:'hello' -s nmsgstr:'zdravo'
would find all messages that contain "hello" in the original and do not contain "zdravo" in the translation (a typical usage pattern in quick terminology checks).
To find all messages not matching a set of conditions, in principle it would be possible to negate the whole condition set by switching between positive/negative parameters and AND/OR-linking, but this can be cumbersome. Instead, the invert
parameter can be issued to report messages that are not matched by the condition set.
Sometimes neither simple AND nor simple OR boolean linking is sufficient to form the search. Therefore the fexpr
parameter is provided, which can be used to specify a search expression with explicit boolean operators and parentheses for controlling the evaluation order. With fexpr
, the previous example could be reformulated as:
$ posieve find-messages -s fexpr:'msgid/hello/ and not msgstr/zdravo/'
For details, see the description of fexpr
below.
Currently defined matching parameters:
(n)msgctxt:regex
Regular expression to match the msgctxt
string.
(n)msgid:regex
Regular expression to match the msgid
and msgid_plural
strings. The condition is satisfed as whole if either of these strings matches.
(n)msgstr:regex
Regular expression to match msgstr
strings. The condition is satisfed as whole if any of the msgstr
strings matches.
(n)comment:regex
Regular expression to match extracted and translator comments and source reference comments. The condition is satisfed as whole if any of these comments matches.
(n)flag:regex
Regular expression to match flags. This matches each flag in turn, and not the flag comment as a monolithic string. The condition is satisfed as whole if any flag matches.
(n)transl
The message must be translated.
(n)obsol
The message must be obsolete.
(n)active
The message must be active, i.e. translated and not obsolete.
(n)plural
The message must be a plural message.
(n)maxchar:number
Original and translation can have at most this many characters. The condition is satisfied as whole if all these strings satisfy it.
(n)lspan:start
:end
The referent line number of the message (the line in which its msgid
string starts) must fall within given range. The starting number is included in the range, the ending number is not.
(n)espan:start
:end
Like lspan
, but instead of line numbers it applies to entry numbers. These are the numbers that dedicated PO editors usually report in their user interfaces.
(n)branch:branch
The message must belong to this branch (summit). Several branches may be given as comma-separated list.
(n)fexpr:expression
Boolean expression with explict boolean operators and parenthesis for priority, constructed out of any of the other matching parameters. If a match parameter needs a value (like a regular expression), in the expression it is given as
, where any nonalphanumeric character can be used consistently instead of match
/value
//
(in case the value itself contains /
). For example, the expression:
fexpr:'(msgctxt/foo/ or comment/foo/) and msgid/bar/'
is satisfied if either the context or comments contain "foo", and the original text contains "bar".
If matching is influenced by a general parameter (e.g. case sensitivity), in the expression it may be able to take overriding modifiers in form of single characters after the value, i.e.
. Assuming that match
/value
/modifiers
case
parameter has not been issued, the expression:
fexpr:'msgid/quuk/ and msgstr/Qaak/c'
will be satisfied if the original text contains "quuk" in any casing, and translation contains exactly "Qaak". Currently available modifiers are:
c
: matching is case-sensitive.
i
: matching is case-insensitive. May be needed when string matching is globally case-sensitive due to case
being issued.
Replacement is done in pair with matching the appropriate string in the message. For example, to replace each appearance of "foobar" with "fumbar" in translation, this would be run:
$ posieve find-messages -s msgstr:foobar -s replace:fumbar
The replace
parameter works in pair with msgstr
, i.e. replace
cannot be issued without issuing msgstr
as well. There are two possible problems with replacement as straightforward as this. The first is that if "foobar" was a whole word (or start of a word), and this word in the text started with upper-case letter, the replacement would make it lower-case. This can be avoided by executing replacement twice with case sensitivity:
$ posieve find-messages -s msgstr:foobar -s replace:fumbar -scase $ posieve find-messages -s msgstr:Foobar -s replace:Fumbar -scase
The other problem is if the word is split by an accelerator marker, for example:
msgstr "... f_oobar ..."
The search may still find the word (see the accel
parameter below), but direct replacement would cause the loss of accelerator marker, and therefore it is not done.[8] To see such cases, you should monitor the output of find-messages (always a good idea when doing batch replacement), where matched and replaced parts of the text will be highlighted.
As usual for replacement based on regular expression, the replacement string may contain \
references to groups defined in the matching pattern. For example, the previous example of case-aware replacement could be more efficiently and more elegantly performed with:number
$ posieve find-messages -s msgstr:'(f)oobar' -s replace:'\1umbar'
(Though this is possible only if the original and the replacement start with the same letter.)
Currently defined replacement parameters:
replace:string
The string to replace the match by msgstr
parameter. Can contain regular expression group references.
Parameters influencing general behavior of find-messages are as follows:
or
Boolean OR instead of AND linking of conditions, but only for string matchers: msgctxt
, msgid
, msgstr
, comment
. This restriction may seem odd, but it is what is mostly needed in practice. For example, the set of conditions:
-s msgctxt:tooltip -s comment:tooltip -s transl -s or
would match all translated messages which have "tooltip" in context or in comments, and not messages which are either translated or have "tooltip" in context or in comments. For full control over the expression, use the fexpr
parameter.
invert
Inverts the selection: messages satisfying the condition set are not selected.
accel:characters
Characters to consider as accelerator markers, to remove before applying matching patterns. If not given, they may be read from PO files (see X-Acclerator-Marker
in Section 9.9, “Influential Header Fields”).
case
Matching patterns for strings and comments are by default case-insensitive, and this parameter switches them to case-sensitive.
mark
To each selected message a match
flag is added, modifying the PO file. Modified files can then be opened in the editor, and selected messages looked up by this flag. This is typically done when something should be modified in selected messages, but doing that automatically (using replace
parameter) is not possible or safe enough. Also useful here is the option -m
/--output-modified
of posieve, to write out the paths of modified PO files into a separate file, which can then be fed to the editor.
filter:hookspec
The hook to modify the translation before applying the msgstr
matcher to it. The hook type must be F1A. The parameter can be repeated to add several hooks.
nomsg
Do not report selected messages, either to standard output or to PO editors. Useful when find-messages is a pre-filter in the sieve chain.
lokalize
Open the PO file on selected messages in Lokalize (unless nomsg
is in effect). Lokalize must be already running with the project that contains the PO file opened.
generate-xml creates a partial XML representation of a group of PO files.
The output XML format is as follows. Each PO file in the group is represented by a <po>
element, which contains a list of <msg>
elements, one for each message. The C<msg>
element contains the usual parts of a PO message:
<line>
: referent line number of the message
<refentry>
: referent entry number of the message
<status>
: current status of the message (obsolete, translated, untranslated, fuzzy)
<msgid>
: the original text
<msgstr>
: the translation
<msgctxt>
: disambiguating context
If the PO message contains plural forms, they will be represented with <plural>
subelements of <msgstr>
.
Parameters:
xml:file
By default the XML content is written to standard output, and this parameter can be used to send it to a file instead
translatedOnly
Only translated messages are exported to XML (i.e. fuzzy, untranslated and obsolete are ignored).
When doing corrections on a copy of PO files tree, it is not possible to easily merge back just the updated translations, because word wrapping in PO file can be different, generating much more difference than it should.
Additionally, tools like pogrep from Translate Toolkit will create new partial tree as output, containing matched messages only. merge-corr-tree will help you to merge changes made in that partial tree back into the main tree.
The main PO files tree is the input, and the pathdelta
parameter is used to provide the path difference to where the partial correction tree is located.
Parameters:
pathdelta:search
:replace
Specifies that the partial tree is located at path obtained when
is replaced with search
in the input path.replace
normalize-header applies the normalize/canonical-header
hook to PO file headers.
There are no parameters.
In older PO files, disambiguating contexts may be embedded into msgid
strings, as the initial part of the string delimited from the actual text with predefined substrings, here called the "head" and the "tail". For example, in:
msgid "" "_:this-is-context\n" "This is original text" msgstr "This is translated text"
the head is the underscore-colon sequence (_:
), and the tail the newline (\n
). normctxt-delim will convert embedded contexts of the delimiter-type to proper msgctxt
strings.
Parameters:
head:string
The head of the delimiter-type embedded context.
tail:string
The tail of the delimiter-type embedded context.
In older PO files, disambiguating contexts may be embedded into msgid
strings, as the initial part of the string separated from the actual text by a predefined substring. For example, in:
msgid "this-is-context|This is original text" msgstr "This is translated text"
the separator string is the pipe character (|
). normctxt-sep will convert embedded contexts of the separator-type to proper msgctxt
strings.
Parameters:
sep:string
The string that separates the context and the text in separator-type embedded context.
Being translator's input, translator comments are copied verbatim to fuzzy messages created on merging with template. Depending on the purpose of translator comments (e.g. see Section 9.11, “Skipping and Selecting Checks” for some special types), it may be better to automatically remove some of them from fuzzy messages (and then possibly add them back manually when updating the translation). If run without any parameters remove-fuzzy-comments will do nothing, so one or more parameters need to be given to actually remove any comment.
Parameters:
all
Simply all translator comments in fuzzy messages are removed.
nopipe
Translator comments containing translator flags (see Section 9.11, “Skipping and Selecting Checks”) are removed.
pattern:regex
Translator comment must match the given regular expression to be removed.
exclude:regex
Translator comment is removed if it does not match the given regular expression.
case
Matching patterns are by default case-insensitive, and this parameter switches to case-sensitivity.
When several removal criteria are specified, first those other than pattern
and exclude
are applied in unspecified order, then the pattern
match, and finally the exclude
match.
remove-obsolete simply removes all obsolete messages, whether fuzzy or translated, from the PO file.
There are no parameters.
remove-previous removes previous strings, i.e. #| ...
comments, from messages.
Parameters:
all
Previous strings are by default removed only from non-fuzzy messages. This parameter specifies to remove previous strings from all messages, including fuzzy.
In its default mode of operation, msgcat(1) produces an aggregate message when in different catalogs it encounters a message with the same key but different translation or translator or extracted comments. A general aggregate message looks like this:
# #-#-#-#-# po-file-name-1 (project-version-id-1) #-#-#-#-# # manual-comments-1 # #-#-#-#-# po-file-name-2 (project-version-id-2) #-#-#-#-# # manual-comments-2 # ... # #-#-#-#-# po-file-name-n (project-version-id-n) #-#-#-#-# # manual-comments-n #. #-#-#-#-# po-file-name-1 (project-version-id-1) #-#-#-#-# #. automatic-comments-1 #. #-#-#-#-# po-file-name-2 (project-version-id-2) #-#-#-#-# #. automatic-comments-2 #. ... #. #-#-#-#-# po-file-name-n (project-version-id-n) #-#-#-#-# #. automatic-comments-n #: source-refs-1 source-refs-2 ... source-refs-n #, fuzzy, other-flags msgctxt "context" msgid "original-text" msgstr "" "#-#-#-#-# po-file-name-1 (project-version-id-1) #-#-#-#-#\n" "translated-text-1\n" "#-#-#-#-# po-file-name-2 (project-version-id-2) #-#-#-#-#\n" "translated-text-2\n" "..." "#-#-#-#-# po-file-name-n (project-version-id-n) #-#-#-#-#\n" "translated-text-n"
Each message part is aggregated only if different in at least one message in the group. For example, extracted comments may be aggregated while translations not.
resolve-aggregates is used to resolve aggregate messages of this kind into normal messages, by picking one variant from each aggregated part.
Parameters:
first
By default, the picked variant is the one with most occurences, or the first of the several with same number of occurences. If this parameter is issued, the first variant is picked unconditionally.
unfuzzy
Aggregated messages are always made fuzzy, leaving no way to determine if and which of the original messages were fuzzy. Therefore, by default, the resolved message is left fuzzy too. If, however, it is known beforehand that none of the original messages were fuzzy, resolved messages can be unfuzzied by issuing this parameter.
keepsrc
Since there is no information based on which the aggregated source references can be split into originating groups, they are entirely removed unless this parameter is issued.
resolve-alternatives resolves alternatives directives found in the translation into one of the alternatives.
An alternative directive is a substring of the form ~@/.../.../...
, for example:
msgstr "I see a ~@/pink/white/ elephant."
~@
is the directive head, which is followed by a character that defines the delimiter of alternatives (can be arbitrary), and then by alternatives themselves. The number of alternatives per directive is not defined by the directive itself, but it is provided as the sieve parameter (i.e. all alternative directives must have some number of alternatives).
Parameters:
alt:N
,M
t
Specifies how to resolve alternatives.
is the index (starting from 1) of the alternative to take from each directive, and N
is the number of alternatives per directive. Example: M
alt:1,2t
.
If an alternatives directive is invalid (e.g. too little alternatives), it is reported to standard output. If at least one alternatives directive in the text is not valid, the text is not modifed.
XML entities are substrings of the form <
, typically encountered in XML-like text markups, but elsewhere too. They are resolved into underlying, human-readable values at build time (when translated text documents are created) or at run time (in translated user interfaces). Sometimes it may be better to have them resolved already in the PO file itself, and that is what resolve-entities does.entityname
>
Parameters:
entdef:file
Path to the file which contains entitiy definitions. It can be repeated to add several files.
Entity definition files are plain text files of the following format:
<!-- This is a commment. --> <!ENTITY name1 'value1'> <!ENTITY name2 'value2'> <!ENTITY name3 'value3'> ...
ignore:entitynames
Entities which should be ignored during resolution. Standard XML entities (<
, >
, '
, "
, &
) are ignored by default.
Sometimes a PO header field or comment needs to be updated in many PO files at once, and set-header serves that purpose.
Parameters for setting and removing header fields:
field:name
:value
Set the field with given name to given value. This parameter can be repeated to set several fields in one run.
By default, field
will actually set the field only if it is already present in the header. To add the field if not present, the create
parameter must be issued as well. If the field is being added, parameters after
and before
can be used to specify where to insert it, or else the new field is appended at the end of the header. If the field is present but not positioned according to after
and before
, the reorder
parameter can be issued to move the field within the header.
create
The field should be added if it is not present in the header.
after
When a field is added, it should be inserted after this field.
before
When a field is added, it should be inserted before this field.
reorder
If the field is present, but it is in the wrong place according to after
and before
, this parameter will cause it to be reinserted in proper place.
remove:field
Remove the field with this name. If there are several fileds of that name, all are removed.
removerx:regex
Remove all fields matched by the given regular expression.
Parameters for setting and removing header comments:
title:value
Set the title comment to the given value. It can be repeated, since the title can be composed of multiple comment lines.
rmtitle
Remove title comments.
copyright:value
Set the copyright comment to the given value.
rmcopyright
Remove the copyright comment.
license:value
Set the license comment to the given value.
rmlicense
Remove the license comment.
author:value
Set the author comment to the given value. It can be repeated, since there may be more authors (i.e. translators).
rmauthor
Remove author comments.
comment:value
Set the free comment to the given value. It can be repeated, since there can be any number of free comment lines.
rmcomment
Remove free comments.
rmallcomm
Remove all header comments.
Note that all existing comments of given type are removed before setting the new ones, i.e. the new comments are not appended to the existing. For example, if single author
parameter is issued, with a translator name and email address as value, this one translator will replace all existing translators in the header comments.
Comment values are checked for some minimal consistency, e.g. author comments must contain email addresses, licence comments the word "licence", etc.
Value strings (both of fields and comments) may contain %-directives, which are expanded to catalog-dependent substrings prior to setting the value. Currently available directive are:
%poname
: PO domain name (equal to file name without .po
extension)
If literal % character is needed (e.g. when setting the Plural-Forms
field), it can be escaped by doubling it, %%
. The directive can also be given inside braces, as %{...}
when it would be ambiguous otherwise.
stats collects statistics on PO files, such as message and word counts, and more. Statistics can be presented in several ways and on several levels.
Parameters:
accel:characters
Characters to consider as accelerator markers, to remove them when splitting text to count words. If not given, they may be read from PO files (see X-Acclerator-Marker
in Section 9.9, “Influential Header Fields”), or else some usual accelerator marker characters are removed.
detail
In table views, by default only message, word, and character counts are given. This parameter requests additional derived data, such as expansion factors (ratio of words in translation to words in original), number of words per message, etc.
incomplete
When run over a collection of PO files, all non-fully translated PO files are listed separately, with very brief statistics of incompleteness.
incompfile:file
Write a file with paths of all non-fully translated PO files, one per line. This file can then be fed with -f
/--from-files
back to posieve or another script, to process only incomplete PO files.
templates:search
:replace
If there exists both a directory with translated PO files and with POT (template) files, and not every POT file has the corresponding PO file, this parameter can be used to count POT files without PO counterpart as fully untranslated in statistics. Value to the parameter are two strings separated by colon: the first string will be searched for in directory paths of processed PO files, and replaced with the second string to construct corresponding directory paths of POT files. For example:
$ cd $MYTRANSLATIONS $ ls my_lang templates $ posieve stats -s templates:my_lang:templates my_lang/
minwords:number
Only messages with at least this many words (in any of original or translation strings) are counted into statistics.
maxwords:number
Only messages with at most this many words (in any of original or translation strings) are counted into statistics.
lspan:start
:end
Only messages with referent line numbers (line number of msgid
) in this range are counted into statistics. The starting line is included in the range, the ending line is not. If start is omitted (e.g. lspan::500
) it is assumed 0, and if end is omitted (e.g. lspan:300
or lspan:300:
) it is assumed the total number of lines.
espan:start
:end
Only messages with entry numbers (as reported by PO editors) in this range are counted into statistics. Same boundary inclusion and omission rules as for lspan
apply; e.g. espan:4:8
means to count messages with entry numbers 4, 5, 6, and 7.
branch:branch
Only messages from given branch are counted into statistics (summit). Several branches may be given as comma-separated list.
bydir
Statistics is broken by directories, that is a report is displayed for each group of PO files in the same directory (and not below it). More usually used with bar displays than with tabular displays.
byfile
Statistics is broken by files, that is a report is displayed for each PO file. Usually used with bar displays.
msgbar
Instead of a table with detailed statistics, only message counts are shown, accompanied with a text-art bar. Mostly useful in combination with bydir
and byfile
.
wbar
Like msgbar
, but to have word instead of message counts.
absolute
Bar displays (on msgbar
and wbar
) are normaly relative, meaning that when byfile
or bydir
is in effect, each bar is of same length. This parameter makes bars scaled to sizes of PO files or directories. For example, if msgbar
and byfile
are issued, then the bar of a PO file with twice as many messages as another PO file will be twice as long.
ondiff
Fuzzy messages are often very easy to correct (e.g. a typo fixed), which may make their word count misleading when estimating translation effort. This can be amended by issuing this parameter, to split word and character counts of fuzzy messages into translated and untranslated counts. The split is based on the difference ratio between current and previous original text, and a threshold. If the difference ratio is larger than the threshold, everything is counted as untranslated. The fuzzy count is left at zero. If previous original text is missing, the correction is not made, and counts are assigned to fuzzy as usual.
mincomp:fraction
Only those PO files which have translation completeness (measured by the ratio of translated to all messages, excluding obsolete) equal to or higher than the given fraction are included into statistics. This is especially useful when for each new template an empty PO file is automatically produced (instead of translators having to start work from a template), to include into statistics only those files which have actually seen some translation (using a small non-zero number for the fraction, e.g. fraction:1e-6
).
The hook to modify the translation before splitting it to count words and characters (see Section 9.10, “Processing Hooks”). The hook type must be F1A. The parameter can be repeated to add several hooks, which are then applied in the order of specification.
Some older PO files will have disambiguating contexts embedded into the msgid
string, instead of using the newer standard msgctxt
string. There are several customary ways in which this is done, but in general it depends on the translation environment where such PO files are used.
Embedded contexts will skew the statistics. Pology contains several sieves for converting embedded contexts into msgctxt
contexts, named normctxt-*. When statistics on such PO files is computed, a sieve chain should be used in which the stats sieve is preceeded by the context conversion sieve. For example, if the embedded context starts the msgid
and ends with |
, statistics should be computed with:
$ posieve --no-sync normctxt-sep,stats -s sep:'|' ...
Note that normctxt-* sieves, since they modify messages, would by default cause PO files to be modified on disk. Option --no-sync
is therefore issued to prevent modifications to sieved files.
The default output from stats is a table where rows present statistics for a category of messages, and columns the particular categories of data:
$ posieve stats frobaz/
- msg msg/tot w-or w/tot-or w-tr ch-or ch-tr
translated ... ... ... ... ... ... ...
fuzzy ... ... ... ... ... ... ...
untranslated ... ... ... ... ... ... ...
total ... ... ... ... ... ... ...
obsolete ... ... ... ... ... ... ...
The total
row is the sum of translated
, fuzzy
, and untranslated
rows, whereas obsolete
row is excluded. The columns are as follows:
msg
: number of messages
msg/tot
: percentage of messages relative to total
w-or
: number of words in the original
w/tot-or
: percentage of words in the original relative to total
w-tr
: number of words in the translation
ch-or
: number of characters in original
ch-tr
: number of characters in the translation
The output with detail
parameter in effect is the same as default, with several columns of derived data appended to the table:
w-ef
: word expansion factor (increase in words from the original to the translation)
ch-ef
: character expansion factor (increase in characters from the original to the translation)
w/msg-or
: average of number words per message in the original
w/msg-tr
: average number of words per message in the translation
ch/w-or
: average number of characters per message in the original
ch/w-tr
: average number of characters per message in the translation
If any of the sieve parameters that restrict or modify counting (such as ondiff
, lspan
, etc.) have been issued, this is indicated in the output by a modifiers: ...
line:
$ posieve stats -s maxwords:5 -s ondiff frobaz/ (...the statistics table...) modifiers: at most 5 words and scaled fuzzy counts
When the incomplete
parameter is given, the statistics table is followed by a table of non-fully translated PO files, with counts of fuzzy and untranslated messages and words:
$ posieve stats -s incomplete frobaz/ (...the overall statistics table...) catalog msg/f msg/u msg/f+u w/f w/u w/f+u frobaz/foxtrot.po 0 11 11 0 123 123 frobaz/november.po 19 14 33 85 47 132 frobaz/sierra.po 22 0 22 231 0 231
In the column names, msg/*
and w/*
stand for messages and words; */f
, */u
, and */f+u
stand for fuzzy, untranslated, and the two summed.
When parameters msgbar
or wbar
are in effect, statistics is presented in the form of a text-art bar, giving visual relation between numbers of translated, fuzzy, and untranslated messages or words:
$ posieve stats -s wbar frobaz/
4572/1829/2533 w-or |¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤×××××××××············|
A typical condensed overview of translation state is obtained by:
$ posieve stats -s byfile -s msgbar frobaz/ frobaz/foxtrot.po 34/ -/11 msgs |¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤·····| frobaz/november.po 58/19/14 msgs |¤¤¤¤¤¤¤¤¤¤¤×××××····| frobaz/sierra.po 65/22/ - msgs |¤¤¤¤¤¤¤¤¤¤¤¤¤¤××××××| (overall) 147/41/25 msgs |¤¤¤¤¤¤¤¤¤¤¤¤¤××××···|
Note that while message counts are the classic for bar overviews (msgbar
), you are probably better off looking at word counts (wbar
) instead, because word counts represent more closely the amount of work needed to complete the translation. Rounding of fractions for bars is such that as long as there is at least one fuzzy or untranslated message (or word), the bar will show one incomplete cell.
Word and character counts for a message string are obtained by processing it in the following order:
Accelerator markers are removed.
Text markup is eliminated (e.g. XML-like tags).
Other special substrings, such as format directives, are also eliminated (e.g. %s
in messages with c-format
flag).
Text is split into words by taking all contiguous sequences of "word characters", which include letters, numbers, and underscore.
All words not starting with a letter are eliminated.
Words that remain are counted into statistics. Whitespace is not included in character count.
In plural messages, counts for the original are the average of msgid
and msgid_plural
strings, and likewise the average of all msgstr
strings for the translation. In this way, the comparative statistics between the original and the translation is not skewed for languages that have more or less than two plural forms.
Some translators like to edit PO files with a plain text editor, which may provide no special support for editing PO files, other than perhaps PO syntax highlighting. In this scenario, tag-untranslated can be used to equip untranslated messages with untranslated
flag, so that they can be easily looked up in the editor.
Since untranslated
is not one of defined PO flags, it will be lost if the PO file is merged with the template. This is intentional: the only purpose of this flag is to facilitate immediate editing of the PO file, and you may miss to remove some of them while editing. There is no reason for untranslated
flags to persist in that case. Also, if the flag is not removed after the message has been translated, a subsequent run of this sieve will remove the flag.
Parameters:
strip
Instead of being added, untranslated
flags are stripped. This is useful when you had no time to translate all messages but you want to send the PO file away.
wfuzzy
untranslated
flags are added to fuzzy messages as well. This can be useful to be able to jump in the text editor through all incomplete message by just giving , untranslated
[9], or when the set of messages to be updated has been limited somehow (e.g. by the branch
parameter).
branch:branch
Tag only untranslated messages from given branch (summit). Several branches may be given as comma-separated list.
Sometimes the message is made fuzzy during merging only due to change in the msgctxt
string, or its addition or removal. Some translators and languages may be less dependent on contexts than the other, or they may be in a hurry prior to the release of the translation, and then unfuzzy-context-only can be used to unfuzzy these messages in which only the context was modified. This state can be detected by comparing the current and the previous strings in the fuzzy message, i.e. the PO file must have been merged with --previous
option to msgmerge.
Parameters:
noreview
By default, unfuzzied messages will also be given a translator comment with unreviewed-context
string, so that you may find and review these messages at a later time. This parameter will prevent the addition of such comment, but it is usually safer to review automatically unfuzzied messages when you find the time.
eqmsgid
Sometimes a lot of messages in the code may be semi-automatically equipped with contexts (e.g. to group items by a common property), and then it may be necessary to review only those messages which got split into two or more messages due to newly added contexts. This parameter may be issued to specifically report all translated messages which have the their msgid
string equal to an unfuzzied message, including unfuzzied messages themselves. Depending on exactly what kind of contexts have been added, the noreview
parameter may be useful here as well.
lokalize
Open the PO file on reported messages in Lokalize. Lokalize must be already running with the project that contains the PO file opened.
unfuzzy-ctxmark-only has a similar but less wide effect compared to the unfuzzy-context-only sieve. It unfuzzies a message only if the only change that caused fuzzyness is in a specific part of msgctxt
string, the UI context marker.
UI context markers are en element of KUIT markup (KDE user interface text), which state more formally the user interface context in which the text given by the PO message is used. This may be important for translation, since style guidelines will typically somewhat depend on where in the UI the text is seen. For example, there may be two messages in the code which have exactly the same text in English, but one is used as a menu item, and the other as a dialog title; with KUIT, they would be marked as:
msgctxt "@action:inmenu File" msgid "Export as HTML" msgstr "" msgctxt "@title:window" msgid "Export as HTML" msgstr ""
The UI context marker here is the leading part of msgctxt
, starting with @...
and ending with first whitespace. unfuzzy-ctxmark-only will unfuzzy the message if only this marker has changed (or was added or removed), but not if the change was in the rest of the context (after the first whitespace).
Parameters:
noreview
See the same-name parameter of unfuzzy-ctxmark-only. Using it here is probably somewhat safer, but this in general it depends on translation style guidelines.
Some text markups may have a "permissible" or "sloppy" mode, where some tags do not have to be explicitly terminated. The typical example is HTML, where <br>
, <hr>
, etc. do not have to be written as <br/>
. (This is unlike XHTML, which is an XML instance and therefore strict in this respect.) When this permissible markup was used in the code, a programmer revisiting that code at a later time may consider it a poor style, and go about fixing it. This may cause some messages in the PO file to become fuzzy. unfuzzy-inplace-only will recognize some of these situations in a fuzzy message (by comparing the current and previous strings) and automatically modify the translation accordingly and unfuzzy the message.
There are no parameters.
PO messages obtained by conversion from Qt Linguist translation files can contain in the msgctxt
an automatically extracted C++ class name, referring to the class where the message is located in the code. In the following two example messages, the C++ class name is the text before the |
character:
#: ui/configdialog.cpp:50 msgctxt "Sonnet::ConfigDialog|" msgid "Spell Checking Configuration" msgstr "" #: core/loader.cpp:206 #, qt-format msgctxt "Sonnet::Loader|%1 = language name, %2 = country name" msgid "%1 (%2)" msgstr ""
If the programmer later changes a class name in the code, all messages inside that class will become fuzzy. The unfuzzy-qtclass-only sieve can be used to unfuzzy such messages, by verifying that the only difference between the old and the new message is in the part of msgctxt
before the |
character. For this to work, the PO file must have been merged with --previous
option to msgmerge.
There are no parameters.
When translation on a PO file starts for the first time, or when a previously translated PO file is being updated after merging, update-header can be used to automatically set and update PO header fields to proper values. The revision date is taken as current, while other pieces of information are read from the user configuration (see Section 9.2, “User Configuration”). Note that this sieve is normally only of use when you are translating with a plain text editor, while dedicated PO editors should do this automatically when the PO file is saved after editing.
Parameters:
proj:projectid
The ID of the project to which the PO files to be updated belong. This ID is used to construct the name of the configuration section as [project-
, which contains the project data fields. Also used are the fields from the projectid
][user]
, whenever they are not overriden in project's section. See Section 9.2.2, “The [user]
section” and Section 9.2.5, “Per-project sections ([project-*]
)”.
init
By default, the sieve tries to detect if the header has been initialized before or not, because it differs somewhat what should be changed in the header on initialization and on update. This parameter can be issued to unconditionally treat the header as not initialized, i.e. overwrite any existing content.
onmod
The header should be updated only if the PO file was otherwise modified. This parameter makes sense only in a sieve chane, when this sieve is preceded by a potentially modifying sieve.
An example of a user configuration appropriate for this sieve would be:
[user] name = Chusslove Illich original-name = Часлав Илић email = caslav.ilic@gmx.net po-editor = Kate [project-kde] language = sr language-team = Serbian team-email = kde-i18n-sr@kde.org plural-forms = nplurals=4; plural=n==1 ? 3 : n%%10==1 && \ n%%100!=11 ? 0 : n%%10>=2 && n%%10<=4 && \ (n%%100<10 || n%%100>=20) ? 1 : 2;
Note that percent characters in the plural-forms
field are escaped by doubling, because single %
in configuration has special meaning. Also note splitting into several lines by trailing \
(only for better looks, since configuration lines can be arbitrarily long).
In French language, some punctuation characters are separated with an unbreakable space from the preceding word. This is unlike in English, so unwary French translators sometimes miss to add the required unbreable space after or before such punctuation when translating from English. fr:setUbsp will heuristically detect such places and insert an unbreakable space.
There are no parameters.
Each translation file for a docbook in KDE has a string for documentation last update date in the format 'yyyy-mm-dd'. This sieve automatically translated those strings into Russian. The sieve uses date command in order to change date formatting. But Russian names of months are hardcoded, so that you do not need to set up Russian locale to use the sieve.
There are no parameters.
Each internal sieve is a single Python file in sieve/
subdirectory (and in lang/
for language-specific sieves). The Python file is named like the sieve, only with hyphens replaced with underscores and with langcode
/sieve/.py
extension. posieve therefore knows how to find which file to execute when an internal sieve name is given as its first argument.
However, instead of an internal sieve name, the first argument to posieve can also be an explicit path (relative or absolute) to a Python file which implements a sieve. Explicit paths can also be part of a sieve chain, mixed with internal sieve names. This is all there is to running external sieves; see Section 11.3, “Writing Sieves” for instructions on how to write one.
Line-level diffing of plain text files assumes that the file is chunked into lines as largest well-defined units, that each line has a significant standalone meaning, and that the ordering of lines is not arbitrary. For example, this is typical of programming language code.
Superficially, PO files could also be considered "a programming language of translation", and amenable to same line-level treatment on diffing. However, some of the outlined assumptions, which make line-level diffing viable, are violated in the PO format. Firstly, the minimal unit of PO file is one message, whereas one line has little semantic value. Secondly, ordering of messages can be arbitrary in principle (e.g. dependent on the order of extraction from program code files), such that two line-wise very different PO files are actually equivalent from translator's viewpoint. And thirdly, good number of lines in the PO file are auxiliary, neither original text nor translation, generated either automatically or by the programmer (e.g. source references, extracted comments), all of which are out of translator's scope for modifications.
Due to these difficulties, the common way to use line-level diffing with PO files is only for review, and even that with some preparations. Due to myriad line-wise different but semantically equivalent representations of the PO file, it is almost useless to send line-level diffs as patches. Translators are instead told to always send full PO files to the reviewer or the commiter, no matter what is the amount of modifications. Then, the reviewer merges the received PO file (new version), and possibly the original (old version), with current PO template, without wrapping of message strings (msgid
, msgstr
, etc.). This "normalizes" the old and the new file with respect to all semantically non-significant elements, and only then can line-level diffing be performed. Additionally, since a long non-wrapped line of text may differ only in few words, a dedicated diff viewer which can highlight word-level differences should be used. Ordinary diff syntax highlighting (e.g. in shell, or in general text editor) would waste reviewer's time in trying to see those few changed words.
Even with preparations and dedicated diff viewer at hand, there is at least one significant case which is still not reasonably covered: when a fuzzy message with previous strings (i.e. when PO file was merged with --previous
option to msgmerge) has been updated and unfuzzied. For example:
old |
#: main.c:110 #, fuzzy #| msgid "The Record of The Witch River" msgid "Records of The Witch River" msgstr "Beleška o Veštičjoj reci" |
new |
#: main.c:110 msgid "Records of The Witch River" msgstr "Beleške o Veštičjoj reci" |
diff |
#: main.c:110 - #, fuzzy - #| msgid "The Record of The Witch River" msgid "Records of The Witch River" - msgstr "Beleška o Veštičjoj reci" + msgstr "Beleške o Veštičjoj reci" |
The line-level diff viewer will know to show word-level diff for modified translation, but it cannot know that it should also show word-level diff between the removed previous and current msgid
strings, so that reviewer can see what has changed in the original text (i.e. why had the message became fuzzy), and based on that judge whether the translation was properly adapted.
A dedicated PO editor may be able to show the truly proper, message-level difference.[10] Even then, however, it remains necessary to send around full PO files, and possibly to normalize them to a lesser extent before comparing. Additionally, the diff format becomes tied to the given PO editor, instead of being self-contained and processable by various tools (such as line-level diffs are).
This chapter therefore introduces the format and semantics for self-contained, message-level diffing of PO files -- the embedded diff -- and presents the Pology tools which implement it.
Difference between two PO messages should primarily, though not exclusively, consist of differences between its string parts (msgid
, msgstr
, etc.) To be well observable, differences between strings should be as localized as possible -- think of a long paragraph in which only the spelling of a word or some punctuation was changed. Finally, the format of the complete PO message diff should be intuitively comprehensible to translators which are used to the PO format itself, and to some extent compatible with existing PO processing tools.
These considerations lead to making the diff of two PO messages be a PO message itself. In other words, the diff gets embedded into the regular parts of a PO message. An embedded diff (ediff for short) message should be at least syntactically valid, if not semantically (it should not cause a simple msgfmt run to fail, though msgfmt --check could). To be possible to exchange ediffs as patches for PO files, the embedding should be resolvable into the old and the new messages from which the diff was created.
In this way, if ediff messages are packed into a PO file (an ediff PO), existing PO tools can be used to review and modify the diff. For example, highlighting in a text editor will need only minimal upgrades to show the embedded differences (more on that below), and otherwise it will already highlight ediff message parts as usual.
To fully define the ediff format, the following questions should be answered:
How to represent embedded differences in strings?
Which parts of the PO message should be diffed?
How to pair for diffing messages from two PO files?
How to present collection of diffed messages?
Once the word-level difference between the old and the new string has been computed, it should be somehow embedded it into the new string (or, equivalently, the old string). This can be done by wrapping removed and added text segments with {-...-}
and {+...+}
, respectively:
old |
"The Record of The Witch River"
|
new |
"Records of The Witch River"
|
diff |
"{-The Record-}{+Records+} of The Witch River" |
It may happen that an opening or closing wrapper sequence occurs as a literal part of diffed strings[11], so some method of escaping is necessary. This is done by inserting a ~
(tilde) in the middle of the literal sequence:
old |
"Foo {+ bar"
|
new |
"Foo {+ qwyx"
|
diff |
"Foo {~+ {-bar-}{+qwyx+}" |
If strings instead contain the literal sequence {~+
, then another tilde is inserted, and so on. In this way, ediff can be unambiguously resolved to old and new versions of the string. Escaping by inserting tildes also makes it easier to write a syntax higlighting definition for an editor, as the wrapper pattern is automatically broken by the tilde.
It may happen that a given string is not merely empty in the old or new PO message, but that it does not exist at all (e.g. msgctxt
). For this reason it is possible to make ediff between an existing and non-existing string as well, in which case a tilde is appended to the very end of the ediff:
old |
|
new |
"a-context-note"
|
diff |
"{+a-context-note+}~" |
Here too escaping is provided, by inserting further tildes if the ediff between two existing strings would result in a trailing tilde (if the old string is "~"
and the new "foo~"
, the ediff is "{+foo+}~~"
).
It is not necessary to prescribe the exact algorithm for computing the difference between two strings. In fact, the diffing tool may allow translator to select between several diffing algorithms, depending on personal taste and situation. For example, the default algorithm of Pology's poediff does the following: words are diffed as atomic sequences, all non-word segments (punctuation, markup tags, etc.) are diffed character by character, and equal non-word segments in between two different words (e.g. whitespace) are included into the difference segment. Hence the above ediff
"{-The Record-}{+Records+} of The Witch River"
instead of the smaller
"{-The -}Record{+s+} of The Witch River"
as the former is (tentatively) easier to comprehend.
Since every difference segment in the ediff message is represented in the described way, it is sufficient to upgrade the PO syntax highlighting of an editor[12] to indiscriminately highlight {-...-}
and {+...+}
segments everywhere in the message.
A PO message consists of several types of parts: strings, comments, flags, source references, etc. It would not be very constructive to diff all of them; for example, while msgstr
strings should clearly be included into diffing, source references most probably should not. To avoid pondering over the advantages and disadvantages of including each and every message part, there already exists a well-defined splitting of message parts into two groups, one of which will be taken into diffing, and the other not. These two groups are:
Extraction-invariant parts are those which do not depend on placement (or even presence) of the message in the source file. These are msgid
string, msgstr
strings, manual comments, etc.
Extraction-prescribed parts are those which cannot exist independently of the source file from which the message is extracted, such as format flags or extracted comments.
Only extraction-invariant parts will be diffed. The working definition of which parts belong to this group is provided by what remains in obsolete messages in PO files:
current original text: msgctxt
, msgid
, and msgid_plural
strings
previous original text: #| msgctxt
, #| msgid
, and #| msgid_plural
comments
translation text: msgstr
strings
translator comments
fuzzy state (whether the fuzzy
flag is present)
obsolete state (whether the message is obsolete)
Strings and translator comments are presented in the ediff message as embedded word-level differences, as described earlier. Changes in state, fuzzy and obsolete, are represented differently. A special "extracted" comment is added to the ediff message, starting with #. ediff:
and listing any extra information needed to describe the ediff, including the state changes. Here is an example of two messages and the ediff they would produce[13]:
old |
#, fuzzy #~| msgid "Accurate subpolar weather cycles" #~ msgid "Accurate subpolar climate cycles" #~ msgstr "Tačni ciklusi subpolarnog vremena" |
new |
#. ui: property (text), widget (QCheckBox, accCyclesTrop) #: config.ui:180 #, fuzzy #| msgid "Accurate tropical weather cycles" msgctxt "some-superfluous-context" msgid "Accurate tropical climate cycles" msgstr "Tačni ciklusi tropskog vremena" |
diff |
#. ediff: state {-obsolete-} #. ui: property (text), widget (QCheckBox, accCyclesTrop) #: config.ui:180 #, fuzzy #| msgid "Accurate {-subpolar-}{+tropical+} weather cycles" msgctxt "{+some-superfluous-context+}~" msgid "Accurate {-subpolar-}{+tropical+} climate cycles" msgstr "Tačni ciklusi {-subpolarnog-}{+tropskog+} vremena" |
The first thing to note is that the ediff message contains not only the extraction-invariant parts, but also verbatim copies of extraction-prescribed parts from the new message. Effectively, the ediff is embedded into the copy of the new message. Extraction-prescribed parts are not simply discarded in order to provide more context when reviewing the diff. Here, for example, the extracted comment states that the text is a checkbox label, which may be important for the style of translation.
The other important element is the #. ediff:
dummy extracted comment, which here indicates that the obsolete state has been "removed", i.e. the message was unobsoleted betwen then old and the new version of the PO file. Aside from state changes, few other indicators may be present in this comment, and they will be mentioned later on. The ediff comment is present only when necessary, if there are any indicators to show.
If diffing of two messages would always be conducted part for part, for all message parts which are taken into diffing, then in some cases the resulting ediff would not be very useful. Consider how the first example in this chapter, the line-level diff of a fuzzy and translated message, would look like as ediff if diffed part for part:
old |
#: main.c:110 #, fuzzy #| msgid "The Record of The Witch River" msgid "Records of The Witch River" msgstr "Beleška o Veštičjoj reci" |
new |
#: main.c:110 msgid "Records of The Witch River" msgstr "Beleške o Veštičjoj reci" |
diff |
#. ediff: state {-fuzzy-} #: main.c:110 #| msgid "{-The Record of The Witch River-}~" msgid "Records of The Witch River" msgstr "{-Beleška-}{+Beleške+} o Veštičjoj reci" |
This ediff suffers from the same problem as the line-level diff: instead of showing the difference from previous to current msgid
string, the current msgid
is left untouched, while the previous msgid
is simply shown to have been removed.
Therefore, instead of diffing directly part for part, a special transformation takes place when exactly one of the two diffed messages is fuzzy and contains previous original strings. This splits into two directions: from fuzzy to non-fuzzy, and from non-fuzzy to fuzzy.
Diffing from a fuzzy to a non-fuzzy message is the more usual of the two directions. It typically appears when the translation has been updated after merging with template. In this case, the old and the new message are shuffled prior to diffing in the following way (*-rest
denotes all diffed parts that are neither original text nor fuzzy state):
old |
fuzzy --> fuzzy old-previous-strings --> old-previous-strings old-current-strings --> old-previous-strings old-rest --> old-rest |
new |
- --> - - --> old-current-strings new-current-strings --> new-current-strings new-rest --> new-rest |
When these shuffled messages are diffed, the resulting ediff message's current strings will show the important difference, that between the previous original text of the old (fuzzy) message and the current original text of the new (non-fuzzy) message. Ediff message's previous strings will show the less important difference between the old message's previous and current strings, but only if it is not the same as the difference between current strings. This may sound confusing, but the actual ediff produced in this way is quite intuitive:
old |
#: main.c:110 #, fuzzy #| msgid "The Record of The Witch River" msgid "Records of The Witch River" msgstr "Beleška o Veštičjoj reci" |
new |
#: main.c:110 msgid "Records of The Witch River" msgstr "Beleške o Veštičjoj reci" |
diff |
#. ediff: state {-fuzzy-} #: main.c:110 msgid "{-The Record-}{+Records+} of The Witch River" msgstr "{-Beleška-}{+Beleške+} o Veštičjoj reci" |
From this the reviewer can see that the message was unfuzzied, the change in the original text that caused the message to become fuzzy, and what was changed in the translation to unfuzzy it. The old version of the text (in removed and equal segments) is that from the message before it got fuzzied, and the new version (in added and equal segments) is that from the message after it was unfuzzied.
The other special direction, from a non-fuzzy to a fuzzy message, should be less frequent. It appears, for example, when the diff is taken from the old, completely translated PO file, to the new PO file which has been merged with the latest template. In this case, the shuffling is as follows:
old |
- --> - - --> new-previous-strings old-current-strings --> old-current-strings old-rest --> old-rest |
new |
fuzzy --> fuzzy new-previous-strings --> new-current-strings new-current-strings --> new-current-strings new-rest --> new-rest |
The difference in ediff messages's current strings will again be the most important one, and in previous strings the less important one and shown only if not equal to the difference in current strings. Here is what this will result in when applied one step earlier, just after merging with template:
old |
#: main.c:89 msgid "The Record of The Witch River" msgstr "Beleška o Veštičjoj reci" |
new |
#: main.c:110 #, fuzzy #| msgid "The Record of The Witch River" msgid "Records of The Witch River" msgstr "Beleška o Veštičjoj reci" |
diff |
#. ediff: state {+fuzzy+} #: main.c:110 #, fuzzy msgid "{-The Record-}{+Records+} of The Witch River" msgstr "Beleška o Veštičjoj reci" |
The reviewer can see that the message became fuzzy, and the change in the original text that caused that.
The diffing tool may add custom additional information at the end of any strings in the ediff message (msgid
, msgstr
, etc.), separated with a newline, a repeated block of one or more characters, and a newline. When this is done, the #. ediff:
comment will have the infsep
indicator, which states the character block used and the number of repetitions in the separator:
#. ediff: state {+fuzzy+}, infsep +- 20
#: main.c:110
#, fuzzy
msgid "{-The Record-}{+Records+} of The Witch River"
msgstr ""
"Beleška o Veštičjoj reci\n"
"+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-\n"
"some-additional-information
"
Of course, the diffing tool should compute the appropriate separator such that it does not conflict with a part of the text in one of the strings. What could be this additional information? For example, it could be a filtered version of the text, to ease some special review type.
By now it was described how to make an embedded diff out of two messages, once it has been decided that those messages should be diffed. However, the translator is not expected to decide which messages to diff, but which PO files to diff. The diffing tools should then automatically pair for diffing the messages from the two PO files, and this section describes the several pairing criteria.
Most obviously, messages should be paired by key, which can be called primary pairing. The PO message key is the unique combination of msgctxt
and msgid
strings. In the most usual case -- reviewing an ediff from incomplete PO file with fuzzy and untranslated messages, to an updated PO file with those messages translated -- pairing by key will be fully sufficient, as both PO files will contain exactly the same set of messages. These two messages will be paired by key:
old |
#: main.c:110 #, fuzzy #| msgid "The Record of The Witch River" msgid "Records of The Witch River" msgstr "Beleška o Veštičjoj reci" |
new |
#: main.c:110 msgid "Records of The Witch River" msgstr "Beleške o Veštičjoj reci" |
But what should happen if some messages are left unpaired after pairing by key? Consider the earlier example where the diff was taken from the older fully translated to the newer merged PO file:
old |
#: main.c:89 msgid "The Record of The Witch River" msgstr "Beleška o Veštičjoj reci" |
new |
#: main.c:110 #, fuzzy #| msgid "The Record of The Witch River" msgid "Records of The Witch River" msgstr "Beleška o Veštičjoj reci" |
The keys, here just current msgid
strings, of the two messages do not match, so they cannot be paired by key. Yet it would be ungainly to represent the old message as fully removed, and the new message as fully added, in the resulting ediff:
diff |
#: main.c:89 msgid "{-The Record of The Witch River-}~" msgstr "{-Beleška o Veštičjoj reci-}~" #. ediff: state {+fuzzy+} #: main.c:110 #, fuzzy #| msgid "{+The Record of The Witch River+}~" msgid "{+Records of The Witch River+}~" msgstr "{+Beleška o Veštičjoj reci+}~" |
(That the message has been fully added or removed can be seen by trailing tilde in the msgid
string, which indicates that the old or new msgid
does not exist at all, and so neither the message with it.)
Instead, messages left unpaired by key should be tested for pairing by pivoting around previous strings (secondary pairing). The two messages above will thus be paired due to the fact that the current msgid
of the old message is equal to the previous msgid
of the new message, and will produce a single ediff message as shown earlier.
Finally, consider the third related combination, when the old PO file has not yet been merged with the template, while the new PO file has both been merged and its translation updated:
old |
#: main.c:89 msgid "The Record of The Witch River" msgstr "Beleška o Veštičjoj reci" |
new |
#: main.c:110 msgid "Records of The Witch River" msgstr "Beleške o Veštičjoj reci" |
Once again it would be a waste to present the old message as fully removed and the new message as fully added in the resulting ediff. When a message is left unpaired after both pairing by key and pairing by pivoting, then the two PO files can be merged in the background -- as if the new is the template for the old, and vice versa -- and then tested for chained pairing by pivoting and by key with the merged PO file as intermediary. This pairing by merging (tertiary pairing) will then produce another natural ediff:
diff |
#: main.c:110 msgid "{-The Record-}{+Records+} of The Witch River" msgstr "{-Beleška-}{+Beleške+} o Veštičjoj reci" |
It can be left to the diffing tool to decide which pairing methods beyond the primary pairing, by key, to use. There should not be much reason not to perform secondary pairing, by pivoting, as well. If tertiary pairing, by merging, is done, the user should be allowed to disable it, as it can sometimes produce strange results (subject to the fuzzy matching algorithm).
For the ediff of two PO files to also be a syntactically valid PO file, constructed ediff messages should be preceded by a PO header in output. At first glance, this PO header could be itself the ediff of headers of the PO files which were diffed. However, there are several issues with this approach:
The reviewer of the ediff PO file would not be informed at once if there was any difference between the headers. Headers tend to be long, and a small change in one of header fields may go visually unnoticed.
Depending on the amount of changes between the two headers, the resulting ediff message of the header could be too badly formed to represent the header as such. For example, if some header fields in msgstr
were added or removed, embedded difference wrappers would invalidate the MIME-header format of msgstr
, which could confuse PO processing tools.
How would the diff of two collections of PO files (e.g. directories) be packed into a single ediff PO? To pack diffs of several file pairs into one diff file is an expected feature of diffing tools.
To avert these difficulties, the following is done instead. First, a minimal valid header is constructed for the ediff PO file, independently of the headers in diffed PO files. The precise content can be left to the diffing tool, with Pology's poediff producing something like:
# +- ediff -+ msgid "" msgstr "" "Project-Id-Version: ediff\n" "PO-Revision-Date: 2009-02-08 01:20+0100\n" "Last-Translator: J. Random Translator\n" "Language-Team: Differs\n" "MIME-Version: 1.0\n" "Content-Type: text/plain; charset=UTF-8\n" "Content-Transfer-Encoding: 8bit\n" "X-Ediff-Header-Context: ~\n"
The PO-Revision-Date
header field is naturally set to the date when the ediff was made. Values for the Last-Translator
and Language-Team
fields can be somehow pulled from the environment (poediff will fetch them from Pology user configuration, or set some dummy values). Encoding of the ediff PO can be chosen at will, so long as all constructed ediff messages can be encoded with it (poediff will always use UTF-8). The purpose of the final, X-Ediff-Header-Context
field will be explained shortly.
It is the first next entry in the ediff PO file that will actually be the ediff of headers of the two diffed PO files. Headers are diffed just like any other message, but the resulting ediff is given a few additional decorations:
# ========================================================= # Translation of The Witch River into Serbian. # Koja Kojic <koja.kojic@nedohodnik.net>, 2008. # {+Era Eric <era.eric@ledopad.net>, 2008.+}~ msgctxt "~" msgid "" "- l10n-wr/sr/wriver-main.po\n" "+ l10n-wr/sr-mod/wriver-main.po\n" msgstr "" "Project-Id-Version: wriver 0.1\n" "POT-Creation-Date: 2008-09-22 09:17+0200\n" "PO-Revision-Date: 2008-09-{-25 20:44-}{+28 21:49+}+0100\n" "Last-Translator: {-Koja Kojic <koja.kojic@nedohodnik-}" "{+Era Eric <era.eric@ledopad+}.net>\n" "Language-Team: Serbian\n" "MIME-Version: 1.0\n" "Content-Type: text/plain; charset=UTF-8\n" "Content-Transfer-Encoding: 8bit\n"
Observe the usual ediff segments: translator comment with a new translator who updated the PO file has been added, and the PO-Revision-Date
and Last-Translator
header fields contain ediffs reflecting the update. These are the only actual differences between the two headers. More interesting are the additional decorations:
The very first translator comment (here a long line of equality signs) can be anything, and serves as a strong visual indicator of the header ediff. This is especially convenient when the ediff PO file contains diffs of several pairs of PO files.
That this particular message is a header ediff, is indicated by the msgctxt
string set to a special value, here a single tilde. This value is given up front by the X-Ediff-Header-Context
of the ediff PO header. It should be computed during diffing such that it does not conflict with msgctxt
of one of the message ediffs (e.g. it may simply be a sufficiently long sequence of tildes).
The msgid
string of the header ediff contains newline-separated paths of the diffed PO files. More precisely, the two lines of the msgid
string are in the form [+-]
. The trailing newline of the second file path is elided if the file-path
[ <<< comment
]\nmsgstr
string does not end in newline, to prevent msgfmt from complaining. The file path is followed by the optional, <<<
-separated comment. This comment can be used for any purpose, one which will be demonstrated in poediff.
Although when a PO file is properly updated there should always be some difference in the header, it may happen that there is none. In such case, the header ediff message is still added, but it contains only the additional decorations: the visual separator comment, the special msgctxt
, and the msgid
with file paths. All other comments and msgstr
are empty; the empty msgstr
immediatelly shows that there is no difference between the headers. This "empty" header ediff is needed to provide the file paths of diffed PO files, and, if several pairs of PO files were diffed, to separate their diffs in the ediff PO file.
After the header ediff message, ordinary ediff messages follow. When all constructed ediff messages from the current pair of PO files are listed, the next pair starts with a new header ediff message, and so on.
Especially when diffing several pairs of PO files, it may happen that two ediff messages have same keys (msgid
and msgctxt
strings) and thus cannot be both added as such to the ediff PO file. When that happens, the ediff message which was added after the first with the same key, will have its msgctxt
string padded by few random alphanumerics, to make its key unique. This padding sequence will be recorded in the #. ediff:
comment, as ctxtpad
field. For example:
# ========================================================= msgctxt "~" msgid "...(first PO header ediff)..." msgstr "..." #. ediff: state {-fuzzy-} msgid "White{+ horizon+}" msgstr "Belo{+ obzorje+}" # ========================================================= msgctxt "~" msgid "...(second PO header ediff)..." msgstr "..." #. ediff: state {-fuzzy-}, ctxtpad q9ac3 msgctxt "|q9ac3~" msgid "White{+ horizon+}" msgstr "Belo{+ obzorje+}"
The padding sequence is appended to the original msgctxt
, separated by |
. If there was no original msgctxt
, the padding sequence is further extended by a tilde.
The poediff script in Pology implements embedded diffing of PO files as defined in the previous section. To diff two PO files, running the usual:
$ poediff orig/foo.po mod/foo.po
will write out the ediff PO content to standard output, with some basic shell coloring of difference segments. The ediff can be written into a file (an ediff PO file) either with shell redirection, or the -o
/--output
. It is equally simple to diff directories:
$ poediff orig/ mod/
By default, given directories are recursively searched for PO files, and the PO files present in only one of the directories will also be included in the ediff.
When PO files are handled by a version control system (VCS), poediff can be put into VCS mode using the -c/--vcs
option, where the value is the keyword of one of the version control systems supported by Pology. In VCS mode, instead of giving two paths to diff, any number of version-controlled paths (files or directories) are given. Without other options, all locally modified PO files in these paths are diffed against the last commit known to local repository. For example, if a program is using a Subversion repository, then the PO files in its VCS
po/
directory can be diffed with:
$ poediff -c svn prog/po/
Specific revisions to diff can be given by the -r/--revision
. REV1
[:REV2
]
and REV1
are not necessarily direct revision IDs, but any strings that the underlying VCS can convert into revision IDs. If REV2
is omitted, diffing is preformed from REV2
to current working copy.REV1
When ediff is made in VCS mode, msgid
strings in header ediffs will state revision IDs, in <<<-separated comments next to file paths:
# ========================================================= # ... msgctxt "~" msgid "" "- prog/po/sr.po <<< 20537\n" "+ prog/po/sr.po" msgstr "..."
Options specific to poediff:
-b
, --skip-obsolete
By default, obsolete messages are treated equally to non-obsolete, and can feature in the ediff output. This makes it possible to detect when a message has become obsolete, or has returned from obsolescence, and show this in the ediff. But sometimes including obsolete messages into diffing may not desired, and then this option can be issued to ignore them.
-c VCS
, --vcs=VCS
The keyword of the underlying version control system, to switch poediff into VCS mode. See Section 9.7.2, “Version Control Systems” for the list of supported version control systems (or issue --list-vcs
option).
--list-options
, --list-vcs
Simple listings of options and VCS keywords. Intended mainly for writting shell completion definitions.
-n
, --no-merge
Disable pairing of messages by by internal merging of diffed PO files. Merging is performed only if there were some messages left unpaired after pairing by key and by pivoting, so in the usual circumstances it is not done anyway. But when it is done, it may produce strange results, so this option can be used to prevent it.
-o FILE
, --output=FILE
The ediff is by default written to the standard output, and this option can be used to send it to a file instead.
-p
, --paired-only
When directories are diffed, by default the PO files present in only one of them will be included into the ediff, i.e. all their messages will be shown as added or removed. This option will limit diffing only to files present in both directories, in the sense of having the same relative paths (rather than e.g. same PO domain name).
-Q
, --quick
Produced maximally stripped-down output, sometimes useful for quick visual observation of changes, but which cannot be used as patch. Equivalent to -bns
.
-r REV1
[:REV2
]
, --revision=REV1
[:REV2
]
When operating in VCS mode, the default is to make the diff from the last commit to the current working copy. This option can be used to diff between any two revisions. If the second revision is omitted, the diff is taken from first revision to current working copy.
-s
, --strip-headers
Prevents diffing of PO headers, as well as inclusion of top ediff header in the output. This reduces clutter when the intention is to see only changes in messages through many PO files, but the resulting ediff cannot be used as patch.
-U
, --update-effort
Instead of outputing the diff, the translation update effort is computed. It is expressed as the nominal number of newly translated words, from old to new paths. The procedure to compute this quantity is not straightforward, but the intention is that it roughly approximate the number of words (in original text) as if messages were translated from scratch. Options -b
and -n
are ignored.
Options common with other Pology tools:
-R
, --raw-colors
; --coloring-type
poediff will consult the [user]
section in user configuration to fill out some of the header of the ediff PO file. It also consults its own section, with the following fields avaialbe:
[poediff]/merge=[*yes|no]
Setting to no
is counterpart to --no-merge
command line option, i.e. this field can be used to permanently disable message pairing by merging.
Basic application of an ediff patch is much easier than that of a line-level patch, because there will be no conflicts if messages have different wrapping, ordering, or extraction-prescribed parts (source references, etc.). The patch is applied by resolving each ediff message from it into the originating old and new message, and if either the old or the new message exists (by key) in the target PO file and has equal extraction-invariant parts, then the message modification is applied, and otherwise rejected.
Applying the modification to the target message means overwriting its extraction-invariant parts with those from the new message from the ediff, and leaving other parts untouched. If the target message is already equal to the new message by extraction-invariant parts, then the patch is silently ignored. This means that if the same patch is applied twice to the target PO file, the second application makes no modifications. Likewise if, by chance, the modifications given by the patch were already independently performed by another translator (e.g. a few simple updates to unfuzzy messages).
Command-line interface of Pology's poepatch is much like that of patch(1), sans the myriad of its more obscure options. There is the -p
option to strip leading elements of file paths in the ediff, and -d
option to append to them a directory path where target PO files are to be looked up. If the ediff was produced in VCS mode, then it can be applied as patch in any of the following ways:
$ cd repos/prog/po && poepatch <ediff.po $ cd repos/ && poepatch -p0 <ediff.po $ poepatch -d repos/app/po <ediff.po
Header modifications (coming from the header ediff message) are applied in a slightly relaxed fashion: some of the standard header fields are ignored when checking whether the patch is applicable. These are the fields which are known to be volatile as the PO file goes through different translators, and do not influence the processing of the PO file (e.g. such as encoding or plural forms). The ignored fields are: POT-Creation-Date
, PO-Revision-Date
, Last-Translator
, X-Generator
. When the header modification is accepted, the ignored fields in the target header are overwritten with those from the patch (including being added or removed).
All ediff messages which were rejected as patches will be written out to stdin.rej.po
in the current working directory if the patch was read from standard input, or to
if the patch file was given by FILE
.rej.po-i
option.FILE
.po
The file with rejected ediff messages will again be an ediff PO file. It will have the header as before, except that its comment will mention that the file contains rejects of a patching operation. Afterwards, rejected ediff messages rejected will follow. Every header ediff message will be present whether rejected or not, for the same purpose of separation and provision of file paths, but if it was not rejected as patch itself, it will be stripped of comments and msgstr
string.
Furthermore, to every straigh-out rejected ediff message an ediff-no-match
flag will be added. This is done, naturally, because some ediff messages may not be rejected straight-out. Consider the following scenario. A PO file has been merged to produce the fuzzy message:
old |
#: tools/power.c:348 msgid "Active sonar low frequency" msgstr "Niska frekvencija aktivnog sonara" |
new |
#: tools/power.c:361 #, fuzzy #| msgid "Active sonar low frequency" msgid "Active sonar high frequency" msgstr "Niska frekvencija aktivnog sonara" |
The translator updates the PO file, which produces the usual ediff message when going from fuzzy to translated:
diff |
#. ediff: state {-fuzzy-} #: tools/power.c:361 msgid "Active sonar {-low-}{+high+} frequency" msgstr "{-Niska-}{+Visoka+} frekvencija aktivnog sonara" |
However, before this patch could have been applied, the programmer adds a trailing colon to the same message, and the catalog is merged again to produce:
new-2 |
#: tools/power.c:361 #, fuzzy #| msgid "Active sonar low frequency" msgid "Active sonar high frequency:" msgstr "Niska frekvencija aktivnog sonara" |
The patch cannot be cleanly applied at this point, due to the extra colon added in the meantime to the msgid
, so it has to be rejected. If nothing else is done, it would appear in the file of rejects as:
#. ediff: state {-fuzzy-} #: tools/power.c:361 #, ediff-no-match msgid "Active sonar {-low-}{+high+} frequency" msgstr "{-Niska-}{+Visoka+} frekvencija aktivnog sonara"
It is wastefull to reject such a near-matching patch without any indication that it could be easily adapted to the latest message in the target PO file. Therefore, when an ediff message is rejected, the following analysis is performed: by trying out message pairings as on diffing, could the old message from the patch be paired with a current message from the target PO, and that current message with the new message from the patch? Or, in other words, can an existing message in the target PO be "fitted in between" the old and new messages defined by the patch? If this is the case, instead of the original, two special ediff messages -- split rejects -- are constructed and written out: one from the old to the current message, and another from the current to the new message. They are flagged as ediff-to-cur
and ediff-to-new
, respectively:
#: tools/power.c:361 #, fuzzy, ediff-to-cur #| msgid "Active sonar low frequency" msgid "Active sonar high frequency{+:+}" msgstr "Niska frekvencija aktivnog sonara" #. ediff: state {-fuzzy-} #: tools/power.c:361 #, ediff-to-new #| msgid "Active sonar {-low-}{+high+} frequency{+:+}" msgid "Active sonar {-low-}{+high+} frequency" msgstr "{-Niska-}{+Visoka+} frekvencija aktivnog sonara"
There are more ways to interpret split rejects, depending on the circumstances. In this example, from the ediff-to-cur
message the reviewer can see what had changed in the target message after the translator made the ediff. This can also be seen by comparing difference embedded into previous and current msgid
strings in the ediff-to-new
message. With a bit of editing, the reviewer can fold these two messages into an applicable patch:
#. ediff: state {-fuzzy-} #: tools/power.c:361 #, ediff msgid "Active sonar {-low-}{+high+} frequency:" msgstr "{-Niska-}{+Visoka+} frekvencija aktivnog sonara:"
Since the file of rejects is also an ediff PO, after edits such as this to make some patches applicable, it can be reapplied as patch. When that is done, poepatch will silently ignore all ediff messages having ediff-no-match
or ediff-to-new
flags, as these have already been determined inapplicable. That is why in this example the reviewer has replaced the ediff-to-new
flag with the plain ediff
in the folded ediff message.
Depending on the kind of text which is being translated, and distance between the source and target language grammar, ortography, and style, it may be difficult to review the ediff in isolation. In general, messages in ediff PO file will lack positional context, which is in the full PO provided by messages immediately preceding and following the observed message. For example, a long passage from documentation probably needs no positional context. But a short, newly added message such as "Crimson" could very well need one, if it has neither msgctxt
nor an extracted comment describing it: is it really a color? what grammatical ending should it have (in a language which matches adjective to noun gender)? Several messages around it in the full PO file could easily show whether it is just another color in a row, and their grammatical endings (determined by a translator earlier).
Another difficulty is when an ediff message needs some editing before being applied. This may not be easy to do this directly in the ediff PO file. Everything is fine so long as only the added text segments ({+...+}
) are edited, but if the sentence needs to be restructured more thoroughly, the reviewer would have to make sure to put all additions into existing or new {+...+}
segments, and to wrap all removals as {-...-}
segments. If this is not carefully performed, the patch will not be applicable any more, as the old message resolved from it will no longer exactly match a message in the target PO file.
For these reasons, poepatch can apply the patch such as not to resolve the ediff, but to set all its extraction-invariant fields to the message in the target PO file. In effect, the target PO file becomes an ediff PO by itself, but only in the messages which were actually patched. To mark these messages for lookup, the usual ediff
flag is added to them. For example, if the message in the patch file was:
#: title.c:274 msgid "Tutorial" msgstr "{-Tutorijal-}{+Podučavanje+}"
then when the patch is successfully applied with embedding, the patched message in target PO file will look like this, among other messages:
#: main.c:110 msgid "Records of The Witch River" msgstr "Beleške o Veštičjoj reci" #: title.c:292 #, ediff msgid "Tutorial" msgstr "{-Tutorijal-}{+Podučavanje+}" #: title.c:328 msgid "Start the Expedition" msgstr "Pođi u ekspediciju"
Other than the addition of the ediff
flag, note that the patched message kept its own source reference, rather than being overwritten by that from the patch. Same holds for all extraction-prescribed parts.
The reviewer can now jump over ediff
flags, always having the full positional context for each patched message, and being able to edit it to heart's content, with only minimal care not to invalidate the ediff format. Wrapped difference segments can be entirely removed, non-wrapped segments can be freely edited; it should only not happen that a wrapped segment looses its opening or closing sequence. But this does not mean that the reviewer has to remove difference segments, that is, to manually unembed patched messages. poepatch can do this automatically, when run on the embedded-patched PO file with the -u
/--unembed
option.
A patch is applied with embedding by issuing the -e
/--embed
option:
$ poepatch -e <ediff.po patched (E): foo.po
where (E)
in the output indicates that the embedding is engaged. After the patched PO file had been reviewed and patched messages possibly edited, all remaining embedded differences are removed, i.e. resolved to new versions, by running:
$ poepatch -u foo.po
More precisely, only those messages having the ediff
flag are resolved, therefore the reviewer must not remove them (unless manually unembedding the whole message).
What happens with rejected patches when embedding is engaged? They are also added into the target PO file, with heuristic positioning, and no separate file with rejects is created. Same as on plain patching, straight-out rejects will have the ediff-no-match
flag, and split rejects ediff-to-cur
or ediff-to-new
. If these are not manually resolved during the review (ediff-no-match
messages removed, ediff-to-*
messages removed or folded), when poepatch is run to unembed the differences, it will remove all ediff-no-match
and ediff-to-new
messages, and resolve ediff-to-cur
messages to current version.
Options specific to poepatch:
-a
, --aggressive
After the messages from the patch and the target PO file have been paired, normally only those differences that have no conflicts (e.g. in translation) will be applied. This option can be issued to instead unconditionally overwrite all extraction-invariant parts of the message in the target PO file with those defined by the paired patch.
-d
, --directory
The directory path to prepend to file paths read from the patch file, when trying to match the files on disk to patch.
-e
, --embed
Apply patch with embedding.
-i FILE
, --input=FILE
Read the patch from the given file instead from standard input.
-n
, --no-merge
When split rejects are computed, all methods for pairing messages like on diffing are used. Pairing by merging can sometimes lead to same strange results as on diffing, and this option disables it.
-p NUM
, --strip=NUM
Strips the smallest prefix containing the given number of slashes from file paths read from the patch file, when trying to match the files on disk to patch. If this option is not given, only the base name of each read file path is taken as relative path to match on disk. (This is the same behavior as in patch(1).)
-u
, --unembed
Clears all embedded differences in input PO files, after they have been patched with embedding.
poepatch consults the following user configuration fields:
[poepatch]/merge=[*yes|no]
Setting to no
is counterpart to --no-merge
command line option, i.e. this field can be used to permanently disable pairing by mergingM when computing split rejects.
[10] For example Lokalize, when operating in merge mode.
[11] Although this should be quite rare. In the collection of PO files from several translation projects, with over 2 million words in total, there was not a single occurence where one of the chosen wrapper sequences was part of the text.
[12] At the moment, the following text and PO editors are known to have highlighting for ediffs: Kate, Kwrite, Lokalize.
[13] Whether two messages such as these would get paired for diffing in the first place, will be discussed later on.
Computer programs (though not only them) are sometimes concurrently developed and released from several branches. For example, there may be one "stable" branch, which sees only small fixes and from which periodical releases are made, and another, "development" branch, which undergoes larger changes and may or may not be periodically released as well; at one point, the development branch will become the new stable branch, and the old stable branch will be abandoned. There may also be more than two branches which see active work, such as "development", "stable", and "old stable".
From programmers' point of view, working by branches can be very convenient. They can freely experiment with new features in the development branch, without having to wory that they will mess something up in the stable branch, from which periodical releases are made. In the stable branch they may fix some bugs discovered between the releases, or carry over some important and well-tested features from the development branch. For users who want to be on the cutting edge, they may provide experimental releases from the development branch.
For translators, however, having to deal with different branches of the same collection of PO files is rarely a convenience. It is text to be translated just as any, only duplicated across two or more file hierarchies. This means that translators additionaly have to think about how to make sure that new and modified translations made in one branch appear in other branches too. It gets particularly ugly if there are mismatches in PO file collections in different branches, like when a PO file is renamed, split into two or more PO files, or merged into another PO file.[14] Sometimes this branch juggling is not necessary; in strict two-branch setting, translators may choose to work only on the stable branch, and switch to the next stable branch when it gets created (or switch to the development branch shortly before it becomes stable). Even so, branch switching may not go very smooth in presence of mismatches in PO file collections.
Instead, for translators the most convenient would be to work on a single, "supercollection" of PO files, from which new and modified translations would be automatically periodically sent to appropriate PO files in branches. Such a supercollection can be created and maintained by Pology's posummit script. In terms of this script, the supercollection is called the summit, the operation of creating and updating it is called gathering, and the operation of filling out branch PO files is called scattering.
How do summit PO files look like? When all branches contain the same PO file, then the counterpart summit PO file is simply the union of all messages from branch PO files. A message in the summit PO file differs from branch messages only by having the special #. +> ...
comment, which lists the branches that contain this message. If there would be two branches, named with devel
and stable
keywords, an excerpt from a summit PO file could be:
#. +> devel #: kdeui/jobs/kwidgetjobtracker.cpp:469 msgctxt "the destination URL of a job" msgid "Destination:" msgstr "" #. +> stable #: kdeui/jobs/kwidgetjobtracker.cpp:469 msgid "Destination:" msgstr "" #. +> devel stable #: kdeui/jobs/kwidgetjobtracker.cpp:517 msgid "Keep this window open after transfer is complete" msgstr ""
The first message above exists only in the development branch, the second only in the stable branch, and the third in both branches. The source reference always refers to the source file in the first listed branch. Any other extracted comments (#.
) are also taken from the first listed branch.
Note that the first two messages are different only by context. The context was added in development branch, but not in stable, probably in order not to break the message freeze. However, due to special ordering of messages in summit PO files, these two messages appear together, allowing the translator to immediately make the correction in stable branch too if the new context in development branch shows it to be necessary.
When a PO file from one branch has a different name in another branch, or several PO files from one branch are represented with a single PO file in another branch, the summit can still handle it gracefully, by manually mapping branch PO files to summit PO files. One branch PO file can be mapped to one or more summit PO files, and several branch PO files can be mapped to one summit PO file. Usually, but not necessarily, one branch (e.g. the development branch) is taken as reference for the summit file organization, and stray PO files from other branches are mapped accordingly.
If a team of translators works in the summit, it is sufficient that one team member (and possibly another one as backup) manages the summit. After the initial setup, this team member should periodically run posummit to update summit and branch PO files. All other team members can simply translate the summit PO files, oblivious of any summit operations behind the scenes. It is also possible that team members perform summit operations on their own, on a subset of PO files that they are about to work on. It is up to the team to agree upon the most convenient workflow.
There are two major parts in setting up the summit: linking locations and organization of PO files in the branches to that of the summit, and deciding what summit mode will be used.
Great flexibility is possible in linking branches to the summit, but at the expense of possibly heavy configuring. To make it simpler, currently there are two types of branch organization which can be handled automatically, just by specifying a few paths and options. In the by-language branch organization, PO files in branches are grouped by language and their file names reflect their domain names:
devel/ # development branch aa/ # language A alpha.po bravo.po charlie.po ... bb/ # language B alpha.po bravo.po charlie.po ... ... templates/ # templates alpha.pot bravo.pot charlie.pot ... stable/ # stable branch aa/ ... bb/ ... templates/ ... ...
The other organization that can be automatically handled is by-domain:
devel/ # development branch alpha/ # domain alpha aa.po # language A bb.po # language B ... alpha.pot # template bravo/ aa.po bb.po ... bravo.pot charlie/ aa.po bb.po ... charlie.pot ... stable/ # stable branch alpha/ ... bravo/ ... charlie/ ... ...
In both organizations, there can be any number of subdirectories in the middle, between the branch top directory and directory where PO files are. For example, in by-language organization there could be some categorization:
path/to/devel/ aa/ utilities/ alpha.po bravo.po ... games/ charlie.po ... bb/ ...
while in by-domain categorization the domain directories could be within their respective sources[15]:
devel/ appfoo/ src/ doc/ po/ foo/ aa.po bb.po ... foo.pot libfoo/ aa.po bb.po ... libfoo.pot ... appbar/ ...
There are three possible summit modes: direct summit, summit over dynamic templates, and summit over static templates. In the direct summit, only branch PO files are processed, in that new and modifed messages are gathered from them and summit translations scattered to them. In summit over dynamic templates, messages from branch PO files are gathered only once, at creation of the summit; after that, it is branch templates (POT files) that are gathered into summit templates, and then summit PO files are merged with them. Summit templates are not actually seen, but are gathered internally when merging command is issued and removed after merging is done. Summit over static templates is quite the same, except that summit templates are explicitly gathered and kept around, and merging is done separately.
What is the reason for having three summit modes to choose from? Direct summit mode is there because it is the easiest to explain and understand, and does not require that branches contain templates. It is however not recommended, for two reasons. Firstly, someone may mistakenly translate directly in a branch[16], and those translations may be silently gathered into the summit. This is bad for quality control (review, etc.), as it is expected that the summit is the sole source of translations. Secondly, you may want to perform some automatic modifications on translation when scattering, but not to get those modifications back into the summit on gathering, which would happen with direct summit. These issues are avoided by using summit over dynamic templates, though now branches must provide templates. Finally, summmit over static templates makes sense when several language teams share the summit setup: since gathering is the most complicated operation and sometimes requires manual intervention, it can be done once (by one person) on summit templates, while language teams can then merge and scatter their summits in a fully automatic fashion.
There is one important design decisions which holds for all summit modes: all summit PO files must be unique by domain name (i.e. base file name without extension), even if they are in different subdirectories within the summit top directory. This in turn means that in automatically supported branch organizations (by-domain and by-language) PO domains should be unique as well.[17] This was done for two reasons. Less importantly, it is convenient to be able to identify a summit PO file simply by its domain name rather than the full path (especially in some posummit invocations). More importantly, uniqueness of domain names allows that PO files are located in different subdirectories between different branches. This happens, for example, in large projects in which code moves between modules. If branches do not satisfy this property, i.e. they contain same-name PO domains with totally different content, it is necessary to define a path transformation (see Section 5.1.4, “Transforming Branch Paths”) which will produce unique domain names with respect to the summit.
The following sections describe how to set up each of the modes, in each of the outlined branch organizations. They should be read in turn up to the mode that you want to use, because they build upon each other.
Let us assume that branches are organized by-language, that branch top directories are in the same parent directory, and that you want the summit top directory to be on the level of branch parent directory. That is:
branches/ devel-aa/ alpha.po bravo.po ... stable-aa/ alpha.po bravo.po ... summit-aa/ alpha.po bravo.po ... summit-config
aa
is the language code, which can be added for clarity, but is not necessary. It could also be a subdirectory, as in branches/devel/aa
and summit/aa
. At start you have the branches/
directory ready; now you create the summit-aa/
directory, and within it the summit configuration file summit-config
with the following content:
S.lang = "aa" S.summit = dict( topdir=S.relpath("."), ) S.branches = [ dict(id="devel", topdir=S.relpath("../branches/devel-aa")), dict(id="stable", topdir=S.relpath("../branches/stable-aa")), # ...and any other branches. ] S.mappings = [ ]
This is all that is necessary to set up a direct summit. The configuration file must be named exactly summit-config
, because posummit will look for a file named like that through parent directories and automatically pick it up. As you may have recognized, summit-config
is actually a Python source file; posummit will insert the special S
object when evaluating summit-config
, and it is through this object that summit options are set. S.lang
states the language code of the summit. S.summit
is a Python dictionary that holds options for the summit PO files (here only its location, through topdir=
key), while S.branches
is a list of dictionaries, each specifying options per branch (here the branch identifier by id=
key and top directory). The S.relpath
function is used to make file and directory paths relative to summit-config
itself. S.mappings
is a list of PO file mappings, for cases of splitting, mergings and renamings between branches. In this example S.mappings
is set to empty only to point out its importance, but it does not need to be present if there are no mappings.
If branches are organized by-domain, the summit tree will still look the same, with PO files named by domain rather than by language:
branches/ devel/ alpha/ aa.po bb.po ... bravo/ aa.po bb.po ... ... stable/ alpha/ aa.po bb.po ... bravo/ aa.po bb.po ... ... summit-aa/ alpha.po bravo.po ... summit-config
The only difference in the summit configuration is the addition of by_lang=
keys into the branch dictionaries:
S.branches = [ dict(id="devel", topdir=S.relpath("../branches/devel"), by_lang=S.lang), dict(id="stable", topdir=S.relpath("../branches/stable"), by_lang=S.lang), ]
Presence of the by_lang=
key signals that the branch is organized by-domain (i.e. PO files named by language), and the value is the language code within the branch. Normaly it is set to previously defined S.lang
, but it can also be something else in case different codes are used between the branches or the branches and the summit.
When the configuration file has been written, the summit can be gathered for the first time (i.e. summit PO files created):
$ cd .../summit-aa/ $ posummit gather --create
The path of each created summit PO file will be written out, along with paths of branch PO files from which messages were gathered into the summit file. After the run is finished, the summit is ready for use.
While this was sufficient to set up a summit, there is a miriyad of options available for specialized purposes, which will be presented throughout this chapter. Also, given that summit configuration file is Python code, you can add into it any scripting that you wish. Some summit options (defined through the S
object) even take Python functions as values.
Again consider by-language organization of branches, similar to the direct summit example above, except that now template directories too must be present in branches:
branches/ devel/ aa/ alpha.po bravo.po ... templates/ alpha.pot bravo.pot ... stable/ aa/ alpha.po bravo.po ... templates/ alpha.pot bravo.pot ... summit-aa/ alpha.po bravo.po ... summit-config
Here the language PO files and templates are put in subdirectories within the branch directory only for convenience, but this is not mandatory. For example, language files could reside in branches/devel-aa
and templates in branches/devel-templates
, no path connection is required between the two. This is because the template path per branch is explicitly given in summit-config
, which would look like this:
S.lang = "aa" S.over_templates = True S.summit = dict( topdir=S.relpath("."), ) S.branches = [ dict(id="devel", topdir=S.relpath("../branches/devel/aa"), topdir_templates=S.relpath("../branches/devel/templates")), dict(id="stable", topdir=S.relpath("../branches/stable/aa"), topdir_templates=S.relpath("../branches/stable/templates")), ] S.mappings = [ ]
Compared to the configuration of a direct summit, two things are added here. S.over_templates
option is set to True
to indicate that summit over templates is used. The path to templates is set with topdir_templates=
key for each branch.
In by-domain branch organization, the directory tree looks just the same as for direct summit, except that each domain directory also contains the templates:
branches/ devel/ alpha/ aa.po bb.po ... alpha.pot bravo/ aa.po bb.po ... bravo.pot ... stable/ alpha/ aa.po bb.po ... alpha.pot bravo/ aa.po bb.po ... bravo.pot ... summit-aa/ alpha.po bravo.po ... summit-config
Summit configuration is modified in the same way as it was for the direct summit, by adding the by_lang=
key to branch specifications:
S.branches = [ dict(id="devel", topdir=S.relpath("../branches/devel/aa"), topdir_templates=S.relpath("../branches/devel/templates"), by_lang=S.lang), dict(id="stable", topdir=S.relpath("../branches/stable/aa"), topdir_templates=S.relpath("../branches/stable/templates"), by_lang=S.lang), ]
Initial gathering of the summit is done slightly differently compared to the direct summit:
$ cd .../summit-aa/ $ posummit gather --create --force
The --force
option must be used here because, unlike in direct summit, explicit gathering is not regularly done in summit over dynamic templates.
As mentioned earlier, summit over static templates can be used when several language teams want to share the summit setup, for the reasons of greater efficiency. The branch directory tree looks exactly the same as in summit over dynamic templates (with several languages being present), but the summit tree is somewhat different:
branches/ # as before, either by-language or by-domain summit/ summit-config-shared aa/ alpha.po bravo.po ... bb/ alpha.po bravo.po ... templates/ alpha.pot bravo.pot ...
First of all, there is now the summit/
directory which contains subdirectories by language (the language summits) and one subdirectory for summit templates (the template summit). Then, there is no more the summit-config
file, but summit-config-shared
; the name can actually be anything, so long as it is not exactly summit-config
. This is in order to prevent posummit from automatically picking it up, as now the configuration is not tied to a single language summit. Instead, the path to the configuration file and the language code are explicitly given as arguments to posummit.
The configuration file for by-language branches looks like this:
S.over_templates = True S.summit = dict( topdir=S.relpath("%s" % S.lang), topdir_templates=S.relpath("templates")), ) S.branches = [ dict(id="devel", topdir=S.relpath("../branches/devel/%s" % S.lang), topdir_templates=S.relpath("../branches/devel/templates")), dict(id="stable", topdir=S.relpath("../branches/stable/%s" % S.lang), topdir_templates=S.relpath("../branches/stable/templates")), ] S.mappings = [ ]
Compared to summit over dynamic templates, here S.lang
is no longer hardcoded in the configuration file, but set at each run of posummit through the command line. This means that paths of language directories too have to be dynamically adapted based on S.lang
, hence the string interpolations "...%s..." % S.lang
.
For by-domain branches, again simply by_lang=
keys are added to branches:
S.branches = [ dict(id="devel", topdir=S.relpath("../branches/devel/%s" % S.lang), topdir_templates=S.relpath("../branches/devel/templates"), by_lang=S.lang), dict(id="stable", topdir=S.relpath("../branches/stable/%s" % S.lang), topdir_templates=S.relpath("../branches/stable/templates"), by_lang=S.lang), ]
In summit over static templates mode, initital gathering is first done for summit templates, like this:
$ cd .../summit/ $ posummit summit-config-shared templates gather --create
The first two arguments are now the path to the configuration file and the language code, where templates
is the dummy language code for templates[18]. After this is finished, language summits can be gathered:
$ posummit summit-config-shared aa gather --create --force $ posummit summit-config-shared bb gather --create --force $ ...
Note that --force
was not needed when gathering templates, because in this mode the template summit is periodically gathered, while language summits are not.
When branches contain only PO files which are used natively, by programs fetching translations at runtime, then all branch PO files will be unique by their domain name (as mandated by the Gettext runtime system). It will not happen that two branch subdirectories contain a PO file with the same name. This fits perfectly with the summit requirement that all summit PO files be unique by domain names.
However, if PO files are used as an intermediate to other formats, branches may contain same-name PO files which have otherwise nothing in common, in different subdirectories. For example, each subdirectory may contain a PO file named index.po
, help.po
, etc. If this would be left unattended, all the same-name PO files would be collapsed into single summit PO file, which makes no sense given that they have (almost) no common messages. For this reason, it is possible to define transformations which modify absolute branch paths during processing, such that branch PO files are seen with unique names.
Consider the following example of two branches for language aa
(i.e. by-language organization) with PO files non-unique by domain name:
branches/ devel-aa/ chapter1/ intro.po glossary.po ... chapter2/ intro.po glossary.po ... ... stable-aa/ chapter1/ intro.po glossary.po ... chapter2/ intro.po glossary.po ... ...
These branches cover some sort of a book, where each chapter has some standard elements, and thus some same-name PO files with totally different content in each chapter's subdirectory. To have unique domain names in the summit, you might decide upon a flat file tree with chapter in prefix:
summit-aa/ chapter1-intro.po chapter1-glossary.po ... chapter2-intro.po chapter2-glossary.po ... summit-config
To achieve this, you must first write two Python functions (remember that the summit configuration file is a normal Python source file), one to split branch paths and another to join them, and add them to branch specifications in S.branches
.
The function to split branch paths takes a single argument, the branch PO file path relative to the branch top directory, and returns the summit PO domain name and the summit subdirectory. For the example above, the splitting function would look like this:
def split_branch_path (subpath): import os filename = os.path.basename(subpath) # get PO file name domain0 = filename[:filename.rfind(".")] # strip .po extension subdir0 = os.path.dirname(subpath) # get branch subdirectory domain = subdir0 + "-" + domain0 # set final domain name subdir = "" # set summit subdirectory return domain, subdir
Note that the branch subdirectory was used only to construct the summit domain name, while the summit subdirectory is an empty string because summit flat file tree should be flat.
The function to join branch paths takes three arguments. The first two are the summit PO domain name and the summit subdirectory. The third argument is the the value of by_lang=
key for the given branch. The return value is the branch PO file path relative to the branch top directory. It would look like this:
def join_branch_path (domain, subdir, bylang): import os subdir0, domain0 = domain.split("-", 1) # get branch domain name # and branch subdirectory # from summit domain name filename = domain0 + ".po" # branch PO file name subpath = os.path.join(subdir0, filename) # branch relative path return subpath
Here the subdir
argument (summit subdirectory) is not used is not used because it is always empty due to flat summit file tree, and bylang
is not used because it is None
due to by-language branch organization.
The definitions of splitting and joining functions are written into the summit-config
file somewhere before the S.branches
branch specification, and added to each branch through transform_path=
key:
S.branches = [ dict(id="devel", topdir=S.relpath("../branches/devel-aa"), transform_path=(split_branch_path, join_branch_path)), dict(id="stable", topdir=S.relpath("../branches/stable-aa"), transform_path=(split_branch_path, join_branch_path)), ]
This means that it is possible, if necessary, to define different splitting and joining functions per branch.
From time to time, summit PO files need to be updated to reflect changes in branch PO files, and scattered so that branch PO files get new translations from the summit. How are summit PO files updated, by whom and in which amount, depends on the summit mode and the organization of the translation team. The same holds for when and by whom the scattering is done.
The usual maintenance procedure would be for one designated person (e.g. the team coordinator) to update all summit PO files and to scatter new translations to branch PO files, at certain periods of time agreed upon in the translation team.
If there are no mismatches between the branch and summit PO files, the summit update procedure is fully automatic. How the summit is updated depends on the summit mode. In direct summit, the update is performed by gathering:
$ cd $SUMMITDIR $ posummit gather
In summit over dynamic templates, merging is performed instead:
$ cd $SUMMITDIR $ posummit merge
Finally, in summit over static templates, first the template summit is gathered, and then language summits are merged:
$ posummit $SOMEWHERE/summit-config-shared templates gather $ posummit $SOMEWHERE/summit-config-shared aa merge $ posummit $SOMEWHERE/summit-config-shared bb merge ...
Note that unlike when setting up the summit, no --create
or --force
options are used. Without them, posummit will warn about any new mismatches between branches and the summit and abort the operation, leaving the user to examine the situation and take corrective measures. Section 5.2.3, “Handling Mismatches Between Branches and Summit” discusses this in detail.
Scattering to branches is always fully automatic. For direct summit and summit over dynamic templates it is performed with:
$ cd $SUMMITDIR $ posummit scatter
For summit over static templates, scattering is done for each language summit:
$ posummit $SOMEWHERE/summit-config-shared aa scatter $ posummit $SOMEWHERE/summit-config-shared bb scatter ...
If summit update (merge, gather, or both, depending on the summit mode) is scheduled to run automatically, the maintainer should make sure to be notified when posummit aborts, so that mismatches can be promptly handled.
The obvious advantage of this maintenance method is that other team members do not need to know anything about workings of the summit. They only fetch updated summit PO files, translate them, and submit them back. The disadvantage is that summit update may interfere with a particular translator who happened to be working on a PO file which just got updated in the repository, causing merge conflicts when he attempts to submit that PO file.
In this maintenance mode, each team member performs summit operations on exactly the PO files that he wants to work on. This has the advantage over centralized maintenance in that translators do not interfere in each others work, as summit PO files get updated only at the request of the translator working on it. Additionally, it may provide faster gather(-merge)-scatter turnaround time. Unfortunately, the disadvantage is that now all team members have to know how the summit is maintained, so this method is likely applicable only to strongly technical teams.
Distributed maintenance is in general the same as centralized, except that now all posummit command lines take extra arguments, namely the selection of PO files to operate on -- so called operation targets. Operation targets can be given in two ways. One is directly by file or directory paths. For example, in summit over dynamic templates mode, when working on the foobaz.po
file, the translator would use the following summit commands to merge it and scatter to the branches:
$ cd $SUMMITDIR $ posummit merge foosuite/foobaz.po $ # ...update the translation... $ posummit scatter foosuite/foobaz.po
To update all files in foosuite/
subdirectory at once, the translator can execute instead:
$ cd $SUMMITDIR $ posummit merge foosuite/ $ posummit scatter foosuite/
It is also possible to single out a particular branch for scattering, by giving the path to the PO file in that branch instead of the summit. To scatter foobaz.po
only to devel
branch, in by-language branch organization the translator would use:
$ posummit scatter $SOMEWHERE/devel/aa/foosuite/foobaz.po
and in by-domain branch organization:
$ posummit scatter $SOMEWHERE/devel/foosuite/foobaz/po/foobaz/aa.po
Note that the current working directory still has to be within the summit directory, so that posummit can find the summit configuration file. (This requirement is not present for summit over static templates, as there the path to configuration file is given in command line.)
The other kind of operation targets are PO domain names and subdirectory names alone. In this formulation, the first example above could be replaced with:
$ posummit merge foobaz $ posummit scatter foobaz
Since all summit PO file names are unique, this is sufficient information for posummit to know what it should operate on. To limit operation to a certain branch, the branch name is added in front of the domain names, separated by a colon. To scatter foobaz.po
to devel
branch:
$ posummit scatter devel:foobaz
and to scatter the complete foosuite/
subdirectory to the same branch:
$ posummit scatter devel:foosuite/
Note that trailing slash is significant here, since otherwise the argument would be interpreted as single PO file (posummit would exit with an error, reporting that such a file does not exist). Summit also has a "branch name" assigned for use in operation targets of this kind, and that is +
.
When merging (or gathering in direct summit mode) is attempted, posummit may abort with the report of mismatches between branches and the summit. The translator must then make the adjustments (Section 5.2.3, “Handling Mismatches Between Branches and Summit” describes how, case by case), or report it to someone else to handle.
After selected summit and branch PO files have been updated, the translator can commit them. Alternatively, a half-distributed workflow could be used, where translators only update and commit summit PO files, while scattering to branches is centralized, and automatically performed at a given period. This makes sense because the scattering in no way interferes with translators' workflow and never needs any manual intervention.
When something changes in the PO file tree in one of the branches, posummit will by default abort gathering (or merging in summit over dynamic templates), and present a list of its findings. At this point posummit could be made to continue by issuing the --create
option, but then it will resolve mismatches in a simplistic way, which will be wrong in many cases. Instead, you should examine what had happened in branches, possibly manually perform some operations on summit PO files and possibly add some branch-to-summit mappings, and rerun posummit after the necessary adjustments have been made.
Typical mismatches and their remedies are as follows:
In a translation project with modules represented by subdirectories, it may happen that a program or a library is moved from one module to another, with its PO files following the move. If this happened in all branches, posummit will report that the summit PO file should be moved as well; it can be rerun with --create
to do the move itself, or you can make the move manually. If the move happened in only one of the branches, posummit will not complain at all; more precisely, if at least one branch PO file is in same relative subdirectory as the summit PO file, it is not considered a mismatch.
Another, less obvious case of moving may arise when two same-named branch PO files appear in different subdirectories of the same branch. posummit will by default simply gather them into single summit PO file, without reporting anything. However, it may be that one of the two subdirectories is of higher priority for translation. Then that it would be better if the summit PO file is located in that subdirectory, and that posummit reports if that is not the case, or make the move itself under --create
. Subdirectory precedence can be specified through S.subdir_precedence
field, which is simply a list of subdirectories:
S.subdir_precedence = [ "library", "application", "plugins/base", ... ]
Earlier subdirectories in the list have higher precedence. If a subdirectory is below one of the listed subdirectories, that subdirectory will have the same precedence as its listed top directory. If a subdirectory is neither listed nor it is below any of the listed, its precedence will be lower than all the listed.
When a piece of software appears (created or imported) in the project, its PO files will appear with it. These PO files are "totally" new, in the sense that they are not derived from any existing PO file. In this case, posummit will report that new branch PO files have no corresponding summit PO files, and expected paths of the missing summit PO files. After having checked that the branch PO files are indeed totally new, you can rerun posummit with --create
, or manually copy branch PO files to expected summit paths (they will be equipped with summit-specific information when posummit rolls over them).
A piece of software may be removed from the project (not maintained any more, moved to another project), which will cause its PO files to disappear. posummit will then report that some summit PO files have no corresponding branch PO files. You should check that branch PO files have indeed been simply removed, and then rerun posummit with --create
, or manually remove summit PO files.
When, for example, a program changes its name, normally its PO file will be renamed as well. What will happen in this case is that posummit will report two problems: a branch PO file without corresponding summit PO file (new name), and a summit PO file without any corresponding branch PO files (old name). When you realize that the cause of these paired reports is indeed renaming (they could also be an unrelated addition and removal), you must rename the summit PO file manually. Note that if you had not done this and issued --create
option instead, the existing summit PO file would have been removed, and an empty one with the new name created -- definitely not what was desired.
A more complicated case of renaming is when the name is changed in only one branch. posummit then reports only the branch PO file with the new name as having no summit PO file, since the existing summit PO file matches non-renamed branch PO files. In this case, the usual correction is to rename the summit PO file to new name and map old names from other branches to the new name. If foobaz.po
was renamed to fooqwyx.po
in devel
branch, but kept its name in stable
, then the mapping in the summit configuration file would be:
S.mappings = [ ... ("stable", "foobaz", "fooqwyx"), ... ]
Each mapping entry is a sequence of strings in parenthesis. The first string is the branch name, the second string is the domain name of the PO file in that branch, and the third string the domain name of the PO file in summit. When you add this mapping (and rename summit foobaz.po
to fooqwyx.po
), you can rerun posummit.
If the summit is over static templates, i.e. there are separate template and language summits, then renamings should be done in all of them.
If a single PO file becomes very big, it may be split into several smaller files by categories of messages (e.g. UI and help texts). A program may also be modularized, when the factored modules may take over part of the messages from the main PO file into their own PO files. Either way, posummit will again simply report that some new branch PO files have appeared and possibly some disappeared, and you recognize that the cause of this is a splitting. Splitting typically happens in the newest branch, but not in older branches. You should then make the same split in summit PO files and map the monolithic PO file from older branches to the newly split summit files. For example, if foobaz.po
in devel
branch got split into foobaz.po
(of reduced size), libfoobaz.po
, and foobaz_help.po
, the mapping for the old monolithic PO file in the stable
branch would be:
S.mappings = [ ... ("stable", "foobaz", "foobaz", "libfoobaz", "foobaz_help"), ... ]
The first string in the mapping is the branch name, the second string is the PO domain name in that branch, and all following strings are the new summit PO domain names which contain part of original messages. The order of summit PO domains is somewhat important: if a message exists only in the monolithic PO file in the stable
branch and not in split PO files in devel
branch, and summit heuristics detects no appropriate insertion point into one of the summit PO files, that message will be added to the end of the first summit PO file listed.
"Making the same split in summit" deserves some special attention. For the templates summit (which exists in summit over static templates), this simply means adding any new files and removing old ones (posummit will do that itself if run with --create
). But for language summits, you should manually copy the original summit PO file to each new name in turn, and then perform gather (direct summit) or merge (summit over templates). In this way no translated messages will be lost in the split.[19]
Sometimes formerly independent pieces of software are joined into a single package, for more effective maintenance and release. This can happen, for example, when selected plugins are taken into the host program distribution as "core plugins". Their separate PO files may then be merged into a single new PO file, or into an existing PO file. Like in the opposite case of splitting, posummit will simply report that some summit PO files no longer have branch counterparts, and possibly that a new branch PO file has appeared. This usually happens in the newest branch first, while older branches retain the separation. Then the same merging should be done in summit too, and mappings added for each of the old separate PO files in other branches. If foobaz_info.po
, foobaz_backup.po
, and foobaz_filters.po
have been merged into existing foobaz.po
in devel
branch, the following mappings for the stable
branch should be added:
S.mappings = [ ... ("stable", "foobaz_info", "foobaz"), ("stable", "foobaz_backup", "foobaz"), ("stable", "foobaz_filters", "foobaz"), ... ]
As for making the same merge in the summit, for templates summit (in summit over static templates) you should manually remove old separate files and possibly add the new monolithic one, or run posummit with --create
. In language summits, in order to retain all existing translations, you should manually concatenate separate files into one (using Gettext's msgcat) and then perform gather (direct summit) or merge (summit over templates).
In summit over templates modes (dynamic or static), the normal way for a language summit PO file to appear is by starting from a clean template, and the corresponding branch PO file is then created on scatter. However, when a program previously developed elsewhere is imported into the project, its PO files are imported too. This will lead to the situation where there are translated branch PO files with no corresponding language summit PO files. This is corrected by forced gathering of the "injected" branch PO file. If the injected file is alien.po
, in summit over dynamic templates you would execute:
$ cd $SUMMITDIR $ posummit gather --create --force alien
and in summit over static templates:
$ posummit $SOMEWHERE/summit-config-shared aa gather --create --force alien $ posummit $SOMEWHERE/summit-config-shared bb gather --create --force alien $ ...
The --force
option is necessary because, in summit over template modes, language summit PO files are normally gathered just once when the summit is created, and later only merged.
Important thing to note about mismatches is that reports produced by posummit may be misleading, especially in more complicated situations (splitting, merging). This means that you must carefully examine what has actually happened, not based only on the branch file trees themselves, but also by keeping an eye on channels (e.g. mailing lists) where information for translators is most likely to appear.
There is also the possibility to map a whole branch subdirectory to another directory in the summit. Since summit PO files are unique by domain name, the only effect of subdirectory mapping is to prevent posummit from reporting that files should be moved to another subdirectory, and to have it report proper expected summit paths when new branch catalogs are discovered. For example, if the PO files from subdirectory foosuite/
in devel
branch and from subdirectory foopack/
in stable
branch should both be collected in summit subdirectory foo/
, the subdirectory mapping would be:
S.subdir_mappings = [ ... ("devel", "foosuite", "foo"), ("stable", "foopack", "foo"), ... ]
Subdirectory mappings should be needed rarely compared to file mappings. A tentative example could be when two closely related software forks are translated within the same project, and they have many PO files in their own subdirectories.
At some moment translation branches will be "shifted", for example devel
will become the new stable
, stable
may become oldstable
(if three branches are maintained), etc. When that happens, mappings should be shifted too. A typical case would be two branches, devel
and stable
, and some mappings only for stable
; then, when the shift comes, all existing mappings would be simply removed.
As the number of mappings grows, or if branch path transformation is employed, it may not be readily clear which summit PO files are related to which branch PO files. Translator may need this information to know exactly which summit PO files to work on in order to have some set of branch files fully translated. For this reason, posummit provides the operation mode deps
, in which any number of operation targets are given in command line, and the dependency chains are reported for those targets.
If you recall the example mapping due to merging, you can check the dependency chain for the file foobaz_info.po
in stable
branch by executing one of:
$ cd $SUMMITDIR $ posummit deps $STABLEDIR/foobaz_info.po $ posummit deps stable:foobaz_info
in direct summit or summit over dynamic templates, or
$ posummit $SOMEWHERE/summit-config-shared aa deps $STABLEDIR/foobaz_info.po $ posummit $SOMEWHERE/summit-config-shared aa deps stable:foobaz_info
in summit over static templates. The output would look like this:
:summit-dir
/foobaz.podevel-dir
/foobaz.postable-dir
/foobaz_info.po \stable-dir
/foobaz_backup.postable-dir
/foobaz_filters.po
You can see that the complete dependency chain to which foobaz_info.po
from stable
belongs to has been written out. The first path in the chain is always the summit PO file, followed by all mapped PO files from each branch in turn.
If the file for which the dependency is mapped to more than one summit PO file, then the dependency chains for each of them is displayed. In the example of mapping due to splitting, if you request dependency for monolithic foobaz.po
from stable
branch, you would get three dependency chains:
:summit-dir
/foobaz.podevel-dir
/foobaz.postable-dir
/foobaz.po :summit-dir
/libfoobaz.podevel-dir
/libfoobaz.postable-dir
/foobaz.po :summit-dir
/foobaz_help.podevel-dir
/foobaz_help.postable-dir
/foobaz.po
Other then the main configuration fields for setting the summit type, summit and branch locations, and mappings, there are many other optional configuration fields. They can be used to make the translation workflow yet more efficient, by relieving translators from taking care of various details.
Summit operations (gather, merge, scatter) are characterized by having PO files and messages flowing between the summit and branches. It is then natural to think of adding some filtering into these flows. For example, on scatter, one could do small ortographic adjustments in translation, or automatically insert translated UI references.[20]
Filtering is implemented by being able to insert Pology hooks (see Section 9.10, “Processing Hooks”) into various stages of summit operations; a particular stage will admit only certain types of hooks. To fetch and insert translated UI references on scatter, the resolve-ui
hook can be added like this:
from pology.uiref import resolve_ui S.hook_on_scatter_msgstr = [ (resolve_ui(uicpathenv="UI_PO_DIR"),), ]
S.hook_on_scatter_msgstr
is a list of hooks which are applied on translation (msgstr
fields) before it is written into branch PO files on scatter. Each element of this list is a tuple of one to three elements. The first element in the tuple is the hook function, here resolve_ui
[21]. resolve_ui
is an F3C hook, which is the type of hooks expected in S.hook_on_scatter_msgstr
list.
The second and third element in the hook tuple are, respectively, selectors by branch and file. These are used when the hook is not meant to be applied on all branches and all PO files. The selector can be either a regular expression string, which is matched against the branch name or PO domain name (positive match means to apply the hook), or a function (return value evaluating as true means to apply the hook). If it is a function, the branch selector gets the branch name as input argument, and the file selector gets the summit PO domain name and summit subdirectory. For example, to add the specialized resolve_ui_docbook4
hook only to foobaz-manual.po
file, and plain resolve_ui
to all other files, the hook list would be:
from pology.uiref import resolve_ui, resolve_ui_docbook4 S.hook_on_scatter_msgstr = [ (resolve_ui_docbook4(uicpathenv="UI_PO_DIR"), "", "-manual$"), (resolve_ui(uicpathenv="UI_PO_DIR"), "", "(?<!-manual)$"), ]
The branch selector here is empty string, which means that both hooks apply to all branches (since empty regular expression matches any string). The resolve_ui_docbook4
hook has "-manual$"
regular expression as the file selector, which means that is should be applied to all PO domain names ending in -manual
. The resolve_ui
hook has been given the opposite regular expression, "(?<!-manual)$"
, which matches any PO domain name not ending in -manual
.[22] Regular expressions can quickly become unreadable, so here is how the same selection could be achieved with selector functions:
from pology.uiref import resolve_ui, resolve_ui_docbook4 def is_manual (domain, subdir): return domain.endswith("-manual") def is_not_manual (domain, subdir): return not is_manual(domain, subdir) S.hook_on_scatter_msgstr = [ (resolve_ui_docbook4(uicpathenv="UI_PO_DIR"), "", is_manual), (resolve_ui(uicpathenv="UI_PO_DIR"), "", is_not_manual), ]
When is more than one hook in the list, they are applied in the order if which they are listed.
This is all there is to say about hook application in general. What follows is a list of all presently defined hook insertion lists, with admissible hook types given in parentheses. Usually paired F* and S* hook types are possible, such that F* hooks are primary used for modification, while S* hooks could be employed for validation (e.g. writing out warnings).
S.hook_on_scatter_msgstr
(F3A, F3C, S3A, S3C)Applied to the branch translation (msgstr
fields) on scatter, before it is written into the branch PO file.
S.hook_on_scatter_msg
(F4A, S4A)Applied to branch message on scatter, before it is written into the branch PO file. These hooks can modify any part of the message, like comments, or even the msgid
field.
S.hook_on_scatter_cat
(F5A, S5A)Applied to the branch PO file while still in internal parsed state on scatter, after S.hook_on_scatter_msgstr
had been applied to all messages.
S.hook_on_scatter_file
(F6A, S6A)Applied to the branch PO file as raw file on disk on scatter, after S.hook_on_scatter_cat
had been applied. If one of the hooks reports non-zero value, the rest of the hooks in the list are not applied to that file.
S.hook_on_scatter_branch
Applied to the complete branch on scatter, after all other hooks on scatter had been applied. Functions used here are not part of the formal hook system. They take a single argument, the branch name, and return a number. If the return value is not zero, rest of the hooks are skipped on that branch.
S.hook_on_gather_file_branch
(F6A, S6A)Applied to the branch PO file as raw file on disk on gather, before S.hook_on_gather_cat_branch
is applied. The branch PO file will not be modified for real, but only its temporary copy.
S.hook_on_gather_cat_branch
(F5A, S5A)Applied to the branch PO file while still in internal parsed state on gather, before S.hook_on_gather_msg_branch
is applied to all messages.
S.hook_on_gather_msg_branch
(F4A, S4A)Applied to the branch message on gather, before it is used to gather the corresponding summit message.
S.hook_on_gather_msg
(F4A, S4A)Applied to the summit message on gather, after it had been gathered from the corresponding branch messages, but before it is written into the summit PO file.
S.hook_on_gather_cat
(F5A, S5A)Applied to the summit PO file while still in internal parsed state on gather, after S.hook_on_gather_msgstr
had been applied to all messages.
S.hook_on_gather_file
(F6A, S6A)Applied to the summit PO file as raw file on disk on gather, after S.hook_on_gather_cat
had been applied.
S.hook_on_merge_head
(F4B, S4B)Applied to summit PO header on merge, after the summit PO file has been merged.
S.hook_on_merge_msg
(F4A, S4A)Applied to summit message on merge, after S.hook_on_merge_head
had been applied.
S.hook_on_merge_cat
(F5A, S6A)Applied to the summit PO file while still in internal parsed state on merge, after S.hook_on_gather_msg
had been applied to all messages.
S.hook_on_merge_file
(F6A, S6A)Applied to the summit PO file as raw file on disk on merge, after S.hook_on_merge_cat
had been applied.
You may notice that some logically possible hook insertion lists are missing (e.g. S.hook_on_merge_msgstr
). This is because they are implemented on demand, as the need is observed in practice, and not before the fact.
Here is another example of hook interplay. Branch PO files may still rely on embedding the context into the msgid
field:
msgid "create new document|New" msgstr ""
but you would nevertheless like to have proper msgctxt
contexts in the summit:
msgctxt "create new document" msgid "New" msgstr ""
You can achieve this by writing two small F4A hooks, and inserting them at proper points:
def context_from_embedded (msg, cat): if "|" in msg.msgid: msg.msgctxt, msg.msgid = msg.msgid.split("|", 1) def context_to_embedded (msg, cat): if msg.msgctxt is not None: msg.msgid = "%s|%s" % (msg.msgctxt, msg.msgid) msg.msgctxt = None S.hook_on_gather_msg_branch = [ (context_from_embedded,), ] S.hook_on_scatter_msg = [ (context_to_embedded,), ]
In this way, branch messages will be converted to proper context just before they are gathered into the summit, and the proper context will be converted back into the embedded when the messages are scattered to branches.
Most likely, branch and summit directories will be kept under some sort of version control. This means that when posummit has finished running, any files that it had added, moved or removed, would have to be manually "reported" to the version control system (VCS). To avoid this, you can set in the summit configuration which VCS is used, among those supported by Pology, and posummit will issue proper VCS commands when changing the file tree. Then, after the posummit run, you can simply issue the VCS commit command on appropriate paths.
Since different VCS may be used for the summit and the branches, it is possible to set them separately. For example, if branches are in a Subversion repository and the summit in a Git repository, the summit configuration would contain:
S.summit_version_control = "git" S.branches_version_control = "svn"
If the same VCS is used for branches and the summit (whether or not they are in the same repository), only one configuration field can be set:
S.version_control = "git"
If you would like posummit to execute VCS commands only in the summit and not in branches, then you would set only the S.summit_version_control
field.
While wrapping of text fields in PO files (msgid
, msgstr
, etc) makes no technical difference, it may be convenient for editing for them to be wrapped in a particular way. Since posummit is anyway modifying PO files both in the summit and branches, it might as well be told what kind of wrapping to use.
For example, a reasonable wrapping setup could be:
S.summit_wrap = False S.summit_fine_wrap = True S.branches_wrap = True S.branches_fine_wrap = False
S.*_wrap
fields activate or deactivate basic (column-based) wrapping, while S.*_fine_wrap
fields do the same for logical breaks. So in this example, summit messages are wrapped only on logical breaks (may be good for editing), while branch messages are wrapped only on columns (may reduce size of VCS deltas).
If not set, the default is basic wrapping without fine wrapping, for both branches and the summit.
In direct summit, summit PO files spring into existence by gathering branch PO files. However, in summit over static templates, by default translators have to start a new PO file by copying over the summit template and initializing it. While dedicated PO editors can do this automatically, all translators in the team have to configure their PO editor correctly (language code, plural forms...), and they have to have templates at hand. Furthermore, any statistic processing on the translation project as whole has to specifically consider templates as empty PO files.
Instead of this, it is possible to tell posummit to automatically initialize summit PO files from summit templates -- to "vivify" them -- when the language summit is merged. There is a summit configuration switch to enable vivification, as well as several fields to specify the information needed to initialize a PO file. Here is an example:
S.vivify_on_merge = True S.vivify_w_translator = "Simulacrum" S.vivify_w_langteam = "Nevernissian <l10n@neverwhere.org>" S.vivify_w_language = "nn" S.vivify_w_plurals = "nplurals=7; plural=(n==1 ? ...)"
Setting S.vivify_on_merge
to True
engages vivification. The S.vivify_w_translator
field specifies the value of Last-Translator:
header field in vivified PO file; it can be something catchy rather than a real translator's name, to be able to see later which summit PO files were not yet worked on. S.vivify_w_langteam
is the contents of Translation-Team:
header field (team's name and email address), S.vivify_w_language
of Language:
(language code), and S.vivify_w_plurals
of Plural-Forms:
.
In summit over dynamic templates, vivification is unconditionally active, whether S.vivify_on_merge
is set or not. This is because synchronization of branches and the summit is checked by comparing template trees, and summit PO files are the only indicator of "virtual" presence of summit templates (while in summit over static templates, the summit template tree is physically present). Without vivification, it would also be very hard for project-wide statistics to take templates into account as empty summit PO files.
By default it is assumed that branch PO files are merged with branch templates using a separate mechanism, which was already in place when the summit was introduced into the workflow. In summit over templates modes, if branch merging is performed asynchronously to summit merging, on scatter it may happen that some messages recently added to branch PO file are not yet present in corresponding summit PO file. In that case, posummit will issue warnings about missing messages in the summit. This is normally not a problem, because merging asynchronicity will stop causing such differences as the pre-release message freeze in the source sets in.
However, on the one hand side, warnings of about messages missing in the summit may be somewhat disconcerting, or aesthetically offending in the otherwise clean scatter output. On the other hand side, perhaps the existing mechanism of merging in branches is not too clean, and it would be nice to replace it with something more thorough. Therefore, in summit over templates modes, it is possible to configure the summit such that on merge, posummit merges not only the summit PO files, but also all branch PO files. This is achieved simply by adding the merge=
key to each branch that should be merged:
S.branches = [ dict(id="devel", ..., merge=True), dict(id="stable", ..., merge=True), ]
When merging in branches is activated, it is still possible to merge only the summit, or any single branch. This is done by using giving an operation target on merge, either the path to the branch top directory or the branch name. For example, in summit over dynamic templates:
$ cd $SUMMITDIR $ posummit merge $DEVELDIR/ # merge only the devel branch $ posummit merge devel: # same $ posummit merge . # merge only the summit $ posummit merge +: # same
PO headers are treated somewhat differently from PO messages in summit operations:
On gather, almost all of the standard header field of the primary branch PO file are copied into the summit PO file. The primary branch PO file is defined as the first branch PO file (in case of several branch files being mapped onto the same summit PO file) from the first branch (as listed in the branch specification in summit configuration). The only exception is the POT-Creation-Date:
, which is set to the time of gathering, if there were any other modifications to the summit PO file. Header comments are not copied over, except when the summit PO files is being automatically created for the first time.
On merge, the summit PO file is merged with the summit PO template using msgmerge, so its header propagation rules apply. For example, no header comments will be touched, POT-Creation-Date:
will be copied over from templates but Last-Translator:
will not be touched, etc. This also means that, by default, any non-standard fields in the template (e.g. those starting with X-*
) will be silently dropped.
On scatter, almost the complete header is copied over from the primary summit PO file into the branch PO file. The primary summit PO file is defined as the first mapped summit PO file, in cases when the single branch PO file has been mapped to several summit PO files. The exception are Report-Msgid-Bugs-To:
and POT-Creation-Date:
, which are preserved as they are in the branch PO file. Also, PO-Revision-Date:
is set to that of the primary summit PO file only if there were any other modifications to the branch PO file (because it may happen that all updates to the summit PO file since the last scatter were for messages from other branches).
There exists the possibility to influence this default header propagation. In particular, non-standard header fields may be added into branch and summit PO files and templates by different tools, and it may be significant to preserve and propagate these fields in some contexts. The following summit configuartion fields can used for that purpose:
S.header_propagate_fields
field can be set to a list of non-standard header field names which should be propagated in gather and merge operations, from branch into summit PO files. For example, to propagate fields named X-Accelerator-Marker:
and X-Sources-Root-URL:
, the following can be added to summit configuration:
S.header_propagate_fields = [ "X-Accelerator-Marker", "X-Sources-Root-URL", ]
Only the primary branch PO file is considered for determining the presence and getting the values of these header fields.
Instead of simply overwriting on scatter most of the branch PO header fields with summit PO header fields, some additional branch fields may be preserved by setting S.header_skip_fields_on_scatter
to the list of header field names to preserve. For example, to preserve X-Scheduled-Release-Date:
field in branch PO files:
S.header_skip_fields_on_scatter = [ "X-Scheduled-Release-Date", ]
Chapter 6, Ascribing Modifications and Reviews describes a translation review system available in Pology, in which every PO message has its modification and review history kept up to some depth in the past. Based on that history, it is possible to select which messages from working PO files (those under ascription) can be passed into release PO files, provided that these two file trees exist. Summit and branches can be viewed exactly as an instance of such separation, where the summit is the working tree, and each branch a release tree.
In this context, only the summit tree should be kept under ascription. Filtering for release is then, naturally, performed on scatter: to each summit PO message a sequence of one or more ascription selectors is applied, and if the message is selected by the selector sequence, it is passed into the branch PO file. Several selector sequences may be defined, for use in various release situations, through S.ascription_filters
configuration field.
For example, to have a single filtering sequence which simply lets through all messages not modifed after last review, the following should be added to summit configuration:
S.ascription_filters = [ ("regular", ["nmodar"]), ]
Each filtering sequence is represented by a two-element tuple. The first element is the name of the filtering sequence, here regular
. You can set the name to anything you like; when there is only one filtering sequence, the name is actually nowhere used later. The second element of the tuple is a list of ascription selectors, which are given just as the values to -s
options in poascribe command line. Here only one selector is issued, nmodar
, which is the negation of modified-after-review selector. This yields the desired filter to pass all messages "not modifed after last review".
A more involving example would be that of having one filter for regular scatter, and another "emergency" filter, which relaxes the strictness a bit in case there was no time to properly review all new translations. This emergency filter may let through unreviewed messages if modified by a select few persons, which are known to produce sufficient quality translators in first attempt. If these persons are, for example, alice
and bob
(by their ascription user names), then the two-filter setup could look like this:
S.ascription_filters = [ ("regular", ["nmodar"]), ("emergency", ["nmodar:~alice,bob"]), ]
The regular
filter looks like in the previous example. The emergency
filter also uses just one nmodar
selector, but with additional argument to consider all users except for alice
and bob
. Due to the fact that it is listed first, the regular
filter is applied on scatter by default. Application of the emergency
filter is requested by issuing the -a
/--asc-filter
option with filter name as value:
$ cd $SUMMITDIR $ posummit scatter -a emergency
When scattering is performed under the ascription filter, messages stopped by the filter will be counted and their number (if non-zero) reported per branch PO file.
Each branch entry in branch specification (S.branches
configuration field) can have some keys in addition to those described earlier.
It is possible to exclude some branch PO files from summit operations, or to include only certain branch PO files into summit operations. This is done by setting excludes=
and includes=
keys. The value is a list of tests on branch PO file absolute path: if any test matches, the file is matched on the whole (logical OR). Each test can be either a regular expression string, or a function taking the file path as argument and returning a truth value. If only excludes=
is set, then all files not matched are operated on, and if includes=
is set, only matched files are operated on. If both keys are set, then only files matched by includes=
and not matched by excludes=
are operated on.
If branches are under version control and posummit is told to issue version control commands as appropriate (i.e. S.branches_version_control
configuration field is set), it is possible to exclude a specific branch from this, by setting its skip_version_control=
key to True
.
As is usual, merging performed by posummit by default produces fuzzy messages; in summit PO files, as well as in branch PO files if merging in branches is enabled. It is possible to prevent fuzzy matching, by setting S.summit_fuzzy_merging
and S.branches_fuzzy_merging
configuration fields to True
. There should be little reason to disable fuzzy matching in summit PO files, but it may be convenient to do so in branch PO files, which are not directly translated. For example, lack of fuzzy message will lead to smaller version control deltas.
Fuzzy messages are by default produced by msgmerge alone. This can be more finely tuned by processing the PO file before and after it has been merged, as done by the poselfmerge command. The S.merge_min_adjsim_fuzzy
configuration field can be set to a number in range from 0 to 1, having the same effect on fuzzy matching as the -A
/--min-adjsim-fuzzy
option of poselfmerge. The S.merge_rebase_fuzzy
field can be set to True
, with the same meaning as the -b
/--rebase-fuzzies
option of poselfmerge.
Summit PO files may be merged by consulting a compendium, to produce additional exact and fuzzy matches. This possibility also draws on the functionality provided by poselfmerge. The S.compendium_on_merge
configuration field is used to set the path to a compendium[23], equivalently to the -C
/--compendium
option of poselfmerge. Since compendium matches are less likely to be appropriate than own matches, you may set the S.compendium_fuzzy_exact
field to True
, or the S.compendium_min_words_exact
field to a positive integer number, with the same effect as -x
/--fuzzy-exact
and -W
/--min-words-exact
options of poselfmerge, respectively.
Sometimes a summit PO file may be "pristine", meaning that all messages in it are clear, neither translated nor fuzzy. Pristine summit PO files may appear, for example, when vivification is active. A pristine summit PO file will by default cause a likewise empty branch PO file to appear on scatter. This may or may not be a problem in a given project. If it is a problem, it is possible to set the minimal translation completeness of a summit PO file at which the branch PO file will be created on scatter. For example:
S.scatter_min_completeness = 0.8
sets the minimum completeness to scatter at 80%. Completeness is taken to be the ratio of the number of translated to all messages in the file (excluding obsolete).
Translation completeness of a summit PO file may deteriorate over time, as it is periodically gathered or merged, and no one comes around to update the translation. At some point, the completeness may become too low to be considered useful, so that it is better to stop releasing remaining translations in that file until it is updated. The completeness at which this happens, at which the branch PO file is automatically cleared of all translations on scatter, can be set through S.scatter_acc_completeness
configuration field. The meaning of the value of this field is the same as for S.scatter_min_completeness
; in fact, one might ask why not simply use S.scatter_min_completeness
for this purpose as well. The reason is that sometimes a higher bar is put for the initial release, and having two separate configuration fields enables you to make this difference.
Although hopefully shadowed by the advantages, working in summit is not without some disadvantages. These should be weighed in when deciding on whether to try out the summit workflow.
In summit over template modes, any changes made manually in branch PO files will not propagate into summit, and will be soon lost to scattering. This means that the whole translation team must work in the summit. It is not possible for some members to use the summit, and some not. In direct summit mode, modifying branches directly would be even messier, as some changes would find their way into the summit and some not, depending on which branch contains the change and the order of gather and scatter operations.
A summit PO file will necessarily have more messages than either of the branch files. For example, in two successive development-stable branch cyclings within the KDE translation project (at the time about 1100 PO files with 750.000 words), summit PO files were on average 5% bigger (by number of words) than their branch counterparts. This percentage can be taken as the top limit of possibly wasted translation effort due to messages in development branch coming and going, given that as the next branch cycling approaches more and more messages become fixed and make into the next stable branch.
A more pressing issue with increased size of summit PO files is the following scenario: next stable release is around the corner, and the translation team has no time to update summit PO files fully, but could update only stable messages in them. For example, there are 1000 untranslated and fuzzy messages in the summit, out of which only 50 are coming from the stable branch. A clever dedicated PO editor could allow jumping only through untranslated and fuzzy messages which additionaly satisfy a general search criteria, in this case that a comment matches \+>.*stable
regular expression (assuming the stable branch is named stable
in summit configuration). Lacking such a feature, with some external help it is enough if the editor can merely search through comments. First, Pology's posieve command can be used to equip all untranslated and fuzzy stable messages in summit PO files with an untranslated
flag (producing #, ..., untranslated
comment):
$ posieve tag-untranslated -sbranch:stable -swfuzzy paths...
Then, in the PO editor you can jump through incomplete stable messages by simply searching for this flag. While doing that, you are not obligated to manually remove the flag: it will either automatically disappear on the next merge, or you can remove all flags afterwards by running:
$ posieve tag-untranslated -sstrip paths...
There are some organizational issues with starting to use the summit, and, if it turns out counter-productive, stopping to use it. Team members have first to be reminded not to send in or commit branch PO files, and then if the summit is disbanded, to be reminded to go back to branch PO files. On the plus side, disbanding the summit is technically simple, simply removing its top directory and summit configuration file will do the job.
[14] One may think of relying upon the translation memory: translate only PO files from one branch, and batch-apply translation memory to PO files other branches, accepting only exact matches. This is dangerous, because short messages may need different translations in different PO files, resulting in hilarious mistranslations.
[15] Unfortunatelly, the following common organization cannot be automatically supported:
path/to/devel/ appfoo/ src/ doc/ po/ aa.po bb.po ... # no template! ... appbar/ ...
The problem is that there is no way to determine domain names from the file tree alone, and that different handling would be required for sources which actually have multiple PO domains.
[16] New translations do not have to appear in branches only by mistake. For example, some external sources, which have been translated elsewhere, may be integrated into the project.
[17] More precisely, if there are two same-name PO domains inside one branch, they will both be gathered into the same summit PO file. The assumption is that PO files with same domain names have mostly common messages.
[18] It can be changed by assigning another string to S.templates_lang
.
[19] One could also skip this and allow immediate loss of translations, and rely on the translation memory when later translating new PO files. But, especially in centralized summit maintenance, it is better to make things right early. Also, translation memory matches may not be as reliable, since they come not only from the original PO file, but from all PO files in the project.
[20] Another possibility are validation filters, which do not modify the text but report possible problems, though validation rules and the check-rules sieve are likely a better solution.
[21] resolve_ui
is not the hook function itself, but a hook factory. It is called with the argument uicpathenv="UI_PO_DIR"
to produced the actual hook function. See its documentation for details.
[22] This pattern makes use of a negative lookbehind token, a fairly advanced bit of regular expression syntax.
[23] Here you can also use the S.relpath()
function, to have the compendium path be relative to the directory of the summit configuration file.
It may not be obvious, especially to new translators, to which extent the translation needs to be reviewed. If the translator has exercised due diligence, how "wrong" can the translation be? Even if the translator has good command of the source language -- typically English in context of PO files -- the answer is "very wrong", all aspects considered.
With comparatively simple grammar of English, the meaning of a short English sentence (as typically encountered in program user interfaces) may vary much depend on the surrounding context. This context may not be obvious when the translator is going through isolated messages in the PO file, so he may commit the worst of errors from the reader's viewpoint: the senseless translation. An experienced reviewer will have developed sense for troublesome contexts, and will have at disposal several means to conclusively determine the context (including, for example, running the development version of the program).
Even if the context is correctly established, the translator may use "wrong" terminology, which is the next worse thing for the reader. A term used in translation does not need to be wrong by itself, in fact it may be exactly the correct term -- in another translation project. The reviewer will have more experience with terminology of the present project, and be able to bring the translation in line with it.
Style in the technical sense is a consistent choice between several perfectly valid constructs in target language when applied to text in the given technical context. For example, how to translate menu titles and items, button labels, or tooltips in user interface. Choices may include noun or verb forms, particular grammar categories, tone of address, and so on. There may be a style guide to the project which details such choices, and the reviewer will know it well.
Style in the linguistic sense is especially applicable to longer texts, such as tooltips in user interface or paragraphs in documentation. A typical error of a new translator is to closely adhere to English style and grammar. This may produce translation which is semantically and grammatically valid in the target language, but very out of style. Reviewer then steps in to naturalize such constructs.
Finally, while the reviewer may be an experienced translator, that does not mean that his own translations need no review. Immersion into the source language, distraction, and fatigue, will lead the reviewer into any of the above errors in translation, only with lesser frequency. This means that reviewers should also mutually review their own translations.
This calls for a systematic approach to review in translation workflow.
Classical review workflow, by stages, seems simple enough. Translator translates a new PO file or updates an existing translation, and declares it ready to review. A reviewer reviews it, and declares it ready to commit into the pool from which PO files are periodically released. A committer finally commits the file. The process is iterative: the reviewer may return the file to the translator, and translator later again declare it as ready for review. There may be several stages of review (such as proof-reading, approving), each of which may return the translation to a previous stage, or forward it to some special stage. The process may also be implemented on the subfile level, where each PO message can go through stages separately.
Regardless of the technical details, review workflows of this kind all have the following in common. Members of the translation team are assigned roles (such as translator, reviewer, committer) by which they step into the workflow. A single person can have more roles. Later review stages must wait for the earlier stages to complete, and the translation cannot be updated again before the current version clears the review pipeline (or the pipeline is aborted). Once the translation is committed, it becomes a part of simply "admitted" translations, with no further qualifiers.
The system of prescribed roles requires that team members assign the roles between themselves, stick to them, and shuffle them along the way. The prescribed review pipeline requires a tool to keep track of the current review stage of translation. This makes the review workflow rigid, with probable bottlenecks. Distribution of roles may become disbalanced by people coming into and leaving the team, or the tracking tool may be prohibitive to some scenarios (e.g. single translator making small adjustments in dozens of files across the project, but having to upload each manually through a web interface).
"Rigid" and "inefficient" are comparative qualifications, so what is it that review by stages can be compared to in this way?
Review by ascriptions is even simpler conceptually, and yet less rigid and more efficient than the review by stages. It obligatory works on the PO message level, rather than PO file level. Anyone can simply translate some PO messages and directly commit modified PO files, without any review, but ascribing modifications to own name. Anyone can review any PO message at any moment, commit modifications made during the review, and ascribe the review to own name (and possibly to a certain class -- review of context, of terminology, style, etc). When the time comes to release the translation, insufficiently reviewed messages are automatically omitted, by evaluating the ascription history of each message.
Based on the ascription history, the reviewer can select a subset of PO messages, and review only the difference between their historical and current versions. For example, Alice can select to review only messages modified since she or Bob had last reviewed them for style. She could see the difference from that last review to current version, e.g. if in the whole paragraph only a single word was changed by Charlie when he reviewed the terminology. Ascription history also propagate through merging of PO files with templates, so the reviewer can compare the change in original to the change in translation since the last review and judge if one fits the other.
Since everyone just commits, translations can be efficiently kept in a version control repository, with the ascription system added on top. After having done some translating, the translator simply substitutes commit command of the version control system (VCS) with "ascribe modifications" command of the ascription system (AS, which calls the underlying VCS internally). After reviewing, the reviewer uses "ascribe reviews" command of the AS to commit reviews to ascription history (as well as any modifications made during the review). To select messages for review, the reviewer issues "diff for review" command of the AS, with suitable parameters to narrow the message set; selected messages are marked in-place in PO files and equipped with embedded differences, and possibly directly opened in a PO editor.
When the translations are to be released, the release person issues "filter for release" command of the AS, which takes the working PO files and creates final PO files, in which the insufficiently reviewed messages are removed. Here "release time" can be understood figuratively: since filtering for release should be a fully automatic process, it can be performed at any interval of convenience.
What constitutes "sufficient review" can be defined in fine detail. It could be specified that messages modified by Alice need to have only review for terminology, but not necessarily for style; Charlie may belong to the group which needs to be reviewed on style, but not necessarily on context; Bob's reviews for style may be nice to have, but never blocking the release if missing. These decisions do not preclude released messages to be reviewed later on missing points, after higher priority reviews have been completed. The definition of sufficiency may be changed at any point, e.g. as team members get more experienced and require less review, without interfering with direct translation and review activities.
In summary, an AS preserves the operational efficiency of VCS, while at the same time providing great flexibility of review. All team members can be given commit access, no web or email detours are needed. There are no prescribed roles, but a functional equivalent of role assignment happens at the last possible moment (release time), can take into account both translators' and reviewers' abilities, and changing estimates of those over time. There is no staging between completing and committing the translation, which enables a translator to continue polishing the translation undisturbed until a reviewer comes around. There are no bottlenecks when performing small changes in many files, since a single AS command commits all changes just as a single VCS command would. On commit operations, the AS can also apply various checks (e.g. decline to commit syntactically invalid PO files) and modifications (e.g. update translator's data in the PO header).
Pology provides an ascription system in the form of the poascribe command.
Let the organization of PO files for the language nn
in the version control repository be such:
l10n-nn/ po/ ui/ alpha.po bravo.po ... doc/ alpha.po bravo.po ... ...
Having PO files grouped by language can be taken as a hard prerequisite[24]. Also necessary is a single top subdirectory for the whole PO file tree (here po/
), rather than having several PO subdirectories directly in the language directory.
Setting up ascription is now simple. Create the ascription configuration file named exactly ascription-config
(poascribe expects this name), on the same level as the top PO directory:
l10n-nn/ ascription-config po/ ui/ ... doc/ ...
and set in it a few global configuration fields, and data for each known translator:
# --------------------------- # Global ascription settings. [global] # Roots of the original and ascription trees. catalog-root = po ascript-root = po-ascript # The underlying version control system. version-control = svn # Data for updating PO headers on request. language = nn language-team = Nevernissian team-email = l10n-nn@neverwhere.org # Default commit message. commit-message = Translation updates. # ----------------------- # Registered translators. [user-alice] name = Alice Akmalryn original-name = Алиса Акмалрин email = alice.akmalryn@someplacenice.org [user-bob] name = Bob Bromkin original-name = Бобан Бромкин email = bob.byomkin@otherplacenice.org # ...and so on.
The configuration fields used in this example, and other possible configuration fields, are listed and described below.
Global settings in the configuration file:
catalog-root
The path to top PO subdirectory. This should be a relative path, and relative to the location of the configuration file.
ascript-root
Relative path to the top directory of the ascription file tree, which will be created and updated by poascribe.
version-control
The underlying version control system of the repository. The value is a keyword, see Section 9.7.2, “Version Control Systems” for a list of VCS supported by Pology.
language
, language-team
, team-email
, plural-header
These fields provide information about the language and the translation team, which poascribe uses to update header fields in modified PO files. language
is the language code, while language-team
is usually just the human-readable language name in English. plural-header
is the exact contents of Plural-Forms:
PO header field (if it contains a %
character, you need to escape it as %%
). For any of these fields that is not set, poascribe will remove the corresponding header field when updating the PO header.
title
The first comment line in the PO header, set when poascribe updates the header. It can contain the following placeholders for inserting file-dependent information: %basename
is the base PO file name (e.g. alpha.po
), %poname
the PO domain name (e.g. alpha
), %langname
the human-readable language name (supplied by the language-team
field), and %langcode
the language code (supplied by the language
field). Note that these placeholders actually must be written as %%
, to escape the special meaning of single name
%
character. If title
field is not set, poascribe will leave the title comment as it is in the PO file.
commit-message
The default commit message for the underlying VCS, when poascribe calls upon it to commit modified PO files. If this field is not set, an editor window will pop up to input the commit message, or the -m
/--message
option can be used to set the message through the command line. If the field is set, -m
can still be used to override the default commit message.
review-tags
The set of accepted review tags, given as whitespace-separated list of tags. If set, poascribe will abort when trying to use an unknown tag, otherwise it will accept any tag.
Each known translator is represented by a [user-
configuration section. Translator's user name in the ascription system has no direct relation with the underlying VCS account name (if VCS uses them), but it makes sense for them to be equal. This also means that a translator does not even have to have VCS account (repository commit access), though this is expected for the sake of efficiency. Translator configuration sections can contain the following fields:name
]
name
Translator's name, in the form supposed to be readable in English. This means that if the name is not originally written in Latin script, some romanized form should be given.
original-name
Translator's name in its original form, if it differs from the romanized form given by the name
field.
email
The email address at which the translator may be contacted.
As soon as ascription-config
file is committed to the repository, the ascription system through poascribe is ready for use. The only expected regular modifications to the configuration file are those of adding new translators. On the other hand, translators should never be removed, because even after they go away, their ascription records remain in the system.
The most common situation at the start of ascription workflow is that there already exists a considerable amount of translations, contributed by many different people over time. These existing translations should be ascribed as initial modifications -- but ascribed to whom? If it is not precisely known who translated what, the solution is to introduce a generic user in the configuration file, appropriately named "Unknown Hero" (or "Lost Translator", you can be inventive):
[user-uhero] name = Unknown Hero original-name = Незнани јунак
You should then ascribe all existing translations as modified and reviewed by this dummy translator:
$ cd $LANGDIR $ poascribe commit -u uhero --all-reviewed -C po/
The commit
argument is the ascription mode, and the -u
option provides the user name to which ascriptions are made. This is an important point: ascriptions are made to a user defined in ascription configuration, and have nothing to do with VCS itself. It is the --all-reviewed
option that declares all messages to be reviewed as well (this option is normally used only this once, and not in day to day operation). The -C
option prevents automatic VCS adding and committing, which is useful for this initial step.
When this command line is executed, a progress bar will appear and the following output will start to unfold:
doc/alpha.po (43/43) doc/bravo.po (81/81) ... ui/alpha.po (582/582) ui/bravo.po (931/931) ... ===== Ascription summary: - modified reviewed translated 11775 11775 fuzzy 2943 2943 obsolete/t 365 365 obsolete/f 26 26
The number in parenthesis indicates how many messages have been ascribed in the given PO file (modified/reviewed), and at the end the totals are given.
If, on the contrary, it is known who translated and reviewed what up to that point, ascription can be performed piece-wise with user names of real translators:
$ cd $LANGDIR $ poascribe commit -u alice --all-reviewed -C po/ui/ $ poascribe commit -u bob --all-reviewed -C po/doc/ $ ...
After the initial ascription has been made, the ascription file tree will appear next to the original file tree. There will be one ascription PO file for each summit PO file, with the same name and relative location within the tree:
l10n-nn/ po/ ui/ alpha.po bravo.po ... ... po-ascript/ ui/ alpha.po bravo.po ... ...
Ascription PO files are used by poascribe to store the ascription history, rather than e.g. a database of some sort. This has the disadvantage in performance, but advantage in simplicity and robustness. For example, ascription files will be under version control as well.
poascribe may also modify original PO files during this run, by removing any previous field comments (#| ...
) on translated messages. These comments are sometimes erroneously left in when the PO file is translated with an older or less capable PO editor, and leaving them would result in unnecessary additions to ascription PO files.
The newly created ascription tree, any modifications to the original tree, and the ascription configuration file, can now be committed as usual. With Subversion as the VCS:
$ cd $LANGDIR $ svn add ascription-config po-ascript $ svn commit ascription-config po po-ascript -m "Initial ascription."
While this is generally a good idea, with ascription in place translators must always update the complete language directory by VCS, rather than just one particular PO file or subdirectory, so that the original and the ascription PO file trees are kept in sync.[25]
In order not to have to report their user name to poascribe all the time (by the -u
option), translators can set it in Pology user configuration, the [poascribe]
section:
[poascribe] user = alice
With this in place, translators can submit updated PO files simply by substituting VCS commit command with poascribe commit
(or shortened: co
or ci
). With Subversion, this would look like:
$ cd $LANGDIR $ poascribe co po/ui/*alpha*.po po/ui/alpha.po (44) po/ui/libalpha.po (15) ===== Ascription summary: - modified translated 169 >>>>> VCS is committing catalogs: Sending ui/alpha.po Sending ui/libalpha.po Sending summit-ascript/messages/kdefoo/fooapp.po Sending LANG/summit-ascript/messages/kdefoo/libfooapp.po Transmitting file data .... Committed revision 1267069. $
The lines after >>>>> VCS...
are produce by the underlying VCS, which is Subversion in this example.[26]
As can be seen from the example output, poascribe will add ascription records into ascription PO files corresponding to original PO files, and commit them all. Like a VCS command, poascribe co can take any number of PO file or directory paths. For a directory path, only files with .po
extension in it will be processed, and any other ignored. poascribe can be run from any working directory with appropriate paths as arguments, and it will always find the associated ascription configuration and files. If a default commit message has not been set in the ascription configuration, poascribe will ask for it; or it can be given in command line through -m
option.
With the ascription system in place, every regular translator should have the commit access to the repository. But, there may be some period of time before new translators are given commit access, or revision control may be too technical for some, and even those who have access may not be able to commit temporarily for some reason.
These translators may send in their work by email or any other informal channel, to any member of the team how does have commit access. This team member can then commit received files without any review, as review can be conducted at any later time. If Bob sends some files to Alice, she can commit them immediately by stating Bob's user name:
$ poascribe co -u bob files...
For this to work, the translator who sent in the files has to be defined in the ascription configuration. There are no hidden costs or security issues to this (as opposed to giving VCS commit access), so every new translator should be defined there before any work of that person is committed.
An ascription system opens up all sorts of possibilities for review patterns. Reviewers should keep in mind that for each message the full modification and review history is available, so that the translation team can think about how to make good use of it. What follows are some examples to illustrate the review functionality provided by poascribe.
At the very basic level (which is the only level in review by stages), messages can be classified as simply unreviewed or reviewed. Alice now wants to review all unreviewed messages in a subset of PO files, say the ui/
subdirectory. She issues the following command (di
is short for diff
):
$ poascribe di po/ui/ po/ui/alpha.po (2) po/ui/foxtrot.po (7) po/ui/november.po (12) ===== Diffed for review: 21 $
With this, all unreviewed messages in listed PO files have been marked, and diffed. If these PO files had already been reviewed before, some of the messages modified since then (those now marked for review) may have changed very little. For example, a few changed words in a paragraph-length message, or even just some punctuation. Therefore, for each message marked for review, Alice also wants to see the difference since the last review to current version. Here are two messages with typical review elements added by poascribe di
:
#. ~ascto: charlie:m #: gui/mainwindow.cc:372 #, ediff msgid "GAME OVER. {-You won-}{+Tie+}!" msgstr "KRAJ IGRE. {-Pobeda-}{+Nerešeno+}!" #. ~ascto: bob:m charlie:m #: game-state.cpp:117 #, ediff-total msgid "Click the pause button again to resume the game." msgstr "Kliknite ponovo na dugme pauze da nastavite igru."
In the first message, the first thing to note is the #. ~ascto:
comment. This comment succinctly lists who did what with the message since the last review; here charlie:m
means that Charlie is the one who modified it. Then there is the ediff
flag, which Alice can search for in the editor to jump through messages marked for review. Finally, the original and translation have been diffed; here they show that, since the last review, the message was fuzzied by changing "You won" to "Tie", and what Charlie did in translation to unfuzzy it. Even on a message as short as this, the difference tells something useful to Alice: the phrase "Game over" likely has a formulaic translation, and the fact that it is not part of the difference means that the earlier reviewer had made sure it is consistent, so Alice does not have to check that.
The #. ~ascto:
comment of the second message reveals that both Charlie and Bob had been modifying it. The ediff-total
flag instead of plain ediff
means that this message had no reviews until now, so there are no embedded differences in text fields.
Alice can now go through marked messages in listed PO files, review translations, and possibly make modifications. When making changes in a message with embedded differences, she can freely edit the text outside of difference segments and within {+...+}
segments (as these are the ones which belong to current version of the text). While reviewing, Alice does not remove any of the added message elements (except for an occasional difference segment, if she modifies a translation), as these elements are needed for a subsequent invocation of poascribe. If a message is particularly hard to translate and Alice wants to defer reviewing it for some later time, she can add to it the unreviewed
flag (or nrev
for short).
Once the review is complete, Alice simply commits the reviewed files:
$ poascribe co po/ui/ po/ui/alpha.po (0/2) po/ui/foxtrot.po (0/7) po/ui/november.po (3/12) ===== Ascription summary: - modified reviewed translated 3 21 >>>>> VCS is committing catalogs: Sending po/ui/november.po Sending po-ascript/ui/alpha.po Sending po-ascript/ui/foxtrot.po Sending po-ascript/ui/november.po Transmitting file data .... Committed revision 1284220. $
Three things have happened here. First, all review states (flags, embedded differences, etc.) have been removed, restoring diffed PO files to original state. Then, any modifications that Alice has made during review are ascribed to her (here 3 out of 21 messages). Finally, all marked messages are ascribed as reviewed by Alice (any with unreviewed
or nrev
flags would have been omitted here). When committing, the only original PO file that got committed is the one with modifications made during review, and all the ascription PO files were committed because of the reviews recorded in them.
When many PO files with few changes per file should be reviewed, it becomes burdensome to manually open each and every diffed file for review, and then to make sure that all are committed with poascribe co. To make this easier, -w toreview.out
option can be added to the poascribe di
command line, which requests that paths of all diffed PO files be written into toreview.out
file. This file can then be used to batch-open diffed PO files in an editor, as well as to commit them later by adding -f toreview.out
to poascribe co
. There is also -o
option, which tells poascribe to directly open PO files in one of the supported PO editors (see Section 9.7.1, “PO Editors”). Putting it together, to efficiently review a whole bunch of small changes throughout many PO files, with Lokalize as the PO editor, you can execute:
$ poascribe di paths...
-w toreview.out -o lokalize
$ # ...only marked messages opened in Lokalize, review them...
$ poascribe co -f toreview.out
If for whatever reason you want to simply remove the review elements from messages without committing the PO files (effectively discarding the review), you can use the purge
mode (short pu
) of poascribe:
$ poascribe pu paths...
If -k
/--keep-flags
option is added to this command line, the flags which mark the messages as reviewed get preserved; more precisely, every ediff*
flag is replaced with reviewed
flag, and every unreviewed
flag is left in, so that subsequent invocation of poascribe co
can record reviews. You will want this limited purging if you have some automatic validation tools to run before committing, and these tools would be thrown off by review elements (most likely by embedded differences).
Invocations of poascribe di without any options, as in the previous section, are equivalent to this:
$ poascribe di -s modar paths...
The -s
option serves to issue a message selector. modar
is the default selector for the diff
operation mode, and stands for "MODified-After-Review": it selects the earliest historical modification of the message after the last (or no) review of that message, if there is any such. By selecting a historical modification of the message, the difference from it to current version can be computed and embedded into the PO file, as seen in earlier examples.
There are various specialized selectors, and they fall into two groups: shallow selectors and history selectors. Shallow selectors look only into the current version of the message, and cannot select historical versions, which means that they cannot provide embedded differences. History selectors (modar
is of this type) can select messages from history and provide differences. Several selectors can be issued on the command line, and the message is selected only if all selectors select it (boolean AND-linking). Shallow selectors are thus normally used as a pre-filters for history selectors. For example, to select messages modified after the last review, but only those found in the stable branch, branch
and modar
selectors are chained like this:
$ poascribe di -s branch:stable -s modar paths...
It is important that the history selector is given last, because the last selector determines which historical message is selected for diffing. If the ordering had been reversed in this example, same messages would get selected, but they would not have embedded differences, because branch
is a shallow selector.
Selectors can take parameters themselves, like branch:stable
in the previous example. Parameters are separated from the selector name by any non-alphanumeric character; this is colon by convention, but if a parameter contains a colon, something else, like slash, tilde, etc. can be used. Number of parameters can vary, and modar
in particular can take from none to three. If Alice wants to review only those messages modified by Charlie since the last review, she states this by first argument to modar
:
$ poascribe di -s modar:charlie paths...
If Alice does not give much credit to other reviewers, she can request selection of messages modified after her own last review with second parameter to modar
:
$ poascribe di -s modar::alice paths...
Here the first parameter ("modified by..."), which is not needed, must be explicitly skipped, before proceeding to the second parameter ("reviewed by..."). (The third optional parameter to modar
will be demonstrated later on.)
When a selector parameter is a user name, normally it can also be a comma-separated list of user names (modar:bob,charlie
) or prefixed with tilde to negate, i.e. to select all users other than those listed (modar:~alice
).
Any selector can be negated by prepending n
to its name. For example, the history selector modafter:
selects first modification after the given date; to select messages modified after the last review, but only if modified during June 2010:date
$ poascribe di -s modafter:2010-06 -s nmodafter:2010-07 -s modar paths...
Negating a history selector produces a shallow selector: while modafter
is a history selector, nmodafter
is shallow. But the mutual ordering of the two in this example is not important, since the last selector in the chain is the usual modar
.
Selectors can be issued in other modes too. If the PO file is big, and Alice has reviewed messages up to and including the message with entry number 246 when she has to pause until another day, she can commit reviews only up to this entry by issuing the espan
selector:
$ poascribe co -s espan::246 paths...
The first parameter to espan
, here omitted, would be the entry number of the first message to select, in case messages should not be selected starting from the first in the file. There is also the counterpart lspan
selector, which works with referent line numbers (those of msgid
keywords) instead of entry numbers.
If you do not want to immediately diff for review, but to see first how many messages would be selected by the selector chain that you assembled, you can use the status
operation mode (st
for short) instead of diff
. It takes selectors in the same way as diff
, and shows counts of selected messages by category. You can also add the -b
option to have counts reported by PO file (where non-zero).
You may also want to observe the complete recorded ascription history of a message, all its modifications and reviews, with differences between each two modifications. For this you can use the history
operation mode (hi
for short), typically with one of l
or e
selectors to single out a particular message. The history will be written out to terminal, starting from the newest to the oldest version of the message, with highlighted embedded differences.
In the introduction of this chapter, several distinct things that can go wrong in translation were described. Not all reviewers may be able to check translation against all those problems. Here is a typical scenario of this kind:
Alice is computer-savvy and knows the translation project inside and out, which means that she can review well for context, terminology, and technical style. But, her language style leaves something to be desired, which shows in longer sentences and passages. Dan, on the other hand, is a very literary person, but not that much into the technical aspects. Dan's style reviews would thus be a perfect complement to Alice's general reviews.
poascribe can support this scenario in the following way. A review type tag lstyle
for language style is defined in the ascription configuration, using the review-tags
field:
[global] # ... review-tags = lstyle
With this addition to configuration, Alice can continue to review as she did before, without any changes in her workflow.
Dan selects messages for review similarly to Alice, but additionally giving the lstyle
tag as the third parameter of modar
, and indicating that reviews should be tagged as lstyle
using the -t
option:
$ poascribe di -s modar:::lstyle -t lstyle paths...
After finishing the review, Dan commits as usual:
$ poascribe co PATHS...
If Dan is always going to review the language style, in order not to have to issue the selector and the tag in command line all the time, he can make them default for the diff
mode in Pology user configuration:
[poascribe] user = dan selectors/diff = modar:::lstyle tags/diff = lstyle
With this, Dan can use plain poascribe di
just like Alice does.
The important point of review tags is that they make reviews by types independent. For example, Dan may come around to review the language style of the given message after several modifications and general reviews have been ascribed to it -- modar:::lstyle
will simply ignore all reviews other than lstyle
reviews. This is going to be reflected in the ~ascto:
comment of diffed messages:
#... #. ~ascto: charlie:m alice:r bob:m #... msgid "..." msgstr "..."
Here Alice has made one review between Charlie's and Bob's modifications, and that review, being general instead of lstyle
, did not cause modar
to stop at it. After Dan reviews this message for language style, Alice runs selection for review and gets this:
#... #. ascto: bob:m dan:r(lstyle) #... msgid "..." msgstr "..."
Again, since lstyle
reviews do not mix with general reviews[27], Dan's review did not hide Bob's modification that Alice did not check so far.
After the ascription system is set up, there should be very little to do to maintain it. The details depend on the established translation workflow, and this section describes some of the procedures which may apply.
If PO files are periodically merged with templates in a centralized manner, by one designated person or repository automation, these modifications must also be ascribed. This is done as any other ascription, by substituting the VCS commit command with poascribe co
. For example:
$ svn commit $LANGDIR -m "Everything merged."
may be substituted with:
$ poascribe commit $LANGDIR -m "Everything merged."
Since the user is not explicitly given by the -u
option, this will ascribe modifications due to merging to the person set in Pology user configuration on the system where the command is executed. This is just fine. It is also possible to define a dummy user to which modifications due to merging are ascribed, though there is no known advantage to that at present.
Note that you can issue the -C
option to prevent poascribe from automatically committing merged files, in case are some automatic post-merge operations that you would like to perform on merged PO files beforehand. Afterward, standalone VCS commit command can be issued, but do not forget to include the ascription file tree in it as well.
Sometimes PO files are "shuffled" in the repository: renamed, moved to another subdirectory, etc. Such shuffling should be exactly mirrored in the ascription tree:
If a PO file is moved or renamed, its counterpart ascription PO file should also be moved or renamed in the same way within the ascription tree.
If a PO file is split into two, then it depends on how you handle the splitting. A good way would be to copy the old PO file to two new names, and then merge them with new templates. In this way as much of existing translation as possible will be preserved. If this is done, then the ascription PO files should be copied to new names, but then there is nothing to merge them with. This is just right, since message ascription histories generally interleave across the split (but also see Section 6.5.4, “Trimming Ascription History”).
If two PO files are merged into one, you should probably handle that by using msgcat to properly concatenate them into the new PO file, and then merge the new PO file with its template. Then, the old ascription PO files should be concatenated with msgcat as well, and nothing more. But, make sure that you issue the --use-first
to msgcat, for both concatenations. This is because when in the two concatenated PO files there are two messages with same msgctxt
+msgid
but different msgstr
, msgcat will by default make a free-form composition of msgstr
texts, for translator to manually disentangle later. This would ruin the ascription entry of such a message in the concatenated ascription PO file.
After the shuffling is performed in both file trees, poascribe co
is executed to smooth out and commit modifications.
At the moment of this writing, filtering for release has not been implemented yet in poascribe, but it is planned.
However, if you translate in summit, it is possible to configure the summit to skip insufficiently reviewed messages when scattering to branches. See Section 5.3.7, “Filtering by Ascription on Scatter” for details.
Overview of operation modes:
commit
, co
Commits modifications and reviews to PO files. Default selector: any
.
diff
, di
Adds embedded differences and other review elements to selected messages in PO files. Default selector: modar
.
history
, hi
Outputs to terminal the complete history of modifications and reviews for selected messages. Default selector: any
.
purge
, pu
Removes all review elements from PO files (unless -k
/--keep-flags
option is added, when only review flags are kept). Default selector: any
.
release
, re
Not implemented yet.
status
, st
Shows ascription counts per message category (total for all selected messages, and also per PO file if -b
/--show-by-file
option is added). Default selector: any
.
trim
, tr
Not implemented yet.
Options specific to poascribe:
-a SELECTOR[:ARGS]
, --select-ascription=SELECTOR[:ARGS]
By default, a historical message is selected for diffing with current message based on the last history selector given by -s
/--selector
option (if any). Instead, with this option you can explicitly set the selector for historical messages. It will be applied after the message has been selected by the primary selector chain. The option can be repeated, in which case a historical message is selected if all selectors match it.
-A RATIO
, --min-adjsim-diff=RATIO
The minimum adjusted similarity between the current and the historical message at which embedded differences will be shown.[28] This is a number in range from 0.0 (always show) to 1.0 (never show). If the difference is not shown due to this limit, the message will get the flag ediff-ignored
instead of the usual ediff
. A reasonable value may be 0.6 to 0.8.
-b
, --show-by-file
Some operation modes show summary at the end of the run, which is based on all processed PO files taken together. With this option you can request some of the summary elements to be shown per processed file.
-C
, --no-vcs-commit
Issue this option if you want poascribe not to commit modifications to version control itself. This may be useful if you want to examine raw modifications it made, to perform some checks, etc, and commit manually later. But do not modify any messages in between, as that would defeat the purpose of ascription.
-d LEVEL
, --depth=LEVEL
Operation modes normally consider ascription history of a message starting from the newest and going down to the earliest ascription. With this option you can set the depth to which history is examined, where 0 is the newest ascription only, 1 the current and first previous ascription, etc.
-D SPEC
, --diff-reduce-history=SPEC
Some special (possibly custom) selectors may need to examine only differences or commonalities between each two adjacent messages. In order not to have to build this functionality into each such selector, you can issue this option to preprocess ascription history such that each historical message is reduced based on the difference with the next earlier message. The message can be reduced to the parts equal, added or removed as compared to the earlier message. This is controlled by the
value, which must start with one of the letters SPEC
e
(equal), a
(added), or r
(removed). This letter may be followed with an arbitrary sequence of characters, which will be used to separate the remaining parts of the text in the message; if there are no additional characters, space is used as the separator.
-F HOOKSPEC
, --filter=HOOKSPEC
Sometimes it may be necessary to apply selectors not to the ascription history as it is, but to a suitably filtered version of the history. This option can be used to set a Pology F1A hook as filter, see Section 9.10, “Processing Hooks” for details. It can be repeated to set several filters.
-G
, --show-filtered
When setting a filter on ascription history by the -F
/--filter
option in the diff
mode, it may be good to see also the difference in filtered messages, those on which the selectors were actually applied. By issuing this option, every message field with an embedded difference will get added a visually conspicuous separator, followed by the filtered version of the text with difference as well. When you commit or purge the PO file diffed in this way, the separators and the filtered text are removed together with all other review elements.
-k
, --keep-flags
When the diffed PO file is purged of review elements, by default all review elements are removed, so that on subsequent commit only modifications would be ascribed, if there were any. Issuing this option on purge causes that all review elements except for flags are removed. More precisely, ediff*
flags are replaced with reviewed
, and unreviewed
flags are simply kept. This makes the subsequent commit also ascribe reviews. You need this if you want to apply some automatic checks to the PO file after the review and before the commit, where more intrusive review elements (like embedded differences) would interfere.
-m TEXT
, --message=TEXT
The text of the commit message. If default commit message is set in the ascription configuration, this text overrides it. If default commit message is not set and this option is not issued, and editor window is opened to enter the commit message.
-o EDITOR
, --open-in-editor=EDITOR
When diffing for review, instead of manually opening diffed PO files and searching for messages by flags, this option can be issued to have poascribe automatically open PO files in a PO editor (and possibly have the editor filter the message list to only selected messages). This work only with PO editors explicitly supported by Pology; the EDITOR
value is an editor keyword rather than an arbitrary editor command. See Section 9.7.1, “PO Editors” for the list of supported editors.
-L RATIO
, --max-fraction-select=RATIO
In diff
mode, this option sets the ratio of selected messages to total messages in a given PO file, above which no message in that file will be selected although the selector chain matched them. The value is the number between 0.0 and 1.0; for example, 0.2 means to accept selection if the number of selected messages is at most 20% of the total number of messages. This can be used to discern between reviewing updated PO files and newly translated PO files, as the latter take much more time to review and hence may be of lesser priority.
-s SELECTOR[:ARGS]
, --selector SELECTOR[:ARGS]
The option to set a selector, in various modes. Can be repeated to create selector chains, in which case a message must match all selectors to be selected. In diff
mode, if the last selector in the chain is not a history selector, selected messages will have no embedded differences (unless an ascription selector is explicitly given by the -A
/--select-ascription
option).
-t TAG
, --tag=TAG
The review tag, denoting the type of the review. If review-tags
field in ascription configuration set, this must be one of the tags defined there (general review has an empty string as tag, which is the default). The tag is normally issued in diff
mode: it will be appended to review flags on diffed messages (e.g. ediff/
), which will cause on commit that the review of this type is ascribed. In tag
commit
mode, this option has effect only if --all-reviewed
is issued as well, in which case this tag will override any from the PO file. Several tags may be given as comma-separated list.
-u NAME
, --user=NAME
The user, one of those defined in ascription configuration, to whom modifications and reviews are ascribed on commit.
-U
, --update-headers
If you work on PO files with a general text editor, you can issue this option on commit to have the header data in modified PO files automatically updated. The necessary information is fetched from the ascription configuration.
-v
, --verbose
More detailed information on progress of the operation.
-w FILE
, --write-modified=FILE
This option specifies the file into which to write the path of every PO file modified during the operation, one per line. This file can later be fed back to poascribe (and other Pology commands) with the -f
/--files-from
option.
-x FILE
, --externals=FILE
If you have written some custom selectors, with this option you specify the path to the file containing them. It can be repeated to load several files with custom selectors.
--all-reviewed
On commit, normally only messages having ediff*
or reviewed
flags will be ascribed as reviewed. If this option is used, instead all messages will be ascribed as reviewed (except for those having unreviewed
flag).
Options common with other Pology tools:
-F FILE
, --files-from=FILE
-e REGEX
, --exclude-name=REGEX
; -E REGEX
, --exclude-path=REGEX
; -i REGEX
, --include-name=REGEX
; -I REGEX
, --include-path=REGEX
The following configuration fields can be used to modify general behavior of poascribe:
[poascribe]/aselectors
The list of explicit selectors of historical messages, as if they were issued with multiple -a
/--aselector
options. The first character in the value must be non-alphanumeric (e.g. /
), and that character is then used to separate selector specifications; the value must also end with this character.
[poascribe]/diff-reduce-history
Counterpart to -D
/diff-reduce-history
command line option.
[poascribe]/filters
Comma-separated list of history filters, as if they were issued with multiple -F
/--filter
options.
[poascribe]/max-fraction-select
Counterpart to -L
/max-fraction-select
command line option.
[poascribe]/min-adjsim-diff
Counterpart to -A
/--min-adjsim-diff
command line option.
[poascribe]/po-editor
Counterpart to -o
/--open-in-editor
command line option.
[poascribe]/selectors
List of message selectors, as if they were issued with multiple -s
/--selector
options. The first character in the value must be non-alphanumeric (e.g. /
), and that character is then used to separate selector specifications; the value must also end with this character.
[poascribe]/tags
Counterpart to -t
/--tag
command line option.
[poascribe]/update-headers=[yes|*no]
Setting to yes
is counterpart to -U
/--update-headers
command line option.
[poascribe]/user
Counterpart to -u
/--user
command line option.
[poascribe]vcs-commit/=[*yes|no]
Setting to no
is counterpart to -C
/--no-vcs-commit
command line option.
poascribe provides a variety of internal selectors, and new selectors are added as general need for them is observed in practice. Selectors come in two types: history and shallow; the former also select a historical message from which to show the differences to the current message, while the latter do not. Arguments to selectors are added consistently separated with any non-alphanumeric character, customarily colon (:
) when possible. If less arguments are given than the selector can take, all remaining arguments are set to empty (the selector may or may not accept this).
Available internal selectors are as follows:
any
(shallow)Selects any message.
active
(shallow)Selects active messages, i.e. those translated and not obsolete.
asc:USER
(history)Selects latest historical message ascribed (modified or reviewed) by the given user, or to any user if the argument is empty. Multiple users can be given as a comma-separated list, and selection inverted by prepending ~
.
branch:NAME
(shallow)Selects messages belonging to the given branch (see Chapter 5, Summitting Translation Branches). Several branch names may be given, as comma-separated list.
current
(shallow)Selects current messages, i.e. those not obsolete.
e:ENTRYNUM
(shallow)Selects a message with given entry number in the PO file (first message has entry number 1, second 2, etc).
espan:START
:END
(shallow)Select messages with entry numbers between given start and end, including both. If start is empty, 1 is assumed; if end is empty, number of messages is assumed.
fexpr:EXPRESSION
(shallow)Selects messages matching a boolean search expression on message parts. It has same syntax as the fexpr
parameter of the find-messages sieve.
hexpr:EXPRESSION
:USER
:DIFFSPEC
(history)Like fexpr
, but matches through historical messages starting from the latest ascription. If user argument is not empty, matches only messages ascribed to that user. Multiple users can be given as a comma-separated list, and selection inverted by prepending ~
. The last argument, if not empty, requests to reduce historical messages by incremental differences before matching them; see the --diff-reduce-history
option for the syntax and other details.
l:LINENUM
(shallow)Selects a message with given referent line number in the PO file. This is the line number of msgid
message field. ±1 offset is accepted.
lspan:START
:END
(shallow)Select messages with referent line numbers between given start and end, including both. If start is empty, 1 is assumed; if end is empty, total number of lines is assumed.
mod:USER
(history)Selects latest historical message modified by the given user, or by any user if the argument is empty. Multiple users can be given as a comma-separated list, and selection inverted by prepending ~
.
modafter:TIMESTAMP
:USER
(history)Selects the earliest historical message modified at or after the given date and time. The full timestamp format is
, but trailing elements can be omitted as logical; for example, YEAR
-MONTH
DAY
HOUR
:MINUTE
:SECOND
2010-10
would be interpreted as 2010-10-01 00:00:00
. If the user argument is not empty, only modifications by that user are considered; multiple users can be given as a comma-separated list, and selection inverted by prepending ~
.
modam:USER1
:USER2
(history)Selects the earliest historical message which introduced modifications after the last modification, or the very first historical message. This makes sense only if one or both of the user arguments are not empty. If the first user argument is not empty, only modifications by that user are considered for selection. If the second user argument is not empty, only the modifications by that user are considered as base. For both user arguments, multiple users can be given as a comma-separated list, and selection inverted by prepending ~
.
modar:MODUSER
:REVUSER
:TAG
(history)Selects the earliest historical message which introduced modifications after the last review, or the very first historical message if there was no review yet. If the first user argument is not empty, only modifications by that user are considered and reviews by that user are ignored. If the second user argument is not empty, only reviews by that user are considered and modifications by that user are ignored. For both user arguments, multiple users can be given as a comma-separated list, and selection inverted by prepending ~
. The last argument determines which review types (by review tag) to consider, where empty value means "general review"; multiple tags can be given as comma-separated list.
modarm:MODUSER
:REVUSER
:TAG
(history)Like modar
, but uses as base for selection the last review or the last modification. This generally makes sense only if some combination of user arguments is given too.
rev:USER
(history)Selects latest historical message reviewed by the given user, or by any user if the argument is empty. Multiple users can be given as a comma-separated list, and selection inverted by prepending ~
.
revbm:REVUSER
:MODUSER
:TAG
(history)Selects the earliest historical message which has been reviewed just before a modification occurred. If the first user argument is not empty, only reviews by that user are considered and modifications by that user are ignored. If the second user argument is not empty, only modifications by that user are considered and reviews by that user are ignored. For both user arguments, multiple users can be given as a comma-separated list, and selection inverted by prepending ~
. The last argument determines which review types (by review tag) to consider, where empty value means "general review"; multiple tags can be given as comma-separated list.
tmodar:MODUSER
:REVUSER
:TAG
(history)Like modar
, but considers as modified only those historical messages with modifications in translation (msgstr
).
unasc
(shallow)Select messages that are not yet ascribed, i.e. those which are modified but not yet committed.
Every selector automatically gets a negative counterpart, with the name prefixed by n*
. The negative selector is always shallow, regardless of the type of the original selector.
Custom selectors can be written in a standalone Python source file, which is then fed to poascribe using the -x
/--externals
option. A file with several custom selectors should have this layout:
def selector_foo (args): ... def selector_bar (args): ... asc_selector_factories = { # key: (factory, is_history_selector) "foo": (selector_foo, False), "bar": (selector_bar, True), }
selector_foo
and selector_bar
are factory functions for selectors foo
and bar
. After loading the file, poascribe will look for the asc_selector_factories
dictionary to see which selectors are defined and of what type they are. See Section 11.5, “Writing Ascription Selectors” for the instructions on writing selector factory functions.
[24] Technically, PO files could also be grouped by PO domain:
po/ ui/ alpha/ ... nn.po mm.po ...
but this would lead to a host of strange sharings of ascription settings and auxiliary file locations between different languages. In general, it is assumed that each translation team manages its own separate ascription.
[25] There should be no technical problem here, since VCS updates are inexpensive in terms of network traffic, but there may be a problem of changing one's habits.
[26] If the underlying VCS would a distributed one, such Git, and the push to a designated central repository is expected afterward, it must be performed manually.
[27] General review too has a tag assigned, the empty string, in case the reviewer needs to explicitly issue it in some context.
[28] Unlike for example in fuzzy messages, the similarity between the current and the earlier message from the ascription history may be exactly zero, it the PO file has undergone several merges in between. For example, in a two-word message, the first merge could have replaced the first word, and the second merge the second word.
This chapter describes various smaller standalone tools in Pology, which do not introduce any major PO processing concepts nor can be grouped under a common topic.
The porewrap script does one simple thing: it rewraps message strings (msgid
, msgstr
, etc.) in PO files. Gettext's tools, e.g. msgcat, can be used for rewrapping as well, so what is the reason of existence of porewrap? The lesser reason is convenience. Arbitrary number of PO file paths can be given to it as arguments, as well as directory paths which will be recursively search for PO files. The more important reason is that Pology can also perform "fine" wrapping, as described in Section 9.8, “Line Wrapping in PO Messages”. Thus, running:
$ porewrap --no-wrap --fine-wrap somedir/
will rewrap all PO files found in somedir/
and below, such that basic wrapping (on column) is disabled (--no-wrap
), while fine wrapping (on logical breaks) is enabled (--fine-wrap
).
Other than from command line options, porewrap will also consult the PO file header and the user configuration, for the wrapping mode. Command line options have the highest priority, followed by the PO header, and the user configuration at the end. For details on how to set the wrapping mode in PO headers, see the description of X-Wrapping
header field in Section 9.9, “Influential Header Fields”. If none of these sources specify the wrapping mode, porewrap will apply basic wrapping.
Options specific to porewrap:
-v
, --verbose
Since porewrap just opens and writes back all the PO files given to it, it normally does not report anything. But this option can be issued for it to report PO file paths as they have been written out.
Options common with other Pology tools:
--wrap
; --no-wrap
; --fine-wrap
; --no-fine-wrap
; --wrap-column=COL
See Section 9.8.1, “Common Command Line Options for Wrapping”.
-F FILE
, --files-from=FILE
porewrap reads the wrapping mode fields as described in Section 9.8.2, “Common User Configuration Fields for Wrapping”, from its [porewrap]
section.
Normally, PO files are periodically merged with latest PO templates, to introduce changes from the source material while preserving as much of the existing translation as possible. poselfmerge, on the other hand, will merge the PO file with "itself". More precisely, it will derive the temporary template version of the PO file (by cleaning it from translations and other details), and then merge the original PO file with the derived template, by calling msgmerge internally. This can have several uses:
The fuzzy matching algorithm of msgmerge is extremely fast and robust, but treats all messages the same and in isolation, without trying out more complicated (and necessarily much slower) heuristic criteria. This can cause the translator to spend more time updating a fuzzy message than it would take to translate it from scratch. poselfmerge can be therefore instructed to go over all fuzzy messages created by merging, and apply additional heuristics to determine whether to leave the message fuzzy or to clean it up and make it fully untranslated.
Sometimes the PO file can contain a number of quite similar longer messages (this is especially the case when translating in summit). A capable PO editor should automatically offer the previous translation on the next similar message (by using internal translation memory), and show the what the small differences in the original text are, thus greately speeding up the translation of that message. If, however, the PO editor is not that capable, or you use a plain text editor, while translating you can simply skip every long message that looks familiar, and afterwards run poselfmerge on the PO file to introduce fuzzy matches on those messages.
More generally, if your PO editor does not have (a good enough) translation memory feature, or you edit PO files with a plain text editor, you can instruct poselfmerge to use one or more PO compendia to provide additional exact and fuzzy matches. This is essentially the batch application of translation memory. Section 10.1, “Creating and Using PO Compendia” provides some hints on how to create and maintain PO compendia.
Arguments to poselfmerge are any number of PO file paths or directories to search for PO files, which will be modified in place:
$ poselfmerge foo.po bar.po somedir/
However, this run will do almost nothing (except possibly rewrap files), just as msgmerge would do nothing if the same template were used twice. Instead, all special processing must be requested by command line options, or activated through the user configuration to avoid issuing some options with same values all the time.
Options specific to poselfmerge:
-A RATIO
, --min-adjsim-fuzzy=RATIO
The minimum required "adjust similarity" between the old and the new orginal text in a fuzzy message, in order to accept it and not clean it to untranslated state. The similarity is expressed as the ratio in range 0.0-1.0, with 0.0 meaning no similarity and 1.0 no difference. A practical range is 0.6-0.8. If this option is not issued, fuzzy messages are kept as they are (as if 0.0 would be given).
The requirement for computation of adjusted similarity is that fuzzy messages contain previous strings, i.e. that the PO file was originally merged with --previous
to msgmerge.
-b
, --rebase-fuzzies
Normally, when merging with template, the untranslated and fuzzy messages already present in the PO file are not checked again for approximate matches. This is on the one hand side a performance measure (why fuzzy match again something that was already matched before?), and on the other hand a safety measure (higher trust in an old fuzzy match based on the PO file itself than e.g. a new match from an arbitrary compendium). By issuing this option, prior to merging all untranslated message are removed from the PO file, as well as all fuzzy messages which still have their base translated message in the PO file (judging by previous strings). This activates fuzzy matching on untranslated messages (e.g. if new compendium given, or for similar messages skipped during translation), and updates base translated messages on fuzzy messages.
-C POFILE
, --compendium=POFILE
The PO file to use as compendium on merging, to produce more exact and fuzzy matches. This option can be repeated to add several compendia.
-v
, --verbose
poselfmerge normally operates silently, and this option requests some progress information. Quite useful if processing a large collection of PO files, because merging and post-merge processing can take a lot of time (especially in presence of compendium).
-W NUMBER
, --min-words-exact=NUMBER
When an exact match for an untranslated message is produced from the compendium, it is not always safe to silently accept it, because the compendium may contain translations from contexts totally unrelated with the current PO file. The shorter the message, the higher the chance that translation will not be suitable in current context. This option provides the minimum number of words (in the original) to accept an exact match from the compendium, or else the message is made fuzzy. The reasonable value depends on the relation between the source and the target language, with 5 to 10 probably being on the safe side.
Note that afterwards you can see when an exact match has been demoted into a fuzzy one, by that message not having previous strings (#| msgid "..."
, etc.).
-x
, --fuzzy-exact
This option is used to unconditionally demote exact matches from the compendium into fuzzy messages (e.g. regardless of the length of the text, as done by -W
/--min-words-exact
). This may be needed, for example, when there is a strict review procedure in place, and the compendium is built from unreviewed translations.
Options common with other Pology tools:
--wrap
; --no-wrap
; --fine-wrap
; --no-fine-wrap
; --wrap-column=COL
See Section 9.8.1, “Common Command Line Options for Wrapping”.
-F FILE
, --files-from=FILE
It is likely that the translator will have a certain personal preference of the various match acceptance criteria provided by command line options. Instead of issuing those options all the time, the following user configuration fields may be set:
[poselfmerge]/fuzzy-exact=[yes|*no]
Counterpart to the -x
/--fuzzy-exact
option.
[poselfmerge]/min-adjsim-fuzzy
Counterpart to the -A
/--min-adjsim-fuzzy
option.
[poselfmerge]/min-words-exact
Counterpart to the -W
/--min-words-exact
option.
[poselfmerge]/rebase-fuzzies=[yes|*no]
Counterpart to the -b
/--rebase-fuzzies
option.
Of course, command line options can be issued to override the user configuration fields when necessary.
poselfmerge also reads the wrapping mode fields as described in Section 9.8.2, “Common User Configuration Fields for Wrapping”, from its [poselfmerge]
section.
Machine translation is the process where a computer program is used to produce translation of more than a trivial piece of text, starting from single sentences, over paragraphs, to full documents. There are debates on how useful machine translation is right now and how much better it could become in the future, and there is a steady line of research in that direction. Limiting to widely available examples of machine translation software today, it is safe to say that, on the one hand, machine translation can preserve a lot of the meaning of the original and thus be very useful to the reader who needs to grasp the main points of the text, but on the other hand, are not at all passable for producing translations of the quality expected of human translators who are native speaker of the target language.
As far as Pology is concerned, the question of machine translation reduces to this: would it increase the efficiency of translation if PO files were first machine-translated, and then manually corrected by a human translator? There is no general answer to this question, as it depends stronly on all elements in the chain: the quality of machine translation software, the source language, the target language, and the human translator. Be that as it may, Pology provides the pomtrans script, which can fill in untranslated messages in PO files by passing original text through various machine translation services.
pomtrans has two principal modes of operation. The more straightforward is the direct mode, where original texts are simply msgid
strings in the given PO file. In this mode, PO files can be machine-translated with:
$ pomtranstranserv
-tlang
paths...
The first argument is the translation service keyword, chosen from one known to pomtrans. The -t
option specifies the target language; it may not be necessary if processed PO files have the Language:
header field properly set. The source language is assumed to be English, but there is an option to specify another source language. Afterwards an arbitrary number of paths follow, which may be either single PO files or directories which will be recursively searched for PO files.
pomtrans will try to translate only untranslated messages, and not fuzzy messages. When it translates a message, by default it will make it fuzzy as well, meaning that a human should go through all machine-translated messages. These defaults are based on the perceived current quality of most machine translation services. There are several command line options to change this behavior.
The other mode of operation is the parallel mode. Here pomtrans takes the original text to be the translation into another language, i.e. msgstr
strings from a PO file translated into another language. For example, if a PO file should be translated into Spanish (i.e. from English to Spanish), and that same PO file is available fully translated into French (i.e. from English to French), then pomtrans could be used to translate from French to Spanish. This is done in the following way:
$ pomtranstranserv
-slang1
-tlang2
-psearch
:replace
paths...
As in direct mode, the first argument is the translation service. Then both the source (-s
) and the target language (-t
) are specified; again, if PO files have their Language:
header fields set, these options are not necessary. The perculiar here is the -p
option, which specifies two strings, separated by colon. These are used to construct paths to source language PO files, by replacing the first string in paths of target language PO files with the second string. For example, if the file tree is:
foo/ po/ alpha/ alpha.pot fr.po es.po bravo/ bravo.pot fr.po es.po
then the invocation could be:
$ cd .../foo/
$ pomtrans transerv
-s fr -t es -p es.:fr. po/*/es.po
In case a PO file in target language does not have a counterpart in source language, it is simply skipped.
There is another variation of the parallel mode, where source language texts are drawn not from counterpart PO files, but from a single, compendium PO file in source language. This mode is engaged by giving the path to that compendium with the -c
option, instead of the -p
option for path replacement.
Options specific to pomtrans:
-a CHARS
, --accelerator=CHARS
Characters used as accelerator markers in user interface messages. They should be removed from the source language text before translation, in order not to confuse the translation service.[29]
-c FILE
, --parallel-compendium=FILE
The path to source language compendium, in parallel translation mode.
-l
, --list-transervs
Lists known translation services (the keywords which can be the first argument to pomtrans).
-m
, --flag-mtrans
Adds the mtrans
flag to each machine-translated message. This may be useful to positively identify machine-translated messages in the resulting PO file, as otherwise they are simply fuzzy.
-M MODE
, --translation-mode=MODE
Translation services need as input the mode in which to operate, usually the source and target language at minimum. By default the translation mode is constructed based on source and target languages, but this is sometimes not precise enough. This option can be used to issue a custom mode string for the chosen translation service, overriding the default construction. The format of the mode string is translation service dependent, check documentation of respective translation services for details.
-n
, --no-fuzzy-flag
By default machine-translated messages are made fuzzy, which is prevented by this option. It goes without saying that this is dangerous at current state of the art in machine translation, and should be used only in very specific scenarios (e.g. high quality machine translation between two dialects of the same language).
-p SEARCH
:REPLACE
, --parallel-catalogs=SEARCH
:REPLACE
The string to search for in paths of target language PO files, and the string to replace them with to construct paths of source language PO files, in parallel translation mode.
-s LANG
, --source-lang=LANG
The source language code, i.e. the language which is being translated from.
-t LANG
, --target-lang=LANG
The target language code, i.e. the language which is being translated into.
-T PATH
, --transerv-bin=PATH
If the selected translation service is (or can be) a program on the local computer, this option can be used to specify the path to its executable file, if it is not in the PATH
.
Currently supported translation services are as follows (with keyword in parenthesis):
apertium
)Apertium is a free machine translation platform, developed by the TRANSDUCENS research group of University of Alicante. There is a basic web service, but the software can be locally installed and that is how pomtrans uses it (some distributions provide packages).
google
)Google Translate is Google's proprietary web machine-translation service, which can be used free of charge. At the moment, pomtrans makes one query to it per message, which can take quite some time on long PO files.
[29] This also means that, at the moment, machine-translated text has no accelerator when the original text did have one. Some heuristics may be implemented in the future to add the accelerator to translated text as well.
Pology was designed with strong language-specific support in mind, and this chapter describes the currently available features in the direction of validation and derivation of translation as whole and various bits in it.
A versatile translation-supporting tool has to have some language-specific functionality. But, it is difficult to agree on what is a language and what is a dialect, what is standard and what is jargon, what is derived from what, how any of these are named, and there are many witty remarks about existing classifications. Therefore, Pology takes a rather simple and non-formal approach to the definition of "language", but such that should provide good technical leverage for constructing language-specific functionality.
There are two levels of language-specificity in Pology.
The first level is simply the "language". In linguistic sense this can be a language proper (whatever that means), a dialect, a variant written in different script, etc. Each language in this sense is assigned a code in Pology, when first elements of support for that language are introduced. By convention this code should be an ISO 639 code (either two- or three-digit) if applicable, but in principle can be anything. Another convenient source of language codes is the GNU C library. For example, Portugese language spoken in Portugal would have the code pt
(ISO 639) while Portugese spoken in Brazil would be pt_BR
(GNU C library).
The second level of language-specificity is the "environment". In linguistic terms this could be whatever distinct but minor variations in vocabulary, style, tone, or ortography, which are specific to certain groups of people within a single language community. Within Pology, this level is used to support variations between specific translation environments, such as long-standing translation projects and their teams. Although translating into the same language, translation teams will almost inevitably have some differences in terminology, style guidelines, etc. Environments also have codes assigned.
In every application in Pology, the language and its environments have a hierarchical relation. In general, language-specific elements defined outside of a specific environment ("environment-agnostic" elements) are a sort of a relaxed least common denominator, and specific environments add their own elements to that. Relaxed means that environment-agnostic elements can sometimes include that which holds for most but not all environments, while each environment can override what it needs to. This prevents the environment-agnostic language support from getting too limited just to cater for perculiarities in certain environments.
When processing PO files, it is necessary to somehow convey to Pology tools to which language and environment the PO files belong. The most effective way of doing this is by adding the necessary information to PO headers. All Pology tools that deal with language-specific elements will check the header of the PO file they process for the language and environment. Some Pology tools will also consult the user configuration (typically with lower priority than PO headers) or provide appropriate command line options (typically giving them higher priority). See Section 9.9, “Influential Header Fields” and Section 9.2, “User Configuration” for details.
The following languages and environments within those languages currently have some level of support in Pology (assigned code in parenthesis, "t.t." stands for translation team):
Language | Environments | ||
---|---|---|---|
Catalan (ca ) |
|||
French (fr ) |
|||
Galician (gl ) |
|||
Japanese (ja ) |
|||
Low Saxon (nds ) |
|||
Norwegian Nynorsk (nn ) |
|||
Romanian (ro ) |
|||
Russian (ru ) |
|||
Serbian (sr ) |
|
||
Spanish (es ) |
Pology can employ various well-known spell-checkers to check the translation in PO files. Currently there is standalone support for Aspell, and unified support for many spell-checkers (including Aspell) through Enchant, the spell-checking wrapper library (more precisely, through Python bindings for Enchant).
Spell-checking of one PO file or a collection of PO files can be performed directly by sieving them through one of check-spell (Aspell) or check-spell-ec sieves. The sieve will report each unknown word, possibly with a list of suggestions, and the location of the message (file and line/entry numbers). It can also be requested to show the full message, with unknown words in the translation highlighted.
Also provided are several spell-checking hooks, which can be used as building blocks in custom translation validation chains. For example, a spell-checking hook can be used to define the spell-checking rule within Pology's validation rules collection for a given language.
Pology collects internal language-specific word lists as supplements to system spelling dictionaries. One use of internal dictionaries is to record those words which are omitted in the system spelling dictionaries, but are actually proper words in the given language. Such words should be added into internal dictionaries only as an immediate fix for false spelling warnings, with an eye towards integrating them into the upstream spelling dictionaries of respective spell-checkers.
More importantly, internal dictionaries serve to collect words specific to a given environment, i.e. the words which are deemed too specific to be part of the upstream, general spelling dictionaries for the language. For example, this can be technical jargon, with newly coined terms which are yet to be more widely accepted. Another example could be translation of fiction, in books or computer games, where it is common-place to make up words for fictional objects, animals, places, etc. which are not even intended to be more widely used.
In Pology source tree, internal spelling dictionaries by language are located in lang/
directories. This directory can contain arbitrary number of dictionary files, which are all automatically picked up by Pology when spelling-checking for that language is done. Dictionary files directly in this directory are environment-agnostic, and should contain only the words which are standard (or standard derivations) in the language, but happen to be missing from the system spelling dictionary. Subdirectories represent specific environments, they are named with the environment code, and can also contain any number of dictionaries. An example of internal dictionary tree with environments:lang
/spell/
lang/ sr/ spell/ colors.aspell fruit.aspell ... science.aspell kde/ general.aspell wesnoth/ general.aspell propernames.aspell
When one of Pology's spell-checking routes is applied for a given language without further qualifiers, only the environment-agnostic dictionaries of that language are automatically included. It must be explicitly requested to additionaly include dictionaries from one of the environments (e.g. by env:
parameter to check-spell sieve).
Dictionary files are in the Aspell word list format (regardless of the spell-checker actually used), and must have .aspell
extension. This is a simple plain text format, listing one word per line. Only the first line is special, the header, which states the language code, number of words in the list, and the encoding. For example:
personal_ws-1.1 fr 1234 UTF-8 apricot banana cherry ...
Actually the only significant element of the header is the encoding. Language code and number of words can be arbitrary, as Pology will not use them.
Pology provides the normalize-aspell-word-list command which sorts word list files alphabetically (and corrects the word count in the header, even if not important), so that you do not have to manually insert new words in proper order. The script is simply run with arbitrary number of word list files as arguments, and modifies them in place. In case of duplicate words, it will report duplicates and eliminate them. In case of words with invalid characters (e.g. space), the script will output a warning, but it will not remove them; automatic removal of invalid words can be requested with -r
/--remove-invalid
option.
Sometimes a message or a few words in it should not be spell-checked. This can be, for example, when the message is dense computer input (like a command line synopsis), or when a word is part of a literal phrase (such as an email address). It may be possible to filter the text to remove some of the non-checkable words prior to spell-checking (especially when spell-checking is done as a validation rule), but not all such words can be automatically detect. For example, especially problematic are onomatopoeic constructs ("Aaargh! Who released the beast?!").
For this reason it is possible to manually skip spell-checking on a message, or on certain words within a message, by adding a special translator comment. The whole message is skipped by adding the no-check-spell
translator flag to it:
# |, no-check-spell
Words within the message are skipped by listing them in well-spelled:
translator comment, comma- or space-separated:
# well-spelled: Aaarg, gaaah, khh
Which of these two levels of skipping to use depends on the nature of the text. For example, if most of the text is composed of proper words, and there are only a few which should not be checked, it is probably better to list those words explicitly instead of skipping the whole message.
With Pology you can use LanguageTool, a free grammar and style checker, to check translation in PO files. At the moment LanguageTool is applicable only through the check-grammar sieve, so look up the details in its documentation.
In program documentation, but also in help texts in running programs, frequently labels from user interface are mentioned. Here are two such messages, one a UI tooltip, the other a Docbook paragraph:
#: comic.cpp:466 msgid "Press the \"Get New Comics...\" button to install comics." msgstr "" #: index.docbook:157 msgid "" "<guimenuitem>Selected files only</guimenuitem> extracts only " "the files which have been selected." msgstr ""
In the usual translation process, an embedded UI label is manually translated just like the surrounding text. You could directly translate the label, hoping that the original UI message was translated in the same way, but this will frequently not be the case (especially for longer labels). To be thorough, you could look up the UI message in its PO file, or run the program, to see how it was actually translated. There are two problems with being thorough in this way: it takes time to look up original UI messages, and worse, translation of a UI message might change in the future (e.g. after a review) and leave the referencing message out of date.
An obvious solution to these problems, in principle, would be to leave embedded UI labels untranslated but properly marked (such as with <gui*>
tags in Docbook), and have an automatic system fetch their translations from original UI messages and insert them into referencing messages. However, there could be many implementational variations of this approach (like at which stage of the translation chain the automatic insertion happens), with some significant details to get right.
At present, Pology approaches automatic insertion of UI labels in a generalized way, which does not mandate any particular organization of PO files or translation workflow. It defines a syntax for wrapping and disambiguating UI references, for linking referencing and originating PO files, and provides a series of hooks to resolve and validate UI references. A UI reference resolving hook will simply replace a properly equipped non-translated UI label with its translation. This implies that PO files which are delivered must not be the same PO files which are directly translated, because resolving UI references in directly translated PO files would preclude their automatic update in the future[30]. It is upon the translator or the translation team to establish the separation between delivered and translated PO files. One way is by translating in summit (see Chapter 5, Summitting Translation Branches), which by definition provides the desired separation, and setting UI reference resolving hooks as filters on scatter.
If UI references are inserted into the text informally (even if relying on certain ortographic or typographic conventions), then they must be manually wrapped in the translation using an explicit UI reference directive. For example:
#: comic.cpp:466 msgid "Press the \"Get New Comics...\" button to install comics." msgstr "Pritisnite dugme „~%/Get New Comics/“ da instalirate stripove."
Explicit UI reference directives are of the format head
/reference-text
/. The directive head is ~%
in this example, which is the default, but another head may be specified as parameter to UI resolving hooks. Delimiting slashes in the UI reference directive can be replaced with any other character consistenly (e.g. if the UI text itself contains a slash). Note that the directive head must be fixed for a collection of PO files (though more than one head can be defined), while delimiting character can be freely chosen from one to another directive.
The other the type are implicit UI references, which do not require special directive, made possible when UI text is indicated in the text through formal markup. This is the case, for example, in PO files coming from Docbook documenation:
#: index.docbook:157 msgid "" "<guimenuitem>Selected files only</guimenuitem> extracts only " "the files which have been selected." msgstr "" "<guimenuitem>Selected files only</guimenuitem> raspakuje samo " "datoteke koje su izabrane."
Here the translation contains nothing special, save for the fact that the UI reference is not translated. UI resolving hooks can be given a list of tags to be considered as UI references, and for some common formats (such as Docbook) there are predefined specialized hooks which already list all UI tags.
If the message of the UI text is unique by its msgid
string in the originating PO file, then it can be wrapped simply as in previous examples. This means that even if it has the msgctxt
string, the reference will still be resolved. But, if there are several UI messages with same msgid
(implying different msgctxt
), then the msgctxt
string has to be manually added to the reference. This is done by puting the context into the prefix of the reference, separated by the pipe |
character. For example, if the PO file has these two messages:
msgctxt "@title:menu" msgid "Columns" msgid "Kolone" msgctxt "@action:inmenu View Mode" msgid "Columns" msgstr "kolone"
then the correct one can be selected in an implicit UI reference like this:
msgid "...<guibutton>Columns</guibutton>..." msgstr "...<guibutton>@title:menu|Columns</guibutton>..."
In the very unlikely case of |
character being part of the context string itself, the ¦
character ("broken bar") can be used as the context separator instead.
If the UI reference equipped with context does not resolve to a message through direct match on context, the given context string will next be tried as regular expression match on msgctxt
strings of the messages with matching msgid
(matching will be case-insensitive). If this results in exactly one matched message, the reference is resolved. This matching sequence allows simplification and robustness in case of longer contexts, which would look ungainly in the UI reference and may slightly change over time.
If two UI messages have equal msgid
but are not part of the same PO file, that is not a conflict because one of those PO files has the priority (see Section 8.4.3, “Linking to Originating PO Files”).
If of UI two messages with equal msgid
one has msgctxt
and the other does not, the message without context can be selected by adding the context separator in front of the text with nothing before it (i.e. as if the context is "empty").
Sometimes, though rarely, it happens that the referenced UI text is not statically complete, that is, that it contains a format directive which is resolved at runtime. In such cases, the reference must be transformed to exactly an existing msgid
, and the arguments are substituted with special syntax. If the UI message is:
msgid "Configure %1..." msgstr "Podesi %1..."
then it can be used in an implicit UI reference like this:
msgid "...<guimenuitem>Configure Foobar...</guimenuitem>..." msgstr "...<guimenuitem>Configure %1...^%1:Foobar</guimenuitem>..."
Substitution arguments follow after the text, separated with the ^
character. Each argument specifies the format directive it replaces and the argument text, separated by :
. In the unlikely case that ^
is part of the msgid
itself, the ª
("feminine ordinal indicator") can be used instead as the argument separator.
If there are several format directives in the UI reference, they are by default considered "named". This means that all same format directives will be replaced by the same argument. This is the right thing to do for some formats, e.g. python-format
or kde-format
messages, but not for all formats. In c-format
, if there are two %s
in the text, to replace just one of them with the current argument, the format directive attached to the argument must be preceded with !
:
#, c-format msgid "...<guilabel>This Foo or that Bar</guilabel>..." msgstr "...<guilabel>This %s or that %s.^!%s:foo^!%s:bar</guilabel>..."
In general, but especially with implicit references, the text wrapped as reference may actually contain several references in form of UI path ("...go to Foo->Bar->Baz, and click on..."
). To handle such cases, if it is not possible or it is not convenient to wrap each element of the UI path separately, UI reference resolving hooks can be given one or more UI path separators (e.g. ->
) to split and resolve the element references on their own.
Sometimes the UI reference in the original text is not valid, i.e. such message no longer exists in the program. This can happen due to slight interpunction mismatch, small style changes, etc., such that you can easily locate the correct UI message and use its msgid
as the reference. However, if the UI reference is not valid due to documentation being outdated, there is no correct UI message to use in translation. This should most certainly be reported to the authors, but up until they fix it, it presents a problem for immediate resolution of UI references. For this reason, a UI reference can be temporarily translated in place, by preceding it with twin context separators:
msgid "...An Outdated Label..." msgstr "...||Zastarela etiketa..."
This will resolve into the verbatim text of the reference (i.e. context separators will simply be removed), without the hook complaining about an unresolvable reference.
The text of the UI message may contain some characters and substrings which should not be carried over into the text which references the message, or should be modified. To cater for this, UI PO files are normalized after being opened and before UI references are looked up in them. In fact, UI references are written precisely in this normalized form, rather than using the true original msgid
from the UI PO file. This is both for convenience and for necessity.
One typical thing to handle in normalization is the accelerator marker. UI reference resolving hooks eliminate accelerator markers automatically, by for that they need to known what the accelerator marker character is. To find this out, hooks will read the X-Accelerator-Marker
header field.
Another problem is when UI messages contain subsections which would invalidate the target format which is being translated in the referencing PO file, e.g. malformed XML in Docbook catalogs. For example, literal &
must be represented as &
in Docbook markup, thus this UI message:
msgid "Scaled & Cropped" msgstr ""
would be referenced as:
msgid "...<guimenuitem>Scaled & Cropped</guimenuitem>..." msgstr "...<guimenuitem>Scaled & Cropped</guimenuitem>..."
Resolving hooks have parameters for specifying the type of escaping needed by the target format.
Normalization may flatten several different messages from the UI PO file into one. Example of this is when msgid
fields are equal but for the accelerator marker. If this happens and normalized translations are not equal for all flattened messages, a special "tail" is added to their contexts, consisting of a tilde and several alphanumeric characters. The first run of the resolving (or validation) hook will report ambiguities of this kind, as well as assigned contexts, so that proper context can be copied and pasted over into the UI reference. The alphanumeric context tail is computed from the non-normalized msgid
alone, so it will not change if, for example, messages in the UI PO file get reordered.
In general, the UI message may not be present in the same PO file in which it is referenced in another messages. This is always the case for documentation PO files. Therefore UI reference resolving hooks need to know two things: the list of all UI PO files (those from which UI references may be drawn), and, for each PO file which contains UI references, the list of PO files from which it may draw UI references.
The list of UI PO files can be given to resolving hooks explicitly, as list of PO file paths (or directory paths to search for PO files). This can, however, be inconvenient, as it implies either that the resolution script must be invoked in a specific directory (if paths are relative), or that UI PO files must reside in a fixed directory on the system where the resolution script is run (if paths are absolute). Therefore there is another way of specifying paths to UI PO files, through an environment variable which contains a colon-separated list of directory paths. Both the explict list of paths and the environment variable which contains the paths can be given as parameters to hooks.
By default, for a given PO file, UI references are looked for only in the PO file of the same name, assuming that it is found among UI PO files. This may be sufficient, for example, for UI references in tooltips, but it is frequently not sufficient for documentation PO files, which may have a different names from corresponding UI PO file names. Therefore a PO file can be manually linked to UI PO files from which it draws UI references, through a special header field X-Associated-UI-Catalogs
. This field specifies only the PO domain names, as space- or comma-separated list:
msgid "" msgstr "" "Project-Id-Version: foobar\n" "..." "X-Associated-UI-Catalogs: foobar libfoobar libqwyx\n"
The order of domain names in the list is important: if the referenced UI message exists in more than one linked PO file, the translation is taken from the one which appears earlier in the list. Knowing PO domain names, resolving hooks can look up the exact file paths in the supplied list of paths.
When a UI reference cannot be resolved, for whatever reason -- it does not exist, there is a context conflict, the message is not translated, etc. -- resolving hooks will output warnings and fallback to original text.
For each resolving hook there exists the counterpart validation hook. Validation hooks may be used in a "dry run" before starting to build PO files for delivery, or they may be built into a general translation validation framework (such as Pology's validation rules).
There are great many possible mistakes to be made when translating. Some of these mistakes can only be observed and corrected by a human reviewer[31], and review is indeed an important part of the translation workflow. However, many mistakes, especially those more technical in nature, can be fully or partially detected by automatic means.
A number of tools are available to perform various checks on translation in PO files. The basic one is Gettext's msgfmt command, which, when run with -c
/--check
option, will detect many "hard" technical problems. These are the kind of problems which may cause the program that uses translation to crash, or that may cause loss of information to the program user. Another is Translate Toolkit's pofilter command, which applies heuristic checks to detect common (and not so common) stylistic and semantic slips in translation. Dedicated PO editors may also provide some checks of their own, or make use of external batch tools.
One commonality of existing validation tools is that they aim for generality, that is, try to apply a fixed battery of checks to all languages and environments (although some differentiation by translation projects may be present, such as in pofilter). Another commonality, unavoidable in heuristic approaches, is wrong detection of valid translation as invalid, the so called "false positives". These two elements produce combined negative effect: since the number and specificity of checks is not that great compared to what a dedicated translator could come up with for given language and environment, and since many reported errors are false positives without possibility for cancelation, the motivation to apply automatic checks sharply decreases; the more so the greater the amount of translation.
Pology therefore provides a system for users to assemble collections of validation rules adapted to their language and environment, with multi-level facilities for applying or skipping rules in certain contexts, pre-filtering of text before applying rules, and post-filtering and opening problematic messages in PO editors. Rules can be written and tuned in the course of translation, and false positives can be systematically canceled, such that over time the collection of rules becomes both highly specific and highly accurate. Since Pology supports language and environment variations from the ground up, such rule collections can be committed to Pology source distribution, so that anyone may use them when applicable.
Validation rules are primarily based on pattern matching with regular expressions, but they can in principle contain any Python code through Pology's hook system. For example, since there are spell-checking hooks provided, spell-checking can be easily made into one validation rule. One could even aim to integrate every available check into the validation rule system, such that it becomes the single and uniform source of all automatic checks in the translation workflow.
The primary tool in Pology for applying validation rules is the check-rules sieve. This section describes how to write rules, how to organize rule collections, and, importantly, how to handle false positives.
There are many nuances to the validation rule system in Pology, so it is best to start off with an example-based exposition of the main elements. Subsequent sections will then look into each element in detail.
Rules are defined in rule files, with flat structure and minimalistic syntax, since the idea is to write the rules during the translation (or the translation review). Here is one rule file with two rules:
# Personal rules of Horatio the Indefatigable. [don't|can't|isn't|aren't|won't|shouldn't|wouldn't]i id="gram-contr" hint="Do not use contractions." {elevator}i id="term-elevator" hint="Translate 'elevator' as 'lift'." valid msgstr="lift"
A rule file should begin with a comment telling something about the rules defined in the file. Then the rules follow, normally separated by one or more blank lines. Each rule starts with a trigger pattern, of which there are several types. The trigger pattern can sometimes be everything there is to the rule, but it is usually followed by a number of subdirectives.
The first rule above starts with a regular expression pattern on the translation, which is denoted by the [...]
syntax. The regular expression matches English contractions, case-insensitively as indicated by trailing i
flag. The trigger pattern is followed by the id
subdirective, which specifies an identifier for the rule (here gram-contr
is short for "grammar, contractions"). The identifier does not have to be present, and does not even have to be unique if present (uses of rule identifiers will be explained later). If the rule matches a message, the message is reported to the user as problematic, along with a note provided in the hint
subdirective.
The second rule starts with a regular expression pattern on the original (rather than the translation), for which the {...}
syntax is reserved. Then the id
and hint
subdirectives follow, as in the first rule. But unlike the first rule, up to this point the second rule would be somewhat strange: report a problem whenever the word "elevator" is found in the original text? That is where the final valid
subdirective comes in, by specifying a condition on translation (msgstr=
) which cancels the trigger pattern. So this rule efectively states "report every message which has the word 'elevator' in the original, but not the word 'lift' in the translation", making it a terminology assertion rule.
If the given example rule file is saved as personal.rules
, it can be applied to a collection of PO files by the check-rules sieve in the following way:
$ posieve check-rules -s rfile:pathto
/personal.rulesPATHS...
The path to the rule file to apply is given by the rfile:
sieve parameter. All messages which are "failed" by rules will be output to the terminal, with spans of the text that triggered the rule highlighted and the note attached to the rule displayed after the message. Additionally, one of the parameters for automatically opening messages in the PO editor can be issued, to make correcting problems (or canceling false positives) that more comfortable.
The rfile:
sieve parameter can be repeated to add several rule files. If all rule files put into one directory (and its subdirectories), a single rdir:
parameter can be used to specify the path to that directory, and all files with .rules
extension will be recursively collected from it and applied. Finally, if rule files are put into Pology's rule directory for the given language, at lang/
, then check-rules will automatically pick them up when neither lang
/rules/rfile:
nor rdir:
parameters are issued. This is a simple way to test the rules if the intention is to include them into Pology distribution.
Instead of applying all defined rules, parameters rule:
, rulerx:
, norule:
, norulerx:
of check-rules can be used to select specific rules to apply or to not apply, by their identifiers. To apply only the no-contractions rule:
$ posieve check-rules -s rfile:pathto
/personal.rules -s rule:gram-contrPATHS...
and to apply all but terminology rules, assuming that their identifiers start with term-
:
$ posieve check-rules -s rfile:pathto
/personal.rules -s norulerx:term-.*PATHS...
When the rule trigger pattern is a regular expression, it can always be made more or less specific. The previous example of matching English contractions could be generalized like this:
[\w+'t\b]i
This regular expression will match one or more word-characters (\w+
) followed by 't ('t
) which is positioned at the word boundary (\b
). More general patterns increase the likelyhood of false positives, but this is not really a problem, since tweaking the rules in the course of translation is expected. It is a bigger problem if the pattern is made too specific at first, such that it misses out some cases. It is therefore recommended to start with "greedy" patterns, and then constrain them as false positivies are observed.
However, tweaking trigger patterns can only go so far.[32] The workhorse of rule flexibility is instead the mentioned valid
subdirective. Within a single valid
directive there may be several tests, and many types of tests are provided. The trigger will be canceled if all the tests in the valid
subdirective are satisfied (boolean AND linking). There may be several valid
subdirectives, each with its own battery of tests, and then the trigger is canceled if any of the valid
subdirectives are satisfied (boolean OR linking). For example, to disallow a certain word in translation unless used in few specific constructs, the following set of valid
subdirectives can be used:
[foo]i id="style-nofoo" hint="The word 'foo' is allowed only in '*goo foo' and 'foo bar*' constructs." valid after="goo " valid before=" bar"
The first valid
subdirective cancels the rule if the trigger pattern matched just after a "goo " segment, and the second if it matched just before a " bar" segment. Another example would be a terminology assertion rule where a certain translation is expected in general, but another translation as well is allowed in a specific PO file:
{foobar}i id="term-foobar" hint="Translate 'foobar' as 'froobaz' (somewhere 'groobaz' allowed too)." valid msgstr="froobaz" valid msgstr="groobaz" cat="gfoo"
Here the second valid
subdirective uses the cat=
test to specify the other possible translation in the specific PO file. Tests can be negated by prepending !
to them, so to require the specific PO file to have only the other translation:
valid msgstr="froobaz" !cat="gfoo" valid msgstr="groobaz" cat="gfoo"
When a regular expression is not sufficient as the rule trigger, a validation hook can be used instead (one of V* hook types). See Section 9.10, “Processing Hooks” for general discussion on hooks in Pology. For example, since there are spell-checking hooks already available, the complete rule for spell-checking could be:
*hook name="spell/check-spell-sp" on="msgstr" id="spelling" hint="Misspelled words detected."
The name=
field specifies the hook, and the on=
field what parts of the message it should operate on. The parts given by on=
field must be appropriate for the hook type; since spell/check-spell-sp
is a V3A hook, it can operate on any string in the message, including the translation as requested here. Validation hooks can provide some notes of their own (here list of replacement suggestions for a faulty word), which will be shown next to the note given by rule's hint=
subdirective.
Examples so far all suffer from one basic problem: the trigger pattern will fail to match a word which has an accelerator marker inside it.[33] This is actually an instance of a broader problem, that some rules should operate on a somewhat modified, filtered text, instead on the original text. This is why the rule system in Pology also provides extensive filtering capabilities. If the accelerator marker is _
(the underscore), here is how it could be removed before applying the rules:
# Personal rules of Horatio the Indefatigable. addFilterRegex match="_" repl="" on="pmsgid,pmsgstr" # Rules follow...
The addFilterRegex
directive sets a regular expression filter that will be applied to messages before applying any of the rules that follow. match=
field provides the pattern, repl=
what to replace it with, and on=
which parts of the message to filter.
The accelerator marker filter from the previous example is quite crude. It fixes the accelerator marker character, and it will simply remove all of them from the text. Filters too can be hooks instead of regular expressions, and in this case it is better to use the dedicated accelerator marker removal hook:
# Personal rules of Horatio the Indefatigable. addFilterHook name="remove/remove-accel-msg" on="msg" # Rules follow...
The remove/remove-accel-msg
hook is an F4A hook, and therefore the on=
field specifies the whole message as the target of filtering. This hook will use information from PO file headers and respect command line overrides to determine the accelerator marker character, and then remove them only from valid accelerator positions.
Filters to not have to be given as global directives, influencing all the rules below them, but they can be defined for a single rule, using one of rule subdirectives. The other way around, global filters can also have a handle assigned (using the handle=
field), and then this handle can be used to remove the filter on a specific rule.
The last important concept in the Pology's validation rule system are rule environments. The examples so far defined rules for a given language, which means that they in principle apply to any PO file of that language. This is generally insufficient (e.g. terminology differences between translation projects), so rules too can be made to support Pology's language and environment hierarchy. Going back to the initial rule file example, let us assume that while "elevator" should always become "lift", but that English contractions are not accepted only in more formal translations. Then, the rule file could be modified to:
# Personal rules of Horatio the Indefatigable. [don't|can't|isn't|aren't|won't|shouldn't|wouldn't]i environment formal ... {elevator}i ...
The first rule now has the environment
subdirective, which sets this rule's environment to formal
. If check-rules is now run as before, only the second rule will be applied, as it is environment-agnostic. To apply the first rule as well, the formal
environment must be requested through the env:
sieve parameter:
$ posieve check-rules -s rfile:pathto
/personal.rules -s env:formalPATHS...
Another way to request the environment is to specify it inside the PO file itself, through the the X-Environment:
header field. This is generally preferable, because it both reduces the amount of command line arguments (which may be accidentaly omitted sometimes), other parts of Pology too can make use of the environment information in the PO header, and, most importantly, it makes possible that not all PO files processed in a single run belong to the same environment.
If all the rules which belong to the formal environment are grouped at the end of the rule file, then the global environment
directive can be used to set the environment for all of them, instead of the subdirective on each of them:
# Personal rules of Horatio the Indefatigable. {elevator}i ... environment formal [don't|can't|isn't|aren't|won't|shouldn't|wouldn't]i ...
A more usual application of the global environment
directive is to split environment-specific rules into a separate file, and then put the environment
directive at the top. Most flexibly, valid
subdirectives provide the env=
test, so that the rule trigger can be canceled in a condition including the environment. In the running example, this could be used as:
# Personal rules of Horatio the Indefatigable. [don't|can't|isn't|aren't|won't|shouldn't|wouldn't]i ... valid !env="formal" {elevator}i ...
It depends on the particular organization of rule files, and on types of rules, which method of environment-sensitivity should be used. Filters too are sensitive to environments, either conforming to global environment directives same as rules, or using their own env=
fields.
When requesting environments in validation runs (through env:
sieve parameter or X-Environment:
header field), more than one environment can be specified. Then the rules from all those environments, plus the environment-agnostic rules, will be applied. Here comes another function of rule identifiers (provided with the id=
rule subdirective): if two rules in different environments have same identifier, then the rule from the more specific environment overrides the rule from the less specific environment. The more specific environment is normally taken to be the one encountered later in the requested environment list.
Rule files are kept simple, to facilitate easy editing without verbose syntax getting in the way. A rule file has the following layout:
# Title of the rule collection. # Author name. # License. # Directives affecting all the rules.global-directive
...global-directive
# Rule 1.trigger-pattern
subdirective-1
...subdirective-n
# Rule 2.trigger-pattern
subdirective-1
...subdirective-n
... # Rule N.trigger-pattern
subdirective-1
...subdirective-n
The rather formal top comment (licence, etc.) is required for the rule files inside Pology distribution. In most contexts rule files are expected to have the .rules
extension, so it is best to always use it (mandatory for internal rules files). Rule files must be UTF-8 encoded.
The rule trigger is most often a regular expression pattern, given within curly or square brackets, {...}
or [...]
, to match the original or the translation part of the message, respectively. The closing bracket may be followed by single-character matching modifiers, as follows:
i
: case-sensitive matching for all patterns in the rule, including but not limited to the trigger pattern. Default matching is case-insensitive.
Bracketed patterns are the shorthand notation, which are sufficient most of the time. There is also the more verbose notation *
, where instead of message-part
/regex
/modifiers
/
any other non-letter character can be used consistently as separator. The verbose notation is needed when some part of the message other than the original or the translation should be matched, or when brackets would cause balancing issues (e.g. when a closing curly bracket without the opening bracket is a part of the match for the original text). For all messages,
can be one of the following keywords:message-part
msgid
: match on original
msgstr
: match on translation
msgctxt
: match on disambiguating context
For example, {foobar}i
is equivalent to *msgid/foobar/i
.
For plural messages, msgid/.../
(and conversely {...}
) tries to match either the msgid
or the msgid_plural
string, whereas msgstr/.../
(and [...]
) try to match any msgstr
string. If only particular of these strings should be matched, the following keywords can be used as well:
msgid_singular
: match only the msgid
string
msgid_plural
: match only the msgid_plural
string
msgstr_
: match only the N
msgstr
string with index N
When regular expressions on message strings are not sufficient as rule triggers, a hook can be used instead. Hooks are described in Section 9.10, “Processing Hooks”. Since hooks are Python functions, in principle any kind of test can be performed by them. A rule with the hook trigger is defined as follows:
*hook name="hookspec
" on="part
" casesens="[yes|no]" # Rule subdirectives follow...
The name=
field provides the hook specification. Only V* type (validation) hooks can be used in this context. The on=
field defines on which part of the message the hook will operate, and needs to conform to the hook type. The following message parts can be specified, with associated hook types:
msg
: the hook applies to the complete message; for type V4A hooks.
msgid
: the hook applies to the original text (msgid
, msgid_plural
), but considering other parts of the message; for type V3A and V3B hooks.
msgstr
: the hook applies to the translation text (all msgstr
strings), but considering other parts of the message; for type V3A and V3C hooks.
pmsgid
: the hook applies to the original text, without considering the rest of the message; for type V1A hooks.
pmsgstr
: the hook applies to the translation, without considering the rest of the message; for type V1A hooks.
The casesens=
field in trigger hook specification controls whether the patterns in the rest of the rule (primarily in valid
subdirectives) are case-sensitive or not. This field can be omitted, and then patterns are case-sensitive.
If the rule trigger pattern matches (or the trigger hook reports some problems), the message is by default considered "failed" by the rule. The message may be still passed by subdirectives that follow, which test if some additional conditions hold.
There are several types of rule subdirectives. The main subdirective is valid
, which provides additional tests to pass the message failed by the trigger pattern. The tests are given by a list of
entries. For a name
="pattern
"valid
directive to pass the message, all its tests must hold, and if any of the valid
directives passes the message, then the rule as whole passes it. Effectively, this means the boolean AND relationship within a directive, and OR across directives.
The following tests are currently available in valid
subdirectives:
msgid="REGEX
"
The original text (msgid
or msgid_plural
string) must match the regular expression.
msgstr="REGEX
"
The translation (any msgstr
string) must match the regular expression.
ctx="REGEX
"
The disambiguating context (msgctxt
string) must match the regular expression.
srcref="REGEX
"
The file path of one of the source references (in #: ...
comment) must match the regular expression
comment="REGEX
"
One of the extracted or translator comments (#. ...
or # ...
) must match the regular expression.
span="REGEX
"
The text segment matched by the trigger pattern must match this regular expression as well.
before="REGEX
"
The text segment matched by the trigger pattern must be placed exactly before one of the text segments matched by this regular expression.
after="REGEX
"
The text segment matched by the trigger pattern must be placed exactly after one of the text segments matched by this regular expression.
cat="DOMAIN1
,DOMAIN2
,..."
The PO domain name (i.e. MO file name without .mo
extension) must be contained in the given comma-separated list of domain names.
catrx="REGEX
The PO domain name must match the regular expression.
env="ENV1
,ENV2
,..."
The operating environment must be contained in the given comma-separated list of environment keywords.
head="/FIELD-REGEX
/VALUE-REGEX
"
The PO file header must contain the field and value combination, each specified by a regular expression pattern. Instead of /
, any other character may be used consistently as delimiter for the field regular expression.
Each test can be negated by prefixing it with !
. For example, !cat="foo,bar"
will match if the PO domain name is neither foo
nor bar
. Tests are "short-circuiting", so it is good for performance to put simple direct matching tests (e.g. cat=
, env=
) before more more expensive regular expression tests (msgid=
, msgstr=
, etc.).
Subdirectives other than valid
set states and properties of the rule. Property directives are written simply as
. These include:property
="value
"
hint="TEXT
"
A note to show to the user when the rule fails a message.
id="IDENT
"
An "almost unique" identifier for the rule (see Section 8.5.6, “Effect of Rule Environments”).
State directives are given by the directive name, possibly followed by keyword parameters:
. These can be:directive
arg1
...
validGroup GROUPNAME
Includes a previously defined standalone group of valid
subdirectives.
environment ENVNAME
Sets the environment in which the rule is applied.
disabled
Disables the rule, so that it is no longer applied to messages. Disabled rule can still be applied by explicit request (e.g. using the rule:
parameter of check-rules sieve).
manual
Makes it necessary to manually apply the rule to a message, by using one of special translator comments (e.g. apply-rule:
).
addFilterRegex
, addFilterHook
, removeFilter
A group of subdirectives to define filters which are applied to messages before the rule is applied to them. See Section 8.5.7, “Filtering Messages”.
Global directives are typically placed at the beginning of a rule file, before any rules. They define common elements for all rules to use, or set state for all rules below them. A global directive can also be placed in the middle of the rule file, between two rules, when it will affect all the rules that follow it, but not those that precede it. The following global directives are defined:
validGroup
Defines common groups of valid
subdirectives, which can be included by any rule using the validGroup
subdirective:
# Global validity group. validGroup passIfQuoted valid after="“" before="”" valid after="‘" before="’" .... # Rule X. {...} validGroup passIfQuoted valid ... ... # Rule Y. {...} validGroup passIfQuoted valid ... ...
environment
Sets a specific environment for the rules that follow, unless overriden with the namesake rule subdirective:
# Global environment. environment FOO ... # Rule X, belongs to FOO. {...} ... # Rule Y, overrides to BAR. {...} environment BAR ...
See Section 8.5.6, “Effect of Rule Environments” for details on use of environments.
include
Used to include files into rule files:
include file="foo.something"
If the file to include is specified by relative path, it is taken as relative to the file which includes it.
The intent behind include
directive is not to include one rule file into another (files with .rules
extension), because normally all rule files in a directory are automatically included by the rule applicator (e.g. check-rules sieve). Instead, files that are included should have an extension different from .rules
, and contain a number of directives needed in several rule files; for example, a set of filters.
addFilterRegex
, addFilterHook
, removeFilter
A group of directives to define filters which are applied to messages before the rules are applied. See Section 8.5.7, “Filtering Messages”.
When there are no environment
directives in a rule file, either global or as rule subdirectives, then all rules in that rule file are considered as being "environment-agnostic". When applying a rule set (e.g. with the check-rules sieve), the applicator may be put into one or more operating environments, either by specifying them as arguments (e.g. in command line) or in PO file headers. If one or more operating environments are given and the rule is environment-agnostic, it will be applied to the message irrespective of the operating environments. However, if there were some environment
directives in the rule file, some rules will be environment-specific. An environment-specific rule will be applied only if its environment matches one of the set operating environments.
Rule environments are used to control application of rules between different translation environments (projects, teams, people). Some rules may be common to all environments, some may be somewhat common, and some not common at all. Common rules would than be made environment-agnostic (i.e. not covered by any environment
directive), while entirely non-common rules would be provided in separate rule files per environment, with one global environment
directive in each.
How to handle "somewhat" common rules depends on circumstances. They could simply be defined as environment-specific, just like non-common rules, but this may reduce the amount of common rules too much for the sake of perculiar environments. Another way would be to define them as environment-agnostic, and then override them in certain environments. This is done by giving the environment-specific rule the same identifier (id
subdirective) as that of the environment-agnostic rule. It may also happen that the bulk of the rule is environment-agnostic, except for a few tests in valid
subdirectives which are not. In this case, env=
and !env=
tests can be used to differentiate between environments.
It is frequently advantageous to apply a set of rules not on the message as it is, but on a suitably filtered variant. For example, if rules are used for terminology checks, it would be good to remove any markup from the text; otherwise, an <email>
tag in the original could be understood as a real word, and a warning issued for missing the expected counterpart in the translation.
Filters sets are created using addFilter*
directives, global or within rules:
# Remove XML-like tags. addFilterRegex match="<.*?>" on="pmsgid,pmsgstr" # Remove long command-line options. addFilterRegex match="--[\w-]+" on="pmsgid,pmsgstr" # Rule A will act on a message filtered by previous two directives. {...} ... # Remove function calls like foo(x, y). addFilterRegex match="\w+\(.*?\)" on="pmsgid,pmsgstr" # Rule B will act on a message filtered by previous three directives. {...} ...
Filters are added cumulatively to the filter set, and the current set is affecting all the rules below it.[34] If a addFilter*
directive appears within the rule, it adds a filter only to the filter set of that rule:
# Rule C, with an additional filter just for itself. {...} addFilterRegex match="grep\(1\)" on="pmsgstr" ... # Rule D, sees only previous global filter additions. {...} ...
These examples illustrate use of the addFilterRegex
directive, which is described in more detail below, as well as other addFilter*
directives.
All addFilter*
have the on=
field. It specifies the message part on which the filter should operate, similar to the on=
field in hook rule triggers. Unlike in triggers, in filters it is possible to state several parts to filter, as comma-separated list. The following message parts are exposed for filtering:
msg
: filter the "complete" message. What this means exactly depends on the particular filter directive.
msgid
: filter the original text (msgid
, msgid_plural
), but possibly taking into account other parts of the message.
msgstr
: filter the translation (all msgstr
strings), but possibly taking into account other parts of the message.
pmsgid
: filter the original text.
pmsgstr
: filter the translation.
pattern
: a quasi-part, to filter not the message, but all matching patterns (regular expressions, substring tests, equality tests) in the rules themselves.
Not all filter directives can filter on all of these parts. Admissible parts are listed with each filter directive.
To remove a filter from the current filter set, addFilter*
directives can define the filter handle, which can then be given to a removeFilter
directive:
addFilterRegex match="<.*?>" on="pmsgid,pmsgstr" handle="tags" # Rule A, "tags" filter applies to it. {...} ... # Rule B, removes "tags" filter only for itself. {...} removeFilter handle="tags" ... # Rule C, "tags" filter applies to it again. {...} ... removeFilter handle="tags" # Rule D, "tags" filter does not apply to it and any following rule. {...} ...
Several filters may share the same handle, in which case the removeFilter
directive removes all of them from the current filter set. One filter can have more than one handle, given as comma-separated list of handles in handle=
field, and then it can be removed from the filter set by any of those handles. Likewise, the handle=
field in removeFilter
directive can state several handles by which to remove filters. removeFilter
as rule subdirective influences the complete rule, regardless of its position among other subdirectives.
clearFilters
directive is used to completely clear the filter set. It has no fields. Like removeFilter
, it can be issued either globally, or as rule subdirective.
A filter may be added or removed only in certain environments, specified by the env=
field in addFilter*
and removeFilter
directives.
Currently the following directives for adding filters are available:
addFilterRegex
Parts of the text to remove are determined by a regular expression match. The pattern is given by the match=
field. If instead of simple removal of the matched segment the replacement is wanted, the repl=
field is used to specify the replacement string (it can include backreferences to regex groups in the pattern):
# Replace in translation the %<number> format directives with a tilde. addFilterRegex match="%\d+" repl="~" on="pmsgstr"
Case-sensitivity of matching can be changed by adding the casesens=[yes|no]
field; default is case-sensitive matching.
Applicable (on=
field) to pmsgid
, pmsgstr
, and pattern
.
addFilterHook
Text is processed with a filtering hook (F* hook types). The hook specification is given by the name=
field. For example, to remove accelerator markers from UI messages in a smart way, while checking various sources for the exact accelerator marker character (command line, PO file header), this filter can be set:
addFilterHook name="remove/remove-accel-msg" on="msg"
Applicable (on=
field) to msg
(for F4A hooks), msgid
(F3A, F3B), msgstr
(F3A, F3C), pmsgid
(F1A), pmsgstr
(F1A), and pattern
(F1A).
Filtering may be run-time expensive, and it normally is in practical uses. Therefore the rule applicator will try to create and apply as few unique filter sets as possible, by considering their signatures -- a hash of ordering, type, and fields in the filter set for the given rule. Each message will be filtered only as many times as there are different filter sets, rather than once for every rule. The appropriate filtered version of the message will be given to each rule according to its filter set.
This means that you should be careful when adding and removing filters, in order to have as few filter sets as really necessary. For example, you may know that filters P and Q can be applied in any order, and in one rule file specify P followed by Q, but in another rule file Q followed by P. However, the rule applicator must assume that the order of filters is significant, so it will create two filter sets, PQ and QP, and spend twice as much time in filtering.
For big filter sets which are needed in several rule files, the best is to split them out in a separate file and use the include
global directive to include them at the beginning of rule files.
In all the examples so far, ASCII double quotes were used as value delimiters ("..."
). However, just as in the verbose notation for trigger patterns (*msgid/.../
, etc.), all quoted values can in fact consistently use any other non-alphanumeric character (e.g. single quote, slash, etc.). On the other hand, literal quotes inside a value can be escaped by prefixing them with \
(backslash). Values which are regular expression are sent to the regular expression engine without resolving any escapes other than for the quote character itself.
The general statement terminator in a rule file is the newline, but if a line would be too long, it can be continued into the next line by putting \
(backslash) in the last column.
As it was explained earlier, it is very important to have a through system of handling false positives in validation rules. There are several levels on which false positives can be canceled, and they will be described in the following, going from the nearest to the furthest from the rule definition itself. Some guidelines on when to use which level will also be provided, but you should keep in mind that this is far from a well-examined topic.
The disable
subdirective can be added to the rule to disable its application. This may seem a quaint method of "handling false positivies", but it is not outright ridiculous, because a disabled rule can still be applied by directly requesting it (e.g. rule:
parameter of check-rules). This is useful for rules which produce too many false positivies to be applied as part of a rule set, but which are still better than ad-hoc searches. In other words, such rules can be understood as codified special searches, which you would round only when you have enough time to wade through all the false positives in search for the few real problems.
The first real way of canceling false positives is by making the regular expression pattern for the rule trigger less greedy. For example, the trigger pattern for the terminology rule on "tool" could be written at first as:
{\btool}i
This will match any word that starts with tool
, due to \b
word boundary token at pattern start. The word boundary is not repeated at the end with the intention to also catch the plural form of the word, "tools". But, this pattern will also match the word "toolbar", which may have its own rule. Then, the pattern can be restricted to really match only "tool" and "tools", in several ways, for example:
{\btools?\b}i
Now the word boundary is placed at the end as well, but also the optional letter 's' is inserted (?
means "zero or one appearance of the preceding element"). Another way would be to write out both forms in full:
{\b(tool|tools)\b}i
The brackets are needed because the OR-operator |
has lower priority than word boundary \b
, so without brackets the meaning would be "word which starts with 'tool' or ends with 'tools'".
Python's regular expressions, used in rule patterns, have rich special features, but which are frequently better not used in rules. For example, the trigger for the terminology rule on "line" (of text) could be written at first as:
{\blines?\b}i
But this would also catch the phrase "command line", which as a standalone concept, may have its own rule. To avoid this match, a proficient user of regular expressions may think of adding a negative lookbehind to the trigger pattern:
{(?<!command )\blines?\b}i
However, it is much less cryptic and more extensible to add a valid
subdirective instead:
{\blines?\b}i valid after="command "
This cancels the rule if the word "line" was matched just after the word "command", while clearly showing the special-case context.
valid
subdirectives are particularly useful for wider rule cancelations, such as by PO domain (catalog) name. For example, the word "wizard" could be translated differently when denoting a step-by-step dialog in a utilitarian program and a learned magic wielding character in a computer game. Then the cat=
test can be used to allow the other term in the game's PO file:
{\bwizard}i valid msgstr="term-for-step-by-step-dialog
" valid cat="foodungeon" msgstr="term-for-magician
"
This requires specifying the domain names of all games with wizard characters to which the rule set is applied, which may not be that comfortable. Another way could be to introduce the fantasy
environment and use the env=
test:
{\bwizard}i valid msgstr="term-for-step-by-step-dialog
" valid env="fantasy" msgstr="term-for-magician
"
and to add the fantasy
environment into the header of the PO file that needs it.
Sometimes there is just a single strange message that falsely triggers the rule, such that there is nothing to generalize about the false positive. You could still cancel this false positivie in the rule definition itself, by adding a valid
directive with the cat=
test for the PO domain name and msgid=
test to single out the troublesome message:
{\bfroobaz}i id="term-frobaz" valid msgstr="..." valid cat="foo" msgid="the amount of froobaz-HX which led to"
However, rules are supposed to be at least somewhat general, and singling out a particular message in the rule is as excessive non-generality as it gets. It is also a maintenance problem: the message may dissapear in the future, leaving cruft in the rule file, or it may change slightly, but enough for the msgid=
test not to match it any more.
A much better way of skipping a rule on a particular message is by adding a special translator comment to that message, in the PO file:
# skip-rule: term-froobaz msgid "...the amount of froobaz-HX which led to..." msgstr "..."
The comment starts with skip-rule:
, and is followed by a comma-separated list of rules to skip, by their identifiers (defined by id=
in the rule).
The other way around, a rule can be set for manual application only, by adding the manual
subdirective to it. Then the apply-rule:
translator comment must be added to apply that rule to a particular message:
# apply-rule: term-froobaz msgid "...the amount of froobaz-HX which led to..." msgstr "..."
There is a pattern where an automatic rule and a manual rule are somehow closely related, so that on a particular message the automatic one should be skipped and the manual one applied. To make this pattern obvious and avoid adding two translator comments (both skip-rule:
and apply-rule:
), a single switch-rule:
comment can be added instead:
# switch-rule: term-froobaz > term-froobaz-chem msgid "...the amount of froobaz-HX which led to..." msgstr "..."
The rule before >
is skipped, and the rule after >
is applied. Several rules can be stated as a comma-separated list, on both sides of >
.
There is a catch to the translator comment approach, though. When the message becomes fuzzy, it depends on the new text whether the rule application comment should be kept or removed. This means that on fuzzy messages translators have to observe and adapt translator comments just as they adapt the msgstr
strings. Unfortunately, some translators do not pay sufficient attention to translator comments, which is further exacerbated by some PO editors not presenting translator comments conspicuously enough (or even do not enable editing them). However, from the point of view of PO translation workflow, not giving full attention to translator comments is plainly an error: unwary translators should be told better, and deficient PO editors should be upgraded.[35]
Sometimes it is possible to do better than plainly skipping a rule on a message. Consider the following message:
#: dialogs/ScriptManager.cpp:498 msgid "Please refer to the console debug output for more information." msgstr "Pogledajte ispravljački izlaz u školjci za više podataka."
An observant translator could conclude that "console" is not the best choice of term in the original text, that "shell" (or "terminal") would be more accurate, and translate the message as if the more accurate term was used in the original. However, this could cause the terminology rule for "console" (in its accurate meaning) to complain about the proper term missing in translation. Adding skip-rule: term-console
comment would indeed cancel this false positive, but what about the terminology rule on "shell"? There is nothing in the original text to trigger it and check for the proper term in translation.
This example is an instance of the general case where the translator would formulate the original text somewhat differently, and make the translation based on that reformulation. Or, when the mere style of the original causes a rule to be falsely triggered, while diferently worded original would be just fine. In such cases, instead of adding a comment to crudely skip a rule, translator can add a comment to rewrite the original text before applying rules to it:
# rewrite-msgid: /console/shell/ #: dialogs/ScriptManager.cpp:498 msgid "Please refer to the console debug output for more information." msgstr "Pogledajte ispravljački izlaz u školjci za više podataka."
The rewrite directive comment starts with rewrite-msgid:
and is followed by search regular expression and replacement strings, delimited with /
or another non-alphanumeric character. With this rewrite, the wrong terminology rule, for "console", will not be triggered, while the correct rule, for "shell", will be.
At the moment, unlike skip-rule:
, rewrite-msgid:
is not an integral part of the rule system. It is instead implemented as a filtering hook. So to use it, add this filter into rule files (or into the filter set file included by rule files):
addFilterHook name="remove/rewrite-msgid" on="msg"
Sometimes it is not quite clear whether to skip a rule or rewrite the original, that is, whether to use skip-rule:
or rewrite-msgid:
comment. A guideline could be as follows. If the concept covered by the falsely triggered rule is present but somewhat camouflaged in the original, or one concept is switched for another (such as "console" with "shell" in the example above), then rewrite-msgid:
should be used to "normalize" the original text. If the original text has nothing to do with the concept covered by the triggered rule, then skip-rule:
should be used. An example of the latter would be such a message from a game:
# skip-rule: term-shell # src/tanks_options.cpp:249 msgid "Fire shells upward"
Here the word "shell" denotes a cannon shell, which has nothing to do with term-shell
rule for the operating system shell, and the rule is therefore skipped.
Consider a message extracted from a .desktop file, representing the name of a GUI utility:
#. field: Name #: data/froobaz.desktop:5 msgid "Froobaz Image Examiner" msgstr ""
Program names from .desktop files can be read and presented to the user by any other program. For example, when an image is right-clicked in a file browser, it could offer to open the file with the utility named with this message. In the PO file of that file browser, the message for the menu item could be:
#. TRANSLATORS: %s is a program name, to open a file with. #: src/contextmenu.c:5 msgid "Open with %s" msgstr ""
In languages featuring noun inflection, it is likely that the program name in this message should have the grammar case different from the nominative (basic) case. This means that simply inserting the name read from the .desktop file, into directly translated text, will produce a grammatically incorrect phrase. Translator may try to adapt the message to the nominative form of the name (by shuffling words, adding "helper" words, adding punctuation), but this will produce stylistically suboptimal phrase. That is, style will be sacrificed for grammar. In order not to have to make such compromises, now and in the future certain translation scripting systems may be available atop the PO format[36], which would, in this example, enable the translator to specify which non-nominative form of the program name to fetch and insert.
Whatever the shape the translation scripting system takes, different forms of phrases have to be derived somehow for use by that system. Given the nuances of spoken languages, fully automatic derivation is probably not going to be possible[37]. Pology therefore provides the syntagma[38] derivator system (synder for short), which allows manual derivation of phrase forms and properties with minimal verbosity, using macro expansion based on partial regularities in the grammar.
Syntagma derivations can be written and maintained in a standalone plain text file, although currently Pology provides no end-user functionality to convert such files (i.e. derive all forms defined by them) to formats which a target translation system could consume. Instead, one can make use of the Synder
class from the pology.synder module to construct a custom converters. Of course, in the future, such converters may become part of Pology. There are already syntax highlighting definitions for the synder file format, for some text editors, in the syntax/
directory of Pology distribution.
What is provided right now in terms of end-user functionality is the collect-pmap sieve. It enables translators to write syntagma derivations in translator comments in PO messages, and then extract them (deriving all forms) into a file in the appropriate format for the target translation system. The example message above from the .desktop file could be equipped with a synder entry like this:
# synder: Frubaz|ov ispitiv|ač slika #. field: Name #: data/froobaz.desktop:5 msgid "Froobaz Image Examiner" msgstr "Frubazov ispitivač slika"
The translator comment starts with the keyword synder:
, and is followed by the synder entry which defines all the needed forms of the translated name. What you can see is that the synder entry is quite compact, exactly two characters longer than the pure translated name, and yet it defines over a dozen forms and some properties (gender, number) of the name.
The rest of this section describes the syntax of synder entries, and the layout and organization of synder files. As an example application, we consider a dictionary of proper names, where for each name in the source language we want to define the basic name and some of its forms and properties in the target language.
For the name in source language Venus and in target language Venera, we could write the following simplest derivation, which defines only the basic form in the target language:
Venus: =Venera
Venus
is the key syntagma or the derivation key, and it is separated by the colon (:
) from the properties of the syntagma. Properties are written as
pairs, and separated by commas; in key
=value
=Venera
, the property key is the empty string, and the property value is Venera
.
We would now like to define some grammar cases in the target language. Venera is the nominative (basic) case, so instead of the empty string we set nom
as its property key. Other cases that we want to define are genitive (gen
) Venere, dative (dat
) Veneri, and accusative (acc
) Veneru. Then we can write:
Venus: nom=Venera, gen=Venere, dat=Veneri, acc=Veneru
By this point, everything is written out manually, there are no "macro derivations" to speak of. But observe the difference between different grammar cases of Venera -- only the final letter is changing. Therefore, we first write the following base derivation for this system of case endings alone, called declension-1
:
|declension-1: nom=a, gen=e, dat=i, acc=u
A base derivation is normally also hidden, by prepending |
(pipe) to its key syntagma. We make it hidden because it should be used only in other derivations, and does not represent a proper entry in our dictionary example. In the processing stage, derivations with hidden key syntagmas will not be offered on queries into dictionary. We can now use this base derivation to shorten the derivation for Venus:
Venus: Vener|declension-1
Here Vener
is the root, and |declension-1
is the expansion, which references the previously defined base derivation. The final forms are derived by inserting the property values found in the expansion (a
from nom=a
, e
from gen=e
, etc.) at the position where the expansion occurs, for each of the property keys found in the expansion, thus obtaining the desired properties (nom=Venera
, gen=Venere
, etc.) for the current derivation.
Note that declension-1
may be a too verbose name for the base derivation. If the declension type can be identified by the stem of the nominative case (here a
), to have much more natural derivations we could write:
|a: nom=a, gen=e, dat=i, acc=u Venus: Vener|a
Now the derivation looks just like the nominative case alone, only having the root and the stem separated by |
.
The big gain of this transformation is, of course, when there are many syntagmas having the same declension type. Other such source-target pairs could be Earth and Zemlja, Europe and Evropa, Rhea and Reja, so we can write:
|a: nom=a, gen=e, dat=i, acc=u Venus: Vener|a Earth: Zemlj|a Europe: Evrop|a Rhea: Rej|a
From this it can also be seen that derivations are terminated by newline. If necessary, single derivation can be split into several lines by putting a \
character (backslash) at the end of each line but the last.
Expansions are implicitly terminated by a whitespace or a comma, or by another expansion. If these characters are part of the expansion itself (i.e. of the key syntagma of the derivation that the expansion refers to), or the text continues right after the expansion without a whitespace, curly brackets can be used to explicitly delimit the expansion:
Alpha Centauri: Alf|{a}-Kentaur
Any character which is special in the current context may be escaped with a backslash. Only the second colon here is the separator:
Destination\: Void: Odredišt|{e}: ništavilo
because the first colon is escaped, and the third colon is not in the context where colon is a special character.
A single derivation may state more than one key syntagma, comma-separated. For example, if the syntagma in source language has several spellings:
Iapetus, Japetus: Japet|
The key syntagma can also be an empty string. This is useful for base derivations when the stem-naming is used and the stem happens to be null -- such as in the previous example. The derivation to which this empty expansion refers to would be:
|: nom=, gen=a, dat=u, acc=
Same-valued properties do not have to be repeated, but instead several property keys can be linked to one value, separated with &
(ampersand). In the previous base derivation, nom=
and acc=
properties could be unified in this way, resulting in:
|: nom&acc=, gen=a, dat=u
Synder files may contain comments, starting with #
and continuing to the end of line:
# A comment. Venus: Vener|a # another comment
A single derivation may contain more than one expansion. There are two distinct types of multiple expansion, outer and inner.
Outer multiple expansion is used when it is advantageous to split derivations by grammar classes. The examples so far were only deriving grammar cases of nouns, but we may also want to define possesive adjectives per noun. For Venera, the possesive adjective in nominative case is Venerin. Using the stem-naming of base derivations, we could write:
|a: … # as above |in: … # posessive adjective Venus: Vener|a, Vener|in
Expansions are resolved from left to right, with the expected effect of derived properties accumulating along the way. The only question is what happens if two expansions produce properties with same keys but different values. In this case, the value produced by the last (rightmost) expansion overrides previous values.
Inner multiple expansion is used on multi-word syntagmas, when more than one word needs expansion. For example, the source syntagma Orion Nebula has the target pair Orionova maglina, in which the first word is a possesive adjective, and the second word a noun. The derivation for this is:
|a: … # as above |ova>: … # posessive adjective as noun, > is not special here Orion Nebula: Orion|ova> maglin|a
Inner expansions are resolved from left to right, such everything on the right of the expansion currently resolved is treated as literal text. If all expansions define same properties by key, then the total derivation will have all those properties, with values derived as expected. However, if there is some difference in property sets, then the total derivation will get their intersection, i.e. only those properties found in all expansions.
Both outer and inner expansion may be used in a single derivation.
An expansion can be made not to include all the properties defined in the refered to derivation, but only a subset of them. It can also be made to modify the property keys from the refered to derivation.
Recall the example of Orion Nebula and Orionova maglina. Here the possesive adjective Orionova has to be matched in both case and gender to the noun maglina, which is of feminine gender. Earlier we defined a special adjective-as-noun derivation |ova>
, specialized for feminine gender nouns, but now we want to make use of the full posessive adjective derivation, which is not specialized to any gender. Let the property keys of this derivation be of the form nommas
(nominative masculine), genmas
(genitive masculine), …, nomfem
(nominative feminine), genfem
(genitive feminine), …. If we use the stem of nominative masculine form, Orionov, to name the possesive adjective base derivation, we get:
|ov: nommas=…, genmas=…, …, nomfem=…, genfem=…, … Orion Nebula: Orion|ov~...fem maglin|a
|ov~...fem
is a masked expansion. It specifies to include only those properties with keys starting with any three characters and ending in fem
, as well as to drop fem
(being a constant) from the resulting property keys. This precisely selects only the feminine forms of the possesive adjective and transforms their keys into noun keys needed to match with those of |a
expansion.
We could also use this same masked expansion as the middle step, to produce the feminine-specialized adjective-as-noun base derivation:
|ov: nommas=…, genmas=…, …, nomfem=…, genfem=…, … |ova>: |ov~...fem Orion Nebula: Orion|ova> maglin|a
A special case of masked expansion is when there are no variable characters in the mask (no dots). In the pair Constellation of Cassiopeia and Sazvežđe Kasiopeje, the of Cassiopeia is translated as single word in genitive case, Kasiopeje, avoiding the need for preposition. If standalone Cassiopeia has its own derivation, then we can use it like this:
Cassiopeia: Kasiopej|a Constellation of Cassiopeia: Sazvežđ|e |Cassiopeia~gen
|e
is the usual nominative-stem expansion. The |Cassiopeia~gen
expansion produces only the genitive form of Cassiopeia, but with the empty property key. If this expansion would be treated as normal inner expansion, it would cancel all properties produced by |e
expansion, since none of them has an empty key. Instead, when an expansion produces a single property with empty key, its value is treated as literal text and concatenated to all property values produced up to that point. Just as if we had written:
Constellation of Cassiopeia: Sazvežđ|e Kasiopeje
Sometimes the default modification of propety keys, removal of all fixed characters in the mask, is not what we want. This should be a rare case, but if it happens, the mask can also be given a key extender. For example, if we would want to select only feminine forms of the |ov
expansion, but preserve the fem
ending of the resulting keys, we would write:
Foobar: Fubar|ov~...fem%*fem
The key extender in this expansion is %*fem
. For each resulting property, the final key is constructed by substituting every *
with the key resulting from the ~...fem
mask. Thus, the fem
ending is readded to every key, as desired.
Expanded values can have their capitalization changed. By prepending ^
(circumflex) or `
(backtick) to the syntagma key of the expansion, the first letter in fetched values is uppercased or lowercased, respectively. We could derive the pair Distant Sun and Udaljeno sunce by using the pair Sun and Sunce (note the case difference in Sunce/sunce) like this:
Sun: Sunc|e # this defines uppercase first letter Distant Sun: Dalek|o> |`Sun # this needs lowercase first letter
Property keys may be given several endings, to make these properties behave differently from what was described so far. These ending are not treated as part of the property key itself, so they should not be given when querying derivations by syntagma and property key.
Cutting properties are used to avoid the normal value concatenation on expansion. For example, if we want to define the gender of nouns through base expansions, we could come up with:
|a: nom=a, gen=e, dat=i, acc=u, gender=fem Venus: Vener|a
However, this will cause the gender
property in expansion to become Venerafem
. For the gender
property to be taken verbatim, without concatenting segments from the calling derivation, we make it a cutting property by appending !
(exclamation mark) to its key:
|a: nom=a, gen=e, dat=i, acc=u, gender!=fem
Now when dictionary is queried for Venus
syntagma and gender
property, we will get the expected fem
value.
Cutting properties also behave differently in multiple inner expansions. Instead of being canceled when not all inner expansions define it, simply the rightmost value is taken -- just like in outer expansions.
Terminal properties are those hidden with respect to expansion, i.e. they are not taken into the calling derivation. A property is made terminal by appending .
(dot) to its key. For example, if some derivations have the short description property desc
, we typically do not want it to propagate into calling derivations which happen not to override it by outer expansion:
Mars: Mars|, desc.=planet Red Mars: Crven|i> |Mars # a novel
Canceling properties will cause a previously defined property with the same key to be removed from the collection of properties. Canceling property is indicated by ending its key with ^
(circumflex). The value of canceling property has no meaning, and can be anything. Canceling is useful in expansions and alternative derivations (more on that later), where some properties introduced by expansion or alternative fallback should be removed from the final collection of properties.
Key syntagmas and property values can be equipped with arbitrary simple tags, which start with the tag name in the form ~
and extend to the next tag or the end of syntagma. For example, when deriving people names, we may want to tag their first and last name, using tags tag
~fn
and ~ln
respectively:
~fn Isaac ~ln Newton: ~fn Isak| ~ln Njutn|
In default queries to the dictionary, tags are simply ignored, syntagmas and property values are reported as if there were no tags. However, custom derivators (based on the Synder
class from pology.synder) can define transformation functions, to which tagged text segments will be passed, so that they can treat them specially when producing the final text.
Tag is implicitly terminated by whitespace or comma (or colon in key syntagmas), but if none of these characters can be put after the tag, the tag name can be explicitly delimited with curly brackets, as ~{
.tag
}
Sometimes there may be several alternative derivations to the given syntagma. The default derivation (in some suitable sense) is written as explained so far, and alternative derivations are written under named environments.
For example, if deriving a transcribed person's name, there may be several versions of the transcription. Isaac Newton, as the name of the Renaissance scientist, may be normally used in its traditional transcription Isak Njutn, while a contemporary person of that name would be transcribed in the modern way, as Ajzak Njuton. Then, in the entry of Newton the scientist, we could also mention what the modern transcription would be, under the environment modern
:
Isaac Newton: Isak| Njutn| @modern: Ajzak| Njuton|
Alternative derivations are put on their own lines after the default derivation, and instead of the key syntagma, they begin with the environment name. The environment name starts with @
and ends with colone, and then the usual derivation follows. It is conventional, but not mandatory, to add some indent to the environment name. There can be any number of non-default environments.
The immediate question that arises is how are expansions treated in non-default environments. In the previous example, what does |
expansion resolve to in modern
environment? This depends on how the synder file is processed. By default, it is required that derivations referenced by expansions have matching environments. If |
were defined as:
|: nom=, gen=a, dat=u, acc=
then the expansion of Isaac Newton in modern
environment would fail. Instead, it would be necessary to define the base derivations as:
|: nom=, gen=a, dat=u, acc= @modern: nom=, gen=a, dat=u, acc=
However, this may not be a very useful requirement. As can be seen in this example already, in many cases base derivations are likely to be same for all environments, so they would be needlessly duplicated. It is therefore possible to define environment fallback chain in processing, such that when a derivation in certain environment is requested but not available, environments in the fallback chain are tried in order. In this example, if the chain would be given as ("modern", "")
(the empty string is the name of default environment), then we could write:
|: nom=, gen=a, dat=u, acc= Isaac Newton: Isak| Njutn| @modern: Ajzak| Njuton| Charles Messier: Šarl| Mesje|
When derivation of Isaac Newton in modern
environment is requested, the default expansion for |
will be used, and the derivation will succeed. Derivation of Charles Messier in modern
environment will succeed too, because the environment fallback chain is applied throughout; if Charles Messier had different modern transcription, we would have explicitly provided it.
ASCII whitespace in derivations, namely the space, tab and newline, is not preserved as-is, but by default it is simplified in final property values. The simplification consists of removing all leading and trailing ASCII whitespace, and replacing all inner sequences of ASCII whitespace with a single space. Thus, these two derivations are equivalent:
Venus: nom=Venera Venus : nom = Venera
but these two are not:
Venus: Vener|a Venus: Vener |a
because the two spaces between the root Vener
and expansion |a
become inner spaces in resulting values, so they get converted into a single space.
Non-ASCII whitespace, on the other hand, is preserved as-is. This means that significant whitespace, like non-breaking space, zero width space, word joiners, etc. can be used normally.
It is possible to have different treatment of whitespace, through an optional parameter to the derivator object (Synder
class). This parameter is a transformation function to which text segments with raw whitespace are passed, so that anything can be done with them.
Due to simplifaction of whitespace, indentation of key syntagmas and environment names is not significant, but it is nevertheless enforced to be consistent. This will not be accepted as valid syntax:
Isaac Newton: Isak| Njutn| @modern: Ajzak| Njuton| George Washington: Džordž| Vašington| # inconsitent indent @modern: Džordž| Vošington| # inconsitent indent
Consistent indenting is enforced both for stylistic reasons when several people are working on the same synder file, and to discourage indentation styles unfriendly to version control systems, such as:
Isaac Newton: Isak| Njutn| @modern: Ajzak| Njuton| George Washington: Džordž| Vašington| @modern: Džordž| Vošington| # inconsitent indent
Unfriendliness to version control comes from the need to reindent lines which are otherwise unchanged, merely in order to keep them aligned to lines which were actually changed.
Within single synder file, each derivation must have at least one unique key syntagma, because they are used as keys in dictionary lookups. These two derivations are in conflict:
Mars: Mars| # the planet Mars: mars| # the chocholate bar
There are several possibilities to resolve key conflicts. The simplest possibility is to use keyword-like key syntagmas, if key syntagmas themselves do not need to be human readable:
marsplanet: Mars| marsbar: mars|
If key syntagmas have to be human readable, then one option is to extend them in human readable way as well:
Mars (planet): Mars| Mars (chocolate bar): mars|
This method too is not acceptable if key syntagmas are intended to be of equal weight to derived syntagmas, like in a dictionary application. In that case, the solution is to add a hidden keyword-like syntagma to both derivations:
Mars, |marsplanet: Mars| Mars, |marsbar: mars|
Processing will now silently eliminate Mars
as the key to either derivation, because it is conflicted, and leave only marsplanet
as key for the first and marsbar
as key for the second derivation. These remaining keys must also used in expansions, to reference the appropriate derivation. However, when querying the dictionary for key syntagmas by key marsplanet
, only Mars will be returned, because marsplanet
is hidden; likewise for marsbar
.
Ordering of derivations is not important. The following order is valid, although the expansion |Venus~gen
is seen before the derivation of Venus:
Merchants of Venus: Trgovc|i> s |Venus~gen Venus: Vener|a
This enables derivations to be ordered naturally, e.g. alphabetically, instead of the order being imposed by dependencies.
It is possible to include one synder file into another. A typical use case would be to split out base derivations into a separate file, and include it into other synder files. If basic derivations are defined in base.sd
:
|: nom=, gen=a, dat=u, acc=, gender!=mas |a: nom=a, gen=e, dat=i, acc=u, gender!=fem …
then the file solarsys.sd
, placed in the same directory, can include base.sd
and use its derivations in expansions like this:
>base.sd Mercury: Merkur| Venus: Vener|a Earth: Zemlj|a …
>
is the inclusion directive, followed by the absolute or relative path to file to be included. If the path is relative, it is considered relative to the including file, and not to some externaly defined set of inclusion paths.
If the including and the included file contain a derivation with same key syntagmas, these two derivations are not a conflict. On expansion, first the derivations from the current file are checked, and if the referenced derivation is not there, then the included files are checked in reverse of the inclusion order. In this way, it is possible to override some of base derivations in one or few including files.
Inclusions are "shallow": only the derivations in the included file itself are visible (available for use in expansions) in the including file. In other words, if file A includes file B, and file B includes file C, then derivations from C are not automatically visible in A; to use them, A must explicitly include C.
Shallow inclusion and ordering-independent resolution of expansions, taken together, enable mutual inclusions: A can include B, while B can include A. This is an important capability when building derivations of taxonomies. While derivation of X naturally belongs to file A and of Y to file B, X may nevertheless be used in expansion in another derivation in B, and Y in another derivation in A.
To make derivations from several synder files available for queries, these files are imported into the derivator object one by one. Derivations from imported files (but not from files included by them, according to shallow inclusion principle) all share a single namespace. This means that key syntagmas across imported files can conflict, and must be resolved by one of outlined methods.
The design rationale for the inclusion mechanism was that in each collection of derivations, each visible derivation, one which is available to queries by the user of the collection, must be accessible by at least one unique key, which does not depend on the underlying file hierarchy.
There are three levels of errors which may happen in syntagma derivations.
The first level are syntax errors, such as synder entry missing a colon which separates the key syntagma from the rest of the entry, unclosed curly bracket in expansion, etc. These errors are reported as soon as the synder file is imported into the derivator object or included by another synder file.
The second level of errors are expansion errors, such as an expansion referencing an undefined derivation, or an expansion mask discarding all properties. These errors are reported lazily, when the problematic derivation is actually looked up for the first time.
The third level is occupied by semantic errors, such as if we want every derivation to have a certain property, or gender
property to have only values mas
, fem
, and neu
, etc. and a derivation violates some of these requirements. At the moment, there is no prepared way to catch semantic errors.
In future, a mechanism (in form of file-level directives, perhaps) may be introduced to immediately report reference errors on request, and to constrain property keys and property values to avoid semantic errors. Until then, the way to validate a collection of derivations would be to write a piece of Python code which will import all files into a derivator object, iterate through derivations (this alone will catch expansion errors) and check for semantic errors.
[30] Another advantage is that original text too will sometimes contain out-of-date UI references, which this process will automatically discover and enable the translation to be more up-to-date than the original. Of course, reporting the problem to the authors would be desireable, or even necessary when the related feature no longer exists.
[31] Taking into account the current level of artificial intelligence development, which, granted, may become more sophisticated in the future.
[32] And cause regular expressions to become horribly complicated.
[33] Why not remove accelerator markers automatically before applying rules? Because some rules might be exactly about accelerator markers, e.g. if it should not be put next to certain letters.
[34] These filtering examples are only for illustrative purposes, as there are more precise methods to remove markup, or literals such as command line options.
[35] Until that is sufficiently satisfied, one simple safety measure is to remove rule application comments from fuzzy messages just after the PO file is merged with template. This will sometimes cause false positive to reappear, but, after all, this is only a tertiary element in the translation workflow (after translation and review).
[36] As of this writting, one currently operative translation scripting system is KDE's Transcript. Another one being developed, albeit not with PO format as base, is Mozilla's L20n.
[37] An exception would be constructed languages with regular grammar, such as Esperanto.
[38] A combination of words having a certain meaning, possibly greater than the sum of meanings of each word.
Different parts of Pology provide common functionality, such as thematic groups of options to scripts, file selection patterns, reliance on PO metadata, etc. This chapter describes such common functionality.
Shell completion means that, similarly as for command names, it is possible to contextually complete command parameters by pressing the Tab key. This allows you to efficiently type in the command line, as well as to quickly remind yourself of options and option parameters without resorting to documentation or browsing the file system.
For example, pressing Tab just after the posieve command will complete sieve names, and Tab after the -s
option will complete sieve parameters based on sieves that precede it in the command line. This:
$ posieve s<TAB>
will show all sieves beginning with s
, and complete the sieve name once sufficient number of characters have been entered to uniquely determine it, while this:
$ posieve stats -s m<TAB>
will show all parameters to stats beginning with m
, and complete one of them after few more characters are typed in.
Various parts of Pology can be configured through the configuration file .pologyrc
in the root of user's home directory (~/.pologyrc
for short). The configuration file does not have to exist, so you have to create it when you want to configure something for the first time. It must be UTF-8 encoded.
The configuration file is in the INI format, which is composed of sections beginning with a [
line, and fields of the form section
]
within a section. Comments can be written after field
= value
#
character at the beginning of the line. Here is an example of a ~/.pologyrc
file:
[global] [user] name = Chusslove Illich original-name = Часлав Илић email = caslav.ilic@gmx.net po-editor = Kate [enchant] # Autodetection sufficient. [posieve] msgfmt-check = yes param-ondiff/stats = yes # Project setups follow. [project-kde] language = sr language-team = Serbian team-email = kde-i18n-sr@kde.org plural-forms = nplurals=4; plural=n==1 ? ...
This configuration contains five sections: [global]
, [user]
, [enchant]
, [posieve]
, and [project-kde]
. The [global]
section set options that have an effect throught Pology, and here it is empty. The [user]
section provides some information on the person who uses Pology. The [enchant]
section configures the Enchant spell checker wrapper, used by Pology for spell checking. The [posieve]
section configures the behavior of the posieve script. The [project-kde]
section provides information on a project that the user contributes translation to.
Some details about the configuration file syntax are as follows. Leading and trailing whitespace in section and field names and values is not significant, e.g. foo=bar
is same as foo = bar
. Percent (%) character is used to expand the value of another field, for example:
rootdir = /path/to/somewhere datadir = %(rootdir)s/data
where the %(...)s
is Python's string interpolation syntax. Importantly, when you need a literal % character within a value (such as in plural-forms
field in the previous example), you must repeat it twice, %%
. Switch-type fields (msgfmt-check
in the previous example) can take any of the following values for the two states: 0
, no
, false
, or off
; and 1
, yes
, true
, or on
(case is not important).
Sections in the configuration can be of one of four general types:
General sections, which provide information used by various parts of Pology as they need them. The [global]
and [user]
sections from the previous example are general sections.
External tool sections, which are used to configure external libraries and programs used within Pology. The [enchant]
section from the previous example is of this type.
Internal tool sections, which configure the behavior of Pology's own scripts. This is the [posieve]
section from the previous example.
Project sections, which provide information related to particular translation projects that the user is contributing to. Names of these sections always start with project-
, such as [project-kde]
from the previous example.
Internal tool sections are documented together with the respective tools, while sections of other types are described in the following.
When mentioning configuration fields in their documentation and elsewhere, they are referred to as [
. If there is only a fixed number of possible values to a field, this is denoted as section
]/field
[
; if one of the values is the default, it is prefixed with a star (*).section
]/field
=[VALUE1
|VALUE2
|VALUE3
|...]
The [global]
section contains options which can have effect on various otherwise unrelated parts of Pology.
Known configuration fields are as follows:
[global]/show-backtrace=[yes|*no]
When one of Pology commands stops execution with an error, by default only the error message is shown. However, for reporting problems and debugging, it is much better to get a backtrace instead. Backtraces can be activated by this option.
Whenever you want to report a problem where a Pology command aborts with an error, make sure to activate this option and submit the full backtrace.
Many parts of Pology can take advantage of information about you and the tools you use. This information is given in the [user]
section. For example, when initializing PO file from a template, your name, email address in the PO header can be filled out, or a PO file can be opened in a translation editor that you use (if it is supported).
Known configuration fields are as follows:
[user]/name
Your name if it is written in Latin script, or the romanized equivalent of your name. The intention is that it is readable (or semi-readable) to people from various places in the world, who would use it to contact you if necessary.
[user]/original-name
This is your name in your native language and script, whatever it may be. If it would be the same as the name in the [user]/name
field, setting this field is not necessary.
[user]/email
Your email address.
[user]/language
The language code of the language you translate into. If by any chance you translate into several languages, this field can be overridden in per-project configuration sections.
[user]/encoding
The encoding of the PO files you work on. Nowdays this should really, really be UTF-8. If it is not UTF-8 for everything that you work on, you can override it in per-project configuration sections.
[user]/plural-forms
The value for the Plural-Forms
PO header field used for your language. If it differs between projects, you can override the value set here in per-project configuration sections.
[user]/po-editor
The human-readable name of the editor with which you translate (it does not have to be a dedicated PO editor). This is used in contexts where your editor preference is announced, such as through the X-Generator
PO header field.
[user]/po-editor-id=[lokalize]
The keyword under which the PO editor that you use is known to Pology. For the moment, only Lokalize is supported. This is used when a Pology tool is told to open PO files on the messages it matched.
This section configures Enchant, a wrapper library for spell checking, which is used for Pology's spell checking functionality. Through Enchant it is possible to use various spell checkers, such as Aspell, Ispell, Hunspell, etc. in a uniform way.
Known configuration fields are as follows:
[enchant]/provider=[aspell|ispell|myspell|...]
The keyword denoting the spell checker that Enchant should use. It can also be a comma-separated list of several keywords, when Enchant will use the first available spell checker in the list. You can find the up-to-date list of all known provider keywords in the enchant(1) man page, and run enchant-lsmod command to see exactly which of those are recognized as available on the system.
[enchant]/language
The spell checking dictionary that should be used, by language code. This value is used only if the language is not specified in any other way, such as in the PO header or through command line.
[enchant]/environment
The sub-language environment for spell checking. This is related to Pology's internal spelling dictionary supplements, see the section on spell checking. This value is used only if the environment is not specified in any other way, such as in the PO header or through command line.
At first Pology used Aspell for spell checking, before Enchant was introduced. Direct support for Aspell was nevertheless kept, due to some specifics that the Enchant wrapper does not support yet. (Which means that you should better use Enchant if it satisfies your needs.)
Known configuration fields are as follows:
[aspell]/language
See [enchant]/language
.
[aspell]/encoding
Encoding for the text sent to Aspell.
[aspell]/variety
The sub-language variety of the Aspell spelling dictionary.
[aspell]/environment
See [enchant]/environment
.
[aspell]/supplements-only=[yes|*no]
Whether to ignore the system spelling dictionary and use only Pology's internal dictionary supplements.
[aspell]/simple-split=[yes|*no]
By default, Pology splits the text into words in a clever fashion (eliminating text markup, format directives, etc.) before sending them to the spell checker. Sometimes this leads to bad result, and then this field can be set to yes
to split text simply on whitespace (possibly, in the given context, in combination with a pre-filtering hook on the text).
You will easily come into the situation where you need to translate and maintain translated material within different projects, each with its own set of rules and conventions. Pology is designed to support project switching extensively, and one element of that are per-project configuration sections.
A project configuration sections has the name [project-
, where PKEY
]PKEY
is the project keyword. You can choose the project keyword freely, but it should contain only ASCII letters, digits, underscore and hyphen. Project configuration fields frequently have fallbacks to fields in other configuration sections. This means that when the project field is not set, its corresponding field in that other (more general) section gets used instead. In the following, this is the whenever you are instructed to see a field in another section.
Per-project configuration fields are as follows:
[project-*]/name
See [user]/name
.
[project-*]/original-name
See [user]/original-name
.
[project-*]/email
See [user]/email
.
[project-*]/language
See [user]/language
.
[project-*]/language-team
This is the name of the team which translates this project into given language. Since usually there is only one translation team per language in a project, the value of this field is just the human-readable name of the language (as opposed to language code) in English.
[project-*]/team-email
The email address for communication with the translation team as whole (usually the team's mailing list).
[project-*]/encoding
See [user]/encoding
.
[project-*]/plural-forms
See [user]/plural-forms
.
There are great many places in Pology where you can supply a matching pattern, to select or deselect something. This could be a PO file by its path, a PO message by its msgid
, etc. Almost always and by default, this matching pattern will be a regular expression (or regex for short). Regular expressions are a powerful pattern matching language, a fascinating topic in their own right, and they will serve you well in just about any context of searching on computers. The plain text editor that you use probably offers regular expressions in its search dialog, so does your office text processor, and so on.
Actually, the only point of this brief section is to impress the importance and usefulness of regular expressions onto you, in the case that you have not used them yet. The Internet is full of tutorials on regular expressions, so that there is no point in linking any one particular here.
It should be mentioned that different regular expression engines have somewhat different syntax and expressiveness. Pology uses regular expressions from the Python Standard Library, described here: http://docs.python.org/library/re.html (keep in mind that this page is a reference, and not a tutorial, so you should look elsewhere to learn basics of regular expressions).
Pology scripts that can recursively search directory paths for PO files will usually provide several options by which certain files can be included or excluded from processing. The first pair of these options include or exclude files by path:
-E REGEX
, --exclude-path=REGEX
Every file with the path that does not match the supplied pattern is excluded from processing. This option can be repeated, when a file is excluded if its path matches every pattern. When you want to exclude by any pattern matching the path, you can connect those patterns with regular expression |
-operator in a single option. This allows you to build up complex exclusion conditions if necessary.
-I REGEX
, --include-path=REGEX
Only those files which have the path matching the supplied pattern are included into processing. If the option is repeated, a file is included only if its path matches every pattern.
Especially those PO files which are used at runtime (as opposed to those used for static translation), but others too, are frequently sufficiently identified by their domain name. The domain name is the base name of the installed MO file without the extension, e.g. for /usr/share/locale/sr/LC_MESSAGES/foobar.mo
the domain name is foobar
. If, in a given translation project, PO files for a given language are all collected under one top directory of that language, their base names are also formed of domain names.[39] When this is the case, it may be more convenient or safer to match PO files by their domain names instead of paths, which is done by options:
-e REGEX
, --exclude-name=REGEX
Counterpart to -E
/--exclude-path
which matches by domain name.
-i REGEX
, --include-name=REGEX
Counterpart to -I
/--include-path
which matches by domain name.
All inclusion and exclusion options can be freely mixed and repeated, with consequent resolution. A file is processed if it matches all inclusion patterns (if any is given) and does not match at least one exclusion pattern (if any is given). The other way around, a file is not processed if does not match at least one inclusion pattern (if any is given) or it matches all exclusion patterns (if any is given).
Sometimes it is convenient to make a temporary or semi-permanent grouping of files, such that the file group can be referenced through a single argument instead of repeating all the files all the time. This is particularly useful when shell piping is not applicable or not comfortable enough. The classic and simple way to group files is by having a file-list file, which contains one file path by line, which a shell command can read to collect files to process.
Many Pology scripts can write and read file-list files. Having scripts write such files automatically is simple enough, just check given script's documentation to see if it has this capability (e.g. the -m
option to posieve). More interesting are the special features that you can use when writing a file-list file manually. You would do this for standing categories which are periodically updated, such as a list of PO files ready for release.
For completeness, here is first an example of a basic file-list file:
xray/alpha.po xray/bravo.po yankee/charlie.po yankee/delta.po
As is usual for path arguments to Pology scripts, you can specify both file and directory paths, and directory paths will be searched recursively for PO files (or whatever the file type that the script is processing):
xray/ yankee/ zulu/echo.po zulu/foxtrot.po
You can add comments by starting the line with hash (#
), and have empty lines:
# Translations ready for release. # Full modules. xray/ yankee/ # Specific files. zulu/echo.po zulu/foxtrot.po
The inclusion-exclusion functionality equivalent to inclusion-exclusion command line options is provided through inclusion-exclusion directives. They are specified by starting the line with colon (:
), followed by directive type token, followed by a regular expression. The directives are:
:/-
to exclude files by path,REGEX
:/+
to include files by path,REGEX
:-
to exclude files by base name without extension, andREGEX
:+
to include files by base name without extension.REGEX
For example, if a whole module should be processed but for one PO file in it, it is easier to list the whole module and exclude that one file, as compared to listing all other files:
# Modules. xray/ yankee/ # Exclude november.po (in whichever module it is). :-november
Ordering and position of include-exclude directives is not significant, as they are all applied to all collected files. The semantics of application of multiple directives is the same as that of counterpart command line options.
File-list files are normally fed to Pology scripts with the following option:
-f FILE
, --files-from=FILE
Read files to process from a file which contains one path per line, or special entries as described above. This option can be repeated to read several file lists. Additional paths to process can still be given as command line arguments. Any inclusion-exclusion options will be applied to the files read from the file as well (in addition to the file's internal inclusion-exclusion directives, if any).
In some contexts, Pology scripts color the terminal output for better visual separation and highlighting of important parts of the text. Examples include warning and error messages, data presented as tables and bars, and, importantly, matched segments of the text in search and validation operations.
Output coloring is turned on by default, but sensitive to output destination: the text is colored if the output is to the terminal (using terminal escape sequences), but not if it is piped to a file. Pology scripts provide the following options by which you can influence this behavior:
-R
, --raw-colors
Disables output destination sensitivity, such that the text is always colored. This is useful when the output is piped to another command which can understand terminal escape sequences by which colors are produce, such as less(1). A typical example would be piping search results from the find-messages sieve to be able to scroll them back and forth:
$ posieve find-messages ... -R | less -R
The -R
of less tells it to interpret escape sequences as colors, rather than showing them as literal text.
--coloring-type=[none|term*|html]
Instead of coloring for the terminal, with this option you can choose another coloring type. none
disables coloring, term
is the default, while html
will produce HTML-tagged text ready for embedding into a web page (e.g. inside a <pre> element). For example, with a little bit of additional scripting, you could use the stats sieve and html
coloring to periodically update a web page with translation statistics.
One of the general aims of Pology is to fit well with other tools typically found in translation workflows based on PO. Although examples of this can be seen throughout the manual, this section gives the overview of integration by the particular supported tool.
When Pology is used to validate the translation, be it through informal but precise searches or formal validation rules, those translations found to be invalid must be modified (or possibly a special translator comment added to the message to silence a false positive). Pology will normally always report the PO file path and the location of the message within the file, so that you can get to it in you preferred PO editor. For greater efficiency, however, Pology can directly open the PO files on problematic messages in some PO editors. Currently these are:
Many sieves, notably find-messages, check-rules, or check-spell, provide the parameter lokalize
to open PO files on reported messages in Lokalize. This means that when run over a collection of PO files, each PO file with at least one reported message will be loaded into one of Lokalize tabs, and only the reported messages will be shown for editing under each tab. A slight catch is that Lokalize must be manually started before a sieve is run, and the Lokalize project which contains all the sieved PO files must be opened; otherwise, simply nothing will happen.
From the viewpoint of translators, PO files are frequently (though not always) handled in the same way as program code, through version control systems (VCS). Pology defines an abstraction of version control functionality, which enables its tools to transparently cooperate with several VCS. Usually it is necessary to tell a Pology tool which VCS is used, which is done by specifying one of VCS keywords. Currently supported VCS and their keywords are:
Git: git
Subversion: svn
, subversion
none (when specifying a VCS is required, but none is actually used): none
, noop
VCS integration is available in following places:
Producing embedded diffs with poediff (see Chapter 4, Diffing and Patching). Option -c
/--vcs
can be used to switch poediff into VCS mode, such that it diffs given paths between repository head and working copy, or between given revisions.
Translating in summit (see Chapter 5, Summitting Translation Branches). posummit will automatically add or remove files from version control as well as to and from disk, so that the modified repository tree can be directly committed after a summit maintenance operation has completed its run.
Review ascription (see Chapter 6, Ascribing Modifications and Reviews). VCS support is central part of poascribe, so it will automatically add, remove and commit files to version control as particular ascription operations require.
Another interesting aspect of VCS support is that, when writing modified PO files to disk, by default Pology will reformat them (almost) only as much as necessary. For example, if only one msgstr
string in the whole PO file has changed, and wrapping is active, only this string and nothing else will be rewrapped when the file is written out. This makes VCS revision deltas smaller and more informative.
While line wrapping of message strings irrelevant to programs that fetch translations from them, it may be significant to the translator, especially when editing the PO file with a plain text editor. Well-wrapped strings make it easier for the translator to follow the text structure, especially in longer messages.
Most Gettext tools (msgmerge, msgcat, msgfilter, etc.) provide options to wrap or not to wrap strings, where wrapping is done on the given column and escaped newlines (\n
). Pology can produce this type of wrapping ("basic" wrapping) as well, but it can also wrap on expected visual line breaks in known text markup, e.g. <p>
and <br>
in HTML ("fine" wrapping). Compare this message in basic wrapping alone:
msgid "" "<p>These settings control the storage of the corrected images. " "There are four modes to choose from:</p><p><ul><li><b>Subfolder:</" "b> The corrected images will be saved in a subfolder under the " "current album path.</li><li><b>Prefix:</b> A custom prefix will be " "added to the corrected image.</li><li><b>Suffix:</b> A custom " "suffix will be added to the corrected image.</li><li><b>Overwrite:</" "b> All original images will be replaced.</li></ul></p><p>Each of " "the four modes allows you to add an optional keyword to the image " "metadata.</p>" msgstr ""
and in basic and fine wrapping together:
msgid "" "<p>These settings control the storage of the corrected images. " "There are four modes to choose from:</p>" "<p>" "<ul>" "<li><b>Subfolder:</b> The corrected images will be saved in a " "subfolder under the current album path.</li>" "<li><b>Prefix:</b> A custom prefix will be added to the corrected " "image.</li>" "<li><b>Suffix:</b> A custom suffix will be added to the corrected " "image.</li>" "<li><b>Overwrite:</b> All original images will be replaced.</li>" "</ul>" "</p>" "<p>Each of the four modes allows you to add an optional keyword " "to the image metadata.</p>" msgstr ""
If you are editing the PO file with a dedicated PO editor, it may itself provide finely tuned wrapping and ignore the wrapping in the PO file, in which case Pology's wrapping facilities are superfluous to you[40]. But a PO editor may also present strings wrapped just as they are in the PO file (and most do!), when Pology's fine wrapping is just as useful as in combination with a plain text editor.
At least for alphabetic languages, the most convenient wrapping may be fine wrapping alone (no basic wrapping), while turning on editor's dynamic (visual) line wrapping. This both makes the text structure easy to follow, and allows editing the translation by logical units (paragraphs, list items) without manually adjusting column breaks or putting up with ugly overlength or mid-broken lines. However, for ideographic languages, editor's dynamic line wrapping may produce bad results, and there basic wrapping might be necessary. In fact, for the moment, for ideographic languages it may be better to pass Pology's wrapping entirely and stick with Gettext's wrapping, since the wrapping algorithm in Gettext is more sophisticated and directly supports ideographic writing systems.
If no wrapping mode is specified when the given PO file is written out, Pology will apply basic wrapping, just as Gettext tools do. There are three general sources from which Pology tools may try to determine the wrapping mode for the given PO file, in decreasing priority: from the command line options, from the PO file's header, and from the user configuration. A tool may or may not provide command line options and configuration fields for wrapping, but PO file headers are always consulted (since this is in Pology's core PO file handling facilities). See the description of the X-Wrapping
header field for how to set the wrapping mode in the PO header, and the set-header
sieve for how to set this field in many PO files at once.
Pology tools in which the wrapping mode can be set from command line, will provide the following options:
--wrap
Perform basic wrapping, on certain column.
--no-wrap
Do not perform basic wrapping.
--fine-wrap
Perform fine wrapping, on various expected visual breaks introduced by text markup in rendered text.
--no-fine-wrap
Do not perform fine wrapping.
--wrap-column=COL
The column at which the text should be wrapped. The wrapped line in the PO file will never be longer than this many columns, including the outer quotes. If not given, the default is 79.
Both positive and negative wrapping options are provided in order to be able to override the wrapping mode defined by the user configuration of the PO header. As in Gettext tools, strings are always wrapped on \n
regardless of the wrapping mode.
The following configuration fields will be read by the tools which consult the user configuration for wrapping mode, in their respective configuration sections:
[section
]/wrap=[*yes|no]
Whether to perform basic wrapping, counterpart to --wrap
and --no-wrap
command line options.
[section
]/fine-wrap=[yes|*no]
Whether to perform fine wrapping, counterpart to --fine-wrap
and --no-fine-wrap
command line options.
The PO header is a natural place to provide the information which holds for the PO file as whole. Pology scripts, sieves, and hooks can take into account a number of header fields, when available, to automatically determine some aspects of processing. The fields considered are as follows:
Language
This field contains the language code of the translation, which Pology will take into account in all contexts where language-dependent processing is done (such as when spell-checking). You can also specify the language into which you translate in user configuration, and sometimes in the command line. The language stated by the PO header will override the user configuration, but it will be in turn overridden by the command line. See also Section 8.1, “The Notion of Language in Pology”.
X-Accelerator-Marker
Accelerator markers are a frequent obstacle in text processing, such as searching or spell-checking, because they can split words apart. This field can be used to specify which character is used as accelerator marker throughout the file, if any. If there are several possible characters, they can be given as comma-separated list[41]. While it is usually possible to specify the accelerator marker through the command line, the header field is much more convenient and flexible: there is no need to remember to add the command line option at every run, and different PO files can have different accelerator markers. However, if command line option is issued, it will override the header field.
There is a difference between this field not existing in the header, and existing but with an empty value (i.e. "X-Accelerator-Marker: \n"
). If the field does not exist, some processing elements will go into the "greedy" mode, where they use a list of known frequent accelerator markers (e.g. to remove them from the text). If the field is set to empty value, these processing elements will take it that there are no accelerator markers in text.
X-Associted-UI-Catalogs
This field lists the PO domains which are the source of user interface references (button labels, menu items, etc.) throughout the text in current PO file. This makes it possible to automatically fetch and insert UI translations, rather than having to look them up manually and maintain them against changes; see Section 8.4, “Automatic Insertion of UI Labels” for details. Several PO domains can be given as space- or comma-separated list. If the UI message is found in more than one listed PO domain, the earlier in the list takes precedence.
X-Environment
The language environment to which the translation belongs; see Section 8.1, “The Notion of Language in Pology” for details. It can be a single keyword, or a comma-separated list of keywords. If several environments are given, the later in the list (which is usually the more specific) takes precedence.
X-Text-Markup
When the text contains markup, it may be useful to remove it such that only the plain text remains. This is the case, for example, when computing word counts or applying terminology validation rules. Another use case would be the validation of markup itself (whether a tag is properly closed, whether a tag exists, etc.) This header field specifies the markup type found in the text, as a keyword, so that Pology can determine how to process it. Several markup types can be given as comma-separated list.
Pology currently recognizes the following markup types:
docbook4
-- Docbook 4.x markup, in documentation POs
html
-- HTML 4.01
kde4
-- markup in KDE4 UI POs, a mix of Qt rich-text and KUIT
kuit
-- UI semantic markup in KDE 4
qtrich
-- Qt rich-text, (almost) a subset of HTML
xmlents
-- only XML-like entities, no other formal markup
X-Wrapping
This header field can be set to tell Pology how to wrap strings in the current PO file, for example, when posieve modifies a message and writes the modified PO file, or when rewrapping is done explicitly by porewrap. The value is a comma-separated list of wrapping modes, chosen from:
basic
-- wrapping on certain column
fine
-- wrapping on logical breaks (such as <p>
or <br/>
tags)
Wrapping on escaped newline \n
is always performed, regardless of the wrapping mode. If the field value is empty, no other wrapping is done. If more than one wrapping mode is given (e.g. "X-Wrapping: basic, fine\n"
), it is specificaly defined how modes are combined, so the ordering is not important. As usual, if wrapping is specified by a command line option, that will override the header field.
All of the listed header fields may be set manually, when you get to work on the particular PO file. But frequently it is possible to set them automatically, or at least automatically for the first time with later manual corrections where needed. For this you may use the set-header sieve. If PO files are periodically merged by the translation project automation (rather than each translator merging on his own only the PO files which he is about to update), the natural moment to run set-header is just after the merging. If translation is done in summit, you can specify in summit configuration to set header fields on merging.
Pology enables the user to insert special processing elements, called hooks, at many places in the processing chain. Hooks are Python functions with certain prescribed input, output, and behavior. Depending on the exact combination of these three ingredients, there are various hook types. Finally, some hooks can be adapted to a given context through their hook factories. Pology defines many hooks internally, and users can add their own external hooks.
Usage of hooks is best illustrated through examples. Suppose that you want to use the the find-messages sieve to look for a certain word, but the text contains XML-like tags of the form <
which happen to be throwing off your search. Suppose that there exists a hook called tagname
>...</tagname
>remove-xml-tags
, in the Pology library module remove
, which takes a piece of text as input and returns that piece of text cleared of any XML-like tags. Then you could insert this hook into the search to clear the tags before matching the text, by using the filter:
parameter to find-messages:
$ posieve find-messages -s filter:'remove/remove-xml-tags' ...
Here remove/remove-xml-tags
is the hook specification, and this is its usual simplest form: the module name, followed by slash, followed by the hook name. (Sometimes it can be only the module name, when the hook function within that module has the same name as the module, but this is rare.) The hook specification was enclosed in single quotes, for the shell to see it as single string; this was not necessary here, but it is a good habit to keep up when adding hooks through command line, because hook specification can get quite involved.
Suppose now that there is a single hook that can remove any kind of markup from the text (not only XML-like tags) called remove-markup
, but that it has to be told which markup to remove, by giving it one of the markup type keywords known to Pology. Continuing the previous example, this could be done like this:
$ posieve find-messages -s filter:'remove/remove-markup~"docbook4"' ...
Now the hook specification is remove/remove-markup~"docbook4"
. Note that outer single quotes in the command line are necessary, as otherwise the shell would strip internal double quotes, which are here integral part of hook specification. remove-markup
is actually a hook factory, which produces a hook based on the parameters given after the tilde (~
) character. Here "docbook4"
is that parameter; why must it be quoted? Because the part after the tilde is passed as argument list to a Python function, and "docbook4"
must be of string type, which is in Python denoted by quotes. For a hook factory foo/bar
which would take a string and a number, the hook specification would be foo/bar~"qwyx",5
. Sometimes a hook factory has default values for some or all of its arguments; in the latter case, if the defaults are sufficient, the part after the tilde in the hook specification can be left empty (e.g. foo/bar~
).
Hooks can be language- and project-dependent. Suppose that in your language the letters are sometimes accented, but the accents should be ignored on spell-checking. Then Pology may contain a hook which strips accents from text in your language. If your language code is ll
, and the hook is remove-accents
in (language-specific) module remove
, you could check spelling while ignoring accents using the the check-spell-ec sieve:
$ posieve check-spell-ec -s filter:'ll:remove/remove-accents' ...
The hook specification now also contains the language code separated by colon, as ll:...
. If the hook is project-specific instead, it is prefixed with pp%...
, where pp
is the project identifier and percent sign the separator. If the hook is both language- and project-specific, then the specification is ll:pp%...
or pp%ll:...
.
In places where a hook can be inserted, it is convenient to succinctly state which types of hooks are acceptable. Hook types are therefore coded with letter-number-letter combinations. The first letter can be F, V, or S, standing for filtering, validation, or side-effect hook, in that order. Filtering hooks modify their input, validation hooks report problems in input in a way understood by their clients, while side-effect hooks can do anything except modifying the input. The number after the first letter describes the composition of input, which can be pure text, PO message, PO header, etc. and their combinations. The final letter indicates the semantics of the input, like whether the input text is supposed to be the original (msgid
) or the translation (msgstr
) or can be any of them.
The following hooks types are currently defined (the hook type is followed by the expected input in parenthesis):
Modifies the input text.
Validates the input text.
Side-effects based on the input text.
Modifies the input text, which is one of the strings in the given PO message, which belongs to the given PO file. The difference between F1A and F3A hooks is that an F1A hook can process text based only on the text itself, while an F3A hook can process text by taking into account the information elsewhere in the PO message (e.g. in comments) and the PO file (e.g. in header). This holds for all *1* and *3* hook types.
Validates the input text, which is one of the strings in the given PO message, which belongs to the given PO file.
Side-effects based on the input text, which is one of the strings in the given PO message, which belongs to the given PO file.
Modifies the input text, which is the msgid
(or msgid_plural
) string in the given PO message, which belongs to the given PO file. The difference between F3A and F3B hooks is that the input text of an F3B hook is expected to be precisely the original string in the message, while giving anything else will lead to undefined results. This holds for all *3A, *3B, *3C hook types.
Validates the input text, which is the msgid
(or msgid_plural
) string in the given PO message, which belongs to the given PO file.
Side-effects based on the input text, which is the msgid
(or msgid_plural
) string in the given PO message, which belongs to the given PO file.
Modifies the input text, which is one of the msgstr
strings in the given PO message, which belongs to the given PO file.
Validates the input text, which is one of the msgstr
strings in the given PO message, which belongs to the given PO file.
Side-effects based on the input text, which is one of the msgstr
strings in the given PO message, which belongs to the given PO file.
Modifies the input PO message, which belongs to the given PO file. The difference between F4A and F3A hooks is that an F3A hook can modify only the given string in the message, while an F4A hook can modify any number of strings, comments, etc. in the message. This holds for all *3* and *4* hook types.
Validates the input PO message, which belongs to the given PO file.
Side-effects based on the input PO message, which belongs to the given PO file.
Modifies the input PO header, which belongs to the given PO file.
Validates the input PO header, which belongs to the given PO file.
Side-effects based on the input PO header, which belongs to the given PO file.
Modifies the input PO file. As opposed to F1* and F3* hooks, which can modify only elements within PO messages, F5* hooks can also add, remove, and change positions of messages within the PO file.
Validates the input PO file. As opposed to V1* and V3* hooks, which report only problems confined to PO messages, V5* hooks can also report problems due to relation between several PO messages each of which is valid in itself.
Side-effects based on the input PO file.
Modifies the input file, whether in PO or another format, on the level of pure text lines. This is unlike F5A hooks which operate on the level of entries in the PO file; F6A hooks are also typically limited to certain types of files, perhaps even only PO files. This holds for all *6* hook types.
Validates the input file.
Side-effects based on the input file.
Pology does not establish strict separation between users and programmers, but presents a continuum between pure use and pure programming, so that users can engage according to their needs and abilities. Hooks, in particular, occupy the middle of this range. On the one hand side, they can be used even from command line; on the other hand side, they are actually Python functions, and hook specifications (in command line and elsewhere) sometimes require Python argument lists (the part after the tilde). This makes it hard both to list all available hooks[42], and to decide where and how to document them, in the user manual or in the library programming interface (API) documentation. Therefore, the following will be done. Here, in the user manual, only functions written specifically to be used as hooks will be listed (sometimes grouped by similarity), with their types and short descriptions. To that the link to the complete hook description in the API documentation will be added.[43]
bpatterns/bad-patterns
(S3A), bpatterns/bad-patterns-msg
(S4A), bpatterns/bad-patterns-msg-sp
(V4A)Detects unwanted patterns in text, by regular expression matching. Patterns can be specified either as direct arguments, or listed in file given as argument.
This hook is deprecated. Use validation rules instead, which are much a richer method of defining and checking for problems.
gtxtools/msgfilter
(F6A)Pipes the PO file through Gettext's msgfilter(1). The filter argument and options to msgfilter can be specified as parameters to hook factory. (May be used to wrap the PO file canonically, as Pology does not produce exactly the same wrapping as Gettext tools.)
gtxtools/msgfmt
(S6A)Pipes the PO file through Gettext's msgfmt(1), discarding output and reporting any errors as warnings. Useful for hard check of the PO file syntax, and extended checks performed when msgfmt is run with --check
option.
markup/check-xml
(S3C), markup/check-xml-sp
(V3C)Checks whether general XML markup in translation is well-formed, and possibly also whether entities are defined. Checks can be performed either only when the original text itself is valid or unconditionally.
markup/check-docbook4
(S3C), markup/check-docbook4-sp
(V3C), markup/check-docbook4-msg
(V4A), markup/check-html
(S3C), markup/check-html-sp
(V3C), markup/check-qtrich
(S3C), markup/check-qtrich-sp
(V3C), markup/check-kde4
(S3C), markup/check-kde4-sp
(V3C), markup/check-pango
(S3C), markup/check-pango-sp
(V3C)Specializations of markup/check-xml
hook for various XML formats. Aside from well-formedness, these hooks can also check whether used tags really exist in the format, whether tags are properly nested, etc. (Full conformance to DTD or schema cannot be checked due to chunking into messages.)
markup/check-xmlents
(S3C), markup/check-xmlents-sp
(V3C)Checks whether XML-like entities (&foo;
) are defined. This can be used when the markup is not trully XML-like but it uses XML-like entities, or simply to have separate checking of tagging (by markup/check-xml-*
hooks) and entities for convenience.
noop/text
(F1A), noop/textm
(F3A), noop/msg
(F4A), noop/hdr
(F4B), noop/cat
(F5A), noop/path
(F6A)Filtering hooks that do nothing ("no-operation"). These are useful in contexts where a filtering hook is required, but input should not be really modified.
normalize/demangle-srcrefs
(F4A)In some message extraction scenarios, the source references end up pointing to dummy files which existed only during the extraction, but true source references can still be reconstructed (based on dummy file names or extracted comments). This hook will reconstruct true source references and replace dummy references with them.
normalize/uniq-source
(F4A)Sometimes source references in PO message end up doubled (e.g. one prefixed with ./
and the other not) due to perculiarities of the extraction process. This hook will make source references unique.
normalize/uniq-auto-comment
(F4A)When extracted comments are automatically added to messages by the extraction tool, if the message is repeated in several source files it may end up containing multiple equal extracted comments. This hook can be used to make extracted comments unique (either all or those matching some criteria).
normalize/canonical-header
(F4B)Rearranges content of the PO header into canonical form. For example, translator comments will be sorted according to years of contribution, any repeated translator comments will be merged, etc.
remove/remove-accel-text
(F3A), remove/remove-accel-text-greedy
(F3A), remove/remove-accel-msg
(F4A), remove/remove-accel-msg-greedy
()Removes accelerator marker from one or all strings in the message. They will check if the PO file specifies the accelerator marker; if not, non-greedy variants will do nothing, while greedy variants will remove everything that is frequently used as accelerator marker.
remove/remove-markup-text
(F3A), remove/remove-markup-msg
(F4A)Converts markup (e.g. XML tags) in one or all strings in the message to plain text. The PO file will be asked for the expected markup types in text; if no markup type is specified, these hooks will do nothing.
remove/remove-fmtdirs-text
(F3A), remove/remove-fmtdirs-text-tick
(F3A), remove/remove-fmtdirs-msg
(F4A), remove/remove-fmtdirs-msg-tick
(F4A)Removes format directives in one or all strings in the message, or replaces them with a fixed placeholder. The type of format directives is determined by *-format
message flags.
remove/remove-literals-text
(F3A), remove/remove-literals-text-tick
(F3A), remove/remove-literals-msg
(F4A), remove/remove-literals-msg-tick
(F4A)Removes "literal" segments from one or all strings in the message, or replaces them wih a fixed placeholder. Literal segments are those which are used as computer input somewhere along the line, such as URLs, email addresses, command line options, etc. and therefore generally do not conform to human language rules. Translator can also explicitly declare literal segments, by adding a special translator comment.
remove/remove-marlits-text
(F3A), remove/remove-marlits-msg
(F4A)remove/remove-literals-*
hooks can positively determine only certain types of literals based on the text alone. If the text contains semantic markup, such as Docbook, literal segments can also be determined based on tags, and these hooks will remove both such tags and their text. The markup type will be taken from the PO file. (When these hooks are used, remove/remove-literals-*
is not needed.)
remove/rewrite-msgid
(F4A)Checks are sometimes defined such that something is first looked up in the original text, and if it is found, something is expected in the translation. No matter how well written these checks are, the original text will sometimes be a bit out of the ordinary, and the check will fail the translation although everything is fine. This can usually be corrected by the translator manually adding a directive, in a special translator comment, to "rewrite" the problematic part of the original before the check is applied.
remove/rewrite-inverse
(F4A)The original text in the message needs to be modified for the same reasons as described in remove/rewrite-msgid
, but it is actually easiest to replace the original text entirely with the original text from another message sharing the same translation (i.e. by "inverse" pairing of messages over translation).
remove/remove-paired-ents
(F4A), remove/remove-paired-ents-tick
(F4A)Removes all XML-like entities (&foo;
) from the original text, and all XML-like entities from the translation that were encountered in the original. This may be useful prior to markup validity checks, when the list of defined entities cannot be provided.
spell/check-spell
(S3A), spell/check-spell-sp
(V3A)Spell-checking hooks, as one element of Pology's spell-checking functionality.
uiref/resolve-ui
(F3C), uiref/resolve-ui-docbook4
(F3C), uiref/resolve-ui-kde4
(F3C)When translating program documentation, using these hooks it is possible to leave UI references (button labels, menu items, etc.) untranslated and let them be automatically inserted into translation later on. The basic hook requires UI references to be manually wrapped in translation in order to be detected, while specialized versions will also use semantic markup for detection (e.g. <guilabel>
element in Docbook).
uiref/check-ui
(V3C), uiref/check-ui-docbook4
(V3C), uiref/check-ui-kde4
(V3C)While uiref/resolve-ui
hooks will complain when they cannot find a translation for a UI reference, when checking the overall validity of translation it is more convenient to use specialized check-only hooks which will not modify the PO file on succesfully resolved UI references.
ja:katakana
(F1A)Removes everything but Katakana words from Japanese text, and separates retained words with spaces. (Used as filter prior to spell-checking words in Katakana.)
nn:exclusion/inofficial-forms
(V3C)Checks if there are any inofficial word forms in Norwegian Nynorsk translation.
sr:accents/resolve-agraphs
(F1A)Converts "accent graphs" to proper accented letters in Serbian Cyrillic text (e.g. ^а
becomes а̂
).
sr:accents/remove-accents
(F1A)Replaces accented letters in Serbian Cyrillic text with their non-accented counterparts. (Useful as filter prior to spell-checking.)
sr:charsets/limit-to-isocyr
(F1A), sr:charsets/limit-to-isolat
(F1A)In situations where it is necessary to use an 8-bit encoding instead of Unicode for Serbian text, these hooks can be used to constrain characters in text to only those representable by the target 8-bit encoding.
sr:checks/naked-latin
(V3C), sr:checks/naked-latin-origui
(V3C), sr:checks/naked-latin-se
(S3C), sr:checks/naked-latin-origui-se
(S3C)In translations into Serbian using Cyrillic script, ordinary segments in Latin script may indicate error or omission in translation. These hooks will look for such stray Latin segments, while ignoring recognizable literal segments such as URLs, commands, options, etc.
sr:nobr/to-nobr-hyphens
(F1A)The ordinary hyphen (-) is normally treated as a character on which the text can be split into the next line. In Serbian texts, hyphens are sometimes used to attach case endings to nouns (especially acronyms), which should not be split into the next line. This hooks guesses such positions and replaces the ordinary hyphen with no-break hyphen.
sr:reduce/words-ec
(F1A), sr:reduce/words-ec-lw
(F1A), sr:reduce/words-ic
(F1A), sr:reduce/words-ic-lw
(F1A), sr:reduce/words-ic-lw-dlc
(F1A)Various reductions of Serbian text to a subset of words of certain type, possibly rearranged in a particular way.
sr:trapres/froments
(F3C), sr:trapres/froments-t1
(F3C), sr:trapres/froments-t1db
(F3C)Hooks which resolve grammatical inserts in form of XML entities in Serbian text, based on the "trapnakron" contained within Pology. See the documentation in Serbian section for details.
sr:uiref/mod_entities
(F1A)When UI references are automatically resolved in documentation, and the UI texts may contain grammatical inserts in form of XML entities, these inserts may need to be slightly modified to keep the documentation structure valid.
sr:wconv/ctol
(F1A), sr:wconv/cltoa
(F1A), and many moreHooks for various transliterations and hybridizations of Serbian text, by script (Cyrillic, Latin) and dialect (Ekavian, Ijekavian). See the documentation in Serbian section for details.
kde%header/equip-header
(F4B)Adds assorted header fields to PO files within the KDE Translation Project, with values based on their name and position in the repository tree, so that Pology and other tools are better informed how to process them.
[Not implemented yet.]
See Section 11.4, “Writing Hooks” for instructions on how to write and contribute hooks.
With all the different heuristic checks and rules that Pology can apply, false positives -- messages proclaimed invalid when they are actually valid -- are inevitable. False positivies are very inconvenient in serious automatic quality control effort. They make it harder for translators to spot real problems, which in turn demotivates them to apply automatic checks at all. If there is one or few dedicated persons in the translation team to tweak and apply automatic checks, they would be particularly hard-hit with this negative feedback. False positives can reduce automatic quality control from a strong normative element in the workflow, to merely advisory "run-if-you-have-the-time" extra.
For this reason, most checks in Pology provide a way for them to be disabled on certain messages, files, or the processing batch, such that it is possible to methodically cancel false positives. From the other side, it is usually possible to run one or few checks on their own, in order to be easier to define and debug. Each checking tool and element documents such functionality, and in the following only some general patterns are described.
The simplest method to disable or enable some checks is "dynamically", for single validation run, through an option to the tool which is being run. For example, the check-rules sieve provides several parameters to select and deselect validation rules which are to be applied. The important point here is that checks in Pology usualy have some sort of a unique identifier, a keyword, by which they can be referred to.
"Static" methods to disable or enable checks are those where the instruction is written down somewhere, in a specific format, and automatically taken into account by the validation tool in subsequent runs. There may be several static methods to disable a certain check, differing in their reach: a group of PO files, single PO file, single message, or even a part of the text in the message. Within one PO file, the following methods are common:
The PO header is a natural place to disable or enable checks for the complete PO file, by adding a custom X-
header field.
On the single message level, the only place where it is possible to add a manual processing instruction is a translator comment. This is because if it would put anywhere else (e.g. as extracted comment or a flag), it would be removed on subsequent merging with template. These instructions are usualy kept simple, like this:
#some-instruction
:arguments
#: ... msgid "..." msgstr "..."
Instructions are always composed of two or more words, separated by hyphens, ended by colon, and followed by an arbitrary argument string (e.g. a list of identifiers of checks to skip on this message). This makes it sufficiently unlikely that another, free-form translator comment will be accidentally interpreted as a known instruction.[44]
A special type of translator comment with processing instructions is a comment of the following form:
# |, flag1
, flag2
, ...
This is a "translator flag" comment, which is used to set processing instructions too simple to occupy one whole comment line (e.g. those of the switch type, never needing arguments). It starts with |,
, and continues with comma-separated list of flag-like keywords.
[39] The other frequently encountered file organization is when there is one directory per PO domain, and that directory contains PO files for all languages, named as
.LANG
.po
[40] But if several people are working on a collection of PO files, it is nevertheless good to agree on fixed wrapping. This is both friendly to those exposed to original wrapping, and to version control systems.
[41] This does mean that the case when the comma itself is the accelerator marker is not covered, but this case is beyond unlikely.
[42] For example, any Python function in Pology that takes one string and returns the modified version of that string can be considered an F1A hook!
[43] In the API documentation, the very first line of the function description will show if the function is a direct hook or a hook factory, the function header will list the inputs for a direct hook (which conform to the declared hook type) or the factory parameters for a hook factory, and the rest of the description will explain the operation of the hook and the meaning of factory parameters.
[44] Especially considering that free-form translator comments are more usually written in the language of the translation.
While each particular PO processing tool from Pology and other packages may be documented in itself, it may not be always obvious how to use these tools together. This chapter presents some scenarios where combined tool usage may increase the quality and efficiency of daily work on translation.
A PO compendium is simply a PO file which aggregates messages from many other normal PO files, usually all same-language PO files in a given translation project. It may aggregate only the messages currently present in project PO files, but also messages that were present once and are no longer. As such, the compendium can be regarded as an instance of a translation memory. This section explains how to create, update, and apply such a translation memory.
Imagine that the translator wants to start translating a PO file that was so far never translated, but which has content similar to some other, translated PO files. Perhaps it was even derived from those other PO files, by merging, splitting, etc. This means that many messages in the present PO file may have been translated already in some other PO file, or at least that very similar translated messages exist in other PO files. Since the translation memory (TM) contains all known translated messages, it can be used to automatically produce translated and fuzzy messages in the present PO file, significantly reducing translation effort. Matching against the TM can be performed either as the translator goes from message to message in the editor (if the editor has a TM feature), or at once for all messages (by a specialized command) before starting to go through messages in the editor.
In most non-PO based translation workflows, translation memories are crucial for efficiency. This is because most non-PO formats have no concept of merging with templates. Each new revision of the source material results in (an equivalent of) entirely empty translation files, and it is translator's duty to somehow bring old translations into the new context. A carefully maintained TM, with a corresponding matching tool, is the foremost way to do this.
In a PO-based translation workflow, merging with templates already provides most of what TM is essential for. In effect, the old PO file that is being merged can be considered as a TM for the new PO file that will become based on the new template. Even when PO files are renamed, merged, or split, if that is properly done, no translations will be lost. A TM for PO files is therefore useful mostly to smooth out glitches in translation maintenance procedures (e.g. a PO file improperly split).[45] Nevertheless, having a well maintained TM in the form of PO compendium cannot hurt, while providing for the (hopefully) rare situations where TM matching is actually needed.
Many dedicated PO editors will automatically maintain an internal TM, usually in a database format, into which they will scoop messages from all PO files that were opened in them. However, in a team environment, these internal TMs are inferior to a PO compendium. For one, different translators will have different TMs; a translator may start to work on a file for which there are TM matches in another translator's internal TM. Internal TMs may be volatile, for example corrupted due to an editor bug, or perish during system maintenance. There is no control over which messages are scooped by the editor, and how they are treated (e.g. which message parts are being ignored).
On the other hand, a PO compendium can be maintained in a central place, and, being a PO file in itself, kept in version control just like all other PO files. In this way, all translators have fast access to a unified TM, which is secured from accidental corruption. Tight control over which messages are collected and how they are collected may be asserted, in the script which is written to update the compendium. This script can be made to run periodically, and to automatically commit updated compendium the version control repository.
As a first attempt, the PO compendium can be created simply by concatenting all PO files in the project into one called compendium.po
, using msgcat. If PO files are organized by language (all PO files of a given language kept in directory of that language), then the concatenation command would be:
$ cd $LANGDIR $ find -iname \*.po | xargs msgcat -o compendium.po
Unfortunatelly, a compendium created in this way has a number of drawbacks:
Aside from translated messages, the compendium will also contain untranslated and fuzzy messages. While untranslated messages are obviously dead weight, a case could be made for taking in fuzzy messages. But in light of the suggested usage of the compendium in the following section, fuzzy messages too should be ignored.
Messages in the compendium will contain all parts as normal messages do. Some of these parts (such as source references) are unnecessary, since they will be ignored when applying the compendium later. Other than increasing the size of the compendium, another problem with these parts is that changes in them will cause unnecessary version control differences, so they should be stripped from the compendium.
Messages will be ordered as they are seen in concatenated PO files. The ordering of messages in the compendium is also of no importance for application. But, any changes in message ordering between two compendium updates will cause unnecessary version control differences, so it is best to sort messages by their keys (msgid
and msgctxt
fields).
When two or more PO files contain the same message by key (msgid
and msgctxt
) but with different translations (due to context), such as:
msgid "Open File" msgstr "Otvori datoteku" msgid "Open File" msgstr "Otvaranje datoteke"
msgcat will aggregate translations (and translator comments if any) in the compendium message, and make it fuzzy:
#, fuzzy msgid "Open File" msgstr "" "#-#-#-#-# alpha.po (alpha-1.2.9) #-#-#-#-#\n" "Otvori datoteku\n" "#-#-#-#-# bravo.po (bravo-0.8.12) #-#-#-#-#\n" "Otvaranje datoteke"
Since the context should be double-checked anyway when applying the compendium later (especially for short messages), it is better to instead pick one of the translations and have a normal translated compendium message. If each translation appears only once, then it does not matter which is picked; but if one translation appears 10 times and the other once, clearly the former should be picked. That is, the most frequent translation should be picked.
The PO header is treated in the same way as messages by msgcat: since all headers have equal msgid
field (empty), their msgstr
fields will be aggregated. This too is just dead weight since the header is not used in applications of the compendium. Instead, a brief and informative header should be explicitly set (mentioning that this is a compendium PO file, for which project and language, etc).
In some translation projects, PO files frequently contain meta-messages, such as those where translators can add their names and contact addresses. These messages have the same key (msgid
) in all PO files, but should be translated differently in general, the more so the more the people in the translation team. So it may be better to omit such messages from the compendium.
It must be noted that none of these problems are an actual deficiency of msgcat itself. Since its function is general concatenation of PO files, it cannot make any of the assumptions necessary for the present application. Instead, msgcat should be used as a part of a wider script, in which the necessary additional processing happens, tailored to the particular translation project and translation team.
Let us assume the following layout of the top directory for the translation project foo
and translation team (language) nn
:
foo-nn/ ui/ alpha.po bravo.po ... doc/ alpha.po bravo.po ... update-compendium-foo-nn.sh compendium-foo-nn.po
update-compendium-foo-nn.sh
will be the script to create or update the compendium, compendium-foo-nn.po
the compendium itself. It helps clarity to add the project name and language into names of these two files, because both are tailored to that project and that language. Taking into account the aforementioned drawbacks of a simple compendium made by msgcat and the suggested resolutions, update-compendium-foo-nn.sh
could look like this[46]:
#!/bin/sh # # Create the PO compendium of Foo in Nevernissian language. # # Usage: # update-compendium-foo-nn.sh [trim] # # The script can be called from anywhere, because PO paths are # hardcoded within the script relative to its own location. # If the 'trim' argument is not given (i.e. script is called # without arguments), messages in the old compendium that are # no longer found in project PO files are preserved in # the new compendium; if 'trim' is given, they are removed. # Directory where this script resides. cmddir=`dirname $0` # Paths of directories containing PO files, space-separated. # (Make sure the compendium itself is not in here!) podirs="$cmddir/ui $cmddir/doc" # Path to the compendium. comppo="$cmddir/compendium-foo-nn.po" trim=$1 # If there is already a compendium, preserve it for later. test -f $comppo && mv $comppo $comppo.old # Collect PO files from given paths into a file. find $podirs -iname \*.po | sort > polist # Pre-process PO files in the project, creating temporary # PO files named *.po.tmpcomp: # - remove fuzzy and untranslated messages # - declare obsolete messages non-obsolete # - remove extracted comments, source references, flags for pofile in `cat polist`; do msgattrib $pofile \ --translated --no-fuzzy --clear-obsolete --force-po \ | grep -v '^#[:.,]' > $pofile.tmpcomp done # Update file list to contain temporary PO files. sed -i "s/$/.tmpcomp/" polist # Reduce headers of temporary PO files to necessary minimum, # proper header for the compendium will be added later. posieve -q set-header -f polist \ -srmallcomm \ -sremoverx:'^(?!MIME-Version$|Content-Type$|Content-Transfer-Encoding$)' # Create raw compendium from temporary PO files: # - aggregate translations for repeated messages # - sort messages by key msgcat --sort-output --force-po -f polist -o $comppo # Clean up temporary PO files and file list. cat polist | xargs rm rm polist # Resolve aggregated messages to most frequent variant. # It is safe to unfuzzy resolved messages, since at # this point assured that only translated messages # have been aggregated. posieve -q resolve-aggregates $comppo -sunfuzzy # Remove meta-messages which are found in many PO files but # should in general be differently translated in each. msggrep -v $comppo -o $comppo \ -JFe 'NAME OF TRANSLATORS' \ -JFe 'EMAIL OF TRANSLATORS' \ -JFe 'ROLES_OF_TRANSLATORS' \ -JFe 'CREDIT_FOR_TRANSLATORS' \ # Set the compendium header. # Use current date as revision date. dtnow=`date '+%Y-%m-%d %H:%M%z'` posieve -q set-header $comppo -screate \ -stitle:'Compendium of Foo translation into Nevernissian.' \ -sfield:'Project-Id-Version:compendium-foo-nn' \ -sfield:"PO-Revision-Date:$dtnow" \ -sfield:'Last-Translator:Simulacrum' \ -sfield:'Language-Team:Nevernissian <l10n-nn@neverwhere.org>' \ -sfield:'Language:nn' \ -sfield:'Plural-Forms:nplurals=9; plural=n==1 ? ...' \ # If the old compendium was preserved, add it to the new compendium # in order to retain messages no longer found in the project # (unless trimming was requested). if test -f $comppo.old && test x"$trim" != xtrim; then msgcat --use-first --sort-output $comppo $comppo.old -o $comppo # ...old compendium must be the second argument, in order # not to override possibly updated translations of # existing messages in the project. fi # Test if new compendium is different from the old, with # the exception of creation time. If they are the same, # discard the new compendium. if test -f $comppo.old; then for cpfile in $comppo $comppo.old; do grep -v '^"PO-Revision-Date:.*\\n"$' $cpfile >$cpfile.nrd done if cmp -s $comppo.nrd $comppo.old.nrd; then mv $comppo.old $comppo else rm $comppo.old fi rm $comppo.nrd $comppo.old.nrd fi # Canonically wrap the compendium. msgcat $comppo -o $comppo # All done.
This script should be periodically called to update the compendium, and the updated file committed, such that all translators will automatically get it when they update their local repository copies. If after some (long) time the compendium becomes to big due to accumulation of old messages, running the script once with the trim
argument will cause all old messages to be dropped.
Translators who use a dedicated PO editor with internal TM should configure the editor to read the compendium into the internal TM. This may be done, for example, by including the compendium PO file (or the directory in which it resides) into editor's translation project paths. If the compendium is kept under version control, the editor should automatically update its internal TM from the compendium whenever the repository is updated and the editor started again. In this way, editor's internal TM becomes transient in nature, there being no problem if it gets corrupted or deleted.
When working on a particular PO file with a properly configured PO editor, as the translator jumps from one to another incomplete (untranslated or fuzzy) message, when the message is similar to one or few messages in the compendium (i.e. in internal TM) the editor will somehow "offer" those similar messages. Ideally, for each similar message the editor should show not only the possible translation, but also the difference between the two original texts (that of the current message and the TM match). This will allow the translator to quickly see how the offered translation should be adapted to fit the current original.
Dedicated PO editors may also offer batch application of the TM. This means that when the PO file is opened, the translator executes a command which fills in all untranslated messages with matches from the TM, making some translated (on exact matches) and some fuzzy (partial matches). However, simpleminded batch application of the TM should be considered dangerous. For one, exact matches in the source language may not be exact matches in the original; especially short messages frequently need different translations. But the translator will simply jump over each batch-translated message and fail to see this. The other problem comes up if the material in the compendium is not sufficiently reviewed, in which case every match from the TM, even on long messages, should be at least casually reviewed by the translator. Thus, if there is no way to configure batch application to be less indiscriminate, it is best to avoid it alltogether, or else the quality of translation may suffer.
Translators who use a general text editor to work on PO files can still make use of the compendium. One option could be merging the PO file with its template in presence of the compendium, just before starting to work on it:
$ msgmerge alpha.po alpha.pot -C compendium.po --update --previous
The -C
option to msgmerge specifies the compendium from which to draw exact and partial matches, when there is no match in the PO file itself. This option can be repeated to add several compendia. The --update
option is to modify the PO file in place, rather than writing the merged PO file to standard output. The --previous
option is to get previous fields (#| ...
comments) on fuzzy messages. Unfortunatelly, this method is a command line version of the batch application of the TM in a dedicated PO editor, and suffers from the same problem of indiscriminate exact matches that the translator will later fail to check. Therefore it should not be used (at least not for general translation).
Fortunatelly, Pology provides the poselfmerge command, which is a wrapper around msgmerge, and has several options to mitigate the indiscriminancy problem of batch application of TM. To avoid silent exact matches on short messages, the -W
/--min-words-exact
can be used to set the minimum length of a message in words at which the exact match will be accepted; otherwise the message is made fuzzy. If every exact match should be checked by the translator, no matter the length of the message, there is the -x
/--fuzzy-exact
to make all exact matches fuzzy.[47] These options have counterpart fields in Pology user configuration, so that the translator does not have to remember to use them on every run, and the PO template is not used at all. See Section 7.2, “Self-Merging PO Files with poselfmerge” for details.
Dedicated PO editors provide not only direct editing enhancements (no dealing with PO format syntax, jumping through incomplete messages, automatic removal of fuzzy elements, etc), but also translation-oriented features like spell checking, translation memory collection and application, glossary suggestions, and, going beyond standalone PO files, translation project overview and statistics. Why would someone, in spite of this, prefer to work on PO files with a general text editor? There are various reasons. Some people do not like how elements of currently translated PO message are scattered all over the window (as is typical of many PO editors), out of eye focus, and some elements even not shown. Other people like to have modularity in the translation workflow, rather than relying on the PO editor for everything and accepting its limitations. Some people are simply well accustomed to their text editor and do not want a higher level editor "abstracting" the PO format for them.
When translating PO files with a general text editor, you will have to use some command line tools to achieve reasonable efficiency and quality.
Starting from the text editor itself, it should have several general text-editing features. Capable editors all have these features, but they should nevertheless be mentioned, so that you can look for them.
The most important feature is probably syntax highlighting, where special parts of the text are displayed in different color, weight, or slant. In a PO file, message field keywords (msgid
, msgstr
) should stand out from the text itself, text in comments should look different from the text in fields, internal text elements (e.g. markup tags) should be highlighted, etc. In this way you can quickly focus on what you should be editing, and on the surrounding context of the text. Syntax higlighting was originaly introduced for various programming language source files, but has since spread to other types of structured text files; established editors should have syntax highlighting for PO files as well.
Capable editors usually provide special methods of navigating through the file, above simply scrolling up and down line by line or page by page. One particularly useful method would be line bookmarking. While in the middle of editing a given line, you have to search through the PO file for something (e.g. how a certain phrase was translated earlier): you can then bookmark the line, search as much as you like, and return to the same line by jumping to the bookmark. Otherwise you would have to remember which line (by number) it was to jump back to it, or search for the text that you remember from that line.[48]
It will usually be possible to start the editor with one or more file paths as command-line arguments, to open those files at once. This is useful when a selection of PO files in need of some editing is determined by an external command, which writes out their paths. These paths can then be fed directly to the editor, rather than having to open them manually one by one (and possibly missing some) through editor's file dialog.
Having good statistics on a single or a group of PO files is necessary for estimating the translation effort, for example how much time should be allotted for updating the existing translation for impending next release of the source material. Pology's workhorse for computing statistics is the stats sieve of posieve.
Assume the following arrangement of PO files for language nn
and their templates:
l10n-nn/ ui/ alpha.po bravo.po ... doc/ alpha.po bravo.po ... l10n-templates/ ui/ alpha.pot bravo.pot ... doc/ alpha.pot bravo.pot ...
If the current working directory is l10n-nn/
, to compute statistics on a single PO file, posieve can be executed like this:
$ posieve stats ui/alpha.po
This will display a table with message counts, word counts and characters counts, as well as ratios to total, per category of messages (translated, fuzzy, untranslated, obsolete). To have the same output for all PO files in the ui/
directory taken together, or in the whole project, respectively:
$ posieve stats ui/ $ posieve stats
Note that word count is a much better base for estimating the translation effort than message count.
When statistics is computed for several PO files (or a directory, or several directories full of PO files), frequently it is necessary to get statistics per file (or per directory). This is done by adding the byfile
or bydir
sieve parameter:
$ posieve stats -s byfile ui/
However, this will output one full table for each file, which may be a bit too much data to grasp. Instead, you can request bar display, where each file is represented by a single-line bar. The bar shows either the number of messages or the number words per category, depending on whether msgbar
or wbar
was issued. To get word bars per file in ui/
directory, you can execute:
$ posieve stats -s byfile -s wbar ui/
Fuzzy messages introduce some uncertainty in effort estimation. If the statistics shows 50 fuzzy messages with 700 words, you cannot conclude from that if changes in those messages are small (e.g. cleaned style, punctuation) and translation can be quickly updated, or substantial (entirely new messages with passing similarity to earlier message) and require heavy editing. For this reason the stats sieve provides the ondiff
parameter: for each fuzzy message the difference from previous message is computed, and based on that a part of the word count is assigned to translated category and the rest to untranslated (thus leaving nominal zero words in the fuzzy category). The result is that, for example, a PO file with a lot of messages fuzzy due to punctuation changes will show in statistics as almost completely translated by number of words.
If the translation project is organized such that new empty PO files are not automatically derived from new PO templates, then when running statistics just over language PO files it will happen that templates which do not have a counterpart PO file are not counted as fully empty PO files. To have such templates counted, the two-argument templates
parameter can be issued; the first parameter is a path segment of the language directory, and the second parameter what to replace it with to get the corresponding template directory path. In the translation project setup as above, this is how you would compute the statistics on ui/
directory while taking templates into account:
$ posieve stats -s templates:l10n-nn:l10n-templates ui/
The path replacement is always done on absolute paths, so in this example it is not a problem that the relative paths (ui/alpha.po
...) do not contain original and replacement segments.
The translation project may not be organized such that each language has its own top directory. Instead, language PO files may be grouped by application and PO domain, and named by language code:
project/ alpha/ po/ aa.po bb.po ... bravo/ po/ aa.po bb.po ... ...
In this setup the stats sieve can still be run on directory paths as arguments, in order to get statistics on all PO files of a given language, by using the -I
/--include-path
option of posieve to single out the desired language. For example, to get statistics on all PO files of the nn
language in a single table:
$ posieve stats project/ -I 'nn.po'
or by file in form of message bars:
$ posieve stats -s byfile -s msgbar project/ -I 'nn.po'
The value of the -I
option is in fact a regular expression, and the option can be repeated, which allows to finely tune the file selection when necessary.
As for other statistics tools, Gettext's msgfmt with --statistics
option could be considered as one (though it shows only translated, fuzzy, and untranslated message counts), and especially the the pocount command from Translate Toolkit.
When a single PO file is to be translated from scratch, then it is easy to just open it in the text editor and start translating messages one by one. However, usually more frequent than this is translation maintenance, in which you need to go through a bunch of freshly merged PO files and update new untranslated and fuzzy messages. The problem then is twofold: how to efficiently check which files need updating, and how to efficiently go through messages that need to be updated within a file.
To see which PO files need to be updated, you can simply run the stats sieve with byfile
and msgbar
/wbar
parameters (and possibly ondiff
), as explained in the previous section. After that you would have to manually observe incomplete files and open them in the editor one by one, which is tedious and prone to oversight. Instead, you can also add the incompfile
parameter to stats, which will write paths of all incomplete PO files into a file. If PO files are organized as in the previous example, and you want to update translations in ui/
subdirectory, you would run:
$ posieve stats -s byfile -s wbar -s incompfile:toupdate.out ui/
Now toupdate.out
will contain the paths of incomplete files. If the editor can be started from the command line with a number of file path arguments, you can directly feed it toupdate.out
, e.g. by adding `cat toupdate.out`
to the editor command.
If the translation project is organized such that each new template results in new empty PO file, you may wish to update only those PO files which where worked on before, i.e. those not entirely empty. For this you can add the mincomp
parameter, which sets the minimal completeness (the ratio of translated to total messages) at which to take a PO file into consideration, with a very small value:
$ posieve stats -s mincomp:1e-6 -s incompfile:toupdate.out ui/
1e-6
is short for 0.000001
, which means to take into consideration only those PO files which have more than one in a million translated files. Since there is no PO file with a million messages, this effectively means to include every PO file which has at least one translated message in it.
Once the incomplete PO files are open in the editor, to be able to jump through incomplete messages, you need to somehow use editor's search function. For fuzzy messages it is easy, you can just search for the , fuzzy
string. Untranslated messages, on the other hand, are more problematic. You may think of searching for msgstr ""
, but this would also find long wrapped messages:
msgid "" "Blah blah blah [...]" "blah blah." msgstr "" "Bla bla bla [...]" "bla bla."
To make untranslated messages stand out unambiguously, there is the tag-untranslated sieve. It simply adds untranslated
flag to all untranslated messages (but not to fuzzy unless explicitly requested), so that you can search for , untranslated
in the editor. The most convenient is to run tag-untranslated on the toupdate.out
file produced by stats using the -f
/--from-files
:
$ posieve tag-untranslated -f toupdate.out
Fuzzy messages may be such only due to small changes in the original text, for example a single word changed in a paragraph-length message. This is not so easy to see by manually comparing the original and the translation. However, since fuzzy messages should have the previous original text in comments (if merged with --previous
option of msgmerge), it is possible to automatically embed differences into those comments with the sv-diff-previous sieve; see its documentation for an example. You should run this sieve on toupdate.out
as well:
$ posieve diff-previous -f toupdate.out
Your editor may even highlight the difference segments added to the previous original text, making them stand out quite clearly.
Since normally you want both to mark untranslated messages and to add differences to fuzzy messages before going through PO files, you can run the two sieves at once:
$ posieve tag-untranslated,diff-previous -f toupdate.out
As you go through incomplete messages and update the translation, you should remove any fuzzy
or untranslated
flags, and previous fields in #| ...
comments, so that in the end you can commit (upload, send) clean updated PO files. But sometimes it will happen that you realize that you do not have enough time to update everything, and you want to commit what you have completed by that moment. The problem is that there will still be some untranslated
flags and embedded differences remaining throughout the files, and leftover embedded differences would e.g. interfere with subsequent merging. To automatically remove these remaining elements, you simply run the two sieves with the strip
parameter:
$ posieve tag-untranslated,diff-previous -s strip -f toupdate.out
When you update a PO file, for the sake of clarity and copyright you should also update its header with your personal data (the author comment, the Last-Translator:
field, etc.) You could do this manually, but it is much simpler to set your data once in the Pology user configuration and run the update-header sieve over all updated files[49]:
$ posieve update-header -f toupdate.out
Summit and ascription workflows, described in Chapter 5, Summitting Translation Branches and Chapter 6, Ascribing Modifications and Reviews, fit excellently together. Ascription enables review-based release control on summit scatter (Section 5.3.7, “Filtering by Ascription on Scatter” shows how to do it), while summit removes the needed for different ascription file trees per branch (and the associated effort at branch cycling). All the information that you need to set up a summit with ascription are explained in the chapters mentioned; the only thing left for this section is to show the order of actions and the resulting file structure, as implied by the technical requirements.
The first thing to set up is the summit. From the viewpoint of ascription, it is not important which summit mode is used; indeed, while the direct summit is still not advised, putting ascription on top would alleviate some of its disadvantages. In the following the summit over dynamic templates is assumed, because it is a bit less involved than the summit over static templates, but nevertheless demonstrates all important points.
After configuring and initializing the summit over dynamic templates, let the summit top directory only (that is, omitting branches) look like this:
l10n-nn/ summit/ foo-module/ alpha.po bravo.po ... bar-module/ kilo.po lima.po ... ... summit-config
PO files in the summit are shown split into several submodules for generality. Unlike in the chapter on summit, the summit directory is placed here within a parent language directory, and the summit configuration file summit-config
in the parent directory instead of the summit directory. This is in order to have a clearer structure when the ascription is added.
The ascription is set up after the summit, such that it takes only the summit directory into account, having nothing to do with branches. After the ascription is configured and initialized, the summit with ascription tree should look like this:
l10n-nn/ summit/ foo-module/ alpha.po bravo.po ... bar-module/ kilo.po lima.po ... ... summit-ascript/ foo-module/ alpha.po bravo.po ... bar-module/ kilo.po lima.po ... ... ascription-config summit-config
Here the ascription tree root is set to summit-ascript/
in the ascription configuration file ascription-config
. With this, setting up the summit with ascription workflow is completed.
In some circumstances you may want to have several separate summits with unified ascription. This may be the case, for example, when the translation project is such that the user interface and documentation PO files are put into separate file trees in branches, and most paired UI-documentation PO files have same names.[50]
The parent language directory in this scenario, with summits and ascription set up, could look like this:
l10n-nn/ summit/ ui/ foo-module/ alpha.po bravo.po ... bar-module/ kilo.po lima.po ... ... summit-config doc/ foo-module/ alpha.po bravo.po ... bar-module/ kilo.po lima.po ... summit-config summit-ascript/ ui/ foo-module/ alpha.po bravo.po ... bar-module/ kilo.po lima.po ... ... doc/ foo-module/ alpha.po bravo.po ... bar-module/ kilo.po lima.po ... ... ascription-config
Note here the location of summit-config
files: each is within its own summit directory, which are summit/ui/
and summit/doc/
. On the other hand, there is a single ascription-config
file, which covers all summits. This means that summit operations (merging, scattering) must be performed from within their respective summit directories (since posummit looks through the parent directories for first summit-config
file), while ascription operations can be performed from anywhere.
Having unified ascription is especially convenient in centralized summit maintenance, since translators and reviewers are concerned only with ascription (running poascribe to commit, select for review, etc.) regardless of how many summits there are.
[45] To be sure, some short messages can be quite similar in many unrelated PO files. But having TM matches only on such messages will result in very small time savings, if measurable at all.
[46] At one point, the script creates a temporary PO file for each original PO file, and then calls msgcat on these temporary files to create the first, raw compendium. These temporary files have fuzzy and untranslated messages removed, and some other adjustments, before concatenation. One could think that all these adjustments could instead be done on the raw compendium. The problem is that then there would be no unambiguous way to tell which fuzzy messages in the raw compendium were fuzzy to begin with, and which were made fuzzy by msgcat due to agreggation of translations. With fuzzy messages removed prior to concatenation, in de-aggregation by frequency that follows it is known that messages with fuzzy flags are those aggregated.
[47] The translator can still see when the match was exact, because normal fuzzy messages will have previous fields and fuzzied exact matches will not.
[48] One trick is also hitting undo once, which will normally skip to the line in which the last modification was made, and then hit redo to recover the modification.
[49] If you use ascription, you should instead tell poascribe to update headers for you when committing. This is done by adding update-headers = yes
to [poascribe]
section in user configuration.
[50] On the other hand, you may still have a unified summit, by defining a path transformation in summit configuration to disambiguate UI and documentation PO files sharing the same domain name.
You may find it odd that the user manual contains the section on programming, as that is normally the matter for a separate, programmer-oriented document. On the other hand, while reading the "pure user" sections of this manual, you may have noticed that in Pology the distinction between a user and a programmer is more blurry than one would expect of a translation-related tool. Indeed, before getting into writing standalone Python programs which use the Pology library, there are many places in Pology itself where you can plug in some Python code to adapt the behavior to your language and translation environment. This section exists to support and stimulate such interaction with Pology.
The Pology library is quite simple conceptually and organizationally. It consists of a small core abstraction of the PO format, and a lot of mutually unrelated functionality that may come in handy in particular translation processing scenarios. Everything is covered by the Pology API documentation, but since API documentation tends to be non-linear and full of details obstructing the bigger picture, the following subsections are there to provide synthesis and rationale of salient points.
The PO format abstraction in Pology is a quite direct and fine-grained reflexion of PO format elements and conventions. This was a design goal from the start; no attempt was made at a more general abstraction, which would tentatively support various translation file formats.
There is, however, one glaring but intentional omission: multi-domain PO files (those which contain domain "..."
directives) are not supported. We had never observed a multi-domain PO file in the wild, nor thought of a significant advantage it could have today over multiple single-domain PO files. Supporting multi-domain PO files would mean not only always needing two nested loops to iterate through messages in a PO file, but it would also interfere with higher levels in Pology which assume equivalence between PO files and domains. Pology will simply report an error when trying to read a multi-domain PO file.
Because the PO abstraction is intended to be robust against programming errors when quickly writting custom scripts, and frugal on file modifications, by default some of the abstracted objects are "monitored". This means that they are checked for expected data types and have modification counters. Main monitored objects are PO files, PO headers, and PO messages, but also their attributes which are not plain data types (strings or numbers). For the moment, these secondary monitored types include Monlist
(the monitored counterpart to built-in list), Monset
(counterpart to set), and Monpair
(like two-element tuple). Monitored types do not in general provide the full scope of functionality of their built-in counterparts, so sometimes it may be easier (and faster) to work with built-in types and convert them to monitored at the moment of adding to PO objects.
To take a Monlist
instance as an example, here is how it behaves on its own:
>>> from pology.monitored import Monlist >>> l = Monlist([u"a", u"b", u"c"]) >>> l.modcount 0 >>> l.append(10) >>> l Monlist([u"a", u"b", u"c", 10]) >>> l.modcount 1 >>>
Appending an element has caused the modification counter to increase, but, as expected, it was possible to add an integer in spite of previous elements being strings. However, if the monitored list comes from a PO message:
>>> from pology.message import Message >>> msg = Message() >>> msg.msgstr Monlist([]) >>> msg.msgstr.append(10) Traceback (most recent call last): ... pology.PologyError: Expected <type 'unicode'> for sequence element, got <type 'int'>. >>> msg.msgstr.append(u"bar") >>> msg.msgstr.modcount 1 >>> msg.modcount 1
The Message
class has type constraints added to its attributes, and therefore addition of an integer to the .msgstr
list was rejected: only unicode values are allowed. This is particularly important due to the basic string type in Python being the raw byte array str[51], to automatically prevent carelessness with encodings. Once a proper string was added to .msgstr
list, its modification counter increased, but also the modification counter of the parent object.
A few more notes on modification counters. Consider this example:
>>> msg = Message() >>> msg.msgstr = Monlist(u"foo") >>> msg.msgstr.modcount 0 >>> msg.msgstr_modcount 1 >>> msg.modcount 1 >>> msg.msgstr[0] = u"foo" >>> msg.msgstr.modcount 0 >>> msg.msgstr = Monlist(u"foo") >>> msg.msgstr_modcount 1 >>> msg.modcount 1
Monlist(u"foo")
itself is a fresh list with modification counter at 0, so after it was assigned to msg.msgstr
, its modification counter is still 0. However, every attribute of a parent monitored object also has the associated attribute modification counter, denoted with trailing _modcount
; therefore msg.msgstr_modcount
did increase on assignment, and so did the parent msg.modcount
. Modification tracking actually checks for equality of values, so when same-valued objects are repeadetly assigned (starting from msg.msgstr[0] = u"foo"
above), modification counters do not increase.
Compound monitored objects may also have the attributes themselves constrained, to prevent typos and other brain glitches from causing mysterious wrong behavior when processing PO files. For example:
>>> msg = Message() >>> msg.msgtsr = Monlist(u"foo") Traceback (most recent call last): ... pology.PologyError: Attribute 'msgtsr' is not among specified. >>>
You may conclude that modification tracking and type and attribute constraining would slow down processing, and you would be right. Since PO messages are by far the most processed objects, a non-monitored counterpart to Message
is provided as well, for occasions where the code is only reading PO files, or has been sufficiently tested, and speed is of importance. See Section 11.1.2, “Message” for details.
PO messages are by default represented with the Message
class. It is monitored for modifications, and constrained on attributes and attribute types. It provides direct attribute access to parts of a PO message:
>>> from pology.monitored import Monpair >>> from pology.message import Message >>> msg = Message() >>> msg.msgid = u"Foo %s" >>> msg.msgstr.append(u"Bar %s") >>> msg.flag.add(u"c-format") >>> msg.fuzzy = True >>> print msg.to_string(), #, fuzzy, c-format msgid "Foo %s" msgstr "Bar %s" >>>
Attribute access provides the least hassle, while being guarded by monitoring, and makes clear the semantics of particular message parts. For example, the .flag
attribute is a set, to indicate that the order of flags should be of no importance to either a human translator or a PO processor, and the .msgstr
attribute is always a list in order to prevent the programmer from not taking into account plural messages. While the fuzzy state is formally indicated by a flag, it is considered special enough to have a separate attribute.
Some message parts may or may not be present in a message, and when they are not present, the corresponding attributes are either empty if sequences (e.g. .manual_comment
list for translator comments), or set to None
if strings[52] (e.g. .msgctxt
).
There are also several derived, read-only attributes for special purposes. For example, if in some context the messages are to be tracked in a dictionary by their keys, there is the .key
attribute available, which is an undefined but unique combination of .msgctxt
and .msgid
attributes. Or, there is the .active
attribute which is True
if the message is neither fuzzy nor obsolete, i.e. its translation (if there is one) would be used by the consumer of the PO file that the message is part of.
Message
has a number of methods for frequent operations that need to read or modify more than one attribute. For example, to thoroughly unfuzzy a message, it is not sufficient to just remove its fuzzy flag (by setting .fuzzy
to False
or removing u"fuzzy"
from .flag
set), but previous field comments (#| ...
) should be removed as well, and this is what .unfuzzy()
method does:
>>> print msg.to_string(), #| msgid "Foubar" #, fuzzy msgid "Foobar" msgstr "Fubar" >>> msg.unfuzzy() >>> print msg.to_string(), msgid "Foobar" msgstr "Fubar"
Other methods include those to copy over a subset of parts from another message, to revert the message to pristine untranslated state, and so on.
There exists a non-monitored counterpart to Message
, the MessageUnsafe
class. Its attributes are of built-in types, e.g. .msgstr
is plain list
, and there is no type nor attribute checking. By using MessageUnsafe
, a speedup of 50% to 100% has been observed in practical applications, so it makes for a good trade-off when you know what you are doing (e.g. you are certain that no modifications will be made). A PO file is opened with non-monitored messages by issuing the monitored=False
argument to Catalog
constructor.
Read-only code could should work with Message
and MessageUnsafe
objects without any type-based specialization. Code that writes may need some care to achieve the same, for example:
def translate_moo_as_mu (msg): if msg.msgid == u"Moo!": # works for both msg.msgstr = [u"Mu!"] # raises exception if Message msg.msgstr[:] = [u"Mu!"] # works for both msg.msgstr[0] = u"Mu!" # works for both (when not empty)
If you need to create an empty message of the same type as another message, or make a same-type copy of the message, you can use type
built-in:
newmsg1 = type(msg)() # create empty newmsg2 = type(msg)(msg) # copy
Message
and MessageUnsafe
share the virtual base class Message_base
, so you can use isinstance(obj, Message_base)
to check if an object is a PO message of either type.
The PO header could be treated as just another message, but that would both be inconvenient for operating on it, and disruptive in iteration over a catalog. Instead the Header
class is introduced. Similar to Message
, it provides both direct attribute access to parts of the header (like the .field
list of name-value pairs), and methods for usual manipulations which would need a sequence of basic data manipulations (like .set_field()
to either modify an existing or add a new header field with the given value).
In particular, header comments are represented by a number of attributes (.title
, .author
, etc.), some of which are strings and some lists, depending on semantics. Unfortunatelly, the PO format does not define this separation formally, so when the PO file is parsed, comments are split heuristically (.title
will be the first comment line, .author
will get every line which looks like it has an email address and a year in it, etc.)
Header
is a monitored class just like Message
, but unlike Message
it has no non-monitored counterpart. This is because in practice the header operations make a small part of total processing, so there is no real advantage at having non-monitored headers.
PO files are read and written through Catalog
objects. A small script to open a PO file on disk (given as the first argument), find all messages that contain a certain substring in the original text (given as the second argument), and write those messages to standard output, would look like this:
import sys from pology.catalog import Catalog from pology.msgreport import report_msg_content popath = sys.argv[1] substr = sys.argv[2] cat = Catalog(popath) for msg in cat: if substr in msg.msgid: report_msg_content(msg, cat)
Note the minimalistic code, both by raw length and access interface. Instead of using something like print msg.to_string()
to output the message, already in this example we introduce the msgreport
module, which contains various functions for reporting on PO messages;[53] report_msg_content()
will first output the PO file name and location of the message (line and entry number) within the file, and then the message content itself, with some highlighting (for field keywords, fuzzy state, etc.) if the output destination permits it. Since no modifications are done to messages, this example would be just as safe but run significantly faster if the PO file were opened in non-monitored mode. This is done by adding the monitored=False
argument to Catalog
constructor:
cat = Catalog(popath, monitored=False)
and no other modification is required.
When some messages are modified in a catalog created by opening a PO file on disk, the modifications will not be written back to disk until the .sync()
method is called -- not even if the program exists. If the catalog is monitored and there were no modifications to it up to the moment .sync()
is called, the file on disk will not be touched, and .sync()
will return False
(it returns True
if the file is written).[54] In a scenario where a bunch of PO files are processed, this allows you to report only those which were actually modified. Take as an example a simplistic[55] script to search and replace in translation:
import sys from pology.catalog import Catalog from pology.fsops import collect_catalogs from pology.report import report serchstr = sys.argv[1] replacestr = sys.argv[2] popaths = sys.argv[3:] popaths = collect_catalogs(popaths) for popath in popaths: cat = Catalog(popath) for msg in cat: for i, text in enumerate(msg.msgstr): msg.msgstr[i] = text.replace(searchstr, replacestr) if cat.sync(): report("%s (%d)" % (cat.filename, cat.modcount))
This script takes the search and replace strings as the first two arguments, followed by any number of PO paths. The paths do not have to be only file paths, but can also be directory paths, in which case the collect_catalogs()
function from fsops
module will recursively collect any PO files in them. After the search and replace iteration through a catalog is done (msgstr
being properly handled on plain and plural messages alike), its .sync()
method is called, and if it reports that the file was modified, the file's path and number of modified texts is output. The latter is obtained simply as the modification counter state of the catalog, since it was bumped up by one on each text that actually got modified. Note the use of .filename
attribute for illustration, although in this particular case we had the path available in popath
variable.
Syncing to disk is an atomic operation. This means that if you or something else aborts the program in the middle of execution, none of the processed PO files will become corrupted; they will either be in their original state, or in the expected modified state.
As can be seen, at its base the Catalog
class is an iterable container of messages. However, the precise nature of this container is less obvious. To the consumer (a program or converter) the PO file is a dictionary of messages by keys (msgctxt
and msgid
fields); there can be no two messages with the same key, and the order of messages is of no importance. For the human translator, however, the order of messages in the PO file is of great importance, because it is one of context indicators. Message keys are parts of the messages themselves, which means that a message is both its own dictionary key and the value. Taking these constraints together, in Pology the PO file is treated as an ordered set, and the Catalog
class interface is made to reflect this.
The ordered set nature of catalogs comes into play when the composition of messages, rather than just the messages themselves, is modified. For example, to remove all obsolete messages from the catalog, the .remove()
method could be used:
for msg in list(cat): if msg.obsolete: cat.remove(msg) cat.sync()
Note that the message sequence was first copied into a list, since the removal would otherwise clobber the iteration. Unfortunatelly, this code will be very slow (linear time wrt. catalog size), since when a message is removed, internal indexing has to be updated to maintain both the order and quick lookups. Instead, the better way to remove messges is the .remove_on_sync()
method, which marks the message for removal on syncing. This runs fast (constant time wrt. catalog size) and requires no copying into a list prior to iteration:
for msg in cat: if msg.obsolete: cat.remove_on_sync(msg) cat.sync()
A message is added to the catalog using the .add()
method. If .add()
is given only the message itself, it will overwrite the message with the same key if there is one such, or else insert it according to source references, or append it to the end. If .add()
is also given the insertion position, it will insert the message at that position only if the message with the same key does not exist in the catalog; if it does, it will ignore the given position and overwrite the existing message. When the message is inserted, .add()
suffers the same performance problem as .remove()
: it runs in linear time. However, the common case when an empty catalog is created and messages added one by one to the end can run in constant time, and this is what .add_last()
method does.[56]
The basic way to check if a message with the same key exists in the catalog is to use the in
operator. Since the catalog is ordered, if the position of the message is wanted, .find()
method can be used instead. Both these methods are fast, running in constant time. There is a series of .select_*()
methods for looking up messages by other than the key, which run in linear time, and return lists of messages since the result may not be unique any more.
Since it is ordered, the catalog can be indexed, and that either by a position or by a message (whose key is used for lookup). To replace a message in the catalog with a message which has the same key but is otherwise different, you can either first fetch its position and then use it as the index, or use the message itself as the index:
# Idexing by position. pos = cat.find(msg) cat[pos] = msg # Indexing by message key. cat[msg] = msg
This leads to the following question: what happens if you modify the key of a message (its .msgctxt
or .msgid
attributes) in the catalog? In that case the internal index goes out of sync, rather than being automatically updated. This is a necessary performance measure. If you need to change message keys, while doing that you should treat the catalog as a pure list, using only in
iteration and positional indexing. Afterwards you should either call .sync()
if you are done with the catalog, or .sync_map()
to only update indexing (and remove messages marked with .remove_on_sync()
) without writing out the PO file.
The Catalog
class provides a number of convenience methods which report things about the catalog based on the header information, rather than having to manually examine the header. These include the number of plural forms, the msgstr
index for the given plural number, as well as information important in some Pology contexts, like language code, accelerator markers, markup types, etc. Each of these methods has a counterpart which sets the appropriate value, but this value is not written to disk when the catalog is synced. This is because frequently there are more ways in which the value can be determined from the header, so it is ambiguous how to write it out. Instead, these methods are used to set or override values provided by the catalog (e.g. based on command line options) for the duration of processing only.
To create an empty catalog if it does not exist on disk, the create=True
argument can be added to the constructor. If the catalog does exist, it will be opened as usual; if it did not exist, the new PO file will be written to disk on sync. To unconditionally create an empty catalog, whether the PO file exists or not at the given path, the truncate=True
parameter should be added as well. In this case, if the PO file did exist, it will be overwritten with the new content only when the catalog is synced. The catalog can also be created with an empty string for path, in which case it is guaranteed to be empty even without setting truncate=True
. If a catalog with empty path should later be synced (as opposed to being transient during processing), its .filename
attribute can simply be assigned a valid path before calling .sync()
.
In summary, it can be said that the Catalog
class is biased, in terms of performance and ease of use, towards processing existing PO files rather than creating PO files from scratch, and towards processing existing messages in the PO file rather than shuffling them around.
This section describes the style and conventions that the code which is intended to be included in Pology distribution should adhere to. The general coding style is expected to follow the Python style guide described in PEP 8.
Lines should be up to 80 characters long. Class names should be written in camel case, and all other names in lower case with underscores:
class SomeThingy (object): ... def some_method (self, ...): ... longer_variable = ... def some_function (...): ...
Long expressions with operators should be wrapped in parentheses and before the binary operator, with the first line indented to the level of the other operand:
some_quantity = ( a_number_of_thingies * quantity_of_that_per_unit + the_base_offset)
In particular, long conditions in if
and while
statements should be written like this:
if ( something and something_else and yet_something and somewhere_in_between and who_knows_what_else ): do_something_appropriate()
All messages, warnings, and errors should be issued through msgreport
and msgreport
modules. There should be no print
statements or raw writes to sys.stdout
/sys.stderr
.
For the code in Pology library, it is always preferable to raise an exception instead of aborting execution. On the other hand, it is fine to add optional parameters by which the client can select if the function should abort rather than raise an exception. All topical problems should raise pology.PologyError
or a subclass of it, and built-in exceptions only for simple general problems (e.g. IndexError
for indexing past the end of something).
All user-visible text, be it reports, warnings, errors (including exception messages) should be wrapped for internationalization through Gettext. The top pology
module provides several wrappers for Gettext functions, which have the following special traits: context is mandatory on every wrapped text, all format directives must be named, and arguments are specified as keyword-value pairs just after the text argument (unless deferred translation is used). Some examples:
# Simple message with context marker. _("@info", "Trying to sync unnamed catalog.") # Simple message with extended context. _("@info command description", "Keep track of who, when, and how, has translated, modified, " "or reviewed messages in a collection of PO files.") # Another context marker and extended context. _("@title:column words per message in original", "w/msg-or") # Parameter substitution. _("@info", "Review tag '%(tag)s' not defined in '%(file)s'.", tag=rev_tag, file=config_path) # Plural message n_("@item:inlist", "written %(num)d word", "written %(num)d words", num=nwords) # Deferred translation, when arguments are known later. tmsg = t_("@info:progress", "Examining state: %(file)s") ... msg = tmsg.with_args(file=some_path).to_string()
Every context starts with the "context marker" in form of @
, drawn from a predefined set (see the article on i18n semantics at KDE Techbase); it is most often keyword
@info
in Pology code. The context marker may be, and should be, followed by a free-form extend context whenever it can help the translator to understand how and where the message is used. It is usual to have the context, text and arguments in different lines, though not necessary if they are short enough to fit one line.
Pology defines lightweight XML markup for coloring text in the colors
module. In fact, Gettext wrappers do not return ordinary strings, but ColorString
objects, and functions from report
and msgreport
modules know how to convert it to raw strings for given output destination (file, terminal, web page...). Therefore you can use colors in any wrapped string:
_("@info:progress", "<green>History follows:</green>") _("@info", "<bold>Context:</bold> %(snippet)s", snippet=some_text)
Coloring should be used sparingly, only when it will help to cue user's eyes to significant elements of the output.
There are two consequences of having text markup available throughout. The first is that every message must be well-formed XML, which means that it must contain no unballanced tags, and that literal <
characters must be escaped (and then also >
for good style):
_("@item automatic name for anonymous input stream", "<stream-%(num)s>", num=strno)
The other consequence is that ColorString
instances must be joined and interpolated with dedicated functions; see cjoin()
and cinterp()
functions in colors
module.
Unless the text of the message is specifically intended to be a title or an insert (i.e. @title
or @item
context markers), it should be a proper sentence, starting with a capital letter and ending with a dot.
Pology sieves are filtering-like processing elements applied by the posieve script to collections of PO files. A sieve can examine as well as modify the PO entries passed through it. Each sieve is written in a separate file. If the sieve file is put into sieve/
directory of Pology distribution (or intallation), the sieve can be referenced on posieve command line by the shorthand notation; otherwise the path to the sieve file is given. The former is called an internal sieve, and the latter an external sieve, but the sieve file layout and the sieve definition are same for both cases.
In the following, posieve will be referred to as "the client". This is because tools other than posieve may start to use sieves in the future, and it will also be described what these clients should adhere to when using sieves.
The sieve file must define the Sieve
class, with some mandatory and some optional interface methods and instance variables. There are no restrictions at what you can put into the sieve file beside this class, only keep in mind that posieve will load the sieve file as a Python module, exactly once during a single run.
Here is a simple sieve (also the complete sieve file) which just counts the number of translated messages:
class Sieve (object): def __init__ (self, params): self.ntranslated = 0 def process (self, msg, cat): if msg.translated: self.ntranslated += 1 def finalize (self): report("Total translated: %d" % self.ntranslated)
The constructor takes as argument an object specifying any sieve parameters (more on that soon). The process
method gets called for each message in each PO file processed by the client, and must take as parameters the message (instance of Message_base
) and the catalog which contains it (Catalog
). The client calls the finalize
method after no more messages will be fed to the sieve, but this method does need to be defined (client should check if it exists before placing the call).
Another optional method is process_header
, which the client calls on the PO header:
def process_header (self, hdr, cat): # ...
hdr
is an instance of Header
, and cat
is the containing catalog. The client will check for the presence of this method, and if it is defined, it will call it prior to any process
call on the messages from the given catalog. In other words, the client is not allowed to switch catalogs between two calls to process
without calling process_header
in between.
There is also the optional process_header_last
method, for which everything holds just like for process_header
, except that, when present, the client must call it after all consecutive process
calls on messages from the same catalog:
def process_header_last (self, hdr, cat): # ...
Sieve methods should not abort program execution in case of errors, instead they should throw an exception. In particular, if the process
method throws SieveMessageError
, it means that the sieve can still process other messages in the same catalog; if it throws SieveCatalogError
, then any following messages from the same catalog must be skipped, but other catalogs may be processed. Similarly, if process_header
throws SieveCatalogError
, other catalogs may still be processed. Any other type of exception tells the client that the sieve should no longer be used.
The process
and process_header
methods should either return None
or an integer exit code. A return value which is neither None
nor 0
indicates that while the evaluation was successfull (no exception was thrown), the processed entry (message or header) should not be passed further along the sieve chain.
The params
parameter of the sieve constructor is an object with data attributes as parameters which may influence the sieve operation. The sieve file can define the setup_sieve
function, which the client will call with a SubcmdView
object as the single argument, to fill in the sieve description and define all mandatory and optional parameters. For example, if the sieve takes an optional parameter named checklevel
, which controles the level (an integer) at which to perform some checks, here is how setup_sieve
could look like:
def setup_sieve (p): p.set_desc("An example sieve.") p.add_param("checklevel", int, defval=0, desc="Validity checking level.") class Sieve (object): def __init__ (self, params): if params.checklevel >= 1: # ...setup some level 1 validity checks... if params.checklevel >= 2: # ...setup some level 2 validity checks... #... ...
See the add_param
method for details on defining sieve parameters.
The client is not obliged to call setup_sieve
, but it must make sure that the object it sends to the sieve as params
has all the instance variable according to the defined parameters.
There are two boolean instance variables that the sieve may define, and which the client may check for to decide on the regime in which the catalogs are opened and closed:
class Sieve (object): def __init__ (self, params): # These are the defaults: self.caller_sync = True self.caller_monitored = True ...
The variables are:
caller_sync
instructs the client whether catalogs processed by the sieve should be synced to disk at the end. If the sieve does not define this variable, the client should assume True
and sync catalogs. This variable is typically set to False
in sieves which do not modify anything, because syncing catalogs takes time.
caller_monitored
tells the client whether it should open catalogs in monitored mode. If this variable is not set, the client should assume it True
. This is another way of reducing processing time for sieves which do not modify PO entries.
Usually a modifying sieve will set neither of these variables, i.e. catalogs will be monitored and synced by default, while a checker sieve will set both to False
. For a modifying sieve that unconditionally modifies all entries sent to it, only caller_monitored
may be set to False
and caller_sync
left undefined (i.e. True
).
If a sieve requests no monitoring or no syncing, the client is not obliged to satisfy these requests. On the other hand, if a sieve does request monitoring or syncing (either explicitly or by not defining the corresponding variables), the client must provide catalogs in that regime. This is because there may be several sieves operating at the same time (a sieve chain), and monitoring and syncing is usually necessary for proper operation of those sieves that request it.
Since monitored catalogs have modification counters, the sieve may use them within its process*
methods to find out if any modification really took place. The proper way to do this is to record the counter at start, and check for increase at end:
def process (self, msg, cat): startcount = msg.modcount # ... # ... do some stuff # ... if msg.modcount > startcount: self.nmodified += 1
The wrong way to do it would be to merely check if msg.modcount > 0
, because several modifying sieves may be operating at the same time, each increasing the counters.
If the sieve wants to remove the message from the catalog, if at all possible it should use catalog's remove_on_sync
instead of remove
method, to defer actual removal to sync time. This is because remove
will probably ruin client's iteration over the catalog, so if it must be used, the sieve documentation should state it clearly. remove
also has linear execution time, while remove_on_sync
has constant.
If the sieve is to become part of Pology distribution, it should be properly documented. This means fully equipped setup_sieve
function in the sieve file, and a piece of user manual documentation. The Sieve
class itself should not be documented in general. Only when process*
are returning an exit code, this should be stated in their own comments (and in the user manual).
Hooks are functions with specified sets of input parameters, return values, processing intent, and behavioral constraints. They can be used as modification and validation plugins in many processing contexts in Pology. There are three broad categories of hooks: filtering, validation and side-effect hooks.
Filtering hooks modify some of their inputs. Modifications are done in-place whenever the input is mutable (like a PO message), otherwise the modified input is provided in a return value (like a PO message text field).
Validation hooks perform certain checks on their inputs, and return a list of annotated spans or annotated parts, which record all the encountered errors:
Annotated spans are reported when the object of validation is a piece of text. Each span is a tuple of start and end index of the problematic segment in the text, and a note which explains the problem. The return value of a text-validation hook will thus be a list:
[(start1, end1, "note1"), (start2, end2, "note1"), ...]
The note can also be None
, if there is nothing to say about the problem.
Annotated parts are reported for an object which has more than one distinct piece of text, such as a PO message. Each annotated part is a tuple stating the name of the problematic part of the object (e.g. "msgid"
, "msgstr"
), the item index for array-like parts (e.g. for msgstr
), and the list of problems in appropriate form (for a PO message this is a list of annotated spans). The return value of a PO message-validation hook will look like this:
[("part1", item1, [(start11, end11, "note11"), ...]), ("part2", item2, [(start21, end21, "note21"), ...]), ...]
Side-effect hooks neither modify their inputs nor report validation information, but can be used for whatever purpose which is independent of the processing chain into which the hook is inserted. For example, a validation hook can be implemented like this as well, when it is enough that it reports problems to standard output, or where the hook client does not know how to use structured validation data (annotated spans or parts). The return value of a side-effect hook the number of errors encountered internally by the hook (an integer). Clients may use this number to decide upon further behavior. For example, if a side-effect hook modified a temporary copy of a file, the client may decide to abandon the result and use the original file if there were some errors.
In this section a number of hook types are described and assigned a formal type keyword, so that they can be conveniently referred to elsewhere in Pology documentation.
Each type keyword has the form <letter1><number><letter2>, e.g. F1A. The first letter represents the hook category: F for filtering hooks, V for validation hooks, and S for side-effect hooks. The number enumerates the input signature by parameter types, and the final letter denotes the difference in semantics of input parameters for equal input signatures. As a handy mnemonic, each type is also given an informal signature in the form of (param1, param2, ...) -> result
; in them, spans
stand for annotated spans, parts
for annotated parts, and numerr
for number of errors.
Hooks on pure text:
F1A ((text) -> text
): filters the text
V1A ((text) -> spans
): validates the text
S1A ((text) -> numerr
): side-effects on text
Hooks on text fields in a PO message in a catalog:
F3A ((text, msg, cat) -> text
): filters any text field
V3A ((text, msg, cat) -> spans
): validates any text field
S3A ((text, msg, cat) -> numerr
): side-effects on any text field
F3B ((msgid, msg, cat) -> msgid
): filters an original text field; original fields are either msgid
or msgid_plural
V3B ((msgid, msg, cat) -> spans
): validates an original text field
S3B ((msgid, msg, cat) -> numerr
): side-effects on an original text field
F3C ((msgstr, msg, cat) -> msgstr
): filters a translation text field; translation fields are the msgstr
array
V3C ((msgstr, msg, cat) -> spans
): validates a translation text field
S3C ((msgstr, msg, cat) -> numerr
): side-effects on a translation text field
*3B and *3C hook series are introduced next to *3A for cases when it does not make sense for text field to be any other but one of the original, or translation fields. For example, to process the translation sometimes the original (obtained by msg
parameter) must be consulted. If a *3B or *3C hook is applied on an inappropriate text field, the results are undefined.
Hooks on PO entries in a catalog:
F4A ((msg, cat) -> numerr
): filters a message, modifying it
V4A ((msg, cat) -> parts
): validates a message
S4A ((msg, cat) -> numerr
): side-effects on a message (no modification)
F4B ((hdr, cat) -> numerr
): filters a header, modifying it
V4B ((hdr, cat) -> parts
): validates a header
S4B ((hdr, cat) -> numerr
): side-effects on a header (no modification)
Hooks on PO catalogs:
F5A ((cat) -> numerr
): filters a catalog, modifying it in any way
S5A ((cat) -> numerr
): side-effects on a catalog (no modification)
Hooks on file paths:
F6A ((filepath) -> numerr
): filters a file, modifying it in any way
S6A ((filepath) -> numerr
): side-effects on a file, no modification
The *2* hook series (with signatures (text, msg) -> ...
) has been skipped because no need for them was observed so far next to *3* hooks.
Since hooks have fixed input signatures by type, the way to customize a given hook behavior is to produce its function by another function. The hook-producing function is called a I{hook factory}. It works by preparing anything needed for the hook, and then defining the hook proper and returning it, thereby creating a lexical closure around it:
def hook_factory (param1, param2, ...): # Use param1, param2, ... to prepare for hook definition. def hook (...): # Perhaps use param1, param2, ... in the hook definition too. return hook
In fact, most internal Pology hooks are defined by factories.
General hooks should be defined in top level modules, language-dependent hooks in lang.
, project-dependent hooks in code
.module
proj.
, and hooks that are both language- and project-dependent in name
.module
lang.
. Hooks placed like this can be fetched by code
.proj.name
.module
getfunc.get_hook_ireq
in various non-code contexts, in particular from Pology utilities which allow users to insert hooks into processing through command line options or configurations. If the complete module is dedicated to a single hook, the hook function (or factory) should be named same as the module, so that users can select it by giving only the hook module name.
Annotated parts for PO messages returned by hooks are a reduced but valid instance of highlight specifications used by reporting functions, e.g. msgreport.report_msg_content
. Annotated parts do not have the optional fourth element of a tuple in highlight specification, which is used to provide the filtered text against which spans were constructed, instead of the original text. If a validation hook constructs the list of problematic spans against the filtered text, just before returning it can apply diff.adapt_spans
to reconstruct the spans against the original text.
The documentation to a hook function should state the hook type within the short description, in square brackets at the end as [type ... hook]
. Input parameters should be named like in the informal signatures in the taxonomy above, and should not be omitted in @param:
Epydoc entries; but the return should be given under @return:
, also using one of the listed return names, in order to complete the hook signature.
The documentation to a hook factory should have [hook factory]
at the end of the short description. It should normally list all the input parameters, while the return value should be given as @return: type ... hook
, and the hook signature as the @rtype:
Epydoc field.
Ascription selectors are functions used by poascribe in the translation review workflow as described in Chapter 6, Ascribing Modifications and Reviews. This section describes how you can write your own ascription selector, which you can then put to use by following the instructions in Section 6.8.1, “Custom Review Selectors”.
In terms of code, an ascription selector is a function factory, which construct the actual selector function based on supplied selector arguments. It has the following form:
# Selector factory. def selector_foo (args): # Validate input arguments. if (...): raise PologyError(...) # Prepare selector definition. ... # The selector function itself. def selector (msg, cat, ahist, aconf): # Prepare selection process. ... # Iterate through ascription history looking for something. for i, asc in enumerate(ahist): ... # Return False or True if a shallow selector, # and 0 or 1-based history index if history selector. return ... return selector
It is customary to name the selector function selector_
, where something
something
will also be used as the selector name (in command line, etc). The input args
parameter is always a list of strings. It should first be validated, insofar as possible without having in hand the particular message, catalog, ascription history or ascription configuration. Whatever does not depend on any of these can also be precomputed for later use in the selector function.
The selector function takes as arguments the message (an instance of Message_base
), the catalog (Catalog
) it comes from, the ascription history (list of AscPoint
objects), and the ascription configuration (AscConfig
). For the most part, AscPoint
and AscConfig
are simple attribute objects; check their API documentation for the list and description of attributes. Some of the attributes of AscPoint
objects that you will usually inspect are .msg
(the historical version of the message), .user
(the user to whom the ascription was made), or .type
(the type of the ascription, one of AscPoint.ATYPE_*
constants). The ascription history is sorted from the latest to the earliest ascription. If the .user
of the first entry in the history is None
, that means that the current version of the message has not been ascribed yet (e.g. if its translation has been modified compared to the latest ascribed version). If you are writing a shallow selector, it should return True
to select the message, or False
otherwise. In a history selector, the return value should be a 1-based index of an entry in the ascription history which caused the message to be selected, or 0
if the message was not selected.[57]
The entry index returned by history selectors is used to compute embedded difference from a historical to the current version of the message, e.g. on poascribe diff
. Note that poascribe will actually take as base for differencing the first non-fuzzy historical message after the indexed one, because it is assumed that already the historical message which triggered the selection contains some changes to be inspected. (When this behavior is not sufficient, poascribe offers the user to specify a second history selector, which directly selects the historical message to base the difference on.)
Most of the time the selector will operate on messages covered by a single ascription configuration, which means that the ascription configuration argument sent to it will always be the same. On the other hand, the resolution of some of the arguments to the selector factory will depend only on the ascription configuration (e.g. a list of users). In this scenario, it would be waste of performance if such arguments were resolved anew in each call to the selector. You could instead write a small caching (memoizing) resolver function, which when called for the second and subsequent times with the same configuration object, returns previously resolved argument value from the cache. A few such caching resolvers for some common arguments have been provided in the ascript
module, functions named cached_*()
(e.g. cached_users()
).
[51] In Python 2 to be precise, on which Pology is based, while in Python 3 there are only Unicode strings.
[52] The canonical way to check if message is a plural message is msg.msgid_plural is not None
.
[53] There is also the report
module for reporting general strings. In fact, all code in Pology distribution is expected to use function from these modules for writing to output streams, and there should not be a print
in sight.
[54] This holds only for catalogs created with monitoring, i.e. no monitored=True
constructor argument. For non-monitored .sync()
will always touch the file and report True
.
[55] As opposed to the find-messages sieve.
[56] In fact, .add_last()
does a bit more: if both non-obsolete and obsolete messages are added in mixed order, in the catalog they will be separated such that all non-obsolete come before all obsolete, but otherwise maintaining the order of addition.
[57] In this way the history selector can automatically behave as shallow selector as well, because simply testing for falsity on the return value will show whether the message has been selected or not.