Chapter 9. Common Functionality

Different parts of Pology provide common functionality, such as thematic groups of options to scripts, file selection patterns, reliance on PO metadata, etc. This chapter describes such common functionality.

9.1. Shell Completion

Shell completion means that, similarly as for command names, it is possible to contextually complete command parameters by pressing the Tab key. This allows you to efficiently type in the command line, as well as to quickly remind yourself of options and option parameters without resorting to documentation or browsing the file system.

For example, pressing Tab just after the posieve command will complete sieve names, and Tab after the -s option will complete sieve parameters based on sieves that precede it in the command line. This:

$ posieve s<TAB>

will show all sieves beginning with s, and complete the sieve name once sufficient number of characters have been entered to uniquely determine it, while this:

$ posieve stats -s m<TAB>

will show all parameters to stats beginning with m, and complete one of them after few more characters are typed in.

9.2. User Configuration

Various parts of Pology can be configured through the configuration file .pologyrc in the root of user's home directory (~/.pologyrc for short). The configuration file does not have to exist, so you have to create it when you want to configure something for the first time. It must be UTF-8 encoded.

The configuration file is in the INI format, which is composed of sections beginning with a [section] line, and fields of the form field = value within a section. Comments can be written after # character at the beginning of the line. Here is an example of a ~/.pologyrc file:

[global]

[user]
name = Chusslove Illich
original-name = Часлав Илић
email = caslav.ilic@gmx.net
po-editor = Kate

[enchant]
# Autodetection sufficient.

[posieve]
msgfmt-check = yes
param-ondiff/stats = yes

# Project setups follow.

[project-kde]
language = sr
language-team = Serbian
team-email = kde-i18n-sr@kde.org
plural-forms = nplurals=4; plural=n==1 ? ...

This configuration contains five sections: [global], [user], [enchant], [posieve], and [project-kde]. The [global] section set options that have an effect throught Pology, and here it is empty. The [user] section provides some information on the person who uses Pology. The [enchant] section configures the Enchant spell checker wrapper, used by Pology for spell checking. The [posieve] section configures the behavior of the posieve script. The [project-kde] section provides information on a project that the user contributes translation to.

Some details about the configuration file syntax are as follows. Leading and trailing whitespace in section and field names and values is not significant, e.g. foo=bar is same as foo = bar. Percent (%) character is used to expand the value of another field, for example:

rootdir = /path/to/somewhere
datadir = %(rootdir)s/data

where the %(...)s is Python's string interpolation syntax. Importantly, when you need a literal % character within a value (such as in plural-forms field in the previous example), you must repeat it twice, %%. Switch-type fields (msgfmt-check in the previous example) can take any of the following values for the two states: 0, no, false, or off; and 1, yes, true, or on (case is not important).

Sections in the configuration can be of one of four general types:

  • General sections, which provide information used by various parts of Pology as they need them. The [global] and [user] sections from the previous example are general sections.

  • External tool sections, which are used to configure external libraries and programs used within Pology. The [enchant] section from the previous example is of this type.

  • Internal tool sections, which configure the behavior of Pology's own scripts. This is the [posieve] section from the previous example.

  • Project sections, which provide information related to particular translation projects that the user is contributing to. Names of these sections always start with project-, such as [project-kde] from the previous example.

Internal tool sections are documented together with the respective tools, while sections of other types are described in the following.

When mentioning configuration fields in their documentation and elsewhere, they are referred to as [section]/field. If there is only a fixed number of possible values to a field, this is denoted as [section]/field=[VALUE1|VALUE2|VALUE3|...]; if one of the values is the default, it is prefixed with a star (*).

9.2.1. The [global] section

The [global] section contains options which can have effect on various otherwise unrelated parts of Pology.

Known configuration fields are as follows:

[global]/show-backtrace=[yes|*no]

When one of Pology commands stops execution with an error, by default only the error message is shown. However, for reporting problems and debugging, it is much better to get a backtrace instead. Backtraces can be activated by this option.

Whenever you want to report a problem where a Pology command aborts with an error, make sure to activate this option and submit the full backtrace.

9.2.2. The [user] section

Many parts of Pology can take advantage of information about you and the tools you use. This information is given in the [user] section. For example, when initializing PO file from a template, your name, email address in the PO header can be filled out, or a PO file can be opened in a translation editor that you use (if it is supported).

Known configuration fields are as follows:

[user]/name

Your name if it is written in Latin script, or the romanized equivalent of your name. The intention is that it is readable (or semi-readable) to people from various places in the world, who would use it to contact you if necessary.

[user]/original-name

This is your name in your native language and script, whatever it may be. If it would be the same as the name in the [user]/name field, setting this field is not necessary.

[user]/email

Your email address.

[user]/language

The language code of the language you translate into. If by any chance you translate into several languages, this field can be overridden in per-project configuration sections.

[user]/encoding

The encoding of the PO files you work on. Nowdays this should really, really be UTF-8. If it is not UTF-8 for everything that you work on, you can override it in per-project configuration sections.

[user]/plural-forms

The value for the Plural-Forms PO header field used for your language. If it differs between projects, you can override the value set here in per-project configuration sections.

[user]/po-editor

The human-readable name of the editor with which you translate (it does not have to be a dedicated PO editor). This is used in contexts where your editor preference is announced, such as through the X-Generator PO header field.

[user]/po-editor-id=[lokalize]

The keyword under which the PO editor that you use is known to Pology. For the moment, only Lokalize is supported. This is used when a Pology tool is told to open PO files on the messages it matched.

9.2.3. The [enchant] section

This section configures Enchant, a wrapper library for spell checking, which is used for Pology's spell checking functionality. Through Enchant it is possible to use various spell checkers, such as Aspell, Ispell, Hunspell, etc. in a uniform way.

Known configuration fields are as follows:

[enchant]/provider=[aspell|ispell|myspell|...]

The keyword denoting the spell checker that Enchant should use. It can also be a comma-separated list of several keywords, when Enchant will use the first available spell checker in the list. You can find the up-to-date list of all known provider keywords in the enchant(1) man page, and run enchant-lsmod command to see exactly which of those are recognized as available on the system.

[enchant]/language

The spell checking dictionary that should be used, by language code. This value is used only if the language is not specified in any other way, such as in the PO header or through command line.

[enchant]/environment

The sub-language environment for spell checking. This is related to Pology's internal spelling dictionary supplements, see the section on spell checking. This value is used only if the environment is not specified in any other way, such as in the PO header or through command line.

9.2.4. The [aspell] section

At first Pology used Aspell for spell checking, before Enchant was introduced. Direct support for Aspell was nevertheless kept, due to some specifics that the Enchant wrapper does not support yet. (Which means that you should better use Enchant if it satisfies your needs.)

Known configuration fields are as follows:

[aspell]/language

See [enchant]/language.

[aspell]/encoding

Encoding for the text sent to Aspell.

[aspell]/variety

The sub-language variety of the Aspell spelling dictionary.

[aspell]/environment

See [enchant]/environment.

[aspell]/supplements-only=[yes|*no]

Whether to ignore the system spelling dictionary and use only Pology's internal dictionary supplements.

[aspell]/simple-split=[yes|*no]

By default, Pology splits the text into words in a clever fashion (eliminating text markup, format directives, etc.) before sending them to the spell checker. Sometimes this leads to bad result, and then this field can be set to yes to split text simply on whitespace (possibly, in the given context, in combination with a pre-filtering hook on the text).

9.2.5. Per-project sections ([project-*])

You will easily come into the situation where you need to translate and maintain translated material within different projects, each with its own set of rules and conventions. Pology is designed to support project switching extensively, and one element of that are per-project configuration sections.

A project configuration sections has the name [project-PKEY], where PKEY is the project keyword. You can choose the project keyword freely, but it should contain only ASCII letters, digits, underscore and hyphen. Project configuration fields frequently have fallbacks to fields in other configuration sections. This means that when the project field is not set, its corresponding field in that other (more general) section gets used instead. In the following, this is the whenever you are instructed to see a field in another section.

Per-project configuration fields are as follows:

[project-*]/name

See [user]/name.

[project-*]/original-name

See [user]/original-name.

[project-*]/email

See [user]/email.

[project-*]/language

See [user]/language.

[project-*]/language-team

This is the name of the team which translates this project into given language. Since usually there is only one translation team per language in a project, the value of this field is just the human-readable name of the language (as opposed to language code) in English.

[project-*]/team-email

The email address for communication with the translation team as whole (usually the team's mailing list).

[project-*]/encoding

See [user]/encoding.

[project-*]/plural-forms

See [user]/plural-forms.

9.3. Regular Expressions

There are great many places in Pology where you can supply a matching pattern, to select or deselect something. This could be a PO file by its path, a PO message by its msgid, etc. Almost always and by default, this matching pattern will be a regular expression (or regex for short). Regular expressions are a powerful pattern matching language, a fascinating topic in their own right, and they will serve you well in just about any context of searching on computers. The plain text editor that you use probably offers regular expressions in its search dialog, so does your office text processor, and so on.

Actually, the only point of this brief section is to impress the importance and usefulness of regular expressions onto you, in the case that you have not used them yet. The Internet is full of tutorials on regular expressions, so that there is no point in linking any one particular here.

It should be mentioned that different regular expression engines have somewhat different syntax and expressiveness. Pology uses regular expressions from the Python Standard Library, described here: http://docs.python.org/library/re.html (keep in mind that this page is a reference, and not a tutorial, so you should look elsewhere to learn basics of regular expressions).

9.4. Path Inclusion and Exclusion

Pology scripts that can recursively search directory paths for PO files will usually provide several options by which certain files can be included or excluded from processing. The first pair of these options include or exclude files by path:

-E REGEX, --exclude-path=REGEX

Every file with the path that does not match the supplied pattern is excluded from processing. This option can be repeated, when a file is excluded if its path matches every pattern. When you want to exclude by any pattern matching the path, you can connect those patterns with regular expression |-operator in a single option. This allows you to build up complex exclusion conditions if necessary.

-I REGEX, --include-path=REGEX

Only those files which have the path matching the supplied pattern are included into processing. If the option is repeated, a file is included only if its path matches every pattern.

Especially those PO files which are used at runtime (as opposed to those used for static translation), but others too, are frequently sufficiently identified by their domain name. The domain name is the base name of the installed MO file without the extension, e.g. for /usr/share/locale/sr/LC_MESSAGES/foobar.mo the domain name is foobar. If, in a given translation project, PO files for a given language are all collected under one top directory of that language, their base names are also formed of domain names.[39] When this is the case, it may be more convenient or safer to match PO files by their domain names instead of paths, which is done by options:

-e REGEX, --exclude-name=REGEX

Counterpart to -E/--exclude-path which matches by domain name.

-i REGEX, --include-name=REGEX

Counterpart to -I/--include-path which matches by domain name.

All inclusion and exclusion options can be freely mixed and repeated, with consequent resolution. A file is processed if it matches all inclusion patterns (if any is given) and does not match at least one exclusion pattern (if any is given). The other way around, a file is not processed if does not match at least one inclusion pattern (if any is given) or it matches all exclusion patterns (if any is given).

9.5. Reading Paths From a File

Sometimes it is convenient to make a temporary or semi-permanent grouping of files, such that the file group can be referenced through a single argument instead of repeating all the files all the time. This is particularly useful when shell piping is not applicable or not comfortable enough. The classic and simple way to group files is by having a file-list file, which contains one file path by line, which a shell command can read to collect files to process.

Many Pology scripts can write and read file-list files. Having scripts write such files automatically is simple enough, just check given script's documentation to see if it has this capability (e.g. the -m option to posieve). More interesting are the special features that you can use when writing a file-list file manually. You would do this for standing categories which are periodically updated, such as a list of PO files ready for release.

For completeness, here is first an example of a basic file-list file:

xray/alpha.po
xray/bravo.po
yankee/charlie.po
yankee/delta.po

As is usual for path arguments to Pology scripts, you can specify both file and directory paths, and directory paths will be searched recursively for PO files (or whatever the file type that the script is processing):

xray/
yankee/
zulu/echo.po
zulu/foxtrot.po

You can add comments by starting the line with hash (#), and have empty lines:

# Translations ready for release.

# Full modules.
xray/
yankee/

# Specific files.
zulu/echo.po
zulu/foxtrot.po

The inclusion-exclusion functionality equivalent to inclusion-exclusion command line options is provided through inclusion-exclusion directives. They are specified by starting the line with colon (:), followed by directive type token, followed by a regular expression. The directives are:

  • :/-REGEX to exclude files by path,

  • :/+REGEX to include files by path,

  • :-REGEX to exclude files by base name without extension, and

  • :+REGEX to include files by base name without extension.

For example, if a whole module should be processed but for one PO file in it, it is easier to list the whole module and exclude that one file, as compared to listing all other files:

# Modules.
xray/
yankee/
# Exclude november.po (in whichever module it is).
:-november

Ordering and position of include-exclude directives is not significant, as they are all applied to all collected files. The semantics of application of multiple directives is the same as that of counterpart command line options.

File-list files are normally fed to Pology scripts with the following option:

-f FILE, --files-from=FILE

Read files to process from a file which contains one path per line, or special entries as described above. This option can be repeated to read several file lists. Additional paths to process can still be given as command line arguments. Any inclusion-exclusion options will be applied to the files read from the file as well (in addition to the file's internal inclusion-exclusion directives, if any).

9.6. Output Coloring

In some contexts, Pology scripts color the terminal output for better visual separation and highlighting of important parts of the text. Examples include warning and error messages, data presented as tables and bars, and, importantly, matched segments of the text in search and validation operations.

Output coloring is turned on by default, but sensitive to output destination: the text is colored if the output is to the terminal (using terminal escape sequences), but not if it is piped to a file. Pology scripts provide the following options by which you can influence this behavior:

-R, --raw-colors

Disables output destination sensitivity, such that the text is always colored. This is useful when the output is piped to another command which can understand terminal escape sequences by which colors are produce, such as less(1). A typical example would be piping search results from the find-messages sieve to be able to scroll them back and forth:

$ posieve find-messages ... -R | less -R

The -R of less tells it to interpret escape sequences as colors, rather than showing them as literal text.

--coloring-type=[none|term*|html]

Instead of coloring for the terminal, with this option you can choose another coloring type. none disables coloring, term is the default, while html will produce HTML-tagged text ready for embedding into a web page (e.g. inside a <pre> element). For example, with a little bit of additional scripting, you could use the stats sieve and html coloring to periodically update a web page with translation statistics.

9.7. Integration with Other Tools

One of the general aims of Pology is to fit well with other tools typically found in translation workflows based on PO. Although examples of this can be seen throughout the manual, this section gives the overview of integration by the particular supported tool.

9.7.1. PO Editors

When Pology is used to validate the translation, be it through informal but precise searches or formal validation rules, those translations found to be invalid must be modified (or possibly a special translator comment added to the message to silence a false positive). Pology will normally always report the PO file path and the location of the message within the file, so that you can get to it in you preferred PO editor. For greater efficiency, however, Pology can directly open the PO files on problematic messages in some PO editors. Currently these are:

Lokalize

Many sieves, notably find-messages, check-rules, or check-spell, provide the parameter lokalize to open PO files on reported messages in Lokalize. This means that when run over a collection of PO files, each PO file with at least one reported message will be loaded into one of Lokalize tabs, and only the reported messages will be shown for editing under each tab. A slight catch is that Lokalize must be manually started before a sieve is run, and the Lokalize project which contains all the sieved PO files must be opened; otherwise, simply nothing will happen.

9.7.2. Version Control Systems

From the viewpoint of translators, PO files are frequently (though not always) handled in the same way as program code, through version control systems (VCS). Pology defines an abstraction of version control functionality, which enables its tools to transparently cooperate with several VCS. Usually it is necessary to tell a Pology tool which VCS is used, which is done by specifying one of VCS keywords. Currently supported VCS and their keywords are:

  • Git: git

  • Subversion: svn, subversion

  • none (when specifying a VCS is required, but none is actually used): none, noop

VCS integration is available in following places:

  • Producing embedded diffs with poediff (see Chapter 4, Diffing and Patching). Option -c/--vcs can be used to switch poediff into VCS mode, such that it diffs given paths between repository head and working copy, or between given revisions.

  • Translating in summit (see Chapter 5, Summitting Translation Branches). posummit will automatically add or remove files from version control as well as to and from disk, so that the modified repository tree can be directly committed after a summit maintenance operation has completed its run.

  • Review ascription (see Chapter 6, Ascribing Modifications and Reviews). VCS support is central part of poascribe, so it will automatically add, remove and commit files to version control as particular ascription operations require.

Another interesting aspect of VCS support is that, when writing modified PO files to disk, by default Pology will reformat them (almost) only as much as necessary. For example, if only one msgstr string in the whole PO file has changed, and wrapping is active, only this string and nothing else will be rewrapped when the file is written out. This makes VCS revision deltas smaller and more informative.

9.8. Line Wrapping in PO Messages

While line wrapping of message strings irrelevant to programs that fetch translations from them, it may be significant to the translator, especially when editing the PO file with a plain text editor. Well-wrapped strings make it easier for the translator to follow the text structure, especially in longer messages.

Most Gettext tools (msgmerge, msgcat, msgfilter, etc.) provide options to wrap or not to wrap strings, where wrapping is done on the given column and escaped newlines (\n). Pology can produce this type of wrapping ("basic" wrapping) as well, but it can also wrap on expected visual line breaks in known text markup, e.g. <p> and <br> in HTML ("fine" wrapping). Compare this message in basic wrapping alone:

msgid ""
"<p>These settings control the storage of the corrected images. "
"There are four modes to choose from:</p><p><ul><li><b>Subfolder:</"
"b> The corrected images will be saved in a subfolder under the "
"current album path.</li><li><b>Prefix:</b> A custom prefix will be "
"added to the corrected image.</li><li><b>Suffix:</b> A custom "
"suffix will be added to the corrected image.</li><li><b>Overwrite:</"
"b> All original images will be replaced.</li></ul></p><p>Each of "
"the four modes allows you to add an optional keyword to the image "
"metadata.</p>"
msgstr ""

and in basic and fine wrapping together:

msgid ""
"<p>These settings control the storage of the corrected images. "
"There are four modes to choose from:</p>"
"<p>"
"<ul>"
"<li><b>Subfolder:</b> The corrected images will be saved in a "
"subfolder under the current album path.</li>"
"<li><b>Prefix:</b> A custom prefix will be added to the corrected "
"image.</li>"
"<li><b>Suffix:</b> A custom suffix will be added to the corrected "
"image.</li>"
"<li><b>Overwrite:</b> All original images will be replaced.</li>"
"</ul>"
"</p>"
"<p>Each of the four modes allows you to add an optional keyword "
"to the image metadata.</p>"
msgstr ""

If you are editing the PO file with a dedicated PO editor, it may itself provide finely tuned wrapping and ignore the wrapping in the PO file, in which case Pology's wrapping facilities are superfluous to you[40]. But a PO editor may also present strings wrapped just as they are in the PO file (and most do!), when Pology's fine wrapping is just as useful as in combination with a plain text editor.

At least for alphabetic languages, the most convenient wrapping may be fine wrapping alone (no basic wrapping), while turning on editor's dynamic (visual) line wrapping. This both makes the text structure easy to follow, and allows editing the translation by logical units (paragraphs, list items) without manually adjusting column breaks or putting up with ugly overlength or mid-broken lines. However, for ideographic languages, editor's dynamic line wrapping may produce bad results, and there basic wrapping might be necessary. In fact, for the moment, for ideographic languages it may be better to pass Pology's wrapping entirely and stick with Gettext's wrapping, since the wrapping algorithm in Gettext is more sophisticated and directly supports ideographic writing systems.

If no wrapping mode is specified when the given PO file is written out, Pology will apply basic wrapping, just as Gettext tools do. There are three general sources from which Pology tools may try to determine the wrapping mode for the given PO file, in decreasing priority: from the command line options, from the PO file's header, and from the user configuration. A tool may or may not provide command line options and configuration fields for wrapping, but PO file headers are always consulted (since this is in Pology's core PO file handling facilities). See the description of the X-Wrapping header field for how to set the wrapping mode in the PO header, and the set-header sieve for how to set this field in many PO files at once.

9.8.1. Common Command Line Options for Wrapping

Pology tools in which the wrapping mode can be set from command line, will provide the following options:

--wrap

Perform basic wrapping, on certain column.

--no-wrap

Do not perform basic wrapping.

--fine-wrap

Perform fine wrapping, on various expected visual breaks introduced by text markup in rendered text.

--no-fine-wrap

Do not perform fine wrapping.

--wrap-column=COL

The column at which the text should be wrapped. The wrapped line in the PO file will never be longer than this many columns, including the outer quotes. If not given, the default is 79.

Both positive and negative wrapping options are provided in order to be able to override the wrapping mode defined by the user configuration of the PO header. As in Gettext tools, strings are always wrapped on \n regardless of the wrapping mode.

9.8.2. Common User Configuration Fields for Wrapping

The following configuration fields will be read by the tools which consult the user configuration for wrapping mode, in their respective configuration sections:

[section]/wrap=[*yes|no]

Whether to perform basic wrapping, counterpart to --wrap and --no-wrap command line options.

[section]/fine-wrap=[yes|*no]

Whether to perform fine wrapping, counterpart to --fine-wrap and --no-fine-wrap command line options.

9.9. Influential Header Fields

The PO header is a natural place to provide the information which holds for the PO file as whole. Pology scripts, sieves, and hooks can take into account a number of header fields, when available, to automatically determine some aspects of processing. The fields considered are as follows:

Language

This field contains the language code of the translation, which Pology will take into account in all contexts where language-dependent processing is done (such as when spell-checking). You can also specify the language into which you translate in user configuration, and sometimes in the command line. The language stated by the PO header will override the user configuration, but it will be in turn overridden by the command line. See also Section 8.1, “The Notion of Language in Pology”.

X-Accelerator-Marker

Accelerator markers are a frequent obstacle in text processing, such as searching or spell-checking, because they can split words apart. This field can be used to specify which character is used as accelerator marker throughout the file, if any. If there are several possible characters, they can be given as comma-separated list[41]. While it is usually possible to specify the accelerator marker through the command line, the header field is much more convenient and flexible: there is no need to remember to add the command line option at every run, and different PO files can have different accelerator markers. However, if command line option is issued, it will override the header field.

There is a difference between this field not existing in the header, and existing but with an empty value (i.e. "X-Accelerator-Marker: \n"). If the field does not exist, some processing elements will go into the "greedy" mode, where they use a list of known frequent accelerator markers (e.g. to remove them from the text). If the field is set to empty value, these processing elements will take it that there are no accelerator markers in text.

X-Associted-UI-Catalogs

This field lists the PO domains which are the source of user interface references (button labels, menu items, etc.) throughout the text in current PO file. This makes it possible to automatically fetch and insert UI translations, rather than having to look them up manually and maintain them against changes; see Section 8.4, “Automatic Insertion of UI Labels” for details. Several PO domains can be given as space- or comma-separated list. If the UI message is found in more than one listed PO domain, the earlier in the list takes precedence.

X-Environment

The language environment to which the translation belongs; see Section 8.1, “The Notion of Language in Pology” for details. It can be a single keyword, or a comma-separated list of keywords. If several environments are given, the later in the list (which is usually the more specific) takes precedence.

X-Text-Markup

When the text contains markup, it may be useful to remove it such that only the plain text remains. This is the case, for example, when computing word counts or applying terminology validation rules. Another use case would be the validation of markup itself (whether a tag is properly closed, whether a tag exists, etc.) This header field specifies the markup type found in the text, as a keyword, so that Pology can determine how to process it. Several markup types can be given as comma-separated list.

Pology currently recognizes the following markup types:

  • docbook4 -- Docbook 4.x markup, in documentation POs

  • html -- HTML 4.01

  • kde4 -- markup in KDE4 UI POs, a mix of Qt rich-text and KUIT

  • kuit -- UI semantic markup in KDE 4

  • qtrich -- Qt rich-text, (almost) a subset of HTML

  • xmlents -- only XML-like entities, no other formal markup

X-Wrapping

This header field can be set to tell Pology how to wrap strings in the current PO file, for example, when posieve modifies a message and writes the modified PO file, or when rewrapping is done explicitly by porewrap. The value is a comma-separated list of wrapping modes, chosen from:

  • basic -- wrapping on certain column

  • fine -- wrapping on logical breaks (such as <p> or <br/> tags)

Wrapping on escaped newline \n is always performed, regardless of the wrapping mode. If the field value is empty, no other wrapping is done. If more than one wrapping mode is given (e.g. "X-Wrapping: basic, fine\n"), it is specificaly defined how modes are combined, so the ordering is not important. As usual, if wrapping is specified by a command line option, that will override the header field.

All of the listed header fields may be set manually, when you get to work on the particular PO file. But frequently it is possible to set them automatically, or at least automatically for the first time with later manual corrections where needed. For this you may use the set-header sieve. If PO files are periodically merged by the translation project automation (rather than each translator merging on his own only the PO files which he is about to update), the natural moment to run set-header is just after the merging. If translation is done in summit, you can specify in summit configuration to set header fields on merging.

9.10. Processing Hooks

Pology enables the user to insert special processing elements, called hooks, at many places in the processing chain. Hooks are Python functions with certain prescribed input, output, and behavior. Depending on the exact combination of these three ingredients, there are various hook types. Finally, some hooks can be adapted to a given context through their hook factories. Pology defines many hooks internally, and users can add their own external hooks.

Usage of hooks is best illustrated through examples. Suppose that you want to use the the find-messages sieve to look for a certain word, but the text contains XML-like tags of the form <tagname>...</tagname> which happen to be throwing off your search. Suppose that there exists a hook called remove-xml-tags, in the Pology library module remove, which takes a piece of text as input and returns that piece of text cleared of any XML-like tags. Then you could insert this hook into the search to clear the tags before matching the text, by using the filter: parameter to find-messages:

$ posieve find-messages -s filter:'remove/remove-xml-tags' ...

Here remove/remove-xml-tags is the hook specification, and this is its usual simplest form: the module name, followed by slash, followed by the hook name. (Sometimes it can be only the module name, when the hook function within that module has the same name as the module, but this is rare.) The hook specification was enclosed in single quotes, for the shell to see it as single string; this was not necessary here, but it is a good habit to keep up when adding hooks through command line, because hook specification can get quite involved.

Suppose now that there is a single hook that can remove any kind of markup from the text (not only XML-like tags) called remove-markup, but that it has to be told which markup to remove, by giving it one of the markup type keywords known to Pology. Continuing the previous example, this could be done like this:

$ posieve find-messages -s filter:'remove/remove-markup~"docbook4"' ...

Now the hook specification is remove/remove-markup~"docbook4". Note that outer single quotes in the command line are necessary, as otherwise the shell would strip internal double quotes, which are here integral part of hook specification. remove-markup is actually a hook factory, which produces a hook based on the parameters given after the tilde (~) character. Here "docbook4" is that parameter; why must it be quoted? Because the part after the tilde is passed as argument list to a Python function, and "docbook4" must be of string type, which is in Python denoted by quotes. For a hook factory foo/bar which would take a string and a number, the hook specification would be foo/bar~"qwyx",5. Sometimes a hook factory has default values for some or all of its arguments; in the latter case, if the defaults are sufficient, the part after the tilde in the hook specification can be left empty (e.g. foo/bar~).

Hooks can be language- and project-dependent. Suppose that in your language the letters are sometimes accented, but the accents should be ignored on spell-checking. Then Pology may contain a hook which strips accents from text in your language. If your language code is ll, and the hook is remove-accents in (language-specific) module remove, you could check spelling while ignoring accents using the the check-spell-ec sieve:

$ posieve check-spell-ec -s filter:'ll:remove/remove-accents' ...

The hook specification now also contains the language code separated by colon, as ll:.... If the hook is project-specific instead, it is prefixed with pp%..., where pp is the project identifier and percent sign the separator. If the hook is both language- and project-specific, then the specification is ll:pp%... or pp%ll:....

9.10.1. Hook Types

In places where a hook can be inserted, it is convenient to succinctly state which types of hooks are acceptable. Hook types are therefore coded with letter-number-letter combinations. The first letter can be F, V, or S, standing for filtering, validation, or side-effect hook, in that order. Filtering hooks modify their input, validation hooks report problems in input in a way understood by their clients, while side-effect hooks can do anything except modifying the input. The number after the first letter describes the composition of input, which can be pure text, PO message, PO header, etc. and their combinations. The final letter indicates the semantics of the input, like whether the input text is supposed to be the original (msgid) or the translation (msgstr) or can be any of them.

The following hooks types are currently defined (the hook type is followed by the expected input in parenthesis):

F1A (text)

Modifies the input text.

V1A (text)

Validates the input text.

S1A (text)

Side-effects based on the input text.

F3A (text, message, file)

Modifies the input text, which is one of the strings in the given PO message, which belongs to the given PO file. The difference between F1A and F3A hooks is that an F1A hook can process text based only on the text itself, while an F3A hook can process text by taking into account the information elsewhere in the PO message (e.g. in comments) and the PO file (e.g. in header). This holds for all *1* and *3* hook types.

V3A (text, message, file)

Validates the input text, which is one of the strings in the given PO message, which belongs to the given PO file.

S3A (text, message, file)

Side-effects based on the input text, which is one of the strings in the given PO message, which belongs to the given PO file.

F3B (original, message, file)

Modifies the input text, which is the msgid (or msgid_plural) string in the given PO message, which belongs to the given PO file. The difference between F3A and F3B hooks is that the input text of an F3B hook is expected to be precisely the original string in the message, while giving anything else will lead to undefined results. This holds for all *3A, *3B, *3C hook types.

V3B (original, message, file)

Validates the input text, which is the msgid (or msgid_plural) string in the given PO message, which belongs to the given PO file.

S3B (original, message, file)

Side-effects based on the input text, which is the msgid (or msgid_plural) string in the given PO message, which belongs to the given PO file.

F3C (translation, message, file)

Modifies the input text, which is one of the msgstr strings in the given PO message, which belongs to the given PO file.

V3C (translation, message, file)

Validates the input text, which is one of the msgstr strings in the given PO message, which belongs to the given PO file.

S3C (translation, message, file)

Side-effects based on the input text, which is one of the msgstr strings in the given PO message, which belongs to the given PO file.

F4A (message, file)

Modifies the input PO message, which belongs to the given PO file. The difference between F4A and F3A hooks is that an F3A hook can modify only the given string in the message, while an F4A hook can modify any number of strings, comments, etc. in the message. This holds for all *3* and *4* hook types.

V4A (message, file)

Validates the input PO message, which belongs to the given PO file.

S4A (message, file)

Side-effects based on the input PO message, which belongs to the given PO file.

F4B (header, file)

Modifies the input PO header, which belongs to the given PO file.

V4B (header, file)

Validates the input PO header, which belongs to the given PO file.

S4B (header, file)

Side-effects based on the input PO header, which belongs to the given PO file.

F5A (file)

Modifies the input PO file. As opposed to F1* and F3* hooks, which can modify only elements within PO messages, F5* hooks can also add, remove, and change positions of messages within the PO file.

V5A (file)

Validates the input PO file. As opposed to V1* and V3* hooks, which report only problems confined to PO messages, V5* hooks can also report problems due to relation between several PO messages each of which is valid in itself.

S5A (file)

Side-effects based on the input PO file.

F6A (any file)

Modifies the input file, whether in PO or another format, on the level of pure text lines. This is unlike F5A hooks which operate on the level of entries in the PO file; F6A hooks are also typically limited to certain types of files, perhaps even only PO files. This holds for all *6* hook types.

V6A (raw file)

Validates the input file.

S6A (raw file)

Side-effects based on the input file.

9.10.2. List of Internal Hooks

Pology does not establish strict separation between users and programmers, but presents a continuum between pure use and pure programming, so that users can engage according to their needs and abilities. Hooks, in particular, occupy the middle of this range. On the one hand side, they can be used even from command line; on the other hand side, they are actually Python functions, and hook specifications (in command line and elsewhere) sometimes require Python argument lists (the part after the tilde). This makes it hard both to list all available hooks[42], and to decide where and how to document them, in the user manual or in the library programming interface (API) documentation. Therefore, the following will be done. Here, in the user manual, only functions written specifically to be used as hooks will be listed (sometimes grouped by similarity), with their types and short descriptions. To that the link to the complete hook description in the API documentation will be added.[43]

9.10.2.1. General Hooks

bpatterns/bad-patterns (S3A), bpatterns/bad-patterns-msg (S4A), bpatterns/bad-patterns-msg-sp (V4A)

Detects unwanted patterns in text, by regular expression matching. Patterns can be specified either as direct arguments, or listed in file given as argument.

Caution

This hook is deprecated. Use validation rules instead, which are much a richer method of defining and checking for problems.

gtxtools/msgfilter (F6A)

Pipes the PO file through Gettext's msgfilter(1). The filter argument and options to msgfilter can be specified as parameters to hook factory. (May be used to wrap the PO file canonically, as Pology does not produce exactly the same wrapping as Gettext tools.)

gtxtools/msgfmt (S6A)

Pipes the PO file through Gettext's msgfmt(1), discarding output and reporting any errors as warnings. Useful for hard check of the PO file syntax, and extended checks performed when msgfmt is run with --check option.

markup/check-xml (S3C), markup/check-xml-sp (V3C)

Checks whether general XML markup in translation is well-formed, and possibly also whether entities are defined. Checks can be performed either only when the original text itself is valid or unconditionally.

markup/check-docbook4 (S3C), markup/check-docbook4-sp (V3C), markup/check-docbook4-msg (V4A), markup/check-html (S3C), markup/check-html-sp (V3C), markup/check-qtrich (S3C), markup/check-qtrich-sp (V3C), markup/check-kde4 (S3C), markup/check-kde4-sp (V3C), markup/check-pango (S3C), markup/check-pango-sp (V3C)

Specializations of markup/check-xml hook for various XML formats. Aside from well-formedness, these hooks can also check whether used tags really exist in the format, whether tags are properly nested, etc. (Full conformance to DTD or schema cannot be checked due to chunking into messages.)

markup/check-xmlents (S3C), markup/check-xmlents-sp (V3C)

Checks whether XML-like entities (&foo;) are defined. This can be used when the markup is not trully XML-like but it uses XML-like entities, or simply to have separate checking of tagging (by markup/check-xml-* hooks) and entities for convenience.

noop/text (F1A), noop/textm (F3A), noop/msg (F4A), noop/hdr (F4B), noop/cat (F5A), noop/path (F6A)

Filtering hooks that do nothing ("no-operation"). These are useful in contexts where a filtering hook is required, but input should not be really modified.

normalize/demangle-srcrefs (F4A)

In some message extraction scenarios, the source references end up pointing to dummy files which existed only during the extraction, but true source references can still be reconstructed (based on dummy file names or extracted comments). This hook will reconstruct true source references and replace dummy references with them.

normalize/uniq-source (F4A)

Sometimes source references in PO message end up doubled (e.g. one prefixed with ./ and the other not) due to perculiarities of the extraction process. This hook will make source references unique.

normalize/uniq-auto-comment (F4A)

When extracted comments are automatically added to messages by the extraction tool, if the message is repeated in several source files it may end up containing multiple equal extracted comments. This hook can be used to make extracted comments unique (either all or those matching some criteria).

normalize/canonical-header (F4B)

Rearranges content of the PO header into canonical form. For example, translator comments will be sorted according to years of contribution, any repeated translator comments will be merged, etc.

remove/remove-accel-text (F3A), remove/remove-accel-text-greedy (F3A), remove/remove-accel-msg (F4A), remove/remove-accel-msg-greedy ()

Removes accelerator marker from one or all strings in the message. They will check if the PO file specifies the accelerator marker; if not, non-greedy variants will do nothing, while greedy variants will remove everything that is frequently used as accelerator marker.

remove/remove-markup-text (F3A), remove/remove-markup-msg (F4A)

Converts markup (e.g. XML tags) in one or all strings in the message to plain text. The PO file will be asked for the expected markup types in text; if no markup type is specified, these hooks will do nothing.

remove/remove-fmtdirs-text (F3A), remove/remove-fmtdirs-text-tick (F3A), remove/remove-fmtdirs-msg (F4A), remove/remove-fmtdirs-msg-tick (F4A)

Removes format directives in one or all strings in the message, or replaces them with a fixed placeholder. The type of format directives is determined by *-format message flags.

remove/remove-literals-text (F3A), remove/remove-literals-text-tick (F3A), remove/remove-literals-msg (F4A), remove/remove-literals-msg-tick (F4A)

Removes "literal" segments from one or all strings in the message, or replaces them wih a fixed placeholder. Literal segments are those which are used as computer input somewhere along the line, such as URLs, email addresses, command line options, etc. and therefore generally do not conform to human language rules. Translator can also explicitly declare literal segments, by adding a special translator comment.

remove/remove-marlits-text (F3A), remove/remove-marlits-msg (F4A)

remove/remove-literals-* hooks can positively determine only certain types of literals based on the text alone. If the text contains semantic markup, such as Docbook, literal segments can also be determined based on tags, and these hooks will remove both such tags and their text. The markup type will be taken from the PO file. (When these hooks are used, remove/remove-literals-* is not needed.)

remove/rewrite-msgid (F4A)

Checks are sometimes defined such that something is first looked up in the original text, and if it is found, something is expected in the translation. No matter how well written these checks are, the original text will sometimes be a bit out of the ordinary, and the check will fail the translation although everything is fine. This can usually be corrected by the translator manually adding a directive, in a special translator comment, to "rewrite" the problematic part of the original before the check is applied.

remove/rewrite-inverse (F4A)

The original text in the message needs to be modified for the same reasons as described in remove/rewrite-msgid, but it is actually easiest to replace the original text entirely with the original text from another message sharing the same translation (i.e. by "inverse" pairing of messages over translation).

remove/remove-paired-ents (F4A), remove/remove-paired-ents-tick (F4A)

Removes all XML-like entities (&foo;) from the original text, and all XML-like entities from the translation that were encountered in the original. This may be useful prior to markup validity checks, when the list of defined entities cannot be provided.

spell/check-spell (S3A), spell/check-spell-sp (V3A)

Spell-checking hooks, as one element of Pology's spell-checking functionality.

uiref/resolve-ui (F3C), uiref/resolve-ui-docbook4 (F3C), uiref/resolve-ui-kde4 (F3C)

When translating program documentation, using these hooks it is possible to leave UI references (button labels, menu items, etc.) untranslated and let them be automatically inserted into translation later on. The basic hook requires UI references to be manually wrapped in translation in order to be detected, while specialized versions will also use semantic markup for detection (e.g. <guilabel> element in Docbook).

uiref/check-ui (V3C), uiref/check-ui-docbook4 (V3C), uiref/check-ui-kde4 (V3C)

While uiref/resolve-ui hooks will complain when they cannot find a translation for a UI reference, when checking the overall validity of translation it is more convenient to use specialized check-only hooks which will not modify the PO file on succesfully resolved UI references.

9.10.2.2. Language-Specific Hooks

ja:katakana (F1A)

Removes everything but Katakana words from Japanese text, and separates retained words with spaces. (Used as filter prior to spell-checking words in Katakana.)

nn:exclusion/inofficial-forms (V3C)

Checks if there are any inofficial word forms in Norwegian Nynorsk translation.

sr:accents/resolve-agraphs (F1A)

Converts "accent graphs" to proper accented letters in Serbian Cyrillic text (e.g. becomes а̂).

sr:accents/remove-accents (F1A)

Replaces accented letters in Serbian Cyrillic text with their non-accented counterparts. (Useful as filter prior to spell-checking.)

sr:charsets/limit-to-isocyr (F1A), sr:charsets/limit-to-isolat (F1A)

In situations where it is necessary to use an 8-bit encoding instead of Unicode for Serbian text, these hooks can be used to constrain characters in text to only those representable by the target 8-bit encoding.

sr:checks/naked-latin (V3C), sr:checks/naked-latin-origui (V3C), sr:checks/naked-latin-se (S3C), sr:checks/naked-latin-origui-se (S3C)

In translations into Serbian using Cyrillic script, ordinary segments in Latin script may indicate error or omission in translation. These hooks will look for such stray Latin segments, while ignoring recognizable literal segments such as URLs, commands, options, etc.

sr:nobr/to-nobr-hyphens (F1A)

The ordinary hyphen (-) is normally treated as a character on which the text can be split into the next line. In Serbian texts, hyphens are sometimes used to attach case endings to nouns (especially acronyms), which should not be split into the next line. This hooks guesses such positions and replaces the ordinary hyphen with no-break hyphen.

sr:reduce/words-ec (F1A), sr:reduce/words-ec-lw (F1A), sr:reduce/words-ic (F1A), sr:reduce/words-ic-lw (F1A), sr:reduce/words-ic-lw-dlc (F1A)

Various reductions of Serbian text to a subset of words of certain type, possibly rearranged in a particular way.

sr:trapres/froments (F3C), sr:trapres/froments-t1 (F3C), sr:trapres/froments-t1db (F3C)

Hooks which resolve grammatical inserts in form of XML entities in Serbian text, based on the "trapnakron" contained within Pology. See the documentation in Serbian section for details.

sr:uiref/mod_entities (F1A)

When UI references are automatically resolved in documentation, and the UI texts may contain grammatical inserts in form of XML entities, these inserts may need to be slightly modified to keep the documentation structure valid.

sr:wconv/ctol (F1A), sr:wconv/cltoa (F1A), and many more

Hooks for various transliterations and hybridizations of Serbian text, by script (Cyrillic, Latin) and dialect (Ekavian, Ijekavian). See the documentation in Serbian section for details.

9.10.2.3. Project-Specific Hooks

kde%header/equip-header (F4B)

Adds assorted header fields to PO files within the KDE Translation Project, with values based on their name and position in the repository tree, so that Pology and other tools are better informed how to process them.

9.10.3. Using External Hooks

[Not implemented yet.]

See Section 11.4, “Writing Hooks” for instructions on how to write and contribute hooks.

9.11. Skipping and Selecting Checks

With all the different heuristic checks and rules that Pology can apply, false positives -- messages proclaimed invalid when they are actually valid -- are inevitable. False positivies are very inconvenient in serious automatic quality control effort. They make it harder for translators to spot real problems, which in turn demotivates them to apply automatic checks at all. If there is one or few dedicated persons in the translation team to tweak and apply automatic checks, they would be particularly hard-hit with this negative feedback. False positives can reduce automatic quality control from a strong normative element in the workflow, to merely advisory "run-if-you-have-the-time" extra.

For this reason, most checks in Pology provide a way for them to be disabled on certain messages, files, or the processing batch, such that it is possible to methodically cancel false positives. From the other side, it is usually possible to run one or few checks on their own, in order to be easier to define and debug. Each checking tool and element documents such functionality, and in the following only some general patterns are described.

The simplest method to disable or enable some checks is "dynamically", for single validation run, through an option to the tool which is being run. For example, the check-rules sieve provides several parameters to select and deselect validation rules which are to be applied. The important point here is that checks in Pology usualy have some sort of a unique identifier, a keyword, by which they can be referred to.

"Static" methods to disable or enable checks are those where the instruction is written down somewhere, in a specific format, and automatically taken into account by the validation tool in subsequent runs. There may be several static methods to disable a certain check, differing in their reach: a group of PO files, single PO file, single message, or even a part of the text in the message. Within one PO file, the following methods are common:

  • The PO header is a natural place to disable or enable checks for the complete PO file, by adding a custom X- header field.

  • On the single message level, the only place where it is possible to add a manual processing instruction is a translator comment. This is because if it would put anywhere else (e.g. as extracted comment or a flag), it would be removed on subsequent merging with template. These instructions are usualy kept simple, like this:

    # some-instruction: arguments
    #: ...
    msgid "..."
    msgstr "..."
    

    Instructions are always composed of two or more words, separated by hyphens, ended by colon, and followed by an arbitrary argument string (e.g. a list of identifiers of checks to skip on this message). This makes it sufficiently unlikely that another, free-form translator comment will be accidentally interpreted as a known instruction.[44]

  • A special type of translator comment with processing instructions is a comment of the following form:

    # |, flag1, flag2, ...
    

    This is a "translator flag" comment, which is used to set processing instructions too simple to occupy one whole comment line (e.g. those of the switch type, never needing arguments). It starts with |,, and continues with comma-separated list of flag-like keywords.



[39] The other frequently encountered file organization is when there is one directory per PO domain, and that directory contains PO files for all languages, named as LANG.po.

[40] But if several people are working on a collection of PO files, it is nevertheless good to agree on fixed wrapping. This is both friendly to those exposed to original wrapping, and to version control systems.

[41] This does mean that the case when the comma itself is the accelerator marker is not covered, but this case is beyond unlikely.

[42] For example, any Python function in Pology that takes one string and returns the modified version of that string can be considered an F1A hook!

[43] In the API documentation, the very first line of the function description will show if the function is a direct hook or a hook factory, the function header will list the inputs for a direct hook (which conform to the declared hook type) or the factory parameters for a hook factory, and the rest of the description will explain the operation of the hook and the meaning of factory parameters.

[44] Especially considering that free-form translator comments are more usually written in the language of the translation.