Chapter 11. Programming with Pology
Prev

Chapter 11. Programming with Pology

You may find it odd that the user manual contains the section on programming, as that is normally the matter for a separate, programmer-oriented document. On the other hand, while reading the "pure user" sections of this manual, you may have noticed that in Pology the distinction between a user and a programmer is more blurry than one would expect of a translation-related tool. Indeed, before getting into writing standalone Python programs which use the Pology library, there are many places in Pology itself where you can plug in some Python code to adapt the behavior to your language and translation environment. This section exists to support and stimulate such interaction with Pology.

The Pology library is quite simple conceptually and organizationally. It consists of a small core abstraction of the PO format, and a lot of mutually unrelated functionality that may come in handy in particular translation processing scenarios. Everything is covered by the Pology API documentation, but since API documentation tends to be non-linear and full of details obstructing the bigger picture, the following subsections are there to provide synthesis and rationale of salient points.

11.1. PO Format Abstraction

The PO format abstraction in Pology is a quite direct and fine-grained reflexion of PO format elements and conventions. This was a design goal from the start; no attempt was made at a more general abstraction, which would tentatively support various translation file formats.

There is, however, one glaring but intentional omission: multi-domain PO files (those which contain domain "..." directives) are not supported. We had never observed a multi-domain PO file in the wild, nor thought of a significant advantage it could have today over multiple single-domain PO files. Supporting multi-domain PO files would mean not only always needing two nested loops to iterate through messages in a PO file, but it would also interfere with higher levels in Pology which assume equivalence between PO files and domains. Pology will simply report an error when trying to read a multi-domain PO file.

11.1.1. Monitored Objects

Because the PO abstraction is intended to be robust against programming errors when quickly writting custom scripts, and frugal on file modifications, by default some of the abstracted objects are "monitored". This means that they are checked for expected data types and have modification counters. Main monitored objects are PO files, PO headers, and PO messages, but also their attributes which are not plain data types (strings or numbers). For the moment, these secondary monitored types include Monlist (the monitored counterpart to built-in list), Monset (counterpart to set), and Monpair (like two-element tuple). Monitored types do not in general provide the full scope of functionality of their built-in counterparts, so sometimes it may be easier (and faster) to work with built-in types and convert them to monitored at the moment of adding to PO objects.

To take a Monlist instance as an example, here is how it behaves on its own:

>>> from pology.monitored import Monlist
>>> l = Monlist([u"a", u"b", u"c"])
>>> l.modcount
0
>>> l.append(10)
>>> l
Monlist([u"a", u"b", u"c", 10])
>>> l.modcount
1
>>>

Appending an element has caused the modification counter to increase, but, as expected, it was possible to add an integer in spite of previous elements being strings. However, if the monitored list comes from a PO message:

>>> from pology.message import Message
>>> msg = Message()
>>> msg.msgstr
Monlist([])
>>> msg.msgstr.append(10)
Traceback (most recent call last):
...
pology.PologyError: Expected <type 'unicode'> for sequence element, got <type 'int'>.
>>> msg.msgstr.append(u"bar")
>>> msg.msgstr.modcount
1
>>> msg.modcount
1

The Message class has type constraints added to its attributes, and therefore addition of an integer to the .msgstr list was rejected: only unicode values are allowed. This is particularly important due to the basic string type in Python being the raw byte array str^[51], to automatically prevent carelessness with encodings. Once a proper string was added to .msgstr list, its modification counter increased, but also the modification counter of the parent object.

A few more notes on modification counters. Consider this example:

>>> msg = Message()
>>> msg.msgstr = Monlist(u"foo")
>>> msg.msgstr.modcount
0
>>> msg.msgstr_modcount
1
>>> msg.modcount
1
>>> msg.msgstr[0] = u"foo"
>>> msg.msgstr.modcount
0
>>> msg.msgstr = Monlist(u"foo")
>>> msg.msgstr_modcount
1
>>> msg.modcount
1

Monlist(u"foo") itself is a fresh list with modification counter at 0, so after it was assigned to msg.msgstr, its modification counter is still 0. However, every attribute of a parent monitored object also has the associated attribute modification counter, denoted with trailing _modcount; therefore msg.msgstr_modcount did increase on assignment, and so did the parent msg.modcount. Modification tracking actually checks for equality of values, so when same-valued objects are repeadetly assigned (starting from msg.msgstr[0] = u"foo" above), modification counters do not increase.

Compound monitored objects may also have the attributes themselves constrained, to prevent typos and other brain glitches from causing mysterious wrong behavior when processing PO files. For example:

>>> msg = Message()
>>> msg.msgtsr = Monlist(u"foo")
Traceback (most recent call last):
...
pology.PologyError: Attribute 'msgtsr' is not among specified.
>>>

You may conclude that modification tracking and type and attribute constraining would slow down processing, and you would be right. Since PO messages are by far the most processed objects, a non-monitored counterpart to Message is provided as well, for occasions where the code is only reading PO files, or has been sufficiently tested, and speed is of importance. See Section 11.1.2, “Message” for details.

11.1.2. Message

PO messages are by default represented with the Message class. It is monitored for modifications, and constrained on attributes and attribute types. It provides direct attribute access to parts of a PO message:

>>> from pology.monitored import Monpair
>>> from pology.message import Message
>>> msg = Message()
>>> msg.msgid = u"Foo %s"
>>> msg.msgstr.append(u"Bar %s")
>>> msg.flag.add(u"c-format")
>>> msg.fuzzy = True
>>> print msg.to_string(),
#, fuzzy, c-format
msgid "Foo %s"
msgstr "Bar %s"

>>>

Attribute access provides the least hassle, while being guarded by monitoring, and makes clear the semantics of particular message parts. For example, the .flag attribute is a set, to indicate that the order of flags should be of no importance to either a human translator or a PO processor, and the .msgstr attribute is always a list in order to prevent the programmer from not taking into account plural messages. While the fuzzy state is formally indicated by a flag, it is considered special enough to have a separate attribute.

Some message parts may or may not be present in a message, and when they are not present, the corresponding attributes are either empty if sequences (e.g. .manual_comment list for translator comments), or set to None if strings^[52] (e.g. .msgctxt).

There are also several derived, read-only attributes for special purposes. For example, if in some context the messages are to be tracked in a dictionary by their keys, there is the .key attribute available, which is an undefined but unique combination of .msgctxt and .msgid attributes. Or, there is the .active attribute which is True if the message is neither fuzzy nor obsolete, i.e. its translation (if there is one) would be used by the consumer of the PO file that the message is part of.

Message has a number of methods for frequent operations that need to read or modify more than one attribute. For example, to thoroughly unfuzzy a message, it is not sufficient to just remove its fuzzy flag (by setting .fuzzy to False or removing u"fuzzy" from .flag set), but previous field comments (#| ...) should be removed as well, and this is what .unfuzzy() method does:

>>> print msg.to_string(),
#| msgid "Foubar"
#, fuzzy
msgid "Foobar"
msgstr "Fubar"

>>> msg.unfuzzy()
>>> print msg.to_string(),
msgid "Foobar"
msgstr "Fubar"

Other methods include those to copy over a subset of parts from another message, to revert the message to pristine untranslated state, and so on.

There exists a non-monitored counterpart to Message, the MessageUnsafeclass. Its attributes are of built-in types, e.g. .msgstr is plain list, and there is no type nor attribute checking. By using MessageUnsafe, a speedup of 50% to 100% has been observed in practical applications, so it makes for a good trade-off when you know what you are doing (e.g. you are certain that no modifications will be made). A PO file is opened with non-monitored messages by issuing the monitored=False argument to Catalog constructor.

Read-only code could should work with Message and MessageUnsafe objects without any type-based specialization. Code that writes may need some care to achieve the same, for example:

def translate_moo_as_mu (msg):

    if msg.msgid == u"Moo!":  # works for both
        msg.msgstr = [u"Mu!"]  # raises exception if Message
        msg.msgstr[:] = [u"Mu!"]  # works for both
        msg.msgstr[0] = u"Mu!"  # works for both (when not empty)

If you need to create an empty message of the same type as another message, or make a same-type copy of the message, you can use type built-in:

newmsg1 = type(msg)()  # create empty
newmsg2 = type(msg)(msg)  # copy

Message and MessageUnsafe share the virtual base class Message_base, so you can use isinstance(obj, Message_base) to check if an object is a PO message of either type.

11.1.3. Header

The PO header could be treated as just another message, but that would both be inconvenient for operating on it, and disruptive in iteration over a catalog. Instead the Header class is introduced. Similar to Message, it provides both direct attribute access to parts of the header (like the .field list of name-value pairs), and methods for usual manipulations which would need a sequence of basic data manipulations (like .set_field() to either modify an existing or add a new header field with the given value).

In particular, header comments are represented by a number of attributes (.title, .author, etc.), some of which are strings and some lists, depending on semantics. Unfortunatelly, the PO format does not define this separation formally, so when the PO file is parsed, comments are split heuristically (.title will be the first comment line, .author will get every line which looks like it has an email address and a year in it, etc.)

Header is a monitored class just like Message, but unlike Message it has no non-monitored counterpart. This is because in practice the header operations make a small part of total processing, so there is no real advantage at having non-monitored headers.

11.1.4. Catalog

PO files are read and written through Catalog objects. A small script to open a PO file on disk (given as the first argument), find all messages that contain a certain substring in the original text (given as the second argument), and write those messages to standard output, would look like this:

import sys
from pology.catalog import Catalog
from pology.msgreport import report_msg_content

popath = sys.argv[1]
substr = sys.argv[2]

cat = Catalog(popath)
for msg in cat:
    if substr in msg.msgid:
        report_msg_content(msg, cat)

Note the minimalistic code, both by raw length and access interface. Instead of using something like print msg.to_string() to output the message, already in this example we introduce the msgreport module, which contains various functions for reporting on PO messages;^[53] report_msg_content() will first output the PO file name and location of the message (line and entry number) within the file, and then the message content itself, with some highlighting (for field keywords, fuzzy state, etc.) if the output destination permits it. Since no modifications are done to messages, this example would be just as safe but run significantly faster if the PO file were opened in non-monitored mode. This is done by adding the monitored=False argument to Catalog constructor:

cat = Catalog(popath, monitored=False)

and no other modification is required.

When some messages are modified in a catalog created by opening a PO file on disk, the modifications will not be written back to disk until the .sync() method is called -- not even if the program exists. If the catalog is monitored and there were no modifications to it up to the moment .sync() is called, the file on disk will not be touched, and .sync() will return False (it returns True if the file is written).^[54] In a scenario where a bunch of PO files are processed, this allows you to report only those which were actually modified. Take as an example a simplistic^[55] script to search and replace in translation:

import sys
from pology.catalog import Catalog
from pology.fsops import collect_catalogs
from pology.report import report

serchstr = sys.argv[1]
replacestr = sys.argv[2]
popaths = sys.argv[3:]

popaths = collect_catalogs(popaths)
for popath in popaths:
    cat = Catalog(popath)
    for msg in cat:
        for i, text in enumerate(msg.msgstr):
            msg.msgstr[i] = text.replace(searchstr, replacestr)
    if cat.sync():
        report("%s (%d)" % (cat.filename, cat.modcount))

This script takes the search and replace strings as the first two arguments, followed by any number of PO paths. The paths do not have to be only file paths, but can also be directory paths, in which case the collect_catalogs() function from fsops module will recursively collect any PO files in them. After the search and replace iteration through a catalog is done (msgstr being properly handled on plain and plural messages alike), its .sync() method is called, and if it reports that the file was modified, the file's path and number of modified texts is output. The latter is obtained simply as the modification counter state of the catalog, since it was bumped up by one on each text that actually got modified. Note the use of .filename attribute for illustration, although in this particular case we had the path available in popath variable.

Syncing to disk is an atomic operation. This means that if you or something else aborts the program in the middle of execution, none of the processed PO files will become corrupted; they will either be in their original state, or in the expected modified state.

As can be seen, at its base the Catalog class is an iterable container of messages. However, the precise nature of this container is less obvious. To the consumer (a program or converter) the PO file is a dictionary of messages by keys (msgctxt and msgid fields); there can be no two messages with the same key, and the order of messages is of no importance. For the human translator, however, the order of messages in the PO file is of great importance, because it is one of context indicators. Message keys are parts of the messages themselves, which means that a message is both its own dictionary key and the value. Taking these constraints together, in Pology the PO file is treated as an ordered set, and the Catalog class interface is made to reflect this.

The ordered set nature of catalogs comes into play when the composition of messages, rather than just the messages themselves, is modified. For example, to remove all obsolete messages from the catalog, the .remove() method could be used:

for msg in list(cat):
    if msg.obsolete:
        cat.remove(msg)
cat.sync()

Note that the message sequence was first copied into a list, since the removal would otherwise clobber the iteration. Unfortunatelly, this code will be very slow (linear time wrt. catalog size), since when a message is removed, internal indexing has to be updated to maintain both the order and quick lookups. Instead, the better way to remove messges is the .remove_on_sync() method, which marks the message for removal on syncing. This runs fast (constant time wrt. catalog size) and requires no copying into a list prior to iteration:

for msg in cat:
    if msg.obsolete:
        cat.remove_on_sync(msg)
cat.sync()

A message is added to the catalog using the .add() method. If .add() is given only the message itself, it will overwrite the message with the same key if there is one such, or else insert it according to source references, or append it to the end. If .add() is also given the insertion position, it will insert the message at that position only if the message with the same key does not exist in the catalog; if it does, it will ignore the given position and overwrite the existing message. When the message is inserted, .add() suffers the same performance problem as .remove(): it runs in linear time. However, the common case when an empty catalog is created and messages added one by one to the end can run in constant time, and this is what .add_last() method does.^[56]

The basic way to check if a message with the same key exists in the catalog is to use the in operator. Since the catalog is ordered, if the position of the message is wanted, .find() method can be used instead. Both these methods are fast, running in constant time. There is a series of .select_*() methods for looking up messages by other than the key, which run in linear time, and return lists of messages since the result may not be unique any more.

Since it is ordered, the catalog can be indexed, and that either by a position or by a message (whose key is used for lookup). To replace a message in the catalog with a message which has the same key but is otherwise different, you can either first fetch its position and then use it as the index, or use the message itself as the index:

# Idexing by position.
pos = cat.find(msg)
cat[pos] = msg

# Indexing by message key.
cat[msg] = msg

This leads to the following question: what happens if you modify the key of a message (its .msgctxt or .msgid attributes) in the catalog? In that case the internal index goes out of sync, rather than being automatically updated. This is a necessary performance measure. If you need to change message keys, while doing that you should treat the catalog as a pure list, using only in iteration and positional indexing. Afterwards you should either call .sync() if you are done with the catalog, or .sync_map() to only update indexing (and remove messages marked with .remove_on_sync()) without writing out the PO file.

The Catalog class provides a number of convenience methods which report things about the catalog based on the header information, rather than having to manually examine the header. These include the number of plural forms, the msgstr index for the given plural number, as well as information important in some Pology contexts, like language code, accelerator markers, markup types, etc. Each of these methods has a counterpart which sets the appropriate value, but this value is not written to disk when the catalog is synced. This is because frequently there are more ways in which the value can be determined from the header, so it is ambiguous how to write it out. Instead, these methods are used to set or override values provided by the catalog (e.g. based on command line options) for the duration of processing only.

To create an empty catalog if it does not exist on disk, the create=True argument can be added to the constructor. If the catalog does exist, it will be opened as usual; if it did not exist, the new PO file will be written to disk on sync. To unconditionally create an empty catalog, whether the PO file exists or not at the given path, the truncate=True parameter should be added as well. In this case, if the PO file did exist, it will be overwritten with the new content only when the catalog is synced. The catalog can also be created with an empty string for path, in which case it is guaranteed to be empty even without setting truncate=True. If a catalog with empty path should later be synced (as opposed to being transient during processing), its .filename attribute can simply be assigned a valid path before calling .sync().

In summary, it can be said that the Catalog class is biased, in terms of performance and ease of use, towards processing existing PO files rather than creating PO files from scratch, and towards processing existing messages in the PO file rather than shuffling them around.

11.2. Coding Conventions

This section describes the style and conventions that the code which is intended to be included in Pology distribution should adhere to. The general coding style is expected to follow the Python style guide described in PEP 8.

Lines should be up to 80 characters long. Class names should be written in camel case, and all other names in lower case with underscores:

class SomeThingy (object):
    ...

    def some_method (self, ...):

        ...
        longer_variable = ...


def some_function (...):
    ...

Long expressions with operators should be wrapped in parentheses and before the binary operator, with the first line indented to the level of the other operand:

some_quantity = (  a_number_of_thingies * quantity_of_that_per_unit
                  + the_base_offset)

In particular, long conditions in if and while statements should be written like this:

if (    something and something_else and yet_something
    and somewhere_in_between and who_knows_what_else
):
    do_something_appropriate()

All messages, warnings, and errors should be issued through msgreport and msgreport modules. There should be no print statements or raw writes to sys.stdout/sys.stderr.

For the code in Pology library, it is always preferable to raise an exception instead of aborting execution. On the other hand, it is fine to add optional parameters by which the client can select if the function should abort rather than raise an exception. All topical problems should raise pology.PologyError or a subclass of it, and built-in exceptions only for simple general problems (e.g. IndexError for indexing past the end of something).

11.2.1. User-Visible Text and Internationalization

All user-visible text, be it reports, warnings, errors (including exception messages) should be wrapped for internationalization through Gettext. The top pology module provides several wrappers for Gettext functions, which have the following special traits: context is mandatory on every wrapped text, all format directives must be named, and arguments are specified as keyword-value pairs just after the text argument (unless deferred translation is used). Some examples:

# Simple message with context marker.
_("@info",
  "Trying to sync unnamed catalog.")

# Simple message with extended context.
_("@info command description",
  "Keep track of who, when, and how, has translated, modified, "
  "or reviewed messages in a collection of PO files.")

# Another context marker and extended context.
_("@title:column words per message in original",
  "w/msg-or")

# Parameter substitution.
_("@info",
  "Review tag '%(tag)s' not defined in '%(file)s'.",
  tag=rev_tag, file=config_path)

# Plural message
n_("@item:inlist",
   "written %(num)d word", "written %(num)d words",
   num=nwords)

# Deferred translation, when arguments are known later.
tmsg = t_("@info:progress",
          "Examining state: %(file)s")
...
msg = tmsg.with_args(file=some_path).to_string()

Every context starts with the "context marker" in form of @keyword, drawn from a predefined set (see the article on i18n semantics at KDE Techbase); it is most often @info in Pology code. The context marker may be, and should be, followed by a free-form extend context whenever it can help the translator to understand how and where the message is used. It is usual to have the context, text and arguments in different lines, though not necessary if they are short enough to fit one line.

Pology defines lightweight XML markup for coloring text in the colors module. In fact, Gettext wrappers do not return ordinary strings, but ColorString objects, and functions from report and msgreport modules know how to convert it to raw strings for given output destination (file, terminal, web page...). Therefore you can use colors in any wrapped string:

_("@info:progress",
  "<green>History follows:</green>")

_("@info",
  "<bold>Context:</bold> %(snippet)s",
  snippet=some_text)

Coloring should be used sparingly, only when it will help to cue user's eyes to significant elements of the output.

There are two consequences of having text markup available throughout. The first is that every message must be well-formed XML, which means that it must contain no unballanced tags, and that literal < characters must be escaped (and then also > for good style):

_("@item automatic name for anonymous input stream",
  "&lt;stream-%(num)s&gt;",
  num=strno)

The other consequence is that ColorString instances must be joined and interpolated with dedicated functions; see cjoin() and cinterp() functions in colors module.

Unless the text of the message is specifically intended to be a title or an insert (i.e. @title or @item context markers), it should be a proper sentence, starting with a capital letter and ending with a dot.

11.3. Writing Sieves

Pology sieves are filtering-like processing elements applied by the posieve script to collections of PO files. A sieve can examine as well as modify the PO entries passed through it. Each sieve is written in a separate file. If the sieve file is put into sieve/ directory of Pology distribution (or intallation), the sieve can be referenced on posieve command line by the shorthand notation; otherwise the path to the sieve file is given. The former is called an internal sieve, and the latter an external sieve, but the sieve file layout and the sieve definition are same for both cases.

In the following, posieve will be referred to as "the client". This is because tools other than posieve may start to use sieves in the future, and it will also be described what these clients should adhere to when using sieves.

11.3.1. Sieve Layout

The sieve file must define the Sieve class, with some mandatory and some optional interface methods and instance variables. There are no restrictions at what you can put into the sieve file beside this class, only keep in mind that posieve will load the sieve file as a Python module, exactly once during a single run.

Here is a simple sieve (also the complete sieve file) which just counts the number of translated messages:

class Sieve (object):

    def __init__ (self, params):

        self.ntranslated = 0

    def process (self, msg, cat):

        if msg.translated:
            self.ntranslated += 1

    def finalize (self):

        report("Total translated: %d" % self.ntranslated)

The constructor takes as argument an object specifying any sieve parameters (more on that soon). The process method gets called for each message in each PO file processed by the client, and must take as parameters the message (instance of Message_base) and the catalog which contains it (Catalog). The client calls the finalize method after no more messages will be fed to the sieve, but this method does need to be defined (client should check if it exists before placing the call).

Another optional method is process_header, which the client calls on the PO header:

def process_header (self, hdr, cat):
    # ...

hdr is an instance of Header, and cat is the containing catalog. The client will check for the presence of this method, and if it is defined, it will call it prior to any process call on the messages from the given catalog. In other words, the client is not allowed to switch catalogs between two calls to process without calling process_header in between.

There is also the optional process_header_last method, for which everything holds just like for process_header, except that, when present, the client must call it after all consecutive process calls on messages from the same catalog:

def process_header_last (self, hdr, cat):
    # ...

Sieve methods should not abort program execution in case of errors, instead they should throw an exception. In particular, if the process method throws SieveMessageError, it means that the sieve can still process other messages in the same catalog; if it throws SieveCatalogError, then any following messages from the same catalog must be skipped, but other catalogs may be processed. Similarly, if process_header throws SieveCatalogError, other catalogs may still be processed. Any other type of exception tells the client that the sieve should no longer be used.

The process and process_header methods should either return None or an integer exit code. A return value which is neither None nor 0 indicates that while the evaluation was successfull (no exception was thrown), the processed entry (message or header) should not be passed further along the sieve chain.

11.3.2. Sieve Parameter Handling

The params parameter of the sieve constructor is an object with data attributes as parameters which may influence the sieve operation. The sieve file can define the setup_sieve function, which the client will call with a SubcmdView object as the single argument, to fill in the sieve description and define all mandatory and optional parameters. For example, if the sieve takes an optional parameter named checklevel, which controles the level (an integer) at which to perform some checks, here is how setup_sieve could look like:

def setup_sieve (p):

    p.set_desc("An example sieve.")
    p.add_param("checklevel", int, defval=0,
                desc="Validity checking level.")


class Sieve (object):

    def __init__ (self, params):

        if params.checklevel >= 1:
            # ...setup some level 1 validity checks...
        if params.checklevel >= 2:
            # ...setup some level 2 validity checks...
        #...

    ...

See the add_param method for details on defining sieve parameters.

The client is not obliged to call setup_sieve, but it must make sure that the object it sends to the sieve as params has all the instance variable according to the defined parameters.

11.3.3. Catalog Regime Indicators

There are two boolean instance variables that the sieve may define, and which the client may check for to decide on the regime in which the catalogs are opened and closed:

class Sieve (object):

    def __init__ (self, params):

        # These are the defaults:
        self.caller_sync = True
        self.caller_monitored = True

    ...

The variables are:

caller_sync instructs the client whether catalogs processed by the sieve should be synced to disk at the end. If the sieve does not define this variable, the client should assume True and sync catalogs. This variable is typically set to False in sieves which do not modify anything, because syncing catalogs takes time.
caller_monitored tells the client whether it should open catalogs in monitored mode. If this variable is not set, the client should assume it True. This is another way of reducing processing time for sieves which do not modify PO entries.

Usually a modifying sieve will set neither of these variables, i.e. catalogs will be monitored and synced by default, while a checker sieve will set both to False. For a modifying sieve that unconditionally modifies all entries sent to it, only caller_monitored may be set to False and caller_sync left undefined (i.e. True).

If a sieve requests no monitoring or no syncing, the client is not obliged to satisfy these requests. On the other hand, if a sieve does request monitoring or syncing (either explicitly or by not defining the corresponding variables), the client must provide catalogs in that regime. This is because there may be several sieves operating at the same time (a sieve chain), and monitoring and syncing is usually necessary for proper operation of those sieves that request it.

11.3.4. Further Notes on Sieves

Since monitored catalogs have modification counters, the sieve may use them within its process* methods to find out if any modification really took place. The proper way to do this is to record the counter at start, and check for increase at end:

def process (self, msg, cat):

    startcount = msg.modcount

    # ...
    # ... do some stuff
    # ...

    if msg.modcount > startcount:
        self.nmodified += 1

The wrong way to do it would be to merely check if msg.modcount > 0, because several modifying sieves may be operating at the same time, each increasing the counters.

If the sieve wants to remove the message from the catalog, if at all possible it should use catalog's remove_on_sync instead of remove method, to defer actual removal to sync time. This is because remove will probably ruin client's iteration over the catalog, so if it must be used, the sieve documentation should state it clearly. remove also has linear execution time, while remove_on_sync has constant.

If the sieve is to become part of Pology distribution, it should be properly documented. This means fully equipped setup_sieve function in the sieve file, and a piece of user manual documentation. The Sieve class itself should not be documented in general. Only when process* are returning an exit code, this should be stated in their own comments (and in the user manual).

11.4. Writing Hooks

Hooks are functions with specified sets of input parameters, return values, processing intent, and behavioral constraints. They can be used as modification and validation plugins in many processing contexts in Pology. There are three broad categories of hooks: filtering, validation and side-effect hooks.

Filtering hooks modify some of their inputs. Modifications are done in-place whenever the input is mutable (like a PO message), otherwise the modified input is provided in a return value (like a PO message text field).

Validation hooks perform certain checks on their inputs, and return a list of annotated spans or annotated parts, which record all the encountered errors:

Annotated spans are reported when the object of validation is a piece of text. Each span is a tuple of start and end index of the problematic segment in the text, and a note which explains the problem. The return value of a text-validation hook will thus be a list:
```
[(start1, end1, "note1"), (start2, end2, "note1"), ...]
```
The note can also be None, if there is nothing to say about the problem.
Annotated parts are reported for an object which has more than one distinct piece of text, such as a PO message. Each annotated part is a tuple stating the name of the problematic part of the object (e.g. "msgid", "msgstr"), the item index for array-like parts (e.g. for msgstr), and the list of problems in appropriate form (for a PO message this is a list of annotated spans). The return value of a PO message-validation hook will look like this:
```
[("part1", item1, [(start11, end11, "note11"), ...]),
 ("part2", item2, [(start21, end21, "note21"), ...]),
 ...]
```

Side-effect hooks neither modify their inputs nor report validation information, but can be used for whatever purpose which is independent of the processing chain into which the hook is inserted. For example, a validation hook can be implemented like this as well, when it is enough that it reports problems to standard output, or where the hook client does not know how to use structured validation data (annotated spans or parts). The return value of a side-effect hook the number of errors encountered internally by the hook (an integer). Clients may use this number to decide upon further behavior. For example, if a side-effect hook modified a temporary copy of a file, the client may decide to abandon the result and use the original file if there were some errors.

11.4.1. Hook Taxonomy

In this section a number of hook types are described and assigned a formal type keyword, so that they can be conveniently referred to elsewhere in Pology documentation.

Each type keyword has the form <letter1><number><letter2>, e.g. F1A. The first letter represents the hook category: F for filtering hooks, V for validation hooks, and S for side-effect hooks. The number enumerates the input signature by parameter types, and the final letter denotes the difference in semantics of input parameters for equal input signatures. As a handy mnemonic, each type is also given an informal signature in the form of (param1, param2, ...) -> result; in them, spans stand for annotated spans, parts for annotated parts, and numerr for number of errors.

Hooks on pure text:

F1A ((text) -> text): filters the text
V1A ((text) -> spans): validates the text
S1A ((text) -> numerr): side-effects on text

Hooks on text fields in a PO message in a catalog:

F3A ((text, msg, cat) -> text): filters any text field
V3A ((text, msg, cat) -> spans): validates any text field
S3A ((text, msg, cat) -> numerr): side-effects on any text field
F3B ((msgid, msg, cat) -> msgid): filters an original text field; original fields are either msgid or msgid_plural
V3B ((msgid, msg, cat) -> spans): validates an original text field
S3B ((msgid, msg, cat) -> numerr): side-effects on an original text field
F3C ((msgstr, msg, cat) -> msgstr): filters a translation text field; translation fields are the msgstr array
V3C ((msgstr, msg, cat) -> spans): validates a translation text field
S3C ((msgstr, msg, cat) -> numerr): side-effects on a translation text field

*3B and *3C hook series are introduced next to *3A for cases when it does not make sense for text field to be any other but one of the original, or translation fields. For example, to process the translation sometimes the original (obtained by msg parameter) must be consulted. If a *3B or *3C hook is applied on an inappropriate text field, the results are undefined.

Hooks on PO entries in a catalog:

F4A ((msg, cat) -> numerr): filters a message, modifying it
V4A ((msg, cat) -> parts): validates a message
S4A ((msg, cat) -> numerr): side-effects on a message (no modification)
F4B ((hdr, cat) -> numerr): filters a header, modifying it
V4B ((hdr, cat) -> parts): validates a header
S4B ((hdr, cat) -> numerr): side-effects on a header (no modification)

Hooks on PO catalogs:

F5A ((cat) -> numerr): filters a catalog, modifying it in any way
S5A ((cat) -> numerr): side-effects on a catalog (no modification)

Hooks on file paths:

F6A ((filepath) -> numerr): filters a file, modifying it in any way
S6A ((filepath) -> numerr): side-effects on a file, no modification

The *2* hook series (with signatures (text, msg) -> ...) has been skipped because no need for them was observed so far next to *3* hooks.

11.4.2. Hook Factories

Since hooks have fixed input signatures by type, the way to customize a given hook behavior is to produce its function by another function. The hook-producing function is called a I{hook factory}. It works by preparing anything needed for the hook, and then defining the hook proper and returning it, thereby creating a lexical closure around it:

def hook_factory (param1, param2, ...):

    # Use param1, param2, ... to prepare for hook definition.

    def hook (...):

        # Perhaps use param1, param2, ... in the hook definition too.

    return hook

In fact, most internal Pology hooks are defined by factories.

11.4.3. Further Notes on Hooks

General hooks should be defined in top level modules, language-dependent hooks in lang.code.module, project-dependent hooks in proj.name.module, and hooks that are both language- and project-dependent in lang.code.proj.name.module. Hooks placed like this can be fetched by getfunc.get_hook_ireq in various non-code contexts, in particular from Pology utilities which allow users to insert hooks into processing through command line options or configurations. If the complete module is dedicated to a single hook, the hook function (or factory) should be named same as the module, so that users can select it by giving only the hook module name.

Annotated parts for PO messages returned by hooks are a reduced but valid instance of highlight specifications used by reporting functions, e.g. msgreport.report_msg_content. Annotated parts do not have the optional fourth element of a tuple in highlight specification, which is used to provide the filtered text against which spans were constructed, instead of the original text. If a validation hook constructs the list of problematic spans against the filtered text, just before returning it can apply diff.adapt_spans to reconstruct the spans against the original text.

The documentation to a hook function should state the hook type within the short description, in square brackets at the end as [type ... hook]. Input parameters should be named like in the informal signatures in the taxonomy above, and should not be omitted in @param: Epydoc entries; but the return should be given under @return:, also using one of the listed return names, in order to complete the hook signature.

The documentation to a hook factory should have [hook factory] at the end of the short description. It should normally list all the input parameters, while the return value should be given as @return: type ... hook, and the hook signature as the @rtype: Epydoc field.

11.5. Writing Ascription Selectors

Ascription selectors are functions used by poascribe in the translation review workflow as described in Chapter 6, Ascribing Modifications and Reviews. This section describes how you can write your own ascription selector, which you can then put to use by following the instructions in Section 6.8.1, “Custom Review Selectors”.

In terms of code, an ascription selector is a function factory, which construct the actual selector function based on supplied selector arguments. It has the following form:

# Selector factory.
def selector_foo (args):

    # Validate input arguments.
    if (...):
        raise PologyError(...)

    # Prepare selector definition.
    ...

    # The selector function itself.
    def selector (msg, cat, ahist, aconf):

        # Prepare selection process.
        ...

        # Iterate through ascription history looking for something.
        for i, asc in enumerate(ahist):
            ...

        # Return False or True if a shallow selector,
        # and 0 or 1-based history index if history selector.
        return ...

    return selector

It is customary to name the selector function selector_something, where something will also be used as the selector name (in command line, etc). The input args parameter is always a list of strings. It should first be validated, insofar as possible without having in hand the particular message, catalog, ascription history or ascription configuration. Whatever does not depend on any of these can also be precomputed for later use in the selector function.

The selector function takes as arguments the message (an instance of Message_base), the catalog (Catalog) it comes from, the ascription history (list of AscPoint objects), and the ascription configuration (AscConfig). For the most part, AscPoint and AscConfig are simple attribute objects; check their API documentation for the list and description of attributes. Some of the attributes of AscPoint objects that you will usually inspect are .msg (the historical version of the message), .user (the user to whom the ascription was made), or .type (the type of the ascription, one of AscPoint.ATYPE_* constants). The ascription history is sorted from the latest to the earliest ascription. If the .user of the first entry in the history is None, that means that the current version of the message has not been ascribed yet (e.g. if its translation has been modified compared to the latest ascribed version). If you are writing a shallow selector, it should return True to select the message, or False otherwise. In a history selector, the return value should be a 1-based index of an entry in the ascription history which caused the message to be selected, or 0 if the message was not selected.^[57]

The entry index returned by history selectors is used to compute embedded difference from a historical to the current version of the message, e.g. on poascribe diff. Note that poascribe will actually take as base for differencing the first non-fuzzy historical message after the indexed one, because it is assumed that already the historical message which triggered the selection contains some changes to be inspected. (When this behavior is not sufficient, poascribe offers the user to specify a second history selector, which directly selects the historical message to base the difference on.)

Most of the time the selector will operate on messages covered by a single ascription configuration, which means that the ascription configuration argument sent to it will always be the same. On the other hand, the resolution of some of the arguments to the selector factory will depend only on the ascription configuration (e.g. a list of users). In this scenario, it would be waste of performance if such arguments were resolved anew in each call to the selector. You could instead write a small caching (memoizing) resolver function, which when called for the second and subsequent times with the same configuration object, returns previously resolved argument value from the cache. A few such caching resolvers for some common arguments have been provided in the ascript module, functions named cached_*() (e.g. cached_users()).

^[51] In Python 2 to be precise, on which Pology is based, while in Python 3 there are only Unicode strings.

^[52] The canonical way to check if message is a plural message is msg.msgid_plural is not None.

^[53] There is also the report module for reporting general strings. In fact, all code in Pology distribution is expected to use function from these modules for writing to output streams, and there should not be a print in sight.

^[54] This holds only for catalogs created with monitoring, i.e. no monitored=True constructor argument. For non-monitored .sync() will always touch the file and report True.

^[55] As opposed to the find-messages sieve.

^[56] In fact, .add_last() does a bit more: if both non-obsolete and obsolete messages are added in mixed order, in the catalog they will be separated such that all non-obsolete come before all obsolete, but otherwise maintaining the order of addition.

^[57] In this way the history selector can automatically behave as shallow selector as well, because simply testing for falsity on the return value will show whether the message has been selected or not.

Prev
Chapter 10. Combined Arms Tactics	Home