Package pology :: Module normalize

Module normalize

Various normalizations for strings and PO elements.


Author: Chusslove Illich (Часлав Илић) <caslav.ilic@gmx.net>

License: GPLv3

Functions
string
simplify(s)
Simplify ASCII whitespace in the string.
string
usimplify(s)
Simplify whitespace in the string.
string
shrink(s)
Remove all whitespace from the string.
string
tighten(s)
Remove all whitespace and lowercase the string.
string
identify(s)
Construct an uniform-case ASCII-identifier out of the string.
string
xentitize(s)
Replace characters having default XML entities with the entities.
string
noinvisible(s)
Remove all invisible characters from the string.
(cat, msg) -> numerr
demangle_srcrefs(collsrcs=None, collsrcmap=None, truesrcheads=None, compexts=None)
Resolve source references in message created by intermediate extraction [hook factory].
 
uniq_source(msg, cat)
Make message source references unique [type F4A hook].
(cat, msg) -> numerr
uniq_auto_comment(onlyheads=None)
Remove non-unique automatic comment lines in message [hook factory].
int
canonical_header(hdr, cat)
Check and rearrange content of a PO header into canonical form [type F4B hook].
Variables
  __package__ = 'pology'
Function Details

simplify(s)

 

Simplify ASCII whitespace in the string.

All leading and trailing ASCII whitespace are removed, all inner ASCII whitespace sequences are replaced with space.

Parameters:
  • s (string) - string to normalize
Returns: string
normalized string

usimplify(s)

 

Simplify whitespace in the string.

Like simplify, but takes into account all whitespace defined by Unicode.

Parameters:
  • s (string) - string to normalize
Returns: string
normalized string

shrink(s)

 

Remove all whitespace from the string.

Parameters:
  • s (string) - string to normalize
Returns: string
normalized string

tighten(s)

 

Remove all whitespace and lowercase the string.

Parameters:
  • s (string) - string to normalize
Returns: string
normalized string

identify(s)

 

Construct an uniform-case ASCII-identifier out of the string.

ASCII-identifier is constructed in the following order:

  • string is decomposed into Unicode NFKD
  • string is lowercased
  • every character that is neither an ASCII alphanumeric nor the underscore is removed
  • if the string starts with a digit, underscore is prepended
Parameters:
  • s (string) - string to normalize
Returns: string
normalized string

xentitize(s)

 

Replace characters having default XML entities with the entities.

The replacements are:

  • &amp; for ampersand
  • &lt and &gt; for less-than and greater-then signs
  • &apos; and &quot; for ASCII single and double quotes
Parameters:
  • s (string) - string to normalize
Returns: string
normalized string

noinvisible(s)

 

Remove all invisible characters from the string.

Invisible characters are those which have zero width, i.e. do not have any visual representation in the text (when the text is rendered proportionally). See http://www.unicode.org/faq/unsup_char.html for the list of these characters as defined by Unicode.

Parameters:
  • s (string) - string to normalize
Returns: string
normalized string

demangle_srcrefs(collsrcs=None, collsrcmap=None, truesrcheads=None, compexts=None)

 

Resolve source references in message created by intermediate extraction [hook factory].

Sometimes the messages from a source file in the format not known to xgettext(1) are first extracted by a preextraction tool into a format known to xgettext, and then by xgettext to PO template. This is the intermediate extraction, and the files that xgettext gets to operate on are intermediate files.

When intermediate extraction is performed, the source references in the resulting PO template are going to be "mangled", pointing to the intermediate files rather than to the true source files. This hook factory will produce a function that will resolve intermediate into true source reference, "demangle" them, where possible.

One mode of intermediate extraction is to extract multiple sources into a collective intermediate file. This file may have standardized name throughout a collection of catalogs, or it may be special by catalog. For demangling to be possible in this case, the preextraction tool has to provide true source references in the extracted comments (#.) of the messages. When that is the case, parameter collsrcs is used to specify the sequence of names of generally known intermediate files, parameter collsrcmap of those specific by catalog (as dictionary of catalog name to sequence of intermediate file names), and parameter truesrcheads specifies the sequence of initial strings in extracted comments which are followed by the true source reference. (If truesrcheads is None or empty, this mode of demangling is disabled.)

For example, collective-intermediate extraction:

   #. file: apples.clt:156
   #: resources.cpp:328
   msgid "Granny Smith"
   msgstr ""

   #. file: peaches.clt:49
   #: resources.cpp:2672
   msgid "Redhaven"
   msgstr ""

is demangled by setting collsrcs=["resources.cpp"] and truesrcheads=["file:"].

Another mode of intermediate extraction is to for each source file to be extracted into a single paired intermediate file, which is named same as the true source plus an additional extension. In this mode, parameter compexts specifies the list of known composite extensions (including the leading dot), which will be demangled by stripping the final extension from the path.

For example, paired-intermediate extraction:

   #: apples.clt.h:156
   msgid "Granny Smith"
   msgstr ""

   #: peaches.clt.h:49
   msgid "Redhaven"
   msgstr ""

is demangled by setting compexts=[".clt.h"].

Parameters:
  • collsrcs (<string*>) - general intermediate file names
  • collsrcmap ({string: <string*>*}) - catalog-specific intermediate file names
  • truesrcheads (<string*>) - prefixes to true file references in comments
  • compexts (<string*>) - composite intermediate file extensions
Returns: (cat, msg) -> numerr
type F4A hook

uniq_source(msg, cat)

 

Make message source references unique [type F4A hook].

Sometimes source references of a message can be non-unique due to particularities of extraction or later processing. This hook makes them unique, while preserving the ordering.

uniq_auto_comment(onlyheads=None)

 

Remove non-unique automatic comment lines in message [hook factory].

Sometimes the message extraction tool adds automatic comments to provide more context for the message (for example, XML tag path to the current message). If the message is found more than once in the same context, such comment lines get repeated. This hook can be used to make auto comment lines unique; either fully, or only those with certain prefixes given by onlyheads parameter.

Parameters:
  • onlyheads (<string*>) - prefixes of comment lines which should be made unique
Returns: (cat, msg) -> numerr
type F4A hook

canonical_header(hdr, cat)

 

Check and rearrange content of a PO header into canonical form [type F4B hook].

Returns: int
number of errors