Translations spellchecking
Introduction
The spellchecker is a set of scripts which manipulate po files and
produce statistics usually displayed on a web page. Spotting typos is
the most obvious feature, but translators can use it to improve the
overall quality of their translations.
Translations are often spread over a bunch of po files translated by
different people with different styles even if belonging to the same
language-team; the spellchecker gathers all of its output in a per
language
rather than per po file
basis, so that each translator
using it, will check not only his/her own translations.
Understanding the output
The statistics page (&url-spellchecker-status;) contains the following
columns for every language supported:
"Language" - extended English name of the language and corresponding
ISO code
Unknown words
- is the number of unique words not found in the
dictionary of a particular language. -
, which means that the spellchecker
is not configured to check translations for this language (usually
because an aspell dictionary for the language is not available)
Messages
- contains all the strings for lang extracted from the po
files: you can use any spellchecker against this file to look for
typos. This file is very useful also when improving consistency: when
translations are spread over many po files, it is difficult to make sure
that certain rules are applied to all files, so here you can check for
example that "CD-ROM" is written always the same way.
List of unknown words
- is a sorted list of words not found in the
aspell dictionary along with the number of occurrences.
suspect variables
- po files often contain variables ${LIKE_THIS}; a
script checks that the name of the variable in the msgid is matched in
the msgstr. It is also checked that the number of variables in the
msgstr is the same in the msgid.
custom wordlist
- translators can write a list of custom words not
found in the aspell dictionary for their language. More details about
wordlists can be found here (FIXME).
all files
- is a tarball containing all the above files. Translators
can download it and work off-line.
Sometimes columns are followed by a "(*)": it means that the
associated file has changed since the last time the spellchecker was
run. By clicking on the "(*)" it is possible to see what changed.
Spotting typos
First of all you should look for errors: typically typos or wrong
variables. The list of unknown words contains words not in the aspell
dictionary and out of their context
; once one has been found, you
need to locate the error in the po file.
Since Messages
contains all the (translated) strings for all the po files,
you can look for the wrong word there; the structure of the text file
is the following:
*** path1/lang.po
- "strings for this file"
*** path2/lang.po
- "strings for this file"
...
*** pathN/lang.po
- "strings for this file"
If the word is easy to find you won't need any help from me to locate
it and you just need to see which po file it belongs to. Emacs has a
very handy function "forward-paragraph" (ctrl-up on my system) that
takes you directly to the line containing the path. Otherwise you can
search backward for "*** " to find it.
Things can be more complicated if the unknown word is very short and
so it is matched too many times inside the file because it is a sub-word
of
longer words; in that case you should use regexps to restrict your
search. Here follow a few practical examples that can help you; refer
to some documentation about regexp to find out more.
If you use emacs you can match an exact word by using
isearch-forward-regexp and "\<word\>".
From the shell you can run 'grep -nw "word" lang_all.txt'
(the n
parameter will tell grep to print the line number where the match
occurs).
Sometimes you may have to look for something like "a word composed
by two letters: the first is "p" the second is not a printable char";
in that case "word" in the previous examples can be replaced with
"p."
Suspect variables
Variables in po files are of the form ${VARIABLE}; the spellchecker
is able to catch the following common mistakes:
mispelled variable names
wrong braces
missing $
character
different number of variables in the msgid and the msgstr
The variable name should not be translated in the msgstr and braces
should not be changed. The variable name is replaced with something
more useful at run-time; if the translator manipulates it, the results
will be rather ugly. When the original msgid contains multiple occurrences
of a particular variable, sometimes translators translate the string
as to use the variable just once; the script will report such strings
as wrong and it is up to the translator to check if it is a mistake or
not (this is the main reason for calling them suspect variables
).