UAM Text Tools v0.9.3

Table of Contents


Next: , Up: (dir)

UTT - UAM Text Tools

This manual is for UAM Text Tools (version 0.90, October, 2008)

Copyright © 2005, 2011 Justyna Walkowska, Tomasz Obrębski, Michał Stolarski, and Marcin Walas

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled GNU Free Documentation License,,GNU Free Documentation License.


Next: , Previous: Top, Up: Top

1 General information

UAM Text Tools (UTT) is a package of language processing tools developed at Adam Mickiewicz University. Its functionality includes:

The toolkit is destined for processing of raw (not annotated) unrestricted text for any conceivable purpose.

The system is organized as a collection of command-line programs, each performing one operation, e.g. tokenization, lemmatization, spelling correction. The components are independent one from another, the unifying element being the uniform i/o file format.

The components may be combined in various ways to provide various text processing services. Also new components supplied by the used may be easily incorporated into the system provided that they respect the i/o file format conventions.

UTT component programs does not depend on any specific tagset or morphological description format.

UTT is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

The Polex/PMDBF dictionary is licensed under the Creative Commons by-nc-sa License which prohibits commercial use.

List of contributors:


Next: , Previous: General information, Up: Top

2 UTT file format

A UTT file contains annotation of a text. It consists of a sequence of segments. Each segment explicitly refers to a continuous piece of the text and provides some information on it.

2.1 Segment format

A segment occupies one line of a UTT file and consists of space-separated fields:

     
     
[start [length]] type form [annotation1 [annotation2 ...]]
     
     
start
Non-negative integer value indicating the position in the source text where the segment starts.
length
Non-negative integer value indicating the length of the segment.
type
A sequence of non-ASCII characters (without spaces or letters, which could lead to type being misinterpreted as a start or length field). type reflects the main classification of segments - into words, numbers, punctuation marks, meta-text markers. See tok output, for description of automatically recognized type markers.
form
This field contains the textual form of the segment or the special symbol * indicating that the form is not given (e.g. when the segment has been created artificially to mark something and is of lentgh 0).

The characters or character sequences that have special meaning in the form field are enumerated below.

Characters with special meaning:

Escape sequences:


annotation1
annotation2
...
Annotation fields have the following format:

longname : value

or

shortname value

where longname is a string of alphanumeric characters (isalnum() test), shortname - a single non-alphanumeric character (ispunct() test), and value is an arbitrary string of non-blank characters.

Only two fields are mandatory: type and form. All other fields may be absent. In the case when only one number precedes the type field, it is interpreted as the START position.

If the length field is ommited, the length of the segment is the length of the form field, except when the value of the form field is * – in this case, the length is assumed to be 0.

If the start field is also absent, the segment is assumed to directly follow the preceding one.

Segments of length 0 may be used to mark file positions with some information. See e.g. BOS and EOS (beginning/end of sentence) markers in the example below.

Example:

sentence: ‘Piszemy dobre progrumy.

     0000 00 BOS *
     0000 07 W Piszemy lem:pisać,V
     0007 01 S _
     0008 05 W dobre lem:dobry,ADJ
     0013 01 S _
     0014 08 W progrumy cor:programy lem:program,N
     0022 01 P .
     0023 00 EOS *
     0023 01 S _
     0024 00 BOS *
     0024 11 W Warszawiacy lem:Warszawiak,N
     0035 01 S _
     0036 03 W też
     0039 01 P .
     0040 00 EOS *
     
     0000 BOS *
     0000 W Piszemy lem:pisać,V
     0007 S _
     0008 W dobre lem:dobry,ADJ
     0013 S _
     0014 W progrumy cor:programy lem:program,N
     0022 P .
     0023 EOS *

Posion information may be provided only for some types of segments:

     0000 BOS *
     W Piszemy lem:pisać,V
     S _
     W dobre lem:dobry,ADJ
     S _
     W progrumy cor:programy lem:program,N
     P .
     EOS *
     S _
     0024 BOS *
     W Warszawiacy lem:Warszawiak,N
     S _
     W też
     P .
     EOS *

Position/length information may be provided only when necessary:

     0000 04 N *
     0000 N 12
     P .
     N 5
     S _
     W km

2.2 UTT File

A UTT file consists of a sequence of segments. The same text position may be covered by multiple segments. In cosequence, ambiguous text segmentation and ambiguous annotation may be represented.

There are two structural requirements a valid UTT-formatted file has to meet:

A valid annotation for the text fragment

     12.5 km

may be

     0000 02 N 12
     0000 04 N 12.5
     0002 01 P .
     0003 01 N 5
     0004 01 S _
     0005 02 W km

but not

     0000 02 N 12
     0000 04 N 12.5
     0004 01 S _
     0005 02 W km

because in the latter example the first segment (starting at position 0000, 2 characters long) ends at position n=0001 which is covered by the second segment and no segment starts at position n+2=0002.

2.3 Flattened UTT file

A UTT file format has two variants: regular and flattened. The regular format was described above. In the flattened format some of the end-of-line characters are replaced with line-feed characters.

The flatten format is basically used to represent whole sentences as single lines of the input file (all intrasentential end-of-line characters are replaced with line-feed characters).

This technical trick permits to perform certain text processing operations on entire sentences with the use of such tools as grep (see grp component) or sed (see mar component).

The conversion between the two formats is performed by the tools: fla and unfla.

2.4 Character encoding

The UTT component programs accept only 1-byte character encoding, such as ISO, ANSI, DOS.


Next: , Previous: UTT file format, Up: Top

3 Configuration files

Values for all command line options accepted by a component may be set in configuration files. The default location of the configuration files for a component named program are

     	/usr/local/etc/utt/program.conf

for system-wide configuration file and

     	~/.utt/program.conf

for user configuration file.

For each option, the value is set according to the following priority:

Parameter values are specified in the following format:

parametername=value

where parametername is the short or long name of an option accepted by the program, or

parametername

if the option does not need arguments.

You can introduce comments to configuration files using the # sign.

If a program accepts multiple occurences of an option (e.g. lem's select option) you can specify them in two distinct lines of the program's configuration file.

Tip: If you have two (or more) frequently used sets of options for the same program (eg. lem with PMDBF dictionary and lem with a user dictionary) a good solution is to create two soft links to lem, called eg. lemg and lemu and specify their configuration in files lemg.conf and lemu.conf respectively.


Next: , Previous: Configuration files, Up: Top

4 UTT components

UTT components are of three types:

Sources: programs which read non-UTT data (e.g. raw text) and produce output in UTT format

Filters: programs which read and produce UTT-formatted data

Sinks: programs which read UTT data and produce output in another format


Next: , Up: UTT components

4.1 tok - a tokenizer

Authors: Tomasz Obrębski
Component category: source
Input format: raw text file
Output format: UTT regular
Required annotation: -


Next: , Up: tok

4.1.1 Description

tok is a simple program which reads a text file and identifies tokens on the basis of their orthographic form. The type of the token is printed as the type field.


Next: , Previous: tok description, Up: tok

4.1.2 Input

Raw text.


Next: , Previous: tok input, Up: tok

4.1.3 Output

UTT-file with four fields: start, length, type, and form. In the type field five types of tokens are distinguished:


Next: , Previous: tok output, Up: tok

4.1.4 Command line options

−−help, −h
Print help.
−−version, −V
Print version information.
−−interactive, −i
This option toggles interactive mode, which is by default off. In the interactive mode the program does not buffer the output.


Previous: tok command line options, Up: tok

4.1.5 Example

Input:

     Piszemy dobre programy.

Output:

     0000 07 W Piszemy
     0007 01 S _
     0008 05 W dobre
     0013 01 S _
     0014 08 W programy
     0022 01 P .
     0023 01 S \n


Next: , Previous: tok, Up: UTT components

4.2 lem - morphological analyzer

Authors: Tomasz Obrębski, Michał Stolarski
Component category: filter
Input format: UTT regular
Output format: UTT regular
Required annotation: tok


Next: , Up: lem

4.2.1 Description

lem performs morphological analysis of a simple orthographic word, returning all its possible morphological annotations, disregarding the context.


Next: , Previous: lem description, Up: lem

4.2.2 Command line options

−−help, −h
Print help.
−−version, −V
Print version information.
−−interactive, −i
This option toggles interactive mode, which is by default off. In the interactive mode the program does not buffer the output.
−−input-field=fieldname, −I fieldname
The field containing the input to the program. The default is the form field. The fields position, length, type, and form are referred to as 1, 2, 3, 4, respectively.
−−output-field=fieldname, −O fieldname
The name of the field added by the program. The default is the name of the program.
−−dictionary=filename, −d filename
Dictionary file name.
−−process=type, −p type
Process segments with the specified value in the type field. Multiple occurences of this option are allowed and are interpreted as disjunction. If this option is absent, all segments are processed.
−−select=fieldname, −s fieldname
Select for processing only segments in which the field named fieldname is present. Multiple occurences of this option are allowed and are interpreted as conjunction of conditions. If this option is absent, all segments are processed.
−−unselect=fieldname, −S fieldname
Select for processing only segments in which the field fieldname is absent. Multiple occurences of this option are allowed and are interpreted as conjunction of conditions. If this option is absent, all segments are processed.
−−one-line
This option makes the program print ambiguous annotation in one output line by generating multiple annotation fields. By default when ambiguous annotation may be produced for a segment, the segment is multiplicated and each of the annotations is added to separate copy of the segment.
−−one-field, −1
This option makes the program print ambiguous annotation in one annotation field. By default when ambiguous annotation may be produced for a segment, the segment is multiplicated and each of the annotations is added to separate copy of the segment.

This option is useful when working with kot or con.


Next: , Previous: lem command line options, Up: lem

4.2.3 Input

Lem reads a UTT file and processes the value of the form field (the input field may be changed with --input-field option).


Next: , Previous: lem input, Up: lem

4.2.4 Output

lem adds a new annotation field, whose default name is lem. In case of ambiguity either the segment is multiplicated (default), multiple lem fields are added (--one-line) or ambiguous annotation is produced as the value of single lem field (option --one-field,-1):


Next: , Previous: lem output, Up: lem

4.2.5 Example

Input:

     0000 07 W Piszemy
     0007 01 S _
     0008 05 W dobre
     0013 01 S _
     0014 08 W programy
     0022 01 P .
     0023 01 B \n

Output (default):

     0000 07 W Piszemy lem:pisać,V/AiVpMdTrfNpP1
     0007 01 B _
     0008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn
     0008 05 W dobre lem:dobry,ADJ/DpNsCnavGn
     0013 01 B _
     0014 08 W programy lem:program,N/GiNpCa
     0014 08 W programy lem:program,N/GiNpCn
     0014 08 W programy lem:program,N/GiNpCv
     0022 01 P .
     0023 01 B \n

Output (--one-line option):

     0000 07 W Piszemy lem:pisać,V/AiVpMdTrfNpP1
     0007 01 S _
     0008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn lem:dobry,ADJ/DpNsCnavGn
     0013 01 S _
     0014 08 W programy lem:program,N/GiNpCa lem:program,N/GiNpCn lem:program,N/GiNpCv
     0022 01 P .
     0023 01 S \n

Output (--one-field option):

     0000 07 W Piszemy lem:pisać,V/AiVpMdTrfNpP1
     0007 01 S _
     0008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn,ADJ/DpNsCnavGn
     0013 01 S _
     0014 08 W programy lem:program,N/GiNpCa,N/GiNpCn,N/GiNpCv
     0022 01 P .
     0023 01 S \n


Next: , Previous: lem example, Up: lem

4.2.6 Dictionaries

lem requires a dictionary. The dictionary may be provided in one of two formats: in text (source) format or in binary (fsa) format.

Text format

Dictionary entries have the following structure:

     <form>;<lemma>,<descr>[;<lemma>,<descr>]

lemma may be given explicitly or in the cut-add format:

     [<cut1><add1>-]<cut2><add2>

meaning: replace prefix of length <cut1> with string <add1>, replace suffix of length <cut2> with string <add2>. For example 3t transforms ‘kocie’ into ‘kot’, 3-4ały transforms ‘najbielsi’ into ‘biały

Each dictionary entry must be written in one line and must not contain blank characters.

Examples:

     kot;0,N/GaNsCn
     kota;1,N/GaNsCg;1,N/GaNsCa
     kotu;1,N/GaNsCd
     kotem;2,N/GaNsCi
     kocie;3t,N/GaNsCl;3t,N/GaNsCv
     najbielsi;3-4ały,ADJ/DsNpCnGp
     najbielsze;3-5ały,ADJ/DsNpCnGaifn
     najlepsi;dobry,ADJ/DsNpCnGp
     najlepsze;dobry,ADJ/DsNpCnGaifn

The mandatory file name extension for a text dictionary is dic. For large dictionaries it is preferable, however, to compile them into binary (fsa) format.

Binary format

The mandatory file name extension for a binary dictionary is bin. To compile a text dictionary into binary format, write:

     compdic <dictionaryname>.dic <dictionaryname>.bin
Polex/PMDBF dictionary

A large-coverage morphological dictionary for Polish language, Polex/PMDBF, is included in the distribution as the default lem's dictionary. It's located by default in:

$HOME/.local/share/utt/pl_PL.ISO-8859-2/lem.bin

in local installation or in

/usr/local/share/utt/pl_PL.ISO-8859-2/lem.bin

in system installation.


Previous: lem dictionaries, Up: lem

4.2.7 Hints

Combining data from multiple dictionaries


Next: , Previous: lem, Up: UTT components

4.3 gue - morphological guesser

Authors: Michał Stolarski, Tomasz Obrębski
Component category: filter


Next: , Up: gue

4.3.1 Description

gue guesess morphological descriptions of the form contained in the form field.


Next: , Previous: gue description, Up: gue

4.3.2 Command line options

−−help, −h
Print help.
−−version, −V
Print version information.
−−interactive, −i
This option toggles interactive mode, which is by default off. In the interactive mode the program does not buffer the output.
−−input-field=fieldname, −I fieldname
The field containing the input to the program. The default is the form field. The fields position, length, type, and form are referred to as 1, 2, 3, 4, respectively.
−−output-field=fieldname, −O fieldname
The name of the field added by the program. The default is the name of the program.
−−dictionary=filename, −d filename
Dictionary file name.
−−process=type, −p type
Process segments with the specified value in the type field. Multiple occurences of this option are allowed and are interpreted as disjunction. If this option is absent, all segments are processed.
−−select=fieldname, −s fieldname
Select for processing only segments in which the field named fieldname is present. Multiple occurences of this option are allowed and are interpreted as conjunction of conditions. If this option is absent, all segments are processed.
−−unselect=fieldname, −S fieldname
Select for processing only segments in which the field fieldname is absent. Multiple occurences of this option are allowed and are interpreted as conjunction of conditions. If this option is absent, all segments are processed.
−−one-line
This option makes the program print ambiguous annotation in one output line by generating multiple annotation fields. By default when ambiguous annotation may be produced for a segment, the segment is multiplicated and each of the annotations is added to separate copy of the segment.
−−one-field, −1
This option makes the program print ambiguous annotation in one annotation field. By default when ambiguous annotation may be produced for a segment, the segment is multiplicated and each of the annotations is added to separate copy of the segment.

This option is useful when working with kot or con.

−−delta=n
Stop displaying answers after fall of weight, that is, when weight difference between 2 subsequent results is more than delta value (default=`0.2').
−−cut-off=n
Do not display answers with less weight than cut-off value (default=`200').
−−guess_count=n, −n n
Guess up to n descriptions (default=`0', which means 'display all results').


Next: , Previous: gue command line options, Up: gue

4.3.3 Example

     command: gue -n 2
     
     input:
     0000 07 W smerfny
     
     output:
     0000 07 W smerfny gue:,ADJ/CaDpGiNs
     0000 07 W smerfny gue:,ADJ/CnvDpGaipNs


Previous: gue example, Up: gue

4.3.4 Dictionaries

gue requires a dictionary. For now, the dictionary must be provided in binary (fsa) format. The fsa format is created by compiling text-format dictionaries.

Text format

Dictionary entries have the following structure:

     prefix*suffix;lemma,description:weight

lemma must be given in the cut-add format:

     [<cut1><add1>-]<cut2><add2>

(no spaces in between): replace prefix of length cut1 with string add1, replace suffix of length cat2 with string add2.

Example: 3-4ały transforms najbielsi into biały

description contains the part of speech and morphosyntactic information (See PMDBF dictionary.).

weight is an integer value between 1 and 999 indicating the likelihood of the guess.


Next: , Previous: gue, Up: UTT components

4.4 cor - spelling corrector

Authors: Tomasz Obrębski, Michał Stolarski
Component category: filter
Input format: UTT regular
Output format: UTT regular
Required annotation: tok


Next: , Up: cor

4.4.1 Description

The spelling corrector applies Kemal Oflazer's dynamic programming algorithm oflazer96 to the FSA representation of the set of word forms of the Polex/PMDBF dictionary. Given an incorrect word form it returns all word forms present in the dictionary whose edit distance is smaller than the threshold given as the parameter.


Next: , Previous: cor description, Up: cor

4.4.2 Command line options

−−help, −h
Print help.
−−version, −V
Print version information.
−−interactive, −i
This option toggles interactive mode, which is by default off. In the interactive mode the program does not buffer the output.
−−input-field=fieldname, −I fieldname
The field containing the input to the program. The default is the form field. The fields position, length, type, and form are referred to as 1, 2, 3, 4, respectively.
−−output-field=fieldname, −O fieldname
The name of the field added by the program. The default is the name of the program.
−−dictionary=filename, −d filename
Dictionary file name.
−−process=type, −p type
Process segments with the specified value in the type field. Multiple occurences of this option are allowed and are interpreted as disjunction. If this option is absent, all segments are processed.
−−select=fieldname, −s fieldname
Select for processing only segments in which the field named fieldname is present. Multiple occurences of this option are allowed and are interpreted as conjunction of conditions. If this option is absent, all segments are processed.
−−unselect=fieldname, −S fieldname
Select for processing only segments in which the field fieldname is absent. Multiple occurences of this option are allowed and are interpreted as conjunction of conditions. If this option is absent, all segments are processed.
−−one-line
This option makes the program print ambiguous annotation in one output line by generating multiple annotation fields. By default when ambiguous annotation may be produced for a segment, the segment is multiplicated and each of the annotations is added to separate copy of the segment.
−−one-field, −1
This option makes the program print ambiguous annotation in one annotation field. By default when ambiguous annotation may be produced for a segment, the segment is multiplicated and each of the annotations is added to separate copy of the segment.

This option is useful when working with kot or con.

−−distance=int, −n int
Maximum edit distance (default='1').


Previous: cor command line options, Up: cor

4.4.3 Dictionaries

cor requires a dictionary. The dictionary has to be provided in binary (fsa) format. The fsa format is created by compiling text-format dictionaries.

Text format

The cor dictionary is a list of words:

     odlot
     odlotowy
     odludek
Binary format

The mandatory file name extension for a binary dictionary is bin. To compile a text dictionary into binary format, write:

     compiledic <dictionaryname>.dic


Next: , Previous: cor, Up: UTT components

4.5 kor - configurable spelling corrector

Authors: Paweł Werenski, Tomasz Obrębski, Michał Stolarski
Component category: filter
Input format: UTT regular
Output format: UTT regular
Required annotation: tok


Next: , Up: kor

4.5.1 Description

The spelling corrector applies a Pawel Werenski's dynamic programming algorithm to the FSA representation of the set of word forms of the Polex/PMDBF dictionary. The algorithm is an extension of K. Oflazer algorithm used by cor. In the extended version it is possible to assign weights to individual edit operations.

Given an incorrect word form it returns all word forms present in the dictionary whose edit distance is smaller than the threshold given as the parameter.


Next: , Previous: kor description, Up: kor

4.5.2 Command line options

−−help, −h
Print help.
−−version, −V
Print version information.
−−interactive, −i
This option toggles interactive mode, which is by default off. In the interactive mode the program does not buffer the output.
−−input-field=fieldname, −I fieldname
The field containing the input to the program. The default is the form field. The fields position, length, type, and form are referred to as 1, 2, 3, 4, respectively.
−−output-field=fieldname, −O fieldname
The name of the field added by the program. The default is the name of the program.
−−dictionary=filename, −d filename
Dictionary file name.
−−process=type, −p type
Process segments with the specified value in the type field. Multiple occurences of this option are allowed and are interpreted as disjunction. If this option is absent, all segments are processed.
−−select=fieldname, −s fieldname
Select for processing only segments in which the field named fieldname is present. Multiple occurences of this option are allowed and are interpreted as conjunction of conditions. If this option is absent, all segments are processed.
−−unselect=fieldname, −S fieldname
Select for processing only segments in which the field fieldname is absent. Multiple occurences of this option are allowed and are interpreted as conjunction of conditions. If this option is absent, all segments are processed.
−−one-line
This option makes the program print ambiguous annotation in one output line by generating multiple annotation fields. By default when ambiguous annotation may be produced for a segment, the segment is multiplicated and each of the annotations is added to separate copy of the segment.
−−one-field, −1
This option makes the program print ambiguous annotation in one annotation field. By default when ambiguous annotation may be produced for a segment, the segment is multiplicated and each of the annotations is added to separate copy of the segment.

This option is useful when working with kot or con.

−−distance=int, −n int
Maximum edit distance (default='1').
−−weights=filename, −w filename
Edit operations' weights file.


Next: , Previous: kor command line options, Up: kor

4.5.3 Weights definition file

Example:

     
     %stdcor 1
     %xchg   1
     ż  rz 0.5
     ch h  0.5
     u  ó  0.5
     

Default weight is set to 1 (%stdcor 1), the weight of exchange operation is set to 1 (%xchg 1), the three principal orthographic errors are assigned the weight 0.5.

The edit operation weight declaration, such as

     ż  rz 0.5

works in both ways, i.e. ż->rz, rz->ż.

The default weights definition file for kor is:

     $HOME/.local/share/utt/weights.kor

or, if the above mentioned file is absent:

     /usr/local/share/utt/weights.kor


Previous: kor weights definition file, Up: kor

4.5.4 Dictionaries

see cor


Next: , Previous: kor, Up: UTT components

4.6 sen - a sentensizer

Authors: Tomasz Obrębski
Component category: filter
Input format: UTT regular
Output format: UTT regular
Required annotation: tok


Next: , Up: sen

4.6.1 Description

sen detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the type field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation.


Previous: sen description, Up: sen

4.6.2 Example

     command: sen
     
     input:
     0000 05 W Cześć
     0005 01 P !
     0006 01 S _
     0007 02 W To
     0009 01 S _
     0010 02 W ja
     0012 01 P .
     0013 01 S \n
     
     output:
     0000 00 BOS *
     0000 05 W Cześć
     0005 01 P !
     0006 00 EOS *
     0006 00 BOS *
     0006 01 S _
     0007 02 W To
     0009 01 S _
     0010 02 W ja
     0012 01 P .
     0013 01 S \n
     0014 00 EOS *


Next: , Previous: sen, Up: UTT components

4.7 ser - pattern search tool

Authors: Tomasz Obrębski
Component category: filter
Input format: UTT regular
Output format: UTT regular
Required annotation: tok, lem –one-field


Next: , Up: ser

4.7.1 Description

ser looks for patterns in UTT-formatted texts.


Next: , Previous: ser description, Up: ser

4.7.2 Command line options

−−help, −h
Print help.
−−version, −V
Print version information.
−−process=type, −p type
Process segments with the specified value in the type field. Multiple occurences of this option are allowed and are interpreted as disjunction. If this option is absent, all segments are processed.
−−interactive, −i
This option toggles interactive mode, which is by default off. In the interactive mode the program does not buffer the output.
−−pattern=pattern, −e pattern
The search pattern.
−−morph=field
The name of the annotation field containing the morphological description (default lem).
−−flex
Only print the generated flex source code.
−−macro=filename
Read macrodefinitions from file filename rather than from default location. This option allows to redefine the set of terms.
−−define=filename
Append macrodefinitions from file filename. This option allows to extend the set of terms.


Next: , Previous: ser command line options, Up: ser

4.7.3 Pattern

The ser pattern is a regular expression over terms corresponding to text segments or segment sequences. Predefined terms are:

seg(t,f,a)
a segment of type t, containing form f and annotation a
form(f)
a segment containing form f
field(f)
a segment containing annotation field f
space(f)
a space segment of form f
word(f)
a word segment of form f
punct(f)
a punct segment of form f
number(f)
a number segment of form f
lexeme(f)
a word segment with lemma f
cat(c)
a word segment of category c

All arguments are optional. If an argument is omitted, an arbitrary string of non-blank characters is assumed as the argument value. Term arguments may be arbitrary character-level regular expressions. The following special symbols can by used:

[...] a character class
[^...] a negated character class
| alternative
* repetition, including zero times
+ repetition, at least one time
? optionality
{m,n} repetition from m to n times
{m,} repetition m or more times
{m} repetition m times
\ddd the character with octal value ddd
\xhh the character with hexadecimal value hh
( ) parentheses, used to override precedence


. a non-blank character
\w a letter
\W a non-blank character other than a letter
\d a digit
\D a non-blank character other than a digit
\s a space or tab character
\S a non-blank character (the same as .)
\l a lowercase letter
\L an uppercase letter

The following characters:

       [   ]   ^   |   *   +   ?   {   }   ,   .   <   >   \ 

must be escaped with a backslash, i.e. written as:

      \[  \]  \^  \|  \*  \+  \?  \{  \}  \,  \.  \<  \>  \\ 
Note: The special symbols are ... borrowed from Perl with minor modifications ... for convenience The meaning of certain special characters/sequences slightly differs from their common ???. This is motivated by convenience reasons. The meaning of the . special character is modified due to the special function of spaces in utt files (they are field separators). Use \s to explicitly

In the argument of the cat term a special operator <...> may be used. A category specification enclosed in angle brackets matches all category descriptions which are consistent (non-contradictory) with the specification. For example <N> matches all noun descriptions, <ADJ/Can> matches all adjectives in accusative or nominal case.


Examples of one-segment patterns:

seg any segment
word any word-form
word(pomocy) the word-form ‘pomocy
word(naj.+) a word-form beginning with ‘naj
word(\L\l+) a capitalized word-form
punct a punctuation character
space(.*\\n.*) a space segment containing a newline character
lexeme(pomoc) any form of the lexeme 'pomoc'
cat(N/.*) a word which category starts with N/
cat(<N/Ca>) a word which category matches N/Ca


Examples of multi-segment patterns:

(word(\L) punct(\.) space?)+ word(\L\l+)
a sequence of initials followed by a surname
punct seg(W|S|N)* cat(<NPRO/Sr>) seg(W|S|N)* punct
a text fragment between two punctuation characters, containing an ocurrence of a relative pronoun


Next: , Previous: ser pattern, Up: ser

4.7.4 How ser works


Next: , Previous: ser how ser works, Up: ser

4.7.5 Customization

     define(`verbseq', `(cat(<V>) (space cat(<V>)))')

the term cat() may not be used as a ... of


Next: , Previous: ser customization, Up: ser

4.7.6 Limitations

Do not use more than 3 attributes in <>.


Previous: ser limitations, Up: ser

4.7.7 Requirements

In order to run ser, the following programs must be installed in the system:


Next: , Previous: mar, Up: UTT components

4.8 grp - pattern search tool

Authors: Tomasz Obrębski
Component category: filter
Input format: UTT flattened
Output format: UTT flattened
Required annotation: tok, sen, lem –one-field


Next: , Up: grp

4.8.1 Description

gre selects sentences containing an expression matching a pattern. The pattern format is exactly the same as that accepted by ser.

gre is intended mainly for speeding up corpus search process. It is extremely fast (processing speed is usually higher then the speed of reading the corpus file from disk).


Next: , Previous: grp description, Up: grp

4.8.2 Command line options

−−help, −h
Print help.
−−version, −V
Print version information.
−−process=type, −p type
Process segments with the specified value in the type field. Multiple occurences of this option are allowed and are interpreted as disjunction. If this option is absent, all segments are processed.
−−interactive, −i
This option toggles interactive mode, which is by default off. In the interactive mode the program does not buffer the output.
−−pattern=pattern, −e pattern
The search pattern.
−−morph=field
The name of the annotation field containing the morphological description (default lem).
−−command
Only print the generated flex source code.
−−macro=filename
Read macrodefinitions from file filename rather than from default location. This option allows to redefine the set of terms.
−−define=filename
Append macrodefinitions from file filename. This option allows to extend the set of terms.


Next: , Previous: grp command line options, Up: grp

4.8.3 Pattern

(see ser)


Previous: grp pattern, Up: grp

4.8.4 Hints

The corpus search speed may be increased by combining grp with lzop compression tool (grp usually processes data faster than it is read from a disk, especially for slow laptop drives).

     cat corpus | tok | sen | lem -1 | fla | lzop -7 > corpus.grp.lzo
     lzop -cd corpus.grp.lzo | grp -e EXPR | unfla | ser -e EXPR


Next: , Previous: ser, Up: UTT components

4.9 mar

Authors: Marcin Walas, Tomasz Obrębski
Input format: UTT flattened
Output format: UTT flattened
Required annotation: tok, sen, lem -1

4.9.1 Description

mar is a perl script, which matches given pattern on the utt-formated text and tags matching parts with any number of user-defined tags.

4.9.2 Command line options

−−help, −h
Print help.
−−version, −V
Print version information.
−−pattern=pattern, −e pattern
The search pattern.
−−action=action, −a action [p] [s] [P]
Perform only indicated actions. Where:

p preprocess
s search
P postprocess
default: psP

−−command
print generated sed command, then exit
−−help, −h
print help, then exit
−−version, −v
print version, then exit

4.9.3 Tokens in pattern

mar pattern is based on ser patterns(see see ser pattern). mar pattern is a ser pattern, in which you can add any number of matching tags, which will be printed in exacly the place, where they were placed in the pattern. A valid token starts with @ which follows any number of alphanumeric characters. For example valid match tokens are: @STARTMATCH @ENDMATCH

Matching tokens can be placed between, before or after any of ser pattern terms. They don't have to be paritied. There can be any number of them in the pattern (zero or more). They don't have to be unique. They can be placed one after another. For example:

@BOM lexeme(pomoc) place tag BOM before any form of the lexeme 'pomoc'
@MATCH lexeme(pomoc) @MATCH place tag MATCH before and after any form of the lexeme 'pomoc'
cat(<ADJ>) @MATCH lexeme(pomoc) @MATCH place tag MATCH before and after any form of the lexeme 'pomoc' which is followef by adjective
cat(<ADJ>) @TAG @BOM lexeme(pomoc) @EOM place tags TAG and BOM before any form of the lexeme 'pomoc' which is followed by adjective and tag EOM after it

(see mar's help 'mar -h' for some more information)

4.9.4 How mar works

mar translates given ser pattern with m4 macroprocessor to regular expression. Then it changes it into sed command script, which is then executed.

You can see translated sed script by using the −−command option.

4.9.5 Limitations

The complexity of computations performed by mar increases linearly with the number of placed tokens. So it is highly recommended not to place too much tokens.

4.9.6 Requirements

In order to run mar, the following programs must be installed in the system:


Next: , Previous: grp, Up: UTT components

4.10 kot - untokenizer

Authors: Tomasz Obrębski
Component category: filter
Input format: UTT regular
Output format: text
Required annotation: tok


Next: , Up: kot

4.10.1 Description

kot transforms a UTT formatted file back into raw text format.


Next: , Previous: kot description, Up: kot

4.10.2 Command line options

−−help, −h
Print help.

−−gap-fill=string, −g string
print string between nonadjacent segments of the input file
−−spaces, −r
retain the special characters _, \t, \n, \r, \f unexpanded in the output


Previous: kot command line options, Up: kot

4.10.3 Usage examples

     cat legia.txt | tok | kot
     cat legia.txt | tok | lem -1 | kot


Previous: kot, Up: UTT components

4.11 con - concordance table generator

Authors: Justyna Walkowska
Component category: sink
Input format: UTT regular
Output format: text
Required annotation: ser or mar


Next: , Up: con

4.11.1 Description

con generates a concordance table based on a pattern given to ser.


Next: , Previous: con description, Up: con

4.11.2 Command line options

−−help, −h
Print help.
−−left −l
Left context info (default='30c'). Example:
          				 -l=5c: left context is 5 characters
                                           -l=5w: left context is 5 words
                                           -l=5s: left context is 5 non-empty input lines
                                           -l='\s*\S+\sr\S+BOS': left context starts with the given regex

−−right −r
Right context info (default='30c').
−−trim −t
Clear incomplete words from output.
−−white −w
DO NOT change all white characters into spaces.
−−column −c
Left column minimal width in characters (default = 0).
−−ignore −i
Ignore segment inconsistency in the input.
−−bom
Beginning of selected segment (regex, default='[0-9]+ [0-9]+ BOM .*').
−−eom
End of selected segment (regex, default='[0-9]+ [0-9]+ EOM .*').
−−bod
Selected segment beginning display string (default='[').
−−eod
Selected segment end display string (default=']').


Next: , Previous: con command line options, Up: con

4.11.3 Usage example

     cat file.txt | tok | lem -1 | ser -e 'lexeme(dom)' | con


Previous: con usage example, Up: con

4.11.4 Hints

con is a rather slow program. Do not pass large amounts of redundant text through this program. con works fine in the following sequence:

     ... | grp -e EXPR | ser -e EXPR | con


Next: , Previous: UTT components, Up: Top

5 Auxiliary tools


Next: , Up: Auxiliary tools

5.1 compdic - the dictionary compiler

Authors: Michał Stolarski, Tomasz Obrębski
Component category: additional tool

compdic compiles dictionaries in text format (.dic extension) into binary (FSA) format (.bin extension).

Automaton representation of a dictionary is built using automata tools from the OpenFst package.

In order for the compdic program to work you have to install the above mentioned package into your system.

Usage:

             compdic <dictionaryname>.dic <dictionaryname>.bin

The file <dictionaryname>.bin will be generated.


Next: , Previous: compdic, Up: Auxiliary tools

5.2 fla - the UTT file flattener

Authors: Tomasz Obrębski
Input format: UTT regular
Output format: UTT flattened
Required annotation: sen


Up: fla

5.2.1 Description

fla “flattens” a utt file by merging segments belonging to one sentence in one line. Technically, end-of-line characters ('\n', ASCII code 10) are replaced with line-feed characters ('\f', ASCII code 12). The flattening makes it possible to process UTT files with such tools as grep or sed sentence by sentence (used in grp and mar).

Flattened files should have the suffix .fla, eg. thetext.utt.fla.

Flattened files are still human-readible.

Usage:

             fla [<bosregex>]

The facultative argument is a regular expression describing segments which should be treated as sentence beginnings (the test is: the segment contains a fragment matching the <bosregex>). By default, segments containing a field BOS are seeked.


Previous: fla, Up: Auxiliary tools

5.3 unfla - the UTT file unflattener

Authors: Tomasz Obrębski
Input format: UTT flattened
Output format: UTT regular
Required annotation: -


Up: unfla

5.3.1 Description

unfla transforms a flattened UTT file, produced by fla, into the regular format by restoring end-of-line characters.


Next: , Previous: Auxiliary tools, Up: Top

6 Usage examples

Simple pipelines
  1. tokenization
              cat text | tok > output
    
  2. morphological annotation (with dictionary)

    simple dictionary based lemmatization

              cat text | tok | lem > output
    
  3. morphological annotation (with dictionary and guessing)
    1. perform dictionary-based lemmatization
    2. guess descriptions for words which have no annotation
              cat text | tok | lem | gue -S lem > output2
    
  4. morphological annotation (complex pipeline)

    1) perform dictionary-based lemmatization 2) try to correct words with no annotation 3) perform dictionary-based lemmatization of corrected words 4) guess descriptions for words which still have no annotation

              cat text | tok | lem | cor -p W -S lem | lem -I cor | gue -p W -S lem
    
  5. morphological annotation (multiple dictionaries) (1)
              cat text | tok | lem | lem -S lem -d user-dictionary.dic
    
  6. morphological annotation (multiple dictionaries) (2)
              cat text | tok | lem -d user-dictionary.bin | lem -S lem
    
  7. spelling correction
              cat text | tok | egrep ' W ' | lem | egrep -v 'lem:' | cor -1
    
  8. expression extraction

    Extraction of all occurrences of a verb followed by a form of the noun 'rozmowa'.

              cat text | tok | lem -1 | ser -e 'cat(<V>) space lexeme(rozmowa)' -m | kot > output4
    
  9. marking text elements

    Mark all forms of 'który' which occur after a comma with tags (0-length segments) RELB and RELE

    cat text | tok | lem -1 | mar -e 'punct(,) space? @RELB lexeme(który) @RELE'

  10. a word in context

    Extraction of text fragments containing a form of the lexeme 'rozmowa' in the context of 5 preceeding and 5 succeeding corpus segments.

              cat text | tok | lem -1 | ser -e 'seg{5} lexeme(rozmowa) seg{5}' -m | kot > output
    
  11. generation of concordance table (1)
              cat text | tok | lem -1 | ser -e 'cat(<V>) space lexeme(rozmowa)' | con
    

    10"

  12. generation of concordance table (2)

    The same as above but much faster

              cat text | tok | lem -1 | fla | \
              grp -e 'cat(<V>) space lexeme(rozmowa)' | \
              ser -e 'cat(<V>) space lexeme(rozmowa)' | \
              con
    

    2"

  13. generation of concordance table (3)

    Usually, one performs repetitively search over the same corpus. In such case it is advisable to transform the corpus data into the format required by grp first, and then use the preprocessed data.

    As grp (grep) processes data faster then it is read from the disk drive, the search time may be still shortened by using file compression techniques. We suggest using the lzop compressor/decompressor.

  14. the fastest way to search a large corpus

    step 1: corpus preprocessing

              cat corpus | tok | sen | lem -1 \
              | fla | lzop -7 > corpus.grp.lzo
    

    step 2: search

              lzop -cd corpus.grp.lzo | grp -a gP -e 'cat(<V>) space
              lexeme(rozmowa)' | ser -e 'cat(<V>) space lexeme(rozmowa)' | con
    


Next: , Previous: Usage examples, Up: Top

7 PMDBF dictionary

UTT components come with lexical data derived from Polish Morphological Database (PMDB).


Next: , Up: PMDBF dictionary

7.1 Files


Next: , Previous: PMDBF files, Up: PMDBF dictionary

7.2 Tag structure

pos = [[:upper:]]+

attr = [[:upper:]]+

val = [[:lower:][:digit:]?!*+-] | <[^>\n]+>

descr = pos ( / ( attr val + ) + ) ?


Next: , Previous: PMDBF tag structure, Up: PMDBF dictionary

7.3 Parts of speech

N noun
NPRO nominal-pronoun
NV deverbal-noun
V verb
BYC byc
VNI non-inflected-verb
ADJ adjective
ADJPAP adjectival-passive-participle
ADJPRP adjectival-present-participle
ADJPP adjectival-past-participle
ADJPRO adjectival-pronoun
ADJNUM adjectival-numeral
ADV adverb
ADVANP adverbial-anterior-participle
ADVPRP adverbial-present-participle
ADVPRO adverbial-pronoun
ADVNUM adverbial-numeral
P preposition
PPRO prep-noun-pronoun
CONJ conjunction
EXCL exclamation
APP call
ONO onomatopoeia
PART particle
NUMCRD cardinal-numeral
NUMCOL collective-numeral
NUMPAR partitive-numeral
NUMORD ordinal-numeral


Previous: PMDBF parts of speech, Up: PMDBF dictionary

7.4 Morphosyntactic attributes

A Aspect
p perfect
i imperfect.

V Verb-Form
b infinitive,
p personal,
i impersonal.

M Mood
d declarative,
c conditional,
i imperative.

T Tense
a past,
r present,
f future.

P Person
1 1,
2 2,
3 3.

D Degree
p positive,
c comparative,
s superlative.

N Number
s singular,
p plural.

C Case
n nominative,
g genitive,
d dative,
a accusative,
i instrumantal,
l locative,
v vocative.
G Gender
p masculine-personal,
a masculine-animal,
i masculine-inanimate,
f feminine,
n neuter.


Next: , Previous: PMDBF dictionary, Up: Top

8 GNU Free Documentation License

Version 1.2, November 2002
     Copyright © 2000,2001,2002 Free Software Foundation, Inc.
     51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA
     
     Everyone is permitted to copy and distribute verbatim copies
     of this license document, but changing it is not allowed.
  1. PREAMBLE

    The purpose of this License is to make a manual, textbook, or other functional and useful document free in the sense of freedom: to assure everyone the effective freedom to copy and redistribute it, with or without modifying it, either commercially or noncommercially. Secondarily, this License preserves for the author and publisher a way to get credit for their work, while not being considered responsible for modifications made by others.

    This License is a kind of “copyleft”, which means that derivative works of the document must themselves be free in the same sense. It complements the GNU General Public License, which is a copyleft license designed for free software.

    We have designed this License in order to use it for manuals for free software, because free software needs free documentation: a free program should come with manuals providing the same freedoms that the software does. But this License is not limited to software manuals; it can be used for any textual work, regardless of subject matter or whether it is published as a printed book. We recommend this License principally for works whose purpose is instruction or reference.

  2. APPLICABILITY AND DEFINITIONS

    This License applies to any manual or other work, in any medium, that contains a notice placed by the copyright holder saying it can be distributed under the terms of this License. Such a notice grants a world-wide, royalty-free license, unlimited in duration, to use that work under the conditions stated herein. The “Document”, below, refers to any such manual or work. Any member of the public is a licensee, and is addressed as “you”. You accept the license if you copy, modify or distribute the work in a way requiring permission under copyright law.

    A “Modified Version” of the Document means any work containing the Document or a portion of it, either copied verbatim, or with modifications and/or translated into another language.

    A “Secondary Section” is a named appendix or a front-matter section of the Document that deals exclusively with the relationship of the publishers or authors of the Document to the Document's overall subject (or to related matters) and contains nothing that could fall directly within that overall subject. (Thus, if the Document is in part a textbook of mathematics, a Secondary Section may not explain any mathematics.) The relationship could be a matter of historical connection with the subject or with related matters, or of legal, commercial, philosophical, ethical or political position regarding them.

    The “Invariant Sections” are certain Secondary Sections whose titles are designated, as being those of Invariant Sections, in the notice that says that the Document is released under this License. If a section does not fit the above definition of Secondary then it is not allowed to be designated as Invariant. The Document may contain zero Invariant Sections. If the Document does not identify any Invariant Sections then there are none.

    The “Cover Texts” are certain short passages of text that are listed, as Front-Cover Texts or Back-Cover Texts, in the notice that says that the Document is released under this License. A Front-Cover Text may be at most 5 words, and a Back-Cover Text may be at most 25 words.

    A “Transparent” copy of the Document means a machine-readable copy, represented in a format whose specification is available to the general public, that is suitable for revising the document straightforwardly with generic text editors or (for images composed of pixels) generic paint programs or (for drawings) some widely available drawing editor, and that is suitable for input to text formatters or for automatic translation to a variety of formats suitable for input to text formatters. A copy made in an otherwise Transparent file format whose markup, or absence of markup, has been arranged to thwart or discourage subsequent modification by readers is not Transparent. An image format is not Transparent if used for any substantial amount of text. A copy that is not “Transparent” is called “Opaque”.

    Examples of suitable formats for Transparent copies include plain ascii without markup, Texinfo input format, LaTeX input format, SGML or XML using a publicly available DTD, and standard-conforming simple HTML, PostScript or PDF designed for human modification. Examples of transparent image formats include PNG, XCF and JPG. Opaque formats include proprietary formats that can be read and edited only by proprietary word processors, SGML or XML for which the DTD and/or processing tools are not generally available, and the machine-generated HTML, PostScript or PDF produced by some word processors for output purposes only.

    The “Title Page” means, for a printed book, the title page itself, plus such following pages as are needed to hold, legibly, the material this License requires to appear in the title page. For works in formats which do not have any title page as such, “Title Page” means the text near the most prominent appearance of the work's title, preceding the beginning of the body of the text.

    A section “Entitled XYZ” means a named subunit of the Document whose title either is precisely XYZ or contains XYZ in parentheses following text that translates XYZ in another language. (Here XYZ stands for a specific section name mentioned below, such as “Acknowledgements”, “Dedications”, “Endorsements”, or “History”.) To “Preserve the Title” of such a section when you modify the Document means that it remains a section “Entitled XYZ” according to this definition.

    The Document may include Warranty Disclaimers next to the notice which states that this License applies to the Document. These Warranty Disclaimers are considered to be included by reference in this License, but only as regards disclaiming warranties: any other implication that these Warranty Disclaimers may have is void and has no effect on the meaning of this License.

  3. VERBATIM COPYING

    You may copy and distribute the Document in any medium, either commercially or noncommercially, provided that this License, the copyright notices, and the license notice saying this License applies to the Document are reproduced in all copies, and that you add no other conditions whatsoever to those of this License. You may not use technical measures to obstruct or control the reading or further copying of the copies you make or distribute. However, you may accept compensation in exchange for copies. If you distribute a large enough number of copies you must also follow the conditions in section 3.

    You may also lend copies, under the same conditions stated above, and you may publicly display copies.

  4. COPYING IN QUANTITY

    If you publish printed copies (or copies in media that commonly have printed covers) of the Document, numbering more than 100, and the Document's license notice requires Cover Texts, you must enclose the copies in covers that carry, clearly and legibly, all these Cover Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on the back cover. Both covers must also clearly and legibly identify you as the publisher of these copies. The front cover must present the full title with all words of the title equally prominent and visible. You may add other material on the covers in addition. Copying with changes limited to the covers, as long as they preserve the title of the Document and satisfy these conditions, can be treated as verbatim copying in other respects.

    If the required texts for either cover are too voluminous to fit legibly, you should put the first ones listed (as many as fit reasonably) on the actual cover, and continue the rest onto adjacent pages.

    If you publish or distribute Opaque copies of the Document numbering more than 100, you must either include a machine-readable Transparent copy along with each Opaque copy, or state in or with each Opaque copy a computer-network location from which the general network-using public has access to download using public-standard network protocols a complete Transparent copy of the Document, free of added material. If you use the latter option, you must take reasonably prudent steps, when you begin distribution of Opaque copies in quantity, to ensure that this Transparent copy will remain thus accessible at the stated location until at least one year after the last time you distribute an Opaque copy (directly or through your agents or retailers) of that edition to the public.

    It is requested, but not required, that you contact the authors of the Document well before redistributing any large number of copies, to give them a chance to provide you with an updated version of the Document.

  5. MODIFICATIONS

    You may copy and distribute a Modified Version of the Document under the conditions of sections 2 and 3 above, provided that you release the Modified Version under precisely this License, with the Modified Version filling the role of the Document, thus licensing distribution and modification of the Modified Version to whoever possesses a copy of it. In addition, you must do these things in the Modified Version:

    1. Use in the Title Page (and on the covers, if any) a title distinct from that of the Document, and from those of previous versions (which should, if there were any, be listed in the History section of the Document). You may use the same title as a previous version if the original publisher of that version gives permission.
    2. List on the Title Page, as authors, one or more persons or entities responsible for authorship of the modifications in the Modified Version, together with at least five of the principal authors of the Document (all of its principal authors, if it has fewer than five), unless they release you from this requirement.
    3. State on the Title page the name of the publisher of the Modified Version, as the publisher.
    4. Preserve all the copyright notices of the Document.
    5. Add an appropriate copyright notice for your modifications adjacent to the other copyright notices.
    6. Include, immediately after the copyright notices, a license notice giving the public permission to use the Modified Version under the terms of this License, in the form shown in the Addendum below.
    7. Preserve in that license notice the full lists of Invariant Sections and required Cover Texts given in the Document's license notice.
    8. Include an unaltered copy of this License.
    9. Preserve the section Entitled “History”, Preserve its Title, and add to it an item stating at least the title, year, new authors, and publisher of the Modified Version as given on the Title Page. If there is no section Entitled “History” in the Document, create one stating the title, year, authors, and publisher of the Document as given on its Title Page, then add an item describing the Modified Version as stated in the previous sentence.
    10. Preserve the network location, if any, given in the Document for public access to a Transparent copy of the Document, and likewise the network locations given in the Document for previous versions it was based on. These may be placed in the “History” section. You may omit a network location for a work that was published at least four years before the Document itself, or if the original publisher of the version it refers to gives permission.
    11. For any section Entitled “Acknowledgements” or “Dedications”, Preserve the Title of the section, and preserve in the section all the substance and tone of each of the contributor acknowledgements and/or dedications given therein.
    12. Preserve all the Invariant Sections of the Document, unaltered in their text and in their titles. Section numbers or the equivalent are not considered part of the section titles.
    13. Delete any section Entitled “Endorsements”. Such a section may not be included in the Modified Version.
    14. Do not retitle any existing section to be Entitled “Endorsements” or to conflict in title with any Invariant Section.
    15. Preserve any Warranty Disclaimers.

    If the Modified Version includes new front-matter sections or appendices that qualify as Secondary Sections and contain no material copied from the Document, you may at your option designate some or all of these sections as invariant. To do this, add their titles to the list of Invariant Sections in the Modified Version's license notice. These titles must be distinct from any other section titles.

    You may add a section Entitled “Endorsements”, provided it contains nothing but endorsements of your Modified Version by various parties—for example, statements of peer review or that the text has been approved by an organization as the authoritative definition of a standard.

    You may add a passage of up to five words as a Front-Cover Text, and a passage of up to 25 words as a Back-Cover Text, to the end of the list of Cover Texts in the Modified Version. Only one passage of Front-Cover Text and one of Back-Cover Text may be added by (or through arrangements made by) any one entity. If the Document already includes a cover text for the same cover, previously added by you or by arrangement made by the same entity you are acting on behalf of, you may not add another; but you may replace the old one, on explicit permission from the previous publisher that added the old one.

    The author(s) and publisher(s) of the Document do not by this License give permission to use their names for publicity for or to assert or imply endorsement of any Modified Version.

  6. COMBINING DOCUMENTS

    You may combine the Document with other documents released under this License, under the terms defined in section 4 above for modified versions, provided that you include in the combination all of the Invariant Sections of all of the original documents, unmodified, and list them all as Invariant Sections of your combined work in its license notice, and that you preserve all their Warranty Disclaimers.

    The combined work need only contain one copy of this License, and multiple identical Invariant Sections may be replaced with a single copy. If there are multiple Invariant Sections with the same name but different contents, make the title of each such section unique by adding at the end of it, in parentheses, the name of the original author or publisher of that section if known, or else a unique number. Make the same adjustment to the section titles in the list of Invariant Sections in the license notice of the combined work.

    In the combination, you must combine any sections Entitled “History” in the various original documents, forming one section Entitled “History”; likewise combine any sections Entitled “Acknowledgements”, and any sections Entitled “Dedications”. You must delete all sections Entitled “Endorsements.”

  7. COLLECTIONS OF DOCUMENTS

    You may make a collection consisting of the Document and other documents released under this License, and replace the individual copies of this License in the various documents with a single copy that is included in the collection, provided that you follow the rules of this License for verbatim copying of each of the documents in all other respects.

    You may extract a single document from such a collection, and distribute it individually under this License, provided you insert a copy of this License into the extracted document, and follow this License in all other respects regarding verbatim copying of that document.

  8. AGGREGATION WITH INDEPENDENT WORKS

    A compilation of the Document or its derivatives with other separate and independent documents or works, in or on a volume of a storage or distribution medium, is called an “aggregate” if the copyright resulting from the compilation is not used to limit the legal rights of the compilation's users beyond what the individual works permit. When the Document is included in an aggregate, this License does not apply to the other works in the aggregate which are not themselves derivative works of the Document.

    If the Cover Text requirement of section 3 is applicable to these copies of the Document, then if the Document is less than one half of the entire aggregate, the Document's Cover Texts may be placed on covers that bracket the Document within the aggregate, or the electronic equivalent of covers if the Document is in electronic form. Otherwise they must appear on printed covers that bracket the whole aggregate.

  9. TRANSLATION

    Translation is considered a kind of modification, so you may distribute translations of the Document under the terms of section 4. Replacing Invariant Sections with translations requires special permission from their copyright holders, but you may include translations of some or all Invariant Sections in addition to the original versions of these Invariant Sections. You may include a translation of this License, and all the license notices in the Document, and any Warranty Disclaimers, provided that you also include the original English version of this License and the original versions of those notices and disclaimers. In case of a disagreement between the translation and the original version of this License or a notice or disclaimer, the original version will prevail.

    If a section in the Document is Entitled “Acknowledgements”, “Dedications”, or “History”, the requirement (section 4) to Preserve its Title (section 1) will typically require changing the actual title.

  10. TERMINATION

    You may not copy, modify, sublicense, or distribute the Document except as expressly provided for under this License. Any other attempt to copy, modify, sublicense or distribute the Document is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance.

  11. FUTURE REVISIONS OF THIS LICENSE

    The Free Software Foundation may publish new, revised versions of the GNU Free Documentation License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. See http://www.gnu.org/copyleft/.

    Each version of the License is given a distinguishing version number. If the Document specifies that a particular numbered version of this License “or any later version” applies to it, you have the option of following the terms and conditions either of that specified version or of any later version that has been published (not as a draft) by the Free Software Foundation. If the Document does not specify a version number of this License, you may choose any version ever published (not as a draft) by the Free Software Foundation.

ADDENDUM: How to use this License for your documents

To use this License in a document you have written, include a copy of the License in the document and put the following copyright and license notices just after the title page:

       Copyright (C)  year  your name.
       Permission is granted to copy, distribute and/or modify this document
       under the terms of the GNU Free Documentation License, Version 1.2
       or any later version published by the Free Software Foundation;
       with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
       Texts.  A copy of the license is included in the section entitled ``GNU
       Free Documentation License''.

If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts, replace the “with...Texts.” line with this:

         with the Invariant Sections being list their titles, with
         the Front-Cover Texts being list, and with the Back-Cover Texts
         being list.

If you have Invariant Sections without Cover Texts, or some other combination of the three, merge those two alternatives to suit the situation.

If your document contains nontrivial examples of program code, we recommend releasing these examples in parallel under your choice of free software license, such as the GNU General Public License, to permit their use in free software.


Next: , Previous: GNU Free Documentation License, Up: Top

9 Reporting bugs

Report bugs to <obrebski@amu.edu.pl>.


Previous: Reporting bugs, Up: Top

10 Author