UAM Text Tools
UAM Text Tools (UTT) is a package of language processing tools
developed at Adam Mickiewicz University. Its functionality includes:
- dictionary-based morphological analysis
- heuristic morphological analysis of unknown words
- spelling correction
- pattern search
- sentence splitting
- generation of concordance tables
The toolkit is destined for processing of raw (not annotated)
unrestricted text for any conceivable purpose.
The system is organized as a collection of command-line programs, each
performing one operation, e.g. tokenization, lemmatization, spelling
correction. The components are independent one from another, the
unifying element being the uniform i/o file format.
The components may be combined in various ways to provide various text
processing services. Also new components supplied by the used may be
easily incorporated into the system provided that they respect the i/o
file format conventions.
UTT component programs does not depend on any specific tagset or
morphological description format.
Authors and contact
- Pawel Konieczka
- Tomasz Obrębski, e-mail: firstname.lastname@example.org
- Michał Stolarski
- Marcin Walas
- Justyna Walkowska
- Paweł Wereński
UTT is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
The dictionary files accompanying the UTT package are subject to separate licenses.
- UTT User Manual, draft, html
Tomasz Obrębski, Processing text corpora with grep, w: Proceedings
of New Trends in Intelligent Information Processing and Web Mining
- 2006, Springer Verlag, seria Advances in Soft Computing, 2006, str
Tomasz Obrębski, Michał Stolarski, UAM Text Tools - A Flexible NLP
Proceedings of LREC 2006, str. 2259-2262
Tomasz Obrębski, Michał Stolarski, UAM Text Tools - A text processing toolkit for
Polish. Proceedings of 2nd Language & Technology Conference,
Poznań, Poland, 2005, str. 301-304
- binary distribution
- sources: (to appear)
- Polex/PMDBF dictionaries for lem, cor, kor, and gue
- the Creative Commons by-nc-sa License
- GNU General Public Licence
The dictionary is generated from the morphological dictionary Polex, cf
Z. Vetulani, B. Walczak, T. Obrębski, G. Vetulani, Unambiguous
coding of the inflection of Polish nouns and its application in
electronic dictionaries - format POLEX, Wydawnictwo Naukowe UAM,