Saphre is a collection of suffix array programs for Java which have been adapted for a large alphabet, so as to be used for natural language where it is often desirable to treat each word as an alphabet character.
The program comes with a modular tokenizer and normalizer, and a modular approach approach to interval tree traversal, allowing phrasal statistics to be calculated based on term frequency, document frequency (or document distribution) and other measures.
The package includes implementations of recently developed algorithms, such as the Abouelhoda et al and Kim et al versions of extended suffix arrays (with most of the power of suffix trees). An interesting feature is the suffix array-based implementation of the Aho Corasick algorithm. With this automaton, one can alternate in a powerful way between bottom-up phrase discovery and top-down phrase searching. See the "examples" directory to see what I mean. The above-mentioned "examples" directory is intended to show a large number of how to use all this stuff without getting lost in the implementation details. Contributions to this examples directory would be greatly appreciated.
The current version of Saphre is ks-0.1.10 (May 13, 2010) and can be downloaded here.
For now, documentation is limited to comments in the code (both Javadoc and implementation comments. To get started, it is recommended that the user look at the Collector interface and the various implementations of this interface in the examples directory. These implementations show how information about phrases can be collected without too much worry about the suffix array implementation. Motivations for the project are described in this presentation.
The main goal is to discover the phrases in a text which are useful or interesting for specific purposes such as lexicography, text categorization, translation, etc. Saphre has been used for such purposes by a number of students in International Studies in Computational Linguistics at the University of Tübingen.
Saphre has also been used in the
LTfLL eLearning
project as an alternative approach to characterizing learner
texts. Latent Semantic Analysis has often been used for this
purpose, but LSA tends to pick out the concepts that are expressed
in a text, whereas phrase extraction can be used to pick out the
ways in which these concepts are expressed textually. When learners
use non-standard terminology for expressing concepts, it is a sign
that they have not yet adapted to the Speech Genre (Bakhtin) of the
relevant Community of Practice (Lave and Wegner). In the project
phrases are used both to estimate the level of the learner and to
provide useful feedback to the learner who wishes to adapt to the
linguistic norms of the Community of Practice. In social models of
learning (Vygotsky) better linguistic skills lead to better
communication, which in turn leads to improved learning.