Saphre: Suffix Arrays for Phrase Extraction

Overview

Saphre is a collection of suffix array programs for Java which have been adapted for a large alphabet, so as to be used for natural language where it is often desirable to treat each word as an alphabet character.

The program comes with a modular tokenizer and normalizer, and a modular approach approach to interval tree traversal, allowing phrasal statistics to be calculated based on term frequency, document frequency (or document distribution) and other measures.

The package includes implementations of recently developed algorithms, such as the Abouelhoda et al and Kim et al versions of extended suffix arrays (with most of the power of suffix trees). An interesting feature is the suffix array-based implementation of the Aho Corasick algorithm. With this automaton, one can alternate in a powerful way between bottom-up phrase discovery and top-down phrase searching. See the "examples" directory to see what I mean. The above-mentioned "examples" directory is intended to show a large number of how to use all this stuff without getting lost in the implementation details. Contributions to this examples directory would be greatly appreciated.

Status

The current version of Saphre is ks-0.1.10 (May 13, 2010) and can be downloaded here.

Documentation

For now, documentation is limited to comments in the code (both Javadoc and implementation comments. To get started, it is recommended that the user look at the Collector interface and the various implementations of this interface in the examples directory. These implementations show how information about phrases can be collected without too much worry about the suffix array implementation. Motivations for the project are described in this presentation.

Projects

Use the system to pick out “interesting” prases, according to some measure of interestingness.
Use phrases for text categorization with a bag-of-phrases model.
Improve text preprocessing. Problematic texts such as Usenet forums are very costly to preprocess. Can this be made more efficient? What other preprocessing would be useful? Morphological analysis perhaps?
Can Saphre be integrated with Foma in any interesting way?
How can Saphre be scaled up to larger corpora?

Application

The main goal is to discover the phrases in a text which are useful or interesting for specific purposes such as lexicography, text categorization, translation, etc. Saphre has been used for such purposes by a number of students in International Studies in Computational Linguistics at the University of Tübingen.

Saphre has also been used in the LTfLL eLearning project as an alternative approach to characterizing learner texts. Latent Semantic Analysis has often been used for this purpose, but LSA tends to pick out the concepts that are expressed in a text, whereas phrase extraction can be used to pick out the ways in which these concepts are expressed textually. When learners use non-standard terminology for expressing concepts, it is a sign that they have not yet adapted to the Speech Genre (Bakhtin) of the relevant Community of Practice (Lave and Wegner). In the project phrases are used both to estimate the level of the learner and to provide useful feedback to the learner who wishes to adapt to the linguistic norms of the Community of Practice. In social models of learning (Vygotsky) better linguistic skills lead to better communication, which in turn leads to improved learning.