CRAN Task View: Natural Language Processing
|Maintainer:||Ingo Feinerer and Fridolin Wild|
|Contact:||Fridolin.Wild at wu-wien.ac.at|
This CRAN Task View contains a list of packages useful for
natural language processing.
Phonetics and Speech Processing:
is a collection of tools for the creation, manipulation, and analysis of speech databases. At the core of EMU is a database search engine which allows the researcher to find various speech segments based on the sequential and hierarchical structure of the utterances in which they occur. EMU includes an interactive labeller which can display spectrograms and other speech waveforms, and which allows the creation of hierarchical, as well as sequential, labels for a speech utterance.
provides an R interface
, a large
lexical database of English.
Keyword Extraction and General String Manipulation:
R's base package already provides a rich set of character manipulation
help.search(keyword = "character", package = "base")
for more information on these capabilities.
provides an R interface to
(Version 5.0). KEA (for
Keyphrase Extraction Algorithm) allows for extracting keyphrases from
text documents. It can be either used for free indexing or for indexing
with a controlled vocabulary.
can be used for certain parsing tasks such as
extracting words from strings by content rather than by delimiters.
shows an example of this in a natural language
Natural Language Processing:
provides an R interface
collection of natural language processing tools including a
sentence detector, tokenizer, pos-tagger, shallow and full
syntactic parser, and named-entity detector, using the Maxent
Java package for training and using maximum entropy
ships trained models for English and
for Spanish to be used
is a interface
which is a collection of machine learning algorithms for data
mining tasks written in Java. Especially useful in the context
of natural language processing is its functionality for
tokenization and stemming.
provides the Snowball stemmers which contain the Porter
stemmer and several other stemmers for different
webpage for details.
provides an R interface to a C version of Porter's word
allows to create and compute with string kernels, like full string,
spectrum, or bounded range string kernels. It can directly use
the document format used
provides a comprehensive text mining framework for
Journal of Statistical Software
Infrastructure in R
gives a detailed overview and presents
techniques for count-based analysis methods, text clustering,
text classification and string kernels.
provides routines for performing a latent semantic analysis with R.
The basic idea of latent semantic analysis (LSA) is,
that text do have a higher order (=latent semantic) structure which,
however, is obscured by word usage (e.g. through the use of synonyms
or polysemy). By using conceptual indices that are derived statistically
via a truncated singular value decomposition (a two-mode factor analysis)
over a given document-term matrix, this variability problem can be overcome.
Unstructured Texts with Latent Semantic Analysis
gives a detailed overview and demonstrates the use of the package
with examples from the are of technology-enhanced learning.
offers utility functions for the statistical analysis of corpus frequency data.
provides data sets and functions exemplifying statistical methods, and some
facilitatory utility functions used in the book by R. H. Baayen: "Analyzing Linguistic Data: a Practical
Introduction to Statistics Using R", Cambridge University Press, 2008.
offers some statistical models for word frequency distributions. The
utilities include functions for loading, manipulating and visualizing word frequency data and
vocabulary growth curves. The package also implements several statistical models for the
distribution of word frequencies in a population. (The name of this library derives from the
most famous word frequency distribution, Zipf's law.)