Open sources Packages and Toolkits

A Chinese Word Segmenter

An open source Chinese word segmentation tool developed on top of Ictclas in Java.
It is reported to have a higher accuracy than Ictclas and many other similar packages,
and also supports user defined dictionary.
A simple python segmenter based on maximum matching
It's just 34 lines of code.
http://www.isnowfy.com/python-chinese-segmentation/

【中文分词开源项目】：SCWS http://t.cn/hda5lb ICTCLAS http://t.cn/hgTZs3HTTPCWS http://t.cn/zjNwvvv 庖丁解牛分词 http://t.cn/hCZC2z CC-CEDICThttp://t.cn/zjNZsss

MMSEG: A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm

written in C

word2vec

Tool for computing continuous distributed representations of words.

FudanNLP

A Chinese NLP Toolkits written in Java.

Features:

Information Retrieval： Text Classification News Clustering
Chinese Processing： Word Segmentation, POS tagger, Entity Recognition, Keyword Extraction, Dependency Grammar Parser, Time Phrase Recognition
Structural Learning： Online Learning, Hierarchy Classification, Clustering, Reasoning

A series of Text Processing tools

FlexCRFs: Flexible Conditional Random Fields
CRFTagger: CRF English POS Chunker
CRFChunker: CRF English Phrase Chunker
JTextPro: A Java-based Text Processing Toolkit
JWebPro: A Java-based Web Processing Toolkit
JVnSegmenter: A Java-based Vietnamese Word Segmentation Tool

JCharset

language-detection

This is a language detection library implemented in plain Java.

http://code.google.com/p/language-detection/

Generate language profiles from Wikipedia abstract xml
Detect language of a text using naive Bayesian filter
99% over precision for 53 languages

LingPipe

LingPipe is tool kit for processing text using computational linguistics. LingPipe is used to do tasks like:

Find the names of people, organizations or locations in news
Automatically classify Twitter search results into categories
Suggest correct spellings of queries
There are also a number of basic implementation of models in NLP, like HMM, CRF, LM, Chunking, SVD, POS, Clustering, Classification (Naive Bayes, Logistic Regression, …), EM, POS, and plenty more.

The most impressive thing of Lingpipe, I think, is its complete documentation. It’s particularly good for those who not only want to use it, but also want to learn the implementation details. There are also a free book about Lingpipe and a Java-base text processing book. You can find them in Lingpipe’s website.

See more: http://alias-i.com/lingpipe/demos/tutorial/read-me.html

OpenNLP

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also includes maximum entropy and perceptron based machine learning.

The goal of the OpenNLP project will be to create a mature toolkit for the abovementioned tasks. An additional goal is to provide a large number of pre-built models for a variety of languages, as well as the annotated text resources that those models are derived from.

RankLib

Overview
RankLib is a library of learning to rank algorithms. Currently eight popular algorithms have been implemented:

MART (Multiple Additive Regression Trees, a.k.a. Gradient boosted regression tree) [6]

RankNet [1]

RankBoost [2]

AdaRank [3]

Coordinate Ascent [4]

LambdaMART [5]

ListNet [7]

Random Forests [8]

With appropriate parameters for Random Forests, it can also do bagging several MART/LambdaMART rankers.

It also implements many retrieval metrics as well as provides many ways to carry out evaluation.

Apache Tika

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries. It will save a lot of time for doing this kind of stuff.

Supported Document Formats

JNotify java

JNotify is a java library that allow java application to listen to file system events, such as:

File created
File modified
File renamed
File deleted

Colt

Colt provides a set of Open Source Libraries for High Performance Scientific and Technical Computing in Java.

Feature	Description
Templated Lists and Maps	Dynamically resizing lists holding objects or primitive data types such as`int`, `double`, etc. Operations on primitive arrays, algorithms on Colt lists and JAL algorithms (see below) can freely be mixed at zero copy overhead.More details. Automatically growing and shrinking maps holding objects or primitive data types such as `int`, `double`, etc. More details. Space efficient high performance BitVectors and BitMatrices. More details
Templated Multi-dimensional matrices	Dense and sparse fixed sized (non-resizable) 1,2, 3 and d-dimensional matrices holding objects or primitive data types such as `int`, `double`, etc; Also known as multi-dimensional arrays or Data Cubes. More details.
Linear Algebra	Standard matrix operations and decompositions. LU, QR, Cholesky, Eigenvalue, Singular value. More details.
Histogramming	Compact, extensible, modular and performant histogramming functionality. AIDA offers the histogramming features of HTL and HBOOK. More detailshere and also there.
Mathematics	Tools for basic and advanced mathematics: Arithmetics and Algebra, Polynomials and Chebyshev series, Bessel and Airy functions, Constants and Units, Trigonometric functions, etc. More details.
Statistics	Tools for basic and advanced statistics: Estimators, Gamma functions, Beta functions, Probabilities, Special integrals, etc. More details.
Random Numbers and Random Sampling	Strong yet quick. Partly a port of CLHEP. More details here and there and also there.
util.concurrent	Efficient utility classes commonly encountered in parallel & concurrent programming. More details.

jforests

jforests is a Java library that implements many tree-based learning algorithms.

jforests can be used for regression, classification and ranking problems. The following tutorial shows how jforests can be used for learning a ranking model using the LambdaMART algorithm.

Qt Jambi

Qt is the de facto standard C++ framework for high performance cross-platform software development. Qt Jambi is the Qt library made available to Java. It is an open source technology aimed at all desktop programmers wanting to write rich GUI clients using the Java language, while at the same time taking advantage of Qt’s power and efficiency.

The technology provides new possibilities for both Java and C++ programmers: It enables Java developers to take advantage of Qt’s features from within Java Standard Edition 5.0 and Java Enterprise Edition 5.0 as well as later versions. In addition, Qt Jambi also enables C++ programmers to easily integrate their Qt code with Java by providing the Qt Jambi generator.

For more comprehensive description of what qt-jambi provides, see here.

This is new website released at 10.03.2012 after far too many delays. If you still want to see old website, it can be seen at http://old.qt-jambi.org.

Lupyne is:

a high-level Pythonic search engine library, built on PyLucene
a RESTful JSON search server, built on CherryPy
a simple Python client for interacting with the server

NLTK

Pattern

PyWordNet

This is the old version of PyWordNet. PyWordNet was contributed to the NLTK project in 2006. Refer to that software for a more recent implementation of Python/WordNet that has been updated to Wordnet 2.1 and extended with some of the Wordnet similarity scoring algorithms.

Information Retrieval Blog

Toolkits

Open sources Packages and Toolkits

A Chinese Word Segmenter

MMSEG: A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm

written in C

word2vec

Tool for computing continuous distributed representations of words.

FudanNLP

Features:

A series of Text Processing tools

JCharset

language-detection

LingPipe

OpenNLP

Apache Tika

JNotify java

Colt

jforests

jforests is a Java library that implements many tree-based learning algorithms.

jforests can be used for regression, classification and ranking problems. The following tutorial shows how jforests can be used for learning a ranking model using the LambdaMART algorithm.

Qt Jambi

Python

Memcached

Machine Learning With Python

TextBlob

Lupyne is:

NLTK

PyWordNet

Popular Posts

IR、ML、NLP

Total Pageviews

Open sources Packages and Toolkits

written in C

Features:

A series of Text Processing tools

JCharset

JNotify java

jforests is a Java library that implements many tree-based learning algorithms. jforests can be used for regression, classification and ranking problems. The following tutorial shows how jforests can be used for learning a ranking model using the LambdaMART algorithm.

Qt Jambi

Python

Lupyne is:

NLTK

Popular Posts

IR、ML、NLP

Total Pageviews

jforests is a Java library that implements many tree-based learning algorithms.

jforests can be used for regression, classification and ranking problems. The following tutorial shows how jforests can be used for learning a ranking model using the LambdaMART algorithm.