wordbreaker
command reference¶
wordbreaker
is one of the helper tools within the Manticore package. It
is used to split compound words, as usual in URLs, into its component
words. For example, this tool can split “lordoftherings” into its four
component words, or “http://manofsteel.warnerbros.com” into “man of
steel warner bros”. This helps searching, without requiring prefixes or
infixes: searching for “sphinx” wouldn’t match “sphinxsearch” but if you
break the compound word and index the separate components, you’ll get a
match without the costs of prefix and infix larger index files.
Examples of its usage are:
echo manofsteel | bin/wordbreaker -dict dict.txt split
man of steel
The input stream will be separated in words using the -dict
dictionary file. In no dictionary specified, wordbreaker looks in the working folder for a wordbreaker-dict.txt file. (The dictionary should match the language of the
compound word.) The split
command breaks words from the standard
input, and outputs the result in the standard output. There are also
test
and bench
commands that let you test the splitting quality
and benchmark the splitting functionality.
Wordbreaker Wordbreaker needs a dictionary to recognize individual
substrings within a string. To differentiate between different guesses,
it uses the relative frequency of each word in the dictionary: higher
frequency means higher split probability. You can generate such a file
using the indexer
tool, as in
indexer --buildstops dict.txt 100000 --buildfreqs myindex -c /path/to/sphinx.conf
which will write the 100,000 most frequent words, along with their counts, from myindex into dict.txt. The output file is a text file, so you can edit it by hand, if need be, to add or remove words.