Index configuration options

agent_blackhole

Remote blackhole agent declaration in the distributed index. Multi-value, optional, default is empty.

agent_blackhole lets you fire-and-forget queries to remote agents. That is useful for debugging (or just testing) production clusters: you can setup a separate debugging/testing searchd instance, and forward the requests to this instance from your production master (aggregator) instance without interfering with production work. Master searchd will attempt to connect and query blackhole agent normally, but it will neither wait nor process any responses. Also, all network errors on blackhole agents will be ignored. The value format is completely identical to regular agent directive.

Example:

agent_blackhole = testbox:9312:testindex1,testindex2

agent_connect_timeout

Remote agent connection timeout, in milliseconds. Optional, default is 1000 (ie. 1 second).

When connecting to remote agents, searchd will wait at most this much time for connect() call to complete successfully. If the timeout is reached but connect() does not complete, and retries are enabled, retry will be initiated.

Example:

agent_connect_timeout = 300

agent_persistent

Persistently connected remote agent declaration. Multi-value, optional, default is empty.

agent_persistent directive syntax matches that of the agent directive. The only difference is that the master will not open a new connection to the agent for every query and then close it. Rather, it will keep a connection open and attempt to reuse for the subsequent queries. The maximal number of such persistent connections per one agent host is limited by persistent_connections_limit option of searchd section.

Note, that you have to set the last one in something greater than 0 if you want to use persistent agent connections. Otherwise - when persistent_connections_limit is not defined, it assumes the zero num of persistent connections, and ‘agent_persistent’ acts exactly as simple ‘agent’.

Persistent master-agent connections reduce TCP port pressure, and save on connection handshakes. As of time of this writing, they are supported only in workers=threads and workers=threadpool mode. In other modes, simple non-persistent connections (i.e., one connection per operation) will be used, and a warning will show up in the console.

Example:

agent_persistent = remotebox:9312:index2

agent_query_timeout

Remote agent query timeout, in milliseconds. Optional, default is 3000 (ie. 3 seconds).

After connection, searchd will wait at most this much time for remote queries to complete. This timeout is fully separate from connection timeout; so the maximum possible delay caused by a remote agent equals to the sum of agent_connection_timeout and agent_query_timeout. Queries will not be retried if this timeout is reached; a warning will be produced instead.

Example:

agent_query_timeout = 10000 # our query can be long, allow up to 10 sec

agent_retry_count

Integer, specifies how many times manticore will try to connect and query remote agents in distributed index before reporting fatal query error. It works the same as agent_retry_count in searchd section, but define the value for concrete index. See also mirror_retry_count option.

agent

Remote agent declaration in the distributed index. Multi-value, optional, default is empty.

agent directive declares remote agents that are searched every time when the enclosing distributed index is searched. The agents are, essentially, pointers to networked indexes. The value specifies address, and also can additionally specify multiple alternatives (agent mirrors) for either the address only, or the address and index list:

agent = address1 [ | address2 [...] ][:index-list]
agent = address1[:index-list [ | address2[:index-list [...] ] ] ]

In both cases the address specification must be one of the following:

address = hostname[:port] # eg. server2:9312
address = /absolute/unix/socket/path # eg. /var/run/sphinx2.sock

Where hostname is the remote host name, port is the remote TCP port number, index-list is a comma-separated list of index names, and square braces [] designate an optional clause.

When index name is omited, it is assumed the same index as the one where this line is defined. I.e. when defining agents for distributed index ‘mycoolindex’ you can just point the address, and it is assumed to calll ‘mycoolindex’ index on agent’s endpoints.

When port number is omited, it is assumed to be default SphinxQL IANA port (9312). However when portnumber is pointed, but invalid (say, port 70000), it will fail (skip) such agent.

In other words, you can point every single agent to one or more remote indexes, residing on one or more networked servers. There are absolutely no restrictions on the pointers. To point out a couple important things, the host can be localhost, and the remote index can be a distributed index in turn, all that is legal. That enables a bunch of very different usage modes:

  • sharding over multiple agent servers, and creating an arbitrary cluster topology;
  • sharding over multiple agent servers, mirrored for HA/LB (High Availability and Load Balancing) purposes;
  • sharding within localhost, to utilize multiple cores (however, it is simpler just to use multiple local indexes and dist_threads directive instead);

All agents are searched in parallel. An index list is passed verbatim to the remote agent. How exactly that list is searched within the agent (ie. sequentially or in parallel too) depends solely on the agent configuration (ie. dist_threads directive). Master has no remote control over that.

The value can additionally enumerate per agent options such as:

agent = address1:index-list[[ha_strategy=value] | [conn=value] | [blackhole=value]]

Example:

# config on box2
# sharding an index over 3 servers
agent = box2:9312:chunk2
agent = box3:9312:chunk3

# config on box2
# sharding an index over 3 servers
agent = box1:9312:chunk2
agent = box3:9312:chunk3

# config on box3
# sharding an index over 3 servers
agent = box1:9312:chunk2
agent = box2:9312:chunk3

# per agent options
agent = box1:9312:chunk1[ha_strategy=nodeads]
agent = box2:9312:chunk2[conn=pconn]
agent = test:9312:any[blackhole=1]
agent = test:9312|box2:9312|box3:9312:any2[retry_count=2]

Agent mirrors

The syntax lets you define so-called agent mirrors that can be used interchangeably when processing a search query. Master server keeps track of mirror status (alive or dead) and response times, and does automatic failover and load balancing based on that. For example, this line:

agent = box1:9312|box2:9312|box3:9312:chunk2

declares that box1:9312, box2:9312, and box3:9312 all have an index called chunk2, and can be used as interchangeable mirrors. If any single of those servers go down, the queries will be distributed between the other two. When it gets back up, master will detect that and begin routing queries to all three boxes again.

Another way to define the mirrors is to explicitly specify the index list for every mirror:

agent = box1:9312:box1chunk2|box2:9312:box2chunk2

This works essentially the same as the previous example, but different index names will be used when querying different severs: box1chunk2 when querying box1:9312, and box2chunk when querying box2:9312.

By default, all queries are routed to the best of the mirrors. The best one is picked based on the recent statistics, as controlled by the ha_period_karma config directive. Master stores a number of metrics (total query count, error count, response time, etc) recently observed for every agent. It groups those by time spans, and karma is that time span length. The best agent mirror is then determined dynamically based on the last 2 such time spans. Specific algorithm that will be used to pick a mirror can be configured ha_strategy directive.

The karma period is in seconds and defaults to 60 seconds. Master stores up to 15 karma spans with per-agent statistics for instrumentation purposes (see SHOW AGENT STATUS statement). However, only the last 2 spans out of those are ever used for HA/LB logic.

When there are no queries, master sends a regular ping command every ha_ping_interval milliseconds in order to have some statistics and at least check, whether the remote host is still alive. ha_ping_interval defaults to 1000 msec. Setting it to 0 disables pings and statistics will only be accumulated based on actual queries.

Example:

# sharding index over 4 servers total
# in just 2 chunks but with 2 failover mirrors for each chunk
# box1, box2 carry chunk1 as local
# box3, box4 carry chunk2 as local

# config on box1, box2
agent = box3:9312|box4:9312:chunk2

# config on box3, box4
agent = box1:9312|box2:9312:chunk1

bigram_freq_words

A list of keywords considered “frequent” when indexing bigrams. Optional, default is empty.

Bigram indexing is a feature to accelerate phrase searches. When indexing, it stores a document list for either all or some of the adjacent words pairs into the index. Such a list can then be used at searching time to significantly accelerate phrase or sub-phrase matching.

Some of the bigram indexing modes (see bigram_index) require to define a list of frequent keywords. These are not to be confused with stopwords! Stopwords are completely eliminated when both indexing and searching. Frequent keywords are only used by bigrams to determine whether to index a current word pair or not.

bigram_freq_words lets you define a list of such keywords.

Example:

bigram_freq_words = the, a, you, i

bigram_index

Bigram indexing mode. Optional, default is none.

Bigram indexing is a feature to accelerate phrase searches. When indexing, it stores a document list for either all or some of the adjacent words pairs into the index. Such a list can then be used at searching time to significantly accelerate phrase or sub-phrase matching.

bigram_index controls the selection of specific word pairs. The known modes are:

  • all, index every single word pair. (NB: probably totally not worth it even on a moderately sized index, but added anyway for the sake of completeness.)
  • first_freq, only index word pairs where the first word is in a list of frequent words (see bigram_freq_words). For example, with bigram_freq_words = the, in, i, a, indexing “alone in the dark” text will result in “in the” and “the dark” pairs being stored as bigrams, because they begin with a frequent keyword (either “in” or “the” respectively), but “alone in” would not be indexed, because “in” is a second word in that pair.
  • both_freq, only index word pairs where both words are frequent. Continuing with the same example, in this mode indexing “alone in the dark” would only store “in the” (the very worst of them all from searching perspective) as a bigram, but none of the other word pairs.

For most usecases, both_freq would be the best mode, but your mileage may vary.

Example:

bigram_freq_words = both_freq

blend_chars

Blended characters list. Optional, default is empty.

Blended characters are indexed both as separators and valid characters. For instance, assume that & is configured as blended and AT&T occurs in an indexed document. Three different keywords will get indexed, namely “at&t”, treating blended characters as valid, plus “at” and “t”, treating them as separators.

Positions for tokens obtained by replacing blended characters with whitespace are assigned as usual, so regular keywords will be indexed just as if there was no blend_chars specified at all. An additional token that mixes blended and non-blended characters will be put at the starting position. For instance, if the field contents are “AT&T company” occurs in the very beginning of the text field, “at” will be given position 1, “t” position 2, “company” position 3, and “AT&T” will also be given position 1 (“blending” with the opening regular keyword). Thus, querying for either AT&T or just AT will match that document, and querying for “AT T” as a phrase also match it. Last but not least, phrase query for “AT&T company” will also match it, despite the position

Blended characters can overlap with special characters used in query syntax (think of T-Mobile or @twitter). Where possible, query parser will automatically handle blended character as blended. For instance, “hello @twitter” within quotes (a phrase operator) would handle @-sign as blended, because @-syntax for field operator is not allowed within phrases. Otherwise, the character would be handled as an operator. So you might want to escape the keywords.

Blended characters can be remapped, so that multiple different blended characters could be normalized into just one base form. This is useful when indexing multiple alternative Unicode codepoints with equivalent glyphs.

Example:

blend_chars = +, &, U+23
blend_chars = +, &->+

blend_mode

Blended tokens indexing mode. Optional, default is trim_none.

By default, tokens that mix blended and non-blended characters get indexed in there entirety. For instance, when both at-sign and an exclamation are in blend_chars, “@dude!” will get result in two tokens indexed: “@dude!” (with all the blended characters) and “dude” (without any). Therefore “@dude” query will not match it.

blend_mode directive adds flexibility to this indexing behavior. It takes a comma-separated list of options.

blend_mode = option [, option [, ...]]
option = trim_none | trim_head | trim_tail | trim_both | skip_pure

Options specify token indexing variants. If multiple options are specified, multiple variants of the same token will be indexed. Regular keywords (resulting from that token by replacing blended with whitespace) are always be indexed.

  • trim_none
  • Index the entire token.
  • trim_head
  • Trim heading blended characters, and index the resulting token.
  • trim_tail
  • Trim trailing blended characters, and index the resulting token.
  • trim_both
  • Trim both heading and trailing blended characters, and index the resulting token.
  • skip_pure
  • Do not index the token if it’s purely blended, that is, consists of blended characters only.

Returning to the “@dude!” example above, setting blend_mode = trim_head, trim_tail will result in two tokens being indexed, “@dude” and”dude!“. In this particular example, trim_both would have no effect, because trimming both blended characters results in”dude” which is already indexed as a regular keyword. Indexing “@U.S.A.” with trim_both (and assuming that dot is blended two) would result in “U.S.A” being indexed. Last but not least, skip_pure enables you to fully ignore sequences of blended characters only. For example, “one @@@ two” would be indexed exactly as “one two”, and match that as a phrase. That is not the case by default because a fully blended token gets indexed and offsets the second keyword position.

Default behavior is to index the entire token, equivalent to blend_mode = trim_none.

Example:

blend_mode = trim_tail, skip_pure

charset_table

Accepted characters table, with case folding rules. Optional, default value are latin and cyrillic characters.

charset_table is the main workhorse of Manticore tokenizing process, ie. the process of extracting keywords from document text or query text. It controls what characters are accepted as valid and what are not, and how the accepted characters should be transformed (eg. should the case be removed or not).

You can think of charset_table as of a big table that has a mapping for each and every of 100K+ characters in Unicode. By default, every character maps to 0, which means that it does not occur within keywords and should be treated as a separator. Once mentioned in the table, character is mapped to some other character (most frequently, either to itself or to a lowercase letter), and is treated as a valid keyword part.

The expected value format is a commas-separated list of mappings. Two simplest mappings simply declare a character as valid, and map a single character to another single character, respectively. But specifying the whole table in such form would result in bloated and barely manageable specifications. So there are several syntax shortcuts that let you map ranges of characters at once. The complete list is as follows:

  • A->a
  • Single char mapping, declares source char ‘A’ as allowed to occur within keywords and maps it to destination char ‘a’ (but does not declare ‘a’ as allowed).
  • A..Z->a..z
  • Range mapping, declares all chars in source range as allowed and maps them to the destination range. Does not declare destination range as allowed. Also checks ranges’ lengths (the lengths must be equal).
  • a
  • Stray char mapping, declares a character as allowed and maps it to itself. Equivalent to a->a single char mapping.
  • a..z
  • Stray range mapping, declares all characters in range as allowed and maps them to themselves. Equivalent to a..z->a..z range mapping.
  • A..Z/2
  • Checkerboard range map. Maps every pair of chars to the second char. More formally, declares odd characters in range as allowed and maps them to the even ones; also declares even characters as allowed and maps them to themselves. For instance, A..Z/2 is equivalent to A->B, B->B, C->D, D->D, …, Y->Z, Z->Z. This mapping shortcut is helpful for a number of Unicode blocks where uppercase and lowercase letters go in such interleaved order instead of contiguous chunks.

Control characters with codes from 0 to 31 are always treated as separators. Characters with codes 32 to 127, ie. 7-bit ASCII characters, can be used in the mappings as is. To avoid configuration file encoding issues, 8-bit ASCII characters and Unicode characters must be specified in U+xxx form, where ‘xxx’ is hexadecimal codepoint number. This form can also be used for 7-bit ASCII characters to encode special ones: eg. use U+20 to encode space, U+2E to encode dot, U+2C to encode comma.

Aliases “english” and “russian” are allowed at control character mapping.

Example:

# default are English and Russian letters
charset_table = 0..9, A..Z->a..z, _, a..z, \
    U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451

# english charset defined with alias
charset_table = 0..9, english, _

dict

The keywords dictionary type. Known values are ‘crc’ and ‘keywords’. . Optional, default is ‘keywords’.

Keywords dictionary mode (dict=keywords), (greatly) reduces indexing impact and enable substring searches on huge collections. That mode is supported both for disk and RT indexes.

CRC dictionaries never store the original keyword text in the index. Instead, keywords are replaced with their control sum value (calculated using FNV64) both when searching and indexing, and that value is used internally in the index.

That approach has two drawbacks. First, there is a chance of control sum collision between several pairs of different keywords, growing quadratically with the number of unique keywords in the index. However, it is not a big concern as a chance of a single FNV64 collision in a dictionary of 1 billion entries is approximately 1:16, or 6.25 percent. And most dictionaries will be much more compact that a billion keywords, as a typical spoken human language has in the region of 1 to 10 million word forms.) Second, and more importantly, substring searches are not directly possible with control sums. Manticore alleviated that by pre-indexing all the possible substrings as separate keywords (see min_prefix_len, min_infix_len directives). That actually has an added benefit of matching substrings in the quickest way possible. But at the same time pre-indexing all substrings grows the index size a lot (factors of 3-10x and even more would not be unusual) and impacts the indexing time respectively, rendering substring searches on big indexes rather impractical.

Keywords dictionary fixes both these drawbacks. It stores the keywords in the index and performs search-time wildcard expansion. For example, a search for a ‘test*‘prefix could internally expand to ‘test|tests|testing’ query based on the dictionary contents. That expansion is fully transparent to the application, except that the separate per-keyword statistics for all the actually matched keywords would now also be reported.

For substring (infix) search extended wildcards may be used. Special symbols like ‘?’ and ‘%’ are supported along with substring (infix) search (e.g. “t?st“,”run%“,”abc*“). Note, however, these wildcards work only with dict=keywords, and not elsewhere.

Indexing with keywords dictionary should be 1.1x to 1.3x slower compared to regular, non-substring indexing - but times faster compared to substring indexing (either prefix or infix). Index size should only be slightly bigger that than of the regular non-substring index, with a 1..10% percent total difference. Regular keyword searching time must be very close or identical across all three discussed index kinds (CRC non-substring, CRC substring, keywords). Substring searching time can vary greatly depending on how many actual keywords match the given substring (in other words, into how many keywords does the search term expand). The maximum number of keywords matched is restricted by the expansion_limit directive.

Essentially, keywords and CRC dictionaries represent the two different trade-off substring searching decisions. You can choose to either sacrifice indexing time and index size in favor of top-speed worst-case searches (CRC dictionary), or only slightly impact indexing time but sacrifice worst-case searching time when the prefix expands into very many keywords (keywords dictionary).

Example:

dict = keywords

docinfo

Document attribute values (docinfo) storage mode. Optional, default is ‘extern’. Known values are ‘none’, ‘extern’ and ‘inline’.

Docinfo storage mode defines how exactly docinfo will be physically stored on disk and RAM. “none” means that there will be no docinfo at all (ie. no attributes). Normally you need not to set “none” explicitly because Manticore will automatically select “none” when there are no attributes configured. “inline” means that the docinfo will be stored in the .spd file, along with the document ID lists. “extern” means that the docinfo will be stored separately (externally) from document ID lists, in a special .spa file.

Basically, externally stored docinfo must be kept in RAM when querying. for performance reasons. So in some cases “inline” might be the only option. However, such cases are infrequent, and docinfo defaults to “extern”. Refer to Attributes for in-depth discussion and RAM usage estimates.

Example:

docinfo = inline

embedded_limit

Embedded exceptions, wordforms, or stopwords file size limit. Optional, default is 16K.

Indexer can either save the file name, or embed the file contents directly into the index. Files sized under embedded_limit get stored into the index. For bigger files, only the file names are stored. This also simplifies moving index files to a different machine; you may get by just copying a single file.

With smaller files, such embedding reduces the number of the external files on which the index depends, and helps maintenance. But at the same time it makes no sense to embed a 100 MB wordforms dictionary into a tiny delta index. So there needs to be a size threshold, and embedded_limit is that threshold.

Example:

embedded_limit = 32K

exceptions

Tokenizing exceptions file. Optional, default is empty.

Exceptions allow to map one or more tokens (including tokens with characters that would normally be excluded) to a single keyword. They are similar to wordforms in that they also perform mapping, but have a number of important differences.

Small enough files are stored in the index header, see embedded_limit for details.

Short summary of the differences is as follows:

  • exceptions are case sensitive, wordforms are not;
  • exceptions can use special characters that are not in charset_table, wordforms fully obey charset_table;
  • exceptions can underperform on huge dictionaries, wordforms handle millions of entries well.

The expected file format is also plain text, with one line per exception, and the line format is as follows:

map-from-tokens => map-to-token

Example file:

at & t => at&t
AT&T => AT&T
Standarten   Fuehrer => standartenfuhrer
Standarten Fuhrer => standartenfuhrer
MS Windows => ms windows
Microsoft Windows => ms windows
C++ => cplusplus
c++ => cplusplus
C plus plus => cplusplus

All tokens here are case sensitive: they will not be processed by charset_table rules. Thus, with the example exceptions file above, “at&t” text will be tokenized as two keywords “at” and “t”, because of lowercase letters. On the other hand, “AT&T” will match exactly and produce single “AT&T” keyword.

Note that this map-to keyword is a) always interpreted as a single word, and b) is both case and space sensitive! In our sample, “ms windows” query will not match the document with “MS Windows” text. The query will be interpreted as a query for two keywords, “ms” and “windows”. And what “MS Windows” gets mapped to is a single keyword “ms windows”, with a space in the middle. On the other hand, “standartenfuhrer” will retrieve documents with “Standarten Fuhrer” or “Standarten Fuehrer” contents (capitalized exactly like this), or any capitalization variant of the keyword itself, eg. “staNdarTenfUhreR”. (It won’t catch “standarten fuhrer”, however: this text does not match any of the listed exceptions because of case sensitivity, and gets indexed as two separate keywords.)

Whitespace in the map-from tokens list matters, but its amount does not. Any amount of the whitespace in the map-form list will match any other amount of whitespace in the indexed document or query. For instance, “AT & T” map-from token will match “AT & T” text, whatever the amount of space in both map-from part and the indexed text. Such text will therefore be indexed as a special “AT&T” keyword, thanks to the very first entry from the sample.

Exceptions also allow to capture special characters (that are exceptions from general charset_table rules; hence the name). Assume that you generally do not want to treat ‘+’ as a valid character, but still want to be able search for some exceptions from this rule such as ‘C++’. The sample above will do just that, totally independent of what characters are in the table and what are not.

Exceptions are applied to raw incoming document and query data during indexing and searching respectively. Therefore, to pick up changes in the file it’s required to reindex and restart searchd.

Example:

exceptions = /usr/local/sphinx/data/exceptions.txt

expand_keywords

Expand keywords with exact forms and/or stars when possible. The value can additionally enumerate options such us exact and star. Optional, default is 0 (do not expand keywords).

Queries against indexes with expand_keywords feature enabled are internally expanded as follows. If the index was built with prefix or infix indexing enabled, every keyword gets internally replaced with a disjunction of keyword itself and a respective prefix or infix (keyword with stars). If the index was built with both stemming and index_exact_words enabled, exact form is also added. Here’s an example that shows how internal expansion works when all of the above (infixes, stemming, and exact words) are combined:

running -> ( running | *running* | =running )

(as expand_keywords =  1 or expand_keywords = star,exact) or expansion limited by exact option even infixes enabled for index

running -> ( running | =running )

(as expand_keywords = exact)

Expanded queries take naturally longer to complete, but can possibly improve the search quality, as the documents with exact form matches should be ranked generally higher than documents with stemmed or infix matches.

Note that the existing query syntax does not allow to emulate this kind of expansion, because internal expansion works on keyword level and expands keywords within phrase or quorum operators too (which is not possible through the query syntax).

This directive does not affect indexer in any way, it only affects searchd.

Example:

expand_keywords = 1

global_idf

The path to a file with global (cluster-wide) keyword IDFs. Optional, default is empty (use local IDFs).

On a multi-index cluster, per-keyword frequencies are quite likely to differ across different indexes. That means that when the ranking function uses TF-IDF based values, such as BM25 family of factors, the results might be ranked slightly different depending on what cluster node they reside.

The easiest way to fix that issue is to create and utilize a global frequency dictionary, or a global IDF file for short. This directive lets you specify the location of that file. It it suggested (but not required) to use a .idf extension. When the IDF file is specified for a given index and and OPTION global_idf is set to 1, the engine will use the keyword frequencies and collection documents count from the global_idf file, rather than just the local index. That way, IDFs and the values that depend on them will stay consistent across the cluster.

IDF files can be shared across multiple indexes. Only a single copy of an IDF file will be loaded by searchd, even when many indexes refer to that file. Should the contents of an IDF file change, the new contents can be loaded with a SIGHUP.

You can build an .idf file using indextool utility, by dumping dictionaries using --dumpdict dict.txt --stats switch first, then converting those to .idf format using --buildidf, then merging all .idf files across cluser using --mergeidf. Refer to indextool command reference for more information.

Example:

global_idf = /usr/local/sphinx/var/global.idf

ha_strategy

Agent mirror selection strategy, for load balancing. Optional, default is random.

The strategy used for mirror selection, or in other words, choosing a specific agent mirror in a distributed index. Essentially, this directive controls how exactly master does the load balancing between the configured mirror agent nodes. The following strategies are implemented:

Simple random balancing

ha_strategy = random

The default balancing mode. Simple linear random distribution among the mirrors. That is, equal selection probability are assigned to every mirror. Kind of similar to round-robin (RR), but unlike RR, does not impose a strict selection order.

Adaptive randomized balancing

The default simple random strategy does not take mirror status, error rate, and, most importantly, actual response latencies into account. So to accommodate for heterogeneous clusters and/or temporary spikes in agent node load, we have a group of balancing strategies that dynamically adjusts the probabilities based on the actual query latencies observed by the master.

The adaptive strategies based on latency-weighted probabilities basically work as follows:

  • latency stats are accumulated, in blocks of ha_period_karma seconds;
  • once per karma period, latency-weighted probabilities get recomputed;
  • once per request (including ping requests), “dead or alive” flag is adjusted.

Currently, we begin with equal probabilities (or percentages, for brevity), and on every step, scale them by the inverse of the latencies observed during the last “karma” period, and then renormalize them. For example, if during the first 60 seconds after the master startup 4 mirrors had latencies of 10, 5, 30, and 3 msec/query respectively, the first adjustment step would go as follow:

  • initial percentages: 0.25, 0.25, 0.25, 0.2%;
  • observed latencies: 10 ms, 5 ms, 30 ms, 3 ms;
  • inverse latencies: 0.1, 0.2, 0.0333, 0.333;
  • scaled percentages: 0.025, 0.05, 0.008333, 0.0833;
  • renormalized percentages: 0.15, 0.30, 0.05, 0.50.

Meaning that the 1st mirror would have a 15% chance of being chosen during the next karma period, the 2nd one a 30% chance, the 3rd one (slowest at 30 ms) only a 5% chance, and the 4th and the fastest one (at 3 ms) a 50% chance. Then, after that period, the second adjustment step would update those chances again, and so on.

The rationale here is, once the observed latencies stabilize, the latency weighted probabilities stabilize as well. So all these adjustment iterations are supposed to converge at a point where the average latencies are (roughly) equal over all mirrors.

ha_strategy = nodeads

Latency-weighted probabilities, but dead mirrors are excluded from the selection. “Dead” mirror is defined as a mirror that resulted in multiple hard errors (eg. network failure, or no answer, etc) in a row.

ha_strategy = noerrors

Latency-weighted probabilities, but mirrors with worse errors/success ratio are excluded from the selection.

Round-robin balancing

ha_strategy = roundrobin

Simple round-robin selection, that is, selecting the 1st mirror in the list, then the 2nd one, then the 3rd one, etc, and then repeating the process once the last mirror in the list is reached. Unlike with the randomized strategies, RR imposes a strict querying order (1, 2, 3, .., N-1, N, 1, 2, 3, … and so on) and guarantees that no two subsequent queries will be sent to the same mirror.

hitless_words

Hitless words list. Optional, allowed values are ‘all’, or a list file name.

By default, Manticore full-text index stores not only a list of matching documents for every given keyword, but also a list of its in-document positions (aka hitlist). Hitlists enables phrase, proximity, strict order and other advanced types of searching, as well as phrase proximity ranking. However, hitlists for specific frequent keywords (that can not be stopped for some reason despite being frequent) can get huge and thus slow to process while querying. Also, in some cases we might only care about boolean keyword matching, and never need position-based searching operators (such as phrase matching) nor phrase ranking.

hitless_words lets you create indexes that either do not have positional information (hitlists) at all, or skip it for specific keywords.

Hitless index will generally use less space than the respective regular index (about 1.5x can be expected). Both indexing and searching should be faster, at a cost of missing positional query and ranking support. When searching, positional queries (eg. phrase queries) will be automatically converted to respective non-positional (document-level) or combined queries. For instance, if keywords “hello” and “world” are hitless, “hello world” phrase query will be converted to (hello & world) bag-of-words query, matching all documents that mention either of the keywords but not necessarily the exact phrase. And if, in addition, keywords “simon” and “says” are not hitless, “simon says hello world” will be converted to (“simon says” & hello & world) query, matching all documents that contain “hello” and “world” anywhere in the document, and also “simon says” as an exact phrase.

Example:

hitless_words = all

html_index_attrs

A list of markup attributes to index when stripping HTML. Optional, default is empty (do not index markup attributes).

Specifies HTML markup attributes whose contents should be retained and indexed even though other HTML markup is stripped. The format is per-tag enumeration of indexable attributes, as shown in the example below.

Example:

html_index_attrs = img=alt,title; a=title;

html_remove_elements

A list of HTML elements for which to strip contents along with the elements themselves. Optional, default is empty string (do not strip contents of any elements).

This feature allows to strip element contents, ie. everything that is between the opening and the closing tags. It is useful to remove embedded scripts, CSS, etc. Short tag form for empty elements (ie. <br />) is properly supported; ie. the text that follows such tag will not be removed.

The value is a comma-separated list of element (tag) names whose contents should be removed. Tag names are case insensitive.

Example:

html_remove_elements = style, script

html_strip

Whether to strip HTML markup from incoming full-text data. Optional, default is 0. Known values are 0 (disable stripping) and 1 (enable stripping).

Both HTML tags and entities and considered markup and get processed.

HTML tags are removed, their contents (i.e., everything between <P> and </P>) are left intact by default. You can choose to keep and index attributes of the tags (e.g., HREF attribute in an A tag, or ALT in an IMG one). Several well-known inline tags are completely removed, all other tags are treated as block level and replaced with whitespace. For example, ‘te<B>st</B>’ text will be indexed as a single keyword ‘test’, however, ‘te<P>st</P>’ will be indexed as two keywords ‘te’ and ‘st’. Known inline tags are as follows: A, B, I, S, U, BASEFONT, BIG, EM, FONT, IMG, LABEL, SMALL, SPAN, STRIKE, STRONG, SUB, SUP, TT.

HTML entities get decoded and replaced with corresponding UTF-8 characters. Stripper supports both numeric forms (such as &#239;) and text forms (such as &oacute; or &nbsp;). All entities as specified by HTML4 standard are supported.

Stripping should work with properly formed HTML and XHTML, but, just as most browsers, may produce unexpected results on malformed input (such as HTML with stray <’s or unclosed >’s).

Only the tags themselves, and also HTML comments, are stripped. To strip the contents of the tags too (eg. to strip embedded scripts), see html_remove_elements option. There are no restrictions on tag names; ie. everything that looks like a valid tag start, or end, or a comment will be stripped.

Example:

html_strip = 1

ignore_chars

Ignored characters list. Optional, default is empty.

Useful in the cases when some characters, such as soft hyphenation mark (U+00AD), should be not just treated as separators but rather fully ignored. For example, if ‘-’ is simply not in the charset_table, “abc-def” text will be indexed as “abc” and “def” keywords. On the contrary, if ‘-’ is added to ignore_chars list, the same text will be indexed as a single “abcdef” keyword.

The syntax is the same as for charset_table, but it’s only allowed to declare characters, and not allowed to map them. Also, the ignored characters must not be present in charset_table.

Example:

ignore_chars = U+AD

index_exact_words

Whether to index the original keywords along with the stemmed/remapped versions. Optional, default is 0 (do not index).

When enabled, index_exact_words forces indexer to put the raw keywords in the index along with the stemmed versions. That, in turn, enables exact form operator in the query language to work. This impacts the index size and the indexing time. However, searching performance is not impacted at all.

Example:

index_exact_words = 1

index_field_lengths

Enables computing and storing of field lengths (both per-document and average per-index values) into the index. Optional, default is 0 (do not compute and store).

When index_field_lengths is set to 1, indexer will 1) create a respective length attribute for every full-text field, sharing the same name but with __len_ suffix; 2) compute a field length (counted in keywords) for every document and store in to a respective attribute; 3) compute the per-index averages. The lengths attributes will have a special TOKENCOUNT type, but their values are in fact regular 32-bit integers, and their values are generally accessible.

BM25A() and BM25F() functions in the expression ranker are based on these lengths and require index_field_lengths to be enabled. Historically, Manticore used a simplified, stripped-down variant of BM25 that, unlike the complete function, did not account for document length. (We later realized that it should have been called BM15 from the start.) Also we added support for both a complete variant of BM25, and its extension towards multiple fields, called BM25F. They require per-document length and per-field lengths, respectively. Hence the additional directive.

Example:

index_field_lengths = 1

index_sp

Whether to detect and index sentence and paragraph boundaries. Optional, default is 0 (do not detect and index).

This directive enables sentence and paragraph boundary indexing. It’s required for the SENTENCE and PARAGRAPH operators to work. Sentence boundary detection is based on plain text analysis, so you only need to set index_sp = 1 to enable it. Paragraph detection is however based on HTML markup, and happens in the HTML stripper. So to index paragraph locations you also need to enable the stripper by specifying html_strip = 1. Both types of boundaries are detected based on a few built-in rules enumerated just below.

Sentence boundary detection rules are as follows.

  • Question and exclamation signs (? and !) are always a sentence boundary.
  • Trailing dot (.) is a sentence boundary, except:
    • When followed by a letter. That’s considered a part of an abbreviation (as in “S.T.A.L.K.E.R” or “Goldman Sachs S.p.A.”).
    • When followed by a comma. That’s considered an abbreviation followed by a comma (as in “Telecom Italia S.p.A., founded in 1994”).
    • When followed by a space and a small letter. That’s considered an abbreviation within a sentence (as in “News Corp. announced in February”).
    • When preceded by a space and a capital letter, and followed by a space. That’s considered a middle initial (as in “John D. Doe”).

Paragraph boundaries are inserted at every block-level HTML tag. Namely, those are (as taken from HTML 4 standard) ADDRESS, BLOCKQUOTE, CAPTION, CENTER, DD, DIV, DL, DT, H1, H2, H3, H4, H5, LI, MENU, OL, P, PRE, TABLE, TBODY, TD, TFOOT, TH, THEAD, TR, and UL.

Both sentences and paragraphs increment the keyword position counter by 1.

Example:

index_sp = 1

index_token_filter

Index-time token filter for index. Optional, default is empty.

Index-time token filter gets created by indexer on indexing source data into index or by RT index on processing INSERT or REPLACE statements and let you implement a custom tokenizer that makes tokens according to custom rules. Plugins defined as library name:plugin name:optional string of settings.

Example:

index_token_filter = my_lib.so:custom_blend:chars=@#&

index_zones

A list of in-field HTML/XML zones to index. Optional, default is empty (do not index zones).

Zones can be formally defined as follows. Everything between an opening and a matching closing tag is called a span, and the aggregate of all spans corresponding sharing the same tag name is called a zone. For instance, everything between the occurrences of <H1> and </H1> in the document field belongs to H1 zone.

Zone indexing, enabled by index_zones directive, is an optional extension of the HTML stripper. So it will also require that the stripper is enabled (with html_strip = 1). The value of the index_zones should be a comma-separated list of those tag names and wildcards (ending with a star) that should be indexed as zones.

Zones can nest and overlap arbitrarily. The only requirement is that every opening tag has a matching tag. You can also have an arbitrary number of both zones (as in unique zone names, such as H1) and spans (all the occurrences of those H1 tags) in a document. Once indexed, zones can then be used for matching with the ZONE operator, see Extended query syntax.

Example:

index_zones = h*, th, title

infix_fields

The list of full-text fields to limit infix indexing to. Applies to dict=crc only. Optional, default is empty (index all fields in infix mode).

Similar to prefix_fields, but lets you limit infix-indexing to given fields.

Example:

infix_fields = url, domain

inplace_docinfo_gap

In-place inversion fine-tuning option. Controls preallocated docinfo gap size. Optional, default is 0.

This directive does not affect searchd in any way, it only affects indexer.

Example:

inplace_docinfo_gap = 1M

inplace_enable

Whether to enable in-place index inversion. Optional, default is 0 (use separate temporary files).

inplace_enable greatly reduces indexing disk footprint, at a cost of slightly slower indexing (it uses around 2x less disk, but yields around 90-95% the original performance).

Indexing involves two major phases. The first phase collects, processes, and partially sorts documents by keyword, and writes the intermediate result to temporary files (.tmp*). The second phase fully sorts the documents, and creates the final index files. Thus, rebuilding a production index on the fly involves around 3x peak disk footprint: 1st copy for the intermediate temporary files, 2nd copy for newly constructed copy, and 3rd copy for the old index that will be serving production queries in the meantime. (Intermediate data is comparable in size to the final index.) That might be too much disk footprint for big data collections, and inplace_enable allows to reduce it. When enabled, it reuses the temporary files, outputs the final data back to them, and renames them on completion. However, this might require additional temporary data chunk relocation, which is where the performance impact comes from.

This directive does not affect searchd in any way, it only affects indexer.

Example:

inplace_enable = 1

inplace_hit_gap

In-place inversion fine-tuning option. Controls preallocated hitlist gap size. Optional, default is 0.

This directive does not affect searchd in any way, it only affects indexer.

Example:

inplace_hit_gap = 1M

inplace_reloc_factor

inplace_reloc_factor fine-tuning option. Controls relocation buffer size within indexing memory arena. Optional, default is 0.1.

This directive does not affect searchd in any way, it only affects indexer.

Example:

inplace_reloc_factor = 0.1

inplace_write_factor

inplace_write_factor fine-tuning option. Controls in-place write buffer size within indexing memory arena. Optional, default is 0.1.

This directive does not affect searchd in any way, it only affects indexer.

Example:

inplace_write_factor = 0.1

local

Local index declaration in the distributed index. Multi-value, optional, default is empty.

This setting is used to declare local indexes that will be searched when given distributed index is searched. Many local indexes can be declared per each distributed index. Any local index can also be mentioned several times in different distributed indexes.

Note that by default all local indexes will be searched sequentially, utilizing only 1 CPU or core. To parallelize processing of the local parts in the distributed index, you should use dist_threads.

Before dist_threads, there also was a legacy solution to configure searchd to query itself instead of using local indexes (refer to agent for the details). However, that creates redundant CPU and network load, and dist_threads is now strongly suggested instead.

Example:

local = chunk1
local = chunk2

max_substring_len

Maximum substring (either prefix or infix) length to index. Optional, default is 0 (do not limit indexed substrings). Applies to dict=crc only.

By default, substring (either prefix or infix) indexing in the dict=crc mode will index all the possible substrings as separate keywords. That might result in an overly large index. So the max_substring_len directive lets you limit the impact of substring indexing by skipping too-long substrings (which, chances are, will never get searched for anyway).

For example, a test index of 10,000 blog posts takes this much disk space depending on the settings:

  • 6.4 MB baseline (no substrings)
  • 24.3 MB (3.8x) with min_prefix_len = 3
  • 22.2 MB (3.5x) with min_prefix_len = 3, max_substring_len = 8
  • 19.3 MB (3.0x) with min_prefix_len = 3, max_substring_len = 6
  • 94.3 MB (14.7x) with min_infix_len = 3
  • 84.6 MB (13.2x) with min_infix_len = 3, max_substring_len = 8
  • 70.7 MB (11.0x) with min_infix_len = 3, max_substring_len = 6

So in this test limiting the max substring length saved us 10-15% on the index size.

There is no performance impact associated with substring length when using dict=keywords mode, so this directive is not applicable and intentionally forbidden in that case. If required, you can still limit the length of a substring that you search for in the application code.

Example:

max_substring_len = 12

min_infix_len

Minimum infix prefix length to index and search. Optional, default is 0 (do not index infixes), and minimum allowed non-zero value is 2.

Infix length setting enables wildcard searches with term patterns like ‘start’, ‘end’, ‘middle’, and so on. It also lets you disable too short wildcards if those are too expensive to search for.

Perfect word matches can be differentiated from infix matches, and ranked higher, by utilizing all of the following options: a) dict=keywords (on by default), b) index_exact_words=1 (off by default), and c) expand_keywords=1 (also off by default). Note that either with the legacy dict=crc mode (which you should ditch anyway!), or with any of the above options disable, there is no data to differentiate between the infixes and full words, and thus perfect word matches can’t be ranked higher.

However, query time might vary greatly, depending on how many keywords the substring will actually expand to. Short and frequent syllables like ‘in’ or ‘ti’ just might expand to way too many keywords, all of which would need to be matched and processed. Therefore, to generally enable substring searches you would set min_infix_len to 2; and to limit the impact from wildcard searches with too short wildcards, you might set it higher.

Infixes must be at least 2 characters long, wildcards like ‘a’ are not allowed for performance reasons. (While in theory it is possible to scan the entire dictionary, identify keywords matching on just a single character, expand ‘a’ to an OR operator over 100,000+ keywords, and evaluate that expanded query, in practice this will very definitely kill your server.)

When mininum infix length is set to a positive number, mininum prefix length is considered 1.

Example:

min_infix_len = 3

min_prefix_len

Minimum word prefix length to index. Optional, default is 0 (do not index prefixes).

Prefix indexing allows to implement wildcard searching by ‘wordstart*’ wildcards. When mininum prefix length is set to a positive number, indexer will index all the possible keyword prefixes (ie. word beginnings) in addition to the keywords themselves. Too short prefixes (below the minimum allowed length) will not be indexed.

For instance, indexing a keyword “example” with min_prefix_len=3 will result in indexing “exa”, “exam”, “examp”, “exampl” prefixes along with the word itself. Searches against such index for “exam” will match documents that contain “example” word, even if they do not contain “exam” on itself. However, indexing prefixes will make the index grow significantly (because of many more indexed keywords), and will degrade both indexing and searching times.

Perfect word matches can be differentiated from prefix matches, and ranked higher, by utilizing all of the following options: a) dict=keywords (on by default), b) index_exact_words=1 (off by default), and c) expand_keywords=1 (also off by default). Note that either with the legacy dict=crc mode (which you should ditch anyway!), or with any of the above options disable, there is no data to differentiate between the prefixes and full words, and thus perfect word matches can’t be ranked higher.

When mininum infix length is set to a positive number, mininum prefix length is considered 1.

Example:

min_prefix_len = 3

min_stemming_len

Minimum word length at which to enable stemming. Optional, default is 1 (stem everything).

Stemmers are not perfect, and might sometimes produce undesired results. For instance, running “gps” keyword through Porter stemmer for English results in “gp”, which is not really the intent. min_stemming_len feature lets you suppress stemming based on the source word length, ie. to avoid stemming too short words. Keywords that are shorter than the given threshold will not be stemmed. Note that keywords that are exactly as long as specified will be stemmed. So in order to avoid stemming 3-character keywords, you should specify 4 for the value. For more finely grained control, refer to wordforms feature.

Example:

min_stemming_len = 4

min_word_len

Minimum indexed word length. Optional, default is 1 (index everything).

Only those words that are not shorter than this minimum will be indexed. For instance, if min_word_len is 4, then ‘the’ won’t be indexed, but ‘they’ will be.

Example:

min_word_len = 4

mirror_retry_count

Same as index_agent_retry_count. If both values provided, mirror_retry_count will be taken, and the warning about it will be fired.

mlock

Memory locking for cached data. Optional, default is 0 (do not call mlock()).

For search performance, searchd preloads a copy of .spa and .spi files in RAM, and keeps that copy in RAM at all times. But if there are no searches on the index for some time, there are no accesses to that cached copy, and OS might decide to swap it out to disk. First queries to such “cooled down” index will cause swap-in and their latency will suffer.

Setting mlock option to 1 makes Manticore lock physical RAM used for that cached data using mlock(2) system call, and that prevents swapping (see man 2 mlock for details). mlock(2) is a privileged call, so it will require searchd to be either run from root account, or be granted enough privileges otherwise. If mlock() fails, a warning is emitted, but index continues working.

Example:

mlock = 1

morphology

A list of morphology preprocessors (stemmers or lemmatizers) to apply. Optional, default is empty (do not apply any preprocessor).

Morphology preprocessors can be applied to the words being indexed to replace different forms of the same word with the base, normalized form. For instance, English stemmer will normalize both “dogs” and “dog” to “dog”, making search results for both searches the same.

There are 3 different morphology preprocessors that Manticore implements: lemmatizers, stemmers, and phonetic algorithms.

  • Lemmatizer reduces a keyword form to a so-called lemma, a proper normal form, or in other words, a valid natural language root word. For example, “running” could be reduced to “run”, the infinitive verb form, and “octopi” would be reduced to “octopus”, the singular noun form. Note that sometimes a word form can have multiple corresponding root words. For instance, by looking at “dove” it is not possible to tell whether this is a past tense of “dive” the verb as in “He dove into a pool.”, or “dove” the noun as in “White dove flew over the cuckoo’s nest.” In this case lemmatizer can generate all the possible root forms.
  • Stemmer reduces a keyword form to a so-called stem by removing and/or replacing certain well-known suffixes. The resulting stem is however not guaranteed to be a valid word on itself. For instance, with a Porter English stemmers “running” would still reduce to “run”, which is fine, but “business” would reduce to “busi”, which is not a word, and “octopi” would not reduce at all. Stemmers are essentially (much) simpler but still pretty good replacements of full-blown lemmatizers.
  • Phonetic algorithms replace the words with specially crafted phonetic codes that are equal even when the words original are different, but phonetically close.

The morphology processors that come with our own built-in Manticore implementations are:

  • English, Russian, and German lemmatizers;
  • English, Russian, Arabic, and Czech stemmers;
  • SoundEx and MetaPhone phonetic algorithms.

You can also link with libstemmer library for even more stemmers (see details below). With libstemmer, Manticore also supports morphological processing for more than 15 other languages. Binary packages should come prebuilt with libstemmer support, too.

Lemmatizers require a dictionary that needs to be additionally downloaded from the Manticore website. That dictionary needs to be installed in a directory specified by lemmatizer_base directive. Also, there is a lemmatizer_cache directive that lets you speed up lemmatizing (and therefore indexing) by spending more RAM for, basically, an uncompressed cache of a dictionary.

Chinese segmentation using Rosette Linguistics Platform is also available. It is a much more precise but slower way (compared to n-grams) to segment Chinese documents. charset_table must contain all Chinese characters except Chinese punctuation marks because incoming documents are first processed by sphinx tokenizer and then the result is processed by RLP. Manticore performs per-token language detection on the incoming documents. If token language is identified as Chinese, it will only be processed the RLP, even if multiple morphology processors are specified. Otherwise, it will be processed by all the morphology processors specified in the “morphology” option. Rosette Linguistics Platform must be installed and configured and sphinx must be built with a –with-rlp switch. See also rlp_root, rlp_environment and rlp_context options. A batched version of RLP segmentation is also available (rlp_chinese_batched). It provides the same functionality as the basic rlp_chinese segmentation, but enables batching documents before processing them by the RLP. Processing several documents at once can result in a substantial indexing speedup if the documents are small (for example, less than 1k). See also rlp_max_batch_size and rlp_max_batch_docs options.

Additional stemmers provided by Snowball project libstemmer library can be enabled at compile time using --with-libstemmer configure option. Built-in English and Russian stemmers should be faster than their libstemmer counterparts, but can produce slightly different results, because they are based on an older version.

Soundex implementation matches that of MySQL. Metaphone implementation is based on Double Metaphone algorithm and indexes the primary code.

Built-in values that are available for use in morphology option are as follows:

  • none - do not perform any morphology processing;
  • lemmatize_ru - apply Russian lemmatizer and pick a single root form;
  • lemmatize_en - apply English lemmatizer and pick a single root form;
  • lemmatize_de - apply German lemmatizer and pick a single root form;
  • lemmatize_ru_all - apply Russian lemmatizer and index all possible root forms;
  • lemmatize_en_all - apply English lemmatizer and index all possible root forms;
  • lemmatize_de_all - apply German lemmatizer and index all possible root forms;
  • stem_en - apply Porter’s English stemmer;
  • stem_ru - apply Porter’s Russian stemmer;
  • stem_enru - apply Porter’s English and Russian stemmers;
  • stem_cz - apply Czech stemmer;
  • stem_ar - apply Arabic stemmer;
  • soundex - replace keywords with their SOUNDEX code;
  • metaphone - replace keywords with their METAPHONE code.
  • rlp_chinese - apply Chinese text segmentation using Rosette Linguistics Platform
  • rlp_chinese_batched - apply Chinese text segmentation using Rosette Linguistics Platform with document batching

Additional values provided by libstemmer are in ‘libstemmer_XXX’ format, where XXX is libstemmer algorithm codename (refer to libstemmer_c/libstemmer/modules.txt for a complete list).

Several stemmers can be specified (comma-separated). They will be applied to incoming words in the order they are listed, and the processing will stop once one of the stemmers actually modifies the word. Also when wordforms feature is enabled the word will be looked up in word forms dictionary first, and if there is a matching entry in the dictionary, stemmers will not be applied at all. Or in other words, wordforms can be used to implement stemming exceptions.

Example:

morphology = stem_en, libstemmer_sv

morphology_skip_fields

A list of fields there morphology preprocessors do not apply. Optional, default is empty (apply preprocessors to all fields).

Used on indexing there only exact form of words got stored for defined fields.

Example:

morphology_skip_fields = tags, name

ngram_chars

N-gram characters list. Optional, default is empty.

To be used in conjunction with in ngram_len, this list defines characters, sequences of which are subject to N-gram extraction. Words comprised of other characters will not be affected by N-gram indexing feature. The value format is identical to charset_table. N-gram characters cannot appear in the charset_table.

Example:

ngram_chars = U+3000..U+2FA1F

ngram_len

N-gram lengths for N-gram indexing. Optional, default is 0 (disable n-gram indexing). Known values are 0 and 1 (other lengths to be implemented).

N-grams provide basic CJK (Chinese, Japanese, Korean) support for unsegmented texts. The issue with CJK searching is that there could be no clear separators between the words. Ideally, the texts would be filtered through a special program called segmenter that would insert separators in proper locations. However, segmenters are slow and error prone, and it’s common to index contiguous groups of N characters, or n-grams, instead.

When this feature is enabled, streams of CJK characters are indexed as N-grams. For example, if incoming text is “ABCDEF” (where A to F represent some CJK characters) and length is 1, in will be indexed as if it was “A B C D E F”. (With length equal to 2, it would produce “AB BC CD DE EF”; but only 1 is supported at the moment.) Only those characters that are listed in ngram_chars table will be split this way; other ones will not be affected.

Note that if search query is segmented, ie. there are separators between individual words, then wrapping the words in quotes and using extended mode will result in proper matches being found even if the text was not segmented. For instance, assume that the original query is BC DEF. After wrapping in quotes on the application side, it should look like “BC” “DEF” (with quotes). This query will be passed to Manticore and internally split into 1-grams too, resulting in “B C” “D E F” query, still with quotes that are the phrase matching operator. And it will match the text even though there were no separators in the text.

Even if the search query is not segmented, Manticore should still produce good results, thanks to phrase based ranking: it will pull closer phrase matches (which in case of N-gram CJK words can mean closer multi-character word matches) to the top.

Example:

ngram_len = 1

ondisk_attrs

Allows for fine-grain control over how attributes are loaded into memory when using indexes with external storage. It is possible to keep attributes on disk. Although, the daemon does map them to memory and the OS loads small chunks of data on demand. This allows use of docinfo = extern instead of docinfo = inline, but still leaves plenty of free memory for cases when you have large collections of pooled attributes (string/JSON/MVA) or when you’re using many indexes per daemon that don’t consume memory. It is not possible to update attributes left on disk when this option is enabled and the constraint of 4Gb of entries per pool is still in effect.

Note that this option also affects RT indexes. When it is enabled, all attribute updates will be disabled, and also all disk chunks of RT indexes will behave described above. However inserting and deleting of docs from RT indexes is still possible with enabled ondisk_attrs.

Possible values:

  • 0 - disabled and default value, all attributes are loaded in memory (the normal behaviour of docinfo = extern)
  • 1 - all attributes stay on disk. Daemon loads no files (spa, spm, sps). This is the most memory conserving mode, however it is also the slowest as the whole doc-id-list and block index doesn’t load.
  • pool - only pooled attributes stay on disk. Pooled attributes are string, MVA, and JSON attributes (sps, spm files). Scalar attributes stored in docinfo (spa file) load as usual.

This option does not affect indexing in any way, it only requires daemon restart.

Example:

ondisk_attrs = pool #keep pooled attributes on disk

overshort_step

Position increment on overshort (less that min_word_len) keywords. Optional, allowed values are 0 and 1, default is 1.

This directive does not affect searchd in any way, it only affects indexer.

Example:

overshort_step = 1

path

Index files path and file name (without extension). Mandatory.

Path specifies both directory and file name, but without extension. indexer will append different extensions to this path when generating final names for both permanent and temporary index files. Permanent data files have several different extensions starting with ‘.sp’; temporary files’ extensions start with ‘.tmp’. It’s safe to remove .tmp* files is if indexer fails to remove them automatically.

For reference, different index files store the following data:

  • .spa stores document attributes (used in extern docinfo storage mode only);
  • .spd stores matching document ID lists for each word ID;
  • .sph stores index header information;
  • .spi stores word lists (word IDs and pointers to .spd file);
  • .spk stores kill-lists;
  • .spm stores MVA data;
  • .spp stores hit (aka posting, aka word occurrence) lists for each word ID;
  • .sps stores string attribute data.
  • .spe stores skip-lists to speed up doc-list filtering

Example:

path = /var/data/test1

phrase_boundary

Phrase boundary characters list. Optional, default is empty.

This list controls what characters will be treated as phrase boundaries, in order to adjust word positions and enable phrase-level search emulation through proximity search. The syntax is similar to charset_table. Mappings are not allowed and the boundary characters must not overlap with anything else.

On phrase boundary, additional word position increment (specified by phrase_boundary_step) will be added to current word position. This enables phrase-level searching through proximity queries: words in different phrases will be guaranteed to be more than phrase_boundary_step distance away from each other; so proximity search within that distance will be equivalent to phrase-level search.

Phrase boundary condition will be raised if and only if such character is followed by a separator; this is to avoid abbreviations such as S.T.A.L.K.E.R or URLs being treated as several phrases.

Example:

phrase_boundary = ., ?, !, U+2026 # horizontal ellipsis

phrase_boundary_step

Phrase boundary word position increment. Optional, default is 0.

On phrase boundary, current word position will be additionally incremented by this number. See phrase_boundary for details.

Example:

phrase_boundary_step = 100

prefix_fields

The list of full-text fields to limit prefix indexing to. Applies to dict=crc only. Optional, default is empty (index all fields in prefix mode).

Because prefix indexing impacts both indexing and searching performance, it might be desired to limit it to specific full-text fields only: for instance, to provide prefix searching through URLs, but not through page contents. prefix_fields specifies what fields will be prefix-indexed; all other fields will be indexed in normal mode. The value format is a comma-separated list of field names.

Example:

prefix_fields = url, domain

preopen

Whether to pre-open all index files, or open them per each query. Optional, default is 0 (do not preopen).

This option tells searchd that it should pre-open all index files on startup (or rotation) and keep them open while it runs. Currently, the default mode is not to pre-open the files (this may change in the future). Preopened indexes take a few (currently 2) file descriptors per index. However, they save on per-query open() calls; and also they are invulnerable to subtle race conditions that may happen during index rotation under high load. On the other hand, when serving many indexes (100s to 1000s), it still might be desired to open the on per-query basis in order to save file descriptors.

This directive does not affect indexer in any way, it only affects searchd.

Example:

preopen = 1

regexp_filter

Regular expressions (regexps) to filter the fields and queries with. Optional, multi-value, default is an empty list of regexps.

In certain applications (like product search) there can be many different ways to call a model, or a product, or a property, and so on. For instance, ‘iphone 3gs’ and ‘iphone 3 gs’ (or even ‘iphone3 gs’) are very likely to mean the same product. Or, for a more tricky example, ‘13-inch’, ‘13 inch’, ‘13“‘, and ‘13in’ in a laptop screen size descriptions do mean the same.

Regexps provide you with a mechanism to specify a number of rules specific to your application to handle such cases. In the first ‘iphone 3gs’ example, you could possibly get away with a wordforms files tailored to handle a handful of iPhone models. However even in a comparatively simple second ‘13-inch’ example there is just way too many individual forms and you are better off specifying rules that would normalize both ‘13-inch’ and ‘13in’ to something identical.

Regular expressions listed in regexp_filter are applied in the order they are listed. That happens at the earliest stage possible, before any other processing, even before tokenization. That is, regexps are applied to the raw source fields when indexing, and to the raw search query text when searching.

We use the RE2 engine to implement regexps. So when building from the source, the library must be installed in the system and Manticore must be configured built with a --with-re2 switch. Binary packages should come with RE2 builtin.

Example:

# index '13"' as '13inch'
regexp_filter = \b(\d+)\" => \1inch

# index 'blue' or 'red' as 'color'
regexp_filter = (blue|red) => color

rlp_context

RLP context configuration file. Mandatory if RLP is used.

Example:

rlp_context = /home/myuser/RLP/rlp-context.xml

rt_attr_bigint

BIGINT attribute declaration. Multi-value (an arbitrary number of attributes is allowed), optional. Declares a signed 64-bit attribute.

Example:

rt_attr_bigint = guid

rt_attr_bool

Boolean attribute declaration. Multi-value (there might be multiple attributes declared), optional. Declares a 1-bit unsigned integer attribute.

Example:

rt_attr_bool = available

rt_attr_float

Floating point attribute declaration. Multi-value (an arbitrary number of attributes is allowed), optional. Declares a single precision, 32-bit IEEE 754 format float attribute.

Example:

rt_attr_float = gpa

rt_attr_json

JSON attribute declaration. Multi-value (ie. there may be more than one such attribute declared), optional.

Refer to sql_attr_json for more details on the JSON attributes.

Example:

rt_attr_json = properties

rt_attr_multi_64

Multi-valued attribute (MVA) declaration. Declares the BIGINT (signed 64-bit) MVA attribute. Multi-value (ie. there may be more than one such attribute declared), optional. Applies to RT indexes only.

Example:

rt_attr_multi_64 = my_wide_tags

rt_attr_multi

Multi-valued attribute (MVA) declaration. Declares the UNSIGNED INTEGER (unsigned 32-bit) MVA attribute. Multi-value (ie. there may be more than one such attribute declared), optional. Applies to RT indexes only.

Example:

rt_attr_multi = my_tags

rt_attr_string

String attribute declaration. Multi-value (an arbitrary number of attributes is allowed), optional.

Example:

rt_attr_string = author

rt_attr_timestamp

Timestamp attribute declaration. Multi-value (an arbitrary number of attributes is allowed), optional.

Example:

rt_attr_timestamp = date_added

rt_attr_uint

Unsigned integer attribute declaration. Multi-value (an arbitrary number of attributes is allowed), optional. Declares an unsigned 32-bit attribute.

Example:

rt_attr_uint = gid

rt_field

Full-text field declaration. Multi-value, mandatory

Full-text fields to be indexed are declared using rt_field directive. The names must be unique. The order is preserved; and so field values in INSERT statements without an explicit list of inserted columns will have to be in the same order as configured.

Example:

rt_field = author
rt_field = title
rt_field = content

rt_mem_limit

RAM chunk size limit. Optional, default is 128M.

RT index keeps some data in memory (so-called RAM chunk) and also maintains a number of on-disk indexes (so-called disk chunks). This directive lets you control the RAM chunk size. Once there’s too much data to keep in RAM, RT index will flush it to disk, activate a newly created disk chunk, and reset the RAM chunk.

The limit is pretty strict; RT index should never allocate more memory than it’s limited to. The memory is not preallocated either, hence, specifying 512 MB limit and only inserting 3 MB of data should result in allocating 3 MB, not 512 MB.

Example:

rt_mem_limit = 512M

source

Adds document source to local index. Multi-value, mandatory.

Specifies document source to get documents from when the current index is indexed. There must be at least one source. There may be multiple sources, without any restrictions on the source types: ie. you can pull part of the data from MySQL server, part from PostgreSQL, part from the filesystem using xmlpipe2 wrapper.

However, there are some restrictions on the source data. First, document IDs must be globally unique across all sources. If that condition is not met, you might get unexpected search results. Second, source schemas must be the same in order to be stored within the same index.

No source ID is stored automatically. Therefore, in order to be able to tell what source the matched document came from, you will need to store some additional information yourself. Two typical approaches include:

  1. mangling document ID and encoding source ID in it:
source src1
{
    sql_query = SELECT id*10+1, ... FROM table1
    ...
}

source src2
{
    sql_query = SELECT id*10+2, ... FROM table2
    ...
}
  1. storing source ID simply as an attribute:
source src1
{
    sql_query = SELECT id, 1 AS source_id FROM table1
    sql_attr_uint = source_id
    ...
}

source src2
{
    sql_query = SELECT id, 2 AS source_id FROM table2
    sql_attr_uint = source_id
    ...
}

Example:

source = srcpart1
source = srcpart2
source = srcpart3

stopwords

Stopword files list (space separated). Optional, default is empty.

Stopwords are the words that will not be indexed. Typically you’d put most frequent words in the stopwords list because they do not add much value to search results but consume a lot of resources to process.

You can specify several file names, separated by spaces. All the files will be loaded. Stopwords file format is simple plain text. The encoding must be UTF-8. File data will be tokenized with respect to charset_table settings, so you can use the same separators as in the indexed data.

The stemmers will normally be applied when parsing stopwords file. That might however lead to undesired results. You can turn that off with stopwords_unstemmed.

Small enough files are stored in the index header, see embedded_limit for details.

While stopwords are not indexed, they still do affect the keyword positions. For instance, assume that “the” is a stopword, that document 1 contains the line “in office”, and that document 2 contains “in the office”. Searching for “in office” as for exact phrase will only return the first document, as expected, even though “the” in the second one is stopped. That behavior can be tweaked through the stopword_step directive.

Stopwords files can either be created manually, or semi-automatically. indexer provides a mode that creates a frequency dictionary of the index, sorted by the keyword frequency, see --buildstops and --buildfreqs switch in indexer command reference. Top keywords from that dictionary can usually be used as stopwords.

Example:

stopwords = /usr/local/sphinx/data/stopwords.txt
stopwords = stopwords-ru.txt stopwords-en.txt

stopword_step

Position increment on stopwords. Optional, allowed values are 0 and 1, default is 1.

This directive does not affect searchd in any way, it only affects indexer.

Example:

stopword_step = 1

stopwords_unstemmed

Whether to apply stopwords before or after stemming. Optional, default is 0 (apply stopword filter after stemming).

By default, stopwords are stemmed themselves, and applied to tokens after stemming (or any other morphology processing). In other words, by default, a token is stopped when stem(token) == stem(stopword). That can lead to unexpected results when a token gets (erroneously) stemmed to a stopped root. For example, ‘Andes’ gets stemmed to ‘and’ by our current stemmer implementation, so when ‘and’ is a stopword, ‘Andes’ is also stopped.

stopwords_unstemmed directive fixes that issue. When it’s enabled, stopwords are applied before stemming (and therefore to the original word forms), and the tokens are stopped when token == stopword.

Example:

stopwords_unstemmed = 1

type

Index type. Known values are plain, distributed, rt, template and percolate. Optional, default is ‘plain’ (plain local index).

Manticore supports several different types of indexes. Plain local indexes are stored and processed on the local machine. Distributed indexes involve not only local searching but querying remote searchd instances over the network as well (see Distributed searching). Real-time indexes (or RT indexes for short) are also stored and processed locally, but additionally allow for on-the-fly updates of the full-text index (see Real-time indexes). Note that attributes can be updated on-the-fly using either plain local indexes or RT ones. Template indexes are actually a pseudo-indexes because they do not store any data. That means they do not create any files on your hard drive. But you can use them for keywords and snippets generation, which may be useful in some cases, and also as templates to inherit real indexes from them.

Index type setting lets you choose the needed type. By default, plain local index type will be assumed.

Example:

type = distributed

wordforms

Word forms dictionary. Optional, default is empty.

Word forms are applied after tokenizing the incoming text by charset_table rules. They essentially let you replace one word with another. Normally, that would be used to bring different word forms to a single normal form (eg. to normalize all the variants such as “walks”, “walked”, “walking” to the normal form “walk”). It can also be used to implement stemming exceptions, because stemming is not applied to words found in the forms list.

Small enough files are stored in the index header, see embedded_limit for details.

Dictionaries are used to normalize incoming words both during indexing and searching. Therefore, to pick up changes in wordforms file it’s required to rotate index.

Word forms support in Manticore is designed to support big dictionaries well. They moderately affect indexing speed: for instance, a dictionary with 1 million entries slows down indexing about 1.5 times. Searching speed is not affected at all. Additional RAM impact is roughly equal to the dictionary file size, and dictionaries are shared across indexes: ie. if the very same 50 MB wordforms file is specified for 10 different indexes, additional searchd RAM usage will be about 50 MB.

Dictionary file should be in a simple plain text format. Each line should contain source and destination word forms, in UTF-8 encoding, separated by “greater” sign. Rules from the charset_table will be applied when the file is loaded. So basically it’s as case sensitive as your other full-text indexed data, ie. typically case insensitive. Here’s the file contents sample:

walks > walk
walked > walk
walking > walk

There is a bundled spelldump utility that helps you create a dictionary file in the format Manticore can read from source .dict and .aff dictionary files in ispell or MySpell format (as bundled with OpenOffice).

You can map several source words to a single destination word. Because the work happens on tokens, not the source text, differences in whitespace and markup are ignored.

You can use “=>” instead of “>”. Comments (starting with “#” are also allowed. Finally, if a line starts with a tilde (“~”) the wordform will be applied after morphology, instead of before.

core 2 duo > c2d
e6600 > c2d
core 2duo => c2d # Some people write '2duo' together...

You can specify multiple destination tokens:

s02e02 > season 2 episode 2
s3 e3 > season 3 episode 3

Example:

wordforms = /usr/local/sphinx/data/wordforms.txt
wordforms = /usr/local/sphinx/data/alternateforms.txt
wordforms = /usr/local/sphinx/private/dict*.txt

You can specify several files and not only just one. Masks can be used as a pattern, and all matching files will be processed in simple ascending order. (If multi-byte codepages are used, and file names can include foreign characters, the resulting order may not be exactly alphabetic.) If a same wordform definition is found in several files, the latter one is used, and it overrides previous definitions.