Introduction
InfiSearch is a client-side search solution made for static sites, including a search UI and library depending on a pre-built index generated by a CLI tool.
Features
-
Relevant Search š: spelling correction, automatic prefix search, boolean and phrase queries, BM25 scoring, proximity scoring, facet filters and moreā¦
-
Speedy š: WebAssembly & WebWorker powered, enabling efficient, non-blocking query processing. Also includes persistent caching to minimize network requests, and a multi-threaded CLI indexer powered by Rust.
-
Semi-Scalable, achieved by optionally splitting the index into tiny morsels, complete with incremental indexing.
-
A customisable, accessible user interface š„ļø
-
Support for multiple file formats (
.json,csv,pdf,html
) to satisfy more custom data requirements.
Search Features
A little more about some of InfiSearchās search features.
Blazing Fast
Powered by WebAssembly and Webworkers, InfiSearch blazes through searches on tens of thousands of documents. Index downloads are persistently cached using the Cache API that backs service workers, but comes without its setup hassle. Users will never download the same data twice.
Some efficient, high-return compression schemes are also employed, so you get all these features without much penalty. This documentation for example, which has all features enabled, generates a main index file of just 20KB, and a dictionary of 9KB.
Scalable
A monolithic index is built by default to reduce network latency, which suffices for 90% of use cases. But, you also have the option of splitting up the index so users retrieve only whatās necessary, greatly improving client-side search scalability.
Ranking Model & Query Refinement
InfiSearch adopts industry standard scoring schemes. Queries are first ranked using the BM25 model, then a soft disjunctive maximum of the documentās field scores is taken. By default, <title>
, <h1>
, <h2-6>
, then other texts are indexed as four separate fields.
Query term proximity ranking is InfiSearchās highlight here, and is enabled by default. Results are scaled according to how close search expressions are to one another, greatly improving contextuality of searches.
InfiSearch also gives the searchers the a powerful boolean query syntax, made known to them through an advanced search tips icon. You also have the option of setting up custom facet filters such as multi-select checkboxes, numeric filters, and date time filters for ease of use.
How it Works:
InfiSearch depends on a static, pre-built index that is a collection of various files.
- The CLI indexer tool first generates:
- Binary index chunk(s)
- JSON field store(s) containing raw document texts
- Supporting metadata, for example the search dictionary
- The search UI:
- Figures out which index files are needed from the query
- Retrieves the files from cache/memory/network requests
- Obtains and ranks the result set
- Lastly, retrieves field stores from cache/memory/network requests progressively to generate result previews
Getting Started
This page assumes the use case of a static site, that is:
-
You have some HTML files you want to index.
-
These HTML files are served in a static file server, and are linkable to.
-
You have an
<input>
element for attaching a search dropdown.
Installing the indexer
There are a couple of options for installing the indexer:
- Install the global npm package with
npm install -g @infisearch/cli
. - If you have the rust / cargo toolchains setup, run
cargo install infisearch --vers 0.10.1
. - You can also grab the cli binaries here.
Running the indexer
Run the executable as such, replacing <source-folder-path>
with the relative or absolute folder path of your source html files, and <output-folder-path>
with your desired index output folder.
infisearch <source-folder-path> <output-folder-path>
If you are using the binaries, replace infisearch
with the appropriate executable name.
Other Cli Options
-c <config-file-path>
: You may also change the config file location (relative to thesource-folder-path
) using the-c <config-file-path>
option.--preserve-output-folder
: All existing contents in the output folder are removed before starting. Specify this option to avoid this.
Installing the search UI
Installation via CDN
<!-- Replace "v0.10.1" as appropriate -->
<!-- Search UI script -->
<script src="https://cdn.jsdelivr.net/gh/ang-zeyu/infisearch@v0.10.1/packages/search-ui/dist/search-ui.ascii.bundle.js"></script>
<!-- Search UI css, this provides some basic styling for the search dropdown, and can be omitted if desired -->
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/ang-zeyu/infisearch@v0.10.1/packages/search-ui/dist/search-ui-light.css" />
ā ļø Ensure the linked versions match the indexer version used exactly.
Hosting the Files
If you wish to host the files, you can find them in the <output-folder-path>/assets
directory generated by the indexer. Using these guarantees that you will always being using the same indexer and search UI versions.
The folder contains:
- A pair of language-specific files that should be served from the same folder:
search-ui.*.bundle.js
, the default issearch-ui.ascii.bundle.js
- An accompanying WebAssembly binary
- A stylesheet:
search-ui-basic/light/dark.css
The same files are also in the release packages here, inside search.infi.zip
.
UI Initialisation
Once you have loaded the bundles, simply call the infisearch.init
function in your page.
This requires an input element with an id=infi-search
to be present in the page by default. The id
can be configured via uiOptions.input
.
infisearch.init({
searcherOptions: {
// Output folder url specified as the second parameter in the cli command
// Urls like '/output/' will work as well
url: 'http://<your-domain>/output/',
},
uiOptions: {
// Input / source folder url, specified as the first parameter in the cli command
sourceFilesUrl: 'http://<your-domain>/source/',
input: 'infi-search',
}
});
mdbook-infisearch
mdbook-infisearch
is a simple search plugin replacement for mdBook to use InfiSearchās search interface and library instead of elasticlunr.js.
What, why?
MdBook already has its own built-in search function utilising elasticlunr, which works well enough for most cases. This plugin was mainly created as:
- A proof-of-concept to integrate InfiSearch with other static site generators easily
- A personal means to set up document deployment workflows in CI scripts
You may nonetheless want to use this plugin if you need InfiSearchās extra features. Some examples:
- you require PDF file support, or JSON file support to link to out-of-domain pages.
- spelling correction, automatic prefix search, term proximity ranking, etc.
Styling
This plugin uses the css variables provided by the 5 main default themes in mdBook to style the search user interface. Switch the themes in this documentation to try out the different themes!
Note: The default InfiSearch theme is not included in the plugin. To see the default styling, head on over to the styling page or view the demo site.
Installation
Install the executable either using cargo install mdbook-infisearch
, or download and add the binaries to your PATH
manually.
Then, minimally add the first two configuration sections below to your book.toml
configuration file:
[output.html.search]
# disable the default mdBook search feature implementation
enable = false
[preprocessor.infisearch]
command = "mdbook-infisearch"
[output.infisearch] # this header should be added
# Plugin configuration options (optional)
# See search configuration page, or use the buttons below
mode = "target"
# Relative path to a InfiSearch indexer configuration file from the project directory.
#
# If you are creating this for the first time, let this point to a non-existent file
# and the config file will be created with Infisearch's settings tailored for mdBook.
config = "infi_search.json"
Preview
Use the following (non-canonical, documentation specific) buttons to try out the different mode
parameters.
You can also try out the different themes on this documentation using mdBookās paintbrush icon!
Content Security Policy
WebAssembly CSP
InfiSearch runs using WebAssembly. If you are using a restrictive content security policy, WebAssembly as a whole unfortunately currently requires adding the script-src: 'unsafe-eval';
directive.
This error will show up in chrome for example as the following extremely detailed error message:
Uncaught (in promise) CompileError: WebAssembly.instantiateStreaming(): Refused to compile or instantiate WebAssembly module because āunsafe-evalā is not an allowed source of script in the following Content Security Policy directive: āā¦ā
Support for a more specific script-src: 'wasm-unsafe-eval';
directive has landed in Chrome, Edge and Firefox, but is still pending in Safari.
WebWorker CSP
InfiSearch also utilises a blob URL to load its WebWorker. This shouldnāt pose as much of a security concern since blob URLs can only be created by scripts already executing within the browser.
To whitelist this, add the script-src: blob:;
directive.
CDN CSP
Naturally, if you load InfiSearch assets from the CDN, you will also need to whitelist this in the script-src: cdn.jsdelivr.net;
and style-src: cdn.jsdelivr.net;
directives.
Search Configuration
All options here are provided through the infisearch.init
function exposed by the search bundle.
There are 2 categories of options, the first related to the user interface, and the other search functionalities.
Search UI Options
Search UI options are organised under the uiOptions
key:
infisearch.init({
uiOptions: {
// ... options go here ...
}
})
Site URL
sourceFilesUrl
- Example:
'/'
or'https://www.infi-search.com'
This option allows InfiSearch to construct a link to the page for search result previews. This is done by appending the relative file path of the indexed file.
Unless you are providing all links manually (see Linking to other pages), this URL must be provided.
Input Element
Option | Default Value | Description |
---|---|---|
input | 'infi-search' | id of the input element or a HTML element reference |
inputDebounce | 100 | debounce time of keystrokes to the input element |
preprocessQuery | (q) => q | any function for preprocessing the query. Can be used to add a field filter for example. |
The input
element is required in most cases. Its behaviour depends on the UI mode.
UI Mode
mode: 'auto'
The search UI provides 4 main different behaviours.
Mode | Details |
---|---|
auto | This uses the fullscreen mode for a mobile device, and dropdown otherwise.This adjustment is rerunned whenever the window is resized. |
dropdown | This wraps the provided input element in a wrapper container, then creates a dropdown next to house InfiSearchās UI. |
fullscreen | This creates a distinct modal (with its own search input, close button, etc.) and appends it to the page <body> .If the input element is specified, a click handler is attached to open this UI so that it functions as a button. For default keyboard accessibility, some minimal and overidable styling is also applied to this button.This UI can also be toggled programatically, removing the need for the input . |
target | This option is most flexible, and is used by the mdBook plugin (this documentation). Search results are then output to a custom target element of choice. |
Use the following buttons to try out the different modes. The default in this documentation is target
.
UI Mode Specific Options
There are also several options specific to each mode. dropdown
and fullscreen
options are also applicable to the auto
mode.
Mode | Option | Default | Description |
---|---|---|---|
dropdown | dropdownAlignment | 'bottom-end' | 'bottom' or 'bottom-start' or 'bottom-end' .The alignment will be automatically flipped horizontally to ensure optimal placement. |
fullscreen | fsContainer | <body> | id of or an element reference to attach the modal to. |
fullscreen | fsScrollLock | true | Scroll locks the body element when the fullscreen UI is opened. |
target | target | undefined | id of or an element reference to attach the UI. |
General Options
Option | Default | Description |
---|---|---|
tip | true | Shows the advanced search tips icon on the bottom right. |
maxSubMatches | 2 | Maximum headings to show for a result preview. |
resultsPerPage | 10 | Number of results to load when āload moreā is clicked. |
useBreadcrumb | false | Prefer using the file path as the result previewās title. This is formatted into a breadcrumb, transformed to Title Case. Example: 'documentation/userGuide/my_file.html' ā Documentation Ā» User Guide Ā» My File . |
Setting Up Enum Filters ā
Enum fields you index can be mapped into UI multi-select dropdowns. In this documentation for example, Mdbookās section titles āUser Guideā, āAdvancedā are mapped.
Setup bindings under uiOptions
like so:
multiSelectFilters: [
{
fieldName: 'partTitle', // name of field definition
displayName: 'Section', // UI header text
defaultOptName: 'None',
collapsed: true, // only the first header is initially expanded
},
]
Documents that do not have an enum value are assigned an internal default enum value. The option text of this enum value to show is specified by defaultOptName
.
Setting Up Numeric Filters and Sort Orders
Indexed numeric fields can be mapped into minimum-maximum filters of <input type="number|date|datetime-local" />
, or used to create custom sort orders.
Minimum-Maximum Filters
numericFilters: [
{
fieldName: 'pageViewsField',
displayName: 'Number of Views',
type: 'number' | 'date' | 'datetime-local',
// Text above date, datetime-local filters and placeholder text for number filters
// Also announced to screenreaders
minLabel: 'Min',
maxLabel: 'Max',
}
]
Sorting by Numbers, Dates
sortFields: {
// Map of the name of your numeric field to names of UI options
price: {
asc: 'Price: Low to High',
desc: 'Price: High to Low',
},
},
Manually Showing / Hiding the Fullscreen UI
Call the showFullscreen()
and hideFullscreen()
functions returned by the infisearch.init
to programatically show/hide the fullscreen search UI.
// These methods can be used under mode="auto|fullscreen"
const { showFullscreen, hideFullscreen } = infisearch.init({ ... });
Client Side Routing
To override the link click handler, use the specially provided parameter onLinkClick
.
uiOptions: {
onLinkClick: function (ev) {
/*
By default, this function is a thunk.
Call ev.preventDefault() and client-side routing code here.
Access the anchor element using "this".
*/
}
}
Changing The Mobile Device Detection Method
If the client is a āmobile deviceā, the fullscreen UI is used under mode='auto'
.
The check is done with a media query, which can be overwritten:
uiOptions: {
// Any function returning a boolean
isMobileDevice: () =>
window.matchMedia('only screen and (max-width: 768px)').matches,
}
Search Functionality Options
The options regarding search functionalities itself are rather brief:
infisearch.init({
searcherOptions: {
// URL of output directory generated by the CLI tool
url: 'http://192.168.10.132:3000/output/',
// ---------------------------------------------------------------
// Optional Parameters
maxAutoSuffixSearchTerms: 3,
maxSuffixSearchTerms: 5,
useQueryTermProximity: true,
// Maximum number of results (unlimited if null).
resultLimit: null,
// ------------------------------
// Caching Options
// Caches **all** texts of storage=[{ "type": "text" }] fields up front,
// to avoid network requests when generating result previews.
// Discussed in the "Larger Collections" chapter.
cacheAllFieldStores: undefined,
// Any index chunk larger than this number of bytes
// will be persistently cached once requested.
plLazyCacheThreshold: 0,
// ------------------------------
// ---------------------------------------------------------------
},
});
(Automatic) Suffix Search
maxAutoSuffixSearchTerms = 3
Stemming is turned off by default. This does mean a bigger dictionary (but not too much usually), and lower recall, but much more precise searches.
To keep recall up, an automatic wildcard suffix search is performed on the last query term of a free text query, and only if the query does not end with a whitespace (an indicator of whether the user has finished typing).
maxSuffixSearchTerms = 5
This controls the maximum number of terms to search for manual wildcard suffix searches.
Term Proximity Ranking
useQueryTermProximity = true
If positions are indexed, document scores are also scaled by how close query expressions or terms are to each other. This boosts result relevance significantly.
Caching Options (Advanced)
This is discussed more in the chapter on larger collections.
Styling
Themes
InfiSearch provides 3 built-in themes by default, which correspond to the 3 stylesheets in the releases.
These 3 stylesheets also expose a wide range of css variables which you can alter as needed.
Head on over to the demo site here to try them out!
Light
CDN link
<!-- Replace "v0.10.1" as appropriate -->
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/ang-zeyu/infisearch@v0.10.1/packages/search-ui/dist/search-ui-light.css" />
Preview


Basic
CDN link
<!-- Replace "v0.10.1" as appropriate -->
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/ang-zeyu/infisearch@v0.10.1/packages/search-ui/dist/search-ui-basic.css" />

Dark
CDN link
<!-- Replace "v0.10.1" as appropriate -->
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/ang-zeyu/infisearch@v0.10.1/packages/search-ui/dist/search-ui-dark.css" />
Preview

Styling the Fullscreen UI Input Button
InfiSearch is minimally invasive in styling your <input>
element (except for the one that comes with the fullscreen UI), leaving this to your siteās design.
Notably however, for accessibility, some minimal styling is applied when using the fullscreen UI to convey the intention of a button (which opens the fullscreen UI). This is limited to:
-
A
background
&box-shadow
&color
application on focusThese are applied with a
!important
modifier as they are key to conveying keyboard focus, but are also overridable easily with InfiSearchās css variables. -
cursor: pointer
application on hover
Applying Input Button Styles under mode='auto'
If using the default UI mode of auto
, which switches between the dropdown and fullscreen UI dynamically, you can also set a different placeholder, and/or use the .infi-button-input
selector to apply your styles only if the fullscreen UI is used. For example,
.infi-button-input:focus:not(:hover) {
background: #6c757d !important;
}
Indexer Configuration
All indexer configurations are sourced from a json file. By default, the cli tool looks for infi_search.json
in the source folder (first argument specified in the command).
This can be changed using the -c <config-file-path>
option.
Indexer Field Configuration
Every document you index contains multiple fields. By default, InfiSearch comes baked in with the configurations needed for supporting static site search.
Default Field Configuration
It may be helpful to first understand the default fields as examples, and how they are used in the UI:
{
"fields_config": {
"fields": {
"title": { "weight": 2.0 },
"h1": { "weight": 2.0 },
"heading": { "weight": 1.5 },
"body": { "weight": 1.0 },
// The default weight is 0.0. These fields are stored, but not searchable.
"headingLink": {},
"link": {},
"_relative_fp": {} // An internal, reserved field (see "Reserved Fields")
}
}
}
Field | Source | UI Usage |
---|---|---|
h1, title | <h1> , <title> | Result previewās title. When unavailable, the _relative_fp field is displayed as a breadcrumb. |
heading | <h2-6> | Result preview sub matchās heading. |
headingLink | <h2-6 id=".."> | Result preview sub matchās #anchor link. |
body | <body> | Result preview sub matchās main text. |
_relative_fp | Relative file path from the source indexer folder | Result previewās <a> link by concatenating sourceFilesUrl to _relative_fp |
link | User supplied override for linking to other pages | Result previewās <a> link. Convenience default field to support custom overrides for links easily (e.g. when indexing a JSON document). |
Click to view Graphical Illustration

Adding Fields
You can add your own fields to index for free-text search, create categorical and/or numeric facet filters, and custom numeric sort orders.
The user interface only incorporates the default set of fields in result highlighting however. If you need to incorporate additional fields, for example to display a icon beside each result, you can alter the HTML outputs, or use the search API.
Removing Default Fields
If you are using InfiSearch as a general-purpose client side search tool, you can assign a value of null
to remove default field definitions completely as a minor optimization.
Alternatively, merge_default_fields: false
removes all default field definitions.
{
"fields_config": {
"fields": {
"h1": null
},
"merge_default_fields": false
}
}
Reserved Fields
Reserved fields are prefixed with an underscore _
, and are hardcoded into the indexer to perform special functions.
-
_relative_fp: the relative path from your source folder to the file.
-
_add_files: This field allows you to index multiple files as a single document, which can be useful for overriding or extending data. See this section under indexing for more details.
Field Scoring
{
"fields_config": {
"fields": {
"title": { "weight": 2.0 }
}
}
}
weight=0.0
This parameter is a boost / penalty multiplied to a individual fieldās score.
Specifying 0.0
will also result in the field not being indexed into InfiSearchās inverted index at all meaning that searching for any terms in this field will not show up any results. When used in combination with the storage
parameter, the use case is to create a field that is only stored for custom sort orders, facet filters, or UI purposes (e.g. the _relative_fp
field).
k=1.2
& b=0.75
These are scoring parameters that control the impact of term frequency and document lengths. The following article provides a good overview on how to configure these, if needed.
All default fields except titles and headings use the above default parameters.
Field Storage
{
"fields_config": {
"fields": {
"title": {
"storage": [{ "type": "text" }] // defaults
}
}
}
}
As with most free-text search tools, InfiSearch performs relies on an inverted index mapping terms to source documents.
Once the result set is obtained, each result documentās data could still be useful. For example, a documentās original title is essential for generating a human-readable result preview.
InfiSearch provides 3 storage types:
1. text
In this format, the documentās raw texts are stored in a JSON file as a series of [fieldName, fieldText]
pairs following the order they were seen.
This āpositionedā model enables constructing detailed result preview hierarchies you see in InfiSearchās UI currently: Title > Heading > Text under Heading
2. enum
This storage format stores a single value for each indexed document, and is useful for categorical data. Only the first such occurence is stored if there are multiple. These values can be queried using the search API or used to create multi-select filters in the search UI.
In this documentation for example (and the mdBook plugin), there is a multi-select checkbox filter that can be used to filter each page by itās mdBook section title. (āUser Guideā, āAdvancedā)
Notes:
- Documents without enum values are internally assigned a default enum value that can also be queried.
- While it is unlikely you will need more, there is a hard limit of 255 possible values for your entire document collection. Excess values are discarded, and the CLI tool will print a warning.
- You can also use InfiSearchās flexible boolean syntaxes to filter documents. Using this option however allows a simplifying assumption to store these values more compactly and enables creating UI multi-select filters easily.
3. i64
This format stores a single 64-bit unsigned integer value for each document. Only the first such occurence is stored.
{
"fields_config": {
"fields": {
"price": {
"storage": [{
"type": "i64",
"default": 1,
"parse": "normal"
}]
},
}
}
}
3 parsing strategies are available currently:
integer
: a signed 64bit integerround
: a double precision floating integer rounded to the nearest integerdatetime
: any date time string. This string is parsed using adatetime_fmt
format specifier as outlined in the Chrono crateāsDateTime::parse_from_str
method. The value is stored in seconds, as a UNIX timestamp relative to 1 Jan 1970 00:00 UTC.{ "type": "i64", "default": 1, "parse": { "method": "datetime", "datetime_fmt": "%Y %b %d %H:%M %z", // ---------------------- // Optional // If your datetime_fmt has no timezone, // specify it in seconds here, relative to UTC "timezone": 0, // If your datetime_fmt has no H, M, and timezone // specify the default time of day in seconds here "time": 0, // ---------------------- } }
i64
fields can be used for facet search for:
- Creating numeric or datetime min-max filters in the UI easily and/or filtering them in the Search API
- Sorting results by these fields in the UI or API
Indexer Data Configuration
The configurations in this page specify how (mapping file data to fields) and which files to index.
InfiSearchās defaults are sufficient to index most HTML files, but if not, you can also configure how the content mapping is done. Enabling support for other file formats (e.g. JSON, CSV, PDF) files is also done here.
Mapping File Data to Fields
{
"indexing_config": {
"loaders": {
// Default: Only HTML files are indexed
"HtmlLoader": {}
}
}
}
The indexer is able to handle data from HTML, JSON, CSV, TXT, or PDF files. Support for each file type is provided by a file Loader abstraction.
You may configure loaders by including them under the loaders
, with any applicable options.
HTML Files: loaders.HtmlLoader
The HTMLLoader
is the only loader that is configured by default, which is as follows:
"loaders": {
"HtmlLoader": {
"exclude_selectors": [
// Selectors to exclude from indexing
"script,style,form,nav,[data-infisearch-ignore]"
],
"selectors": {
"title": {
"field_name": "title"
},
"h1": {
"field_name": "h1"
},
"h2,h3,h4,h5,h6": {
"attr_map": {
"id": "headingLink" // stores the id attribute under the headingLink field
},
"field_name": "heading"
},
"body": {
"field_name": "body"
},
"meta[name=\"description\"],meta[name=\"keywords\"]": {
"attr_map": {
"content": "body"
}
},
// A convenient means to override the link used in the result preview
// See "Linking to other pages" for more information
"span[data-infisearch-link]": {
"attr_map": {
"data-infisearch-link": "link"
}
}
},
"merge_default_selectors": true
}
}
The HTML loader indexes a document by:
-
Traversing the document depth-first, in the order text naturally appears.
-
Checking if any selectors specified as keys under
HtmlLoader.selectors
is satisfied for each element. If so, all descendants (elements, text) of the element are indexed under the newly specifiedfield_name
, if any.-
This process repeats as the document is traversed ā if a descendant matched another different selector, the field mapping is overwritten for that descendant and its descendants.
-
The
attr_map
option allows indexing attributes of specific elements under fields as well. -
All selectors are matched in arbitrary order by default. To specify an order, add a higher
priority: n
key to your selector definition, wheren
is any integer.
-
To exclude elements from indexing, use the exclude_selectors
option, or add the in-built data-infisearch-ignore
attribute to your HTML.
If needed, you can also index HTML fragments that are incomplete documents (for example, documents which are missing the <head>
). To match the entire fragment, use the body
selector.
Lastly, if you need to remove a default selector, simply replace its definition with null
. For example, "h2,h3,h4,h5,h6": null
. Alternatively, specifying "merge_default_selectors": false
will remove all default selectors.
JSON Files: loaders.JsonLoader
"loaders": {
"JsonLoader": {
"field_map": {
"chapter_text": "body",
"book_link": "link",
"chapter_title": "title"
},
// Optional, order in which to index the keys of the json {} document
"field_order": [
"book_link",
"chapter_title",
"chapter_text"
]
}
}
JSON files can also be indexed. The field_map
contains a mapping of your JSON data key -> field name.
The field_order
array controls the order in which the data keys are indexed, which has a minor influence on query term proximity ranking.
The JSON file can be either:
- An object, with numbers, strings or
null
values - An array of such objects
CSV Files: loaders.CsvLoader
"loaders": {
"CsvLoader": {
// ---------------------
// Map data using CSV headers
"header_field_map": {},
"header_field_order": [], // Optional, order to index the columns
// ---------------------
// Or with header indices
"index_field_map": {
"0": "link",
"1": "title",
"2": "body",
"4": "heading"
},
"index_field_order": [1, 4, 2, 0], // Optional, order to index the columns
// ---------------------
// Options for csv parsing, from the Rust "csv" crate
"parse_options": {
"comment": null,
"delimiter": 44,
"double_quote": true,
"escape": null,
"has_headers": true,
"quote": 34
}
}
}
Field mappings for CSV files can be configured using one of the field_map
keys. The field_order
arrays controls the order columns are indexed.
The parse_options
key specifies options for parsing the csv file.
PDF Files: loaders.PdfLoader
"loaders": {
"PdfLoader": {
"field": "body",
}
}
This loader indexes all content into a single field ābodyā by default.
The search result title would appear as <...PDF file path breadcrumb...> (PDF)
, and when clicked upon will open the PDF in the browser.
Text Files: loaders.TxtLoader
"loaders": {
"TxtLoader": {
"field": "field_name",
}
}
This loader simply reads .txt
files and indexes all its contents into a single field. This is not particularly useful without the _add_files
feature feature that allows indexing data from multiple files as one document.
File Exclusions
{
"indexing_config": {
"exclude": [
"infi_search.json"
],
"include": [],
"with_positions": true
}
}
File Exclusions: exclude = ["infi_search.json"]
Global file exclusions can be specified in this parameter, which is simply an array of file globs.
File Inclusions: include = []
Similarly, you can specify only specific files to index. This is an empty array by default, which indexes everything.
If a file matches both an exclude
and include
pattern, the exclude
pattern will have priority.
Indexer Misc Configuration
Indexing Positions
{
"indexing_config": {
"with_positions": true
}
}
This option controls if positions are stored. Features such as phrase queries that require positional information will not work if this is disabled. Turning this off for very large collections (~> 1GB) can increase the toolās scalability, at the cost of such features.
Indexer Thread Count
{
"indexing_config": {
"num_threads": max(min(physical cores, logical cores) - 1, 1)
}
}
Indexing Multiple Files Under One Document
InfiSearch regards each file as a single document by default. You can index multiple files into one document using the reserved field _add_files
. This is useful if you need to override or add data but canāt modify the source document easily.
Overrides should be provided with JSON, CSV, or HTML files, as TXT and PDF files have no reliable way of supplying the _add_files
field. In addition, you will need to manually map the CSV data to the _add_files
field. This is automatically done for JSON and HTML files.
Example: Overriding a Documentās Link With Another File
Suppose you have the following files:
folder
|-- main.html
|-- overrides.json
To index main.html
and override its link, you would have:
overrides.json
{
"link": "https://infi-search.com",
"_add_files": "./main.html"
}
Indexer Configuration
{
"indexing_config": {
"exclude": ["main.html"]
}
}
This excludes indexing main.html
directly, but does so through overrides.json
.
Larger Collections
ā ļø This section serves as a reference, prefer the preconfigured scaling presets if possible.
Field Configuration
{
"fields_config": {
"cache_all_field_stores": true,
"num_docs_per_store": 100000000
},
"indexing_config": {
"pl_limit": 4294967295,
"pl_cache_threshold": 0,
"num_pls_per_dir": 1000
}
}
Field Store Caching: cache_all_field_stores
All fields specified with storage=[{ "type": "text" }]
are cached up front when this is enabled.
This is the same option as the one under search functionality options, and has lower priority.
Field Store Granularity: num_docs_per_store
The num_docs_per_store
parameter controls how many documentsā texts to store in one JSON file. Batching multiple files together increases file size but can lead to less files and better browser caching.
Index Shard Size: pl_limit
This is a threshold (in bytes) at which to ācutā index (pl meaning postings list) chunks. Increasing this produces less but bigger chunks (which take longer to retrieve).
Index Caching: pl_cache_threshold
Index chunks that exceed this size (in bytes) are cached by the search library on initilisation. It is used to configure InfiSearch for response time (over scalability) for typical use cases.
Language Configuration
There are 3 language modules available. To configure these, you will need to serve the appropriate language bundle in your HTML (or edit the CDN link accordingly), and edit the indexer configuration file.
{
"lang_config": {
// ... options go here ...
}
}
Ascii Tokenizer
The default tokenizer should work for any language that relies on ASCII characters, or their inflections (e.g. āĆ”ā).
The text is first split into on sentences, then whitespaces to obtain tokens. An asciiFoldingFilter is then applied to normalize diacritics, followed by punctuation and non-word-character boundary removal.
{
"lang": "ascii",
"options": {
"stop_words": [
"a", "an", "and", "are", "as", "at", "be", "but", "by", "for",
"if", "in", "into", "is", "it", "no", "not", "of", "on", "or",
"such", "that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
],
"ignore_stop_words": false,
// Hard limit = 250
"max_term_len": 80
}
}
CDN Link
<script src="https://cdn.jsdelivr.net/gh/ang-zeyu/infisearch@v0.10.1/packages/search-ui/dist/search-ui.ascii.bundle.js"></script>
Ascii Tokenizer with Stemmer
This is essentially the same as the ascii tokenizer, but adds a stemmer
option.
{
"lang": "ascii_stemmer",
"options": {
// ----------------------------------
// Ascii Tokenizer options also apply
// ...
// ----------------------------------
// Any of the languages here
// https://docs.rs/rust-stemmers/1.2.0/rust_stemmers/enum.Algorithm.html
// Languages other than "english" have not been extensively tested. Use with caution!
"stemmer": "english"
}
}
If you do not need stemming, use the ascii
tokenizer, which has a smaller wasm binary.
CDN Link
<script src="https://cdn.jsdelivr.net/gh/ang-zeyu/infisearch@v0.10.1/packages/search-ui/dist/search-ui.ascii-stemmer.bundle.js"></script>
Chinese Tokenizer
This is a lightweight character-wise tokenizer, not based on word-based tokenizers like Jieba.
You are highly recommended to keep positions indexed and query term proximity ranking turned on when using this tokenizer, in order to boost the relevance of documents with multi-character queries.
{
"lang": "chinese",
"options": {
"stop_words": [],
"ignore_stop_words": false,
"max_term_len": 80
}
}
CDN Link
<script src="https://cdn.jsdelivr.net/gh/ang-zeyu/infisearch@v0.10.1/packages/search-ui/dist/search-ui.chinese.bundle.js"></script>
Stop Words
All tokenizers support keeping (default) or removing stop words using the ignore_stop_words
option.
Keeping them enables the following:
- Processing phrase queries such as
"for tomorrow"
accurately; Stop words would be removed automatically from such queries. - Boolean queries of stop words (e.g.
if AND forecast AND sunny
) - More accurate ranking for free text queries, which uses stop words in term proximity ranking
UI Translations
The UIās text can also be overwritten. Refer to this link for the default set of texts.
infisearch.init({
uiOptions: {
translations: { ... }
}
})
Option | Default | Description |
---|---|---|
resultsLabel | 'Site results' | Accessibility label for the listbox containing result previews. This is announced to screenreaders. |
fsButtonLabel | 'Search' | Accessibility label for the original input element that functions as a button when the fullscreen UI is in use. |
fsButtonPlaceholder | undefined | Placeholder override for the provided input that functions as a button when the fullscreen UI is in use. |
fsPlaceholder | 'Search this site' | Placeholder of the input element in the fullscreen UI. |
fsCloseText | 'Close' | Text for the Close button. |
filtersButton | 'Filters' | Text for the Filters button if any enum or numeric filters are configured. |
numResultsFound | ' results found' | The text following the number of results found. |
startSearching | 'Start Searching Above!' | Text shown when the input is empty. |
startingUp | '... Starting Up ...' | Text shown when InfiSearch is still not ready to perform any queries. The setup occurs extremely quickly, you will hopefully not be able to see this text most of the time. |
navigation | 'Navigation' | Navigation controls text. |
sortBy | 'Sort by' | Header text for custom sort orders. |
tipHeader | 'š Advanced search tips' | Header of the tip popup. |
tip | 'Tip' | First column header of the tip popup. |
example | 'Example' | Second column header of the tip popup. |
tipRows.xx (refer here) | Examples for usage of InfiSearchās advanced search syntax. | |
error | 'Oops! Something went wrong... š' | Generic error text when something goes wrong |
Search API
You can also interface with InfiSearch through its API.
Setup
Under the global infisearch
variable, you can instantiate an instance of the Searcher
class.
const searcher = new infisearch.Searcher({
url: 'https://... the index output directory ...'
});
The constructor parameter uses the same options as infisearch.init
, refer to this page for the other available options.
Initialising States
Setup is also async and and mostly proceeds in the WebWorker. You can use the setupPromise
and isSetupDone
interfaces to optionally show UI initialising states.
searcher.setupPromise.then(() => {
assert(searcher.isSetupDone, true);
});
Retrieving Enum Values
If you have an enum field, you can retrieve all its possible values like such:
const enumValues: string[] = await searcher.getEnumValues('weather');
> console.log(enumValues)
['sunny', 'rainy', 'warm', 'cloudy']
Querying
Next, you can create a Query
object, which obtains and ranks the result set.
const query: Query = await searcher.runQuery('sunny weather');
The Query
object follows this interface.
interface Query {
/**
* Original query string.
*/
public readonly query: string,
/**
* Total number of results.
*/
public readonly resultsTotal: number,
/**
* Returns the next top N results.
*/
public readonly getNextN: (n: number) => Promise<Result[]>,
/**
* Freeing a query manually is required since its results live in the WebWorker.
*/
public readonly free: () => void,
}
Filtering and Sorting
Filter document results with enum fields or numeric fields by passing an additional parameter.
const query: Query = await searcher.runQuery('weather', {
enumFilters: {
// 'weather' is the enum field name
weather: [
null, // Use null to include documents that have no enum values
'sunny',
'warm',
]
},
i64Filters: {
// 'price' is the numeric field name
price: {
gte?: number | bigint,
lte?: number | bigint,
}
},
});
Sort document results using numeric fields. Results are tie-broken by their relevance.
const query: Query = await searcher.runQuery('weather', {
sort: 'pageViews', // where 'pageViews' is the name of the field
sortAscending: false, // the default is to sort in descending order
});
Loading Document Texts
Running a query alone probably isnāt very useful. You can get a Result
object using the getNextN
function.
const results: Result[] = await query.getNextN(10);
A Result
object stores the fields of the indexed document.
const fields = results[0].fields;
> console.log(fields)
{
texts: [
['_relative_fp', 'relative_file_path/of_the_file/from_the_folder/you_indexed'],
['title', 'README'],
['h1', 'README'],
['headingLink', 'description'],
['heading', 'Description'],
['body', 'InfiSearch is a client-side search solution made for static sites, .....'],
// ... more headingLink, heading, body fields ...
],
enums: {
weather: 'cloudy',
reporter: null,
},
numbers: {
datePosted: 1671336914n,
}
}
-
texts
: an array of[fieldName, fieldText]
pairs stored in the order they were seen.This ordered model is more complex than a regular key-value store, but enables the detailed content hierarchy you see in InfiSearchās UI: Title > Heading > Text under heading
-
enums
: stores the enum values of the document. Documents missing enum values are assignednull
. -
numbers
:u64
fields returned as JavascriptBigInt
values.
Memory Management
As InfiSearch uses a WebWorker to run things, you would also need to perform some memory management.
Once you are done with a Query
(e.g. if a new query was run), call free()
on the query
object.
query.free();
Search interfaces usually live for the entire lifetime of the application. If you need to do so however, you should also free the Searcher
instance:
searcher.free();
UI Convenience Methods
A Result
object also exposes 2 other convenience functions that may be useful to help deal with the positional format of the text
type field stores.
1. Retrieving Singular Fields as KV Stores
Certain fields will only occur once in every document (e.g. titles, <h1>
tags). To retrieve these easily, use the getKVFields
method:
const kvFields = result.getKVFields('link', '_relative_fp', 'title', 'h1');
> console.log(kvFields)
{
"_relative_fp": "...",
"title": "..."
// Missing fields will not be populated
}
Only the first [fieldName, fieldText]
pair for each field will be populated into the fields
object.
Tip: Constructing a Document Link
If you havenāt manually added links to your source documents, you can use the _relative_fp
field to construct one by concatenating it to a base URL. Any links added via the data-infisearch-link
attribute are also available under the link
field.
2. Associating Headings to other Content Fields and Highlighting
To establish the relationship between heading
and headingLink
pairs to other content fields following them, call getHeadingsAndContents
.
// The parameter is a varargs of field names to consider as content fields
const headingsAndContents: Segment[] = result.linkHeadingsToContents('body');
This returns an array of Segment
objects, each of which represents a continuous chunk of heading or content text:
interface Segment {
/**
* 'content': content text without preceding heading
* 'heading': text from 'heading' fields
* 'heading-content': content text with a preceding heading
*/
type: 'content' | 'heading' | 'heading-content',
/**
* Only present if type = 'heading-content',
* and points to another Segment of type === 'heading'.
*/
heading?: Segment,
/**
* Only present if type = 'heading' | 'heading-content',
* and points to the heading's id, if any.
*/
headingLink?: string,
// Number of terms matched in this segment.
numTerms: number,
}
Sorting and Choosing Segments
This is fully up to your UI. For example, you can first priortise segments with a greater numTerms
, then tie-break by the type
of the segment.
Text Highlighting
interface Segment {
// ... continuation ...
highlightHTML: (addEllipses: boolean = true) => string,
highlight: (addEllipses: boolean = true) => (string | HTMLElement)[],
text: string, // original string
window: { pos: number, len: number }[], // Character position and length
}
There are 3 choices for text highlighting:
-
highlightHTML()
wraps matched terms with<mark>
tag, truncates text, and adds trailing and leading ellipses. A single escaped HTML string is then returned for use. -
highlight()
does the same but is slightly more efficient, returning a(string | HTMLElement)[]
array. To use this array safely (strings are unescaped) and conveniently, use the.append(...seg.highlight())
DOM API.Click to see example output
[ <span class="infi-ellipses"> ... </span>, ' ... text before ... ', <mark class="infi-highlight">highlighted</mark>, ' ... text after ... ', <span class="infi-ellipses"> ... </span>, ]
-
Lastly, you can perform text highlighting manually using the original
text
and the closestwindow
of term matches.
Search Syntax
InfiSearch provides a few advanced search operators that can be used in the search API. These are also made known to the user using the help icon on the bottom right of the search UI.
Boolean Operators, Parentheses
AND
and NOT
and inversion operators are supported.
OR
is the default behaviour; Documents are ranked according to the BM25 model.
Parentheses (...)
can be used to group expressions together.
weather +sunny - documents that may contain "weather" but must contain "sunny"
weather -sunny - documents containing "weather" and do not have "sunny"
~cloudy - all documents that do not contain "gloomy"
~(ipsum dolor) - all documents that do not contain "ipsum" and "dolor"
Phrase Queries
Phrase queries are also supported by enclosing the relevant terms in "..."
.
"sunny weather" - documents containing "sunny weather"
The withPositions
index feature needs to be enabled for this to work (by default it is).
Field Search
Field queries are supported via the following syntax field_name:
:
title:sunny - documents containing "sunny" in the title
heading:(+sunny +cloudy) - documents with both "lorem" and "ipsum" in headings only
body:gloomy - documents with "gloomy" elsewhere
Wildcard Search
You can also perform suffix searches on any term using the *
character:
run* - searches for "run", "running"
In most instances, an automatic wildcard suffix search is also performed on the last query term that the user is still typing.
Escaping Search Operators
All search operators can also be escaped using \
:
\+sunny
\-sunny
\(sunny cloudy\)
\"cloudy weather\"
"phrase query with qu\"otes"
title\:lorem
Larger Collections
Three configuration presets are available for scaling this tool to larger collections. They are designed primarily for InfiSearchās main intended use case of supporting static site search.
Introduction
Each preset primarily makes a tradeoff between the document collection size it can support and the number of rounds of network requests (RTT
).
The default preset is small
, which generates a monolithic index and field store, much like other client side indexing tools.
Specify the preset
key in your configuration file to change this.
{
"preset": "small" | "medium" | "large"
}
Presets
small
,medium
andlarge
corresponds to 0, 1, or 2 rounds of network requests in the table below.
Preset | Description |
---|---|
small | Generates a monolithic index and field store. Identical to most other client side indexing tools. |
medium | Generates an almost-monolithic index but sharded field store. Only required field stores are retrieved for generating result previews. |
large | Generates both a sharded index and field store. Only index files that are required for the query are retrieved. Keeps stop words. This is the preset used in the demo here! |
In summary, scaling this tool for larger collections dosenāt come freely, and necessitates fragmenting the index and/or field stores, retrieving only whatās needed. This means extra network requests, but to a reasonable degree.
This tool should be able to handle 800MB
(not counting things like HTML tags) collections with the full set of features enabled in the large
preset.
Other Options
There are a few other options especially worth highlighting that can help reduce the index size (and hence support larger collections) or modify caching strategies.
-
In addition to upfront caching of index files with the
pl_cache_threshold
indexing parameter, InfiSearch also persistently caches any index shard that was requested before, but fell short of thepl_cache_threshold
. -
This option is mostly only useful when using the
small / medium
presets which generate a monolithic index. Ignoring stop words in this case can reduce the overall index size, if you are willing to forgo its benefits. -
Positions take up a considerable (~3/4) portion of the index size but produces useful information for proximity ranking, and enables performing phrase queries.
Modified Properties
Presets modify only the following properties:
- Search Configuration:
cacheAllFieldStores
- Indexing Configuration:
num_docs_per_store
,pl_limit
,pl_cache_threshold
Any of these values specified in the configuration file will override that of the presetās.
Linking to other pages
InfiSearch is convenient to get started with if the pages you link to are the same files you index, and these files are hosted at sourceFilesUrl
in the same way your source file folders are structured.
Linking to other pages instead is facilitated by the default link
field, which lets you override the link used in the result preview.
There is also a default data mapping for HTML files which the below section covers. If using JSON or CSV files, refer to the earlier section.
Indexing HTML Files
For HTML files, simply add this link with the data-infisearch-link
attribute.
<span data-infisearch-link="https://www.google.com"></span>
This data mapping configuration is already implemented by default, shown by the below snippet.
"loaders": {
"HtmlLoader": {
"selectors": {
"span[data-infisearch-link]": {
"attr_map": {
"data-infisearch-link": "link"
}
}
}
}
}
Filters
Multi-Select Filters
Multi-select filters, for example the ones you see in this documentationās search (āUser Guideā, āAdvancedā), allow users to filter for results belonging to one or more categories.
For this guide, letās suppose we have a bunch of weather forecast articles and want to support filtering them by the weather (sunny, warm, cloudy).
First, setup a custom field inside the indexer configuration file.
"fields_config": {
"fields": {
"weatherField": {
"storage": [{ "type": "enum" }]
}
}
}
The "storage": [{ "type": "enum" }]
option tells InfiSearch that to store the first seen value of this field for each document, but weāll need to tell InfiSearch where the data for this field comes from next.
Letās assume weāre dealing with a bunch of HTML weather forecast articles, which uses the HTMLLoader
. In particular, these HTML files store the weather inside a specific element with an id="weather"
.
"indexing_config": {
"loaders": {
"HtmlLoader": {
"selectors": {
// Match elements with an id of weather
"#weather": {
// And index its contents into our earlier defined field
// You can also use attributes, see the HTMLLoader documentation
"field_name": "weatherField"
}
}
}
}
}
Lastly, we need to tell InfiSearchās UI to setup a multi-select filter using this field. To do so, add the following to your init
call.
infisearch.init({
...
uiOptions: {
multiSelectFilters: [
{
fieldName: 'weatherField', // matching our earlier defined field
displayName: 'Weather',
defaultOptName: 'Probably Sunny!',
collapsed: true, // only the first header is initially expanded
},
// You can setup more filters as needed following the above procedures
]
}
})
The displayName
option tells the UI how to display the multi-selectās header. We simply use an uppercased āWeatherā in this case.
Some of the weather forecast articles indexed may also be missing the id="weather"
element, for example due to a bug in generating the article, and therefore lacks an enum value. InfiSearch internally assigns such documents a default enum value by default. The defaultOptName
option specifies the name of this default enum value as seen in the UI.
Numeric Filters
You can also create minimum-maximum numeric filters with InfiSearch. These can be of either <input type="number|date|datetime-local" />
.
Continuing the same example as multi-select filters, letās suppose we also want to support filtering weather forecast articles by their number of page views. These page views are stored in the data-pageviews
attribute of the element with an id="weather"
.
First, we define a signed integer field.
"fields_config": {
"fields": {
"pageViewsField": {
"storage": [{
"type": "i64",
// Default number of page views if there is none
"default": 0,
// Parse the data seen as a signed integer
// Datetimes and floats are also supported, see the above linked documentation
"parse": { "method": "normal" }
}]
}
}
}
Next, we map the data from the data-pageviews
attribute into the above field.
"indexing_config": {
"loaders": {
"HtmlLoader": {
"selectors": {
// Match elements with an id of weather
"#weather": {
// And index its data-pageviews attribute into our earlier defined field
"attr-map": {
"[data-pageviews]": "pageViewsField"
}
}
}
}
}
}
Lastly, we tell InfiSearchās UI to setup a numeric filter using this field. To do so, add the following to your infisearch.init
call.
infisearch.init({
...
uiOptions: {
numericFilters: [
{
fieldName: 'pageViewsField',
displayName: 'Number of Views',
type: 'number', // date, datetime-local is also supported
minLabel: 'Min',
maxLabel: 'Max',
}
]
}
})
Sorting by Numbers & Dates
Results can also be sorted by numeric fields. Letās suppose we want to support filtering weather forecast articles by their date posted. The date is stored in an element with the data-date-posted
attribute.
First, define the numeric field that can store any signed 64-bit integers.
"fields_config": {
"fields": {
"datePostedField": {
"storage": [{
"type": "i64",
// Default UNIX timestamp.
// In this case, we use "0", which falls on Jan 1 1970 00:00 UTC.
"default": 0,
// Parse the data seen as a date.
// Integers, floats, and other datetime formats are also supported,
// see the above linked documentation.
"parse": {
"method": "datetime",
"datetime_fmt": "%Y %b %d %H:%M %z"
}
}]
}
}
}
Next, map the data from the data-date-posted
attribute into the above field.
"indexing_config": {
"loaders": {
"HtmlLoader": {
"selectors": {
// Match elements with the attribute
"[data-date-posted]": {
// And index the attribute into our earlier defined field
"attr-map": {
"[data-date-posted]": "datePostedField"
}
}
}
}
}
}
Lastly, configure InfiSearchās UI to setup the UI dropdown using this field.
infisearch.init({
...
uiOptions: {
sortFields: {
dateposted: {
asc: 'Date: Oldest First',
desc: 'Date: Latest First',
},
},
}
})
Altering HTML Outputs
This page covers customising the result preview HTML output structure.
Some use cases for this include:
- The default HTML structure is not sufficient for your styling needs
- You want to override or insert additional content sourced from your own fields (e.g. an image)
- You want to change the default use case of linking to a web page entirely (e.g. use client side routing)
š” If you only need to style the dropdown or search popup, you can include your own css file to do so and / or override the variables exposed by the default css bundle.
The only API option is similarly specified under the uiOptions
key of the root configuration object.
infisearch.init({
uiOptions: {
listItemRender: ...
}
});
Itās interface is as follows:
type ListItemRender = async (
h: CreateElement,
opts: Options, // what you passed to infisearch.init
result: Result,
query: Query,
) => Promise<HTMLElement>;
If you havenāt, you should also read through the Search API documentation on the Result
and Query
parameters.
h
function
This is an optional helper function you may use to create elements.
The method signature is as such:
export type CreateElement = (
// Element name
name: string,
// Element attribute map
attrs: { [attrName: string]: string },
/*
Child elements (HTMLElement) OR text nodes (string)
String parameters are automatically escaped.
*/
...children: (string | HTMLElement)[]
) => HTMLElement;
Accessibility and User Interaction
To ensure that combobox controls work as expected, you should also ensure that the appropriate elements are labelled with role='option'
(and optionally role='group'
).
Elements with role='option'
will also have the .focus
class applied to them once they are visually focused. You can use this class to style the option.
Granularity
At the current, this API is moderately lengthy, performing things such as limiting the number of sub matches (heading-content pairs) per document, formatting the relative file path of documents into a breadcrumb form, etc.
There may be room for breaking this API down further as such, please help to bring up a feature request if you have any suggestions!.
Source Code
See the source to get a better idea of using this API.
Incremental Indexing
Incremental indexing is also supported by the indexer cli tool.
Detecting deleted, changed, or added files is done by storing an internal file path ā> last modified timestamp map.
To use it, simply pass the --incremental
or -i
option when running the indexer.
You will most likely not need to dabble with incremental indexing, unless your collection is extremely large (e.g. > 200MB).
Content Based Hashing
The default change detection currently relies on the last modified time in file metadata. This may not always be guaranteed by the tools that generate the files InfiSearch indexes, or be an accurate reflection of whether a fileās contents were updated.
If file metadata is unavailable for any given file, the file would always be re-indexed as well.
You may specify the --incremental-content-hash
option in such a case to opt into using a crc32 hash comparison for all files instead. This option should also be specified when running a full index and intending to run incremental indexing somewhere down the line.
It should only be marginally more expensive for the majority of cases, and may be the default option in the future.
Circumstances that Trigger a Full (Re)Index
Note also, that the following circumstances will forcibly trigger a full reindex:
- If the output folder path does not contain any files indexed by InfiSearch
- It contains files indexed by a different version of InfiSearch
- The configuration file (
infi_search.json
) was changed in any way - Usage of the
--incremental-content-hash
option changed
Caveats
There are some additional caveats to note when using this option. Whenever possible, try to run a full reindex of the documents, utilising incremental indexing only when indexing speed is of concern ā for example, supporting an āincrementalā build mode in static site generators.
Small Increase in File Size
As one of the core ideas of InfiSearch is to split up the index into many tiny parts, the incremental indexing feature works by āpatchingā only relevant index files containing terms seen during the current run. Deleted documents are handled using an invalidation bit vector. Hence, there might be a small increase in file size due to these unpruned files.
However, if these āirrelevantā files become relevant again in a future index run, they will be pruned.
Collection Statistics
Collection statistics used to rank documents will tend to drift off when deleting documents (which also entails updating documents). This is because such documents may contain terms that were not encountered during the current run of incremental indexing (from added / updated documents). Detecting such terms is difficult, as there is no guarantee the deleted documents are available anymore. The alternative would be to store such information in a non-inverted index, but that again takes up extra space =(.
As such, the information for these terms may not be āpatchedā. You may notice some slight drifting in the relative ranking of documents returned after some number of incremental indexing runs, until said terms are encountered again in some other document.
File Bloat
When deleting documents or updating documents, old field stores are not removed. This may lead to file bloat after many incremental indexing runs.