Introduction

InfiSearch is a client-side search solution made for static sites, including a search UI and library depending on a pre-built index generated by a CLI tool.

Features

  • Relevant Search 🔍: spelling correction, automatic prefix search, boolean and phrase queries, BM25 scoring, proximity scoring, facet filters and more…

  • Speedy 🏇: WebAssembly & WebWorker powered, enabling efficient, non-blocking query processing. Also includes persistent caching to minimize network requests, and a multi-threaded CLI indexer powered by Rust.

  • Semi-Scalable, achieved by optionally splitting the index into tiny morsels, complete with incremental indexing.

  • A customisable, accessible user interface 🖥️

  • Support for multiple file formats (.json,csv,pdf,html) to satisfy more custom data requirements.

Search Features

A little more about some of InfiSearch’s search features.

Blazing Fast

Powered by WebAssembly and Webworkers, InfiSearch blazes through searches on tens of thousands of documents. Index downloads are persistently cached using the Cache API that backs service workers, but comes without its setup hassle. Users will never download the same data twice.

Some efficient, high-return compression schemes are also employed, so you get all these features without much penalty. This documentation for example, which has all features enabled, generates a main index file of just 20KB, and a dictionary of 9KB.

Scalable

A monolithic index is built by default to reduce network latency, which suffices for 90% of use cases. But, you also have the option of splitting up the index so users retrieve only what’s necessary, greatly improving client-side search scalability.

Ranking Model & Query Refinement

InfiSearch adopts industry standard scoring schemes. Queries are first ranked using the BM25 model, then a soft disjunctive maximum of the document’s field scores is taken. By default, <title>, <h1>, <h2-6>, then other texts are indexed as four separate fields.

Query term proximity ranking is InfiSearch’s highlight here, and is enabled by default. Results are scaled according to how close search expressions are to one another, greatly improving contextuality of searches.

InfiSearch also gives the searchers the a powerful boolean query syntax, made known to them through an advanced search tips icon. You also have the option of setting up custom facet filters such as multi-select checkboxes, numeric filters, and date time filters for ease of use.

How it Works:

InfiSearch depends on a static, pre-built index that is a collection of various files.

  1. The CLI indexer tool first generates:
    • Binary index chunk(s)
    • JSON field store(s) containing raw document texts
    • Supporting metadata, for example the search dictionary
  2. The search UI:
    1. Figures out which index files are needed from the query
    2. Retrieves the files from cache/memory/network requests
    3. Obtains and ranks the result set
    4. Lastly, retrieves field stores from cache/memory/network requests progressively to generate result previews

Getting Started

This page assumes the use case of a static site, that is:

  • You have some HTML files you want to index.

  • These HTML files are served in a static file server, and are linkable to.

  • You have an <input> element for attaching a search dropdown.

    For mobile devices
    A fullscreen modal will show when the input element is focused.

    This documentation uses an alternative user interface (try the search function!), which is covered later. To preview the defaults, head on over here.

Installing the indexer

There are a couple of options for installing the indexer:

  • Install the global npm package with npm install -g @infisearch/cli.
  • If you have the rust / cargo toolchains setup, run cargo install infisearch --vers 0.10.1.
  • You can also grab the cli binaries here.

Running the indexer

Run the executable as such, replacing <source-folder-path> with the relative or absolute folder path of your source html files, and <output-folder-path> with your desired index output folder.

infisearch <source-folder-path> <output-folder-path>

If you are using the binaries, replace infisearch with the appropriate executable name.

Other Cli Options

  • -c <config-file-path>: You may also change the config file location (relative to the source-folder-path) using the -c <config-file-path> option.
  • --preserve-output-folder: All existing contents in the output folder are removed before starting. Specify this option to avoid this.

Installing the search UI

Installation via CDN

<!-- Replace "v0.10.1" as appropriate -->

<!--  Search UI script -->
<script src="https://cdn.jsdelivr.net/gh/ang-zeyu/infisearch@v0.10.1/packages/search-ui/dist/search-ui.ascii.bundle.js"></script>
<!-- Search UI css, this provides some basic styling for the search dropdown, and can be omitted if desired -->
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/ang-zeyu/infisearch@v0.10.1/packages/search-ui/dist/search-ui-light.css" />

⚠️ Ensure the linked versions match the indexer version used exactly.

Hosting the Files

If you wish to host the files, you can find them in the <output-folder-path>/assets directory generated by the indexer. Using these guarantees that you will always being using the same indexer and search UI versions.

The folder contains:

  • A pair of language-specific files that should be served from the same folder:
    • search-ui.*.bundle.js, the default is search-ui.ascii.bundle.js
    • An accompanying WebAssembly binary
  • A stylesheet: search-ui-basic/light/dark.css

The same files are also in the release packages here, inside search.infi.zip.

UI Initialisation

Once you have loaded the bundles, simply call the infisearch.init function in your page.

This requires an input element with an id=infi-search to be present in the page by default. The id can be configured via uiOptions.input.

infisearch.init({
  searcherOptions: {
    // Output folder url specified as the second parameter in the cli command
    // Urls like '/output/' will work as well
    url: 'http://<your-domain>/output/',
  },
  uiOptions: {
    // Input / source folder url, specified as the first parameter in the cli command
    sourceFilesUrl: 'http://<your-domain>/source/',
    input: 'infi-search',
  }
});

mdbook-infisearch

mdbook-infisearch is a simple search plugin replacement for mdBook to use InfiSearch’s search interface and library instead of elasticlunr.js.

What, why?

MdBook already has its own built-in search function utilising elasticlunr, which works well enough for most cases. This plugin was mainly created as:

  1. A proof-of-concept to integrate InfiSearch with other static site generators easily
  2. A personal means to set up document deployment workflows in CI scripts

You may nonetheless want to use this plugin if you need InfiSearch’s extra features. Some examples:

  • you require PDF file support, or JSON file support to link to out-of-domain pages.
  • spelling correction, automatic prefix search, term proximity ranking, etc.

Styling

This plugin uses the css variables provided by the 5 main default themes in mdBook to style the search user interface. Switch the themes in this documentation to try out the different themes!

Note: The default InfiSearch theme is not included in the plugin. To see the default styling, head on over to the styling page or view the demo site.

Installation

Install the executable either using cargo install mdbook-infisearch, or download and add the binaries to your PATH manually.

Then, minimally add the first two configuration sections below to your book.toml configuration file:

[output.html.search]
# disable the default mdBook search feature implementation
enable = false

[preprocessor.infisearch]
command = "mdbook-infisearch"

[output.infisearch]  # this header should be added
# Plugin configuration options (optional)
# See search configuration page, or use the buttons below
mode = "target"

# Relative path to a InfiSearch indexer configuration file from the project directory.
#
# If you are creating this for the first time, let this point to a non-existent file
# and the config file will be created with Infisearch's settings tailored for mdBook.
config = "infi_search.json"

Preview

Use the following (non-canonical, documentation specific) buttons to try out the different mode parameters.

You can also try out the different themes on this documentation using mdBook’s paintbrush icon!

Content Security Policy

WebAssembly CSP

InfiSearch runs using WebAssembly. If you are using a restrictive content security policy, WebAssembly as a whole unfortunately currently requires adding the script-src: 'unsafe-eval'; directive.

This error will show up in chrome for example as the following extremely detailed error message:

Uncaught (in promise) CompileError: WebAssembly.instantiateStreaming(): Refused to compile or instantiate WebAssembly module because ‘unsafe-eval’ is not an allowed source of script in the following Content Security Policy directive: ‘…’

Support for a more specific script-src: 'wasm-unsafe-eval'; directive has landed in Chrome, Edge and Firefox, but is still pending in Safari.

WebWorker CSP

InfiSearch also utilises a blob URL to load its WebWorker. This shouldn’t pose as much of a security concern since blob URLs can only be created by scripts already executing within the browser.

To whitelist this, add the script-src: blob:; directive.

CDN CSP

Naturally, if you load InfiSearch assets from the CDN, you will also need to whitelist this in the script-src: cdn.jsdelivr.net; and style-src: cdn.jsdelivr.net; directives.

Search Configuration

All options here are provided through the infisearch.init function exposed by the search bundle.

There are 2 categories of options, the first related to the user interface, and the other search functionalities.

Search UI Options

Search UI options are organised under the uiOptions key:

infisearch.init({
    uiOptions: {
        // ... options go here ...
    }
})

Site URL

sourceFilesUrl

  • Example: '/' or 'https://www.infi-search.com'

This option allows InfiSearch to construct a link to the page for search result previews. This is done by appending the relative file path of the indexed file.

Unless you are providing all links manually (see Linking to other pages), this URL must be provided.

Input Element

OptionDefault ValueDescription
input'infi-search'id of the input element or a HTML element reference
inputDebounce100debounce time of keystrokes to the input element
preprocessQuery(q) => qany function for preprocessing the query. Can be used to add a field filter for example.

The input element is required in most cases. Its behaviour depends on the UI mode.

UI Mode

mode: 'auto'

The search UI provides 4 main different behaviours.

ModeDetails
autoThis uses the fullscreen mode for a mobile device, and dropdown otherwise.
This adjustment is rerunned whenever the window is resized.
dropdownThis wraps the provided input element in a wrapper container, then creates a dropdown next to house InfiSearch’s UI.
fullscreenThis creates a distinct modal (with its own search input, close button, etc.) and appends it to the page <body>.

If the input element is specified, a click handler is attached to open this UI so that it functions as a button. For default keyboard accessibility, some minimal and overidable styling is also applied to this button.

This UI can also be toggled programatically, removing the need for the input.
targetThis option is most flexible, and is used by the mdBook plugin (this documentation).

Search results are then output to a custom target element of choice.

Use the following buttons to try out the different modes. The default in this documentation is target.

UI Mode Specific Options

There are also several options specific to each mode. dropdown and fullscreen options are also applicable to the auto mode.

ModeOptionDefaultDescription
dropdowndropdownAlignment'bottom-end''bottom' or 'bottom-start' or 'bottom-end'.

The alignment will be automatically flipped horizontally to ensure optimal placement.
fullscreenfsContainer<body>id of or an element reference to attach the modal to.
fullscreenfsScrollLocktrueScroll locks the body element when the fullscreen UI is opened.
targettargetundefinedid of or an element reference to attach the UI.

General Options

OptionDefaultDescription
tiptrueShows the advanced search tips icon on the bottom right.
maxSubMatches2Maximum headings to show for a result preview.
resultsPerPage10Number of results to load when ‘load more’ is clicked.
useBreadcrumbfalsePrefer using the file path as the result preview’s title. This is formatted into a breadcrumb, transformed to Title Case.

Example: 'documentation/userGuide/my_file.html'Documentation » User Guide » My File.

Setting Up Enum Filters ∀

Enum fields you index can be mapped into UI multi-select dropdowns. In this documentation for example, Mdbook’s section titles “User Guide”, “Advanced” are mapped.

Setup bindings under uiOptions like so:

multiSelectFilters: [
  {
    fieldName: 'partTitle',  // name of field definition
    displayName: 'Section',  // UI header text
    defaultOptName: 'None',
    collapsed: true,         // only the first header is initially expanded
  },
]

Documents that do not have an enum value are assigned an internal default enum value. The option text of this enum value to show is specified by defaultOptName.

Setting Up Numeric Filters and Sort Orders

Indexed numeric fields can be mapped into minimum-maximum filters of <input type="number|date|datetime-local" />, or used to create custom sort orders.

Minimum-Maximum Filters

numericFilters: [
  {
    fieldName: 'pageViewsField',
    displayName: 'Number of Views',
    type: 'number' | 'date' | 'datetime-local',
    // Text above date, datetime-local filters and placeholder text for number filters
    // Also announced to screenreaders
    minLabel: 'Min',
    maxLabel: 'Max',
  }
]

Sorting by Numbers, Dates

sortFields: {
  // Map of the name of your numeric field to names of UI options
  price: {
    asc: 'Price: Low to High',
    desc: 'Price: High to Low',
  },
},

Manually Showing / Hiding the Fullscreen UI

Call the showFullscreen() and hideFullscreen() functions returned by the infisearch.init to programatically show/hide the fullscreen search UI.

// These methods can be used under mode="auto|fullscreen"
const { showFullscreen, hideFullscreen } = infisearch.init({ ... });

Client Side Routing

To override the link click handler, use the specially provided parameter onLinkClick.

uiOptions: {
  onLinkClick: function (ev) {
    /*
     By default, this function is a thunk.
     Call ev.preventDefault() and client-side routing code here.
     Access the anchor element using "this".
    */
  }
}

Changing The Mobile Device Detection Method

If the client is a “mobile device”, the fullscreen UI is used under mode='auto'. The check is done with a media query, which can be overwritten:

uiOptions: {
  // Any function returning a boolean
  isMobileDevice: () =>
    window.matchMedia('only screen and (max-width: 768px)').matches,
}

Search Functionality Options

The options regarding search functionalities itself are rather brief:

infisearch.init({
    searcherOptions: {
        // URL of output directory generated by the CLI tool
        url: 'http://192.168.10.132:3000/output/',

        // ---------------------------------------------------------------
        // Optional Parameters
        maxAutoSuffixSearchTerms: 3,
        maxSuffixSearchTerms: 5,

        useQueryTermProximity: true,

        // Maximum number of results (unlimited if null).
        resultLimit: null,

        // ------------------------------
        // Caching Options

        // Caches **all** texts of storage=[{ "type": "text" }] fields up front,
        // to avoid network requests when generating result previews.
        // Discussed in the "Larger Collections" chapter.
        cacheAllFieldStores: undefined,

        // Any index chunk larger than this number of bytes
        // will be persistently cached once requested.
        plLazyCacheThreshold: 0,
        // ------------------------------

        // ---------------------------------------------------------------
    },
});

maxAutoSuffixSearchTerms = 3

Stemming is turned off by default. This does mean a bigger dictionary (but not too much usually), and lower recall, but much more precise searches.

To keep recall up, an automatic wildcard suffix search is performed on the last query term of a free text query, and only if the query does not end with a whitespace (an indicator of whether the user has finished typing).

maxSuffixSearchTerms = 5

This controls the maximum number of terms to search for manual wildcard suffix searches.

Term Proximity Ranking

useQueryTermProximity = true

If positions are indexed, document scores are also scaled by how close query expressions or terms are to each other. This boosts result relevance significantly.

Caching Options (Advanced)

This is discussed more in the chapter on larger collections.

Styling

Themes

InfiSearch provides 3 built-in themes by default, which correspond to the 3 stylesheets in the releases.

These 3 stylesheets also expose a wide range of css variables which you can alter as needed.

Head on over to the demo site here to try them out!

Light

<!-- Replace "v0.10.1" as appropriate -->
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/ang-zeyu/infisearch@v0.10.1/packages/search-ui/dist/search-ui-light.css" />

Preview

Preview of light theme Preview of light theme (dropdown)

Basic

<!-- Replace "v0.10.1" as appropriate -->
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/ang-zeyu/infisearch@v0.10.1/packages/search-ui/dist/search-ui-basic.css" />
Preview of basic theme (dropdown)

Dark

<!-- Replace "v0.10.1" as appropriate -->
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/ang-zeyu/infisearch@v0.10.1/packages/search-ui/dist/search-ui-dark.css" />

Preview

Preview of dark theme (dropdown)

Styling the Fullscreen UI Input Button

InfiSearch is minimally invasive in styling your <input> element (except for the one that comes with the fullscreen UI), leaving this to your site’s design.

Notably however, for accessibility, some minimal styling is applied when using the fullscreen UI to convey the intention of a button (which opens the fullscreen UI). This is limited to:

  • A background & box-shadow & color application on focus

    These are applied with a !important modifier as they are key to conveying keyboard focus, but are also overridable easily with InfiSearch’s css variables.

  • cursor: pointer application on hover

Applying Input Button Styles under mode='auto'

If using the default UI mode of auto, which switches between the dropdown and fullscreen UI dynamically, you can also set a different placeholder, and/or use the .infi-button-input selector to apply your styles only if the fullscreen UI is used. For example,

.infi-button-input:focus:not(:hover) {
    background: #6c757d !important;
}

Indexer Configuration

All indexer configurations are sourced from a json file. By default, the cli tool looks for infi_search.json in the source folder (first argument specified in the command).

This can be changed using the -c <config-file-path> option.

Indexer Field Configuration

Every document you index contains multiple fields. By default, InfiSearch comes baked in with the configurations needed for supporting static site search.

Default Field Configuration

It may be helpful to first understand the default fields as examples, and how they are used in the UI:

{
  "fields_config": {
    "fields": {
      "title":        { "weight": 2.0 },
      "h1":           { "weight": 2.0 },
      "heading":      { "weight": 1.5 },
      "body":         { "weight": 1.0 },
      // The default weight is 0.0. These fields are stored, but not searchable.
      "headingLink":  {},
      "link":         {},
      "_relative_fp": {} // An internal, reserved field (see "Reserved Fields")
    }
  }
}
FieldSourceUI Usage
h1, title<h1>, <title>Result preview’s title. When unavailable, the _relative_fp field is displayed as a breadcrumb.
heading<h2-6>Result preview sub match’s heading.
headingLink<h2-6 id="..">Result preview sub match’s #anchor link.
body<body>Result preview sub match’s main text.
_relative_fpRelative file path from the source indexer folderResult preview’s <a> link by concatenating sourceFilesUrl to _relative_fp
linkUser supplied override for linking to other pagesResult preview’s <a> link. Convenience default field to support custom overrides for links easily (e.g. when indexing a JSON document).

Click to view Graphical Illustration annotation for fields

Adding Fields

You can add your own fields to index for free-text search, create categorical and/or numeric facet filters, and custom numeric sort orders.

The user interface only incorporates the default set of fields in result highlighting however. If you need to incorporate additional fields, for example to display a icon beside each result, you can alter the HTML outputs, or use the search API.

Removing Default Fields

If you are using InfiSearch as a general-purpose client side search tool, you can assign a value of null to remove default field definitions completely as a minor optimization. Alternatively, merge_default_fields: false removes all default field definitions.

{
  "fields_config": {
    "fields": {
      "h1": null
    },
    "merge_default_fields": false
  }
}

Reserved Fields

Reserved fields are prefixed with an underscore _, and are hardcoded into the indexer to perform special functions.

  • _relative_fp: the relative path from your source folder to the file.

  • _add_files: This field allows you to index multiple files as a single document, which can be useful for overriding or extending data. See this section under indexing for more details.

Field Scoring

{
  "fields_config": {
    "fields": {
      "title": { "weight": 2.0 }
    }
  }
}

weight=0.0

This parameter is a boost / penalty multiplied to a individual field’s score.

Specifying 0.0 will also result in the field not being indexed into InfiSearch’s inverted index at all meaning that searching for any terms in this field will not show up any results. When used in combination with the storage parameter, the use case is to create a field that is only stored for custom sort orders, facet filters, or UI purposes (e.g. the _relative_fp field).

k=1.2 & b=0.75

These are scoring parameters that control the impact of term frequency and document lengths. The following article provides a good overview on how to configure these, if needed.

All default fields except titles and headings use the above default parameters.

Field Storage

{
  "fields_config": {
    "fields": {
      "title": {
        "storage": [{ "type": "text" }]  // defaults
      }
    }
  }
}

As with most free-text search tools, InfiSearch performs relies on an inverted index mapping terms to source documents.

Once the result set is obtained, each result document’s data could still be useful. For example, a document’s original title is essential for generating a human-readable result preview.

InfiSearch provides 3 storage types:

1. text

In this format, the document’s raw texts are stored in a JSON file as a series of [fieldName, fieldText] pairs following the order they were seen.

This “positioned” model enables constructing detailed result preview hierarchies you see in InfiSearch’s UI currently: Title > Heading > Text under Heading

2. enum

This storage format stores a single value for each indexed document, and is useful for categorical data. Only the first such occurence is stored if there are multiple. These values can be queried using the search API or used to create multi-select filters in the search UI.

In this documentation for example (and the mdBook plugin), there is a multi-select checkbox filter that can be used to filter each page by it’s mdBook section title. (“User Guide”, “Advanced”)

Notes:

  • Documents without enum values are internally assigned a default enum value that can also be queried.
  • While it is unlikely you will need more, there is a hard limit of 255 possible values for your entire document collection. Excess values are discarded, and the CLI tool will print a warning.
  • You can also use InfiSearch’s flexible boolean syntaxes to filter documents. Using this option however allows a simplifying assumption to store these values more compactly and enables creating UI multi-select filters easily.

3. i64

This format stores a single 64-bit unsigned integer value for each document. Only the first such occurence is stored.

{
  "fields_config": {
    "fields": {
      "price": {
        "storage": [{
          "type": "i64",
          "default": 1,
          "parse": "normal"
        }]
      },
    }
  }
}

3 parsing strategies are available currently:

  1. integer: a signed 64bit integer
  2. round: a double precision floating integer rounded to the nearest integer
  3. datetime: any date time string. This string is parsed using a datetime_fmt format specifier as outlined in the Chrono crate’s DateTime::parse_from_str method. The value is stored in seconds, as a UNIX timestamp relative to 1 Jan 1970 00:00 UTC.
    {
      "type": "i64",
      "default": 1,
      "parse": {
        "method": "datetime",
        "datetime_fmt": "%Y %b %d %H:%M %z",
    
        // ----------------------
        // Optional
    
        // If your datetime_fmt has no timezone,
        // specify it in seconds here, relative to UTC
        "timezone": 0,
    
        // If your datetime_fmt has no H, M, and timezone
        // specify the default time of day in seconds here
        "time": 0,
    
        // ----------------------
      }
    }
    

i64 fields can be used for facet search for:

  • Creating numeric or datetime min-max filters in the UI easily and/or filtering them in the Search API
  • Sorting results by these fields in the UI or API

Indexer Data Configuration

The configurations in this page specify how (mapping file data to fields) and which files to index.

InfiSearch’s defaults are sufficient to index most HTML files, but if not, you can also configure how the content mapping is done. Enabling support for other file formats (e.g. JSON, CSV, PDF) files is also done here.

Mapping File Data to Fields

{
  "indexing_config": {
    "loaders": {
      // Default: Only HTML files are indexed
      "HtmlLoader": {}
    }
  }
}

The indexer is able to handle data from HTML, JSON, CSV, TXT, or PDF files. Support for each file type is provided by a file Loader abstraction.

You may configure loaders by including them under the loaders, with any applicable options.

HTML Files: loaders.HtmlLoader

The HTMLLoader is the only loader that is configured by default, which is as follows:

"loaders": {
  "HtmlLoader": {
    "exclude_selectors": [
      // Selectors to exclude from indexing
      "script,style,form,nav,[data-infisearch-ignore]"
    ],
    "selectors": {
      "title": {
        "field_name": "title"
      },

      "h1": {
        "field_name": "h1"
      },

      "h2,h3,h4,h5,h6": {
        "attr_map": {
          "id": "headingLink" // stores the id attribute under the headingLink field
        },
        "field_name": "heading"
      },

      "body": {
        "field_name": "body"
      },

      "meta[name=\"description\"],meta[name=\"keywords\"]": {
        "attr_map": {
          "content": "body"
        }
      },

      // A convenient means to override the link used in the result preview
      // See "Linking to other pages" for more information
      "span[data-infisearch-link]": {
        "attr_map": {
          "data-infisearch-link": "link"
        }
      }
    },
    "merge_default_selectors": true
  }
}

The HTML loader indexes a document by:

  1. Traversing the document depth-first, in the order text naturally appears.

  2. Checking if any selectors specified as keys under HtmlLoader.selectors is satisfied for each element. If so, all descendants (elements, text) of the element are indexed under the newly specified field_name, if any.

    • This process repeats as the document is traversed — if a descendant matched another different selector, the field mapping is overwritten for that descendant and its descendants.

    • The attr_map option allows indexing attributes of specific elements under fields as well.

    • All selectors are matched in arbitrary order by default. To specify an order, add a higher priority: n key to your selector definition, where n is any integer.

To exclude elements from indexing, use the exclude_selectors option, or add the in-built data-infisearch-ignore attribute to your HTML.

If needed, you can also index HTML fragments that are incomplete documents (for example, documents which are missing the <head>). To match the entire fragment, use the body selector.

Lastly, if you need to remove a default selector, simply replace its definition with null. For example, "h2,h3,h4,h5,h6": null. Alternatively, specifying "merge_default_selectors": false will remove all default selectors.

JSON Files: loaders.JsonLoader

"loaders": {
  "JsonLoader": {
    "field_map": {
      "chapter_text": "body",
      "book_link": "link",
      "chapter_title": "title"
    },
    // Optional, order in which to index the keys of the json {} document
    "field_order": [
      "book_link",
      "chapter_title",
      "chapter_text"
    ]
  }
}

JSON files can also be indexed. The field_map contains a mapping of your JSON data key -> field name. The field_order array controls the order in which the data keys are indexed, which has a minor influence on query term proximity ranking.

The JSON file can be either:

  1. An object, with numbers, strings or null values
  2. An array of such objects

CSV Files: loaders.CsvLoader

"loaders": {
  "CsvLoader": {
    // ---------------------
    // Map data using CSV headers
    "header_field_map": {},
    "header_field_order": [],            // Optional, order to index the columns
    // ---------------------
    // Or with header indices
    "index_field_map": {
      "0": "link",
      "1": "title",
      "2": "body",
      "4": "heading"
    },
    "index_field_order": [1, 4, 2, 0],   // Optional, order to index the columns
    // ---------------------
    // Options for csv parsing, from the Rust "csv" crate
    "parse_options": {
      "comment": null,
      "delimiter": 44,
      "double_quote": true,
      "escape": null,
      "has_headers": true,
      "quote": 34
    }
  }
}

Field mappings for CSV files can be configured using one of the field_map keys. The field_order arrays controls the order columns are indexed.

The parse_options key specifies options for parsing the csv file.

PDF Files: loaders.PdfLoader

"loaders": {
  "PdfLoader": {
    "field": "body",
  }
}

This loader indexes all content into a single field “body” by default.

The search result title would appear as <...PDF file path breadcrumb...> (PDF), and when clicked upon will open the PDF in the browser.

Text Files: loaders.TxtLoader

"loaders": {
  "TxtLoader": {
    "field": "field_name",
  }
}

This loader simply reads .txt files and indexes all its contents into a single field. This is not particularly useful without the _add_files feature feature that allows indexing data from multiple files as one document.

File Exclusions

{
  "indexing_config": {
    "exclude": [
      "infi_search.json"
    ],
    "include": [],

    "with_positions": true
  }
}

File Exclusions: exclude = ["infi_search.json"]

Global file exclusions can be specified in this parameter, which is simply an array of file globs.

File Inclusions: include = []

Similarly, you can specify only specific files to index. This is an empty array by default, which indexes everything.

If a file matches both an exclude and include pattern, the exclude pattern will have priority.

Indexer Misc Configuration

Indexing Positions

{
  "indexing_config": {
    "with_positions": true
  }
}

This option controls if positions are stored. Features such as phrase queries that require positional information will not work if this is disabled. Turning this off for very large collections (~> 1GB) can increase the tool’s scalability, at the cost of such features.

Indexer Thread Count

{
  "indexing_config": {
    "num_threads": max(min(physical cores, logical cores) - 1, 1)
  }
}

Indexing Multiple Files Under One Document

InfiSearch regards each file as a single document by default. You can index multiple files into one document using the reserved field _add_files. This is useful if you need to override or add data but can’t modify the source document easily.

Overrides should be provided with JSON, CSV, or HTML files, as TXT and PDF files have no reliable way of supplying the _add_files field. In addition, you will need to manually map the CSV data to the _add_files field. This is automatically done for JSON and HTML files.

Suppose you have the following files:

folder
|-- main.html
|-- overrides.json

To index main.html and override its link, you would have:

overrides.json

{
  "link": "https://infi-search.com",
  "_add_files": "./main.html"
}

Indexer Configuration

{
  "indexing_config": {
    "exclude": ["main.html"]
  }
}

This excludes indexing main.html directly, but does so through overrides.json.

Larger Collections

⚠️ This section serves as a reference, prefer the preconfigured scaling presets if possible.

Field Configuration

{
  "fields_config": {
    "cache_all_field_stores": true,
    "num_docs_per_store": 100000000
  },
  "indexing_config": {
    "pl_limit": 4294967295,
    "pl_cache_threshold": 0,
    "num_pls_per_dir": 1000
  }
}

Field Store Caching: cache_all_field_stores

All fields specified with storage=[{ "type": "text" }] are cached up front when this is enabled. This is the same option as the one under search functionality options, and has lower priority.

Field Store Granularity: num_docs_per_store

The num_docs_per_store parameter controls how many documents’ texts to store in one JSON file. Batching multiple files together increases file size but can lead to less files and better browser caching.

Index Shard Size: pl_limit

This is a threshold (in bytes) at which to “cut” index (pl meaning postings list) chunks. Increasing this produces less but bigger chunks (which take longer to retrieve).

Index Caching: pl_cache_threshold

Index chunks that exceed this size (in bytes) are cached by the search library on initilisation. It is used to configure InfiSearch for response time (over scalability) for typical use cases.

Language Configuration

There are 3 language modules available. To configure these, you will need to serve the appropriate language bundle in your HTML (or edit the CDN link accordingly), and edit the indexer configuration file.

{
  "lang_config": {
    // ... options go here ...
  }
}

Ascii Tokenizer

The default tokenizer should work for any language that relies on ASCII characters, or their inflections (e.g. “á”).

The text is first split into on sentences, then whitespaces to obtain tokens. An asciiFoldingFilter is then applied to normalize diacritics, followed by punctuation and non-word-character boundary removal.

{
  "lang": "ascii",
  "options": {
    "stop_words": [
      "a", "an", "and", "are", "as", "at", "be", "but", "by", "for",
      "if", "in", "into", "is", "it", "no", "not", "of", "on", "or",
      "such", "that", "the", "their", "then", "there", "these",
      "they", "this", "to", "was", "will", "with"
    ],
    "ignore_stop_words": false,

    // Hard limit = 250
    "max_term_len": 80
  }
}

CDN Link

<script src="https://cdn.jsdelivr.net/gh/ang-zeyu/infisearch@v0.10.1/packages/search-ui/dist/search-ui.ascii.bundle.js"></script>

Ascii Tokenizer with Stemmer

This is essentially the same as the ascii tokenizer, but adds a stemmer option.

{
  "lang": "ascii_stemmer",
  "options": {
    // ----------------------------------
    // Ascii Tokenizer options also apply
    // ...
    // ----------------------------------

    // Any of the languages here
    // https://docs.rs/rust-stemmers/1.2.0/rust_stemmers/enum.Algorithm.html
    // Languages other than "english" have not been extensively tested. Use with caution!
    "stemmer": "english"
  }
}

If you do not need stemming, use the ascii tokenizer, which has a smaller wasm binary.

CDN Link

<script src="https://cdn.jsdelivr.net/gh/ang-zeyu/infisearch@v0.10.1/packages/search-ui/dist/search-ui.ascii-stemmer.bundle.js"></script>

Chinese Tokenizer

This is a lightweight character-wise tokenizer, not based on word-based tokenizers like Jieba.

You are highly recommended to keep positions indexed and query term proximity ranking turned on when using this tokenizer, in order to boost the relevance of documents with multi-character queries.

{
  "lang": "chinese",
  "options": {
    "stop_words": [],
    "ignore_stop_words": false,
    "max_term_len": 80
  }
}

CDN Link

<script src="https://cdn.jsdelivr.net/gh/ang-zeyu/infisearch@v0.10.1/packages/search-ui/dist/search-ui.chinese.bundle.js"></script>

Stop Words

All tokenizers support keeping (default) or removing stop words using the ignore_stop_words option.

Keeping them enables the following:

  • Processing phrase queries such as "for tomorrow" accurately; Stop words would be removed automatically from such queries.
  • Boolean queries of stop words (e.g. if AND forecast AND sunny)
  • More accurate ranking for free text queries, which uses stop words in term proximity ranking

UI Translations

The UI’s text can also be overwritten. Refer to this link for the default set of texts.

infisearch.init({
  uiOptions: {
    translations: { ... }
  }
})
OptionDefaultDescription
resultsLabel'Site results'Accessibility label for the listbox containing result previews. This is announced to screenreaders.
fsButtonLabel'Search'Accessibility label for the original input element that functions as a button when the fullscreen UI is in use.
fsButtonPlaceholderundefinedPlaceholder override for the provided input that functions as a button when the fullscreen UI is in use.
fsPlaceholder'Search this site'Placeholder of the input element in the fullscreen UI.
fsCloseText'Close'Text for the Close button.
filtersButton'Filters'Text for the Filters button if any enum or numeric filters are configured.
numResultsFound' results found'The text following the number of results found.
startSearching'Start Searching Above!'Text shown when the input is empty.
startingUp'... Starting Up ...'Text shown when InfiSearch is still not ready to perform any queries. The setup occurs extremely quickly, you will hopefully not be able to see this text most of the time.
navigation'Navigation'Navigation controls text.
sortBy'Sort by'Header text for custom sort orders.
tipHeader'🔎 Advanced search tips'Header of the tip popup.
tip'Tip'First column header of the tip popup.
example'Example'Second column header of the tip popup.
tipRows.xx (refer here)Examples for usage of InfiSearch’s advanced search syntax.
error'Oops! Something went wrong... 🙁'Generic error text when something goes wrong

Search API

You can also interface with InfiSearch through its API.

Setup

Under the global infisearch variable, you can instantiate an instance of the Searcher class.

const searcher = new infisearch.Searcher({
    url: 'https://... the index output directory ...'
});

The constructor parameter uses the same options as infisearch.init, refer to this page for the other available options.

Initialising States

Setup is also async and and mostly proceeds in the WebWorker. You can use the setupPromise and isSetupDone interfaces to optionally show UI initialising states.

searcher.setupPromise.then(() => {
    assert(searcher.isSetupDone, true);
});

Retrieving Enum Values

If you have an enum field, you can retrieve all its possible values like such:

const enumValues: string[] = await searcher.getEnumValues('weather');

> console.log(enumValues)
['sunny', 'rainy', 'warm', 'cloudy']

Querying

Next, you can create a Query object, which obtains and ranks the result set.

const query: Query = await searcher.runQuery('sunny weather');

The Query object follows this interface.

interface Query {
    /**
     * Original query string.
     */
    public readonly query: string,
    /**
     * Total number of results.
     */
    public readonly resultsTotal: number,
    /**
     * Returns the next top N results.
     */
    public readonly getNextN: (n: number) => Promise<Result[]>,
    /**
     * Freeing a query manually is required since its results live in the WebWorker.
     */
    public readonly free: () => void,
}

Filtering and Sorting

Filter document results with enum fields or numeric fields by passing an additional parameter.

const query: Query = await searcher.runQuery('weather', {
  enumFilters: {
    // 'weather' is the enum field name
    weather: [
      null,    // Use null to include documents that have no enum values
      'sunny',
      'warm',
    ]
  },
  i64Filters: {
    // 'price' is the numeric field name
    price: {
      gte?: number | bigint,
      lte?: number | bigint,
    }
  },
});

Sort document results using numeric fields. Results are tie-broken by their relevance.

const query: Query = await searcher.runQuery('weather', {
  sort: 'pageViews',     // where 'pageViews' is the name of the field
  sortAscending: false,  // the default is to sort in descending order
});

Loading Document Texts

Running a query alone probably isn’t very useful. You can get a Result object using the getNextN function.

const results: Result[] = await query.getNextN(10);

A Result object stores the fields of the indexed document.

const fields = results[0].fields;

> console.log(fields)
{
  texts: [
    ['_relative_fp', 'relative_file_path/of_the_file/from_the_folder/you_indexed'],
    ['title', 'README'],
    ['h1', 'README'],
    ['headingLink', 'description'],
    ['heading', 'Description'],
    ['body', 'InfiSearch is a client-side search solution made for static sites, .....'],
    // ... more headingLink, heading, body fields ...
  ],
  enums: {
    weather: 'cloudy',
    reporter: null,
  },
  numbers: {
    datePosted: 1671336914n,
  }
}
  • texts: an array of [fieldName, fieldText] pairs stored in the order they were seen.

    This ordered model is more complex than a regular key-value store, but enables the detailed content hierarchy you see in InfiSearch’s UI: Title > Heading > Text under heading

  • enums: stores the enum values of the document. Documents missing enum values are assigned null.

  • numbers: u64 fields returned as Javascript BigInt values.

Memory Management

As InfiSearch uses a WebWorker to run things, you would also need to perform some memory management.

Once you are done with a Query (e.g. if a new query was run), call free() on the query object.

query.free();

Search interfaces usually live for the entire lifetime of the application. If you need to do so however, you should also free the Searcher instance:

searcher.free();

UI Convenience Methods

A Result object also exposes 2 other convenience functions that may be useful to help deal with the positional format of the text type field stores.

1. Retrieving Singular Fields as KV Stores

Certain fields will only occur once in every document (e.g. titles, <h1> tags). To retrieve these easily, use the getKVFields method:

const kvFields = result.getKVFields('link', '_relative_fp', 'title', 'h1');

> console.log(kvFields)
{
  "_relative_fp": "...",
  "title": "..."
  // Missing fields will not be populated
}

Only the first [fieldName, fieldText] pair for each field will be populated into the fields object.

Tip: Constructing a Document Link

If you haven’t manually added links to your source documents, you can use the _relative_fp field to construct one by concatenating it to a base URL. Any links added via the data-infisearch-link attribute are also available under the link field.

2. Associating Headings to other Content Fields and Highlighting

To establish the relationship between heading and headingLink pairs to other content fields following them, call getHeadingsAndContents.

// The parameter is a varargs of field names to consider as content fields
const headingsAndContents: Segment[] = result.linkHeadingsToContents('body');

This returns an array of Segment objects, each of which represents a continuous chunk of heading or content text:

interface Segment {
  /**
   * 'content': content text without preceding heading
   * 'heading': text from 'heading' fields
   * 'heading-content': content text with a preceding heading
   */
  type: 'content' | 'heading' | 'heading-content',

  /**
   * Only present if type = 'heading-content',
   * and points to another Segment of type === 'heading'.
   */
  heading?: Segment,

  /**
   * Only present if type = 'heading' | 'heading-content',
   * and points to the heading's id, if any.
   */
  headingLink?: string,
  
  // Number of terms matched in this segment.
  numTerms: number,
}

Sorting and Choosing Segments

This is fully up to your UI. For example, you can first priortise segments with a greater numTerms, then tie-break by the type of the segment.

Text Highlighting

interface Segment {
  // ... continuation ...

  highlightHTML: (addEllipses: boolean = true) => string,
  highlight: (addEllipses: boolean = true) => (string | HTMLElement)[],

  text: string,                             // original string
  window: { pos: number, len: number }[],   // Character position and length
}

There are 3 choices for text highlighting:

  1. highlightHTML() wraps matched terms with <mark> tag, truncates text, and adds trailing and leading ellipses. A single escaped HTML string is then returned for use.

  2. highlight() does the same but is slightly more efficient, returning a (string | HTMLElement)[] array. To use this array safely (strings are unescaped) and conveniently, use the .append(...seg.highlight()) DOM API.

    Click to see example output
    [
      <span class="infi-ellipses"> ... </span>,
      ' ... text before ... ',
      <mark class="infi-highlight">highlighted</mark>,
      ' ... text after ... ',
      <span class="infi-ellipses"> ... </span>,
    ]
    
  3. Lastly, you can perform text highlighting manually using the original text and the closest window of term matches.

Search Syntax

InfiSearch provides a few advanced search operators that can be used in the search API. These are also made known to the user using the help icon on the bottom right of the search UI.

Boolean Operators, Parentheses

AND and NOT and inversion operators are supported. OR is the default behaviour; Documents are ranked according to the BM25 model. Parentheses (...) can be used to group expressions together.

weather +sunny  - documents that may contain "weather" but must contain "sunny"
weather -sunny  - documents containing "weather" and do not have "sunny"
~cloudy         - all documents that do not contain "gloomy"
~(ipsum dolor)  - all documents that do not contain "ipsum" and "dolor"

Phrase Queries

Phrase queries are also supported by enclosing the relevant terms in "...".

"sunny weather" - documents containing "sunny weather"

The withPositions index feature needs to be enabled for this to work (by default it is).

Field queries are supported via the following syntax field_name::

title:sunny              - documents containing "sunny" in the title
heading:(+sunny +cloudy) - documents with both "lorem" and "ipsum" in headings only
body:gloomy              - documents with "gloomy" elsewhere

You can also perform suffix searches on any term using the * character:

run* - searches for "run", "running"

In most instances, an automatic wildcard suffix search is also performed on the last query term that the user is still typing.

Escaping Search Operators

All search operators can also be escaped using \:

\+sunny
\-sunny
\(sunny cloudy\)
\"cloudy weather\"
"phrase query with qu\"otes"
title\:lorem

Larger Collections

Three configuration presets are available for scaling this tool to larger collections. They are designed primarily for InfiSearch’s main intended use case of supporting static site search.

Introduction

Each preset primarily makes a tradeoff between the document collection size it can support and the number of rounds of network requests (RTT).

The default preset is small, which generates a monolithic index and field store, much like other client side indexing tools.

Specify the preset key in your configuration file to change this.

{
    "preset": "small" | "medium" | "large"
}

Presets

small, medium and large corresponds to 0, 1, or 2 rounds of network requests in the table below.

PresetDescription
smallGenerates a monolithic index and field store. Identical to most other client side indexing tools.
mediumGenerates an almost-monolithic index but sharded field store. Only required field stores are retrieved for generating result previews.
largeGenerates both a sharded index and field store. Only index files that are required for the query are retrieved. Keeps stop words. This is the preset used in the demo here!

In summary, scaling this tool for larger collections dosen’t come freely, and necessitates fragmenting the index and/or field stores, retrieving only what’s needed. This means extra network requests, but to a reasonable degree.

This tool should be able to handle 800MB (not counting things like HTML tags) collections with the full set of features enabled in the large preset.

Other Options

There are a few other options especially worth highlighting that can help reduce the index size (and hence support larger collections) or modify caching strategies.

  • plLazyCacheThreshold

    In addition to upfront caching of index files with the pl_cache_threshold indexing parameter, InfiSearch also persistently caches any index shard that was requested before, but fell short of the pl_cache_threshold.

  • ignore_stop_words=false

    This option is mostly only useful when using the small / medium presets which generate a monolithic index. Ignoring stop words in this case can reduce the overall index size, if you are willing to forgo its benefits.

  • with_positions=true

    Positions take up a considerable (~3/4) portion of the index size but produces useful information for proximity ranking, and enables performing phrase queries.

Modified Properties

Presets modify only the following properties:

Any of these values specified in the configuration file will override that of the preset’s.

Linking to other pages

InfiSearch is convenient to get started with if the pages you link to are the same files you index, and these files are hosted at sourceFilesUrl in the same way your source file folders are structured.

Linking to other pages instead is facilitated by the default link field, which lets you override the link used in the result preview.

There is also a default data mapping for HTML files which the below section covers. If using JSON or CSV files, refer to the earlier section.

Indexing HTML Files

For HTML files, simply add this link with the data-infisearch-link attribute.

<span data-infisearch-link="https://www.google.com"></span>

This data mapping configuration is already implemented by default, shown by the below snippet.

"loaders": {
  "HtmlLoader": {
    "selectors": {
      "span[data-infisearch-link]": {
        "attr_map": {
          "data-infisearch-link": "link"
        }
      }
    }
  }
}

Filters

Multi-Select Filters

Multi-select filters, for example the ones you see in this documentation’s search (“User Guide”, “Advanced”), allow users to filter for results belonging to one or more categories.

For this guide, let’s suppose we have a bunch of weather forecast articles and want to support filtering them by the weather (sunny, warm, cloudy).

First, setup a custom field inside the indexer configuration file.

"fields_config": {
  "fields": {
    "weatherField": {
      "storage": [{ "type": "enum" }]
    }
  }
}

The "storage": [{ "type": "enum" }] option tells InfiSearch that to store the first seen value of this field for each document, but we’ll need to tell InfiSearch where the data for this field comes from next.

Let’s assume we’re dealing with a bunch of HTML weather forecast articles, which uses the HTMLLoader. In particular, these HTML files store the weather inside a specific element with an id="weather".

"indexing_config": {
  "loaders": {
    "HtmlLoader": {
      "selectors": {
        // Match elements with an id of weather
        "#weather": {
          // And index its contents into our earlier defined field
          // You can also use attributes, see the HTMLLoader documentation
          "field_name": "weatherField"
        }
      }
    }
  }
}

Lastly, we need to tell InfiSearch’s UI to setup a multi-select filter using this field. To do so, add the following to your init call.

infisearch.init({
  ...
  uiOptions: {
    multiSelectFilters: [
      {
        fieldName: 'weatherField', // matching our earlier defined field
        displayName: 'Weather',
        defaultOptName: 'Probably Sunny!',
        collapsed: true,  // only the first header is initially expanded
      },
      // You can setup more filters as needed following the above procedures
    ]
  }
})

The displayName option tells the UI how to display the multi-select’s header. We simply use an uppercased “Weather” in this case.

Some of the weather forecast articles indexed may also be missing the id="weather" element, for example due to a bug in generating the article, and therefore lacks an enum value. InfiSearch internally assigns such documents a default enum value by default. The defaultOptName option specifies the name of this default enum value as seen in the UI.

Numeric Filters

You can also create minimum-maximum numeric filters with InfiSearch. These can be of either <input type="number|date|datetime-local" />.

numeric filters example

Continuing the same example as multi-select filters, let’s suppose we also want to support filtering weather forecast articles by their number of page views. These page views are stored in the data-pageviews attribute of the element with an id="weather".

First, we define a signed integer field.

"fields_config": {
  "fields": {
    "pageViewsField": {
      "storage": [{
        "type": "i64",
        // Default number of page views if there is none
        "default": 0,
        // Parse the data seen as a signed integer
        // Datetimes and floats are also supported, see the above linked documentation
        "parse": { "method": "normal" } 
      }]
    }
  }
}

Next, we map the data from the data-pageviews attribute into the above field.

"indexing_config": {
  "loaders": {
    "HtmlLoader": {
      "selectors": {
        // Match elements with an id of weather
        "#weather": {
          // And index its data-pageviews attribute into our earlier defined field
          "attr-map": {
            "[data-pageviews]": "pageViewsField"
          }
        }
      }
    }
  }
}

Lastly, we tell InfiSearch’s UI to setup a numeric filter using this field. To do so, add the following to your infisearch.init call.

infisearch.init({
  ...
  uiOptions: {
    numericFilters: [
      {
        fieldName: 'pageViewsField',
        displayName: 'Number of Views',
        type: 'number', // date, datetime-local is also supported
        minLabel: 'Min',
        maxLabel: 'Max',
      }
    ]
  }
})

Sorting by Numbers & Dates

Results can also be sorted by numeric fields. Let’s suppose we want to support filtering weather forecast articles by their date posted. The date is stored in an element with the data-date-posted attribute.

sort options dropdown

First, define the numeric field that can store any signed 64-bit integers.

"fields_config": {
  "fields": {
    "datePostedField": {
      "storage": [{
        "type": "i64",

        // Default UNIX timestamp.
        // In this case, we use "0", which falls on Jan 1 1970 00:00 UTC.
        "default": 0,

        // Parse the data seen as a date.
        // Integers, floats, and other datetime formats are also supported,
        // see the above linked documentation.
        "parse": {
          "method": "datetime",
          "datetime_fmt": "%Y %b %d %H:%M %z"
        }
      }]
    }
  }
}

Next, map the data from the data-date-posted attribute into the above field.

"indexing_config": {
  "loaders": {
    "HtmlLoader": {
      "selectors": {
        // Match elements with the attribute
        "[data-date-posted]": {
          // And index the attribute into our earlier defined field
          "attr-map": {
            "[data-date-posted]": "datePostedField"
          }
        }
      }
    }
  }
}

Lastly, configure InfiSearch’s UI to setup the UI dropdown using this field.

infisearch.init({
  ...
  uiOptions: {
    sortFields: {
      dateposted: {
        asc: 'Date: Oldest First',
        desc: 'Date: Latest First',
      },
    },
  }
})

Altering HTML Outputs

This page covers customising the result preview HTML output structure.

Some use cases for this include:

  • The default HTML structure is not sufficient for your styling needs
  • You want to override or insert additional content sourced from your own fields (e.g. an image)
  • You want to change the default use case of linking to a web page entirely (e.g. use client side routing)

💡 If you only need to style the dropdown or search popup, you can include your own css file to do so and / or override the variables exposed by the default css bundle.

The only API option is similarly specified under the uiOptions key of the root configuration object.

infisearch.init({
    uiOptions: {
        listItemRender: ...
    }
});

It’s interface is as follows:

type ListItemRender = async (
  h: CreateElement,
  opts: Options,  // what you passed to infisearch.init
  result: Result,
  query: Query,
) => Promise<HTMLElement>;

If you haven’t, you should also read through the Search API documentation on the Result and Query parameters.

h function

This is an optional helper function you may use to create elements.

The method signature is as such:

export type CreateElement = (
  // Element name
  name: string,

  // Element attribute map
  attrs: { [attrName: string]: string },

  /*
   Child elements (HTMLElement) OR text nodes (string)
   String parameters are automatically escaped.
  */
  ...children: (string | HTMLElement)[]
) => HTMLElement;

Accessibility and User Interaction

To ensure that combobox controls work as expected, you should also ensure that the appropriate elements are labelled with role='option' (and optionally role='group').

Elements with role='option' will also have the .focus class applied to them once they are visually focused. You can use this class to style the option.

Granularity

At the current, this API is moderately lengthy, performing things such as limiting the number of sub matches (heading-content pairs) per document, formatting the relative file path of documents into a breadcrumb form, etc.

There may be room for breaking this API down further as such, please help to bring up a feature request if you have any suggestions!.

Source Code

See the source to get a better idea of using this API.

Incremental Indexing

Incremental indexing is also supported by the indexer cli tool.

Detecting deleted, changed, or added files is done by storing an internal file path –> last modified timestamp map.

To use it, simply pass the --incremental or -i option when running the indexer.

You will most likely not need to dabble with incremental indexing, unless your collection is extremely large (e.g. > 200MB).

Content Based Hashing

The default change detection currently relies on the last modified time in file metadata. This may not always be guaranteed by the tools that generate the files InfiSearch indexes, or be an accurate reflection of whether a file’s contents were updated.

If file metadata is unavailable for any given file, the file would always be re-indexed as well.

You may specify the --incremental-content-hash option in such a case to opt into using a crc32 hash comparison for all files instead. This option should also be specified when running a full index and intending to run incremental indexing somewhere down the line.

It should only be marginally more expensive for the majority of cases, and may be the default option in the future.

Circumstances that Trigger a Full (Re)Index

Note also, that the following circumstances will forcibly trigger a full reindex:

  • If the output folder path does not contain any files indexed by InfiSearch
  • It contains files indexed by a different version of InfiSearch
  • The configuration file (infi_search.json) was changed in any way
  • Usage of the --incremental-content-hash option changed

Caveats

There are some additional caveats to note when using this option. Whenever possible, try to run a full reindex of the documents, utilising incremental indexing only when indexing speed is of concern – for example, supporting an “incremental” build mode in static site generators.

Small Increase in File Size

As one of the core ideas of InfiSearch is to split up the index into many tiny parts, the incremental indexing feature works by “patching” only relevant index files containing terms seen during the current run. Deleted documents are handled using an invalidation bit vector. Hence, there might be a small increase in file size due to these unpruned files.

However, if these “irrelevant” files become relevant again in a future index run, they will be pruned.

Collection Statistics

Collection statistics used to rank documents will tend to drift off when deleting documents (which also entails updating documents). This is because such documents may contain terms that were not encountered during the current run of incremental indexing (from added / updated documents). Detecting such terms is difficult, as there is no guarantee the deleted documents are available anymore. The alternative would be to store such information in a non-inverted index, but that again takes up extra space =(.

As such, the information for these terms may not be “patched”. You may notice some slight drifting in the relative ranking of documents returned after some number of incremental indexing runs, until said terms are encountered again in some other document.

File Bloat

When deleting documents or updating documents, old field stores are not removed. This may lead to file bloat after many incremental indexing runs.