Indexer Data Configuration
The configurations in this page specify how (mapping file data to fields) and which files to index.
InfiSearch’s defaults are sufficient to index most HTML files, but if not, you can also configure how the content mapping is done. Enabling support for other file formats (e.g. JSON, CSV, PDF) files is also done here.
Mapping File Data to Fields
{
"indexing_config": {
"loaders": {
// Default: Only HTML files are indexed
"HtmlLoader": {}
}
}
}
The indexer is able to handle data from HTML, JSON, CSV, TXT, or PDF files. Support for each file type is provided by a file Loader abstraction.
You may configure loaders by including them under the loaders
, with any applicable options.
HTML Files: loaders.HtmlLoader
The HTMLLoader
is the only loader that is configured by default, which is as follows:
"loaders": {
"HtmlLoader": {
"exclude_selectors": [
// Selectors to exclude from indexing
"script,style,form,nav,[data-infisearch-ignore]"
],
"selectors": {
"title": {
"field_name": "title"
},
"h1": {
"field_name": "h1"
},
"h2,h3,h4,h5,h6": {
"attr_map": {
"id": "headingLink" // stores the id attribute under the headingLink field
},
"field_name": "heading"
},
"body": {
"field_name": "body"
},
"meta[name=\"description\"],meta[name=\"keywords\"]": {
"attr_map": {
"content": "body"
}
},
// A convenient means to override the link used in the result preview
// See "Linking to other pages" for more information
"span[data-infisearch-link]": {
"attr_map": {
"data-infisearch-link": "link"
}
}
},
"merge_default_selectors": true
}
}
The HTML loader indexes a document by:
-
Traversing the document depth-first, in the order text naturally appears.
-
Checking if any selectors specified as keys under
HtmlLoader.selectors
is satisfied for each element. If so, all descendants (elements, text) of the element are indexed under the newly specifiedfield_name
, if any.-
This process repeats as the document is traversed — if a descendant matched another different selector, the field mapping is overwritten for that descendant and its descendants.
-
The
attr_map
option allows indexing attributes of specific elements under fields as well. -
All selectors are matched in arbitrary order by default. To specify an order, add a higher
priority: n
key to your selector definition, wheren
is any integer.
-
To exclude elements from indexing, use the exclude_selectors
option, or add the in-built data-infisearch-ignore
attribute to your HTML.
If needed, you can also index HTML fragments that are incomplete documents (for example, documents which are missing the <head>
). To match the entire fragment, use the body
selector.
Lastly, if you need to remove a default selector, simply replace its definition with null
. For example, "h2,h3,h4,h5,h6": null
. Alternatively, specifying "merge_default_selectors": false
will remove all default selectors.
JSON Files: loaders.JsonLoader
"loaders": {
"JsonLoader": {
"field_map": {
"chapter_text": "body",
"book_link": "link",
"chapter_title": "title"
},
// Optional, order in which to index the keys of the json {} document
"field_order": [
"book_link",
"chapter_title",
"chapter_text"
]
}
}
JSON files can also be indexed. The field_map
contains a mapping of your JSON data key -> field name.
The field_order
array controls the order in which the data keys are indexed, which has a minor influence on query term proximity ranking.
The JSON file can be either:
- An object, with numbers, strings or
null
values - An array of such objects
CSV Files: loaders.CsvLoader
"loaders": {
"CsvLoader": {
// ---------------------
// Map data using CSV headers
"header_field_map": {},
"header_field_order": [], // Optional, order to index the columns
// ---------------------
// Or with header indices
"index_field_map": {
"0": "link",
"1": "title",
"2": "body",
"4": "heading"
},
"index_field_order": [1, 4, 2, 0], // Optional, order to index the columns
// ---------------------
// Options for csv parsing, from the Rust "csv" crate
"parse_options": {
"comment": null,
"delimiter": 44,
"double_quote": true,
"escape": null,
"has_headers": true,
"quote": 34
}
}
}
Field mappings for CSV files can be configured using one of the field_map
keys. The field_order
arrays controls the order columns are indexed.
The parse_options
key specifies options for parsing the csv file.
PDF Files: loaders.PdfLoader
"loaders": {
"PdfLoader": {
"field": "body",
}
}
This loader indexes all content into a single field “body” by default.
The search result title would appear as <...PDF file path breadcrumb...> (PDF)
, and when clicked upon will open the PDF in the browser.
Text Files: loaders.TxtLoader
"loaders": {
"TxtLoader": {
"field": "field_name",
}
}
This loader simply reads .txt
files and indexes all its contents into a single field. This is not particularly useful without the _add_files
feature feature that allows indexing data from multiple files as one document.
File Exclusions
{
"indexing_config": {
"exclude": [
"infi_search.json"
],
"include": [],
"with_positions": true
}
}
File Exclusions: exclude = ["infi_search.json"]
Global file exclusions can be specified in this parameter, which is simply an array of file globs.
File Inclusions: include = []
Similarly, you can specify only specific files to index. This is an empty array by default, which indexes everything.
If a file matches both an exclude
and include
pattern, the exclude
pattern will have priority.