Hit highlighting

Howto get hit highlighting work with the teaser-field. Describe the limitations, and best strategies to make it work the best for you (i.e. only on one field).

Strategies on how to index large datasets

Strategies on how to index really large datasets. 1.3 milliion documents is the record so far but it's using a lot of memory, and will easily run out of memory.

Check: fergiemcdowall/search-index#261 and fergiemcdowall/search-index-adder#2

Update query object to AND, OR and NOT

https://github.com/eklem/search-index-cookbook/blob/master/doc/reference/references.md#query-object-example

Here is how search-index do that now: https://github.com/fergiemcdowall/search-index/blob/master/doc/search.md#search

Cache search index file/data for index in the browser app

When running search-index in the browser and the webserver is only holding static files it would be good to show how to cache the normal stuff as HTML, CSS, JS and also the actual index. Or the index data as JSON.

This will ensure two things:

You won't need a lot of CPU since the app is running in the browser.
You won't need a lot of bandwidth since the index file (that changes every now and then) is cached in various network equipment.

Strategies for weighting fields indexing and query side

Different strategies when weighting document fields for batches, and how to weigh some fields for certain documents higher or lower.

And the benefits of query side weighting:

No re-indexing
Seasonal changes (through a day, a week, a month, a year)

Feature tradeoff's for memory and CPU

Take a look at memory use while indexing. See what makes it differ:

Just look at top to get some ballpark numbers. Could check if memwatch-next could maybe help

Getting the data: Crawling

How to get the data and what you have to figure out if crawling the data

How to figure out which URLs to actually crawl (site crawl or list pages)
When is a paginated list finished (tests do run)
New content
Updated content
What content to crawl on a page
Preparing for filtering on buckets or categories

In-browser: When to use concurrentAdd versus appendOnly

Check out this issue on concurrentAdd.

Starting fresh, appendOnly is much quicker. But you need to not have duplicates in your data set, or tolerate duplicates in the search engine.

When you already have an index, you have to consider if concurrentAdd is quicker than flush + appendOnly

Indexing features - how to set up the config

List a full indexing config object with everything defined, even if default.

Adding pipeline for Chinese Simplified and Japanese language

Show how to use

TinySegmenter (Japanses)
ChineseTokenizer

to split up text into words when adding the text to search-index.

@fergiemcdowall: At some point I will need some help to understand processing in the pipeline

Use morph.io to scrape data

Show how to get these data over to a norch or search-index instance

How to build a great autocomplete / autosuggest ?

Autosuggest using the index.match and how to use the nGramLength when adding documents to get more words for the autosuggest.

Getting the data: Crawl this page

Use norch-zapier-bookmarklet

Reference to search-index version

The cookbook should have a reference to which search-index version it is describing

Add link to JSON validator / formater

JSON validator / formater to check your queries
https://jsonformatter.curiousconcept.com/

Explain why streaming interface equals many endpoints / functions

Since everything is a stream, each object should be equally built, and can only hold one type of information. Then you need an endpoint / function for each type of info.

How to user 0 result queries to something positive: Synonyms

Figure out which synonyms to add and how to do it wit OR-search
(expand search with more OR-queries, but boost the original query)

The main parts of a search engine

Getting the documents (crawling)
Document processing & enrichment
Index (add/remove + take query/return results)
Query pipeline
User queries

Install issues on Debian/Ubuntu

Sometimes leveldown or sub-modules are not prebuilt. To get the installation of norch/search-index working anyway, you need to install build-essentials:

https://nodejs.org/en/download/package-manager/#debian-and-ubuntu-based-linux-distributions

Pitfalls using and developing with search-index

Common pitfalls:

Use modularized version of Norch-vuejs-app to show of different features

Show different features and how to combine them with a modularized version of norch-vuejs-app

Make reference into one document

To not create a competing documentation to search-index own, keep the reference to one document, and mostly explanatory text, not code and howto.

Get data

The data can be kept outside this repo, but have code to index some existing repos on GitHub. Need one or more indexer inside this repo.

Use Zapier-recipe on GitHub events for search-index.
Food recipes
Reuters data set

query pipeline example

Show it with stopwords

List of third party modules to search-index / norch

List up all third party modules to search-index, and a quick text on what they do.

Facets and filters - the Swiss army knife of search

Standard facets and filters:
Explain the benefits (less typing, more exploration of data in index)

How to use the cookbook interactively?

For best use of this cookbook:

npm install search-index-cookbook

Should have html files with browserified search-index javascript ready to index different data sets. Should be easy to play around in developer tool's javascript consoles. This goes for both indexing side and query side and should have example files for each recipe.

This should be possible, @fergiemcdowall ?

Fuzzy search with Levenshtein distance

https://en.wikipedia.org/wiki/Levenshtein_distance

Could use leven. First, when indexing, make a separate array of all words used (or see if any way possible to get this from search-index). Then, when searching, do a Levenshtein distance on 1 or 2 and do an OR-search on words you get back.

Update for new search-index API

Search API changed in 0.8.0 to allow for NOT and OR. Facets are now 'categories' and 'buckets'

Need to update the examples here

opensearch based on matcher running in the browser

A matcher only search-index running in the browser for opensearch. Not sure if it is possible to reach an in-browser index from the browser search box, but think it's doable.

Investigate: Set up browser demo from search-index on a server w/ a domain and add an opensearch.xml

Base it on search-index and the ngraminator.

When to use buckets and when to use categories

Short answer: When the list of categories grow too long. If you feel the list of available filters is almost as- or as long as the result list, then you need to group them into buckets.

You filter to split the result list into more a more manageable size, not to pinpoint with one filter added.

How does a/the index work?

Issue over at search-index: fergiemcdowall/search-index#238
Explain how the search index' different parts work. Which part does the matcher use, and which does the searcher use?

Getting the data: JSON import (fetch)

Either data as JSON or export from another search-index

Put a "not compatible with latest norch/search-index"-flag on readme.md

So people won't use it yet.

Explain all features and if they reply with an object or stream of objects

Show which functions has a Norch equivalent

"phrased search"

How to let the user do a "phrased search".

This comes out of the box, but none of the frontends we've created have used this feature. Should be explained that you only need to NOT split up the query string into separate words.

General document/query processing step

How to add a document processing step to the stream.

change links to relative

Change links to relative, i.e. query object when in topics

Create your first search engine backend and frontend

On your own server/laptop:

Easiest steps to get something up and running:

dataset + config: JSON Gist
data in: search-index-indexer
norch: point to index (and accept calls from other IP)
norch-angular-app: config your frontend

Cloud

Heroku / norch.io

Show how you can sort search results other than tf-idf

Numerical and alphabetical sorting of search results, facets etc.

Use cases:

Numeric sorting could lead to geographic proximity sorting, and other fun stuff.
When the type and amount of facets are known, you could use alphabetical sort

How to use opensearch.xml to turn on search from the browser search box and create autodetection of your search solution. Here's the most important docs from opensearch.org: http://www.opensearch.org/Specifications/OpenSearch/Extensions/Suggestions/1.1#Example_3

fielded search
nGramLength
searchable: false
batchSize
...
Document what are the tradeoffs

n-grams + matcher with new search index

fergiemcdowall/search-index#487

Update to search-index v.0.9.x

A lot of stuff has changed between search-index v.0.8.x and v.0.9.x. Need a full makeover.

eklem / search-index-cookbook Goto Github PK

search-index-cookbook's Introduction

search-index and norch cookbook

NOT COMPATIBLE WITH LATEST SEARCH-INDEX !!!

Topics

Pitfalls

References

Get up and running with Node and NMP

TODO

search-index-cookbook's People

Contributors

Stargazers

Watchers

Forkers

search-index-cookbook's Issues

On your own server/laptop:

Cloud

Recommend Projects

Recommend Topics

Recommend Org

Jobs