Light

zgornel / garamond.jl Goto Github PK

View Code? Open in Web Editor NEW

14.0 7.0 1.0 1.34 MB

A small, flexible neural and data search engine, written in Julia. Batteries not included.

License: MIT License

Julia 100.00%

search-engine search semantic information-retrieval search-in-text neural-search julia

garamond.jl's Introduction

Installation

Installation can be performed by:

first cloning with git clone https://github.com/zgornel/Garamond.jl
then running julia -e 'using Pkg; Pkg.activate("."); Pkg.instantiate()' from the project root directory

Binary executables of the search server and clients can be built by running ./make.jl from the build/ directory.

Usage

For information and examples over the usage of the search engine, visit the documentation.

License

This code has an MIT license.

References

Search engines on Wikipedia

Semantic search on Wikipedia

Word embeddings

Acknowledgements

This work could not have been possible without the great work of the people developing all the Julia packages and other technologies Garamond is based upon.

Citing

@misc{cofaru2019garamond, title={Garamond}, author={Corneliu, Cofaru and others}, year={2019}, publisher={GitHub}, howpublished={\url{https://github.com/zgornel/Garamond.jl}}, }

Reporting Bugs

Garamond is at the moment under heavy development and much of the API and features are subject to change ¯\(ツ)/¯. Please file an issue to report a bug or request a feature.

garamond.jl's People

Contributors

Stargazers

Watchers

Forkers

garamond.jl's Issues

Extend base input parser

A query language should be defined and developed. Should word at query term level, with boolean operators. Main operations should implement logical AND, OR (at this point implicitandNOT` (i.e. negation)

Environment, searcher config refactor

The environment configuration, searcher configuration and config parser need a refactor in order to support particular options for embedders, indexing structures etc.

Embedders pool

This brings the capability of using different embedders (or embedding libraries) for query and data . This can make possible multilingual retrieval and other more complex operations.

Noop index support

Gives the ability to create searchers with empty indexes (but working embedders). This allows to start an engine with data only in the db (indexing is skipped). Any non-db query is impossible.

Save/Load/Re-index search environment

A full search environment should be saved / loaded

Add paging (result offset) support

Requests should support a parameter specifying what page in the results one desires. This is equivalent to supporting a result offset when building the response.

Make the data format in the 'json-data' return option configurable

It may be useful to return as matches ids and scores instead of index and scores. This is achievable through the json-data option which uses a more complex format and returns the full metadata of the document.

Solution:

make the json-data option return data which is configurable; one can return either the full metadata or only some fields in it + data etc.

Registration mechanism instead of symlink based for custom loaders, parsers etc.

A new mechanism based on explicit registering of loaders, parsers etc. should exist. This allows for

removal of the custom folders + symlinks directly specifying a custom hook file that includes all custom code (loaders, etc.);The path to the custom hook file should be a gars option.
This gives the ability for parsers calling to call other parsers, and so on. (i.e. an auto or pre-parser #28 )

Mockup

gars starts with gars -d config.jl --hook ./custom_hooks.jl -p 9000 --log-level debug
hook file, custom_hooks.jl contains a loader and a parser

@register_loader fload "A nice loader name"
function fload(path) 
    #...
end

@register_parser fparse "An even nicer parser name"
fparse(path)
    # ...
end

Now, a preparser would parse a raw query, extract the parser from a "parser>query" pattern
and execute it (this option could be a default for missing input_parser in requests, with a fallback to noop if no parser can be selected.

Implicit parser specification

The ability to select parser or parsing modes from a given text - i.e. parser detection stage (with default must exist)
In the request, the preparser should have a similar name i.e. pre_parser
This parser should access all available parsers

Examples:

noop> text noop mode, text is the actual query sent to db, searchers
nlp> some thext that can be nlp'd NLP mode
imp> https://julialang.org/v2/img/logo.svg` image mode
db> col1:value AND col2:[min, max]` juliadb mode
index> "index_1":"something embeddable" specify index and some text that the embedder of a searcher will recognize (i.e. simple text, a link to an image etc)

Ideea:
whatever query is sent gets triaged by an initial pre-parserthat looks for the pattern r"[a-zA-Z]+>"
and if found, selects the parser. A default has to exist, most probably noop_parser

Removal of Corpus

Needs upstream modifications in StringAnalysis. Not entirely clear if beneficial

Consider using alternative to JuliaDB

JuliaDB is more-or-less in maintenance mode, and has received very little in the way of bugfixes and performance improvements. While it might be hasty for me to say "don't depend on JuliaDB", I think at a minimum it would be a good idea to consider allowing other alternatives into Garamond's core. We could maje JuliaDB optional, and (try to) adopt the official Tables.jl interface so that other table implementations could be used. I'm not sure how indexing would work here, but I feel like it might be worth investigating approaches that aren't inherently tied to JuliaDB/IndexedTables.

Speaking as a maintainer of Dagger.jl, making JuliaDB optional would also make it possible to use newer versions of Dagger (since JuliaDB only works on old versions of Dagger) for parallelism and accelerator offload, and could make it possible for Garamond to operate as a distributed cluster of engines.

Start the engine by directly loading a cached environment

gars should support loading a search environment from cache
i.e.
gars --env-cache ./cache.bin -p 9000 -i 8999 --log-level debug

Add containerization sample files

Sample framework:

Sample Dockerfile for one engine instance
docker-compose.yml
Notes, readme

Decide on strategy for non-embeddable items

Items/documents semantically embedded have a zero i.e.zeros(T, n) value which may in some circustances yield higher than normal scores.
Solution:

all documents must be embeddable
another default vector has to be provided through a config option i.e. fill(T(val), n) where val is read from the configuration (for example val=100 works well) for increasing the distances and lowering the relevance scores of non-embeddable items

Fix documentation building

Increase test coverage

Rework suggestion mechanism

Multi-threading support

Add multi-threading support where needed and bump julia dependency to v1.3+

Ability to drop data after embedding

Data indexing (also online form) should allow for dropping an arbitrary number if fields, potentially all; this would allow for the engine not to store any data except indexes in the db and the search to employ only the searchers.
A bit the reverse of the noop index where the search is done in the db only.

Create small test dataset

A small dataset that can be loaded to an IndexedTable or NDSparse object and searched into using various configuration parameters.

Complete documentation sections

architecture: principles, plugins, diagram
internal APIs, extending the engine
data configuration options in the current form

Investigate document term/part weights

tf-idf, bm25 etc. weighing of word embeddings when embedding
different document parts weights

Possible solution: new document object i.e. Weighted1GramDocument{T} with a weight associated for each 1-ngram: for tf-idf and such, the term weight in the document, for different document parts, the part weight (probably fixed apriori)
Cons: necesitates changes in the document embedding approach, forces the embedder to work with a document instead of a vector of sentences

Possible solution 2: Post-process embedded documents (for weighing of word embeddings, before creating the search model); return also sentence weights from parser (for weighting different document, the weights would be an input to the document embedder)
Cons: Parser modification

push!/pop!/etc for indexes, searchers and env

Implement push!, pop!, pushfirst!, popfirst! and delete_from_index!

juliadb container, index, searcher and env level
handle missing methods (may impact the way consistency is enforced i.e. the check wether to atempt pushing into different index types that may or may not support pushing)
ensure consistency (throw SearchEnvConsistencyException) if an op fails in either db or searchers; ~~try to recover by pop-ing already pushed items as well ?~~
test

Support sorting option in request

Default an empty list i.e. sort by the linear index key. Several columns from the data can be provided as well as a reverse sort flag. Sorting applies to data filtering, prior to the filtering operation.

Enforce a single float type for all indexes

A search environment should operate with a single <:AbstractFloat type, into which all vectors should be converted. This will ensure type stability across all operations.

Make naive index underlying data structure a vector of vectors

The underlying data structure for the naive index is currently a matrix. This should be transformed into a Vector{Vector{T} where T<:AbstractFloat} so that insertions and deletions from it become feasible.

Ability to start engine with no data

Loading will occur only through online loading;

only the naive index will be supported
ivfadc requires cluster prototypes

Add push!/pop!/deleteat! etc support at search environment

push!, pop!, deleteat! etc. operations should operate at environment level:

data container (IndexedTable/NDSparse)
applied to the indexes (if not applicable, an error/exeception should be thrown)

Improve, extend indexing features

formalize index API
add support for IVFADC
~~add support for index parameters in configuration~~ EDIT: covered by #22
cleanup in indexing structures (if necessary) i.e. indexes that do not support updates.
implement push!, pop!, pushfirst!, popfirst! and delete_from_index! operations operating at environment level: data container (IndexedTable/NDSparse) and indexes (if not applicable, an error/exeception should be thrown) EDIT: covered by #26

Add documentation and examples

Add documentation
Add examples based on sample datasets and embeddings

Create small embeddings libraries

Small embeddings libraries (related to the data contained in the test dataset #7) using EmbeddingsAnalysis.jl

Expand data loader support

Some out of the box data loaders should be available:

databases (ODBC?)
knowledge graphs (the most known)
streaming data (i.e. request-index-request looping on some pattern?)

representing parts of a document as indexable entities

i want to search inside files and retrieve relevant sections (together with the name of the file and other file metadata)

Improve `garw` UI

Should use some new Julia-based graphical framework and contain:

query box and results placeholder (in one page)
query configuration in another

Should either connect through websocket or http(REST)

Add exact match information in response

Exact matches should be specified in the response somewhere by adding a new field or modifying existing ones.
EDIT: This could be extended to result statistics. To be defined at a later stage

Prepare compilation support for PackageCompiler v1.0+

The API for PackageCompiler changed and build_executable does not exist anymore.
This implies:

modifications of build/make.jl
possibly new boilerplate for gars, garc, garw

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs