GithubHelp home page GithubHelp logo

zgornel / garamond.jl Goto Github PK

View Code? Open in Web Editor NEW
14.0 7.0 1.0 1.34 MB

A small, flexible neural and data search engine, written in Julia. Batteries not included.

License: MIT License

Julia 100.00%
search-engine search semantic information-retrieval search-in-text neural-search julia

garamond.jl's Introduction

Alt text

Build Status (master) Coverage Status

Alt text

Installation

Installation can be performed by:

  • first cloning with git clone https://github.com/zgornel/Garamond.jl
  • then running julia -e 'using Pkg; Pkg.activate("."); Pkg.instantiate()' from the project root directory

Binary executables of the search server and clients can be built by running ./make.jl from the build/ directory.

Usage

For information and examples over the usage of the search engine, visit the documentation.

License

This code has an MIT license.

References

Search engines on Wikipedia

Semantic search on Wikipedia

Word embeddings

Acknowledgements

This work could not have been possible without the great work of the people developing all the Julia packages and other technologies Garamond is based upon.

Citing

@misc{cofaru2019garamond, title={Garamond}, author={Corneliu, Cofaru and others}, year={2019}, publisher={GitHub}, howpublished={\url{https://github.com/zgornel/Garamond.jl}}, }

Reporting Bugs

Garamond is at the moment under heavy development and much of the API and features are subject to change ¯\(ツ)/¯. Please file an issue to report a bug or request a feature.

garamond.jl's People

Contributors

sorinescu avatar zgornel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

oxoaresearch

garamond.jl's Issues

Extend base input parser

A query language should be defined and developed. Should word at query term level, with boolean operators. Main operations should implement logical AND, OR (at this point implicitandNOT` (i.e. negation)

Environment, searcher config refactor

The environment configuration, searcher configuration and config parser need a refactor in order to support particular options for embedders, indexing structures etc.

Embedders pool

This brings the capability of using different embedders (or embedding libraries) for query and data . This can make possible multilingual retrieval and other more complex operations.

Noop index support

Gives the ability to create searchers with empty indexes (but working embedders). This allows to start an engine with data only in the db (indexing is skipped). Any non-db query is impossible.

Add paging (result offset) support

Requests should support a parameter specifying what page in the results one desires. This is equivalent to supporting a result offset when building the response.

Make the data format in the 'json-data' return option configurable

It may be useful to return as matches ids and scores instead of index and scores. This is achievable through the json-data option which uses a more complex format and returns the full metadata of the document.

Solution:

  • make the json-data option return data which is configurable; one can return either the full metadata or only some fields in it + data etc.

Registration mechanism instead of symlink based for custom loaders, parsers etc.

A new mechanism based on explicit registering of loaders, parsers etc. should exist. This allows for

  • removal of the custom folders + symlinks directly specifying a custom hook file that includes all custom code (loaders, etc.);The path to the custom hook file should be a gars option.
  • This gives the ability for parsers calling to call other parsers, and so on. (i.e. an auto or pre-parser #28 )

Mockup

  • gars starts with gars -d config.jl --hook ./custom_hooks.jl -p 9000 --log-level debug
  • hook file, custom_hooks.jl contains a loader and a parser
@register_loader fload "A nice loader name"
function fload(path) 
    #...
end

@register_parser fparse "An even nicer parser name"
fparse(path)
    # ...
end

Now, a preparser would parse a raw query, extract the parser from a "parser>query" pattern
and execute it (this option could be a default for missing input_parser in requests, with a fallback to noop if no parser can be selected.

Implicit parser specification

The ability to select parser or parsing modes from a given text - i.e. parser detection stage (with default must exist)
In the request, the preparser should have a similar name i.e. pre_parser
This parser should access all available parsers

Examples:

  • noop> text noop mode, text is the actual query sent to db, searchers
  • nlp> some thext that can be nlp'd NLP mode
  • imp> https://julialang.org/v2/img/logo.svg` image mode
  • db> col1:value AND col2:[min, max]` juliadb mode
  • index> "index_1":"something embeddable" specify index and some text that the embedder of a searcher will recognize (i.e. simple text, a link to an image etc)

Ideea:
whatever query is sent gets triaged by an initial pre-parserthat looks for the pattern r"[a-zA-Z]+>"
and if found, selects the parser. A default has to exist, most probably noop_parser

Removal of Corpus

Needs upstream modifications in StringAnalysis. Not entirely clear if beneficial

Consider using alternative to JuliaDB

JuliaDB is more-or-less in maintenance mode, and has received very little in the way of bugfixes and performance improvements. While it might be hasty for me to say "don't depend on JuliaDB", I think at a minimum it would be a good idea to consider allowing other alternatives into Garamond's core. We could maje JuliaDB optional, and (try to) adopt the official Tables.jl interface so that other table implementations could be used. I'm not sure how indexing would work here, but I feel like it might be worth investigating approaches that aren't inherently tied to JuliaDB/IndexedTables.

Speaking as a maintainer of Dagger.jl, making JuliaDB optional would also make it possible to use newer versions of Dagger (since JuliaDB only works on old versions of Dagger) for parallelism and accelerator offload, and could make it possible for Garamond to operate as a distributed cluster of engines.

Decide on strategy for non-embeddable items

Items/documents semantically embedded have a zero i.e.zeros(T, n) value which may in some circustances yield higher than normal scores.
Solution:

  • all documents must be embeddable
  • another default vector has to be provided through a config option i.e. fill(T(val), n) where val is read from the configuration (for example val=100 works well) for increasing the distances and lowering the relevance scores of non-embeddable items

Ability to drop data after embedding

Data indexing (also online form) should allow for dropping an arbitrary number if fields, potentially all; this would allow for the engine not to store any data except indexes in the db and the search to employ only the searchers.
A bit the reverse of the noop index where the search is done in the db only.

Create small test dataset

A small dataset that can be loaded to an IndexedTable or NDSparse object and searched into using various configuration parameters.

Complete documentation sections

  • architecture: principles, plugins, diagram
  • internal APIs, extending the engine
  • data configuration options in the current form

Investigate document term/part weights

  • tf-idf, bm25 etc. weighing of word embeddings when embedding
  • different document parts weights

Possible solution: new document object i.e. Weighted1GramDocument{T} with a weight associated for each 1-ngram: for tf-idf and such, the term weight in the document, for different document parts, the part weight (probably fixed apriori)
Cons: necesitates changes in the document embedding approach, forces the embedder to work with a document instead of a vector of sentences

Possible solution 2: Post-process embedded documents (for weighing of word embeddings, before creating the search model); return also sentence weights from parser (for weighting different document, the weights would be an input to the document embedder)
Cons: Parser modification

push!/pop!/etc for indexes, searchers and env

Implement push!, pop!, pushfirst!, popfirst! and delete_from_index!

  • juliadb container, index, searcher and env level
  • handle missing methods (may impact the way consistency is enforced i.e. the check wether to atempt pushing into different index types that may or may not support pushing)
  • ensure consistency (throw SearchEnvConsistencyException) if an op fails in either db or searchers; try to recover by pop-ing already pushed items as well ?
  • test

Support sorting option in request

Default an empty list i.e. sort by the linear index key. Several columns from the data can be provided as well as a reverse sort flag. Sorting applies to data filtering, prior to the filtering operation.

Enforce a single float type for all indexes

A search environment should operate with a single <:AbstractFloat type, into which all vectors should be converted. This will ensure type stability across all operations.

Improve, extend indexing features

  • formalize index API
  • add support for IVFADC
  • add support for index parameters in configuration EDIT: covered by #22
  • cleanup in indexing structures (if necessary) i.e. indexes that do not support updates.
  • implement push!, pop!, pushfirst!, popfirst! and delete_from_index! operations operating at environment level: data container (IndexedTable/NDSparse) and indexes (if not applicable, an error/exeception should be thrown) EDIT: covered by #26

Expand data loader support

Some out of the box data loaders should be available:

  • databases (ODBC?)
  • knowledge graphs (the most known)
  • streaming data (i.e. request-index-request looping on some pattern?)

Improve `garw` UI

Should use some new Julia-based graphical framework and contain:

  • query box and results placeholder (in one page)
  • query configuration in another

Should either connect through websocket or http(REST)

Add exact match information in response

Exact matches should be specified in the response somewhere by adding a new field or modifying existing ones.
EDIT: This could be extended to result statistics. To be defined at a later stage

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.