GithubHelp home page GithubHelp logo

gajus / liqe Goto Github PK

View Code? Open in Web Editor NEW
615.0 6.0 18.0 424 KB

Lightweight and performant Lucene-like parser, serializer and search engine.

License: Other

Shell 0.08% TypeScript 91.27% Nearley 8.65%
lucene search filter parser serializer

liqe's Introduction

liqe

Travis build status Coveralls NPM version Canonical Code Style Twitter Follow

Lightweight and performant Lucene-like parser, serializer and search engine.

Motivation

Originally built Liqe to enable Roarr log filtering via cli. I have since been polishing this project as a hobby/intellectual exercise. I've seen it being adopted by various CLI and web applications that require advanced search. To my knowledge, it is currently the most complete Lucene-like syntax parser and serializer in JavaScript, as well as a compatible in-memory search engine.

Liqe use cases include:

  • parsing search queries
  • serializing parsed queries
  • searching JSON documents using the Liqe query language (LQL)

Note that the Liqe AST is treated as a public API, i.e., one could implement their own search mechanism that uses Liqe query language (LQL).

Usage

import {
  filter,
  highlight,
  parse,
  test,
} from 'liqe';

const persons = [
  {
    height: 180,
    name: 'John Morton',
  },
  {
    height: 175,
    name: 'David Barker',
  },
  {
    height: 170,
    name: 'Thomas Castro',
  },
];

Filter a collection:

filter(parse('height:>170'), persons);
// [
//   {
//     height: 180,
//     name: 'John Morton',
//   },
//   {
//     height: 175,
//     name: 'David Barker',
//   },
// ]

Test a single object:

test(parse('name:John'), persons[0]);
// true
test(parse('name:David'), persons[0]);
// false

Highlight matching fields and substrings:

test(highlight('name:john'), persons[0]);
// [
//   {
//     path: 'name',
//     query: /(John)/,
//   }
// ]
test(highlight('height:180'), persons[0]);
// [
//   {
//     path: 'height',
//   }
// ]

Query Syntax

Liqe uses Liqe Query Language (LQL), which is heavily inspired by Lucene but extends it in various ways that allow a more powerful search experience.

Liqe syntax cheat sheet

# search for "foo" term anywhere in the document (case insensitive)
foo

# search for "foo" term anywhere in the document (case sensitive)
'foo'
"foo"

# search for "foo" term in `name` field
name:foo

# search for "foo" term in `full name` field
'full name':foo
"full name":foo

# search for "foo" term in `first` field, member of `name`, i.e.
# matches {name: {first: 'foo'}}
name.first:foo

# search using regex
name:/foo/
name:/foo/o

# search using wildcard
name:foo*bar
name:foo?bar

# boolean search
member:true
member:false

# null search
member:null

# search for age =, >, >=, <, <=
height:=100
height:>100
height:>=100
height:<100
height:<=100

# search for height in range (inclusive, exclusive)
height:[100 TO 200]
height:{100 TO 200}

# boolean operators
name:foo AND height:=100
name:foo OR name:bar

# unary operators
NOT foo
-foo
NOT foo:bar
-foo:bar
name:foo AND NOT (bio:bar OR bio:baz)

# implicit AND boolean operator
name:foo height:=100

# grouping
name:foo AND (bio:bar OR bio:baz)

Keyword matching

Search for word "foo" in any field (case insensitive).

foo

Search for word "foo" in the name field.

name:foo

Search for name field values matching /foo/i regex.

name:/foo/i

Search for name field values matching f*o wildcard pattern.

name:f*o

Search for name field values matching f?o wildcard pattern.

name:f?o

Search for phrase "foo bar" in the name field (case sensitive).

name:"foo bar"

Number matching

Search for value equal to 100 in the height field.

height:=100

Search for value greater than 100 in the height field.

height:>100

Search for value greater than or equal to 100 in the height field.

height:>=100

Range matching

Search for value greater or equal to 100 and lower or equal to 200 in the height field.

height:[100 TO 200]

Search for value greater than 100 and lower than 200 in the height field.

height:{100 TO 200}

Wildcard matching

Search for any word that starts with "foo" in the name field.

name:foo*

Search for any word that starts with "foo" and ends with "bar" in the name field.

name:foo*bar

Search for any word that starts with "foo" in the name field, followed by a single arbitrary character.

name:foo?

Search for any word that starts with "foo", followed by a single arbitrary character and immediately ends with "bar" in the name field.

name:foo?bar

Boolean operators

Search for phrase "foo bar" in the name field AND the phrase "quick fox" in the bio field.

name:"foo bar" AND bio:"quick fox"

Search for either the phrase "foo bar" in the name field AND the phrase "quick fox" in the bio field, or the word "fox" in the name field.

(name:"foo bar" AND bio:"quick fox") OR name:fox

Serializer

Serializer allows to convert Liqe tokens back to the original search query.

import {
  parse,
  serialize,
} from 'liqe';

const tokens = parse('foo:bar');

// {
//   expression: {
//     location: {
//       start: 4,
//     },
//     quoted: false,
//     type: 'LiteralExpression',
//     value: 'bar',
//   },
//   field: {
//     location: {
//       start: 0,
//     },
//     name: 'foo',
//     path: ['foo'],
//     quoted: false,
//     type: 'Field',
//   },
//   location: {
//     start: 0,
//   },
//   operator: {
//     location: {
//       start: 3,
//     },
//     operator: ':',
//     type: 'ComparisonOperator',
//   },
//   type: 'Tag',
// }

serialize(tokens);
// 'foo:bar'

AST

import {
  type BooleanOperatorToken,
  type ComparisonOperatorToken,
  type EmptyExpression,
  type FieldToken,
  type ImplicitBooleanOperatorToken,
  type ImplicitFieldToken,
  type LiteralExpressionToken,
  type LogicalExpressionToken,
  type RangeExpressionToken,
  type RegexExpressionToken,
  type TagToken,
  type UnaryOperatorToken,
} from 'liqe';

There are 11 AST tokens that describe a parsed Liqe query.

If you are building a serializer, then you must implement all of them for the complete coverage of all possible query inputs. Refer to the built-in serializer for an example.

Utilities

import {
  isSafeUnquotedExpression,
} from 'liqe';

/**
 * Determines if an expression requires quotes.
 * Use this if you need to programmatically manipulate the AST
 * before using a serializer to convert the query back to text.
 */
isSafeUnquotedExpression(expression: string): boolean;

Compatibility with Lucene

The following Lucene abilities are not supported:

Recipes

Handling syntax errors

In case of a syntax error, Liqe throws SyntaxError.

import {
  parse,
  SyntaxError,
} from 'liqe';

try {
  parse('foo bar');
} catch (error) {
  if (error instanceof SyntaxError) {
    console.error({
      // Syntax error at line 1 column 5
      message: error.message,
      // 4
      offset: error.offset,
      // 1
      offset: error.line,
      // 5
      offset: error.column,
    });
  } else {
    throw error;
  }
}

Highlighting matches

Consider using highlight-words package to highlight Liqe matches.

Development

Compiling Parser

If you are going to modify parser, then use npm run watch to run compiler in watch mode.

Benchmarking Changes

Before making any changes, capture the current benchmark on your machine using npm run benchmark. Run benchmark again after making any changes. Before committing changes, ensure that performance is not negatively impacted.

Tutorials

liqe's People

Contributors

gajus avatar martinma avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

liqe's Issues

Support implicit AND

Parser at the moment does not support implicit "AND", i.e. when there is a space between two statements without an operator. In those cases, we want to treat space as "AND".

These tests are marked as skipped:

Silving this issue requires editing src/grammar.ne to enable the skipped tests.

See https://github.com/gajus/liqe#development for additional details.

Support for currently unsupported Lucene features

Hi,

I was looking for a Lucene parser/builder and first found https://github.com/bripkens/lucene, I quickly noticed it was basically dead and unmaintained. The open issues highlight critical problems that are claimed to be solved in liqe. The big down side for me is that I need a parser/builder that supports all (if not, as many as possible) Lucene features, and I can't afford to go without fuzzy searches, proximity searches and boosting terms.

I would like to know if supporting those features is planned. I understand that liqe is not just a parser/builder and implementing those features in the search engine is complicated, I assume it's the reason why they are not supported at the moment.

If supporting those features is too complicated to implement in the search engine, is it possible to add partial support (so just for the parser and the builder)?

Add an index method

Input:

{
  height: 175,
  location: {
    city: 'London',
  },
  name: 'mike',
}

Output:

{
  '$liqe.index': [
    {
      path: 'height',
      value: 175,
    },
    {
      path: 'location.city',
      value: 'London',
    },
    {
      path: 'name',
      value: 'mike',
    }
  ],
  height: 175,
  location: {
    city: 'London',
  },
  name: 'mike',
}

This may help simplify filter / highlighter logic, though it may not be worth it because of the memory overhead.

In case of a query such as "London", we could just run through everything in @liqe.index.

Incorrect definition of TagToken in types.ts

Problem

  • Parsing of "a" generates AST that is not consistent with definition in types.ts
  • Based on types.ts, "operation" property is NOT optional
  • Parsing of "a" generates TagToken w/o "operation" property - example here

Proposed solution

  • Fix types.ts or parser

Versions

Generated AST

{
	"location": {
		"start": 0,
		"end": 1
	},
	"field": {
		"type": "ImplicitField"
	},
	"type": "Tag",
	"expression": {
		"location": {
			"start": 0,
			"end": 1
		},
		"type": "LiteralExpression",
		"quoted": false,
		"value": "a"
	}
}

Wildcard doesn't work for number strings

I am trying to use createdAt:2022-01-01* but get a syntax error.

createdAt is an ISO date string eg 2021-12-25T00:00:00.000Z

I also tried

createdAt:2021* - syntax error
createdAt:"2021*" - no syntax error but doesnt return the right results
createdAt:"2021"* - syntax error

Other string fields work fine eg name:item* returns matches for item 1, item 2 and item 3

Dates?

It looks like date strings are not supported in regards to the >, >=, <, <= operators.

eg createdAt:>=2021-12-25T00:00:00.000Z caues a syntax error

How easy, or difficult, would it be to add this in?

Wildcard star * doesn't work as expected

The current implementation for wildcard * does not comply with the Lucene Syntax "specification".

The docs state the following:

Implements the wildcard search query. Supported wildcards are *, which matches any
character sequence (including the empty one), and ?, which matches any single
character. '\' is the escape character.

Source: https://github.com/apache/lucene/blob/branch_9_4/lucene/core/src/java/org/apache/lucene/search/WildcardQuery.java
Source: https://lucene.apache.org/core/2_9_4/queryparsersyntax.html

However, liqe treats * as an arbitrary character sequence of minimum length 1. While as per documentation, an empty character sequence is allowed as well.

I know that liqe is just a "Lucene-like" parser, serializer and search engine. But I wonder if this design choice was on purpose or by accident?

I would like to adjust the search engine to comply with the standard in this subject.

How do you think about it?

I know that it would mean a breaking change. But you could bump the version number.

An alternative solution would be that the filter function takes a config param and the standard behavior can be activated with some sort of strict flag.

Parse Error - model:AP7900

const { parse } = require('liqe');
const parsed = parse('model:AP7900');
console.log(parsed);
/app/node_modules/liqe/dist/src/parse.js:34
            throw new errors_1.SyntaxError("Syntax error at line ".concat(match.groups.line, " column ").concat(match.groups.column), error.offset, Number(match.groups.line), Number(match.groups.column));
            ^

SyntaxError: Syntax error at line 1 column 9
    at parse (/app/node_modules/liqe/dist/src/parse.js:34:19)
    at Object.<anonymous> (/app/tmp.js:3:16)
    at Module._compile (internal/modules/cjs/loader.js:1063:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:1092:10)
    at Module.load (internal/modules/cjs/loader.js:928:32)
    at Function.Module._load (internal/modules/cjs/loader.js:769:14)
    at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:72:12)
    at internal/main/run_main_module.js:17:47 {
  offset: 8,
  line: 1,
  column: 9
}

Should word without wildcard partial match by default?

Hi, thanks for working on this!

Maybe I'm wrong about the Lucene specification, but I thought that words do not partial match by default. At the moment, the following returns true:

test(parse("name:Joh"), { name: "John Morton" })

Is this the intended behavior?

Error when parsing non-English text

When I try to put some unquoted Chinese characters in the search string, I got this error when parsing the text.

SyntaxError: Syntax error at line 1 column 12

The search text is ๆต‹่ฏ•

Parsing of Field Grouping not working

Problem

  • Parsing of Field Grouping throws error
  • Error is thrown as well in case of +/- prefix and missing field name
  • All examples below parse correctly with Java Lucene parser
  • Syntax is in line with:
  • Fixing this would:
    • Increase compatibility with Lucene
    • Provide alternative for providing list of values in format a:(b OR c OR d) or a:(b c d) (later in case of possibility to change default operator to OR)

Example of failing searches

  • +(a b)
  • -(a b)
  • c:(a b)
  • +c:(a b)
  • -c:(a b)

Versions
3.5.0 - on https://npm.runkit.com/liqe
3.6.0 - with Angular 14.0.4

Demonstration

https://runkit.com/embed/msogu7iiwefg

Filter should also search keys

Note: This issue is describing an improvement to the search engine, not the parser/serializer.

  • foo should match {"fooBar":"baz"}
  • "foo" should not match {"fooBar":"bar"}

This complicates highlighting logic, though.

Do not include highlights from non-matching branches

test.skip(
  'does not include highlights from non-matching branches',
  testQuery,
  'name:foo AND NOT name:foo',
  {
    name: 'foo',
  },
  [],
);

liqe/test/liqe/highlight.ts

Lines 233 to 241 in 62bf2c0

test.skip(
'does not include highlights from non-matching branches',
testQuery,
'name:foo AND NOT name:foo',
{
name: 'foo',
},
[],
);

This test should produce empty result. However, it instead produces name:foo match.

We need to exclude highlights from the entire group if the group is rejected.

Docs: Add info on where it's supposed to be used

In order for new commers to this Amazing Library, add information on where it's supposed to be used. Pls provide use cases. Also provide information on whether it can be used on Nodejs or Browser.
Clarify that data is stored in Memory. So for very large array, this could be a key consideration.

Decide how we handle "foo:"

foo:

On one hand, it is a syntax error since it is missing the expression.

On the other hand, we could provide a lot more user-friendly errors if we identified that the right hand-side is empty.

This would require introducing a new token that represents empty though.

The same "empty" concept could then be applied to an empty query, () and other places.

A few outstanding questions, ...

How do we handle an empty query?

I suppose it would just return an empty token.

{
  location: {...},
  type: 'EmptyExpression',
}

What would the location be if though if the query contains multiple whitespaces?

How do we handle foo:?

Same question as with the empty query.

How do we distinguish foo: bar?

At the moment, we allow space between field and expression, i.e. bar becomes a value of foo.

This change would require that we do not allow space between operator and value. I think that's a good thing either way (#19).

How we handle () and ( )?

The challenge here is that every token has to have location with start,end.

Let's say that ( ) has EmptyExpression

{
  location: {start: 1, end: 4},
  type: 'EmptyExpression',
}

What would the location be in the earlier example of foo: bar?

support `foo:"bar` (open quotes)

This is needed if we don't want to syntax error when user is typing a search query that includes quotes. Without this, foo:"bar will immediately produce an error.

The challenge implementing this is that

foo:"bar baz

produces ambiguous parsing.

We could limit the open expression to characters in unquoted_value range, but that would mostly defeat the purpose.

Boolean operators raise filter error: Expected left to be defined.

The README explains, that boolean expressions like the following one can be used. But the filter method raises the error Expected left to be defined., when using this exact query (straight from the README):

(name:"foo bar" AND bio:"quick fox") OR name:fox

The parser works fine. The filter methods seems to be flawed, from what I can see.

I also couldn't find any tests that cover this use case.

liqe version: 3.6.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.