GithubHelp home page GithubHelp logo

namoshek / laravel-scout-database Goto Github PK

View Code? Open in Web Editor NEW
9.0 3.0 6.0 147 KB

A generic Laravel Scout driver which performs full-text search on indexed model data using an SQL database as storage backend. Indexed data is stored in normalized form, allowing efficient search.

License: MIT License

PHP 100.00%
laravel laravel-scout laravel-scout-driver sql-database search-engine

laravel-scout-database's Introduction

Laravel Scout driver for SQL Database based Search Indexing

Latest Version on Packagist Total Downloads Tests Quality Gate Status Maintainability Rating Reliability Rating Security Rating Vulnerabilities License

This package provides a generic Laravel Scout driver which performs full-text search on indexed model data using an SQL database as storage backend. Indexed data is stored in normalized form, allowing efficient and fuzzy search which does not require a full and/or exact match.

This driver is an alternative to teamtnt/laravel-scout-tntsearch-driver. The primary difference is that this driver provides fewer features (like geo search). Instead, it works with all database systems supported by Laravel itself (which are basically all PDO drivers). Also, the search algorithm is slightly different.

All tests are run through GitHub Actions for PHP 8.0, 8.1 and 8.2 on the following database systems:

  • SQLite 3
  • MySQL 8.0
  • PostgreSQL 13.1
  • SQL Server 2017

Actual limitations regarding supported database systems are mostly related to the use of Common Table Expression using staudenmeir/laravel-cte. Please make sure your database system is supported before using the package, or you might run into database errors.

Installation

You can install the package via composer:

composer require namoshek/laravel-scout-database

After installing the package, the configuration file as well as the migrations need to be published:

php artisan vendor:publish --provider="Namoshek\Scout\Database\ScoutDatabaseServiceProvider" --tag="config"
php artisan vendor:publish --provider="Namoshek\Scout\Database\ScoutDatabaseServiceProvider" --tag="migrations"

If you would like to use a different table prefix than scout_ for the tables created by this package, you should change the configuration as well as the copied migrations accordingly. When you have done so, you can then apply the database migrations:

php artisan migrate

Upgrading from v0.x to v1.x

Migrations

With the new version, the database schema has changed and new migrations need to be published using:

php artisan vendor:publish --provider="Namoshek\Scout\Database\ScoutDatabaseServiceProvider" --tag="migrations"

The same hint as mentioned above in the Installation section applies to the newly published migrations as well.

Config

The configuration has been reduced slightly and you might want to compare the new configuration file with the old one to remove obsolete settings. Skipping this part has no negative impact on the performance of the Scout driver, though.

Commands

The \Namoshek\Scout\Database\Commands\CleanWordsTable::class command has been removed and you should un-schedule it, if you added it previously.

Noteworthy

Most occurrences of protected fields and methods have been changed to private to simplify development in regard to backwards-compatibility breaking changes in the future. If you have not been actively overriding parts of the implementation, this does not affect you at all.

Configuration

In order to instruct Scout to use the driver provided by this package, you need to change the driver option in config/scout.php to database. If you did not change the Scout configuration file, you can also set the SCOUT_DRIVER environment variable to database instead.

All available configuration options of the package itself can be found in config/scout-database.php. The options are described thoroughly in the file itself. By default, the package uses the UnicodeTokenizer and the PorterStemmer which is suitable for the English language. The search adds a trailing wildcard to the last token and not all search terms need to be found in order for a document to show up in the results (there must be at least one match though).

You may also add a wildcard to each search token by enabling wildcard_all_tokens in the config file altough this is not recommended for performance reasons.

A basic installation most likely does not require you to change any of these settings. Just to make sure, you should have a look at the connection option though. If you want to change this, do so before running the migrations or the tables will be created using the wrong database connection.

Supported Tokenizers

Currently, only a UnicodeTokenizer is available. It will split strings at any character which is neither a letter, nor a number according to the \p{L} and \p{N} regex patterns. This means that dots, colons, dashes, whitespace, etc. are split criteria.

If you have different requirements for a tokenizer, you can provide your own implementation via the configuration. Just make sure it implements the Tokenizer interface.

Supported Stemmers

Currently, all stemmers implemented by the wamania/php-stemmer package are available. A wrapper class has been added for each of them:

If you have different requirements for a stemmer, you can provide your own implementation via the configuration. Just make sure it implements the Stemmer interface.

Usage

The package follows the available use cases described in the official Scout documentation. Please be aware of the listed limitations though.

How does it work?

The Indexing

The search driver internally uses a single table, which contains terms and the association to documents. When indexing documents (i.e. adding or updating models in the search index) the engine will use the configured tokenizer to split the input of each column into tokens. The tokenizer configured by default simply splits inputs into words consisting of any unicode letter or number, which means any other character like ,, ., -, _, !, ?, /, whitespace and all other special characters are considered separators for the tokens and will be removed by the tokenizer. This way such characters will never end up in the search index itself.

After the inputs have been tokenized, each token (and at this point we actually expect our tokens to be words) is run through the configured stemmer to retrieve the stem (i.e. root word). Performing this action allows us to search for similar words later. The PorterStemmer for example will produce intellig as output for both intelligent as well as intelligence as input. How this helps when searching will be clear in a moment.

Finally, the results of this process are stored in the database. The index table is filled with the results of the stemming process and the associations to the indexed models (model type and identifier). On top of that, for each row in the index, the database also contains the number of occurences in a document. We use this information for scoring within the search part of our engine.

The Search

When executing a search query, the same tokenizing and stemming process as used for indexing is applied to the search query string. The result of this process is a list of stems (or root words) which are then used to perform the actual search. Depending on the configuration of the package, the search will return documents which contain at least one or all of the stems. This is done by calculating a score for each match in the index based on the inverse document frequency (i.e. the ratio between indexed documents and documents containing one of the searched words), the term frequency (i.e. the number of occurrences of a search term within a document) and the term deviation (which is only relevant for the wildcard search). Returned are documents ordered by their score in descending order, until the desired limit is reached.

Extending the Search Index

It is possible to extend the search index table (scout_index) with custom columns. During indexing, these columns may be filled with custom content and during searching the searches can be scoped to these columns (exact match). This feature is particularly useful when working with a multi-tenancy application where the search index is used by multiple tenants.

Example Migration

In our example, we add a mandatory tenant_id column to the search index.

return new class extends Migration {
    public function up(): void
    {
        Schema::table('scout_index', function (Blueprint $table) {
            $table->uuid('tenant_id');
        });
    }

    public function down(): void
    {
        Schema::table('scout_index', function (Blueprint $table) {
            $table->dropColumn(['tenant_id']);
        });
    }
};

Indexing Example

The tenant_id is added during indexing for each model:

class User extends Model
{
    public function toSearchableArray(): array
    {
        return [
            'id' => $this->id,
            'name' => $this->name,
            'tenant_id' => new StandaloneField($this->tenant_id),
        ];
    }
}

Search Example

The tenant_id is filtered during search based on the $tenantId, which may for example be taken from the HTTP request:

User::search('Max Mustermann')
    ->where('tenant_id', $tenantId)
    ->get();

Limitations

Obviously, this package does not provide a search engine which (even remotely) brings the performance and quality a professional search engine like Elasticsearch offers. This solution is meant for smaller to medium-sized projects which are in need of a rather simple-to-setup solution.

Also worth noting, the following Scout features are currently not implemented:

  • Soft Deletes
  • Search custom index using User::search('Mustermann')->within('users_without_admins')
  • Search with custom order using User::search('Musterfrau')->orderBy('age', 'desc')
    • Implementing this feature would be difficult in combination with the scoring algorithm. Only the result of the database query could be ordered, while this could then lead to issues with pagination.

Known Issues

One issue with this search engine is that it can lead to issues if multiple queue workers work on the indexing of a single document concurrently (database will deadlock). To circumvent this issue, a the number of attempts used for transactions is configurable. By default, each transaction is tried a maximum of three times if a deadlock (or any other error) occurs.

License

The MIT License (MIT). Please see License File for more information.

laravel-scout-database's People

Contributors

dependabot-preview[bot] avatar dependabot[bot] avatar marijoo avatar namoshek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

laravel-scout-database's Issues

Question: Would another wildcard setting be an option?

First of all: great package for medium sized projects needing a search index that goes beyond the default Scout database driver. Thanks!

I wonder what you think about another configurable option that would make the SearchConfiguration tweakable similar to the lastTokenShouldUseWildcard() setting. Iโ€™m thinking about implementing something like this:

if ($this->searchConfiguration->allTokensShouldUseWildcard()) {
    $keywords = array_map(fn($token) => "%{$token}%", $keywords);
}

Do you think this is reasonable? Would you accept a PR for this?

Reason is I would like to match the following words with the query entgelt where now only the last two are found:

  • Basisentgelt
  • Serviceentgelt
  • Entgelt A
  • Entgelt B

Index update failures due to deadlocks

When running multiple queue workers to update the index, or when updating the index synchronously from multiple requests concurrently, deadlocks may occur. When this happens, the MakeSearchable job will be retried until all retries failed or one attempt succeeds. Synchronous updates in a request may fail the entire request, depending on the exception handling.

To mitigate this issue, we could simply allow multiple attempts in our transaction handling by passing a second parameter to transaction($callback, $attempts):

$this->connection->transaction(function () use ($models, $preparedStems) {
// Delete existing entries of the models.
$this->deleteFromIndex($models);
// Add the new data to the index.
foreach ($preparedStems as $preparedStem) {
// Saving the result to the index.
$this->saveDataToIndex($preparedStem['model'], $preparedStem['stems']);
}
});

Ideally, this parameter is configurable. 3 seems like a good default value.

SQLSTATE[42000]: Syntax error or access violation: 1064 You have an error in your SQL syntax;

I install this package and use it in Orchid Admin platform. https://orchid.software/ru/docs/global-search/

And i have this error if i use search:

SQLSTATE[42000]: Syntax error or access violation: 1064 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'documents_in_index as (select document_type, COUNT(DISTINCT(document_id)) as' at line 1 (SQL: with documents_in_index as (select document_type, COUNT(DISTINCT(document_id)) as cnt from scout_index where document_type = 'analytics_industry' group by document_type), document_index as (select id, document_id, term, length, num_hits from scout_index where document_type = 'analytics_industry'), matching_terms as ((select '0' as term, 0 as length from document_index where 0 = 1) union (select DISTINCT(term) as term, 1 as length from document_index where term like f%)), term_frequency as (select term, SUM(num_hits) as occurrences from document_index where term in (select term from matching_terms) group by term), matches_with_score as (select di.document_id, di.term, CASE WHEN mt.term IS NOT NULL THEN ((1 + LOG(1 * (CAST((SELECT cnt FROM documents_in_index) as float) / ((CASE WHEN tf.occurrences > 1 THEN tf.occurrences ELSE 1 END) + 1))))* ((1 * SQRT(CAST(di.num_hits as float)))+ (1 * SQRT(1.0 / (ABS(di.length - mt.length) + 1))))) ELSE 0 END as score from document_index as di inner join term_frequency as tf on tf.term = di.term left join matching_terms as mt on mt.term = di.term) select document_id from matches_with_score group by document_id order by SQRT(COUNT(DISTINCT(term))) * SUM(score) DESC, document_id ASC limit 15 offset 0)

What is the problem? How to fix this? I forgot something?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.