GithubHelp home page GithubHelp logo

jirutka / smlar Goto Github PK

View Code? Open in Web Editor NEW
113.0 14.0 26.0 65 KB

PostgreSQL extension for an effective similarity search || mirror of git://sigaev.ru/smlar.git || see https://www.pgcon.org/2012/schedule/track/Hacking/443.en.html

Home Page: http://sigaev.ru/git/gitweb.cgi?p=smlar.git

Makefile 0.91% C 94.02% PLpgSQL 5.08%

smlar's Introduction

float4 smlar(anyarray, anyarray)
	- computes similary of two arrays. Arrays should be the same type.

float4 smlar(anyarray, anyarray, bool useIntersect)
	-  computes similary of two arrays of composite types. Composite type looks like:
		CREATE TYPE type_name AS (element_name anytype, weight_name FLOAT4);
	   useIntersect option points to use only intersected elements in denominator
	   see an exmaples in sql/composite_int4.sql or sql/composite_text.sql

float4 smlar( anyarray a, anyarray b, text formula );
	- computes similary of two arrays by given formula, arrays should 
	be the same type. 
	Predefined variables in formula:
	  N.i	- number of common elements in both array (intersection)
	  N.a   - number of uniqueelements in first array
	  N.b   - number of uniqueelements in second array
	Example:
	smlar('{1,4,6}'::int[], '{5,4,6}' )
	smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / sqrt(N.a * N.b)' )
	That calls are equivalent.

anyarray % anyarray
	- returns true if similarity of that arrays is greater than limit

float4 show_smlar_limit()  - deprecated
	- shows the limit for % operation

float4 set_smlar_limit(float4) - deprecated
	- sets the limit for % operation

Use instead of show_smlar_limit/set_smlar_limit GUC variable 
smlar.threshold (see below)


text[] tsvector2textarray(tsvector)
	- transforms tsvector type to text array

anyarray array_unique(anyarray)
	- sort and unique array

float4 inarray(anyarray, anyelement)
	- returns zero if second argument does not present in a first one
	  and 1.0 in opposite case

float4 inarray(anyarray, anyelement, float4, float4)
	- returns fourth argument if second argument does not present in 
	  a first one and third argument in opposite case

GUC configuration variables:

smlar.threshold  FLOAT
	Array's with similarity lower than threshold are not similar 
	by % operation

smlar.persistent_cache BOOL
	Cache of global stat is stored in transaction-independent memory

smlar.type  STRING
	Type of similarity formula: cosine(default), tfidf, overlap

smlar.stattable	STRING
	Name of table stored set-wide statistic. Table should be 
	defined as
	CREATE TABLE table_name (
		value   data_type UNIQUE,
		ndoc    int4 (or bigint)  NOT NULL CHECK (ndoc>0)
	);
	And row with null value means total number of documents.
	See an examples in sql/*g.sql files
	Note: used on for smlar.type = 'tfidf'

smlar.tf_method STRING
	Calculation method for term frequency. Values:
		"n"     - simple counting of entries (default)
		"log"   - 1 + log(n)
		"const" - TF is equal to 1
	Note: used on for smlar.type = 'tfidf'

smlar.idf_plus_one BOOL
	If false (default), calculate idf as log(d/df),
	if true - as log(1+d/df)
	Note: used on for smlar.type = 'tfidf'

Module provides several GUC variables smlar.threshold, it's highly
recommended to add to postgesql.conf:
custom_variable_classes = 'smlar'       # list of custom variable class names
smlar.threshold = 0.6  #or any other value > 0 and < 1
and other smlar.* variables

GiST/GIN support for % and  && operations for:
  Array Type   |  GIN operator class  | GiST operator class  
---------------+----------------------+----------------------
 bit[]         | _bit_sml_ops         | 
 bytea[]       | _bytea_sml_ops       | _bytea_sml_ops
 char[]        | _char_sml_ops        | _char_sml_ops
 cidr[]        | _cidr_sml_ops        | _cidr_sml_ops
 date[]        | _date_sml_ops        | _date_sml_ops
 float4[]      | _float4_sml_ops      | _float4_sml_ops
 float8[]      | _float8_sml_ops      | _float8_sml_ops
 inet[]        | _inet_sml_ops        | _inet_sml_ops
 int2[]        | _int2_sml_ops        | _int2_sml_ops
 int4[]        | _int4_sml_ops        | _int4_sml_ops
 int8[]        | _int8_sml_ops        | _int8_sml_ops
 interval[]    | _interval_sml_ops    | _interval_sml_ops
 macaddr[]     | _macaddr_sml_ops     | _macaddr_sml_ops
 money[]       | _money_sml_ops       | 
 numeric[]     | _numeric_sml_ops     | _numeric_sml_ops
 oid[]         | _oid_sml_ops         | _oid_sml_ops
 text[]        | _text_sml_ops        | _text_sml_ops
 time[]        | _time_sml_ops        | _time_sml_ops
 timestamp[]   | _timestamp_sml_ops   | _timestamp_sml_ops
 timestamptz[] | _timestamptz_sml_ops | _timestamptz_sml_ops
 timetz[]      | _timetz_sml_ops      | _timetz_sml_ops
 varbit[]      | _varbit_sml_ops      | 
 varchar[]     | _varchar_sml_ops     | _varchar_sml_ops

smlar's People

Contributors

feodor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.