patrickfrey / strusutilities Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 3.07 MB

A set of command line programs to access the strus information retrieval engine

Home Page: http://www.project-strus.net

License: Mozilla Public License 2.0

CMake 8.92% C++ 90.78% xBase 0.13% Perl 0.17%

strusutilities's People

Stargazers

Watchers

strusutilities's Issues

all tests fail in-source

TestAnalyzeSubSegmenter1
 1/20 Test  #1: TestAnalyzeSubSegmenter1 .........***Failed    0.06 sec
...
20/20 Test #20: TestUpdateCalcStats ..............***Failed    0.00 sec

for instance:

20: error executing test: error reading program file '/data/work/strus/strusUtilities/tests/scripts/testUpdateCalcStats1/RUN': No such file or directory

Grammar of Analyzer does not report all errors

Missing a semicolon can be very very bad. The following instruction is seen as part of the selection expression, and therefore ignored but the syntax parsing of the selection expresseion does not report any error (textwolf problem)

analyzer programs can't have emtpy sections

[Aggregator]
#    doclen = count( ngram );

results in:

ERROR failed to load analyzer program document.ana: error in document analyzer program at line 32 column 1: feature type name (identifier) expected at start of a feature declaration

tests don't run on OSX

The following tests FAILED:
	  1 - TestAnalyzeSubSegmenter1 (Failed)
	  2 - TestAnalyzeSubSegmenter2 (Failed)
	  3 - TestAnalyzeSubSegmenter3 (Failed)
	  4 - TestAnalyzeBase1 (Failed)
	  5 - TestAnalyzeJson1 (Failed)
	  6 - TestAnalyzeJson2 (Failed)
	  7 - TestAnalyzeConfig1 (Failed)
	  8 - TestAnalyzeConfig2 (Failed)
	  9 - TestAnalyzeWithDocType1 (Failed)
	 10 - TestInsertWithDocType1 (Failed)
	 11 - TestInsertBase1 (Failed)
	 12 - TestSimpleQuery1 (Failed)
	 13 - TestSummarization1 (Failed)

12: Test command: /Users/administrator/strus/strusUtilities/tests/scripts/runTest.sh "testSimpleQuery1"
12: Test timeout computed to be: 9.99988e+06
12: /Users/administrator/strus/strusUtilities/tests/scripts
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 20: /Users/administrator/strus/strusUtilities/src/strusCreate/strusCreate: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 44: /Users/administrator/strus/strusUtilities/src/strusInsert/strusInsert: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 53: /Users/administrator/strus/strusUtilities/src/strusQuery/strusQuery: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 53: /Users/administrator/strus/strusUtilities/src/strusQuery/strusQuery: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 53: /Users/administrator/strus/strusUtilities/src/strusQuery/strusQuery: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 53: /Users/administrator/strus/strusUtilities/src/strusQuery/strusQuery: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 53: /Users/administrator/strus/strusUtilities/src/strusQuery/strusQuery: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 53: /Users/administrator/strus/strusUtilities/src/strusQuery/strusQuery: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 26: /Users/administrator/strus/strusUtilities/src/strusDestroy/strusDestroy: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts
12: DIFF AT 25

find . -name strusQuery -perm 0755 -type f
./src/strusQuery/Release/strusQuery

Windows and XCode builds binaries not in the current directory, but in Release or Debug
depending on the Cmake project build option switch.

Registered segmenter cjson is suddenly unknown in text processor

Test project /usr/home/abaumann/strusUtilities
      Start  1: TestPatternMatch1
 1/22 Test  #1: TestPatternMatch1 ................***Failed    0.62 sec
      Start  2: TestPatternMatch2
 2/22 Test  #2: TestPatternMatch2 ................***Failed    0.53 sec
      Start  3: TestPatternMatch3
 3/22 Test  #3: TestPatternMatch3 ................***Failed    0.54 sec
      Start  4: TestPatternMatch4
 4/22 Test  #4: TestPatternMatch4 ................***Failed    0.53 sec
      Start  5: TestAnalyzeSubSegmenter1
 5/22 Test  #5: TestAnalyzeSubSegmenter1 .........***Failed    0.18 sec
      Start  6: TestAnalyzeSubSegmenter2
 6/22 Test  #6: TestAnalyzeSubSegmenter2 .........***Failed    0.18 sec
      Start  7: TestAnalyzeSubSegmenter3
 7/22 Test  #7: TestAnalyzeSubSegmenter3 .........***Failed    0.16 sec
      Start  8: TestAnalyzeBase1
 8/22 Test  #8: TestAnalyzeBase1 .................   Passed    0.47 sec
      Start  9: TestAnalyzeJson1
 9/22 Test  #9: TestAnalyzeJson1 .................***Failed    0.16 sec
      Start 10: TestAnalyzeJson2
10/22 Test #10: TestAnalyzeJson2 .................***Failed    0.16 sec
      Start 11: TestAnalyzeTsv1
11/22 Test #11: TestAnalyzeTsv1 ..................   Passed    0.17 sec
      Start 12: TestAnalyzeConfig1
12/22 Test #12: TestAnalyzeConfig1 ...............   Passed    0.17 sec
      Start 13: TestAnalyzeConfig2
13/22 Test #13: TestAnalyzeConfig2 ...............   Passed    0.17 sec
      Start 14: TestAnalyzeWithDocType1
14/22 Test #14: TestAnalyzeWithDocType1 ..........***Failed    0.26 sec
      Start 15: TestAnalyzeBindPos1
15/22 Test #15: TestAnalyzeBindPos1 ..............   Passed    0.46 sec
      Start 16: TestInsertWithDocType1
16/22 Test #16: TestInsertWithDocType1 ...........***Failed    2.69 sec
      Start 17: TestInsertBase1
17/22 Test #17: TestInsertBase1 ..................***Failed    2.35 sec
      Start 18: TestSimpleQuery1
18/22 Test #18: TestSimpleQuery1 .................***Failed    1.45 sec
      Start 19: TestQueryWithRestriction1
19/22 Test #19: TestQueryWithRestriction1 ........   Passed    1.06 sec
      Start 20: TestQueryWithFormula1
20/22 Test #20: TestQueryWithFormula1 ............   Passed    0.72 sec
      Start 21: TestSummarization1
21/22 Test #21: TestSummarization1 ...............***Failed    0.71 sec
      Start 22: TestUpdateCalcStats
22/22 Test #22: TestUpdateCalcStats ..............   Passed    2.31 sec

36% tests passed, 14 tests failed out of 22

tools don't show consistent usage of LevelDB parameters

shell> strusCreate -h:

shows:

-s|--storage <CONFIG>
    Define the storage configuration string as <CONFIG>
    <CONFIG> is a semicolon ';' separated list of assignments:
            path=<LevelDB storage path>;compression=<yes/no>
            acl=<yes/no, yes if users with different access rights exist>
            metadata=<comma separated list of meta data def>

shell> strusInspect -h

shows

-T|--trace <CONFIG>
    Print method call traces configured with <CONFIG>
    <CONFIG> is a semicolon ';' separated list of assignments:
            path=<LevelDB storage path>
            create=<yes/no, yes=do create if database does not exist yet>
            cache=<size of LRU cache for LevelDB>
            compression=<yes/no>
            max_open_files=<maximum number of open files for LevelDB>
            write_buffer_size=<Amount of data to build up in memory per file>
            block_size=<approximate size of user data packed per block>
            cachedterms=<file with list of terms to cache>

In strusCreate I would expect to find all details of the leveldb parameters, as
I will most likely use them only there.

strusInspect shows me the leveldb parameters in the tracing configuration, the
tracing configuration per se is missing.

Built-In function description is outdated

http://www.project-strus.net/builtin_functions.htm

some tests failing on 32-bit Intel and ARMv6

[  178s] 10/13 Test #10: TestInsertWithDocType1 ...........***Failed    0.12 sec
[  178s]       Start 11: TestInsertBase1
[  178s] 11/13 Test #11: TestInsertBase1 ..................***Failed    0.19 sec
[  178s]       Start 12: TestSimpleQuery1

Why is there a metadata reader opened in strusInsert?

void InsertProcessor::run()
{
...
        std::auto_ptr<strus::MetaDataReaderInterface> metadata( 
            m_storage->createMetaDataReader());
        if (!metadata.get()) throw strus::runtime_error(_TXT("error creating meta data reader"));

It seems, reading metadata while inserting documents is nothing we should be needing.

strusPattern: get all matches in a rule

How can I get all matches in a rule? It seems, the first one appearing in the document wins and
gets assigned to the rule variables. The others get lost. This is a problem with any and within,
sequence with big enough distances, so multiple matches can fit.

sequence_imm and a hypothetical within_imm

As I understand sequence_imm is logically the same as sequence with a distance parameter of 1.
So either a similar one for within is missing or both can be ommitted and be done by the internal
optimizer.

strange output

What does '1' at the end of each attribute mean?

[1] 1067 score 0.412662
        docid = '119119' 1
        author = 'Twain, Mark, 1835-1910' 1
        title = 'A Tramp Abroad' 1
        language = 'English' 1
        loc_categories = 'PS: Language and Literatures: American and Canadian literature' 1
        subject = 'Humorous storiesEurope -- FictionWalking -- FictionAmericans -- Europe -- Fiction' 1
        copyright = 'Not copyrighted in the United States.' 1
        release_date = '2004-06-19' 1
        txt_file = 'data/1/1/119/119_8.zip' 1

query.eva is:

SELECT selfeat;
EVAL bm25( k1=0.75, b=2.1, avgdoclen=1, .match=docfeat);

SUMMARIZE attribute( name=docid );
SUMMARIZE attribute( name=author );
SUMMARIZE attribute( name=title );
SUMMARIZE attribute( name=language );
SUMMARIZE attribute( name=loc_categories );
SUMMARIZE attribute( name=subject );
SUMMARIZE attribute( name=copyright );
SUMMARIZE attribute( name=release_date );
SUMMARIZE attribute( name=txt_file );

strusDumpStatistics just core dumps

strusDumpStatistics -s 'path=storage' file

just segfaults:

gdb show:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000404a5d in main (argc=<optimized out>, argv=<optimized out>)
    at /home/abaumann/strusUtilities/src/strusDumpStatistics/strusDumpStatistics.cpp:219
219                     while (statsqueue->getNext( msg, msgsize))

msg is ""
msgsize = 140180131644049

looking at the code msgsize seems pretty uninitialized.

Option -S for strusUpdateStorageCalcStatistics and strusInspect

strusPatternMatcher: defining two matching strings in rules

I have the following:

P1 = /xxx/;
P2 = /yyy/;

Then two rules:

R1 = any( p1 = P1 "abc" );
R2 = any( p2 = P2 "cde" );

each rule R1, R2 individually works, enabling both leads to

symbol defined twice "cde"

improve strusAlterMetaData program argument format

Pass commands as one string is not intuitive.
Should be similar to strusInspect

strusPattern: comments are possible everywhere but at the very beginning

strusPattern: matches not ordered logically

Matches in Rule:

R = sequence( first = WORD, second = WORD );

get returned as

second: ... first

The natural order seems to be in sequence not in reverse order.

strusPatternMatcher plain text segmenter error

Matching one file I get an error message:

error thread 1 file 'test/test1.txt': error matching rules, only empty expressions allowed for
'plain' segmenter, got '//()' for 1

I'm calling strusPatternMatchers with -C text/plain -F -t 1 -k modstrus_analyzer_pattern -p somerules.rul filelist.test.

This happens no matter what the document or the rule file contain.

strusInsert tries to detect the document type also if an explicit segmenter is given

shell>strusInsert -V -g tsv -s 'path=storage' ~/phone.ana ~/phone.tsv
DEBUG: adding selector expression: 1, lineno
DEBUG: definition multimap contains: [lineno, 1], 
DEBUG: adding selector expression: 2, lastname
DEBUG: definition multimap contains: [lastname, 2], [lineno, 1], 
DEBUG: adding selector expression: 3, firstname
DEBUG: definition multimap contains: [firstname, 3], [lastname, 2], [lineno, 1], 
DEBUG: adding selector expression: 4, lastname
DEBUG: definition multimap contains: [firstname, 3], [lastname, 2], [lastname, 4], [lineno, 1], 
DEBUG: adding selector expression: 5, firstname
DEBUG: definition multimap contains: [firstname, 3], [firstname, 5], [lastname, 2], [lastname, 4], [lineno, 1], 
DEBUG: selector subsection: 16777217, 16777216, line
failed to detect document class of file '~/phone.tsv'

done

The TSV segmenter gets initialized correctly with the analyzer configuration.

strusAnalyze shows nothing unless at least SearchIndex or ForwardIndex is declared

The following strus.ana configuration prints no output:

  [Attribute]
        docid = orig content /docs/doc/docid/();

  #[ForwardIndex]
  #    word = orig split /docs/doc/content/();

  #[SearchIndex]
  #       word = lc split /docs/doc/content/();

  [Document]
        doc = /docs/doc;

As soon as section ForwardIndex or SearchIndex is enabled, I also see
attribute and metadata output, otherwise not.

XML test examples should pass xmllint

xmllint ./tests/scripts/testSummarization1/data/input.xml 
./tests/scripts/testSummarization1/data/input.xml:1: parser error : standalone accepts only 'yes' or 'no'
<?xml version="1.0" encoding="UTF-8" standalone="YES"?>
                                                 ^
./tests/scripts/testSummarization1/data/input.xml:1: parser error : String not closed expecting " or '
<?xml version="1.0" encoding="UTF-8" standalone="YES"?>
                                                 ^
./tests/scripts/testSummarization1/data/input.xml:1: parser error : parsing XML declaration: '?>' expected
<?xml version="1.0" encoding="UTF-8" standalone="YES"?>
                                                 ^

Documentation of the rule language in strusPatternMatcher

Maybe I'm just blind, but I don't see any documentation of the language in the *.rul file.

strusPattern: sometimes corrupt match output

WORD ^ 1 : /\b\p{L}+\b/;
NUM ^ 1 : /\b[0-9]{1,12}\b/;
ORG ^ 2 : /\b(XXXX)\b/;

11 [61] : 4 ORG XXXX1
12 [66] : 2 NUMBER 41

the text contained:

He worked in department XXXX 41 from May 12th to...

strusPattern: constants would be nice

WINDOW_SIZE := 1000;

R = within( A, B | WINDOW_SIZE );

strusPattern: missing an AND operator

something like A AND B must appear in the document, no matter at which position.
I can simulate that with a within( A, B | N) with a big N. :-)

Problems with detecting word boundaries (\b) and UTF-8

Some word boundaries are not detected correctly in UTF-8 input.

strusPattern: %LEXER CASELESS and rule tokens

%LEXER CASELESS;

W = /aBc/;

R1 = any( W "aBc" );
R2 = any( W "abc" );

R1 matches, R2 not. Also 'ABC' in the text doesn't match do any R1, R2.
I suspect, the CASELESS has effect only on the Hyperscan option for case-insensitive checking.
IMHO it should also have an effect on the tokens..

testPatternMatch1 .. testPatternMatch4 don't work in out-of-source-builds

They assume test data is in the current build dir presumably.

strusInspect

Missing a command to list all feature types in the index (something like attrnames for attributes and metatable for metadata)

Error messages in strusAnalyze/strusInsert lack a line number of the error

ERROR error in analyze document: error defining expression for 'textwolf' segmenter: error in selection expression 'author' at start of expression

unfriendly error message in strusInspect when getting unkown metadata

strusInspect -s 'path=storage/xxx' metadata THIS_META_FIELD_DOES_NOT_EXIST

results in:

EXCEPTION array bound read in function get

Standard document type detection fails on TSV files with big elements

Programs doing document type detection
strusAnalyze, strusInsert, strusCheckInsert, strusGenerateKeyMap, strusSegment
fail to detect TSV files if the first two lines of the file (header + first data line) are not fitting into 4K.

The reason is that these programs use only the first 4K of the document to detect the document type.

Possible fix: Retry with a bigger size, if the document type detection fails. The standard document type detection must also be fixed. It currently returns "text/plain" in this case.

strusQuery segfault

#0  0x00007f56a85b0477 in buildQueryTree (fields=std::vector of length 2, capacity 2 = {...}, qry=..., 
    groups=std::vector of length 2, capacity 2 = {...})
    at /home/abaumann/strus/strusAnalyzer/src/analyzer/queryAnalyzerContext.cpp:158
#1  strus::QueryAnalyzerContext::analyze (this=0x23e9fb0)
    at /home/abaumann/strus/strusAnalyzer/src/analyzer/queryAnalyzerContext.cpp:382
#2  0x00007f56ac3c7ec4 in strus::QueryStruct::translate (this=this@entry=0x7ffc618fcf60, query=..., 
    queryproc=queryproc@entry=0x1a55bb0, errorhnd=errorhnd@entry=0x1a54460)
    at /home/abaumann/strus/strusUtilities/src/program/queryStruct.cpp:99
#3  0x00007f56ac3d45a3 in strus::loadQuery (query=..., analyzer=analyzer@entry=0x1a5ae80, 
    queryproc=queryproc@entry=0x1a55bb0, source="Mark Twain\n", qdescr=..., 
    errorhnd=errorhnd@entry=0x1a54460)
    at /home/abaumann/strus/strusUtilities/src/program/programLoader.cpp:1825
#4  0x000000000040781f in main (argc_=<optimized out>, argv_=<optimized out>)
    at /home/abaumann/strus/strusUtilities/src/strusQuery/strusQuery.cpp:401

query.ana:

word = lc word word;

indexed with:

[SearchIndex]
	word = lc word author;
	word = lc word word;

query.qln contains:

Mark Twain

TestAnalyzeBindPos1 fails

      Start 11: TestAnalyzeBindPos1
11/18 Test #11: TestAnalyzeBindPos1 ..............***Failed    2.44 sec

shell> ctest --verbose -R TestAnalyzeBindPos1
UpdateCTestConfiguration  from :/data/strusUtilities/DartConfiguration.tcl
UpdateCTestConfiguration  from :/data/strusUtilities/DartConfiguration.tcl
Test project /data/strusUtilities
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 11
    Start 11: TestAnalyzeBindPos1

11: Test command: /data/strusUtilities/tests/scripts/runTest.sh "testAnalyzeBindPos1"
11: Test timeout computed to be: 9.99988e+06
11: /data/strusUtilities/tests/scripts
11: ERROR error in analyze document: error adding search index feature: error defining feature: 'illegal definition of a feature that has a tokenizer processing the content concatenated with positions bound to other features'
11: /data/strusUtilities/tests/scripts
11: DIFF AT 12924
1/1 Test #11: TestAnalyzeBindPos1 ..............***Failed    1.71 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) =   1.85 sec

The following tests FAILED:
         11 - TestAnalyzeBindPos1 (Failed)
Errors while running CTest

test_results_armv6.zip

pattern matcher tests are calling shell scripts

This is bad in the future, if somebody want to port to Windows.

dump multiple elements in one strusInspect call

For instance:

/strus/bin/strusInspect -s 'path=/opt/eurospider/strus/storage/xxx' attribute docid guid | less

Would be handy. Current workaround: dump docid and guid separately and the do a join.

strusPatternMatcher: unclear pattern matching in rules

Assuming I have a regex with two catching sub-groups:

P : /([0-9]+)-(0-9)/;

R = any( p = P "14-22" );

returns an error messages 'symbol defined twice '14-22'.

So is the value in the matching pattern in the rule matches to the whole match or are there N
strings, one for each subgroup? And how to differentiate between the two?

patrickfrey / strusutilities Goto Github PK

strusutilities's People

Stargazers

Watchers

strusutilities's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs