GithubHelp home page GithubHelp logo

patrickfrey / strusutilities Goto Github PK

View Code? Open in Web Editor NEW
3.0 3.0 0.0 3.07 MB

A set of command line programs to access the strus information retrieval engine

Home Page: http://www.project-strus.net

License: Mozilla Public License 2.0

CMake 8.92% C++ 90.78% xBase 0.13% Perl 0.17%

strusutilities's People

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

strusutilities's Issues

all tests fail in-source

TestAnalyzeSubSegmenter1
 1/20 Test  #1: TestAnalyzeSubSegmenter1 .........***Failed    0.06 sec
...
20/20 Test #20: TestUpdateCalcStats ..............***Failed    0.00 sec

for instance:

20: error executing test: error reading program file '/data/work/strus/strusUtilities/tests/scripts/testUpdateCalcStats1/RUN': No such file or directory

Grammar of Analyzer does not report all errors

Missing a semicolon can be very very bad. The following instruction is seen as part of the selection expression, and therefore ignored but the syntax parsing of the selection expresseion does not report any error (textwolf problem)

analyzer programs can't have emtpy sections

[Aggregator]
#    doclen = count( ngram );

results in:

ERROR failed to load analyzer program document.ana: error in document analyzer program at line 32 column 1: feature type name (identifier) expected at start of a feature declaration

tests don't run on OSX

The following tests FAILED:
	  1 - TestAnalyzeSubSegmenter1 (Failed)
	  2 - TestAnalyzeSubSegmenter2 (Failed)
	  3 - TestAnalyzeSubSegmenter3 (Failed)
	  4 - TestAnalyzeBase1 (Failed)
	  5 - TestAnalyzeJson1 (Failed)
	  6 - TestAnalyzeJson2 (Failed)
	  7 - TestAnalyzeConfig1 (Failed)
	  8 - TestAnalyzeConfig2 (Failed)
	  9 - TestAnalyzeWithDocType1 (Failed)
	 10 - TestInsertWithDocType1 (Failed)
	 11 - TestInsertBase1 (Failed)
	 12 - TestSimpleQuery1 (Failed)
	 13 - TestSummarization1 (Failed)

12: Test command: /Users/administrator/strus/strusUtilities/tests/scripts/runTest.sh "testSimpleQuery1"
12: Test timeout computed to be: 9.99988e+06
12: /Users/administrator/strus/strusUtilities/tests/scripts
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 20: /Users/administrator/strus/strusUtilities/src/strusCreate/strusCreate: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 44: /Users/administrator/strus/strusUtilities/src/strusInsert/strusInsert: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 53: /Users/administrator/strus/strusUtilities/src/strusQuery/strusQuery: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 53: /Users/administrator/strus/strusUtilities/src/strusQuery/strusQuery: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 53: /Users/administrator/strus/strusUtilities/src/strusQuery/strusQuery: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 53: /Users/administrator/strus/strusUtilities/src/strusQuery/strusQuery: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 53: /Users/administrator/strus/strusUtilities/src/strusQuery/strusQuery: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 53: /Users/administrator/strus/strusUtilities/src/strusQuery/strusQuery: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 26: /Users/administrator/strus/strusUtilities/src/strusDestroy/strusDestroy: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts
12: DIFF AT 25

find . -name strusQuery -perm 0755 -type f
./src/strusQuery/Release/strusQuery

Windows and XCode builds binaries not in the current directory, but in Release or Debug
depending on the Cmake project build option switch.

Registered segmenter cjson is suddenly unknown in text processor

Test project /usr/home/abaumann/strusUtilities
      Start  1: TestPatternMatch1
 1/22 Test  #1: TestPatternMatch1 ................***Failed    0.62 sec
      Start  2: TestPatternMatch2
 2/22 Test  #2: TestPatternMatch2 ................***Failed    0.53 sec
      Start  3: TestPatternMatch3
 3/22 Test  #3: TestPatternMatch3 ................***Failed    0.54 sec
      Start  4: TestPatternMatch4
 4/22 Test  #4: TestPatternMatch4 ................***Failed    0.53 sec
      Start  5: TestAnalyzeSubSegmenter1
 5/22 Test  #5: TestAnalyzeSubSegmenter1 .........***Failed    0.18 sec
      Start  6: TestAnalyzeSubSegmenter2
 6/22 Test  #6: TestAnalyzeSubSegmenter2 .........***Failed    0.18 sec
      Start  7: TestAnalyzeSubSegmenter3
 7/22 Test  #7: TestAnalyzeSubSegmenter3 .........***Failed    0.16 sec
      Start  8: TestAnalyzeBase1
 8/22 Test  #8: TestAnalyzeBase1 .................   Passed    0.47 sec
      Start  9: TestAnalyzeJson1
 9/22 Test  #9: TestAnalyzeJson1 .................***Failed    0.16 sec
      Start 10: TestAnalyzeJson2
10/22 Test #10: TestAnalyzeJson2 .................***Failed    0.16 sec
      Start 11: TestAnalyzeTsv1
11/22 Test #11: TestAnalyzeTsv1 ..................   Passed    0.17 sec
      Start 12: TestAnalyzeConfig1
12/22 Test #12: TestAnalyzeConfig1 ...............   Passed    0.17 sec
      Start 13: TestAnalyzeConfig2
13/22 Test #13: TestAnalyzeConfig2 ...............   Passed    0.17 sec
      Start 14: TestAnalyzeWithDocType1
14/22 Test #14: TestAnalyzeWithDocType1 ..........***Failed    0.26 sec
      Start 15: TestAnalyzeBindPos1
15/22 Test #15: TestAnalyzeBindPos1 ..............   Passed    0.46 sec
      Start 16: TestInsertWithDocType1
16/22 Test #16: TestInsertWithDocType1 ...........***Failed    2.69 sec
      Start 17: TestInsertBase1
17/22 Test #17: TestInsertBase1 ..................***Failed    2.35 sec
      Start 18: TestSimpleQuery1
18/22 Test #18: TestSimpleQuery1 .................***Failed    1.45 sec
      Start 19: TestQueryWithRestriction1
19/22 Test #19: TestQueryWithRestriction1 ........   Passed    1.06 sec
      Start 20: TestQueryWithFormula1
20/22 Test #20: TestQueryWithFormula1 ............   Passed    0.72 sec
      Start 21: TestSummarization1
21/22 Test #21: TestSummarization1 ...............***Failed    0.71 sec
      Start 22: TestUpdateCalcStats
22/22 Test #22: TestUpdateCalcStats ..............   Passed    2.31 sec

36% tests passed, 14 tests failed out of 22

tools don't show consistent usage of LevelDB parameters

shell> strusCreate -h:

shows:

-s|--storage <CONFIG>
    Define the storage configuration string as <CONFIG>
    <CONFIG> is a semicolon ';' separated list of assignments:
            path=<LevelDB storage path>;compression=<yes/no>
            acl=<yes/no, yes if users with different access rights exist>
            metadata=<comma separated list of meta data def>
shell> strusInspect -h

shows

-T|--trace <CONFIG>
    Print method call traces configured with <CONFIG>
    <CONFIG> is a semicolon ';' separated list of assignments:
            path=<LevelDB storage path>
            create=<yes/no, yes=do create if database does not exist yet>
            cache=<size of LRU cache for LevelDB>
            compression=<yes/no>
            max_open_files=<maximum number of open files for LevelDB>
            write_buffer_size=<Amount of data to build up in memory per file>
            block_size=<approximate size of user data packed per block>
            cachedterms=<file with list of terms to cache>

In strusCreate I would expect to find all details of the leveldb parameters, as
I will most likely use them only there.

strusInspect shows me the leveldb parameters in the tracing configuration, the
tracing configuration per se is missing.

Why is there a metadata reader opened in strusInsert?

void InsertProcessor::run()
{
...
        std::auto_ptr<strus::MetaDataReaderInterface> metadata( 
            m_storage->createMetaDataReader());
        if (!metadata.get()) throw strus::runtime_error(_TXT("error creating meta data reader"));

It seems, reading metadata while inserting documents is nothing we should be needing.

strusPattern: get all matches in a rule

How can I get all matches in a rule? It seems, the first one appearing in the document wins and
gets assigned to the rule variables. The others get lost. This is a problem with any and within,
sequence with big enough distances, so multiple matches can fit.

sequence_imm and a hypothetical within_imm

As I understand sequence_imm is logically the same as sequence with a distance parameter of 1.
So either a similar one for within is missing or both can be ommitted and be done by the internal
optimizer.

strange output

What does '1' at the end of each attribute mean?

[1] 1067 score 0.412662
        docid = '119119' 1
        author = 'Twain, Mark, 1835-1910' 1
        title = 'A Tramp Abroad' 1
        language = 'English' 1
        loc_categories = 'PS: Language and Literatures: American and Canadian literature' 1
        subject = 'Humorous storiesEurope -- FictionWalking -- FictionAmericans -- Europe -- Fiction' 1
        copyright = 'Not copyrighted in the United States.' 1
        release_date = '2004-06-19' 1
        txt_file = 'data/1/1/119/119_8.zip' 1

query.eva is:

SELECT selfeat;
EVAL bm25( k1=0.75, b=2.1, avgdoclen=1, .match=docfeat);

SUMMARIZE attribute( name=docid );
SUMMARIZE attribute( name=author );
SUMMARIZE attribute( name=title );
SUMMARIZE attribute( name=language );
SUMMARIZE attribute( name=loc_categories );
SUMMARIZE attribute( name=subject );
SUMMARIZE attribute( name=copyright );
SUMMARIZE attribute( name=release_date );
SUMMARIZE attribute( name=txt_file );

strusDumpStatistics just core dumps

strusDumpStatistics -s 'path=storage' file

just segfaults:

gdb show:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000404a5d in main (argc=<optimized out>, argv=<optimized out>)
    at /home/abaumann/strusUtilities/src/strusDumpStatistics/strusDumpStatistics.cpp:219
219                     while (statsqueue->getNext( msg, msgsize))
msg is ""
msgsize = 140180131644049

looking at the code msgsize seems pretty uninitialized.

strusPatternMatcher plain text segmenter error

Matching one file I get an error message:

error thread 1 file 'test/test1.txt': error matching rules, only empty expressions allowed for
'plain' segmenter, got '//()' for 1

I'm calling strusPatternMatchers with -C text/plain -F -t 1 -k modstrus_analyzer_pattern -p somerules.rul filelist.test.

This happens no matter what the document or the rule file contain.

strusInsert tries to detect the document type also if an explicit segmenter is given

shell>strusInsert -V -g tsv -s 'path=storage' ~/phone.ana ~/phone.tsv
DEBUG: adding selector expression: 1, lineno
DEBUG: definition multimap contains: [lineno, 1], 
DEBUG: adding selector expression: 2, lastname
DEBUG: definition multimap contains: [lastname, 2], [lineno, 1], 
DEBUG: adding selector expression: 3, firstname
DEBUG: definition multimap contains: [firstname, 3], [lastname, 2], [lineno, 1], 
DEBUG: adding selector expression: 4, lastname
DEBUG: definition multimap contains: [firstname, 3], [lastname, 2], [lastname, 4], [lineno, 1], 
DEBUG: adding selector expression: 5, firstname
DEBUG: definition multimap contains: [firstname, 3], [firstname, 5], [lastname, 2], [lastname, 4], [lineno, 1], 
DEBUG: selector subsection: 16777217, 16777216, line
failed to detect document class of file '~/phone.tsv'

done

The TSV segmenter gets initialized correctly with the analyzer configuration.

strusAnalyze shows nothing unless at least SearchIndex or ForwardIndex is declared

The following strus.ana configuration prints no output:

  [Attribute]
        docid = orig content /docs/doc/docid/();

  #[ForwardIndex]
  #    word = orig split /docs/doc/content/();

  #[SearchIndex]
  #       word = lc split /docs/doc/content/();

  [Document]
        doc = /docs/doc;

As soon as section ForwardIndex or SearchIndex is enabled, I also see
attribute and metadata output, otherwise not.

XML test examples should pass xmllint

xmllint ./tests/scripts/testSummarization1/data/input.xml 
./tests/scripts/testSummarization1/data/input.xml:1: parser error : standalone accepts only 'yes' or 'no'
<?xml version="1.0" encoding="UTF-8" standalone="YES"?>
                                                 ^
./tests/scripts/testSummarization1/data/input.xml:1: parser error : String not closed expecting " or '
<?xml version="1.0" encoding="UTF-8" standalone="YES"?>
                                                 ^
./tests/scripts/testSummarization1/data/input.xml:1: parser error : parsing XML declaration: '?>' expected
<?xml version="1.0" encoding="UTF-8" standalone="YES"?>
                                                 ^

strusPattern: sometimes corrupt match output

WORD ^ 1 : /\b\p{L}+\b/;
NUM ^ 1 : /\b[0-9]{1,12}\b/;
ORG ^ 2 : /\b(XXXX)\b/;

11 [61] : 4 ORG XXXX1
12 [66] : 2 NUMBER 41

the text contained:

He worked in department XXXX 41 from May 12th to...

strusPattern: %LEXER CASELESS and rule tokens

%LEXER CASELESS;

W = /aBc/;

R1 = any( W "aBc" );
R2 = any( W "abc" );

R1 matches, R2 not. Also 'ABC' in the text doesn't match do any R1, R2.
I suspect, the CASELESS has effect only on the Hyperscan option for case-insensitive checking.
IMHO it should also have an effect on the tokens..

strusInspect

Missing a command to list all feature types in the index (something like attrnames for attributes and metatable for metadata)

Standard document type detection fails on TSV files with big elements

Programs doing document type detection
strusAnalyze, strusInsert, strusCheckInsert, strusGenerateKeyMap, strusSegment
fail to detect TSV files if the first two lines of the file (header + first data line) are not fitting into 4K.

The reason is that these programs use only the first 4K of the document to detect the document type.

Possible fix: Retry with a bigger size, if the document type detection fails. The standard document type detection must also be fixed. It currently returns "text/plain" in this case.

strusQuery segfault

#0  0x00007f56a85b0477 in buildQueryTree (fields=std::vector of length 2, capacity 2 = {...}, qry=..., 
    groups=std::vector of length 2, capacity 2 = {...})
    at /home/abaumann/strus/strusAnalyzer/src/analyzer/queryAnalyzerContext.cpp:158
#1  strus::QueryAnalyzerContext::analyze (this=0x23e9fb0)
    at /home/abaumann/strus/strusAnalyzer/src/analyzer/queryAnalyzerContext.cpp:382
#2  0x00007f56ac3c7ec4 in strus::QueryStruct::translate (this=this@entry=0x7ffc618fcf60, query=..., 
    queryproc=queryproc@entry=0x1a55bb0, errorhnd=errorhnd@entry=0x1a54460)
    at /home/abaumann/strus/strusUtilities/src/program/queryStruct.cpp:99
#3  0x00007f56ac3d45a3 in strus::loadQuery (query=..., analyzer=analyzer@entry=0x1a5ae80, 
    queryproc=queryproc@entry=0x1a55bb0, source="Mark Twain\n", qdescr=..., 
    errorhnd=errorhnd@entry=0x1a54460)
    at /home/abaumann/strus/strusUtilities/src/program/programLoader.cpp:1825
#4  0x000000000040781f in main (argc_=<optimized out>, argv_=<optimized out>)
    at /home/abaumann/strus/strusUtilities/src/strusQuery/strusQuery.cpp:401

query.ana:

word = lc word word;

indexed with:

[SearchIndex]
	word = lc word author;
	word = lc word word;

query.qln contains:

Mark Twain

TestAnalyzeBindPos1 fails

      Start 11: TestAnalyzeBindPos1
11/18 Test #11: TestAnalyzeBindPos1 ..............***Failed    2.44 sec

shell> ctest --verbose -R TestAnalyzeBindPos1
UpdateCTestConfiguration  from :/data/strusUtilities/DartConfiguration.tcl
UpdateCTestConfiguration  from :/data/strusUtilities/DartConfiguration.tcl
Test project /data/strusUtilities
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 11
    Start 11: TestAnalyzeBindPos1

11: Test command: /data/strusUtilities/tests/scripts/runTest.sh "testAnalyzeBindPos1"
11: Test timeout computed to be: 9.99988e+06
11: /data/strusUtilities/tests/scripts
11: ERROR error in analyze document: error adding search index feature: error defining feature: 'illegal definition of a feature that has a tokenizer processing the content concatenated with positions bound to other features'
11: /data/strusUtilities/tests/scripts
11: DIFF AT 12924
1/1 Test #11: TestAnalyzeBindPos1 ..............***Failed    1.71 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) =   1.85 sec

The following tests FAILED:
         11 - TestAnalyzeBindPos1 (Failed)
Errors while running CTest

test_results_armv6.zip

dump multiple elements in one strusInspect call

For instance:

/strus/bin/strusInspect -s 'path=/opt/eurospider/strus/storage/xxx' attribute docid guid | less

Would be handy. Current workaround: dump docid and guid separately and the do a join.

strusPatternMatcher: unclear pattern matching in rules

Assuming I have a regex with two catching sub-groups:

P : /([0-9]+)-(0-9)/;

R = any( p = P "14-22" );

returns an error messages 'symbol defined twice '14-22'.

So is the value in the matching pattern in the rule matches to the whole match or are there N
strings, one for each subgroup? And how to differentiate between the two?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.