patrickfrey / strusutilities Goto Github PK
View Code? Open in Web Editor NEWA set of command line programs to access the strus information retrieval engine
Home Page: http://www.project-strus.net
License: Mozilla Public License 2.0
A set of command line programs to access the strus information retrieval engine
Home Page: http://www.project-strus.net
License: Mozilla Public License 2.0
TestAnalyzeSubSegmenter1
1/20 Test #1: TestAnalyzeSubSegmenter1 .........***Failed 0.06 sec
...
20/20 Test #20: TestUpdateCalcStats ..............***Failed 0.00 sec
for instance:
20: error executing test: error reading program file '/data/work/strus/strusUtilities/tests/scripts/testUpdateCalcStats1/RUN': No such file or directory
Missing a semicolon can be very very bad. The following instruction is seen as part of the selection expression, and therefore ignored but the syntax parsing of the selection expresseion does not report any error (textwolf problem)
[Aggregator]
# doclen = count( ngram );
results in:
ERROR failed to load analyzer program document.ana: error in document analyzer program at line 32 column 1: feature type name (identifier) expected at start of a feature declaration
The following tests FAILED:
1 - TestAnalyzeSubSegmenter1 (Failed)
2 - TestAnalyzeSubSegmenter2 (Failed)
3 - TestAnalyzeSubSegmenter3 (Failed)
4 - TestAnalyzeBase1 (Failed)
5 - TestAnalyzeJson1 (Failed)
6 - TestAnalyzeJson2 (Failed)
7 - TestAnalyzeConfig1 (Failed)
8 - TestAnalyzeConfig2 (Failed)
9 - TestAnalyzeWithDocType1 (Failed)
10 - TestInsertWithDocType1 (Failed)
11 - TestInsertBase1 (Failed)
12 - TestSimpleQuery1 (Failed)
13 - TestSummarization1 (Failed)
12: Test command: /Users/administrator/strus/strusUtilities/tests/scripts/runTest.sh "testSimpleQuery1"
12: Test timeout computed to be: 9.99988e+06
12: /Users/administrator/strus/strusUtilities/tests/scripts
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 20: /Users/administrator/strus/strusUtilities/src/strusCreate/strusCreate: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 44: /Users/administrator/strus/strusUtilities/src/strusInsert/strusInsert: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 53: /Users/administrator/strus/strusUtilities/src/strusQuery/strusQuery: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 53: /Users/administrator/strus/strusUtilities/src/strusQuery/strusQuery: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 53: /Users/administrator/strus/strusUtilities/src/strusQuery/strusQuery: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 53: /Users/administrator/strus/strusUtilities/src/strusQuery/strusQuery: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 53: /Users/administrator/strus/strusUtilities/src/strusQuery/strusQuery: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 53: /Users/administrator/strus/strusUtilities/src/strusQuery/strusQuery: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts/ENV: line 26: /Users/administrator/strus/strusUtilities/src/strusDestroy/strusDestroy: No such file or directory
12: /Users/administrator/strus/strusUtilities/tests/scripts
12: DIFF AT 25
find . -name strusQuery -perm 0755 -type f
./src/strusQuery/Release/strusQuery
Windows and XCode builds binaries not in the current directory, but in Release
or Debug
depending on the Cmake project build option switch.
Test project /usr/home/abaumann/strusUtilities
Start 1: TestPatternMatch1
1/22 Test #1: TestPatternMatch1 ................***Failed 0.62 sec
Start 2: TestPatternMatch2
2/22 Test #2: TestPatternMatch2 ................***Failed 0.53 sec
Start 3: TestPatternMatch3
3/22 Test #3: TestPatternMatch3 ................***Failed 0.54 sec
Start 4: TestPatternMatch4
4/22 Test #4: TestPatternMatch4 ................***Failed 0.53 sec
Start 5: TestAnalyzeSubSegmenter1
5/22 Test #5: TestAnalyzeSubSegmenter1 .........***Failed 0.18 sec
Start 6: TestAnalyzeSubSegmenter2
6/22 Test #6: TestAnalyzeSubSegmenter2 .........***Failed 0.18 sec
Start 7: TestAnalyzeSubSegmenter3
7/22 Test #7: TestAnalyzeSubSegmenter3 .........***Failed 0.16 sec
Start 8: TestAnalyzeBase1
8/22 Test #8: TestAnalyzeBase1 ................. Passed 0.47 sec
Start 9: TestAnalyzeJson1
9/22 Test #9: TestAnalyzeJson1 .................***Failed 0.16 sec
Start 10: TestAnalyzeJson2
10/22 Test #10: TestAnalyzeJson2 .................***Failed 0.16 sec
Start 11: TestAnalyzeTsv1
11/22 Test #11: TestAnalyzeTsv1 .................. Passed 0.17 sec
Start 12: TestAnalyzeConfig1
12/22 Test #12: TestAnalyzeConfig1 ............... Passed 0.17 sec
Start 13: TestAnalyzeConfig2
13/22 Test #13: TestAnalyzeConfig2 ............... Passed 0.17 sec
Start 14: TestAnalyzeWithDocType1
14/22 Test #14: TestAnalyzeWithDocType1 ..........***Failed 0.26 sec
Start 15: TestAnalyzeBindPos1
15/22 Test #15: TestAnalyzeBindPos1 .............. Passed 0.46 sec
Start 16: TestInsertWithDocType1
16/22 Test #16: TestInsertWithDocType1 ...........***Failed 2.69 sec
Start 17: TestInsertBase1
17/22 Test #17: TestInsertBase1 ..................***Failed 2.35 sec
Start 18: TestSimpleQuery1
18/22 Test #18: TestSimpleQuery1 .................***Failed 1.45 sec
Start 19: TestQueryWithRestriction1
19/22 Test #19: TestQueryWithRestriction1 ........ Passed 1.06 sec
Start 20: TestQueryWithFormula1
20/22 Test #20: TestQueryWithFormula1 ............ Passed 0.72 sec
Start 21: TestSummarization1
21/22 Test #21: TestSummarization1 ...............***Failed 0.71 sec
Start 22: TestUpdateCalcStats
22/22 Test #22: TestUpdateCalcStats .............. Passed 2.31 sec
36% tests passed, 14 tests failed out of 22
shell> strusCreate -h:
shows:
-s|--storage <CONFIG>
Define the storage configuration string as <CONFIG>
<CONFIG> is a semicolon ';' separated list of assignments:
path=<LevelDB storage path>;compression=<yes/no>
acl=<yes/no, yes if users with different access rights exist>
metadata=<comma separated list of meta data def>
shell> strusInspect -h
shows
-T|--trace <CONFIG>
Print method call traces configured with <CONFIG>
<CONFIG> is a semicolon ';' separated list of assignments:
path=<LevelDB storage path>
create=<yes/no, yes=do create if database does not exist yet>
cache=<size of LRU cache for LevelDB>
compression=<yes/no>
max_open_files=<maximum number of open files for LevelDB>
write_buffer_size=<Amount of data to build up in memory per file>
block_size=<approximate size of user data packed per block>
cachedterms=<file with list of terms to cache>
In strusCreate
I would expect to find all details of the leveldb parameters, as
I will most likely use them only there.
strusInspect shows me the leveldb parameters in the tracing configuration, the
tracing configuration per se is missing.
[ 178s] 10/13 Test #10: TestInsertWithDocType1 ...........***Failed 0.12 sec
[ 178s] Start 11: TestInsertBase1
[ 178s] 11/13 Test #11: TestInsertBase1 ..................***Failed 0.19 sec
[ 178s] Start 12: TestSimpleQuery1
See also:
https://build.opensuse.org/package/live_build_log/home:andreas_baumann/strusutilities/CentOS_6/i586
void InsertProcessor::run()
{
...
std::auto_ptr<strus::MetaDataReaderInterface> metadata(
m_storage->createMetaDataReader());
if (!metadata.get()) throw strus::runtime_error(_TXT("error creating meta data reader"));
It seems, reading metadata while inserting documents is nothing we should be needing.
How can I get all matches in a rule? It seems, the first one appearing in the document wins and
gets assigned to the rule variables. The others get lost. This is a problem with any
and within
,
sequence
with big enough distances, so multiple matches can fit.
As I understand sequence_imm is logically the same as sequence with a distance parameter of 1.
So either a similar one for within is missing or both can be ommitted and be done by the internal
optimizer.
What does '1' at the end of each attribute mean?
[1] 1067 score 0.412662
docid = '119119' 1
author = 'Twain, Mark, 1835-1910' 1
title = 'A Tramp Abroad' 1
language = 'English' 1
loc_categories = 'PS: Language and Literatures: American and Canadian literature' 1
subject = 'Humorous storiesEurope -- FictionWalking -- FictionAmericans -- Europe -- Fiction' 1
copyright = 'Not copyrighted in the United States.' 1
release_date = '2004-06-19' 1
txt_file = 'data/1/1/119/119_8.zip' 1
query.eva is:
SELECT selfeat;
EVAL bm25( k1=0.75, b=2.1, avgdoclen=1, .match=docfeat);
SUMMARIZE attribute( name=docid );
SUMMARIZE attribute( name=author );
SUMMARIZE attribute( name=title );
SUMMARIZE attribute( name=language );
SUMMARIZE attribute( name=loc_categories );
SUMMARIZE attribute( name=subject );
SUMMARIZE attribute( name=copyright );
SUMMARIZE attribute( name=release_date );
SUMMARIZE attribute( name=txt_file );
strusDumpStatistics -s 'path=storage' file
just segfaults:
gdb show:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000000404a5d in main (argc=<optimized out>, argv=<optimized out>)
at /home/abaumann/strusUtilities/src/strusDumpStatistics/strusDumpStatistics.cpp:219
219 while (statsqueue->getNext( msg, msgsize))
msg is ""
msgsize = 140180131644049
looking at the code msgsize
seems pretty uninitialized.
I have the following:
P1 = /xxx/;
P2 = /yyy/;
Then two rules:
R1 = any( p1 = P1 "abc" );
R2 = any( p2 = P2 "cde" );
each rule R1, R2 individually works, enabling both leads to
symbol defined twice "cde"
Pass commands as one string is not intuitive.
Should be similar to strusInspect
Matches in Rule:
R = sequence( first = WORD, second = WORD );
get returned as
second: ... first
The natural order seems to be in sequence not in reverse order.
Matching one file I get an error message:
error thread 1 file 'test/test1.txt': error matching rules, only empty expressions allowed for
'plain' segmenter, got '//()' for 1
I'm calling strusPatternMatchers with -C text/plain -F -t 1 -k modstrus_analyzer_pattern -p somerules.rul filelist.test
.
This happens no matter what the document or the rule file contain.
shell>strusInsert -V -g tsv -s 'path=storage' ~/phone.ana ~/phone.tsv
DEBUG: adding selector expression: 1, lineno
DEBUG: definition multimap contains: [lineno, 1],
DEBUG: adding selector expression: 2, lastname
DEBUG: definition multimap contains: [lastname, 2], [lineno, 1],
DEBUG: adding selector expression: 3, firstname
DEBUG: definition multimap contains: [firstname, 3], [lastname, 2], [lineno, 1],
DEBUG: adding selector expression: 4, lastname
DEBUG: definition multimap contains: [firstname, 3], [lastname, 2], [lastname, 4], [lineno, 1],
DEBUG: adding selector expression: 5, firstname
DEBUG: definition multimap contains: [firstname, 3], [firstname, 5], [lastname, 2], [lastname, 4], [lineno, 1],
DEBUG: selector subsection: 16777217, 16777216, line
failed to detect document class of file '~/phone.tsv'
done
The TSV segmenter gets initialized correctly with the analyzer configuration.
The following strus.ana
configuration prints no output:
[Attribute]
docid = orig content /docs/doc/docid/();
#[ForwardIndex]
# word = orig split /docs/doc/content/();
#[SearchIndex]
# word = lc split /docs/doc/content/();
[Document]
doc = /docs/doc;
As soon as section ForwardIndex
or SearchIndex
is enabled, I also see
attribute and metadata output, otherwise not.
xmllint ./tests/scripts/testSummarization1/data/input.xml
./tests/scripts/testSummarization1/data/input.xml:1: parser error : standalone accepts only 'yes' or 'no'
<?xml version="1.0" encoding="UTF-8" standalone="YES"?>
^
./tests/scripts/testSummarization1/data/input.xml:1: parser error : String not closed expecting " or '
<?xml version="1.0" encoding="UTF-8" standalone="YES"?>
^
./tests/scripts/testSummarization1/data/input.xml:1: parser error : parsing XML declaration: '?>' expected
<?xml version="1.0" encoding="UTF-8" standalone="YES"?>
^
Maybe I'm just blind, but I don't see any documentation of the language in the *.rul file.
WORD ^ 1 : /\b\p{L}+\b/;
NUM ^ 1 : /\b[0-9]{1,12}\b/;
ORG ^ 2 : /\b(XXXX)\b/;
11 [61] : 4 ORG XXXX1
12 [66] : 2 NUMBER 41
the text contained:
He worked in department XXXX 41 from May 12th to...
WINDOW_SIZE := 1000;
R = within( A, B | WINDOW_SIZE );
something like A AND B must appear in the document, no matter at which position.
I can simulate that with a within( A, B | N)
with a big N
. :-)
Some word boundaries are not detected correctly in UTF-8 input.
%LEXER CASELESS;
W = /aBc/;
R1 = any( W "aBc" );
R2 = any( W "abc" );
R1 matches, R2 not. Also 'ABC' in the text doesn't match do any R1, R2.
I suspect, the CASELESS has effect only on the Hyperscan option for case-insensitive checking.
IMHO it should also have an effect on the tokens..
They assume test data is in the current build dir presumably.
Missing a command to list all feature types in the index (something like attrnames
for attributes and metatable
for metadata)
ERROR error in analyze document: error defining expression for 'textwolf' segmenter: error in selection expression 'author' at start of expression
strusInspect -s 'path=storage/xxx' metadata THIS_META_FIELD_DOES_NOT_EXIST
results in:
EXCEPTION array bound read in function get
Programs doing document type detection
strusAnalyze, strusInsert, strusCheckInsert, strusGenerateKeyMap, strusSegment
fail to detect TSV files if the first two lines of the file (header + first data line) are not fitting into 4K.
The reason is that these programs use only the first 4K of the document to detect the document type.
Possible fix: Retry with a bigger size, if the document type detection fails. The standard document type detection must also be fixed. It currently returns "text/plain" in this case.
#0 0x00007f56a85b0477 in buildQueryTree (fields=std::vector of length 2, capacity 2 = {...}, qry=...,
groups=std::vector of length 2, capacity 2 = {...})
at /home/abaumann/strus/strusAnalyzer/src/analyzer/queryAnalyzerContext.cpp:158
#1 strus::QueryAnalyzerContext::analyze (this=0x23e9fb0)
at /home/abaumann/strus/strusAnalyzer/src/analyzer/queryAnalyzerContext.cpp:382
#2 0x00007f56ac3c7ec4 in strus::QueryStruct::translate (this=this@entry=0x7ffc618fcf60, query=...,
queryproc=queryproc@entry=0x1a55bb0, errorhnd=errorhnd@entry=0x1a54460)
at /home/abaumann/strus/strusUtilities/src/program/queryStruct.cpp:99
#3 0x00007f56ac3d45a3 in strus::loadQuery (query=..., analyzer=analyzer@entry=0x1a5ae80,
queryproc=queryproc@entry=0x1a55bb0, source="Mark Twain\n", qdescr=...,
errorhnd=errorhnd@entry=0x1a54460)
at /home/abaumann/strus/strusUtilities/src/program/programLoader.cpp:1825
#4 0x000000000040781f in main (argc_=<optimized out>, argv_=<optimized out>)
at /home/abaumann/strus/strusUtilities/src/strusQuery/strusQuery.cpp:401
query.ana:
word = lc word word;
indexed with:
[SearchIndex]
word = lc word author;
word = lc word word;
query.qln contains:
Mark Twain
Start 11: TestAnalyzeBindPos1
11/18 Test #11: TestAnalyzeBindPos1 ..............***Failed 2.44 sec
shell> ctest --verbose -R TestAnalyzeBindPos1
UpdateCTestConfiguration from :/data/strusUtilities/DartConfiguration.tcl
UpdateCTestConfiguration from :/data/strusUtilities/DartConfiguration.tcl
Test project /data/strusUtilities
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 11
Start 11: TestAnalyzeBindPos1
11: Test command: /data/strusUtilities/tests/scripts/runTest.sh "testAnalyzeBindPos1"
11: Test timeout computed to be: 9.99988e+06
11: /data/strusUtilities/tests/scripts
11: ERROR error in analyze document: error adding search index feature: error defining feature: 'illegal definition of a feature that has a tokenizer processing the content concatenated with positions bound to other features'
11: /data/strusUtilities/tests/scripts
11: DIFF AT 12924
1/1 Test #11: TestAnalyzeBindPos1 ..............***Failed 1.71 sec
0% tests passed, 1 tests failed out of 1
Total Test time (real) = 1.85 sec
The following tests FAILED:
11 - TestAnalyzeBindPos1 (Failed)
Errors while running CTest
This is bad in the future, if somebody want to port to Windows.
For instance:
/strus/bin/strusInspect -s 'path=/opt/eurospider/strus/storage/xxx' attribute docid guid | less
Would be handy. Current workaround: dump docid
and guid
separately and the do a join.
Assuming I have a regex with two catching sub-groups:
P : /([0-9]+)-(0-9)/;
R = any( p = P "14-22" );
returns an error messages 'symbol defined twice '14-22'.
So is the value in the matching pattern in the rule matches to the whole match or are there N
strings, one for each subgroup? And how to differentiate between the two?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.