I'm using:
- ES9.0.3
- Mapper-attachement plugin 1.8.0
- FSRiver plugin 0.3.0
I create an index:
curl -XPUT 'localhost:9200/documents' - '{
"document" : {
"properties" : {
"file" : {
"type" : "attachment",
"path" : "full",
"fields" : {
"file" : {
"type" : "string",
"store" : "yes",
"term_vector" : "with_positions_offsets"
},
"author" : {
"type" : "string"
},
"title" : {
"type" : "string",
"store" : "yes"
},
"name" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"keywords" : {
"type" : "string"
},
"content_type" : {
"type" : "string",
"store" : "yes"
}
}
},
"name" : {
"type" : "string",
"analyzer" : "keyword"
},
"pathEncoded" : {
"type" : "string",
"analyzer" : "keyword"
},
"postDate" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"rootpath" : {
"type" : "string",
"analyzer" : "keyword"
},
"virtualpath" : {
"type" : "string",
"analyzer" : "keyword"
},
"filesize" : {
"type" : "long"
}
}
}
}'
I create a river:
curl - XPUT 'localhost:9200/_river/documentsriver/_meta' -d '{
"type": "fs",
"fs": {
"url": "c:\\tmp",
"update_rate": 10000,
"includes": [ "*.docx" , "*.xlsx", "*.pdf", "*.pptx" ]
},
"index": {
"index": "documents",
"type": "document",
"bulk_size": 50
}
}'
- I'm using windows so the url parameter uses "c:\tmp".
- Update rate is rather high because of testing.
- The index used is "documents", as created.
- The mapping type used in the index is "document" as created.
The c:\tmp folder contains 12 files (max size 4000KB, most of them around 100KB) which match the include pattern. When the river is added (and ES has been restarted before, to be sure all plugins are recognized and loaded) my documents index is filled with 13 entries (1 folder, 12 files). So far so good.
Adding files does not change index
However, if I copy some more files matching the pattern to the folder, they are not indexed. After ten minutes waiting, wherein the river should have tried it 60 times, there is no change in the document index. The fact there have been 60 tries can be stated by using a simple rest call
curl -XGET 'localhost:9200/_river/_search' -d '{}'.
With the result:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1,
"hits": [
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_meta",
"_score": 1,
"_source": {
"type": "fs",
"fs": {
"url": "c:\\tmp",
"update_rate": 10000,
"includes": [
"*.docx",
"*.xlsx",
"*.pdf",
"*.pptx"
]
},
"index": {
"index": "documents",
"type": "document",
"bulk_size": 50
}
}
},
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_fsstatus",
"_score": 1,
"_source": {
"fs": {
"status": "STARTED"
}
}
},
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_status",
"_score": 1,
"_source": {
"ok": true,
"node": {
"id": "j1ViClzcQzSg4rqpgTseBQ",
"name": "Node1",
"transport_address": "inet[/192.31.142.25:9300]"
}
}
},
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_lastupdated",
"_score": 1,
"_source": {
"fs": {
"feedname": "documentsriver",
"lastdate": "2013-09-18T13:33:03.092Z",
"docadded": 0,
"docdeleted": 0
}
}
}
]
}
}
The last element shows that the river has the status 'STARTED' and when I repeat this request, the "lastdate" element in the last JSON object changes every 10 seconds.
Deleting files is recognized but does not change index
The same happens when most files are deleted and just one of the original set of files is left. In this case, the index still doesn't change. It still says there are 13 documents in stead of 2 (1 folder, 1 file). But now, searching the _river "index" we'll see in the last JSON object this:
....
{
"_index": "_river",
"_type": "documentsriver",
"_id": "_lastupdated",
"_score": 1,
"_source": {
"fs": {
"feedname": "documentsriver",
"lastdate": "2013-09-18T13:38:25.274Z",
"docadded": 0,
"docdeleted": 11
}
}
}
....
So the river says 11 documents are deleted, but the index does not change. Even not the version of the indexed document. So, from the index you cannot verify the file still exists.
In both cases above, the logfiles (loglevel DEBUG) do not give any feedback.
Adding folders raises index error
Then the last case. A new subfolder is added to the folder C:\tmp. After adding the folder, logfiles say
[2013-09-18 15:45:27,942][WARN ][fr.pilato.elasticsearch.river.fs.river.FsRiver] [Node1] [fs][documentsriver] Error while indexing content from c:\tmp
After removing the folder, the warnings disappear.