GithubHelp home page GithubHelp logo

dadoonet / fscrawler Goto Github PK

View Code? Open in Web Editor NEW
1.3K 72.0 293.0 14.92 MB

Elasticsearch File System Crawler (FS Crawler)

Home Page: https://fscrawler.readthedocs.io/

License: Apache License 2.0

Java 89.67% Shell 0.96% Batchfile 0.19% HTML 4.54% Rich Text Format 4.61% Dockerfile 0.04%
java elasticsearch crawler tika

fscrawler's Introduction

File System Crawler for Elasticsearch

Welcome to the FS Crawler for Elasticsearch

This crawler helps to index binary documents such as PDF, Open Office, MS Office.

Main features:

  • Local file system (or a mounted drive) crawling and index new files, update existing ones and removes old ones.
  • Remote file system over SSH/FTP crawling.
  • REST interface to let you "upload" your binary documents to elasticsearch.

Latest versions

Current "most stable" versions are:

Elasticsearch FS Crawler Released Docs
6.x, 7.x, 8.x 2.10-SNAPSHOT 2.10-SNAPSHOT

Maven Central GitHub Release Date Maven metadata URL GitHub last commit

Docker Pulls Docker Image Size (tag) Docker Image Version (latest semver)

Build and Quality Status

Build Documentation Status

Lines of Code Duplicated Lines (%) Maintainability Rating Technical Debt Reliability Rating

Vulnerabilities Bugs Quality Gate Status Code Smells Security Rating

GitHub stats

GitHub commits since latest release (by SemVer including pre-releases) GitHub commit activity (branch) GitHub contributors

GitHub issues GitHub pull requests

Documentation

The guide has been moved to ReadTheDocs.

X (formerly Twitter) Follow

Contribute

Works on my machine - and yours ! Spin up pre-configured, standardized dev environments of this repository, by clicking on the button below.

Open in Gitpod

License

GitHub

Read more about the Apache2 License.

Thanks

Thanks to JetBrains for the IntelliJ IDEA License!

Thanks to SonarCloud for the free analysis!

SonarCloud

fscrawler's People

Contributors

babadofar avatar barts2108 avatar cadm-frank avatar chrissound avatar circuitguy avatar coder-sa avatar dadoonet avatar dependabot-preview[bot] avatar dependabot[bot] avatar eternallybaffled avatar fgaujous avatar github-actions[bot] avatar helsonxiao avatar iadcode avatar ian-cameron avatar it20one avatar janhoy avatar kikkauz avatar kneubi avatar koopmac avatar logicer16 avatar mergify[bot] avatar muraken720 avatar quix0r avatar rhaist avatar shadiakiki1986 avatar shahariaazam avatar tommylike avatar xcorail avatar ywjung avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fscrawler's Issues

failed to get _meta from [fs]

Hi,

how to fix this issue? I am uing fsriver 0.4.0 and elastisearch 1.0.1

[2014-03-03 13:03:12,359][INFO ][gateway                  ] [Orphan] recovered [3] indices into cluster_state
[2014-03-03 13:03:12,544][INFO ][node                     ] [Orphan] started
[2014-03-03 13:03:12,556][INFO ][fr.pilato.elasticsearch.river.fs.river.FsRiver] [Orphan] [fs][ocrdocs] Starting fs river scanning
[2014-03-03 13:03:12,559][WARN ][river                    ] [Orphan] failed to get _meta from [fs]/[ocrdocs]
java.lang.NoSuchMethodError: org.elasticsearch.action.admin.cluster.state.ClusterStateRequestBuilder.setFilterIndices([Ljava/lang/String;)Lorg/el
asticsearch/action/admin/cluster/state/ClusterStateRequestBuilder;
        at fr.pilato.elasticsearch.river.fs.river.FsRiver.isMappingExist(FsRiver.java:300)
        at fr.pilato.elasticsearch.river.fs.river.FsRiver.pushMapping(FsRiver.java:315)
        at fr.pilato.elasticsearch.river.fs.river.FsRiver.start(FsRiver.java:229)
        at org.elasticsearch.river.RiversService.createRiver(RiversService.java:148)
        at org.elasticsearch.river.RiversService$ApplyRivers$2.onResponse(RiversService.java:275)
        at org.elasticsearch.river.RiversService$ApplyRivers$2.onResponse(RiversService.java:269)
        at org.elasticsearch.action.support.TransportAction$ThreadedActionListener$1.run(TransportAction.java:93)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
        at java.lang.Thread.run(Thread.java:695)

Thanks a lot

Meta data

Indexed files dont contain any metadata, only name and content. I need author and any keywords. Where is issue?

I have ver 0.0.3
{
_index: projects
_type: doc
_id: 1fdc355779fb5bb6bab0d341207b1e96
_score: 0.0560452
_source: {
name: test.xlsx
postDate: 1286880677000
pathEncoded: 3b248e2f08548742290f371652fb
rootpath: 70e1733efe56e2a6edcb96632119b4b
virtualpath: s/data
file: {
_name: test.xlsx
content: UEsDB...
}
}

Tests are Non-Determinism

Depending on the environment some tests take longer than allowed by the tests. I'm running on a MBA 2011 and some tests fail but if I re-run the tests then they all pass.

What would you think of the concept of re-tries?
Or do you have any ideas?

Store origin URL in documents

We would like to have the original URL (full path file://mydir/mydoc.txt) stored in a source_url field.

It would allow to index only content without the need of storing the _source document itself.

NoSuchMethod Found Error, when using FS with ES 1.0.x

[2014-03-03 17:11:58,263][WARN ][river ] [Frank Payne] failed to get _meta from [fs]/[ram2docs]
java.lang.NoSuchMethodError: org.elasticsearch.action.admin.cluster.state.ClusterStateRequestBuilder.setFilterIndices([Ljava/lang/String;)Lorg/elasticsearch/action/admin/cluster/state/ClusterStateRequestBuilder;
at fr.pilato.elasticsearch.river.fs.river.FsRiver.isMappingExist(FsRiver.java:300)
at fr.pilato.elasticsearch.river.fs.river.FsRiver.pushMapping(FsRiver.java:315)
at fr.pilato.elasticsearch.river.fs.river.FsRiver.start(FsRiver.java:229)
at org.elasticsearch.river.RiversService.createRiver(RiversService.java:148)
at org.elasticsearch.river.RiversService$ApplyRivers$2.onResponse(RiversService.java:275)
at org.elasticsearch.river.RiversService$ApplyRivers$2.onResponse(RiversService.java:269)
at org.elasticsearch.action.support.TransportAction$ThreadedActionListener$1.run(TransportAction.java:93)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

Use BulkProcessor feature

A cool feature exists in elasticsearch: BulkProcessor.
We want to use it here instead of managing bulks by hand.

Problem on virtualpath on linux servers

On linux, the virualpath contain the last character of the fs river path.
Example:

curl -XPUT 'localhost:9201/_river/docs/_meta' -d '{
"type": "fs",
"fs": {
"name": "test-river",
"url": "/tmp/river",
"update_rate": 1000,
"includes": "*",
"excludes": "resume"
}
}'

the virtual path is "r/dir/" and should be "dir/".

Thank you

Files not correctly added/removed from index on Windows platform

Hello,

Actually, I don't understand why when I delete files from the folder configured for the river, the documents are not deleted in the index.

The river is well configured because, each file with an updated date > last synchronisation date is well taken into account and added or updated in the index.

Here, the river configuration :

{
  "type": "fs",
  "fs": {
        "url": "/json-river",
        "update_rate": 10000,
   "json_support" : true,
   "filename_as_id": true
},
 "index": {
       "index": "baseproduit",
       "type": "produit",
       "bulk_size": 50
   }
}

I put some logs in FsRiver.class and, I can see the array "esFiles" is still empty :

line 520 :
Collection esFiles = getFileDirectory(filepath);

So the method esDelete is never call in the loop
for (String esfile : esFiles) {

FsRiver version used : 0.4.0-SNAPSHOT and 0.3.0
Should I configure another thing to get it work ?

Thank you very much for your help.

Indexed only any files

I have folder with 200 000 files. This plugin indexed only 4420 documents (folders + files). I get this total from {"query":{"match_all":{}}}. Where is issue?

This plugin skip files, which not supported by Tika? Or skip files with error from Tika.

I have any errors, example:

org.elasticsearch.index.mapper.MapperParsingException: Failed to extract [100000] characters of text for [thirdparty-licenses.xml]
at org.elasticsearch.index.mapper.attachment.AttachmentMapper.parse(AttachmentMapper.java:311)
at org.elasticsearch.index.mapper.object.ObjectMapper.serializeObject(ObjectMapper.java:507)
at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:449)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:486)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:430)
at org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:318)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:158)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:532)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:430)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.tika.exception.TikaException: XML parse error
at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.Tika.parseToString(Tika.java:421)
at org.elasticsearch.index.mapper.attachment.AttachmentMapper.parse(AttachmentMapper.java:309)
... 11 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 45; columnNumber: 97; Element type "module" must be followed by either attribute specifications, ">" or "/>".
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.seekCloseOfStartTag(XMLDocumentFragmentScannerImpl.java:1355)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:261)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2717)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:607)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:489)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:835)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1210)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:568)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:302)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
... 16 more
[2013-04-04 16:49:49,728][WARN ][river.fs ] [Physter] [fs][projects] Failed to execute failure in bulk execution:
[23]: index [projects], type [doc], id [f957de3fec56b2b2a4268db95f6161], message [MapperParsingException[Failed to extract [100000] characters of text for [jms-queue.xml.vm]]; nested: TikaException[XML parse error]; nested: SAXParseException[Element type "role" must be followed by either attribute specifications, ">" or "/>".]; ]
[24]: index [projects], type [doc], id [f957de3fec56b2b2a4268db95f6161], message [MapperParsingException[Failed to extract [100000] characters of text for [jms-queue.xml.vm]]; nested: TikaException[XML parse error]; nested: SAXParseException[Element type "role" must be followed by either attribute specifications, ">" or "/>".]; ]
[56]: index [projects], type [doc], id [26501b64741d8ce415c677feae4bf5], message [MapperParsingException[Failed to extract [100000] characters of text for [thirdparty-licenses.xml]]; nested: TikaException[XML parse error]; nested: SAXParseException[Element type "module" must be followed by either attribute specifications, ">" or "/>".]; ]
[58]: index [projects], type [doc], id [26501b64741d8ce415c677feae4bf5], message [MapperParsingException[Failed to extract [100000] characters of text for [thirdparty-licenses.xml]]; nested: TikaException[XML parse error]; nested: SAXParseException[Element type "module" must be followed by either attribute specifications, ">" or "/>".]; ]

How to set DEBUG log level?

My log file tells only this warn message. Do fsriver have higher log severity?

[2013-04-12 15:56:04,709][WARN ][river.fs                 ] [Physter] [fs][projects] Error while indexing content from /projects

logging.yml:

rootLogger: INFO, console, file
logger:
  # log action execution errors for easier debugging
  action: DEBUG
  # reduce the logging for aws, too much is logged under the default INFO
  com.amazonaws: WARN

Replace mapper-attachment plugin by Tika

If we want to have a finer control of JSon documents we generate, we need to remove the attachment type (mapper-attachment-plugin that is) and replace it with Tika.

It will allow to support features like "store-origin": false which basically won't require to encode in Base64 the content but only will generate json values for extracted content.

We need probably here to keep the original format of generated Json documents for bw compatibility.

Add a lightweight distribution of fsriver

We can provide a lightweight distribution of river if the user has already added Tika and its dependencies as a plugin for example with mapper attachment plugin.

In that case, Tika will be in the node classloader and we don't need to provide it as part of the distribution.

It can be installed with:

bin/plugin -install fr.pilato.elasticsearch.river/fsriver/0.4.0-SNAPSHOT-light

Add option to store original file as binary

In addition to #39, we don't store original file by default as it was the case previously.

So we need to add a new option to allow it explicitly: "store_source":true.

Default to false.

add JSON-file support

I would be really cool if the fs river could monitor a directory with JSON-files. If a new file is created the JSOn should be indexed. If a file gets changed the file should be reindexed/updated. And if a file gets deleted, the document should be removed from the index as well.

Add SSH Support

You can now index files remotely using SSH:

  • FS URL: /tmp3
  • Server: mynode.mydomain.com
  • Username: username
  • Password: password
  • Protocol: ssh (default to local)
  • Update Rate: every hour (60 * 60 * 1000 = 3600000 ms)
  • Get only docs like *.doc, *.xls and *.pdf
curl -XPUT 'localhost:9200/_river/mysshriver/_meta' -d '{
  "type": "fs",
  "fs": {
    "url": "/tmp3",
    "server": "mynode.mydomain.com",
    "username": "username",
    "password": "password",
    "protocol": "ssh",
    "update_rate": 3600000,
    "includes": [ "*.doc" , "*.xls", "*.pdf" ]
  }
}'

How to do multiple records in .json

Hi I'm trying to use your plugin but I can't seem to figure out how to have multiple records in a .json file in the folder

If I have a .json for each item (not usable) it works.

I'm trying however to have a .json file with hundreds of items. rather than hundreds of .json files.

Thanks

Vinny

Indexing on Windows - Index is not modified on adding and deleting files - folders give index errors

I'm using:

  • ES9.0.3
  • Mapper-attachement plugin 1.8.0
  • FSRiver plugin 0.3.0

I create an index:

curl -XPUT 'localhost:9200/documents' - '{
  "document" : {
    "properties" : {
      "file" : {
        "type" : "attachment",
        "path" : "full",
        "fields" : {
          "file" : {
            "type" : "string",
            "store" : "yes",
            "term_vector" : "with_positions_offsets"
          },
          "author" : {
            "type" : "string"
          },
          "title" : {
            "type" : "string",
            "store" : "yes"
          },
          "name" : {
            "type" : "string"
          },
          "date" : {
            "type" : "date",
            "format" : "dateOptionalTime"
          },
          "keywords" : {
            "type" : "string"
          },
          "content_type" : {
            "type" : "string",
            "store" : "yes"
          }
        }
      },
      "name" : {
        "type" : "string",
        "analyzer" : "keyword"
      },
      "pathEncoded" : {
        "type" : "string",
        "analyzer" : "keyword"
      },
      "postDate" : {
        "type" : "date",
        "format" : "dateOptionalTime"
      },
      "rootpath" : {
        "type" : "string",
        "analyzer" : "keyword"
      },
      "virtualpath" : {
        "type" : "string",
        "analyzer" : "keyword"
      },
      "filesize" : {
        "type" : "long"
      }
    }
  }
}'

I create a river:

curl - XPUT 'localhost:9200/_river/documentsriver/_meta' -d '{
  "type": "fs",
  "fs": {
    "url": "c:\\tmp",
    "update_rate": 10000,
    "includes": [ "*.docx" , "*.xlsx", "*.pdf", "*.pptx" ]
  },
  "index": {
    "index": "documents",
    "type": "document",
    "bulk_size": 50
  }
}'
  • I'm using windows so the url parameter uses "c:\tmp".
  • Update rate is rather high because of testing.
  • The index used is "documents", as created.
  • The mapping type used in the index is "document" as created.

The c:\tmp folder contains 12 files (max size 4000KB, most of them around 100KB) which match the include pattern. When the river is added (and ES has been restarted before, to be sure all plugins are recognized and loaded) my documents index is filled with 13 entries (1 folder, 12 files). So far so good.

Adding files does not change index

However, if I copy some more files matching the pattern to the folder, they are not indexed. After ten minutes waiting, wherein the river should have tried it 60 times, there is no change in the document index. The fact there have been 60 tries can be stated by using a simple rest call

curl -XGET 'localhost:9200/_river/_search' -d '{}'. 

With the result:

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "hits": {
        "total": 4,
        "max_score": 1,
        "hits": [
            {
                "_index": "_river",
                "_type": "documentsriver",
                "_id": "_meta",
                "_score": 1,
                "_source": {
                    "type": "fs",
                    "fs": {
                        "url": "c:\\tmp",
                        "update_rate": 10000,
                        "includes": [
                            "*.docx",
                            "*.xlsx",
                            "*.pdf",
                            "*.pptx"
                        ]
                    },
                    "index": {
                        "index": "documents",
                        "type": "document",
                        "bulk_size": 50
                    }
                }
            },
            {
                "_index": "_river",
                "_type": "documentsriver",
                "_id": "_fsstatus",
                "_score": 1,
                "_source": {
                    "fs": {
                        "status": "STARTED"
                    }
                }
            },
            {
                "_index": "_river",
                "_type": "documentsriver",
                "_id": "_status",
                "_score": 1,
                "_source": {
                    "ok": true,
                    "node": {
                        "id": "j1ViClzcQzSg4rqpgTseBQ",
                        "name": "Node1",
                        "transport_address": "inet[/192.31.142.25:9300]"
                    }
                }
            },
            {
                "_index": "_river",
                "_type": "documentsriver",
                "_id": "_lastupdated",
                "_score": 1,
                "_source": {
                    "fs": {
                        "feedname": "documentsriver",
                        "lastdate": "2013-09-18T13:33:03.092Z",
                        "docadded": 0,
                        "docdeleted": 0
                    }
                }
            }
        ]
    }
}

The last element shows that the river has the status 'STARTED' and when I repeat this request, the "lastdate" element in the last JSON object changes every 10 seconds.

Deleting files is recognized but does not change index

The same happens when most files are deleted and just one of the original set of files is left. In this case, the index still doesn't change. It still says there are 13 documents in stead of 2 (1 folder, 1 file). But now, searching the _river "index" we'll see in the last JSON object this:

....
            {
                "_index": "_river",
                "_type": "documentsriver",
                "_id": "_lastupdated",
                "_score": 1,
                "_source": {
                    "fs": {
                        "feedname": "documentsriver",
                        "lastdate": "2013-09-18T13:38:25.274Z",
                        "docadded": 0,
                        "docdeleted": 11
                    }
                }
            }
....

So the river says 11 documents are deleted, but the index does not change. Even not the version of the indexed document. So, from the index you cannot verify the file still exists.

In both cases above, the logfiles (loglevel DEBUG) do not give any feedback.

Adding folders raises index error

Then the last case. A new subfolder is added to the folder C:\tmp. After adding the folder, logfiles say

[2013-09-18 15:45:27,942][WARN ][fr.pilato.elasticsearch.river.fs.river.FsRiver] [Node1] [fs][documentsriver] Error while indexing content from c:\tmp

After removing the folder, the warnings disappear.

StringIndexOutOfBoundsException; -1; computeVirtualPathName

Exception for g:\feg\ is java.lang.StringIndexOutOfBoundsException: String index out of range: -1

It looks like 'computeVirtualPathName' throws this Exception.

realPath.substring(stats.getRootPath().length() - offset) <=> getRootPath seems to return 0.

"url": "g:\\xxx",

g:\xxx

g:/xxx (which would be valid by java.io.File) does also not work.

Maybe a configuration issue?

Regards,
Christian

FSriver error when "_source" is disable

When I use the following mapping:

curl -X PUT "localhost:9201/docs/doc/_mapping" -d '{
"doc" : {
"_source" : {"enabled" : false}
}
}'

I have this error in the fsriver
"Error while indexing content from /data/river"

Thanks

indexing remote documents via ssh

First of all sorry for my English.

About issue, I have 3 machines, two elastics nodes and one documents repository.
For example, let elastic nodes have the follow ip 192.168.55.11 (node 1) and 192.168.55.12 (node 2) and the document repository have ip 192.168.55.13 and have folder in root directory with path "/files/"

I download zip of fsriver 0.3.0-SNAPSHOT and extract it into plugin directory on both nodes.
here ls -l output one of them:

elastic/plugins/fsriver# ls -l
-rw-r--r--  1 root  wheel   35107 19 июл 17:09 fsriver-0.3.0-SNAPSHOT.jar
-rw-r--r--  1 root  wheel  229086  7 мар 09:23 jsch-0.1.49.jar

after that I stop elastic via curl -XPOST '192.168.55.11:9200/_cluster/nodes/_shutdown'

and then again start it via ./bin/elaticsearch both nodes

In logs I see what fsriver is ok. Plugin loaded and everything is ok:

[2013-07-20 17:15:39,558][INFO ][plugins] [Prime] loaded [mapper-attachments, river-fs, analysis-morphology], sites [head]

after that I create a test index from the document reposiroty`s machine (192.168.55.13):

curl -XPUT '192.168.55.11:9200/sshdocs/' -d '{}'

got ok answer from elastic and:

curl -XPUT '192.168.55.11:9200/_river/sshdocs/_meta' -d '{
  "type": "fs",
  "fs": {
    "url": "/files",
    "server": "192.168.55.13",
    "username": "USER",
    "password": "PASS",
    "protocol": "ssh",
    "update_rate": 60000,
    "includes": [ "*.doc" , "*.xls", "*.pdf" ]
  }
}'

elastic tell me:

{"ok":true,"_index":"_river","_type":"sshdocs","_id":"_meta","_version":1}

And then I look to the statistic of index and see

  • size: 495b (890b)
  • docs: 0 (0)

index have zero document

logs have the follow WARN

[WARN ][fr.pilato.elasticsearch.river.fs.river.FsRiver] [Sectant] [fs][sshdocs] Error while indexing content from /files

PS:
I have elasticSearch 0.90.2
OS FreeBSD 9.1
openjdk-7.25.15

in configuration of elasticSearch I have disabled zend.discovery.multicast option.

Can we add feature to be able to to see both file content and file paths as human readable format (text) with just one river for text files

  1. When I create river with following setting (note explicitly specifying JSON support):
    curl -XDELETE 127.0.0.1:9200/_river/foo
    curl -XDELETE 127.0.0.1:9200/foo
    curl -XPUT 'localhost:9200/_river/foo/_meta' -d '{
        "type": "fs",
        "fs": {
            "name": "Foo Data",
            "url": "/Users/slodha/foo/content",
            "update_rate": 60000,
            "includes": "*.json",
            "json_support" : true
        },
        "index": {
            "index": "foo",
            "type": "foo",
           "bulk_size": 50
        }
    }'

and search on this with this query:

{
     "query": {
         "query_string": {
             "default_field": "_all",
             "query": "slodha"
         }
     }
}

I get results like:

hits: [
    {
        _index: foo
        _type: foo
        _id: 18156b6b5a6b3a8e1ec5984f185e18
        _score: 6.9584246
        _source: {
            sunnyVal: slodha
        }
    }
    {
        _index: foo
        _type: foo
        _id: d7b4df4222e0d075d74ffde8aaa04a56
        _score: 6.901722
        _source: {
            fileNameTest: slodha
        }
    }
]

Problem : I never get which file it belonged to - which I would definitely need to be able to search in the filesystem eventually.

When I do this:

curl -XDELETE 127.0.0.1:9200/_river/foo
curl -XDELETE 127.0.0.1:9200/foo
curl -XPUT 'localhost:9200/_river/foo/_meta' -d '{
  "type": "fs",
  "fs": {
    "name": "Foo Data",
    "url": "/Users/slodha/foo/content",
    "update_rate": 60000
  },
  "index": {
    "index": "city",
    "type": "city",
    "bulk_size": 50
  }
}'

Notice here I do not use any json restriction..
I get results like this:

hits: [
    {
        _index: foo
        _type: foo
        _id: 18156b6b5a6b3a8e1ec5984f185e18
        _score: 2.0015228
        _source: {
            name: slodha_1.json
            postDate: 1363384941000
            pathEncoded: 44d22b925f562f4e8d1d847253493336
            rootpath: 948cd64d775db4119962b5a36dd530
            virtualpath: t/sunnyTest
            file: {
                _name: slodha_1.json
                content: ewoic3VubnlWYWwiIDogInNsb2RoYSIgCn0K
            }
        }
    }
    {
        _index: foo
        _type: foo
        _id: d7b4df4222e0d075d74ffde8aaa04a56
        _score: 1.7533717
        _source: {
            name: file1.json
            postDate: 1363388628000
            pathEncoded: 99d79f46f1ce275b6b9152a0de54d5
            rootpath: 948cd64d775db4119962b5a36dd530
            virtualpath: t/sunnyTest/sunnyTest2
            file: {
                _name: file1.json
                content: ewoiZmlsZU5hbWVUZXN0IiA6ICJzbG9kaGEiCn0K
            }
        }
    }
]

Now, here I get to know the exact file paths in _source, but this time the content is all a hash sum, and not readable..

there should be a way for me to see both content and file paths as human readable with just one river.

-Sunny

New json document mapping for docs

With issue #38 closed (replace mapper-attachment-plugin by Tika), we can now cleanup the JSON structure we generate when indexing documents with FSRiver.

In 0.3.0 version, the JSon mapping is the following:

{
  "doc" : {
    "properties" : {
      "file" : {
        "type" : "attachment",
        "path" : "full",
        "fields" : {
          "file" : {
            "type" : "string",
            "store" : "yes",
            "term_vector" : "with_positions_offsets"
          },
          "author" : {
            "type" : "string"
          },
          "title" : {
            "type" : "string",
            "store" : "yes"
          },
          "name" : {
            "type" : "string"
          },
          "date" : {
            "type" : "date",
            "format" : "dateOptionalTime"
          },
          "keywords" : {
            "type" : "string"
          },
          "content_type" : {
            "type" : "string",
            "store" : "yes"
          }
        }
      },
      "name" : {
        "type" : "string",
        "analyzer" : "keyword"
      },
      "pathEncoded" : {
        "type" : "string",
        "analyzer" : "keyword"
      },
      "postDate" : {
        "type" : "date",
        "format" : "dateOptionalTime"
      },
      "rootpath" : {
        "type" : "string",
        "analyzer" : "keyword"
      },
      "virtualpath" : {
        "type" : "string",
        "analyzer" : "keyword"
      },
      "filesize" : {
        "type" : "long"
      }
    }
  }
}

We can see that we have different levels of metadata and some of them are redundant.
Also, we use keyword analyzer instead of not indexing at all fields or using no analyzer.

The new structure will be:

{
  "doc" : {
    "properties" : {
      "content" : {
        "type" : "string",
        "store" : "yes"
      },
      "meta" : {
        "properties" : {
          "author" : {
              "type" : "string",
              "store" : "yes"
          },
          "title" : {
              "type" : "string",
              "store" : "yes"
          },
          "date" : {
              "type" : "date",
              "format" : "dateOptionalTime",
              "store" : "yes"
          },
          "keywords" : {
              "type" : "string",
              "store" : "yes"
          }
        }
      },
      "file" : {
        "properties" : {
          "content_type" : {
              "type" : "string",
              "analyzer" : "simple",
              "store" : "yes"
          },
          "last_modified" : {
              "type" : "date",
              "format" : "dateOptionalTime",
              "store" : "yes"
          },
          "indexing_date" : {
              "type" : "date",
              "format" : "dateOptionalTime",
              "store" : "yes"
          },
          "filesize" : {
              "type" : "long",
              "store" : "yes"
          },
          "indexed_chars" : {
              "type" : "long",
              "store" : "yes"
          },
          "filename" : {
              "type" : "string",
              "analyzer" : "simple",
              "store" : "yes"
          },
          "url" : {
              "type" : "string",
              "store" : "yes",
              "index" : "no"
          }
        }
      },
      "path" : {
        "properties" : {
          "encoded" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
          },
          "virtual" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
          },
          "root" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
          },
          "real" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
          }
        }
      }
    }
  }
}

Reformat code (use spaces instead of tabs)

It's not a debate, it's a question. Do you use tabs or spaces ? In this project, there is a majority of tabs (at least in the files I checked) but sometimes there is spaces.

Thanks.

Add exhaustive list of all parameters

For now, I'm simply digging in the source code to find what I need, but it would be easier with some documentation. I'm pretty sure I'm still missing some stuff. I would also be great to know what are the default values.

If I have some time, I'll take a look at it and propose one.

Suspend or restart FSRiver

If you need to stop a river, you can call the `_stop' endpoint:

curl 'localhost:9200/_river/mydocs/_stop'

To restart the river from the previous point, just call _start end point:

curl 'localhost:9200/_river/mydocs/_start'

Add SSH port setting

Hello, it would be possible to be able to put a port different that that by default for the connection SSH?

{
    "password": "******",
    "protocol": "ssh",
    "port": "2222",
    "update_rate": 600000
}

Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.