dadoonet / fscrawler Goto Github PK

View Code? Open in Web Editor NEW

1.3K 72.0 293.0 14.92 MB

Elasticsearch File System Crawler (FS Crawler)

Home Page: https://fscrawler.readthedocs.io/

License: Apache License 2.0

Java 89.67% Shell 0.96% Batchfile 0.19% HTML 4.54% Rich Text Format 4.61% Dockerfile 0.04%

java elasticsearch crawler tika

fscrawler's Introduction

File System Crawler for Elasticsearch

Welcome to the FS Crawler for Elasticsearch

This crawler helps to index binary documents such as PDF, Open Office, MS Office.

Main features:

Local file system (or a mounted drive) crawling and index new files, update existing ones and removes old ones.
Remote file system over SSH/FTP crawling.
REST interface to let you "upload" your binary documents to elasticsearch.

Latest versions

Current "most stable" versions are:

Elasticsearch	FS Crawler	Released	Docs
6.x, 7.x, 8.x	2.10-SNAPSHOT		2.10-SNAPSHOT

Build and Quality Status

GitHub stats

Documentation

The guide has been moved to ReadTheDocs.

Contribute

Works on my machine - and yours ! Spin up pre-configured, standardized dev environments of this repository, by clicking on the button below.

License

Thanks

Thanks to JetBrains for the IntelliJ IDEA License!

Thanks to SonarCloud for the free analysis!

fscrawler's People

Contributors

Stargazers

Watchers

Forkers

jsturma debuggingfuture risoldi ystolyarov fgaujous pombredanne hedloc holg grirrle christoph-frick barts2108 amatety10 jkot sindicetech botchor acochenour lyr andimak lemonhall johtani jcollas zhangwei5095 wittyameta hinkman yuchaozhou cullman dharmendrak thirukumarandpi living1069 gaelcharbonnier hustbill peterlai107 llpj danyill claytonbrown hamzabdk feiva freddut smacm codecentric openbizgit mjspka faridskyman yongxu74 rlugojr vinitbipinjain data-search sustain jdecaudin edjeavons babadofar rowhit pmoneda teqdex sanjoy-hike jamesbrink lbroudoux szaharici shadiakiki1986 doppiomacchiatto hlecorche laugha anakorjakov trorbyte winnerineast agile-innovations andresbh acesir quix0r kinbod mmeaney tomhttp dsivakumar forks mohitsethi guptam jigsawsecurity nikhilagrwl07 brettveenstra hongcoo gongsong jevertz arcodergh matarrese reubenkhanna wooodhead 18nvaz nicolasclaudon charlesfair konamgil chaomas ognjenm ghaseminya snewhouse benhoimpala asmaelk n040661 apingali tuepd lifedom

fscrawler's Issues

Meta data

Indexed files dont contain any metadata, only name and content. I need author and any keywords. Where is issue?

I have ver 0.0.3
{
_index: projects
_type: doc
_id: 1fdc355779fb5bb6bab0d341207b1e96
_score: 0.0560452
_source: {
name: test.xlsx
postDate: 1286880677000
pathEncoded: 3b248e2f08548742290f371652fb
rootpath: 70e1733efe56e2a6edcb96632119b4b
virtualpath: s/data
file: {
_name: test.xlsx
content: UEsDB...
}
}

Add Apache License

Support Asciidoc format

We can probably handle well Asciidoc format following that post: http://www.lordofthejars.com/2013/06/searchable-documents-yes-you-can.html

ContentType autodetection doesn't work

Store content-type by default

We want to store content-type field by default.
Relative to #14

"file.content_type" : "application/vnd.oasis.opendocument.text"

Option to not delete documents when files are removed

Adding a new option remove_deleted: false that won't remove documents when files are removed.

Update to Elasticsearch 0.90.3 / Mapper 1.8.0

Index a network drive

By any chance is it possible to index a network drive (windows file share)?

JSON support: use filename as ID

Tests are Non-Determinism

Depending on the environment some tests take longer than allowed by the tests. I'm running on a MBA 2011 and some tests fail but if I re-run the tests then they all pass.

What would you think of the concept of re-tries?
Or do you have any ideas?

Store origin URL in documents

We would like to have the original URL (full path file://mydir/mydoc.txt) stored in a source_url field.

It would allow to index only content without the need of storing the _source document itself.

Add filesize to indexed document

Source: https://groups.google.com/forum/?hl=fr&fromgroups=#!searchin/elasticsearch/fsriver$20size/elasticsearch/uks2Zbc4iKU/e1DL9MYrht8J

We want to add the filesize as an attribute of the generated document.

It will be added by default. Can be removed using add_filesize: false.

NoSuchMethod Found Error, when using FS with ES 1.0.x

[2014-03-03 17:11:58,263][WARN ][river ] [Frank Payne] failed to get _meta from [fs]/[ram2docs]
java.lang.NoSuchMethodError: org.elasticsearch.action.admin.cluster.state.ClusterStateRequestBuilder.setFilterIndices([Ljava/lang/String;)Lorg/elasticsearch/action/admin/cluster/state/ClusterStateRequestBuilder;
at fr.pilato.elasticsearch.river.fs.river.FsRiver.isMappingExist(FsRiver.java:300)
at fr.pilato.elasticsearch.river.fs.river.FsRiver.pushMapping(FsRiver.java:315)
at fr.pilato.elasticsearch.river.fs.river.FsRiver.start(FsRiver.java:229)
at org.elasticsearch.river.RiversService.createRiver(RiversService.java:148)
at org.elasticsearch.river.RiversService$ApplyRivers$2.onResponse(RiversService.java:275)
at org.elasticsearch.river.RiversService$ApplyRivers$2.onResponse(RiversService.java:269)
at org.elasticsearch.action.support.TransportAction$ThreadedActionListener$1.run(TransportAction.java:93)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

Use BulkProcessor feature

A cool feature exists in elasticsearch: BulkProcessor.
We want to use it here instead of managing bulks by hand.

Problem on virtualpath on linux servers

On linux, the virualpath contain the last character of the fs river path.
Example:

curl -XPUT 'localhost:9201/_river/docs/_meta' -d '{
"type": "fs",
"fs": {
"name": "test-river",
"url": "/tmp/river",
"update_rate": 1000,
"includes": "*",
"excludes": "resume"
}
}'

the virtual path is "r/dir/" and should be "dir/".

Thank you

Files not correctly added/removed from index on Windows platform

Hello,

Actually, I don't understand why when I delete files from the folder configured for the river, the documents are not deleted in the index.

The river is well configured because, each file with an updated date > last synchronisation date is well taken into account and added or updated in the index.

Here, the river configuration :

{
  "type": "fs",
  "fs": {
        "url": "/json-river",
        "update_rate": 10000,
   "json_support" : true,
   "filename_as_id": true
},
 "index": {
       "index": "baseproduit",
       "type": "produit",
       "bulk_size": 50
   }
}

I put some logs in FsRiver.class and, I can see the array "esFiles" is still empty :

line 520 :
Collection esFiles = getFileDirectory(filepath);

So the method esDelete is never call in the loop
for (String esfile : esFiles) {

FsRiver version used : 0.4.0-SNAPSHOT and 0.3.0
Should I configure another thing to get it work ?

Thank you very much for your help.

Indexed only any files

I have folder with 200 000 files. This plugin indexed only 4420 documents (folders + files). I get this total from {"query":{"match_all":{}}}. Where is issue?

This plugin skip files, which not supported by Tika? Or skip files with error from Tika.

I have any errors, example:

org.elasticsearch.index.mapper.MapperParsingException: Failed to extract [100000] characters of text for [thirdparty-licenses.xml]
at org.elasticsearch.index.mapper.attachment.AttachmentMapper.parse(AttachmentMapper.java:311)
at org.elasticsearch.index.mapper.object.ObjectMapper.serializeObject(ObjectMapper.java:507)
at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:449)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:486)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:430)
at org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:318)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:158)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:532)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:430)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.tika.exception.TikaException: XML parse error
at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.Tika.parseToString(Tika.java:421)
at org.elasticsearch.index.mapper.attachment.AttachmentMapper.parse(AttachmentMapper.java:309)
... 11 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 45; columnNumber: 97; Element type "module" must be followed by either attribute specifications, ">" or "/>".
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.seekCloseOfStartTag(XMLDocumentFragmentScannerImpl.java:1355)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:261)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2717)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:607)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:489)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:835)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1210)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:568)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:302)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
... 16 more
[2013-04-04 16:49:49,728][WARN ][river.fs ] [Physter] [fs][projects] Failed to execute failure in bulk execution:
[23]: index [projects], type [doc], id [f957de3fec56b2b2a4268db95f6161], message [MapperParsingException[Failed to extract [100000] characters of text for [jms-queue.xml.vm]]; nested: TikaException[XML parse error]; nested: SAXParseException[Element type "role" must be followed by either attribute specifications, ">" or "/>".]; ]
[24]: index [projects], type [doc], id [f957de3fec56b2b2a4268db95f6161], message [MapperParsingException[Failed to extract [100000] characters of text for [jms-queue.xml.vm]]; nested: TikaException[XML parse error]; nested: SAXParseException[Element type "role" must be followed by either attribute specifications, ">" or "/>".]; ]
[56]: index [projects], type [doc], id [26501b64741d8ce415c677feae4bf5], message [MapperParsingException[Failed to extract [100000] characters of text for [thirdparty-licenses.xml]]; nested: TikaException[XML parse error]; nested: SAXParseException[Element type "module" must be followed by either attribute specifications, ">" or "/>".]; ]
[58]: index [projects], type [doc], id [26501b64741d8ce415c677feae4bf5], message [MapperParsingException[Failed to extract [100000] characters of text for [thirdparty-licenses.xml]]; nested: TikaException[XML parse error]; nested: SAXParseException[Element type "module" must be followed by either attribute specifications, ">" or "/>".]; ]

Rename package from org.elasticsearch to fr.pilato...

We want to have a proper package name.

Update to Elasticsearch 0.90.0

Check that server/login are defined when using ssh

We should check some settings when starting a SSH River:

server should not be empty
username should not be empty

How to set DEBUG log level?

My log file tells only this warn message. Do fsriver have higher log severity?

[2013-04-12 15:56:04,709][WARN ][river.fs                 ] [Physter] [fs][projects] Error while indexing content from /projects

logging.yml:

rootLogger: INFO, console, file
logger:
  # log action execution errors for easier debugging
  action: DEBUG
  # reduce the logging for aws, too much is logged under the default INFO
  com.amazonaws: WARN

Move tests to elasticsearch test framework

Update to Elasticsearch 0.20.2

Replace mapper-attachment plugin by Tika

If we want to have a finer control of JSon documents we generate, we need to remove the attachment type (mapper-attachment-plugin that is) and replace it with Tika.

It will allow to support features like "store-origin": false which basically won't require to encode in Base64 the content but only will generate json values for extracted content.

We need probably here to keep the original format of generated Json documents for bw compatibility.

Filter documents

Add a include/exclude option

Add a lightweight distribution of fsriver

We can provide a lightweight distribution of river if the user has already added Tika and its dependencies as a plugin for example with mapper attachment plugin.

In that case, Tika will be in the node classloader and we don't need to provide it as part of the distribution.

It can be installed with:

bin/plugin -install fr.pilato.elasticsearch.river/fsriver/0.4.0-SNAPSHOT-light

Add option to store original file as binary

In addition to #39, we don't store original file by default as it was the case previously.

So we need to add a new option to allow it explicitly: "store_source":true.

Default to false.

add JSON-file support

I would be really cool if the fs river could monitor a directory with JSON-files. If a new file is created the JSOn should be indexed. If a file gets changed the file should be reindexed/updated. And if a file gets deleted, the document should be removed from the index as well.

Add SSH Support

You can now index files remotely using SSH:

FS URL: /tmp3
Server: mynode.mydomain.com
Username: username
Password: password
Protocol: ssh (default to local)
Update Rate: every hour (60 * 60 * 1000 = 3600000 ms)
Get only docs like *.doc, *.xls and *.pdf

curl -XPUT 'localhost:9200/_river/mysshriver/_meta' -d '{
  "type": "fs",
  "fs": {
    "url": "/tmp3",
    "server": "mynode.mydomain.com",
    "username": "username",
    "password": "password",
    "protocol": "ssh",
    "update_rate": 3600000,
    "includes": [ "*.doc" , "*.xls", "*.pdf" ]
  }
}'

Enhance documentation for indexed_chars

Relative to #17 (comment)

How to do multiple records in .json

Hi I'm trying to use your plugin but I can't seem to figure out how to have multiple records in a .json file in the folder

If I have a .json for each item (not usable) it works.

I'm trying however to have a .json file with hundreds of items. rather than hundreds of .json files.

Thanks

Vinny

Indexing on Windows - Index is not modified on adding and deleting files - folders give index errors

I'm using:

ES9.0.3
Mapper-attachement plugin 1.8.0
FSRiver plugin 0.3.0

I create an index:

curl -XPUT 'localhost:9200/documents' - '{
  "document" : {
    "properties" : {
      "file" : {
        "type" : "attachment",
        "path" : "full",
        "fields" : {
          "file" : {
            "type" : "string",
            "store" : "yes",
            "term_vector" : "with_positions_offsets"
          },
          "author" : {
            "type" : "string"
          },
          "title" : {
            "type" : "string",
            "store" : "yes"
          },
          "name" : {
            "type" : "string"
          },
          "date" : {
            "type" : "date",
            "format" : "dateOptionalTime"
          },
          "keywords" : {
            "type" : "string"
          },
          "content_type" : {
            "type" : "string",
            "store" : "yes"
          }
        }
      },
      "name" : {
        "type" : "string",
        "analyzer" : "keyword"
      },
      "pathEncoded" : {
        "type" : "string",
        "analyzer" : "keyword"
      },
      "postDate" : {
        "type" : "date",
        "format" : "dateOptionalTime"
      },
      "rootpath" : {
        "type" : "string",
        "analyzer" : "keyword"
      },
      "virtualpath" : {
        "type" : "string",
        "analyzer" : "keyword"
      },
      "filesize" : {
        "type" : "long"
      }
    }
  }
}'

I create a river:

curl - XPUT 'localhost:9200/_river/documentsriver/_meta' -d '{
  "type": "fs",
  "fs": {
    "url": "c:\\tmp",
    "update_rate": 10000,
    "includes": [ "*.docx" , "*.xlsx", "*.pdf", "*.pptx" ]
  },
  "index": {
    "index": "documents",
    "type": "document",
    "bulk_size": 50
  }
}'

I'm using windows so the url parameter uses "c:\tmp".
Update rate is rather high because of testing.
The index used is "documents", as created.
The mapping type used in the index is "document" as created.

The c:\tmp folder contains 12 files (max size 4000KB, most of them around 100KB) which match the include pattern. When the river is added (and ES has been restarted before, to be sure all plugins are recognized and loaded) my documents index is filled with 13 entries (1 folder, 12 files). So far so good.

Adding files does not change index

However, if I copy some more files matching the pattern to the folder, they are not indexed. After ten minutes waiting, wherein the river should have tried it 60 times, there is no change in the document index. The fact there have been 60 tries can be stated by using a simple rest call

curl -XGET 'localhost:9200/_river/_search' -d '{}'.

With the result:

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "hits": {
        "total": 4,
        "max_score": 1,
        "hits": [
            {
                "_index": "_river",
                "_type": "documentsriver",
                "_id": "_meta",
                "_score": 1,
                "_source": {
                    "type": "fs",
                    "fs": {
                        "url": "c:\\tmp",
                        "update_rate": 10000,
                        "includes": [
                            "*.docx",
                            "*.xlsx",
                            "*.pdf",
                            "*.pptx"
                        ]
                    },
                    "index": {
                        "index": "documents",
                        "type": "document",
                        "bulk_size": 50
                    }
                }
            },
            {
                "_index": "_river",
                "_type": "documentsriver",
                "_id": "_fsstatus",
                "_score": 1,
                "_source": {
                    "fs": {
                        "status": "STARTED"
                    }
                }
            },
            {
                "_index": "_river",
                "_type": "documentsriver",
                "_id": "_status",
                "_score": 1,
                "_source": {
                    "ok": true,
                    "node": {
                        "id": "j1ViClzcQzSg4rqpgTseBQ",
                        "name": "Node1",
                        "transport_address": "inet[/192.31.142.25:9300]"
                    }
                }
            },
            {
                "_index": "_river",
                "_type": "documentsriver",
                "_id": "_lastupdated",
                "_score": 1,
                "_source": {
                    "fs": {
                        "feedname": "documentsriver",
                        "lastdate": "2013-09-18T13:33:03.092Z",
                        "docadded": 0,
                        "docdeleted": 0
                    }
                }
            }
        ]
    }
}

The last element shows that the river has the status 'STARTED' and when I repeat this request, the "lastdate" element in the last JSON object changes every 10 seconds.

Deleting files is recognized but does not change index

The same happens when most files are deleted and just one of the original set of files is left. In this case, the index still doesn't change. It still says there are 13 documents in stead of 2 (1 folder, 1 file). But now, searching the _river "index" we'll see in the last JSON object this:

....
            {
                "_index": "_river",
                "_type": "documentsriver",
                "_id": "_lastupdated",
                "_score": 1,
                "_source": {
                    "fs": {
                        "feedname": "documentsriver",
                        "lastdate": "2013-09-18T13:38:25.274Z",
                        "docadded": 0,
                        "docdeleted": 11
                    }
                }
            }
....

So the river says 11 documents are deleted, but the index does not change. Even not the version of the indexed document. So, from the index you cannot verify the file still exists.

In both cases above, the logfiles (loglevel DEBUG) do not give any feedback.

Adding folders raises index error

Then the last case. A new subfolder is added to the folder C:\tmp. After adding the folder, logfiles say

[2013-09-18 15:45:27,942][WARN ][fr.pilato.elasticsearch.river.fs.river.FsRiver] [Node1] [fs][documentsriver] Error while indexing content from c:\tmp

After removing the folder, the warnings disappear.

Base64 codes in elasticsearch.log

Please remove base64 codes from log.

Update to elasticsearch 1.0.0

Modify Indexed Characters limit

I need set own count for Indexed Characters by index.mapping.attachment.indexed_chars. https://github.com/elasticsearch/elasticsearch-mapper-attachments#indexed-characters

This parameter does not work for index of this plugin.

Update to Elasticsearch 0.90.7

StringIndexOutOfBoundsException; -1; computeVirtualPathName

Exception for g:\feg\ is java.lang.StringIndexOutOfBoundsException: String index out of range: -1

It looks like 'computeVirtualPathName' throws this Exception.

realPath.substring(stats.getRootPath().length() - offset) <=> getRootPath seems to return 0.

"url": "g:\\xxx",

g:\xxx

g:/xxx (which would be valid by java.io.File) does also not work.

Maybe a configuration issue?

Regards,
Christian

FSriver error when "_source" is disable

When I use the following mapping:

curl -X PUT "localhost:9201/docs/doc/_mapping" -d '{
"doc" : {
"_source" : {"enabled" : false}
}
}'

I have this error in the fsriver
"Error while indexing content from /data/river"

Thanks

indexing remote documents via ssh

First of all sorry for my English.

About issue, I have 3 machines, two elastics nodes and one documents repository.
For example, let elastic nodes have the follow ip 192.168.55.11 (node 1) and 192.168.55.12 (node 2) and the document repository have ip 192.168.55.13 and have folder in root directory with path "/files/"

I download zip of fsriver 0.3.0-SNAPSHOT and extract it into plugin directory on both nodes.
here ls -l output one of them:

elastic/plugins/fsriver# ls -l
-rw-r--r--  1 root  wheel   35107 19 июл 17:09 fsriver-0.3.0-SNAPSHOT.jar
-rw-r--r--  1 root  wheel  229086  7 мар 09:23 jsch-0.1.49.jar

after that I stop elastic via curl -XPOST '192.168.55.11:9200/_cluster/nodes/_shutdown'

and then again start it via ./bin/elaticsearch both nodes

In logs I see what fsriver is ok. Plugin loaded and everything is ok:

[2013-07-20 17:15:39,558][INFO ][plugins] [Prime] loaded [mapper-attachments, river-fs, analysis-morphology], sites [head]

after that I create a test index from the document reposiroty`s machine (192.168.55.13):

curl -XPUT '192.168.55.11:9200/sshdocs/' -d '{}'

got ok answer from elastic and:

curl -XPUT '192.168.55.11:9200/_river/sshdocs/_meta' -d '{
  "type": "fs",
  "fs": {
    "url": "/files",
    "server": "192.168.55.13",
    "username": "USER",
    "password": "PASS",
    "protocol": "ssh",
    "update_rate": 60000,
    "includes": [ "*.doc" , "*.xls", "*.pdf" ]
  }
}'

elastic tell me:

{"ok":true,"_index":"_river","_type":"sshdocs","_id":"_meta","_version":1}

And then I look to the statistic of index and see

size: 495b (890b)
docs: 0 (0)

index have zero document

logs have the follow WARN

[WARN ][fr.pilato.elasticsearch.river.fs.river.FsRiver] [Sectant] [fs][sshdocs] Error while indexing content from /files

PS:
I have elasticSearch 0.90.2
OS FreeBSD 9.1
openjdk-7.25.15

in configuration of elasticSearch I have disabled zend.discovery.multicast option.

Move to Elasticsearch 0.90.0

Update to ES 0.90.0

Can we add feature to be able to to see both file content and file paths as human readable format (text) with just one river for text files

When I create river with following setting (note explicitly specifying JSON support):

    curl -XDELETE 127.0.0.1:9200/_river/foo
    curl -XDELETE 127.0.0.1:9200/foo
    curl -XPUT 'localhost:9200/_river/foo/_meta' -d '{
        "type": "fs",
        "fs": {
            "name": "Foo Data",
            "url": "/Users/slodha/foo/content",
            "update_rate": 60000,
            "includes": "*.json",
            "json_support" : true
        },
        "index": {
            "index": "foo",
            "type": "foo",
           "bulk_size": 50
        }
    }'

and search on this with this query:

{
     "query": {
         "query_string": {
             "default_field": "_all",
             "query": "slodha"
         }
     }
}

I get results like:

hits: [
    {
        _index: foo
        _type: foo
        _id: 18156b6b5a6b3a8e1ec5984f185e18
        _score: 6.9584246
        _source: {
            sunnyVal: slodha
        }
    }
    {
        _index: foo
        _type: foo
        _id: d7b4df4222e0d075d74ffde8aaa04a56
        _score: 6.901722
        _source: {
            fileNameTest: slodha
        }
    }
]

Problem : I never get which file it belonged to - which I would definitely need to be able to search in the filesystem eventually.

When I do this:

curl -XDELETE 127.0.0.1:9200/_river/foo
curl -XDELETE 127.0.0.1:9200/foo
curl -XPUT 'localhost:9200/_river/foo/_meta' -d '{
  "type": "fs",
  "fs": {
    "name": "Foo Data",
    "url": "/Users/slodha/foo/content",
    "update_rate": 60000
  },
  "index": {
    "index": "city",
    "type": "city",
    "bulk_size": 50
  }
}'

Notice here I do not use any json restriction..
I get results like this:

hits: [
    {
        _index: foo
        _type: foo
        _id: 18156b6b5a6b3a8e1ec5984f185e18
        _score: 2.0015228
        _source: {
            name: slodha_1.json
            postDate: 1363384941000
            pathEncoded: 44d22b925f562f4e8d1d847253493336
            rootpath: 948cd64d775db4119962b5a36dd530
            virtualpath: t/sunnyTest
            file: {
                _name: slodha_1.json
                content: ewoic3VubnlWYWwiIDogInNsb2RoYSIgCn0K
            }
        }
    }
    {
        _index: foo
        _type: foo
        _id: d7b4df4222e0d075d74ffde8aaa04a56
        _score: 1.7533717
        _source: {
            name: file1.json
            postDate: 1363388628000
            pathEncoded: 99d79f46f1ce275b6b9152a0de54d5
            rootpath: 948cd64d775db4119962b5a36dd530
            virtualpath: t/sunnyTest/sunnyTest2
            file: {
                _name: file1.json
                content: ewoiZmlsZU5hbWVUZXN0IiA6ICJzbG9kaGEiCn0K
            }
        }
    }
]

Now, here I get to know the exact file paths in _source, but this time the content is all a hash sum, and not readable..

there should be a way for me to see both content and file paths as human readable with just one river.

-Sunny

New json document mapping for docs

With issue #38 closed (replace mapper-attachment-plugin by Tika), we can now cleanup the JSON structure we generate when indexing documents with FSRiver.

In 0.3.0 version, the JSon mapping is the following:

{
  "doc" : {
    "properties" : {
      "file" : {
        "type" : "attachment",
        "path" : "full",
        "fields" : {
          "file" : {
            "type" : "string",
            "store" : "yes",
            "term_vector" : "with_positions_offsets"
          },
          "author" : {
            "type" : "string"
          },
          "title" : {
            "type" : "string",
            "store" : "yes"
          },
          "name" : {
            "type" : "string"
          },
          "date" : {
            "type" : "date",
            "format" : "dateOptionalTime"
          },
          "keywords" : {
            "type" : "string"
          },
          "content_type" : {
            "type" : "string",
            "store" : "yes"
          }
        }
      },
      "name" : {
        "type" : "string",
        "analyzer" : "keyword"
      },
      "pathEncoded" : {
        "type" : "string",
        "analyzer" : "keyword"
      },
      "postDate" : {
        "type" : "date",
        "format" : "dateOptionalTime"
      },
      "rootpath" : {
        "type" : "string",
        "analyzer" : "keyword"
      },
      "virtualpath" : {
        "type" : "string",
        "analyzer" : "keyword"
      },
      "filesize" : {
        "type" : "long"
      }
    }
  }
}

We can see that we have different levels of metadata and some of them are redundant.
Also, we use keyword analyzer instead of not indexing at all fields or using no analyzer.

The new structure will be:

{
  "doc" : {
    "properties" : {
      "content" : {
        "type" : "string",
        "store" : "yes"
      },
      "meta" : {
        "properties" : {
          "author" : {
              "type" : "string",
              "store" : "yes"
          },
          "title" : {
              "type" : "string",
              "store" : "yes"
          },
          "date" : {
              "type" : "date",
              "format" : "dateOptionalTime",
              "store" : "yes"
          },
          "keywords" : {
              "type" : "string",
              "store" : "yes"
          }
        }
      },
      "file" : {
        "properties" : {
          "content_type" : {
              "type" : "string",
              "analyzer" : "simple",
              "store" : "yes"
          },
          "last_modified" : {
              "type" : "date",
              "format" : "dateOptionalTime",
              "store" : "yes"
          },
          "indexing_date" : {
              "type" : "date",
              "format" : "dateOptionalTime",
              "store" : "yes"
          },
          "filesize" : {
              "type" : "long",
              "store" : "yes"
          },
          "indexed_chars" : {
              "type" : "long",
              "store" : "yes"
          },
          "filename" : {
              "type" : "string",
              "analyzer" : "simple",
              "store" : "yes"
          },
          "url" : {
              "type" : "string",
              "store" : "yes",
              "index" : "no"
          }
        }
      },
      "path" : {
        "properties" : {
          "encoded" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
          },
          "virtual" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
          },
          "root" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
          },
          "real" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
          }
        }
      }
    }
  }
}

curl 'localhost:9200/_river/mydocs/_stop'

To restart the river from the previous point, just call _start end point:

curl 'localhost:9200/_river/mydocs/_start'

Add SSH port setting

Hello, it would be possible to be able to put a port different that that by default for the connection SSH?

{
    "password": "******",
    "protocol": "ssh",
    "port": "2222",
    "update_rate": 600000
}

Thank you

dadoonet / fscrawler Goto Github PK

fscrawler's Introduction

File System Crawler for Elasticsearch

Latest versions

Build and Quality Status

GitHub stats

Documentation

Contribute

License

Thanks

fscrawler's People

Contributors

Stargazers

Watchers

Forkers

fscrawler's Issues

Adding files does not change index

Deleting files is recognized but does not change index

Adding folders raises index error

Recommend Projects

Recommend Topics

Recommend Org

Jobs