GithubHelp home page GithubHelp logo

elasticsearch-imap's Introduction

elasticsearch-importer-imap Elasticsearch 2.x

Support Elasticsearch 5.0 readyness and keep elasticsearch imap importer free. Currently IMAP importer is only working with Elasticsearch 2 and it costs a lot of time and effort to update and maintain it for Elasticsearch 5. Donations welcome!

Pledgie:
Click here to lend your support to: Elasticsearch IMAP Importer and make a donation at pledgie.com !

Paypal:
Donate

Patreon:
https://patreon.com/salyh


Import e-mails from IMAP (and POP3) into Elasticsearch 2.x

E-Mail [email protected]

Twitter @hendrikdev22

This importer connects to IMAP4 or POP3 servers, poll your mail and index it. The emails on the server will be never modified or removed from the server. The importer tracks (after the first initial full load) which mails are new or deleted and then only update the index for this mails.

Features:

  • Incremental indexing of e-mails from a IMAP or POP3 server
  • Support indexing of attachments
  • Support for UTF-7 encoded e-mails (through jutf7)
  • SSL, STARTTLS and SASL are supported (through JavaMail API)
  • IMAP only: Folders which should be indexed can be specified with a regex pattern
  • IMAP only: Subfolders can also be indexed (whole traversal of all folders)
  • No special server capabilities needed
  • Bulk indexing
  • Works also with Gmail, iCloud, Yahoo, etc.

The importer acts as a disconnected client. This means that the importer is polling and for every indexing run a new server connection is opened and, after work is done, closed.

Installation

Prerequisites:
  • Java 7 or 8
  • Elasticsearch 2.x
  • At least one IMAP4 or POP3 server to connect to

Download .zip or .tar.gz from https://github.com/salyh/elasticsearch-river-imap/releases/latest (only Version 0.8.6 or higher) and unpack them somewhere.

Then run

  • bin/importer.sh [-e] <config-file>
    • -e: Start embedded elasticsearch node (only for testing !!)
    • config-file: path to the json configuration file (see next chapter)

Configuration

Put the following configuration in a file and store them somewhere with a extension of .json
{
   "mail.store.protocol":"imap",
   "mail.imap.host":"imap.server.com",
   "mail.imap.port":993,
   "mail.imap.ssl.enable":true,
   "mail.imap.connectionpoolsize":"3",
   "mail.debug":"false",
   "mail.imap.timeout":10000,
   "users":["[email protected]"],
   "passwords":["secret"],
   "schedule":null,
   "interval":"60s",
   "threads":5,
   "folderpattern":null,
   "bulk_size":100,
   "max_bulk_requests":"30",
   "bulk_flush_interval":"5s",
   "mail_index_name":"imapriverdata",
   "mail_type_name":"mail",
   "with_striptags_from_textcontent":true,
   "with_attachments":false,
   "with_text_content":true,
   "with_flag_sync":true,
   "keep_expunged_messages":false,
   "index_settings" : null,
   "type_mapping" : null,
   "user_source" : null,
   "ldap_url" : null,
   "ldap_user" : null,
   "ldap_password" : null,
   "ldap_base" : null,
   "ldap_name_field" : "uid",
   "ldap_password_field" : null,
   "ldap_refresh_interval" : null,
   "master_user" : null,
   "master_password" : null,

"client.transport.ignore_cluster_name": false, "client.transport.ping_timeout": "5s", "client.transport.nodes_sampler_interval": "5s", "client.transport.sniff": true, "cluster.name": "elasticsearch", "elasticsearch.hosts": "localhost:9300,127.0.0.1:9300"

}

  • mail.* - see JAVAMail documentation https://javamail.java.net/nonav/docs/api/ (default: none)
  • user - user name for server login (default: null) - deprecated, use users
  • password - password for server login (default: null) - deprecated, use passwords
  • users - array of users name for server login (default: null)
  • passwords - array of passwords for server login (default: null)
  • schedule - a cron expression like 0/3 0-59 0-23 ? * * (default: null)
  • interval - if no schedule is set then this is will be the indexing interval (default: 60s)
  • threads - How many thready for parallel indexing (must be 1 or higher) (default: 5)
  • folderpattern - IMAP only: regular expressions which folders should be indexed (default: null)
  • bulk_size - the length of each bulk index request submitted (default: 100)
  • max_bulk_requests - the maximum number of concurrent bulk requests (default: 30)
  • bulk_flush_interval - the time period the bulk processor is flushing outstanding documents (default: 5s)
  • mail_index_name - name of the index which holds the mail (default: imapriverdata)
  • mail_index_name_strategy - how the indexname should be composed for each user (default: all_in_one)
    • all_in_one - Put all mails from all users, the index name is mail_index_name
    • username - Put mail from each user in a index with the users username
    • username_crop - Put mail from each user in a index with the users username but crop at the @ sign
    • prefixed_username - Put mail from each user in a index with the users username prefixed by mail_index_name
    • prefixed_username_crop - Put mail from each user in a index with the users username prefixed by mail_index_name but crop at the @ sign
  • mail_type_name - name of the type (default: mail)
  • with_striptags_from_textcontent - if true then html/xml tags are stripped from text content (default: true)
  • with_attachments - if true then attachments will be indexed (default: false)
  • with_text_content - if true then the text content of the mail is indexed (default: true)
  • with_flag_sync - IMAP only: if true then message flag changes will be detected and indexed. Maybe slow for very huge mailboxes. (default: true)
  • keep_expunged_messages - if true then message which are expunged/deleted on the server will be kept in elasticsearch. (default: false)
  • index_settings - optional settings for the Elasticsearch index
  • type_mapping - optional mapping for the Elasticsearch index type
  • headers_to_fields - array with e-mail header names to include as proper fields. To create a legal field name, the header name is prefixed with header_, lowercased and has all non-alphanumeric characters replaced with _. For example, an input of ["Message-ID"] will copy that header into a field with name header_message_id.
  • user_source - If "ldap" then query user and passwords from an ldap directory, if null or missing use users/passwords from the config file (default: null)
    • ldap_url - LDAP host and port (default: null), example: "ldap://ldaphostname:389"
    • ldap_user - Ldap user which is used to query IMAP users (default: null), example: "cn=Directory Manager"
    • ldap_password - Ldap password for ldap_user (default: null)
    • ldap_base - Base Dn where to search for users, (default: null), example: "ou=users,ou=accounts,dc=company,dc=org"
    • ldap_name_field - The fieldname (attribute) where the IMAP username is stored (default "dn"), example: "uid"
    • ldap_password_field - The fieldname (attribute) of the IMAP password (default: "userPassword")
    • ldap_refresh_interval - Refresh interval in minutes (default:"60"), set to "0" to disable automatic refreshing. If enabled this will automatically refresh the users/passwords from ldap every n minutes.
    • master_user - For Dovecot, a master user account can be supplied who can access all users' mailboxes, even if their passwords are encrypted (default: null)
    • master_password - master user password (default: null)
  • client.transport.* see https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/transport-client.html
  • cluster.name see https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/transport-client.html
  • elasticsearch.hosts Comma separated list of elasticsearch nodes/servers (mandatory)

Note: For POP3 only the "INBOX" folder is supported. This is a limitation of the POP3 protocol.

Default Mapping Example

```json "mail" : { "properties" : { "attachmentCount" : { "type" : "long" }, "bcc" : { "properties" : { "email" : { "type" : "string" }, "personal" : { "type" : "string" } } }, "cc" : { "properties" : { "email" : { "type" : "string" }, "personal" : { "type" : "string" } } }, "contentType" : { "type" : "string" }, "flaghashcode" : { "type" : "integer" }, "flags" : { "type" : "string" }, "folderFullName" : { "type" : "string", "index" : "not_analyzed" }, "folderUri" : { "type" : "string" }, "from" : { "properties" : { "email" : { "type" : "string" }, "personal" : { "type" : "string" } } }, "headers" : { "properties" : { "name" : { "type" : "string" }, "value" : { "type" : "string" } } }, "mailboxType" : { "type" : "string" }, "receivedDate" : { "type" : "date", "format" : "basic_date_time" }, "sentDate" : { "type" : "date", "format" : "basic_date_time" }, "size" : { "type" : "long" }, "subject" : { "type" : "string" }, "textContent" : { "type" : "string" }, "to" : { "properties" : { "email" : { "type" : "string" }, "personal" : { "type" : "string" } } }, "uid" : { "type" : "long" } } } } ```

For advanced mapping ideas look here:

Advanced Mapping Example (to be set manually using "type_mapping")

```json { "mail":{ "properties":{ "textContent":{ "type":"langdetect" }, "email":{ "type":"string", "index":"not_analyzed" }, "subject":{ "type":"multi_field", "fields":{ "text":{ "type":"string" }, "raw":{ "type":"string", "index":"not_analyzed" } } }, "personal":{ "type":"multi_field", "fields":{ "title":{ "type":"string" }, "raw":{ "type":"string", "index":"not_analyzed" } } } } } } ```

Content Example

```json { "_index" : "imapriverdata", "_type" : "mail", "_id" : "50220::imap://test%[email protected]/import", "_score" : 1.0, "_source" : { "attachmentCount" : 0, "attachments" : null, "bcc" : null, "cc" : null, "contentType" : "text/plain; charset=ISO-8859-15", "flaghashcode" : 16, "flags" : [ "Recent" ], "folderFullName" : "test", "folderUri" : "imap://test%[email protected]/import", "from" : { "email" : "[email protected]", "personal" : null }, "headers" : [ { "name" : "Subject", "value" : "Suchagent Wohnung mieten in Berlin - 1 neues Objekt gefunden!" }, { "name" : "Return-Path", "value" : "" }, { "name" : "Content-Transfer-Encoding", "value" : "quoted-printable" }, { "name" : "To", "value" : "[email protected]" }, { "name" : "X-OfflineIMAP-1722382714-52656d6f7465-6165727a7465", "value" : "1248516496-0146849121575-v5.99.4" }, { "name" : "Message-ID", "value" : "<[email protected]>" }, { "name" : "Mime-Version", "value" : "1.0" }, { "name" : "X-Gmail-Labels", "value" : "ablage,[email protected]" }, { "name" : "X-GM-THRID", "value" : "1309162987234255956" }, { "name" : "Delivered-To", "value" : "GMX delivery to [email protected]" }, { "name" : "Reply-To", "value" : "[email protected]" }, { "name" : "Date", "value" : "Fri, 18 Nov 2005 04:17:24 +0100 (MET)" }, { "name" : "Auto-Submitted", "value" : "auto-generated" }, { "name" : "Received", "value" : "(qmail invoked by alias); 18 Nov 2005 03:17:25 -0000" }, { "name" : "Content-Type", "value" : "text/plain; charset=\"ISO-8859-15\"" }, { "name" : "From", "value" : "[email protected]" } ], "mailboxType" : "IMAP", "popId" : null, "receivedDate" : 1132283845000, "sentDate" : 1132283844000, "size" : 3645, "subject" : "Suchagent Wohnung mieten in Berlin - 1 neues Objekt gefunden!", "textContent" : "Sehr geehrter Nutzer, ... JETZT AUCH IM FERNSEHEN: IMMOBILIENANGEBOTE FÜR HAMBURG UND UMGEBUNG!\r\n\tFinden Sie Ihre Wunschwohnung oder ..." "to" : [ { "email" : "[email protected]", "personal" : null } ], "uid" : 50220 } } ```

Indexing attachments

If you want also indexing your mail attachments look here: * #10 (comment) * #13 * http://tinyurl.com/nbujv7h * https://github.com/salyh/elasticsearch-river-imap/blob/master/src/test/java/de/saly/elasticsearch/imap/AttachmentMapperTest.java

Contributers/Credits

* Hans Jørgen Hoel (https://github.com/hansjorg) * Stefan Thies (https://github.com/megastef) * René Peinl (University Hof)

License

Copyright (C) 2014-2015 by Hendrik Saly (http://saly.de) and others.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

elasticsearch-imap's People

Contributors

hansjorg avatar mscifo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elasticsearch-imap's Issues

Installation as described in README not working - incorrect URL

Hi,

Installing the plugin does not work as described in the README:

bin/plugin -i river-imap -u http://dl.bintray.com/salyh/maven/de/saly/elasticsearch/plugin/elasticsearch-river-imap/0.0.7-b12/elasticsearch-river-imap-0.0.7-b12-plugin.zip

The only version available is 0.0.7-b11, which can be installed as such:

bin/plugin -i river-imap -u http://dl.bintray.com/salyh/maven/de/saly/elasticsearch/plugin/elasticsearch-river-imap/0.0.7-b11/elasticsearch-river-imap-0.0.7-b11-plugin.zip

Rgrds,
Eric

IMAP-ES-Importer verwirft Mails

Wir versuchen in unserem Projekt ihren "elasticsearch-importer-imap" (ES 1.x) folgendermaßen einzusetzen:
Die IMAP-Nutzer werden aus LDAP ausgelesen
Der Importer loggt sich über einen Master User in die jeweiligen Konten ein (Dovecot)
Die Anzahl der Threads und Verbindungen ist schon auf 1 gestellt, um Concurrency-Probleme zu vermeiden
Seit einiger Zeit haben wir aber das Problem, dass zwar der erste Durchlauf funktioniert und alle Mails wie erwartet indiziert werden. Aber bei weiteren Durchläufen treten immer wieder Fehler auf und der Importer fängt an, nach und nach Mails aus dem Index zu entfernen, bis dieser leer ist. Ggf. treten die Probleme erst nach einiger Zeit auf bzw. wenn weitere Mails versendet oder gelöscht wurden.

Die Meldung vom Mail-Server, die beim Abrufen von Mails die Fehler auslöst ist "A34 BAD Error in IMAP command UID FETCH: Invalid uidset".
Ich mutmaße daher, dass Mail-IDs teilweise den falschen Nutzern zugeordnet werden, was zu Abfragefehlern führt. Leider konnte ich bisher keinen Fehler im Quellcode finden, der zu diesem Zustand führen würde.

web ui to browse indexed emails?

I'm looking to see if there's any web ui component that could be used along with the imap plugin to browse and search indexed emails.

we have a lot of projects that send out an email and would like to send a copy of each email to our catch all email account no that support and sales teams can be a little more efficient in troubleshooting issues.

any other thoughts or projects that you could recommend that do this type of thing?

thank you

pop3 connection fails

Hi,

I have a working imap connection but I'm required to switch to pop3. I've tried to use the following configuration:

"type":"imap",
"mail.store.protocol":"pop3",
"mail.imap.host":"cto-sdx01",
"mail.imap.port":110,
"mail.imap.ssl.enable":"false",
"mail.imap.connectionpoolsize":"3",
"mail.debug":"true",
"mail.imap.timeout":10000,
"user":"gkapitan",
"password":"xxxxxxx",
"schedule":null,
"interval":"60s",
"threads":5,
"folderpattern":null,
"bulk_size":100,
"max_bulk_requests":"30",
"bulk_flush_interval":"5s",
"mail_index_name":"imapriverdata",
"mail_type_name":"mail",
"with_striptags_from_textcontent":true,
"with_attachments":true,
"with_text_content":true,
"with_flag_sync":true,
"index_settings":null,
"type_mapping":null,
"headers_to_fields":"true"

I can simply telnet the port and log in via command line using parameters above, however, from elasticsearch I get:
[2014-07-25 15:17:19,420][ERROR][de.saly.elasticsearch.support.MailFlowJob] Error in mail flow job group1.org.elasticsearch.river.RiverName@1fa28fae--578680900: javax.mail.AuthenticationFailedException: failed to connect job
javax.mail.AuthenticationFailedException: failed to connect
at javax.mail.Service.connect(Service.java:401)
at javax.mail.Service.connect(Service.java:245)
at javax.mail.Service.connect(Service.java:265)
at de.saly.elasticsearch.mailsource.ParallelPollingPOPMailSource.fetch(ParallelPollingPOPMailSource.java:358)
at de.saly.elasticsearch.mailsource.ParallelPollingPOPMailSource.fetchAll(ParallelPollingPOPMailSource.java:115)
at de.saly.elasticsearch.support.MailFlowJob.execute(MailFlowJob.java:62)
at de.saly.elasticsearch.support.MailFlowJob.execute(MailFlowJob.java:86)
at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)

I don't see the plugin reaching out to the POP3 server for authentication.
I did try the following combinations for the configuration (protocol, port, ssl):
pop3,110, false
pop3, 995, true
pop3s, 995, true
pop3, 110, true

without success. Again, IMAP works. I'm missing something but I don't know what.
Plugin: elsaticsearch-river-imap-0.0.7-b20
elasticsearch-1.2.1

Any help much appreciated,
Gabriel

Issue with accessing text content of attachments

Hi,

I'm having trouble accessing the Tika-extracted text content of attachments when downloaded with river-imap.

I have installed mapper-attachments and also the langdetect and icu_tokenizer plugins. I'm still on ElasticSearch 1.1.0 now and thus using river-imap 0.7b20 and mapper-attachments 2.0.0.

To test if the mapper-attachments plugin is working and I can extract text, I have created an index called test, created a mapping for a type called mapper. I have based this test on what is described here: https://gist.github.com/dadoonet/5310075

POST http://localhost:9200/test:

{
    "settings": {
        "index": {
            "analysis": {
                "analyzer": {
                    "fulltext_analyzer_icu": {
                        "type": "custom",
                        "char_filter": [
                            "html_strip"
                        ],
                        "filter": [
                            "lowercase",
                            "type_as_payload"
                        ],
                        "tokenizer": "icu_tokenizer"
                    }
                }
            },
            "number_of_replicas": "0",
            "number_of_shards": "5"
        }
    }
}

POST http://localhost:9200/test/mapper/_mapping:

{
    "mapper" : {

      "dynamic" : "strict",
        "properties" : {
            "file" : {
                "type" : "attachment",
                "path" : "full",
                "fields" : {
                    "file" : {
                      "type" : "string",
                      "store" : true,
                      "term_vector" : "with_positions_offsets"
                    },
                  "name" : {"store" : "yes"},
                    "title" : {"store" : "yes"},
                    "date" : {"store" : "yes"},
                    "author" : {"analyzer" : "fulltext_analyzer_icu"},
                    "keywords" : {"store" : "yes"},
                    "content_type" : {"store" : "yes"},
                    "content_length" : {"store" : "yes"},
                    "language" : {"store" : "yes"}
                }
            }
        }
    }
}

I can index a document:

PUT http://localhost:9200/test/mapper/66d2006d-d269-5de7-5aa8-53f5b7b730e8:

{
    "file": {
        "content": "UEsDBBQABgAIAAAAIQDtO5Wb8wE ...(shortened)... mAAAHdvcmQvc3R5bGVzLnhtbFBLAQItABQABgAIAAAAIQAUnjhLYgsAAINRAAAaAAAAAAAAAAAAAAAAAKprAAB3b3JkL3N0eWxlc1dpdGhFZmZlY3RzLnhtbFBLAQItABQABgAIAAAAIQAgxZ8kzgEAAP8HAAAUAAAAAAAAAAAAAAAAAER3AAB3b3JkL3dlYlNldHRpbmdzLnhtbFBLAQItABQABgAIAAAAIQAQKFQ4agEAAJICAAARAAAAAAAAAAAAAAAAAER5AABkb2NQcm9wcy9jb3JlLnhtbFBLBQYAAAAAHwAfAB8IAADlewAAAAA=",
        "_name": "test.DOCX",
        "_detect_language": true,
        "_indexed_chars": -1
    }
}

The indexed document looks as such:

GET http://localhost:9200/test/mapper/66d2006d-d269-5de7-5aa8-53f5b7b730e8:

{
    "_index": "test",
    "_type": "mapper",
    "_id": "66d2006d-d269-5de7-5aa8-53f5b7b730e8",
    "_version": 4,
    "found": true,
    "_source": {
        "file": {
            "content": "UEsDBBQABgAIAAAAIQDtO5Wb8wEAAFoKAAATAAgCW0NvbnRlbnRfVHlwZXNdLnhtbCCiBAI ...(shortened)... BLBQYAAAAAHwAfAB8IAADlewAAAAA=",
            "_name": "test.DOCX",
            "_detect_language": true,
            "_indexed_chars": -1
        }
    }
}

I can perform the following search: POST http://localhost:9200/test/mapper/_search with the following payload (searching for the term "ambiguous"):

{
  "fields": [ "file" ],
  "query": {
    "match": {
      "file": "ambiguous"
    }
  }
}

This is the result which includes the file field and the extracted content:

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 0.023731936,
        "hits": [
            {
                "_index": "test",
                "_type": "mapper",
                "_id": "66d2006d-d269-5de7-5aa8-53f5b7b730e8",
                "_score": 0.023731936,
                "fields": {
                    "file": [
                        "\n\tRequest\n\tDR/BCP specialist\n\n\tLevel\n\tSenior/ expert\n\n\tYears of exp\n\t+7 years\n\n\tJob description\n\tOversees development, implementation, testing and maintenance of Disaster Recovery strategy.  Partners with SWING streams and GDC’s  to align processes and procedures within the organization in order to respond to a disaster and recover the critical business functions within the defined RTO/RPOs.   \nPerfor ...(shortened)... hnologies \nDatabase administration for SQL Server \nVMWare Virtualization products (Server, ESX) \nBackup software and disk to disk solutions \nEnterprise Storage and replication capabilities \nHost Based replication technologies\n\n\n\tLanguage Skills\n(mandatory)\n\tEglish / Dutch\n\n\tLanguage Skills\n(optional)\n\tN/A\n\n\tLocation\n\tAmsterdam \n\n\tSector\n\tFinance\n\n\tStart Date\n\tASAP\n\n\tContract Duration\n\t22 days\n\n\n\n"
                    ]
                }
            }
        ]
    }
}

To me this confirms the mapper-attachment plugin is working and I can extract my text.

So, my wish is to extract the text from attachments stored by river-imap as well. Therefore I made the following configuration of river-imap, containing type mappings and index settings:

{
   "type":"imap",
   "mail.store.protocol":"imap",
   "mail.imap.host":"imap.gmail.com",
   "mail.imap.port":993,
   "mail.imap.ssl.enable":true,
   "mail.imap.connectionpoolsize":"3",
   "mail.debug":"false",
   "mail.imap.timeout":10000,
   "user":"***PRIVATE***",
   "password":"***PRIVATE***",
   "schedule":null,
   "interval":"60s",
   "threads":5,
   "folderpattern":"^INBOX$",
   "bulk_size":100,
   "max_bulk_requests":"30",
   "bulk_flush_interval":"5s",
   "mail_index_name":"imapriverdata",
   "mail_type_name":"mail",
   "with_striptags_from_textcontent":true,
   "with_attachments":true,
   "with_text_content":true,
   "with_flag_sync":false,
   "index_settings" : {
        "index": {
            "analysis": {
                "analyzer": {
                    "email_analyzer": {
                        "type": "custom",
                        "filter": [
                            "lowercase"
                        ],
                        "tokenizer": "uax_url_email"
                    },
                    "fulltext_analyzer_icu": {
                        "type": "custom",
                        "char_filter": [
                            "html_strip"
                        ],
                        "filter": [
                            "lowercase",
                            "type_as_payload"
                        ],
                        "tokenizer": "icu_tokenizer"
                    }
                }
            },
            "index": {
                "number_of_replicas": "1",
                "number_of_shards": "5"
            }
        }
    },
    "type_mapping" : {
        "mail": {
            "dynamic" : "strict",
            "properties": {
                "attachmentCount": {
                    "type": "long"
                },
                "attachments": {
                  "type" : "nested",
                    "properties": {
                        "content": {
                            "type" : "attachment",
                            "path" : "full",
                            "fields" : {
                                "file" : {
                                  "type" : "string",
                                  "store" : true,
                                  "term_vector" : "with_positions_offsets"
                                },
                                "name" : {"store" : "yes"},
                                "title" : {"store" : "yes"},
                                "date" : {"store" : "yes"},
                                "author" : {"analyzer" : "fulltext_analyzer_icu"},
                                "keywords" : {"store" : "yes"},
                                "content_type" : {"store" : "yes"},
                                "content_length" : {"store" : "yes"},
                                "language" : {"store" : "yes"}
                            }
                        },
                        "contentType": {
                            "type": "string"
                        },
                        "filename": {
                            "type": "string"
                        },
                        "name": {
                            "type": "string"
                        },
                        "size": {
                            "type": "long"
                        }
                    }
                },
                "contentType": {
                    "type": "string"
                },
                "flaghashcode": {
                    "type": "integer"
                },
                "flags": {
                    "type": "string"
                },
                "folderFullName": {
                    "type": "string",
                    "index": "not_analyzed"
                },
                "folderUri": {
                    "type": "string"
                },
                "from": {
                    "properties": {
                        "email": {
                            "type": "string"
                        },
                        "personal": {
                            "type": "string"
                        }
                    }
                },
                "headers": {
                  "type" : "nested",
                    "properties": {
                        "name": {
                            "type": "string"
                        },
                        "value": {
                            "type": "string"
                        }
                    }
                },
                "mailboxType": {
                    "type": "string"
                },
                "receivedDate": {
                    "type": "date",
                    "format": "basic_date_time"
                },
                "sentDate": {
                    "type": "date",
                    "format": "basic_date_time"
                },
                "size": {
                    "type": "long"
                },
                "subject": {
                    "type": "string"
                },
                "textContent": {
                    "type": "string"
                },
                "to": {
                  "type" : "nested",
                    "properties": {
                        "email": {
                            "type": "string",
                            "index_analyzer": "email_analyzer"
                        },
                        "personal": {
                            "type": "string"
                        }
                    }
                },
                "uid": {
                    "type": "long"
                }
            }
        }
    }
}

As you can see I have configured content to be of type attachment, but with extended configuration for the mapper-attachments plugin.

I also made some changes to the mapping in order to do nested query searches on attachments, headers and to, which seem to be accepted.

When started, river-imap nicely syncs all messages from the IMAP server to the imapriverdata index. It creates the index nicely with the settings and mappings as specified in the above configuration.

When I perform a search I want to see the extracted text, just like when using the mapper-attachment in the test I described above. I perform this search:

POST http://localhost:9200/imapriverdata/mail/_search
with the following payload:

{
  "fields": [ "attachments.content" ],
    "query" : {
      "nested" : {
        "path" : "attachments",
        "query" : {
            "match" : {
                "attachments.content" : "ambiguous"
            }
        }
      }
    }
}

this results in the following:

{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 0.53624094,
        "hits": [
            {
                "_index": "imapriverdata",
                "_type": "mail",
                "_id": "84::imap://***PRIVATE***@imap.gmail.com/INBOX",
                "_score": 0.53624094,
                "fields": {
                    "attachments.content": [
                        "UEsDBBQABgAIAAAAIQDtO5Wb8wEAA ...(shortened)... AAAUAAAAAAAAAAAAAAAAAER3AAB3b3JkL3dlYlNldHRpbmdzLnhtbFBLAQItABQABgAIAAAAIQAQKFQ4agEAAJICAAARAAAAAAAAAAAAAAAAAER5AABkb2NQcm9wcy9jb3JlLnhtbFBLBQYAAAAAHwAfAB8IAADlewAAAAA="
                    ]
                }
            }
        ]
    }
}

As you can see, attachments.content shows the base64 encoded data, as opposed to my test above with just the mapper plugin. I'd say the attachments.content field is configured exactly the same as in the test.

Any suggestion why this is happening?

Incompatibility with river-web

Since I installed both river-web (https://github.com/codelibs/elasticsearch-river-web) and river-imap (https://github.com/salyh/elasticsearch-river-imap) I got 3 node out of 5 that fails to start.

Here's the only line of log I got about this :
[2014-09-27 23:53:48,622][ERROR][bootstrap ] {1.3.2}: Initialization Failed ...

  1. NoSuchMethodError[org.apache.commons.codec.binary.Base64.(I[BZ)V]

Seems two plugins uses a different version of apache commons (maybe by transitive dependencies...)

Which is very weird is that 2 nodes starts (but fails to synchronize due to lack of other nodes).

I'm using version 1.3.2 of ES and the latest version of the two river plugins, installed following readme.md instructions.

Example with type_mapping

I would like to add mapping for https://github.com/jprante/elasticsearch-langdetect
and get possibility to sort by receivedDate - so a custom mapping is required.

There is no documentation, what is expected in

 type_mapping - optional mapping for the Elasticsearch index type

I tried this:

"type_mapping" : {

                        "properties" : {
                           "textContent" : { "type" : "langdetect" },
                           "receivedDate" : { "type": "date"}
                        }

                    }

River creation was OK, but mapping not applied.
Then I used this one:

"type_mapping" : {
                      "mail" : {
                        "properties" : {
                           "textContent" : { "type" : "langdetect" },
                           "receivedDate" : { "type": "date"}
                        }
                      }
                    }

No effect on mapping. Of course I deleted river and index on each try ..

In both cases I got following response:

{"_index":"_river","_type":"stefan","_id":"_meta","_version":1,"created":true}

NoNodeAvailableException: None of the configured nodes are available

Any Idea how to fix this error? :(

Exception while waiting for cluster state: YELLOW due to
org.elasticsearch.client.transport.NoNodeAvailableException: None of the configured nodes are available: []

bin/importer.sh config.json

"client.transport.ignore_cluster_name": true,
"client.transport.ping_timeout": "5s",
"client.transport.nodes_sampler_interval": "5s",
"client.transport.sniff": false,
"cluster.name": "goldncompass-dev",
"elasticsearch.hosts": "127.0.0.1:9300"

elasticsearch.yml

# Use a descriptive name for your cluster:
#
cluster.name: goldncompass-dev

# Use a descriptive name for the node:
#
node.name: Shira

# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: 127.0.0.1

Elastic search v2.3.5 logs the following:

[2016-08-12 00:01:37,316][WARN ][transport.netty          ] [Shira] exception caught on transport layer [[id: 0x73a297fa, /127.0.0.1:56407 => /127.0.0.1:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (request) for requestId [0], action [cluster/nodes/info], readerIndex [39] vs expected [57]; resetting
        at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:121)

Flags not indexed?

1st ofa all: Excellent work!
I tried to find out a query to select all unread messages, but could not find such a field in my dataset. It should be part of the IMAP protocol.

* OK [PERMANENTFLAGS (\Seen \Answered \Flagged \Deleted \Draft $MDNSent)] Permanent flags

Feature: index names based on sendDate / receiveDate

It would be really great to have index names per day like logstash does. This way old emails could easily be archived by closing indexes. Also querying is more performant this way if a time frame is given (like Kibana does).

What do you think, Hendrik?

Greetings,
Gabriel

River does not start after ES restart (IndexAlreadyExists Exception)

When I restart ElasticSearch I get this message and no logs about fetching mails.

[2014-05-12 19:47:08,398][ERROR][de.saly.elasticsearch.river.imap.IMAPRiver] Unable to start IMAPRiver due to org.elasticsearch.indices.IndexAlreadyExistsException: [stefan2] already exists
org.elasticsearch.indices.IndexAlreadyExistsException: [stefan2] already exists
    at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:152)
    at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:465)
    at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.access$100(MetaDataCreateIndexService.java:85)
    at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$2.execute(MetaDataCreateIndexService.java:215)
    at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:308)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:134)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

NoClassDefFoundError: org/quartz/ScheduleBuilder

I'm running into

failed to get _meta from [imap]/[riverimapdata]
org.elasticsearch.common.util.concurrent.ExecutionError: java.lang.NoClassDefFoundError: org/quartz/ScheduleBuilder

in a fresh install of ES 1.1.1 with JDK version 1.8.0_05 and Maven version 3.2.1. The original pom file has not been modified...

Missing Text Content

I am using this for pulling emails from an IMAP server. While it seems to be indexing all emails, a proportion of those emails have their contents missing i.e textContent and htmlContent are empty in Elasticsearch. Unfortunately this is happening randomly so I have no idea what could be the problem.

I also did not see any error in the logs that could give me an idea of why these contents are not being indexed.

See example extract from sense below;

 "mailboxType": "IMAP",
               "popId": null,
               "receivedDate": 1449630321000,
               "sentDate": 1449630310000,
               "size": 8455,
               "subject": "Re: Newsletter: 9th December 2015",
               "textContent": "",
               "htmlContent": null ```

attachment mapping

Hi,
I'd like to add attachment mapping for:

https://github.com/elasticsearch/elasticsearch-mapper-attachments/

I not sure what is the correct way to do that. I would appreciate your help. Maybe would help to also update the documentation, in case somebody else needs it.
Here is the modified mapping I'm trying to use:

curl -XGET 'http://localhost:9200/imapriverdata/_mapping?pretty'
{
  "imapriverdata" : {
    "mappings" : {
      "imapriverstate" : {
        "properties" : {
          "errormsg" : {
            "type" : "string"
          },
          "exists" : {
            "type" : "boolean"
          },
          "folderUrl" : {
            "type" : "string"
          },
          "lastCount" : {
            "type" : "long"
          },
          "lastIndexed" : {
            "type" : "long"
          },
          "lastSchedule" : {
            "type" : "long"
          },
          "lastTook" : {
            "type" : "long"
          },
          "lastUid" : {
            "type" : "long"
          },
          "messageid" : {
            "type" : "string"
          },
          "uidValidity" : {
            "type" : "long"
          }
        }
      },
      "mail" : {
        "properties" : {
          "attachmentCount" : {
            "type" : "long"
          },
          "attachments" : {
            "properties" : {
              "attachmentType" : {
                "type" : "attachment",
                "path" : "full",
                "fields" : {
                  "attachmentType" : {
                    "type" : "string"
                  },
                  "author" : {
                    "type" : "string"
                  },
                  "title" : {
                    "type" : "string"
                  },
                  "name" : {
                    "type" : "string"
                  },
                  "date" : {
                    "type" : "date",
                    "format" : "dateOptionalTime"
                  },
                  "keywords" : {
                    "type" : "string"
                  },
                  "content_type" : {
                    "type" : "string"
                  },
                  "content_length" : {
                    "type" : "integer"
                  },
                  "language" : {
                    "type" : "string"
                  }
                }
              },
              "content" : {
                "type" : "string"
              },
              "contentType" : {
                "type" : "string"
              },
              "filename" : {
                "type" : "string"
              },
              "name" : {
                "type" : "string"
              },
              "size" : {
                "type" : "long"
              }
            }
          },
          "contentType" : {
            "type" : "string"
          },
          "flaghashcode" : {
            "type" : "integer"
          },
          "flags" : {
            "type" : "string"
          },
          "folderFullName" : {
            "type" : "string",
            "index" : "not_analyzed"
          },
          "folderUri" : {
            "type" : "string"
          },
          "from" : {
            "properties" : {
              "email" : {
                "type" : "string"
              },
              "personal" : {
                "type" : "string"
              }
            }
          },
          "headers" : {
            "properties" : {
              "name" : {
                "type" : "string"
              },
              "value" : {
                "type" : "string"
              }
            }
          },
          "mailboxType" : {
            "type" : "string"
          },
          "receivedDate" : {
            "type" : "date",
            "format" : "basic_date_time"
          },
          "sentDate" : {
            "type" : "date",
            "format" : "basic_date_time"
          },
          "size" : {
            "type" : "long"
          },
          "subject" : {
            "type" : "string"
          },
          "textContent" : {
            "type" : "string"
          },
          "to" : {
            "properties" : {
              "email" : {
                "type" : "string"
              },
              "personal" : {
                "type" : "string"
              }
            }
          },
          "uid" : {
            "type" : "long"
          }
        }
      }
    }
  }
}

I did restart elasticsearch after the change.

Thanks,
Gabriel

multiple imap rivers with multiple indexes

Create two imap rivers with two indexes:

curl -XPUT 'http://127.0.0.1:9200/_river/rv-hole-planreq/_meta' -d '{
   "type":"imap",
   "mail.store.protocol":"imap",
   "mail.imap.host":"192.168.xxx.xxx",
   "mail.imap.port":143,
   "mail.imap.ssl.enable":false,
   "mail.imap.connectionpoolsize":"6",
   "mail.debug":"false",
   "mail.imap.timeout":10000,
   "user":"user1",
   "password":"password1",
   "schedule":null,
   "interval":"60s",
   "threads":6,
   "folderpattern":null,
   "bulk_size":100,
   "max_bulk_requests":"30",
   "bulk_flush_interval":"5s",
   "mail_index_name":"ndx-hole-planreq",
   "mail_type_name":"mail",
   "with_striptags_from_textcontent":false,
   "with_attachments":true,
   "with_text_content":true,
   "with_flag_sync":false,
   "index_settings" : null,
   "type_mapping" :  null
}'

curl -XPUT 'http://127.0.0.1:9200/_river/rv-csx-tc-const-notes/_meta' -d '{
   "type":"imap",
   "mail.store.protocol":"imap",
   "mail.imap.host":"192.168.xxx.xxx",
   "mail.imap.port":143,
   "mail.imap.ssl.enable":false,
   "mail.imap.connectionpoolsize":"6",
   "mail.debug":"false",
   "mail.imap.timeout":10000,
   "user":"user2",
   "password":"password2",
   "schedule":null,
   "interval":"60s",
   "threads":6,
   "folderpattern":null,
   "bulk_size":100,
   "max_bulk_requests":"30",
   "bulk_flush_interval":"5s",
   "mail_index_name":"ndx-csx-tc-const-notes",
   "mail_type_name":"mail",
   "with_striptags_from_textcontent":false,
   "with_attachments":true,
   "with_text_content":true,
   "with_flag_sync":false,
   "index_settings" : null,
   "type_mapping" :  null
}'

Both of the create requests return "created":true, but only one of the rivers function.

The error I receive shortly after the second river create is:

[2014-11-25 10:12:02,079][ERROR][de.saly.elasticsearch.river.imap.IMAPRiver] Unable to start IMAPRiver due to org.quartz.ObjectAlreadyExistsException: Unable to store Trigger with name: 'intervaltrigger' and group: 'group1', because one already exists with this identification.
org.quartz.ObjectAlreadyExistsException: Unable to store Trigger with name: 'intervaltrigger' and group: 'group1', because one already exists with this identification.
    at org.quartz.simpl.RAMJobStore.storeTrigger(RAMJobStore.java:415)
    at org.quartz.simpl.RAMJobStore.storeJobAndTrigger(RAMJobStore.java:252)
    at org.quartz.core.QuartzScheduler.scheduleJob(QuartzScheduler.java:886)
    at org.quartz.impl.StdScheduler.scheduleJob(StdScheduler.java:249)
    at de.saly.elasticsearch.river.imap.IMAPRiver.start(IMAPRiver.java:283)
    at org.elasticsearch.river.RiversService.createRiver(RiversService.java:148)
    at org.elasticsearch.river.RiversService$ApplyRivers$2.onResponse(RiversService.java:275)
    at org.elasticsearch.river.RiversService$ApplyRivers$2.onResponse(RiversService.java:269)
    at org.elasticsearch.action.support.TransportAction$ThreadedActionListener$1.run(TransportAction.java:95)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Versions:
ElasticSearch 1.3.5
mapper-attachments 2.4.1
river-imap-0.3-7fbfd2d

{
  "cluster_name" : "elasticsearch",
  "nodes" : {
    "RDEee9gsR96-EtW_L6T87Q" : {
      "name" : "Hank McCoy",
      "transport_address" : "inet[/192.168.xxx.xxx:9300]",
      "host" : "archives.atl.raildocs.net",
      "ip" : "192.168.xxx.xxx",
      "version" : "1.3.5",
      "build" : "4a50e7d",
      "http_address" : "inet[/192.168.xxx.xxx:9200]",
      "settings" : {
        "path" : {
          "data" : "/var/lib/elasticsearch",
          "work" : "/tmp/elasticsearch",
          "home" : "/usr/share/elasticsearch",
          "conf" : "/etc/elasticsearch",
          "logs" : "/var/log/elasticsearch"
        },
        "pidfile" : "/var/run/elasticsearch/elasticsearch.pid",
        "cluster" : {
          "name" : "elasticsearch"
        },
        "config" : "/etc/elasticsearch/elasticsearch.yml",
        "name" : "Hank McCoy"
      },
      "os" : {
        "refresh_interval_in_millis" : 1000,
        "available_processors" : 2,
        "cpu" : {
          "vendor" : "Intel",
          "model" : "Xeon",
          "mhz" : 2500,
          "total_cores" : 2,
          "total_sockets" : 2,
          "cores_per_socket" : 1,
          "cache_size_in_bytes" : 25600
        },
        "mem" : {
          "total_in_bytes" : 3876798464
        },
        "swap" : {
          "total_in_bytes" : 1073737728
        }
      },
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 14516,
        "max_file_descriptors" : 65535,
        "mlockall" : false
      },
      "jvm" : {
        "pid" : 14516,
        "version" : "1.7.0_71",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.65-b04",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1416863302694,
        "mem" : {
          "heap_init_in_bytes" : 268435456,
          "heap_max_in_bytes" : 1056309248,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 1056309248
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      },
      "thread_pool" : {
        "generic" : {
          "type" : "cached",
          "keep_alive" : "30s",
          "queue_size" : -1
        },
        "index" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "200"
        },
        "snapshot_data" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 5,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "bench" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "get" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "1k"
        },
        "snapshot" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "merge" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "suggest" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "1k"
        },
        "bulk" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "50"
        },
        "optimize" : {
          "type" : "fixed",
          "min" : 1,
          "max" : 1,
          "queue_size" : -1
        },
        "warmer" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "flush" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "search" : {
          "type" : "fixed",
          "min" : 6,
          "max" : 6,
          "queue_size" : "1k"
        },
        "percolate" : {
          "type" : "fixed",
          "min" : 2,
          "max" : 2,
          "queue_size" : "1k"
        },
        "management" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 5,
          "keep_alive" : "5m",
          "queue_size" : -1
        },
        "refresh" : {
          "type" : "scaling",
          "min" : 1,
          "max" : 1,
          "keep_alive" : "5m",
          "queue_size" : -1
        }
      },
      "network" : {
        "refresh_interval_in_millis" : 5000,
        "primary_interface" : {
          "address" : "192.168.xxx.xxx",
          "name" : "eth0",
          "mac_address" : "00:50:56:8A:6F:67"
        }
      },
      "transport" : {
        "bound_address" : "inet[/0.0.0.0:9300]",
        "publish_address" : "inet[/192.168.xxx.xxx:9300]"
      },
      "http" : {
        "bound_address" : "inet[/0.0.0.0:9200]",
        "publish_address" : "inet[/192.168.xxx.xxx:9200]",
        "max_content_length_in_bytes" : 104857600
      },
      "plugins" : [ {
        "name" : "mapper-attachments",
        "version" : "2.4.1",
        "description" : "Adds the attachment type allowing to parse difference attachment formats",
        "jvm" : true,
        "site" : false
      }, {
        "name" : "river-imap-0.3-7fbfd2d",
        "version" : "0.3",
        "description" : "IMAP River",
        "jvm" : true,
        "site" : false
      } ]
    }
  }
}

Fetch downloads new e-mails and the 2nd last e-mail

Hi,

Me again ;)

When fetching mail from an IMAP server, new e-mails + 1 already downloaded are downloaded.

This is my situation:

  • Using latest version of river-imap (0.3) on ES 1.3.2
  • I have 43 mails in the INBOX folder of an IMAP account
  • The imapriverdata is in sync with de INBOX folder, i.e. 43 mail type documents stored
  • Looking at the logs of ElasticSearch, I constantly see the logs below showing 43 messages in my inbox and 1 new (see below, even though 1 new message is not true because imapriverdata is in sync with the IMAP box).
  • Imap-river does not download any new message.
  • When sending 1 new message to the mailbox, Imap-river downloads the new message as well as the 2nd-last message. This overrides the 2nd last mail document in the index with the same UID, effectively resetting any changes made to this mail document.
  • When sending 2 new messages to the mailbox, Imap-river downloads the 2 new messages as well as the 3rd-last message, etc.

So, effectively it always downloads the number of new messages + 1.

This is the log when imapriverdata is in sync with the IMAP mailbox:

[2014-09-09 11:42:45,794][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Fetch mails from folder imap://***PRIVATE***@imap.gmail.com/INBOX (43)
[2014-09-09 11:42:45,917][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] 1 new messages in folder INBOX
[2014-09-09 11:42:45,920][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Not initiailly processed 0 mails for folder INBOX
[2014-09-09 11:42:46,057][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] 0 messages were locally deleted, because they are expunged on server.
[2014-09-09 11:42:46,427][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match Logs Google Apps Script
[2014-09-09 11:42:46,550][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Alle berichten
[2014-09-09 11:42:46,551][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Belangrijk
[2014-09-09 11:42:46,551][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Concepten
[2014-09-09 11:42:46,551][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Met ster
[2014-09-09 11:42:46,551][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Prullenbak
[2014-09-09 11:42:46,551][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Spam
[2014-09-09 11:42:46,551][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Verzonden berichten
[2014-09-09 11:43:44,766][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match 98-Test
[2014-09-09 11:43:44,766][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match 99-History
[2014-09-09 11:43:45,145][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Fetch mails from folder imap://***PRIVATE***@imap.gmail.com/INBOX (43)
[2014-09-09 11:43:45,268][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] 1 new messages in folder INBOX
[2014-09-09 11:43:45,271][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Not initiailly processed 0 mails for folder INBOX
[2014-09-09 11:43:45,407][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] 0 messages were locally deleted, because they are expunged on server.
[2014-09-09 11:43:45,784][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match Logs Google Apps Script
[2014-09-09 11:43:45,911][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Alle berichten
[2014-09-09 11:43:45,911][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Belangrijk
[2014-09-09 11:43:45,911][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Concepten
[2014-09-09 11:43:45,911][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Met ster
[2014-09-09 11:43:45,911][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Prullenbak
[2014-09-09 11:43:45,911][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Spam
[2014-09-09 11:43:45,911][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Verzonden berichten
[2014-09-09 11:44:45,284][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match 98-Test
[2014-09-09 11:44:45,284][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match 99-History

So, it says 1 new message in inbox and it does not download in the above log. When I send a new mail to this mailbox, the following happens: You see the number of messages going from 43 to 44 and imap-river downloads 2 mails at 2014-09-09 11:46:48,446.

[2014-09-09 11:45:45,705][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Fetch mails from folder imap://***PRIVATE***@imap.gmail.com/INBOX (43)
[2014-09-09 11:45:45,826][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] 1 new messages in folder INBOX
[2014-09-09 11:45:45,829][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Not initiailly processed 0 mails for folder INBOX
[2014-09-09 11:45:45,963][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] 0 messages were locally deleted, because they are expunged on server.
[2014-09-09 11:45:46,328][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match Logs Google Apps Script
[2014-09-09 11:45:46,450][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Alle berichten
[2014-09-09 11:45:46,450][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Belangrijk
[2014-09-09 11:45:46,450][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Concepten
[2014-09-09 11:45:46,451][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Met ster
[2014-09-09 11:45:46,451][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Prullenbak
[2014-09-09 11:45:46,451][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Spam
[2014-09-09 11:45:46,451][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Verzonden berichten
[2014-09-09 11:46:44,748][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match 98-Test
[2014-09-09 11:46:44,749][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match 99-History
[2014-09-09 11:46:45,127][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Fetch mails from folder imap://***PRIVATE***@imap.gmail.com/INBOX (44)
[2014-09-09 11:46:45,249][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] 2 new messages in folder INBOX
[2014-09-09 11:46:47,825][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Not initiailly processed 2 mails for folder INBOX
[2014-09-09 11:46:47,959][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] 0 messages were locally deleted, because they are expunged on server.
[2014-09-09 11:46:48,325][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match Logs Google Apps Script
[2014-09-09 11:46:48,439][INFO ][de.saly.elasticsearch.maildestination.ElasticsearchBulkMailDestination] New bulk actions queued [29] of [2 items], 1 outstanding bulk requests
[2014-09-09 11:46:48,446][INFO ][de.saly.elasticsearch.maildestination.ElasticsearchBulkMailDestination] Bulk actions done successfully [29] success [2 items] [7ms], 0 outstanding bulk requests, queue size is 0
[2014-09-09 11:46:48,447][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Alle berichten
[2014-09-09 11:46:48,447][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Belangrijk
[2014-09-09 11:46:48,447][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Concepten
[2014-09-09 11:46:48,447][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Met ster
[2014-09-09 11:46:48,447][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Prullenbak
[2014-09-09 11:46:48,447][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Spam
[2014-09-09 11:46:48,447][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Verzonden berichten
[2014-09-09 11:47:44,790][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match 98-Test
[2014-09-09 11:47:44,790][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match 99-History
[2014-09-09 11:47:45,170][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Fetch mails from folder imap://***PRIVATE***@imap.gmail.com/INBOX (44)
[2014-09-09 11:47:45,294][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] 1 new messages in folder INBOX
[2014-09-09 11:47:45,296][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Not initiailly processed 0 mails for folder INBOX
[2014-09-09 11:47:45,436][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] 0 messages were locally deleted, because they are expunged on server.
[2014-09-09 11:47:45,808][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match Logs Google Apps Script
[2014-09-09 11:47:45,932][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Alle berichten
[2014-09-09 11:47:45,932][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Belangrijk
[2014-09-09 11:47:45,932][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Concepten
[2014-09-09 11:47:45,932][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Met ster
[2014-09-09 11:47:45,932][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Prullenbak
[2014-09-09 11:47:45,932][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Spam
[2014-09-09 11:47:45,932][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match [Gmail]/Verzonden berichten
[2014-09-09 11:48:44,627][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match 98-Test
[2014-09-09 11:48:44,627][INFO ][de.saly.elasticsearch.mailsource.ParallelPollingIMAPMailSource] Pattern ^INBOX$ does not match 99-History

add new property

Hi,

Would it be possible to add a new property which would allow the deletions of processed emails from the source server? Something like:
"delete_received": "true/false"

Thanks,
Gabriel

Shutdown delay

When shutting down ES waits for IMAP-River, as it is fetching many mails it is doing the job a long time - I think on SIGTERM it should only finish current bulk insert and stop to allow ElasticSearch to terminate the process.

[2014-05-12 22:24:53,429][INFO ][node                     ] [Carolyn Parmenter] stopping ...
[2014-05-12 22:24:57,950][INFO ][de.saly.elasticsearch.maildestination.ElasticsearchBulkMailDestination] New bulk actions queued [97] of [25 items], 1 outstanding bulk requests
[2014-05-12 22:24:57,965][INFO ][de.saly.elasticsearch.maildestination.ElasticsearchBulkMailDestination] Bulk actions done successfully [97] success [25 items] [14ms], 0 outstanding bulk requests, queue size is 0
[2014-05-12 22:25:02,951][INFO ][de.saly.elasticsearch.maildestination.ElasticsearchBulkMailDestination] New bulk actions queued [98] of [24 items], 1 outstanding bulk requests
[2014-05-12 22:25:02,965][INFO ][de.saly.elasticsearch.maildestination.ElasticsearchBulkMailDestination] Bulk actions done successfully [98] success [24 items] [13ms], 0 outstanding bulk requests, queue size is 0
^C[2014-05-12 22:25:07,953][INFO ][de.saly.elasticsearch.maildestination.ElasticsearchBulkMailDestination] New bulk actions queued [99] of [23 items], 1 outstanding bulk requests

type_mapping setting doesn't take effect

Hi @salyh,

Thanks for writing elasticsearch-imap - it is a very cool program, however when I try and apply a mapping, for example the default mapping example in the readme.md or it doesn't seem to take affect on the index.

I have re-created the index many times and when I inspect it with elasticsearch-kopf it always tells me:

"receivedDate": { "type": "string" },

(receivedDate is my field of interest as I need timestamped data)

I have also tried creating the index manually with the default or other mapping and when I do that it fails to actually import the emails (not sure why - have tried setting type_mapping to null too which doesn't make a difference).

Any ideas?

Error detecting flagchanges for message ...

In the latest version I get this Error. Seems like first run is fine, but after the first interval it is thrown for ever mail it seems. Could be it happens only if there are already imported mails...

[14:26:44,256][ERROR][ParallelPollingIMAPMailSource][DefaultQuartzScheduler_Worker-2] Error detecting flagchanges for message <[email protected]>
java.io.IOException: No flaghashcode field for id 1::imap://somename@someserver/Sent Messages
    at de.saly.elasticsearch.importer.imap.maildestination.ElasticsearchMailDestination.getFlaghashcode(ElasticsearchMailDestination.java:193)
    at de.saly.elasticsearch.importer.imap.mailsource.ParallelPollingIMAPMailSource.fetch(ParallelPollingIMAPMailSource.java:343)
    at de.saly.elasticsearch.importer.imap.mailsource.ParallelPollingIMAPMailSource.recurseFolders(ParallelPollingIMAPMailSource.java:570)
    at de.saly.elasticsearch.importer.imap.mailsource.ParallelPollingIMAPMailSource.recurseFolders(ParallelPollingIMAPMailSource.java:580)
    at de.saly.elasticsearch.importer.imap.mailsource.ParallelPollingIMAPMailSource.fetch(ParallelPollingIMAPMailSource.java:540)
    at de.saly.elasticsearch.importer.imap.mailsource.ParallelPollingIMAPMailSource.fetchAll(ParallelPollingIMAPMailSource.java:116)
    at de.saly.elasticsearch.importer.imap.support.MailFlowJob.execute(MailFlowJob.java:64)
    at de.saly.elasticsearch.importer.imap.support.MailFlowJob.execute(MailFlowJob.java:91)
    at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)

Subfolders are not being indexed

When I try to index a folder structure like 'a/b/c' the indexing process stops at update_mapping.

For the "folderpattern" parameter I tried 'a/b/c', '.+c', and 'c'

AWS Configuration

Running into errors trying to configure this on AWS, as follows:

With respect to the local IP, which in this case is at ip-171-20-0-254, I get the following

00:02:41,032][WARN ][network][main] failed to resolve local host, fallback to loopback java.net.UnknownHostException: ip-171-20-0-254: ip-171-20-0-254: Name or service not known

After configuring the elasticsearch.hosts with my Amazon Elasticsearch endpoint I get:

Adding http://search-[redacted].us-east-1.es.amazonaws.com Exception in thread "main" java.lang.NumberFormatException: For input string: "//search-[redacted].us-east-1.es.amazonaws.com"

Is there a way to configure this tool for AWS usage?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.