oliver006 / elasticsearch-gmail Goto Github PK

Index your Gmail Inbox with Elasticsearch

Python 100.00%

elasticsearch gmail mbox-format python tornado gmail-inbox filter tutorial

elasticsearch-gmail's Introduction

Elasticsearch For Beginners: Indexing your Gmail Inbox (and more: Supports any mbox and MH mailboxes)

What's this all about?

I recently looked at my Gmail inbox and noticed that I have well over 50k emails, taking up about 12GB of space but there is no good way to tell what emails take up space, who sent them to, who emails me, etc

Goal of this tutorial is to load an entire Gmail inbox into Elasticsearch using bulk indexing and then start querying the cluster to get a better picture of what's going on.

Prerequisites

Set up Elasticsearch and make sure it's running at http://localhost:9200

A quick way to run Elasticsearch is using Docker: (the cors settings aren't really needed but come in handy if you want to use e.g. dejavu to explore the index)

docker run --name es -d -p 9200:9200 -e http.port=9200 -e http.cors.enabled=true -e 'http.cors.allow-origin=*' -e http.cors.allow-headers=X-Requested-With,X-Auth-Token,Content-Type,Content-Length,Authorization -e http.cors.allow-credentials=true -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch-oss:7.6.1

I use Python and Tornado for the scripts to import and query the data. Also beautifulsoup4 for the stripping HTML/JS/CSS (if you want to use the body indexing flag).

Install the dependencies by running:

pip3 install -r requirements.txt

Aight, where do we start?

First, go here and download your Gmail mailbox, depending on the amount of emails you have accumulated this might take a while. There's also a small sample.mbox file included in the repo for you to play around with while you're waiting for Google to prepare your download.

The downloaded archive is in the mbox format and Python provides libraries to work with the mbox format so that's easy.

You can run the code (assuming Elasticsearch is running at localhost:9200) with the sammple mbox file like this:

$ python3 src/index_emails.py --infile=sample.mbox
[I index_emails:173] Starting import from file sample.mbox
[I index_emails:101] Upload: OK - upload took: 1033ms, total messages uploaded:      3
[I index_emails:197] Import done - total count 16
$

Note: All examples focus on Gmail inboxes. Substitute any --infile= parameters with --indir= pointing to an MH directory to make them work with MH mailboxes instead.

The Source Code

The overall program will look something like this:

mbox = mailbox.mbox('emails.mbox') // or mailbox.MH('inbox/')

for msg in mbox:
    item = convert_msg_to_json(msg)
	upload_item_to_es(item)

print "Done!"

Ok, tell me more about the details

The full Python code is here: src/index_emails.py

Turn mailbox into JSON

First, we got to turn the messages into JSON so we can insert it into Elasticsearch. Here is some sample code that was very useful when it came to normalizing and cleaning up the data.

A good first step:

def convert_msg_to_json(msg):
    result = {'parts': []}
    for (k, v) in msg.items():
        result[k.lower()] = v.decode('utf-8', 'ignore')

Additionally, you also want to parse and normalize the From and To email addresses:

for k in ['to', 'cc', 'bcc']:
    if not result.get(k):
        continue
    emails_split = result[k].replace('\n', '').replace('\t', '').replace('\r', '').replace(' ', '').encode('utf8').decode('utf-8', 'ignore').split(',')
    result[k] = [ normalize_email(e) for e in emails_split]

if "from" in result:
    result['from'] = normalize_email(result['from'])

Elasticsearch expects timestamps to be in microseconds so let's convert the date accordingly

if "date" in result:
    tt = email.utils.parsedate_tz(result['date'])
    result['date_ts'] = int(calendar.timegm(tt) - tt[9]) * 1000

We also need to split up and normalize the labels

labels = []
if "x-gmail-labels" in result:
    labels = [l.strip().lower() for l in result["x-gmail-labels"].split(',')]
    del result["x-gmail-labels"]
result['labels'] = labels

Email size is also interesting so let's break that out

parts = json_msg.get("parts", [])
json_msg['content_size_total'] = 0
for part in parts:
    json_msg['content_size_total'] += len(part.get('content', ""))

Index the data with Elasticsearch

The most simple approach is a PUT request per item:

def upload_item_to_es(item):
    es_url = "http://localhost:9200/gmail/email/%s" % (item['message-id'])
    request = HTTPRequest(es_url, method="PUT", body=json.dumps(item), request_timeout=10)
    response = yield http_client.fetch(request)
    if not response.code in [200, 201]:
        print "\nfailed to add item %s" % item['message-id']

However, Elasticsearch provides a better method for importing large chunks of data: bulk indexing Instead of making a HTTP request per document and indexing individually, we batch them in chunks of eg. 1000 documents and then index them.
Bulk messages are of the format:

cmd\n
doc\n
cmd\n
doc\n
...

where cmd is the control message for each doc we want to index. For our example, cmd would look like this:

cmd = {'index': {'_index': 'gmail', '_type': 'email', '_id': item['message-id']}}`

The final code looks something like this:

upload_data = list()
for msg in mbox:
    item = convert_msg_to_json(msg)
    upload_data.append(item)
    if len(upload_data) == 100:
        upload_batch(upload_data)
        upload_data = list()

if upload_data:
    upload_batch(upload_data)

and

def upload_batch(upload_data):

    upload_data_txt = ""
    for item in upload_data:
        cmd = {'index': {'_index': 'gmail', '_type': 'email', '_id': item['message-id']}}
        upload_data_txt += json.dumps(cmd) + "\n"
        upload_data_txt += json.dumps(item) + "\n"

    request = HTTPRequest("http://localhost:9200/_bulk", method="POST", body=upload_data_txt, request_timeout=240)
    response = http_client.fetch(request)
    result = json.loads(response.body)
	if 'errors' in result:
	    print result['errors']

Ok, show me some data!

After indexing all your emails, we can start running queries.

Filters

If you want to search for emails from the last 6 months, you can use the range filter and search for gte the current time (now) minus 6 month:

curl -XGET 'http://localhost:9200/gmail/email/_search?pretty' -d '{
"filter": { "range" : { "date_ts" : { "gte": "now-6M" } } } }
'

or you can filter for all emails from 2014 by using gte and lt

curl -XGET 'http://localhost:9200/gmail/email/_search?pretty' -d '{
"filter": { "range" : { "date_ts" : { "gte": "2013-01-01T00:00:00.000Z", "lt": "2014-01-01T00:00:00.000Z" } } } }
'

You can also quickly query for certain fields via the q parameter. This example shows you all your Amazon shipping info emails:

curl "localhost:9200/gmail/email/_search?pretty&q=from:[email protected]"

Aggregation queries

Aggregation queries let us bucket data by a given key and count the number of messages per bucket. For example, number of messages grouped by recipient:

curl -XGET 'http://localhost:9200/gmail/email/_search?pretty&search_type=count' -d '{
"aggs": { "emails": { "terms" : { "field" : "to",  "size": 10 }
} } }
'

Result:

"aggregations" : {
"emails" : {
  "buckets" : [ {
       "key" : "[email protected]",
       "doc_count" : 1920
  }, { "key" : "[email protected]",
       "doc_count" : 1326
  }, { "key" : "[email protected]",
       "doc_count" : 263
  }, { "key" : "[email protected]",
       "doc_count" : 232
  }
  ...
  ]
}

This one gives us the number of emails per label:

curl -XGET 'http://localhost:9200/gmail/email/_search?pretty&search_type=count' -d '{
"aggs": { "labels": { "terms" : { "field" : "labels",  "size": 10 }
} } }
'

Result:

"hits" : {
  "total" : 51794,
},
"aggregations" : {
"labels" : {
  "buckets" : [       {
       "key" : "important",
       "doc_count" : 15430
  }, { "key" : "github",
       "doc_count" : 4928
  }, { "key" : "sent",
       "doc_count" : 4285
  }, { "key" : "unread",
       "doc_count" : 510
  },
  ...
   ]
}

Use a date histogram you can also count how many emails you sent and received per year:

curl -s "localhost:9200/gmail/email/_search?pretty&search_type=count" -d '
{ "aggs": {
    "years": {
      "date_histogram": {
        "field": "date_ts", "interval": "year"
}}}}
'

Result:

"aggregations" : {
"years" : {
  "buckets" : [ {
    "key_as_string" : "2004-01-01T00:00:00.000Z",
    "key" : 1072915200000,
    "doc_count" : 585
  }, {
...
  }, {
    "key_as_string" : "2013-01-01T00:00:00.000Z",
    "key" : 1356998400000,
    "doc_count" : 12832
  }, {
    "key_as_string" : "2014-01-01T00:00:00.000Z",
    "key" : 1388534400000,
    "doc_count" : 7283
  } ]
}

Write aggregation queries to work out how much you spent on Amazon/Steam:

GET _search
{
  "query": {
    "match_all": {}
      },
      "size": 0,
      "aggs": {
        "group_by_company": {
          "terms": {
            "field": "order_details.merchant"
            },
            "aggs": {
              "total_spent": {
                "sum": {
                  "field": "order_details.order_total"
                }
                },
                "postage": {
                  "sum": {
                    "field": "order_details.postage"
                  }
                }
              }
            }
          }
        }

Todo

more interesting queries
schema tweaks
multi-part message parsing
blurb about performance
...

Feedback

Open a pull requests or an issue!

elasticsearch-gmail's People

Contributors

Stargazers

Watchers

Forkers

rrozestw jalons zixan meandavejustice andy-culshaw qbektrix neuroradiology bussiere iross hsingh23 qz267 jeffcressman metricle rtvt123 pbamotra ifeatu neilborromeo clauswitt timuric ghosthamlet gsh45 philippeback rrciisc shankisg 91pavan claudiouzelac jovibizstack tomjal srobertson u4ickleviathan esaul wheezy123 mike-wade wtest kurtado cr00x00 vpineda7 netconstructor ranjodh-singh tehp ashwinisur amineferchichi jeanfrancis willingc c2tonyc2 laranea swayson seregayoga issif bitcity youngdev mirzap dbose ksmaheshkumar davinirjr uicoded bmthomas123 thebennos jdmonty trizno mezhekov iamromal overdrive3000 khaliqgant lhenry2k mrphishxxx fridiculous geekpete wzr phdkiran irobot0 dfirgeek fardinbehboudi bitsofinfo bundgaard ipsolar aglaianwoman xunux latestalexey dumbomir shano lcbasu izogain jkimmason utahdave indy9000 prince-mishra sharifmamun lsalamon andrewgjohnson-forks chadhahn candeira raymondseger euren slachiewicz dot-sean bklynate cclauss njisrawi dr-denzy

elasticsearch-gmail's Issues

create_index() adds a mapping but doesn't create an index

Hi,

as said in title, your function 'create_index' seems not create the 'gmail' index. It misses lines like this (need some improvements) :

index_name = "gmail"
es_url = "http://localhost:9200/%s" % (index_name)
request = HTTPRequest(es_url, method="POST", request_timeout=10)
response = yield http_client.fetch(request)
if not response.code in [200, 201]:
print "\nfailed to create index%s" % item['message-id']

How to run the example? (I get ImportError when running the script)

Can you please add a step-by-step explanation how to run the example? (I have no experience with Python coding, just want to run the example).

Python is installed. I also executed your PIP commands (was already included in Anaconda).
I have downloaded your source code (one Python file) in the same folder as the emails.mbox file from Google. I tried:

python index_emails.py
Traceback (most recent call last):
File "index_emails.py", line 12, in
from DelegatingEmailParser import DelegatingEmailParser
ImportError: No module named DelegatingEmailParser

info on how to run from cli

I tried moving through this and it appears to be failing on the import... I am running a vagrant-vm and get everything installed just fine.
I don't know how to invoke the script properly...
I've tried so many ways. This seems like it would give results... though it does nothing much.
python index_emails.py test.mbox
Any help or tips are appreciated! This has been a fun project so far. Stumbling at the end. Thanks!

How to run queries

I am not able to run the queries shown in the readme.

$ curl -H 'Content-Type: application/json'  'http://localhost:9200/gmail/email/_search?

pretty&search_type=count' -d '{"aggs": { "emails":{"terms" : { "field" : "to",  "size": 10 }} } }'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "No search type for [count]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "No search type for [count]"
  },
  "status" : 400
}

$ curl -H 'Content-Type: application/json' 'http://localhost:9200/gmail/email/_search?pretty' -d


```json
'{"filter": { "range" : { "date_ts" : {"gte": "now-6M" } } } }'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "parsing_exception",
        "reason" : "Unknown key for a START_OBJECT in [filter].",
        "line" : 1,
        "col" : 12
      }
    ],
    "type" : "parsing_exception",
    "reason" : "Unknown key for a START_OBJECT in [filter].",
    "line" : 1,
    "col" : 12
  },
  "status" : 400
}

I can see some output for:

curl -H 'Content-Type: application/json' 'http://localhost:9200/_search' -d'

{
    "query" : {
        "match_all" : {}
    }
}'

Appreciate any help.

Filter queries for time?

This would probably be a good example.

can we index email body field?

Hey, thanks for a great tool! I got it to successfully import my large mbox file into ES with no problems. All of your queries on your readme worked fine.

It seem, though, that you aren't grabbing the email body. Is that right?
https://github.com/oliver006/elasticsearch-gmail/blob/master/src/index_emails.py#L89
Could you please add it to the code so that we are indexing email body's into ES?

It look like from here: http://www.qmail.org/man/man5/mbox.html that the body is between the _From line and a blank line.

Between the From_ line and the blank line is a message in
RFC 822 format, as described in qmail-header(5), subject to
>From quoting as described below.

Thanks again!

AttributeError: module 'mailbox' has no attribute 'UnixMailbox'

When running the code out of the box (all dependencies are imported and work alright) I get:

raceback (most recent call last):
  File "index_emails.py", line 220, in <module>
    IOLoop.instance().run_sync(load_from_file)
  File "/anaconda3/envs/elasticsearch_gmail/lib/python3.6/site-packages/tornado/ioloop.py", line 458, in run_sync
    return future_cell[0].result()
  File "/anaconda3/envs/elasticsearch_gmail/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/anaconda3/envs/elasticsearch_gmail/lib/python3.6/site-packages/tornado/ioloop.py", line 436, in run
    result = func()
  File "index_emails.py", line 169, in load_from_file
    mbox = mailbox.UnixMailbox(open(tornado.options.options.infile, 'rb'), email.message_from_file)
AttributeError: module 'mailbox' has no attribute 'UnixMailbox'
Exception ignored in: <bound method HTTPClient.__del__ of <tornado.httpclient.HTTPClient object at 0x10dfaf518>>
Traceback (most recent call last):
  File "/anaconda3/envs/elasticsearch_gmail/lib/python3.6/site-packages/tornado/httpclient.py", line 82, in __del__
  File "/anaconda3/envs/elasticsearch_gmail/lib/python3.6/site-packages/tornado/httpclient.py", line 87, in close
  File "/anaconda3/envs/elasticsearch_gmail/lib/python3.6/site-packages/tornado/simple_httpclient.py", line 118, in close
  File "/anaconda3/envs/elasticsearch_gmail/lib/python3.6/site-packages/tornado/httpclient.py", line 203, in close
RuntimeError: inconsistent AsyncHTTPClient cache

Python 3.5.6

Link to download Gmail data not working

Hello,

I would like to go through this interesting tutorial. However, the link to download the Gmail data does not work.

Can I also download it via the official Google download feature (https://support.google.com/accounts/answer/3024190) or is this a different data set and structure?

Kai

subject encoding

There is a problem when the subject of an email contains a special character such as "°".
In this case, the subject is

'subject': <email.header.Header object at 0x7f799db1d6d0>

Traceback :

Traceback (most recent call last):
File "src/index_emails.py", line 231, in
IOLoop.instance().run_sync(load_from_file)
File "/usr/lib/python3/dist-packages/tornado/ioloop.py", line 576, in run_sync
return future_cell[0].result()
File "/usr/lib/python3/dist-packages/tornado/ioloop.py", line 547, in run
result = func()
File "src/index_emails.py", line 196, in load_from_file
upload_batch(upload_data)
File "src/index_emails.py", line 93, in upload_batch
upload_data_txt += json.dumps(item) + "\n"
File "/usr/lib/python3.8/json/init.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/usr/lib/python3.8/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python3.8/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/usr/lib/python3.8/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.class.name} '
TypeError: Object of type Header is not JSON serializable

Possible resolution in def convert_msg_to_json:

    if "subject" in result:
        result['subject'] = str(result['subject'])

tornado.httpcient.HTTPError: 406 Not acceptable

Hi. I have a problem. I get this error:

python index_emails.py --infile=email.mbox [I 180429 15:01:57 index_emails:170] Starting import from file email.mbox Traceback (most recent call last): File "index_emails.py", line 222, in <module> IOLoop.instance().run_sync(load_from_file) File "/Library/Python/2.7/site-packages/tornado/ioloop.py", line 458, in run_sync return future_cell[0].result() File "/Library/Python/2.7/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "/Library/Python/2.7/site-packages/tornado/ioloop.py", line 436, in run result = func() File "index_emails.py", line 188, in load_from_file upload_batch(upload_data) File "index_emails.py", line 92, in upload_batch response = http_client.fetch(request) File "/Library/Python/2.7/site-packages/tornado/httpclient.py", line 102, in fetch self._async_client.fetch, request, **kwargs)) File "/Library/Python/2.7/site-packages/tornado/ioloop.py", line 458, in run_sync return future_cell[0].result() File "/Library/Python/2.7/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 3, in raise_exc_info tornado.httpclient.HTTPError: HTTP 406: Not Acceptable

What . can i do?

[UPDATE]
I tried to see what i wrong... and.. when it want to create index it gets:
[I 180429 15:08:35 index_emails:81] HTTP 400: Bad Request

oliver006 / elasticsearch-gmail Goto Github PK

elasticsearch-gmail's Introduction

Elasticsearch For Beginners: Indexing your Gmail Inbox (and more: Supports any mbox and MH mailboxes)

What's this all about?

Prerequisites

Aight, where do we start?

The Source Code

Ok, tell me more about the details

Turn mailbox into JSON

Index the data with Elasticsearch

Ok, show me some data!

Filters

Aggregation queries

Todo

Feedback

elasticsearch-gmail's People

Contributors

Stargazers

Watchers

Forkers

elasticsearch-gmail's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs