GithubHelp home page GithubHelp logo

polo2ro / imapbox Goto Github PK

View Code? Open in Web Editor NEW
217.0 13.0 45.0 110 KB

Dump imap inbox to a local folder in a regular backupable format: html, json and attachements

License: MIT License

Python 98.00% Dockerfile 2.00%
imap backup mailbox

imapbox's Introduction

IMAPBOX

Dump IMAP inbox to a local folder in a regular backupable format: HTML, PDF, JSON and attachments.

This program aims to save a mailbox for archive using files in indexable or searchable formats. The produced files should be readable without external software, for example, to find an email in backups using only the terminal.

For each email in an IMAP mailbox, a folder is created with the following files:

File Description
message.html If an html part exists for the message body. the message.html will always be in UTF-8, the embedded images links are modified to refer to the attachments subfolder.
message.pdf This file is optionally created from message.html when the wkhtmltopdf option is set in the config file.
attachments The attachments folder contains the attached files and the embeded images.
message.txt This file contain the body text if available in the original email, always converted in UTF-8.
metadata.json Various informations in JSON format, date, recipients, body text, etc... This file can be used from external applications or a search engine like Elasticsearch.
raw.eml.gz A gziped version of the email in .eml format.

Imapbox was designed to archive multiple mailboxes in one common directory tree, copies of the same message spread knew several account will be archived once using the Message-Id property.

Install

This script requires Python 3 for master branch or python 2 on the python2 branch and the following libraries:

  • six
  • chardet – required for character encoding detection.
  • pdfkit – optionally required for archiving emails to PDF.

Use cases

  • I use the script to merge all my mail accounts in one searchable directory on my NAS server.
  • Report on a website the content of an email address, like a mailing list.
  • Sharing address of several employees to perform cross-searches on a common database.
  • Archiving an IMAP account because of mailbox size restrictions, or to restrain the used disk space on the IMAP server.
  • Archiving emails to PDF format.

Config file

Use ./config.cfg ~/.config/imapbox/config.cfg or /etc/imapbox/config.cfg

Example:

[imapbox]
local_folder=/var/imapbox
days=6
wkhtmltopdf=/opt/bin/wkhtmltopdf

[account1]
host=mail.autistici.org
username=username@domain
password=secret
ssl=True

[account2]
host=imap.googlemail.com
username[email protected]
password=secret
remote_folder=INBOX
port=993

The imapbox section

Possibles parameters for the imapbox section:

Parameter Description
local_folder The full path to the folder where the emails should be stored. If the local_folder is not set, imapbox will download the emails in the current directory. This can be overwritten with the shell argument -l.
days Number of days back to get in the IMAP account, this should be set greater and equals to the cronjob frequency. If this parameter is not set, imapbox will get all the emails from the IMAP account. This can be overwritten with the shell argument -d.
wkhtmltopdf (optional) The location of the wkhtmltopdf binary. By default pdfkit will attempt to locate this using which (on UNIX type systems) or where (on Windows). This can be overwritten with the shell argument -w.

Other sections

You can have has many configured account as you want, one per section. Sections names may contains the account name.

Possibles parameters for an account section:

Parameter Description
host IMAP server hostname
username Login id for the IMAP server.
password (optional) The password will be saved in cleartext, for security reasons, you have to run the imapbox script in userspace and set chmod 700 on your ~/.config/mailbox/config.cfg file. The user will prompted for a password if this parameter is missing.
remote_folder (optional) IMAP folder name (multiple folder name is not supported for the moment). Default value is INBOX. You can use __ALL__ to fetch all folders.
port (optional) Default value is 993.
ssl (optional) Default value is False. Set to True to enable SSL

Metadata file

Property Description
Subject Email subject
Body A text version of the message
From Name and email of the sender
To An array of recipients
Cc An array of recipients
Attachments An array of files names
Date Message date with the timezone included, in the RFC 2822 format
Utc Message date converted in UTC, in the ISO 8601 format. This can be used to sort emails or filter emails by date
WithHtml Boolean, if the message.html file exists or not
WithText Boolean, if the message.txt file exists or not

Elasticsearch

The metadata.json file contain the necessary informations for a search engine like Elasticsearch. Populate an Elasticsearch index with the emails metadata can be done with a simple script.

Create an index:

curl -XPUT 'localhost:9200/imapbox?pretty'

Add all emails to index:

#!/bin/bash
cd emails/
for ID in */ ; do
    curl -XPUT "localhost:9200/imapbox/message/${ID}?pretty" --data-binary "@${ID}/metadata.json"
done

A front-end can be used to search in email archives:

  • Calaca is a beautiful, easy to use, search UI for Elasticsearch.
  • Facetview

Search in emails without indexation process

jq is a lightweight and flexible command-line JSON processor.

Example command to browse emails:

find . -name "*.json" | xargs cat | jq '[.Date, .Id, .Subject, " ✉ "] + .From | join(" ")'

Example with a filter on UTC date:

find . -name "*.json" | xargs cat | jq 'select(.Utc > "20150221T130000Z")'

Docker compose

version: '3'
services:

  imapbox:
    image: mauricemueller/imapbox
    container_name: imapbox
    volumes:
      - imapbox_data:/var/imapbox
      # change the path to the config
      - ./test/config.cfg:/etc/imapbox/config.cfg

volumes:
  imapbox_data:

docker-compose run --rm imapbox

Build own docker image and push to dockerhub

  1. docker login
  2. docker-compose build
  3. docker tag imapbox:latest [USERNAME]/imapbox:latest
  4. docker push [USERNAME]/imapbox:latest

Similar projects

NoPriv is a python script to backup any IMAP capable email account to a browsable HTML archive and a Maildir folder.

License

The MIT License (MIT)

imapbox's People

Contributors

alichtman avatar bitdeli-chef avatar coryjthompson avatar filisko avatar jakeprice-me avatar jaraco avatar maurice-mueller avatar mhajder avatar miclaus avatar olajep avatar polo2ro avatar robdyke avatar xwtf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

imapbox's Issues

Exclude IMAP Folder

I have a SPAM Folder with about 8000 eMails. Those eMails get used to train the Spam Filter on my Mail server. Would be great not to download all of them. ;-)

Feature: restorable format?

If I build a backup system with imapbox I'd also like to have a format which I can restore back to an email server in some form. The best way would probably be a mildir file, or mbox alternatively.

Ideally imapbox would either have an option to also create/update a maildir file on sync, or an option to convert the folder structure it produces to a maildir file.

MailboxClient.saveEmail() failed: None

Hey,

I'm using the docker version and get the following output:

account1/INBOX (on mx.mailserver.tld)
Saving folder: INBOX
~/imapbox/INBOX/2022/23244980.10761646738362496.JavaMail.tomcatlagaspintra1.domain.tld
MailboxClient.saveEmail() failed: None
~/imapbox/INBOX/2022/31433348.10671646824808489.JavaMail.tomcatlagaspintra1.domain.tld
MailboxClient.saveEmail() failed: None
~/imapbox/INBOX/2022/3064516.10971646911215249.JavaMail.tomcatlagaspintra1.domain.tld
MailboxClient.saveEmail() failed: None
3 emails created, 0 emails already exists```

My config file is:
```[imapbox]
local_folder=~/imapbox
days=3
wkhtmltopdf=/opt/bin/wkhtmltopdf

[account1]
host=mx.mailserver.tld
[email protected]
password=xxx
ssl=True
remote_folder=INBOX
port=993```

OS: macOS Catalina 10.15.7

Any idea why this happens and how I can debug/fix this?

Kind regards,
David

How to get all gmail "folders" (labels)

Hi (again),

I'd like to archive my gmail account (among others) using IMAPBOX. Currently it only fetches the inbox.
Gmail uses IMAP folders to reflect its labels. An email can be assigned multiple labels so it appears in multiple IMAP folders. All archived emails appear under [Gmail]/All messages "folder".
The hierarchy shown in my email client (evolution) is as follows:

Gmail account

  • INBOX
  • Label 1
  • Label 2
  • Other Labels ...
  • [Gmail]
    • All messages
    • Sent
    • Drafts
    • Spam
    • Important
  • More labels ...

What to set the remote_folder to in order to archive all folders?

Another point (maybe beyond the scope of this script) is that due to the fact that the same email can be in multiple folders to save disk space some kind of checksum and symlinking would be great. So an email assigned to "Label 1" and "Label 2" would be symlinked from the "All messages" folder to the respective label folder.

CC not correctly filled out

I just processed a bit more than 3,000 emails. Of those 3 had no email address in the CC field.

Here are two examples from the raw.eml (exact copy/paste)l:

Example 1:

Cc: "[email protected]" [email protected], "Bo
Schou" [email protected]

This gives the following Cc part in the json:

"Cc": [
[
"[email protected]",
"[email protected]"
],
[
"Bo",
""
]
],

Example 2:

Cc: "[email protected]" [email protected], "Bernt
M=?iso-8859-1?Q?=F8?=ller" [email protected]

and the corresponding json:

"Cc": [
[
"[email protected]",
"[email protected]"
],
[
"Bernt",
""
]
],

wkhtmltopdf on Synology not workin

After installing wkhtmltopdf via pip install wkhtmltopdf onto my synology, it doesn't work

config.cfg
wkhtmltopdf=/usr/lib/python2.7/site-packages/wkhtmltopdf

hello i got error?

Screenshot_2020-12-20-03-38-49-79

how to fix this??

mailboxclient.saveemail() failed module 'cgi' has no attribute 'escape'

Archiving of subfolders fails

Hi,
I get the following Error

account1/ALL (on fqdn.domain.com)
Saving folder: Calendar
0 emails created, 0 emails already exists
Saving folder: Contacts
0 emails created, 0 emails already exists
Saving folder: Drafts
0 emails created, 1 emails already exists
Saving folder: INBOX
0 emails created, 3 emails already exists
Saving folder: INBOX.Archiv
Traceback (most recent call last):
File "imapbox.py", line 98, in
main()
File "imapbox.py", line 93, in main
save_emails(account, options)
File "/Users/henegger/git/imapbox/mailboxresource.py", line 109, in save_emails
stats = mailbox.copy_emails(options['days'], options['local_folder'], options['wkhtmltopdf'])
File "/Users/henegger/git/imapbox/mailboxresource.py", line 36, in copy_emails
typ, data = self.mailbox.search(None, criterion)
File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/imaplib.py", line 723, in search
typ, dat = self._simple_command(name, *criteria)
File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/imaplib.py", line 1196, in _simple_command
return self._command_complete(name, self._command(name, *args))
File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/imaplib.py", line 944, in _command
', '.join(Commands[name])))
imaplib.error: command SEARCH illegal in state AUTH, only allowed in states SELECTED`

Any idea how to fix this?

imaplib.error: command SEARCH illegal in state AUTH

Hello. First of all, thank you for this app, it's a great idea. I want to download my Gmail archive so that I can delete my Google account. I set remote_folder to ALL and it started downloading a few folders but it fails when it gets to the main one called [Gmail]. So then I set remote_folder=[Gmail] and it does the same thing. Do you know what is causing this?

$ python imapbox.py 
account1/[Gmail] (on imap.gmail.com)
MailboxClient: Could not select remote folder '[Gmail]'
Traceback (most recent call last):
  File "imapbox.py", line 105, in <module>
    main()
  File "imapbox.py", line 102, in main
    save_emails(account, options)
  File "/usr/home/elliott/imapbox-master/mailboxresource.py", line 123, in save_emails
    stats = mailbox.copy_emails(options['days'], options['local_folder'], options['wkhtmltopdf'])
  File "/usr/home/elliott/imapbox-master/mailboxresource.py", line 44, in copy_emails
    typ, data = self.mailbox.search(None, criterion)
  File "/usr/local/lib/python3.7/imaplib.py", line 723, in search
    typ, dat = self._simple_command(name, *criteria)
  File "/usr/local/lib/python3.7/imaplib.py", line 1196, in _simple_command
    return self._command_complete(name, self._command(name, *args))
  File "/usr/local/lib/python3.7/imaplib.py", line 944, in _command
    ', '.join(Commands[name])))
imaplib.error: command SEARCH illegal in state AUTH, only allowed in states SELECTED

improve message.html

Add a title with the email subject
Add metadata in the with author (sender), recipients and date

Can't backup anyhting with docker image

Hi,

I cloned the project using git clone https://github.com/polo2ro/imapbox.git into the folder /home/thomas/Projekte/imapbox.

Then I ceated the config file into /home/thomas/.config/imapbox/config.cfg:

[imapbox]
local_folder=/home/thomas/mailbackup
days=2

[GMXKonto]
host=imap.gmx.net
[email protected]
password=asdf
remote_folder=__ALL__
port=993

The docker-compose.yml file accordingly:

version: '3'
services:

  imapbox:
    image: mauricemueller/imapbox
    container_name: imapbox
    volumes:
      - imapbox_data:/var/imapbox
      # change the path to the config
      - /home/thomas/.config/imapbox/config.cfg:/etc/imapbox/config.cfg

volumes:
  imapbox_data:

After everything was set up, I created the base folder (mkdir /home/thomas/mailbackup).

I tried variations of the account name, but some special chars seems not to be the problem here. To run it using docker-compose I did exactly the same as described in readme file:

thomas@linux:~/Projekte/imapbox> docker-compose run --rm imapbox
Creating imapbox_imapbox_run ... done
GMXKonto/__ALL__ (on imap.gmx.net)
Traceback (most recent call last):
  File "/opt/bin/imapbox.py", line 98, in <module>
    main()
  File "/opt/bin/imapbox.py", line 91, in main
    print("Saving folder: " + folder_name[1])
IndexError: list index out of range

Also, the experiments with only INBOX key in config file throws errors:

thomas@linux:~/Projekte/imapbox> docker-compose run --rm imapbox
Creating imapbox_imapbox_run ... done
GMX Konto/INBOX (on imap.gmx.net)
/home/thomas/mailbackup/2021/5055368007.37200749.08360958643149newsletter.aljazeera.com
MailboxClient.saveEmail() failed
Expected object of type bytes or bytearray, got: <class 'str'>
12 emails created, 0 emails already exists

Even though I got success messages in every experiment, the folder was always empty:

thomas@linux:~/Projekte/imapbox> ls -lha /home/thomas/mailbackup/
insgesamt 0
drwxr-xr-x 1 thomas users   0 13. Feb 22:58 .
drwxr-xr-x 1 thomas users 912 13. Feb 22:58 ..

The pyhton version on the host system is Python 2.7.18. But, as I got it right, this fact doesn't matter, because it's everything set up in the containers.
Host's os-release:

thomas@linux:~/Projekte/imapbox> cat /etc/os-release
NAME="openSUSE Tumbleweed"
# VERSION="20210210"
ID="opensuse-tumbleweed"
ID_LIKE="opensuse suse"
VERSION_ID="20210210"
PRETTY_NAME="openSUSE Tumbleweed"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:tumbleweed:20210210"
BUG_REPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org/"
DOCUMENTATION_URL="https://en.opensuse.org/Portal:Tumbleweed"
LOGO="distributor-logo"

The error for __ALL__ seems like an array out of max index size exception in console's printing message. But I'm not familar with python - just a quick suggestion of what the error message says.

Any help would be great :-)

Thank you in advice,

alpham8

TypeError: initial_value must be str or None, not bytes

Despite the declared python3 support, I was unable to run it on Ubuntu 17.10

  File "imapbox.py", line 97, in <module>
    main()
  File "imapbox.py", line 90, in main
    stats = mailbox.copy_emails(options['days'], options['local_folder'], options['wkhtmltopdf'])
  File "/home/imapbox/mailboxresource.py", line 39, in copy_emails
    if self.saveEmail(data):
  File "/home/imapbox/mailboxresource.py", line 72, in saveEmail
    msg = email.message_from_string(response_part[1])
  File "/usr/lib/python3.6/email/__init__.py", line 38, in message_from_string
    return Parser(*args, **kws).parsestr(s)
  File "/usr/lib/python3.6/email/parser.py", line 68, in parsestr
    return self.parse(StringIO(text), headersonly=headersonly)
TypeError: initial_value must be str or None, not bytes

trying the same code with python2.7 leads to the desired result but from time to time this kind of error is popping out:

/mnt/volume-fra1-01/arc/2017/3128654e-fa30-40e8-823a-e6e745c7c350KAWGMEHUB01.gme.gbl
MailboxClient.saveEmail() failed
'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

which is probably related to default encoding in python2

I would gladly contribute with a PR but unfortunately, python is not my forte :(

Cant connect via SSL, wrong version

Using docker image with config with SSL set to true and the correct port get the following:

account1/__ALL__ (on imap.host)
Traceback (most recent call last):
  File "/opt/bin/imapbox.py", line 98, in <module>
    main()
  File "/opt/bin/imapbox.py", line 89, in main
    for folder_entry in get_folder_fist(account):
  File "/opt/bin/mailboxresource.py", line 115, in get_folder_fist
    mailbox = imaplib.IMAP4_SSL(account['host'], account['port'])
  File "/usr/local/lib/python3.7/imaplib.py", line 1288, in __init__
    IMAP4.__init__(self, host, port)
  File "/usr/local/lib/python3.7/imaplib.py", line 198, in __init__
    self.open(host, port)
  File "/usr/local/lib/python3.7/imaplib.py", line 1301, in open
    IMAP4.open(self, host, port)
  File "/usr/local/lib/python3.7/imaplib.py", line 299, in open
    self.sock = self._create_socket()
  File "/usr/local/lib/python3.7/imaplib.py", line 1293, in _create_socket
    server_hostname=self.host)
  File "/usr/local/lib/python3.7/ssl.py", line 423, in wrap_socket
    session=session
  File "/usr/local/lib/python3.7/ssl.py", line 870, in _create
    self.do_handshake()
  File "/usr/local/lib/python3.7/ssl.py", line 1139, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1076)

Folder structure for accounts

Hello,

I am running this in a docker container. I want to save multiple folders from single account. As imapbox does not support multiple remote folders with one account, I just duplicated the account and use different remote folder for each. Everything works, but..

The config is as follows

[imapbox]
local_folder=/var/imapbox
days=365
wkhtmltopdf=/opt/bin/wkhtmltopdf

[account1]
host=imap.example.com
username=[email protected]
password=password
remote_folder=INBOX

[account2]
host=imap.example.com
username=[email protected]
password=password
remote_folder=archive

[account3]
host=imap.example.com
username=[email protected]
password=password
remote_folder=sent

Imapbox runs and downloads all the mail, but the folder structure is following

  • home
    • INBOX
      • 2023
      • 2022
      • archive
        • 2023
        • sent
          • 2023

Imapbox runs and writes the email, but then does not go back to home and runs further in the account1 folder and that just continues. I would expect to have the remote folders as subfolders in home. I bet that it is something simple, but I am not on that level yet. :)

I am not sure how to name this issue, so you might come up with something better than I did.

IndexError: list index out of range

Hello

I have tested the soft, my problem:

account1/ALL (on imap.free.fr)
Traceback (most recent call last):
File "imapbox.py", line 98, in
main()
File "imapbox.py", line 91, in main
print("Saving folder: " + folder_name[1])
IndexError: list index out of range

i have modified the imapbox.py:
if account['remote_folder'] == "ALL":
for folder_entry in get_folder_fist(account):
folder_name = folder_entry.decode().split(' "." ')
-> print(folder_name)
print("Saving folder: " + folder_name[1])
account['remote_folder'] = folder_name[1]
save_emails(account, options)
else:
save_emails(account, options)

the error is:
account1/ALL (on imap.free.fr)
['(\HasNoChildren) "/" "Apple"']
Traceback (most recent call last):
File "imapbox.py", line 99, in
main()
File "imapbox.py", line 92, in main
print("Saving folder: " + folder_name[1])
IndexError: list index out of range

Could you help me?

SyntaxError: Missing parentheses in call to 'print'

Hi,

I get an error when launching the script.

$ python imapbox.py
Traceback (most recent call last):
  File "/opt/git/imapbox/imapbox.py", line 5, in <module>
    from mailboxresource import MailboxClient
  File "/opt/git/imapbox/mailboxresource.py", line 84
    print directory
                  ^
SyntaxError: Missing parentheses in call to 'print'
$ python --version
Python 3.6.1

Local Maildir as a target?

I don't believe there's a half-decent Maildir to HTML converter at the moment.

Would that be possible to add support for processing offline Maildirs?

OAuth2 Gmail Access

Greatly appreciate this code for enabling backup of mailboxes via IMAP.

I notice that Google is pushing to get rid of basic authentication this year 2024. I could only get this code to work by utilising app passwords.

However, it is possible Google may eventually sunset this too. Given Google's roadmap are there any plans to add OAuth to IMAPbox for connection to Google mailboxes?

Fetch more than one message at a time

I have attempted to download a couple of thousands of emails I started wondering about speed limitations of imapbox. After a brief investigation, I have discovered that we are fetching only one message at the time.

@polo2ro did you tried to download emails in bulk? I wonder if that solution could be faster.

See the documentation:

The message_set options to commands below is a string specifying one or more messages to be acted upon. It may be a simple message number ('1'), a range of message numbers ('2:4'), or a group of non-contiguous ranges separated by commas ('1:3,6:9'). A range can contain an asterisk to indicate an infinite upper bound ('3:*').

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.