GithubHelp home page GithubHelp logo

talon's Introduction

talon

Mailgun library to extract message quotations and signatures.

If you ever tried to parse message quotations or signatures you know that absence of any formatting standards in this area could make this task a nightmare. Hopefully this library will make your life much easier. The name of the project is inspired by TALON - multipurpose robot designed to perform missions ranging from reconnaissance to combat and operate in a number of hostile environments. That’s what a good quotations and signature parser should be like 😄

Usage

Here’s how you initialize the library and extract a reply from a text message:

import talon
from talon import quotations

talon.init()

text =  """Reply

-----Original Message-----

Quote"""

reply = quotations.extract_from(text, 'text/plain')
reply = quotations.extract_from_plain(text)
# reply == "Reply"

To extract a reply from html:

html = """Reply
<blockquote>

  <div>
    On 11-Apr-2011, at 6:54 PM, Bob &lt;[email protected]&gt; wrote:
  </div>

  <div>
    Quote
  </div>

</blockquote>"""

reply = quotations.extract_from(html, 'text/html')
reply = quotations.extract_from_html(html)
# reply == "<html><body><p>Reply</p></body></html>"

Often the best way is the easiest one. Here’s how you can extract signature from email message without any machine learning fancy stuff:

from talon.signature.bruteforce import extract_signature


message = """Wow. Awesome!
--
Bob Smith"""

text, signature = extract_signature(message)
# text == "Wow. Awesome!"
# signature == "--\nBob Smith"

Quick and works like a charm 90% of the time. For other 10% you can use the power of machine learning algorithms:

import talon
# don't forget to init the library first
# it loads machine learning classifiers
talon.init()

from talon import signature


message = """Thanks Sasha, I can't go any higher and is why I limited it to the
homepage.

John Doe
via mobile"""

text, signature = signature.extract(message, sender='[email protected]')
# text == "Thanks Sasha, I can't go any higher and is why I limited it to the\nhomepage."
# signature == "John Doe\nvia mobile"

For machine learning talon currently uses the scikit-learn library to build SVM classifiers. The core of machine learning algorithm lays in talon.signature.learning package. It defines a set of features to apply to a message (featurespace.py), how data sets are built (dataset.py), classifier’s interface (classifier.py).

Currently the data used for training is taken from our personal email conversations and from ENRON dataset. As a result of applying our set of features to the dataset we provide files classifier and train.data that don’t have any personal information but could be used to load trained classifier. Those files should be regenerated every time the feature/data set is changed.

To regenerate the model files, you can run

python train.py

or

from talon.signature import EXTRACTOR_FILENAME, EXTRACTOR_DATA
from talon.signature.learning.classifier import train, init
train(init(), EXTRACTOR_DATA, EXTRACTOR_FILENAME)

Open-source Dataset

Recently we started a forge project to create an open-source, annotated dataset of raw emails. In the project we used a subset of ENRON data, cleansed of private, health and financial information by EDRM. At the moment over 190 emails are annotated. Any contribution and collaboration on the project are welcome. Once the dataset is ready we plan to start using it for talon.

Training on your dataset

talon comes with a pre-processed dataset and a pre-trained classifier. To retrain the classifier on your own dataset of raw emails, structure and annotate them in the same way the forge project does. Then do:

from talon.signature.learning.dataset import build_extraction_dataset
from talon.signature.learning import classifier as c

build_extraction_dataset("/path/to/your/P/folder", "/path/to/talon/signature/data/train.data")
c.train(c.init(), "/path/to/talon/signature/data/train.data", "/path/to/talon/signature/data/classifier")

Note that for signature extraction you need just the folder with the positive samples with annotated signature lines (P folder).

Research

The library is inspired by the following research papers and projects:

talon's People

Contributors

ad-m avatar alexriina avatar ashishgandhi avatar cgc avatar conalsmith49 avatar conapart3 avatar dougkeen avatar ehsandarroudi avatar esetnik avatar ezrapagel avatar glaand avatar hnx116 avatar horkhe avatar ivuk avatar jeremyschlatter avatar kevincathcart avatar mattdietz avatar obukhov-sergey avatar pborreli avatar savageman avatar scottmac avatar simonflore avatar szymonsobczak avatar tgwizard avatar thrawn01 avatar tsheasha avatar umairwaheed avatar wdeeke avatar yfilali avatar yoks avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

talon's Issues

Open source the talon demo app?

Would it be possible to publish the code used for the talon demo app? http://talon.mailgun.net/

It looks like it's hosted on Heroku, which perfectly suits my needs (I want to access it as an external service rather than integrate the library into my app).

Pip install fails

When I install talon via pip from pypi is installs fine. But in this case it's not installing PyML and it's separate installation also fails.
So I decided to install latest version of talon from github and here is what I got:
https://gist.github.com/EugeneFeshchenko/6335f648838842209a00
I was trying to install it in a newly created env.

Is it me doing something wrong or setup file needs fixing ?

Also in general, installation process is not very obvious:
When you install from pypi you have to install PyML separately
When you install from github PyML install as well
I didn't find info that numpy should be preinstalled in order to install PyMl but it's doc says numpy is required.

Fails to find signatures with 61+ characters.

The first test email I pasted into http://talon.mailgun.net/ had a >60 character signature and it failed to find it. 60 characters works fine. Here's a simplified length checker test case suitable for inclusion into bruteforce_test.py:

def test_signature_lengths():
    for n in range(80):
        content = "CONTENT"
        sig = "\n-- \n" + "."*n
        msg_body = content + sig
        eq_((n, (content.strip(), sig.strip())),
                (n, bruteforce.extract_signature(msg_body)))

Better handling of Date: From: block

Usually "Date:" and "From:" indicate quotations but sometimes looking for them leads to false positives (e.g. when invitation details are provided with "Date:") or false negatives (when "Subject:" goes first).

Dutch Apple mail splitter

Hi all,

After a request to support, I got this link to submit a new dutch splitter format. Hope this improves this great libary!

"Op 14 jan. 2015, om 13:52 heeft ... het volgende geschreven:"
Op <:date>. om heeft het volgende geschreven:

How should i proceed?

Issue with stripped_text when replying from Hotmail

I have found that if I send an email from mailgun to a hotmail account and I want to reply to that, the body looks like this:

My reply message...
<div><hr id="stopSpelling">
Subject: Sending a message from Mailgun's API<br>
From: [email protected]<br>
To: [email protected]<br>
Date: Thu, 18 Jun 2015 07:45:36 +0000<br>
etc.....

The stripped_text now is "My reply message... Subject: Sending a message from Mailgun's API" but it should be "My reply message..."

As to my understanding when you [ extract_from_html ] the rule [ cut_from_block ] is applied (that checks for "From" and "Date" ) . But in this case the "Subject" is above from both "From" and "Date" so it passes the check and doesn't get stripped.

This issue appears only when I send from Mailgun to Hotmail.

Parse email disclaimer

Sometimes emails have disclaimers. Disclaimer lines are generally longer then 60 chars. So signature won't be extracted.

Error Initializing Talon

I installed talon using PIP on python 2.7. I'm trying to run the example but I get an error as soon as talon.init() is called. I can't figure out what parser it is colliding with - any ideas what is wrong? I didn't receive any error installing talon with pip.

Here's the code, error, and the other files I see on my system called parser.py.

Parser.py files:

vagrant@vagrant-ubuntu-trusty-64:/$ sudo find . -name "parser.py"
./usr/lib/python3.4/html/parser.py
./usr/lib/python3.4/email/parser.py
./usr/lib/python2.7/email/parser.py
./usr/lib/python2.7/dist-packages/jinja2/parser.py
./usr/lib/python2.7/dist-packages/yaml/parser.py
./usr/lib/python2.7/dist-packages/dateutil/parser.py
./usr/lib/python3/dist-packages/ufw/parser.py
./usr/local/lib/python2.7/dist-packages/cssselect/parser.py

Code:

import talon
from talon import quotations

talon.init()

text =  """Reply

-----Original Message-----

Quote"""

reply = quotations.extract_from(text, 'text/plain')
reply = quotations.extract_from_plain(text)
Print(reply)

Error

Traceback (most recent call last):
  File "test.py", line 4, in <module>
    talon.init()
  File "/usr/local/lib/python2.7/dist-packages/talon/__init__.py", line 7, in init
    signature.initialize()
  File "/usr/local/lib/python2.7/dist-packages/talon/signature/__init__.py", line 38, in initialize
    EXTRACTOR_DATA)
  File "/usr/local/lib/python2.7/dist-packages/talon/signature/learning/classifier.py", line 31, in load
    return joblib.load(saved_classifier_filename)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/numpy_pickle.py", line 425, in load
    obj = unpickler.load()
  File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/numpy_pickle.py", line 291, in load_build
    array = nd_array_wrapper.read(self)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/numpy_pickle.py", line 113, in read
    mmap_mode=unpickler.mmap_mode)
  File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 394, in load
    return format.read_array(fid)
  File "/usr/lib/python2.7/dist-packages/numpy/lib/format.py", line 437, in read_array
    shape, fortran_order, dtype = read_array_header_1_0(fp)
  File "/usr/lib/python2.7/dist-packages/numpy/lib/format.py", line 334, in read_array_header_1_0
    d = safe_eval(header)
  File "/usr/lib/python2.7/dist-packages/numpy/lib/utils.py", line 1128, in safe_eval
    ast = compiler.parse(source, mode="eval")
  File "/usr/lib/python2.7/compiler/transformer.py", line 53, in parse
    return Transformer().parseexpr(buf)
  File "/usr/lib/python2.7/compiler/transformer.py", line 132, in parseexpr
    return self.transform(parser.expr(text))
AttributeError: 'module' object has no attribute 'expr'

emails/ folder not included for training

Hi,

It seems like the raw emails you used for the ML training are not included in the repo. I'd like to train the AI on my own emails, can you tell me what's the right format to use?

Handle gmail_quote div in emails created by clicking the Reply button in gmail

If you click "reply" in gmail, and then change the subject line and undo the quotation in the WYSIWYG editor, you get something that looks a lot like a fresh email that you just typed yourself. If you send the email to another gmail user, it will render for them the way you would expect if you composed the email yourself from scratch.

But gmail wraps the content in a <div class="gmail_quote">, which gets stripped by quotations.extract_from_html. Ideally, talon would know when this was the case and not treat the div as a quote. This seems hard, though. It's not clear to me how to identify this case.

Talon does not detect signatures with this common format

The signature format:

Hi Mailgun,

Please fix the parsing logic so that it detects and strips signatures such as the one on this email.

Regards


David Perks
Managing Director
email: [email protected]<mailto:[email protected]> | mobile: 0424 282 465 | office: 1300 N REACH (1300 6 73224)
twitter: @withinreachsw | web: www.withinreach.com.au<http://www.withinreach.com.au/> | www.goalhuddle.com<http://www.goalhuddle.com/>

Within Reach Software Pty Ltd, Suite 102, 21 Berry St, North Sydney, NSW 2060

This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. Please note that any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Finally, the recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email

Parsing of gmail forwarded messages

Forwarded email content sent from Gmail is pruned by Talon. Normally when you forward a message you need to preserve it.

A potential improvement could detect the ------Forwarded message------ text inside the first child node of the <div class="gmail_quote"> element and output the original message if gmail_quote div is not deeply nested.

Thanks,
Sokratis

Mailgun Talon: Signature extraction example throwing error

Crossposting with StackOverflow (http://stackoverflow.com/questions/25639506/mailgun-talon-signature-extraction-example-throwing-error)

I installed mailgun/talon on GCE and was trying out the example in the README section, but it threw the following error at me:

from talon import signature
message = """Thanks Sasha, I can't go any higher and is why I limited it to the
... homepage.
...
... John Doe
... via mobile"""
message
"Thanks Sasha, I can't go any higher and is why I limited it to the\nhomepage.\n\nJohn Doe\nvia mobile"
text,signtr = signature.extract(message, sender='[email protected]')
ERROR:talon.signature.extraction:ERROR when extracting signature with classifiers
Traceback (most recent call last):
File "talon/signature/extraction.py", line 57, in extract
markers = _mark_lines(lines, sender)
File "talon/signature/extraction.py", line 99, in _mark_lines
elif is_signature_line(line, sender, EXTRACTOR):
File "talon/signature/extraction.py", line 40, in is_signature_line
return classifier.decisionFunc(data, 0) > 0
AttributeError: 'NoneType' object has no attribute 'decisionFunc'

Do I need to train the model somehow (this signature seems to be the ML example)? I installed it using pip.

Improve german formats

Hi guys, I just got the link to this repo from your support. It's awesome that this is open source and I would like to help improve the german formats.

I just need some help to get started. E.g. I would like to respect this german Windows Outlook split pattern:

-----Ursprüngliche Nachricht-----

As you can see this includes a special character ü which is represented as =C3=BC (in Base64, I guess). So what do i have to do, something like this in quotations.py?

SPLITTER_PATTERNS = [
    # ------Original Message------ or ---- Reply Message ----
    re.compile("[\s]*[-]+[ ]*(Original|Reply) Message[ ]*[-]+", re.I),
    re.compile("[\s]*[-]+[ ]*(Urspr=C3=BCngliche|Antwort) Nachricht[ ]*[-]+", re.I),
    ...

Is this correct? I'll try to add more cases which do not work atm (I tested a few, which led me to your support :) I'll open a pull request then. Ah and how can I run the tests and should I create tests for each new case? Thanks for your advise.

Date time appointment data is stripped off

Consider the following example:

Hi Xxx,

Good day to you. I have already scheduled an appointment with Dr. 
Xxx Xxx for you to pick up your xxx. Please see 
appointment details below:

Date: Monday, 8/25/2014
Time: 9:15 AM

If you have any questions, please let me know and I will be happy to help.

Have a great weekend!

Best,

Xxx X.
Perssist Assistant
www.example.com

The "Date: ..." part and the rest of the email is stripped off because it's mistaken for quotations.
In talon/talon/quotations.py the following splitter pattern is checked:

SPLITTER_PATTERNS = [
...
    re.compile('(_+\r?\n)?[\s]*(:?[*]?From|Date):[*]? .*'),
...
    ]

The pattern should be adjusted to check for both "From: ..." and "Date: ..." A possible implementation could be similar to RE_ON_DATE_SMB_WROTE pattern, also see mark_message_lines() on how multi-line patterns are handled.

Extract both msg and text from text message

I send text message from G^Mail and I am unable to extract signature from messages.

How to do it? extract_signature doesn't work for plain messages.

In [34]: msg.text
Out[34]: u'Och, zapomnia\u0142em o jeszcze jednej decyzji.\r\n\r\nW dniu 14 marca 2015 19:58 u\u017cytkownik  <[email protected]> napisa\u0142:\r\n> Szanowna Pani/Pan,\r\n>\r\n> Przesy\u0142amy w za\u0142\u0105czeniu \u017c\u0105dane informacje.\r\n>\r\n> Z wyrazami szacunku,\r\n> Anna Kmie\u0107\r\n>\r\n> W dniu 14 marca 2015 19:57 u\u017cytkownik <[email protected]> napisa\u0142:\r\n>\r\n>> Szanowni Pa\u0144stwo,\r\n>>\r\n>> Dzia\u0142aj\u0105c na podstawie art. 61 Konstytucji RP wnosz\u0119 o przes\u0142anie:\r\n>> - kopie wszystkich decyzji w sprawie KM.1431.2.215,\r\n>>\r\n>> Odpowiedz przes\u0142a\u0107 pod adres [email protected].\r\n>>\r\n>> Z powa\u017caniem,\r\n>\r\n>'

In [35]: quotations.extract_from(msg.text, 'text/plain')
Out[35]: u'Och, zapomnia\u0142em o jeszcze jednej decyzji.\r\n\r\nW dniu 14 marca 2015 19:58 u\u017cytkownik  <[email protected]> napisa\u0142:'

In [36]: extract_signature(msg.text)
Out[36]: 
(u'Och, zapomnia\u0142em o jeszcze jednej decyzji.\r\n\r\nW dniu 14 marca 2015 19:58 u\u017cytkownik  <[email protected]> napisa\u0142:\r\n> Szanowna Pani/Pan,\r\n>\r\n> Przesy\u0142amy w za\u0142\u0105czeniu \u017c\u0105dane informacje.\r\n>\r\n> Z wyrazami szacunku,\r\n> Anna Kmie\u0107\r\n>\r\n> W dniu 14 marca 2015 19:57 u\u017cytkownik <[email protected]> napisa\u0142:\r\n>\r\n>> Szanowni Pa\u0144stwo,\r\n>>\r\n>> Dzia\u0142aj\u0105c na podstawie art. 61 Konstytucji RP wnosz\u0119 o przes\u0142anie:\r\n>> - kopie wszystkich decyzji w sprawie KM.1431.2.215,\r\n>>\r\n>> Odpowiedz przes\u0142a\u0107 pod adres [email protected].\r\n>>\r\n>> Z powa\u017caniem,\r\n>\r\n>',
 None)

I want receive result like:

In [40]: msg.text.replace(quotations.extract_from(msg.text, 'text/plain'),'')
Out[40]: u'\r\n> Szanowna Pani/Pan,\r\n>\r\n> Przesy\u0142amy w za\u0142\u0105czeniu \u017c\u0105dane informacje.\r\n>\r\n> Z wyrazami szacunku,\r\n> Anna Kmie\u0107\r\n>\r\n> W dniu 14 marca 2015 19:57 u\u017cytkownik <[email protected]> napisa\u0142:\r\n>\r\n>> Szanowni Pa\u0144stwo,\r\n>>\r\n>> Dzia\u0142aj\u0105c na podstawie art. 61 Konstytucji RP wnosz\u0119 o przes\u0142anie:\r\n>> - kopie wszystkich decyzji w sprawie KM.1431.2.215,\r\n>>\r\n>> Odpowiedz przes\u0142a\u0107 pod adres [email protected].\r\n>>\r\n>> Z powa\u017caniem,\r\n>\r\n>

JSON for Linking Data

Hey ho!

Thanks for creating this awesome library!

I was wondering if you have any thoughts or plans on supporting json for linking data. It is used to power gmail actions (see https://developers.google.com/gmail/actions/actions/actions-overview) and it generally seems like a great thing to have anyway.

I do not mind spending the time doing the development work to get this in but was wondering if you would accept a PR if I do this. Also, is there any specific way it should be done?

Detect signatures with long lines

Generally signatures have short lines - no more than 60 characters but there is also a class of signatures that have long lines with long URLs, etc.

Example:

Some text

-- 


John Smith
Co-Founder and CEO
Xxxxxxxxx

mobile: 555.115.4274 | book a mtg
<http://example.com/soooooome/looooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong/path?t=looooooooooooooooooooooooooooooooooooooooooooooooooooooooong>
 | @handle
<http://example.com/looooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong/paaaaaaaaath?t=loooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong>
 | linkedin
<http://example.com/loooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong?t=loooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong>
 | video
<http://example.com/looooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong/paaath?t=loooooooooooooooong-parameeeeeeeeeeeeeeeeeeeeeeeeter-query-string>

Currently talon doesn't parse signatures like this

`extract_from_html` breaks when xhtml encoding specified in document

To reproduce, add the following line to the top of tests/fixtures/html_replies/hotmail.html:

<?xml version="1.0" encoding="UTF-8"?>

and run nosetests tests/html_quotations_test.py:test_hotmail_reply. lxml cannot parse html documents from a unicode string when the encoding is declared in the document.

Failing Testcase

Just in case it's helpful, we developed this test data based on a signature we received that wasn't parsed correctly. Here's the data I pasted at http://talon.mailgun.net/ to reproduce:

From: <[email protected]>
To: <[email protected]>
Subject: Re: [SPF] Still trying to figure out your signature
Date: Tue, 16 Dec 2014 16:46:22 +0000
Message-Id: <D0B5BDE2.882AA%[email protected]>
Accept-Language: en-US, ja-JP
Content-Language: en-US
User-Agent: Microsoft-MacOutlook/14.4.6.141106
Content-Type: multipart/alternative; boundary="_000_D0B5BDE2882AAtestbotmatoncompanysdomaincom_"
Mime-Version: 1.0
Sender: [email protected]

--_000_D0B5BDE2882AAtestbotmatoncompanysdomaincom_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Sure thing - happy to help! Hope you can get it sorted out :)
-----
Testbot Maton
Manager, Communications Platforms
Company Liner

Phone: +1 555.555.5555
Mobile: +1 555.555.5555

[email protected]<mailto:[email protected]>
http://companysdomain.com<http://companysdomain.com/>

pyml failed while installting talon via pip(ubuntu)

I felt few things here:
any idea why scikit was not used rather pyml? which naturally ease the deployment.
how does multiple mail reply chain works to extract all the signature?

sh: 0: getcwd() failed: No such file or directory

% Total % Received % Xferd Average Speed Time Time Time Current

                             Dload  Upload   Total   Spent    Left  Speed

0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0tar: PyML-0.7.9: Cannot mkdir: No such file or directory

tar: PyML-0.7.9: Cannot mkdir: No such file or directory

tar: PyML-0.7.9/data: Cannot mkdir: No such file or directory

Improve Date: splitter

Lines like:

Date: ....
From: ...

usually indicate quotations. But sometimes "Date:" could be part of text. Parsing could e improved by e.g. checking several lines to be present e.g. "Date:" line and "From:" line.

nested gmail_quote

Gmail started to nest "gmail_quote" tags:

<div class="gmail_quote">
On <date> [email protected] wrote:
  <blockquote class="gmail_quote">
    Original message
  </blockquote>
</div>

talon removes the nested tag with quoted message and leaves the outer tag with quotation splitter in it

Update regex to 2014.12.24

Hello!

I tried installing talon in Python 3 today and ran into the following issue installing regex:

Downloading/unpacking regex==0.1.20110315
  Downloading regex-0.1.20110315.tar.gz (948kB): 948kB downloaded
  Running setup.py (path: .../regex/setup.py) egg_info for package regex
    No unicodedata_db.h could be prepared for Python 3.4.2 (default, Oct 19 2014, 17:55:38)
    [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.54)]
    Complete output from command python setup.py egg_info:
    No unicodedata_db.h could be prepared for Python 3.4.2 (default, Oct 19 2014, 17:55:38)

[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.54)]

----------------------------------------
Cleaning up...

I don't have this problem installing regex==2014.12.24. Could you please consider updating the dependency version?

I would be happy to bump the version, ensure tests pass, and open a PR. Thanks!

Arabic quotation splitter

Here's a new quotation splitter example:

في ١٨‏/٠٨‏/٢٠١٤، الساعة ٢:٣٣ م، كتب XXX <[email protected]>:

or:

\u202b\u0641\u064a \u0661\u0668\u200f/\u0660\u0668\u200f/\u0662\u0660\u0661\u0664\u060c \u0627\u0644\u0633\u0627\u0639\u0629 \u0662:\u0663\u0663 \u0645\u060c \u0643\u062a\u0628 XXX <[email protected]>:\u202c

Here's a translation to English:

On 08.18.2014, at 14:33, wrote XXX <[email protected]>:

Install via pip fails due to missing README.rst file

In a fresh environment:

~ > pip install talon
Downloading/unpacking talon
  Downloading talon-1.0.tar.gz
  Running setup.py (path:/private/tmp/pip_build_root/talon/setup.py) egg_info for package talon
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/private/tmp/pip_build_root/talon/setup.py", line 13, in <module>
        long_description=open("README.rst").read(),
    IOError: [Errno 2] No such file or directory: 'README.rst'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 17, in <module>

  File "/private/tmp/pip_build_root/talon/setup.py", line 13, in <module>

    long_description=open("README.rst").read(),

IOError: [Errno 2] No such file or directory: 'README.rst'

Adjust quotations pattern

Here's a sample:

                                Second from gmail

2014-10-17 11:28 GMT+03:00 Postmaster <
[email protected]>:

> First from site
>

Quotation SPLITTER_PATTERNS from talon.quotations need to be adjusted. In this particular case no new pattern is needed instead one of the following needs to be fixed:

re.compile("(\d+/\d+/\d+|\d+\.\d+\.\d+).*@", re.VERBOSE)
re.compile('\S{3,10}, \d\d? \S{3,10} 20\d\d,? \d\d?:\d\d(:\d\d)?'
               '( \S+){3,6}@\S+:')

Consider merging them into one.

No extraction of forwarded part

If i get the code right it is intended that

reply = quotations.extract_from_plain(body)

does not extract forwarded messages?

So something like:

test

----- Forwarded Message -----
From: "test" [email protected]
Sent: Thursday, January 28, 2016 4:31:32 PM
Subject: test

testmail

will be ignored and returned as is/a part of reply?

Would it be possible to extract forwarded messages just like replies?

So either chop the forwarded part with quotations.extract_from_...(body) or a dedicated function to run against the body if reply equals body?

Not extracting mail signatures for minor changes in email/signature.

The following code which is the example works while running the library and returns signature as expected.

message = """Thanks Sasha, I can't go any higher and is why I limited it to the
homepage.

John Doe
via mobile"""

text, signature = signature.extract(message, sender='[email protected]')

But when i make a minor edit as changing the sender name and his email id the signature returns "None". The following were values passed to signature.extract()

message = """
Hello ,
Thanks Sasha, I can't go any higher and is why I limited it to the
homepage.

Sam John
via mobile"""

text, signature = signature.extract(message, sender='[email protected]')

Signature returned None for most of the messages that were tried.

Zimbra HTML parsing

Run this through http://talon.mailgun.net/

It will only strip the reply headers but miss the actual quote in the html part
text/plain extraction is working fine

Date: Thu, 4 Feb 2016 16:56:47 +0100 (CET)
From: [email protected]
To: [email protected]
Message-ID: <[email protected]>
In-Reply-To: <[email protected]>
References: <[email protected]>
Subject: Re: Lorem Ipsum
MIME-Version: 1.0
Content-Type: multipart/alternative; 
    boundary="----=_Part_35_1109890054.1454601407386"
X-Originating-IP: [1.1.1.1]
X-Mailer: Zimbra 8.6.0_GA_1153 (ZimbraWebClient - FF44 (Win)/8.6.0_GA_1153)
Thread-Topic: Lorem Ipsum
Thread-Index: ddFMd6wnxYPGpAbdA2oNKj8MgU0bH6/lWgJ/

------=_Part_35_1109890054.1454601407386
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. 


From: [email protected] 
To: "admin" <[email protected]> 
Sent: Thursday, February 4, 2016 4:56:33 PM 
Subject: Lorem Ipsum 

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. 


------=_Part_35_1109890054.1454601407386
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable

<html><body><div style=3D"font-family: arial, helvetica, sans-serif; font-s=
ize: 12pt; color: #000000"><div>Lorem ipsum dolor sit amet, consectetur adi=
piscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna al=
iqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris ni=
si ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehender=
it in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteu=
r sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt =
mollit anim id est laborum.</div><div><br></div><hr id=3D"zwchr" data-marke=
r=3D"__DIVIDER__"><div data-marker=3D"__HEADERS__"><b>From: </b>admin@mymon=
eyex.com<br><b>To: </b>"admin" &lt;[email protected]&gt;<br><b>Sent: </b>T=
hursday, February 4, 2016 4:56:33 PM<br><b>Subject: </b>Lorem Ipsum<br></di=
v><br><div data-marker=3D"__QUOTED_TEXT__"><div style=3D"font-family: arial=
, helvetica, sans-serif; font-size: 12pt; color: #000000"><div>Lorem ipsum =
dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididu=
nt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud =
exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis =
aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu =
fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt=
 in culpa qui officia deserunt mollit anim id est laborum.</div></div><br><=
/div></div></body></html>
------=_Part_35_1109890054.1454601407386--

Signature Detection in HTML Emails

I haven't had a lot of success parsing signatures out of text/html emails. It seems to work pretty well for text/plain emails. Is there a good strategy to parse out the signature for text/html emails?

Thanks,
Pete

Improve handling dashes

Dashes could be mistaken for signature separators, e.g:

Hi,

some item list:
- item 1
- item 2
item 2 continued
- item3
some text

Because "item 3" is separated from the rest of the list (by "item 2 continued" line) it's mistaken for signature.

One div wraps quotation and reply

Consider email html like this:

<div>
<p>Reply</p>
<p>
  <b>From:</b>[email protected]
  ...
</p>
<p>Original email</p>
</div>

talon detects the From: ... block, goes up till it finds a "wrapping div" and cuts off the whole message!

More granular message / signature parts classification

The algo would be more accurate if the classification is more detailed. E.g. if it classifies message greeting there could be a sanity check saying that signature can't go right after a greeting.

Similarly if a closing phrase (e.g. "Kind Regards,", "Thanks,", etc) is detected in signature candidate lines we could say for sure that it should be the first signature line.

Another example - disclaimers. Some messages have them and that totally breaks signature detection logic.

Numpy dep is obsolete

Talon requires Numpy 1.6.1, while the latest version is 1.9.2. Version 1.6.1 doesn't install on Mac OS X, failing with this error: clang: error: invalid argument '-faltivec' only allowed with 'ppc/ppc64/ppc64le'

fielding of emails?

After extracting a signature block, it would be nice to break the signature into fields, such as name, title, org, mobile-phone, work-phone, work-phone, home-phone. Have you thought about building that?

C port

Can we possibly run talon in C?

Detect disclaimers

Many messages has disclaimers at the bottom. Signature detection doesn't work unless you strip disclaimers first.

HtmlEntity(13)

Carriage return "\r" is replaced with "&#13;" when html quotations are extracted. This happens when deepcopy is applied to html tree.

To reproduce:

>>> from copy import deepcopy
>>> from lxml import html
>>>
>>> html_tree = html.document_fromstring("<html>/r/n</html>")
>>> html_tree_copy = deepcopy(html_tree)
>>> html.tostring(html_tree)
'<html>/r/n</html>'
>>> html.tostring(html_tree_copy)
'<html>&#13;/n</html>'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.