GithubHelp home page GithubHelp logo

Comments (19)

jordan-wright avatar jordan-wright commented on June 29, 2024

Interesting. And you're sure to be using the latest version of the library? I'll see if I can track this down later.

Thanks for the report!

from email.

jordan-wright avatar jordan-wright commented on June 29, 2024

You get the most interesting emails ๐Ÿ˜„

That's a good thing, because I want this to be a comprehensive library that just works. In this case, the root cause is that the first subpart (Content-Transfer-Encoding: quoted-printable) has not Content-Type specified.

Now, I figured this would be caught by the checks I put in place before, but it doesn't look like that's happening. I'll see if I can get that fixed.

My question to you is - what would you like the email.Email struct to look like with this message? I see two what I would consider "text" parts, and one PGP signature - not sure where to put that. What would be the best way to handle these two text parts? One of them has to go in Email.Text, but I'm not sure which it should be. Any suggestions?

from email.

kalbasit avatar kalbasit commented on June 29, 2024

You get the most interesting emails ๐Ÿ˜„
lol! Well, at some point I want to port notmuchmail to Go so I am using their test database to drive mine. Gmuch is an HTTP/RPC API on top of Notmuch but at some point I will write my own Notmuch clone. The email database is found here: https://github.com/notmuch/notmuch/blob/master/test/test-databases/Makefile.local#L3 (follow the Makefile).

For quoted printable, perhaps look at https://godoc.org/mime/quotedprintable

My question to you is - what would you like the email.Email struct to look like with this message? I see two what I would consider "text" parts, and one PGP signature - not sure where to put that. What would be the best way to handle these two text parts? One of them has to go in Email.Text, but I'm not sure which it should be. Any suggestions?

I don't feel that this library should handle GPG but should at least provided. So perhaps add two new fields Email.GPGSignature and Email.GPGEncrypted so we can deal with them somewhere else.

from email.

jordan-wright avatar jordan-wright commented on June 29, 2024

The email database is found here: https://github.com/notmuch/notmuch/blob/master/test/test-databases/Makefile.local#L3 (follow the Makefile)

Thanks! I'll see what I can do with this.

For quoted printable, perhaps look at https://godoc.org/mime/quotedprintable

The quoted printable part isn't my worry here. By reading a MIME part using a multipart.NewReader, we get the benefit of the automatic quoted-printable decoding introduced in v1.5. My question was really about which text part out of the two you felt should be the contents of Email.Text. The text are really two separate parts with unique content (even though one looks like a signature), so it'd be hard to determine if one is the primary text content over the other.

I don't feel that this library should handle GPG but should at least provided. So perhaps add two new fields Email.GPGSignature and Email.GPGEncrypted so we can deal with them somewhere else.

That's a good idea. I would likely incorporate in a new version and try to abstract it out a bit. For example, I might just have an Email.Signature and Email.Encrypted, or something similar. Maybe I could add PGP support where the library can handle the generation of the signature for the user (first, I'll admittedly need to learn more about how that works!).

from email.

kalbasit avatar kalbasit commented on June 29, 2024

The quoted printable part isn't my worry here. By reading a MIME part using a multipart.NewReader, we get the benefit of the automatic quoted-printable decoding introduced in v1.5.

Oh cool! I missed that.

The text are really two separate parts with unique content (even though one looks like a signature), so it'd be hard to determine if one is the primary text content over the other.

Well the signature is Content-Type: application/pgp-signature and the signed is Content-Type: application/pgp-encrypted and you also have application/pgp-keys. See https://www.ietf.org/rfc/rfc2015.txt for more details.

That's a good idea. I would likely incorporate in a new version and try to abstract it out a bit. For example, I might just have an Email.Signature and Email.Encrypted, or something similar. Maybe I could add PGP support where the library can handle the generation of the signature for the user.

That'd be cool! But again I feel that logic should probably live in a different library (or even in this one but different package) so it can be used independently. It would be nice to have methods on the Email object itself though.

(first, I'll admittedly need to learn more about how that works!).

Maybe this will help https://godoc.org/golang.org/x/crypto/openpgp

from email.

kalbasit avatar kalbasit commented on June 29, 2024

once this is fixed, I'll run your library across all of my emails (currently 261380 emails), this would be a nice benchmark and a very good test case!

from email.

kalbasit avatar kalbasit commented on June 29, 2024

ok here goes. With this main I managed to generate this log for you. I guess we should probably open an issue per error. Let me know which email you want first.

BTW performance is not too bad: it took 5m54.04951563s to parse 261414 emails That's about 738.45 emails per second! This is amazing! Now I am more encouraged to re-create Notmuch in Go.

from email.

jordan-wright avatar jordan-wright commented on June 29, 2024

Wow!

That's fantastic! Not the results... since clearly I have work to do (:smile:)... but the sheer coverage is great! Thanks so much for running this!

I'll need to see if I can download my corpus of email from gmail and do the same thing.

As for the "No Content-Type found", let me knock out the one we have here and we can re-run for corpus. As far as the others, yeah, we can absolutely look to make some issues with some test cases and see if it's my library or the mail client who's in the wrong.

Also, I noticed your main.go was done serially. If you want to parse those even faster, you can use a sync.WaitGroup to run those in goroutines so you have a bunch of emails being parsed in parallel. That'd be way faster.

Thanks again for all your help!

from email.

kalbasit avatar kalbasit commented on June 29, 2024

It was intentionally done serially and it I believe the bottleneck is really I/O! Wonderful result...

from email.

jordan-wright avatar jordan-wright commented on June 29, 2024

Ok, I think this one in particular has been fixed as of 669ab2f.

I was checking for the default content type on the root part, but not on subparts before trying to get the content type from them, which failed.

This should (hopefully!) fix a bunch of the "No Content-Type..." logs from your output.

Let me know if this fixes the issue, and we can move forward with a new log to work on! We'll knock em out one at a time ๐Ÿ˜„

from email.

jordan-wright avatar jordan-wright commented on June 29, 2024

Also, due to the way I am setting e.Text, the last text part is what's set (in this case, it's the signature line). I can change this behavior if you think that, in general, the most recent (most relevant) text part will be the first one found. I don't know if the RFC has a spec for it, but I'll check.

from email.

jordan-wright avatar jordan-wright commented on June 29, 2024

Whoops! Had a syntax error in my test (teaches me not to commit without running!)

Here's the commit that should fix it for you: f61123e

from email.

kalbasit avatar kalbasit commented on June 29, 2024

Also, due to the way I am setting e.Text, the last text part is what's set

I think this is undesirable behavior, how about setting the application/pgp-signature content-type in the Signature field?

from email.

jordan-wright avatar jordan-wright commented on June 29, 2024

I could add a .Signature field, but I'd consider that another issue since it's another feature. Plus, I think we might be talking about two different things ๐Ÿ˜„

The "signature" I'm referring to is:

_______________________________________________
notmuch mailing list
[email protected]
http://notmuchmail.org/mailman/listinfo/notmuch

This is what I imagine is the signature in the email (not the PGP signature). This is what's being set in the text field, since I check for things with a Content-Type of "text/plain" to set as Email.Text. This skips over the PGP stuff.

Oh, and good news! I was inspired so I'm downloading my entire Gmail corpus as we speak, so I'll have yet another sample set to work with in tracking down issues. Thanks for the great idea! I'll post the results as soon as I have them.

from email.

kalbasit avatar kalbasit commented on June 29, 2024

The "signature" I'm referring to is:

notmuch mailing list
[email protected]
http://notmuchmail.org/mailman/listinfo/notmuch

Ah I see. Good question... We could:

  • Append the text to the current Text field
  • Make the Text a []string

Also I've updated the main and the log to reflect the latest version.

Closing this as fixed.

from email.

jordan-wright avatar jordan-wright commented on June 29, 2024

Excellent! I'm glad to see that we fixed about 2k issues. I'm definitely ok with that!

So, from what I can tell, there are about 4 unique issues left:

  • quotedprintable: invalid unescaped byte
  • unexpected EOF
  • mime: invalid media parameter
  • malformed MIME header line:

Would there be any way you could add test cases for these 4 items? I'd be happy to create the issues for it this evening.

from email.

kalbasit avatar kalbasit commented on June 29, 2024

Sure, I'll push a commit tonight with test cases.

from email.

jordan-wright avatar jordan-wright commented on June 29, 2024

Good news! I managed to run a test with about 55k emails I had laying around. Your main worked perfectly, I just adjusted paths, etc. Thanks for that!

Here's the log that was generated. I'll see if I can generate some test cases out of this.

from email.

kalbasit avatar kalbasit commented on June 29, 2024

Cool! I'm happy that helped. Hopefully we'll get it parsing ALL emails even those not respecting the RFC! I have emails dating back to 2008 from god knows what kind of email client so I feel it's important to parse emails even if they are do not follow the spec verbatim as in reality most people have one or many of those laying around.

from email.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.