mirage / mrmime Goto Github PK

View Code? Open in Web Editor NEW

40.0 22.0 9.0 1.83 MB

What do you mean?

License: MIT License

OCaml 99.86% Makefile 0.03% Standard ML 0.01% Perl 0.11%

mrmime's Introduction

Mr. MIME (Multipurpose Internet Mail Extensions)

mrmime is a library to parse and generate mail according several RFCs:

RFC822: Standard For The Format of ARPA Internet Text Messages
RFC2822: Internet Message Format
RFC5321: Simple Mail Transfer Protocol
RFC5322: Internet Message Format
RfC2045: MIME Part One: Format of Internet Message Bodies
RFC2046: MIME Part Two: Media Types
RFC2047: MIME Part Three: Message-Header Extensions for Non-ASCII Text
RFC2049: MIME Part Five: Conformance Criteria and Examples
RFC6532: Internationalized Email Headers

mrmime was made with angstrom to be able to parse mails and try to do the best-effort. From a bunch of mails (2 billions), mrmime is able to parse all of them - however, results can diverge from what you expect.

In other side, mrmime is able to generate valid mail from an OCaml description. Generation follows some rules:

stream produced emits only line per line
we do the best-effort to limit lines by 78 characters
we follows RFC6532 and emit UTF-8 mail

How to parse a mail?

We have different ways to parse a mail and it's depends of what you want. In fact, in some ways, you should be interesting only by the header part. In some others cases, you probably want bodies. We decide to separate these tasks into 2 API (which differ) to fit under some constraints.

For example, if you want to extract only the header, we probably want to take care about memory consumption - if you want, for example, to implement a SMTP server and where only the header is interesting.

An stream API is provided in this case and from this, we are able to implement a DKIM checker which needs only one-pass to verify your mail.

In other side, if you want to extract bodies of your mail, parser provided is not a stream parser where we need to extract bodies from a multipart mail. An explanation of how to use it is given in this document.

Parse only the header part

For many purposes, we are mostly interesting to parse only the header part of a mail. In this case, Hd sub-module should be what you want.

A complex example of Hd is available on the ocaml-dkim project which wants to extract DKIM signature from header.

let dkim_signature = Mrmime.Field_name.v "DKIM-Signature"

let extract_dkim () =
  let open Mrmime in
  let tmp = Bytes.create 0x1000 in
  let buffer = Bigstringaf.create 0x1000 in
  let decoder = Hd.decoder buffer in
  let rec decode () = match Hd.decode decoder with
    | `Field field ->
      ( match Location.prj field with
      | Field.Field (field_name, Unstructured, v)
          when Field_name.equal field_name dkim_signature ->
        Fmt.pr "%a: %a\n%!" Field_name.pp dkim_signature Unstructured.pp v
      | _ -> decode () )
    | `Malformed err -> failwith err
    | `End rest -> ()
    | `Await ->
      let len = input stdin tmp 0 (Bytes.length tmp) in
      ( match Hd.src decoder (Bytes.unsafe_to_string tmp) 0 len with
        | Ok () -> decode ()
        | Error (`Msg err) -> failwith err ) in
  decode ()

This little snippet will parse a mail which is encoded with CRLF end-of-line from stdin (so you should map your mail with this newline convention). When it reachs a DKIM field, it prints a well-parsed value of it (in our case, an unstructured value). [Other] corresponds to other fields - DKIM signature can appear here where we failed to parse value as an unstructured value.

Parse entirely a mail

Of course, the initial goal of mrmime is to parse an entire mail. In this case, you should use the Mail sub-module which provides angstrom parser.

Bodies can be weight and if you want to store them by yourself, we provide an API which expects consumers to consume bodies (and store them, for example, into UNIX files).

A complex example is available on ptt to extract bodies and save them into UNIX files. For this we use:

val stream : emitters:(Header.t -> (string option -> unit) * 'id) -> (Header.t * 'id t) Angstrom.t

Which will call emitters at any part of your mail. parser will decode properly part (according Content-Transfer-Encoding) and give you inputs into your consumer.

How to emit a mail?

mrmime is able to generate a mail from an OCaml description of it. You have several ways to craft informations like address or Content-Type field for a specific part.

Many sub-modules of mrmime provide a way to construct an information like a subject needed for you mail or recipients of it. For example, the sub-module Mailbox provides an easy way to construct an address:

let romain_calascibetta =
  let open Mrmime.Mailbox in
  Local.[ w "romain"; w "calascibetta" ] @ Domain.(domain, [ a "x25519"; a "net" ])

Documentation was done to help you to construct many of these values. Of course, Header will be the module to construct an header:

let header =
  let open Mrmime in
  Field.[ Field (Field_name.subject, Unstructured,
                 Unstructured.Craft.(compile [ v "Simple"; sp 1; v "Email" ]))
        ; Field (Field_name.v "To", Addresses, [ `Mailbox romain_calascibetta ])
        ; Field (Field_name.date, Date, (Date.of_ptime ~zone:GMT (Ptime_clock.now ()))) ]
  |> Header.of_list

Then, Header provides a to_stream function which will emit your header line per line (with the CRLF newline convention) - mostly to be able to branch it into a SMTP pipe.

Finally, for a multipart mail, the Mt sub-module is the most interesting to make part from stream (stream from a file or from standard input) associated to Content fields (like Content-Transfer-Encoding). mrmime takes care about how to encode your stream (base64 or quoted-printable).

A complex example of how to use Mt module is available in facteur project which is able to send a multipart mail.

Encoding

A real effort was made to consider any inputs/outputs of mrmime as UTF-8 string. This result is done by some underlying packages:

rosetta as universal unifier to unicode
uuuu as mapper from ISO-8859 to Unicode
coin as mapper from KOI8-{U,R} to Unicode
yuscii as mapper from UTF-7 to Unicode

SMTP protocol constraints bodies to use only 7 bits per byte (historial limitation). By this way, encoding such as quoted-printable or base64 are used to encode bodies and respect this limitation. mrmime uses:

pecu as a stream encoder/decoder
base64 (base64.rfc2045 sub-package) as a stream encoder/decoder

Status of the project

mrmime is really experimental. Where it wants to take care about many purposes (encoding or multipart), API should change often. We reach a first version because we are able to send a well formed multipart mail from it - however, it's possible to reach weird case where mrmime can emit invalid mail.

About parser, the same advise is done where Mail format is not really respected by implementations in many cases and the parser should fail on some of them for a weird reason.

Of course, feedback is expected to improve it. So you can use it, but you should not expect an industrial quality - I mean, not yet. So play with it, and enjoy your hacking!

mrmime has received funding from the Next Generation Internet Initiative (NGI) within the framework of the DAPSI Project.

mrmime's People

Contributors

Stargazers

Watchers

Forkers

avsm craigfe kit-ty-kate seliopou shonfeder msgpo lyrm clecat hannesm

mrmime's Issues

Prettym added some WS inside content-type parameters

Hi,
Last issue that I identified with the fuzzed mail generator: it seems that prettym sometimes adds white spaces at the beginning or at the end of content-type parameters inside the quotes. That make the comparison functions of Content_type.Parameter not work properly.

If it is an expected behavior, we should add trim in the comparison functions.

Some seeds with this issue: 157, 160, 164, 221, 272, 319, 322.

`prettym` should have its own release cycle

prettym is self contained but used by mrmime (and multipart_form). We should cut prettym as a new package. It wants to solve a more general issue than email stuffs.

Use uucd on RFC 6532 to know length of UTF-8 string

Issue with message-id header

I have an issue with message-id header. I don't have the time to investigate, especially since it seems the error is not reproducible (some weird stuffs are happening here, but probably with a very simple explanation !). It may actually come from something else, but for now I can neither pinpoint the issue nor tell more about it.

I will complete this issue when I got the time to investigate further.

Issue with header parsing

Seed 102.

There is an issue with the header parsing. From what I found, everything is fine until the call to Unstrctrd_parser.unstrctrd in Mrmime.Field.Decoder.field. If you peek at the parsed string :

before this call : everything is ok, the pointer is just before the header value.
after this call : something went wrong and the pointer has moved to the next line.

However I don't know if the issue comes from the Unstrctrd function or if it comes from how the header are encoded.

OCaml 5.00 support

Currently the tests fail with:

#=== ERROR while compiling mrmime.0.5.0 =======================================#
# context              2.2.0~alpha~dev | linux/x86_64 | ocaml-variants.5.00.0+trunk | file:///home/opam/opam-repository
# path                 ~/.opam/5.00/.opam-switch/build/mrmime.0.5.0
# command              ~/.opam/opam-init/hooks/sandbox.sh build dune runtest -p mrmime -j 255
# exit-code            1
# env-file             ~/.opam/log/mrmime-7917-fdd0d1.env
# output-file          ~/.opam/log/mrmime-7917-fdd0d1.out
### output ###
# File "examples/test.t", line 1, characters 0-0:
#          git (internal) (exit 1)
# (cd _build/.sandbox/8446b8ac2189e9654e3b176b92d7af06/default && /usr/bin/git --no-pager diff --no-index --color=always -u ../../../default/examples/test.t examples/test.t.corrected)
# diff --git a/../../../default/examples/test.t b/examples/test.t.corrected
# index dca929a..b2fd8ab 100644
# --- a/../../../default/examples/test.t
# +++ b/examples/test.t.corrected
# @@ -4,15 +4,15 @@ Simple email with attachment
#    From: [email protected]
#    Subject: A Simple Email
#    Date: Mon, 26 Apr 2021 16:20:50 GMT
# -  Content-Type: multipart/mixed; boundary=YlGxbWQC
# +  Content-Type: multipart/mixed; boundary=QYClM9fD
#    
# -  --YlGxbWQC
# +  --QYClM9fD
#    Content-Transfer-Encoding: quoted-printable
#    Content-Type: text/plain
#    
#    Hello=20World!
#    
# -  --YlGxbWQC
# +  --QYClM9fD
#    Content-Disposition: attachement; filename=mrmime.png
#    Content-Transfer-Encoding: base64
#    Content-Type: image/png
# @@ -1093,4 +1093,4 @@ Simple email with attachment
#    ZHRoADg1MJ7GNKcAAAAZdEVYdFRodW1iOjpNaW1ldHlwZQBpbWFnZS9wbmc/slZOAAAAF3RFWHRU
#    aHVtYjo6TVRpbWUAMTU1NDQ5MjI0MNxNkFcAAAASdEVYdFRodW1iOjpTaXplADQ4NEtCQmZRA9kA
#    AAAASUVORK5CYII=
# -  --YlGxbWQC--
# +  --QYClM9fD--
#      rfc2047 alias test/runtest

RFC7103

RFC2231

Content type automatically added by Mt.multipart

This issue is about the two following functions of the Mt module :

val multipart : rng:'g rng -> ?header:Header.t -> ?boundary:string -> part list -> multipart
val make : Header.t -> 'x body -> 'x -> t

There are two ways of building the "same" email with predefined header header.

More intuitive choice

let mail = 
   Mt.multipart ~rng:Mt.rng parts 
   |> Mt.make header Mt.multi

Second choice

let mail = 
   Mt.multipart ~header ~rng:Mt.rng parts 
   |> Mt.make Header.empty Mt.multi

The issue is the following: the multipart function checks the input header for the content type and adds it if necessary. So the first solution does not work properly : the final headers can have multiple content-type headers (the one in the predefined header and the added one).

But the first solution is clearly suggested by the API as header is an optional argument for multipart but not for make.

I think an easy solution will be to remove the header argument of the make function.

Some bits lost with Base64

Hi, some parsed base64 mails have a few less bits at the end of their body than their counterpart generated mail. I don't know why. What should I look for to correct if (if a correction if needed) ?

Some seed with this issue: 3, 16, 209, 435, 501.

Update README.md

Add `Header.message_id` accessor

Bigstring on rosetta package

Currently, rosetta works on Bytes.t. A translation from an encoding to UTF-8, we choose this kind of buffer mostly because uutf works on Bytes.t. However, angstrom works on bigstring and, in other side, fe (internal encoder of mrmime) works with both.

So, because rosetta is under my responsibility, I can decide to provide a translation from a bigstring input. But the code will change a lot - and internals stuffs will change.

From my point of view and mostly because I did lot of benchmarks with buffet, we get the same and big question: should enforce to use Bytes.t or Bigstring.t or functorize it or use an (G)ADT about the input? From benchmarks, functor is the best (and flambda) will be able to optimize it easily - specialization of the functor.

So we have different plans:

(middle) functorize rosetta (and pecu, and uuuu, and coin, and yuscii).

This solution move the boilerplate on rosetta, then we can do application of functor in mrmime and use only bigstring. From this point, we avoid most of copies when we translate an input from an encoding to UTF-8. However, we continue to have copies to the uutf part (which uses only Bytes.t).

From benchmarks, it's the best solution even if I don't like to functorize all things. flambda then will be able to optimize it and readability of code is kept instead the second solution which need to put a witness to any functions which manipulates input.

(middle) (G)ADT - (decompress's solution)

Avoid the functor but put an argument, the witness in any functions which manipulates input. flambda is not really able to optimize it and specialization (even if we use GADT) is hard.

move to bigstring (angstrom's solution)

According to angstrom which use a bigstring, we can move to this solution and enforce to use only bigstring on rosetta (and so on packages). However, we lost the capabilities to use Bytes.t in some cases. But in performance perspective, this is the best choice.

In my opinion, the first case should be the best but ... eh an other functor and after my story with ocaml-git I'm little bit sick with it. Bref, I let this issue because the question stills open.

Newline at the end of a quoted printable body

Hi,
Some of the parsed mails are different from the generated ones by only one char: a newline char (=0A) has been added at the very end of the body . Is it an expected behaviour ?

Some seeds with this issue : 34, 299 and 421.