GithubHelp home page GithubHelp logo

Comments (13)

mre avatar mre commented on May 26, 2024 1

I quite enjoyed it tbh. 😃

from hyperjson.

mre avatar mre commented on May 26, 2024 1

Thanks to @RSabet for integrating our first hypothesis-based tests. This should make the codebase quite a bit more robust with regards to edge-cases.

from hyperjson.

Shnatsel avatar Shnatsel commented on May 26, 2024 1

I've checked pyo3 bugtracker, and it seems the library needs some serious work before you can rely on it to be safe for passing arbitrary data. Check this out:

PyO3/pyo3#159
PyO3/pyo3#145
PyO3/pyo3#94
PyO3/pyo3#71

Some of these issues are over a year old. And these are just the bugs that have been sufficiently dramatic to cause a segfault. You need something to go very, very wrong to trigger a segfault - most memory corruption is not detected unless you build the binary with address sanitizer.

TL;DR: you would need to audit the relevant parts of pyo3 before you can claim any kind of safety.

from hyperjson.

Shnatsel avatar Shnatsel commented on May 26, 2024

To clarify, I have listed a bunch of approaches to software verification, and I do not expect fuzzing specifically to discover many interesting bugs, unless there is gross negligence in FFI. Still, for something that takes half an hour to set up, it provides a really good return on investment, so there's no reason not to do it.

https://github.com/rust-fuzz/targets already provides a fuzzing target for serde_json deserialization, including roundtrip equivalence check. It does not come with a starting corpus, though.

Do not forget to use address and memory sanitizers if you're looking for bugs in unsafe code, otherwise they will not be discovered. You can omit sanitizers when looking for panics or building the initial corpus.

from hyperjson.

Shnatsel avatar Shnatsel commented on May 26, 2024

All unsafe blocks in serde seem to be in serialization, not in deserialization. So fuzzing is unlikely to help you find those, since non-specialized fuzzers are unable to systematically produce interesting Python objects to serialize.

I've looked at what kind of unsafes there are in serde and serde-json, here's what I've found:

There is a number of str::from_utf8_unchecked() calls in both serde and serde-json. If the input is not actually utf8, the result is undefined behavior. This should be reasonably easy to verify since Python3 has a built-in unicode string type. You just need to make sure that the FFI library does not cast Python binary strings to Rust strings behind your back, and you're set.

serialize_display_bounded_length() in serde core is trickier because in addition to from_utf8_unchecked() it also trusts the Display trait implementation on an arbitrary input object to correctly report the length of data it has written.

from hyperjson.

dtolnay avatar dtolnay commented on May 26, 2024

I don't think you have accurately characterized the assumptions around utf8 in Serde:

it also trusts the Display trait implementation on an arbitrary input object to correctly report the length of data it has written.

  • The code only involves the Display impls of 4 specific types from std::net in the standard library, although this is not relevant for safety and the code is safe for an arbitrary implementation of Display.
  • A Display impl does not "report the length of data it has written".

The key observation, which is called out in a comment, is that there is no way to emit non-utf8 encoded bytes through any API in std::fmt. You can see this by looking at the methods of std::fmt::Write which the module is built around.

from hyperjson.

Shnatsel avatar Shnatsel commented on May 26, 2024

@dtolnay than you for the clarification! I'd really appreciate if you could add a code comment explaining why a poorly implemented Display trait would not be a security issue. Here's the relevant snippet:

macro_rules! serialize_display_bounded_length {
    ($value:expr, $max:expr, $serializer:expr) => {{
        let mut buffer: [u8; $max] = unsafe { mem::uninitialized() };
        let remaining_len = {
            let mut remaining = &mut buffer[..];
            write!(remaining, "{}", $value).unwrap();
            remaining.len()
        };
        let written_len = buffer.len() - remaining_len;
        let written = &buffer[..written_len];
        //  <snip>

By reading this code I observe that:

  1. This macro can be invoked on an arbitrary input object with an arbitrary, potentially incorrect, trait implementations
  2. This macro creates an array with uninitialized memory in it
  3. Said array is populated and its new length of it is determined via a call to write! macro.
  4. If the remaining slice length after calling write! macro is not set to a correct value, this function contains an information disclosure vulnerability.
  5. The write! macro invokes the (potentially incorrect) implementation of Display trait internally.

This leads me to believe that in order to verify that this is safe, I need to inspect the implementation of the write! macro and ensure that it always writes the bytes and modifies input slice length as advertised for an arbitrary implementation of Display trait; failing that, I need to audit the implementation of Display trait for all objects that macro can be invoked on.

This makes some critical invariants non-local. So an explanation in a code comment on why write! always writes the bytes as advertised and mutates the length of the input slice correctly for an arbitrary and possibly incorrect implementation of Display trait would be very helpful.

from hyperjson.

dtolnay avatar dtolnay commented on May 26, 2024

Thanks for the feedback @Shnatsel. I would love to get to the root of where our perspective differs. In response to your observations:

in order to verify that this is safe, I need to inspect the implementation of the write! macro and ensure that it always writes the bytes and modifies input slice length as advertised

I guess I don't see this use of write! as fundamentally different from any other place that Serde relies on standard library functionality to behave "as advertised". If our threat model includes write! possibly dumping garbage bytes into the output and returning a made-up integer then we are going to have a bad time no matter what, even in code that does not visibly use unsafe nearby. Practically everything is implemented in terms of something unsafe. Just calling to_string() on an integer or inserting into a BTreeMap involve dozens of lines of unsafe code. The only way to stay sane is assume the standard library behaves "as advertised" and inspect the implementation if there is any doubt, whether or not the callers are safe code.

[Emphasis mine in the following quote]

code comment on *why* write! always writes the bytes as advertised and mutates the length of the input slice correctly

As above, I'm not sure there is anything we could write in Serde that would justify why write! does what it says it does, beyond noting this is what it is supposed to do. The rest of Serde needs to trust every other standard library API (String, Vec, BTreeMap, etc) just as much as we trust write! here.

for an arbitrary and possibly incorrect implementation of Display trait

I've tried not to distinguish between "correct" and "incorrect" implementations of Display because the reasoning applies equally well to any possible way that Display could be implemented. Maybe I'm missing the sort of incorrectness you have in mind -- could you give an example of what you would consider an incorrect implementation of Display in this context?

from hyperjson.

Shnatsel avatar Shnatsel commented on May 26, 2024

I was wondering if it was possible for an implementation of Display to actually write a different amount of bytes than is eventually reported to this code via slice mutation in write!.

Turns out this is not possible because Display passes a string to the Write implementation in the formatter, which handles writing and updating the length. Display cannot mess up the length of the string it passes without abusing unsafe code, so this code is safe for any safe implementation of Display.

from hyperjson.

Shnatsel avatar Shnatsel commented on May 26, 2024

Thanks a lot to @dtolnay for clarifications and for bearing with me. This attitude to safety is commendable.

Apologies to mre for somewhat unrelated discussion on the issue.

from hyperjson.

RSabet avatar RSabet commented on May 26, 2024

We definetly need more tests and fuzzing is a good start, in Haskell i really love quickcheck and i was happy to see Rust-Quickcheck

what about pyo3?
In pyo3-0.2.7:

grep -rn unsafe . | wc -l
742

so maybe add hypothesis-python

from hyperjson.

Shnatsel avatar Shnatsel commented on May 26, 2024

pyo3 worries me both due to inherent unsafety of FFI and past track record of some of its developers. I would love to see it audited, but I have to admit my competence is insufficient for a meaningful audit.

from hyperjson.

davidhewitt avatar davidhewitt commented on May 26, 2024

Two years later, drive by hello from a new (ish) pyo3 maintainer. 😄

I treat the soundness of pyo3 as the number one priority of the library. IMO using Rust, a stricter language than C++, gives you design constraints which C++ doesn't have. Also, the C++ pybind11 library is already a very convenient choice for Python extensions written in a compiled language. So in my opinion most people who are using pyo3 really care about Rust's safety aspects.

It's a shame that pyo3 has historical safety issues; I have been working to see them removed. As of today I think I have submitted PRs to resolve the last of the issues labelled Unsound in the pyo3 bug tracker 🎉

I wouldn't be surprised if there are still skeletons in the closet though - if you do end up doing any fuzzing and find any issues related to pyo3, please do post them: I'll work to clean them up asap!

from hyperjson.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.