Comments (13)
I quite enjoyed it tbh. 😃
from hyperjson.
Thanks to @RSabet for integrating our first hypothesis-based tests. This should make the codebase quite a bit more robust with regards to edge-cases.
from hyperjson.
I've checked pyo3 bugtracker, and it seems the library needs some serious work before you can rely on it to be safe for passing arbitrary data. Check this out:
PyO3/pyo3#159
PyO3/pyo3#145
PyO3/pyo3#94
PyO3/pyo3#71
Some of these issues are over a year old. And these are just the bugs that have been sufficiently dramatic to cause a segfault. You need something to go very, very wrong to trigger a segfault - most memory corruption is not detected unless you build the binary with address sanitizer.
TL;DR: you would need to audit the relevant parts of pyo3 before you can claim any kind of safety.
from hyperjson.
To clarify, I have listed a bunch of approaches to software verification, and I do not expect fuzzing specifically to discover many interesting bugs, unless there is gross negligence in FFI. Still, for something that takes half an hour to set up, it provides a really good return on investment, so there's no reason not to do it.
https://github.com/rust-fuzz/targets already provides a fuzzing target for serde_json
deserialization, including roundtrip equivalence check. It does not come with a starting corpus, though.
Do not forget to use address and memory sanitizers if you're looking for bugs in unsafe code, otherwise they will not be discovered. You can omit sanitizers when looking for panics or building the initial corpus.
from hyperjson.
All unsafe
blocks in serde seem to be in serialization, not in deserialization. So fuzzing is unlikely to help you find those, since non-specialized fuzzers are unable to systematically produce interesting Python objects to serialize.
I've looked at what kind of unsafes there are in serde and serde-json, here's what I've found:
There is a number of str::from_utf8_unchecked()
calls in both serde and serde-json. If the input is not actually utf8, the result is undefined behavior. This should be reasonably easy to verify since Python3 has a built-in unicode string type. You just need to make sure that the FFI library does not cast Python binary strings to Rust strings behind your back, and you're set.
serialize_display_bounded_length()
in serde core is trickier because in addition to from_utf8_unchecked()
it also trusts the Display
trait implementation on an arbitrary input object to correctly report the length of data it has written.
from hyperjson.
I don't think you have accurately characterized the assumptions around utf8 in Serde:
it also trusts the
Display
trait implementation on an arbitrary input object to correctly report the length of data it has written.
- The code only involves the Display impls of 4 specific types from std::net in the standard library, although this is not relevant for safety and the code is safe for an arbitrary implementation of Display.
- A Display impl does not "report the length of data it has written".
The key observation, which is called out in a comment, is that there is no way to emit non-utf8 encoded bytes through any API in std::fmt. You can see this by looking at the methods of std::fmt::Write
which the module is built around.
from hyperjson.
@dtolnay than you for the clarification! I'd really appreciate if you could add a code comment explaining why a poorly implemented Display
trait would not be a security issue. Here's the relevant snippet:
macro_rules! serialize_display_bounded_length {
($value:expr, $max:expr, $serializer:expr) => {{
let mut buffer: [u8; $max] = unsafe { mem::uninitialized() };
let remaining_len = {
let mut remaining = &mut buffer[..];
write!(remaining, "{}", $value).unwrap();
remaining.len()
};
let written_len = buffer.len() - remaining_len;
let written = &buffer[..written_len];
// <snip>
By reading this code I observe that:
- This macro can be invoked on an arbitrary input object with an arbitrary, potentially incorrect, trait implementations
- This macro creates an array with uninitialized memory in it
- Said array is populated and its new length of it is determined via a call to
write!
macro. - If the
remaining
slice length after callingwrite!
macro is not set to a correct value, this function contains an information disclosure vulnerability. - The
write!
macro invokes the (potentially incorrect) implementation of Display trait internally.
This leads me to believe that in order to verify that this is safe, I need to inspect the implementation of the write!
macro and ensure that it always writes the bytes and modifies input slice length as advertised for an arbitrary implementation of Display
trait; failing that, I need to audit the implementation of Display trait for all objects that macro can be invoked on.
This makes some critical invariants non-local. So an explanation in a code comment on why write!
always writes the bytes as advertised and mutates the length of the input slice correctly for an arbitrary and possibly incorrect implementation of Display
trait would be very helpful.
from hyperjson.
Thanks for the feedback @Shnatsel. I would love to get to the root of where our perspective differs. In response to your observations:
in order to verify that this is safe, I need to inspect the implementation of the
write!
macro and ensure that it always writes the bytes and modifies input slice length as advertised
I guess I don't see this use of write!
as fundamentally different from any other place that Serde relies on standard library functionality to behave "as advertised". If our threat model includes write!
possibly dumping garbage bytes into the output and returning a made-up integer then we are going to have a bad time no matter what, even in code that does not visibly use unsafe nearby. Practically everything is implemented in terms of something unsafe. Just calling to_string()
on an integer or inserting into a BTreeMap involve dozens of lines of unsafe code. The only way to stay sane is assume the standard library behaves "as advertised" and inspect the implementation if there is any doubt, whether or not the callers are safe code.
[Emphasis mine in the following quote]
code comment on *why*
write!
always writes the bytes as advertised and mutates the length of the input slice correctly
As above, I'm not sure there is anything we could write in Serde that would justify why write!
does what it says it does, beyond noting this is what it is supposed to do. The rest of Serde needs to trust every other standard library API (String, Vec, BTreeMap, etc) just as much as we trust write!
here.
for an arbitrary and possibly incorrect implementation of
Display
trait
I've tried not to distinguish between "correct" and "incorrect" implementations of Display because the reasoning applies equally well to any possible way that Display could be implemented. Maybe I'm missing the sort of incorrectness you have in mind -- could you give an example of what you would consider an incorrect implementation of Display
in this context?
from hyperjson.
I was wondering if it was possible for an implementation of Display to actually write a different amount of bytes than is eventually reported to this code via slice mutation in write!
.
Turns out this is not possible because Display passes a string to the Write implementation in the formatter, which handles writing and updating the length. Display cannot mess up the length of the string it passes without abusing unsafe code, so this code is safe for any safe implementation of Display.
from hyperjson.
Thanks a lot to @dtolnay for clarifications and for bearing with me. This attitude to safety is commendable.
Apologies to mre for somewhat unrelated discussion on the issue.
from hyperjson.
We definetly need more tests and fuzzing is a good start, in Haskell i really love quickcheck and i was happy to see Rust-Quickcheck
what about pyo3?
In pyo3-0.2.7:
grep -rn unsafe . | wc -l
742
so maybe add hypothesis-python
from hyperjson.
pyo3 worries me both due to inherent unsafety of FFI and past track record of some of its developers. I would love to see it audited, but I have to admit my competence is insufficient for a meaningful audit.
from hyperjson.
Two years later, drive by hello from a new (ish) pyo3 maintainer. 😄
I treat the soundness of pyo3
as the number one priority of the library. IMO using Rust, a stricter language than C++, gives you design constraints which C++ doesn't have. Also, the C++ pybind11
library is already a very convenient choice for Python extensions written in a compiled language. So in my opinion most people who are using pyo3 really care about Rust's safety aspects.
It's a shame that pyo3 has historical safety issues; I have been working to see them removed. As of today I think I have submitted PRs to resolve the last of the issues labelled Unsound
in the pyo3 bug tracker 🎉
I wouldn't be surprised if there are still skeletons in the closet though - if you do end up doing any fuzzing and find any issues related to pyo3, please do post them: I'll work to clean them up asap!
from hyperjson.
Related Issues (20)
- [0.2.4] Low-level details in JSONDecodeError message by mistake? HOT 5
- dumps doesn't support indent HOT 1
- Dependabot can't resolve your Python dependency files
- Dependabot couldn't authenticate with https://pypi.python.org/simple/
- Dependabot couldn't authenticate with https://pypi.python.org/simple/
- Dependabot can't resolve your Python dependency files
- Dependabot can't resolve your Python dependency files
- Dependabot can't resolve your Python dependency files
- Profiling support HOT 1
- Linking error with make build HOT 7
- few questions HOT 3
- Error when installing via pip HOT 11
- Make install doesn't work. HOT 3
- hypers standard script will not install on macosbeta HOT 1
- Automate deployment to Pypi HOT 5
- Compare with orjson
- Use smallvec to avoid heap allocations
- Add optional support for simdjson-rs behind a feature flag
- Zero-copy string deserialization HOT 1
- Speed up boolean encoding/decoding HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hyperjson.