GithubHelp home page GithubHelp logo

Comments (9)

mikeday avatar mikeday commented on April 28, 2024 1

I'm questioning whether generating an actual Rust struct is necessary at all, although it could be helpful for relatively simple structs with fixed field layouts. Complex structs could be represented by an abstract slice with custom accessor methods.

from fathom.

mikeday avatar mikeday commented on April 28, 2024 1

Even simple structs will need accessors anyway for endian swapping if we want zero-copy access.

from fathom.

brendanzab avatar brendanzab commented on April 28, 2024

Yeah, that may indeed be the case. We also want the output Rust API to be reasonably predicable, so changing the specification slightly should not greatly change the resulting code.

from fathom.

brendanzab avatar brendanzab commented on April 28, 2024

Note that in the cmap example above I do just that. Note that in Rust struct fields are private by default, so the only way to interact with this API is via the accessors.

from fathom.

mikeday avatar mikeday commented on April 28, 2024

Right, so the public API doesn't hinge on there being an actual struct, just a type with the appropriate name. (Zero copy would require some lifetime trickery to make references live as long as the main buffer, I suppose).

from fathom.

brendanzab avatar brendanzab commented on April 28, 2024

Yup. The Cmap type is defined as:

#[repr(C)]
pub struct CMap {
    version: u16,
    num_tables: u16,
    encoding_records: [EncodingRecord],
}

The [EncodingRecord] means that this struct is 'dynamically sized' - ie. it can only be accessed via a pointer indirection. The only way it can be constructed is via pub fn from_buf(buf: &[u8]) -> Result<&CMap, ()>, so it will be forever forced to be immutable.

from fathom.

brendanzab avatar brendanzab commented on April 28, 2024

Another alternative might be to do something like:

pub struct CMap<'data> {
    data: &'data [u8],
}

pub fn from_buf<'data>(data: &'data [u8]) -> Result<CMap<'data>, ()> {
    ...
}

Would be interested to see how Nom does deserialization, seeing as it claims to be zero-copy too.

Edit: Seems like nom is zero copy in the sense of it does not allocate copies of data from the input iterator. But it does require you copy data into new structs in order to make use of the parsed data. For example, the gif decoder: https://github.com/Geal/gif.rs/blob/master/src/parser.rs

from fathom.

brendanzab avatar brendanzab commented on April 28, 2024

Here are some different methods I've been exploring for the by-ref output:

use std::{mem, ptr, slice};

pub struct Bmp {
    w: u64,
    h: u64,
    data: [u8],
}

impl Bmp {
    pub fn new(buf: &[u8]) -> Result<&Bmp, ()> {
        if buf.len() < Bmp::min_size() {
            Err(())
        } else {
            let bmp = unsafe { mem::transmute::<_, &Bmp>(buf) };

            if buf.len() != bmp.exact_size() {
                Err(())
            } else {
                Ok(bmp)
            }
        }
    }

    fn min_size() -> usize {
        mem::size_of::<u64>() + // w
        mem::size_of::<u64>() // h
    }

    fn exact_size(&self) -> usize {
        Bmp::min_size() +
        mem::size_of::<u8>() * self.w() as usize * self.h() as usize // data
    }

    pub fn w(&self) -> u64 {
        u64::from_be(self.w)
    }

    pub fn h(&self) -> u64 {
        u64::from_be(self.h)
    }

    pub fn data(&self) -> &[u8] {
        unsafe { slice::from_raw_parts(self.data.as_ptr(), self.w() as usize * self.h() as usize) }
    }
}
use std::{mem, ptr, slice};

pub struct Bmp {
    data: [u8],
}

impl Bmp {
    pub fn new(buf: &[u8]) -> Result<&Bmp, ()> {
        if buf.len() < Bmp::min_size() {
            Err(())
        } else {
            let bmp = unsafe { mem::transmute::<_, &Bmp>(buf) };

            if buf.len() != bmp.exact_size() {
                Err(())
            } else {
                Ok(bmp)
            }
        }
    }

    fn min_size() -> usize {
        Bmp::w_size() + Bmp::h_size()
    }

    fn exact_size(&self) -> usize {
        Bmp::min_size() + self.data_size()
    }

    fn w_size() -> usize { mem::size_of::<u64>() }
    fn h_size() -> usize { mem::size_of::<u64>() }
    fn data_size(&self) -> usize { mem::size_of::<u8>() * self.w() as usize * self.h() as usize }

    fn w_offset() -> isize { 0 }
    fn h_offset() -> isize { Bmp::w_offset() + Bmp::w_size() as isize }
    fn data_offset() -> isize { Bmp::h_offset() + Bmp::h_size() as isize }

    pub fn w(&self) -> u64 {
        unsafe {
          let ptr = self.data.as_ptr().offset(Bmp::w_offset()) as *const u64;
          u64::from_be(ptr::read(ptr))
        }
    }

    pub fn h(&self) -> u64 {
        unsafe {
          let ptr = self.data.as_ptr().offset(Bmp::h_offset()) as *const u64;
          u64::from_be(ptr::read(ptr))
        }
    }

    pub fn data(&self) -> &[u8] {
        unsafe {
          let ptr = self.data.as_ptr().offset(Bmp::data_offset()) as *const u8;
          slice::from_raw_parts(ptr, self.data_size())
        }
    }
}

Note that the generated assembly code is identical. This is very similar to what Harfbuzz is doing.

Advantages:

  • all bounds checking is done up-front
  • no reallocation occurs because it is reusing the same bytes as the buffer

Disadvantages:

  • requires the full buffer to be persisted in memory
  • requires a great deal of unsafe code under the hood to work which could be tricky to audit
  • have to execute 'interp' types by need, without memoization (eg. for endian-conversions)
  • users might want to persist part of the struct beyond the lifetime of the buffer
  • not compatible with streaming

from fathom.

brendanzab avatar brendanzab commented on April 28, 2024

I wanted to experiment with the above unsafe APIs because they followed Harfbuzz's technique, but using unsafe in generated code gives me the heebyjeebies. Don’t want to hit something like that Ragel bug that hit Cloudflare…

Here are some more ideas for the 'spidery api', but this time using byte slices as they are intended top be used:

//! Pixel = {
//!     r : u8,
//!     g : u8,
//!     b : u8,
//! };
//!
//! Bmp = struct {
//!     w : u64be,
//!     h : u64be,
//!     data : [Pixel; w * h],
//!     trailer_len : u32be,
//!     trailer_data : [u32be; trailer_len],
//! };

extern crate byteorder;

use byteorder::{BigEndian, ReadBytesExt};
use std::io::Cursor;
use std::mem;

pub struct BmpDataRef<'a> {
    buf: &'a [u8],
}

impl<'a> BmpDataRef<'a> {
    pub fn new(buf: &[u8]) -> BmpDataRef {
        BmpDataRef { buf }
    }
}

pub struct BmpRef<'a> {
    buf: &'a [u8],
}

impl<'a> BmpRef<'a> {
    pub fn new(buf: &[u8]) -> BmpRef {
        BmpRef { buf }
    }

    pub fn w(&self) -> u64 {
        let offset = 0;
        Cursor::new(&self.buf[offset..])
            .read_u64::<BigEndian>()
            .unwrap()
    }

    pub fn h(&self) -> u64 {
        let offset = mem::size_of::<u64>();
        Cursor::new(&self.buf[offset..])
            .read_u64::<BigEndian>()
            .unwrap()
    }

    pub fn data(&self) -> BmpDataRef {
        let offset = mem::size_of::<u64>() * 2;
        let size = mem::size_of::<u8>() * 3 * self.w() as usize * self.h() as usize;
        BmpDataRef::new(&self.buf[offset..offset + size])
    }
}

This one is a spidery api with 'staged copying'. Advantage is that it does up-front verification in the constructor, and caches the results:

//! Pixel = {
//!     r : u8,
//!     g : u8,
//!     b : u8,
//! };
//!
//! Bmp = struct {
//!     w : u64be,
//!     h : u64be,
//!     data : [Pixel; w * h],
//!     trailer_len : u32be,
//!     trailer_data : [u32be; trailer_len],
//! };

extern crate byteorder;

use byteorder::{BigEndian, ReadBytesExt};
use std::io::{self, Cursor};
use std::mem;

#[derive(Copy, Clone)]
pub struct BmpRef<'a> {
    w: u64,
    h: u64,
    data: BmpDataCursor<'a>,
}

impl<'a> BmpRef<'a> {
    pub fn new(buf: &[u8]) -> Result<BmpRef, ()> {
        let w = Cursor::new(&buf.get(0..).ok_or(())?)
            .read_u64::<BigEndian>()
            .map_err(|_| ())?;
        let h = Cursor::new(&buf.get(mem::size_of_val(&w)..).ok_or(())?)
            .read_u64::<BigEndian>()
            .map_err(|_| ())?;

        let data = BmpDataCursor::new(&buf
            .get(mem::size_of_val(&w) + mem::size_of_val(&h)..)
            .ok_or(())?);

        Ok(BmpRef { w, h, data })
    }

    pub fn w(&self) -> u64 {
        self.w
    }

    pub fn h(&self) -> u64 {
        self.h
    }

    pub fn data(&self) -> BmpDataCursor<'a> {
        self.data
    }
}

#[derive(Copy, Clone)]
pub struct BmpDataCursor<'a> {
    buf: &'a [u8],
}

impl<'a> BmpDataCursor<'a> {
    pub fn new(buf: &[u8]) -> BmpDataCursor {
        BmpDataCursor { buf }
    }
    
    pub fn get(&self, index: usize) -> Option<Pixel> {
        unimplemented!()
    }
    
    pub fn iter(&self) -> BmpDataIter {
        unimplemented!()
    }
}

pub struct BmpDataIter {
    // TODO
}

impl Iterator for BmpDataIter {
    type Item = Pixel;
    
    fn next(&mut self) -> Option<Pixel> {
        unimplemented!()
    }
}

pub struct Pixel {
    r: u8,
    g: u8,
    b: u8,
}

from fathom.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.