GithubHelp home page GithubHelp logo

yotann / bcdb Goto Github PK

View Code? Open in Web Editor NEW
8.0 8.0 0.0 14.62 MB

A database and infrastructure for distributed processing of LLVM bitcode.

License: Other

Shell 1.08% CMake 1.45% Nix 5.32% C++ 68.68% C 0.15% LLVM 18.02% Python 4.90% Dockerfile 0.39%
bitcode distributed dynamic-linking llvm optimization outlining

bcdb's People

Contributors

andrewf29 avatar dependabot[bot] avatar theo25 avatar yotann avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

bcdb's Issues

Splitting debug info is too inefficient

When applied to a module with megabytes of debug metadata, the splitter is extremely slow and uses way too much memory.

Backtrace:

llvm::MDNode::operator new
llvm::DISubprogram::getImpl
llvm::DISubprogram::cloneImpl
llvm::MDNode::clone
MDNodeMapper::mapTopLevelUniquedNode
Mapper::mapMetadata [clone .part.318]
Mapper::mapMetadata
Mapper::remapInstruction
Mapper::remapFunction
llvm::ValueMapper::remapFunction
llvm::RemapFunction
ExtractFunction
bcdb::SplitModule

Handle DICompileUnits properly.

Currently, splitting and joining causes DICompileUnits to be duplicated so each function gets its own copy of the compile unit. To fix this, we need to use the remainder module to keep track of which compile units are actually the same.

Security implications of guided linking

One concern I have, in looking at guided linking, is that it potentially causes a huge increase in the ROP surface of all programs linked using it. If it were possible to specify portions of the optimized set which, in the optimized output, must not have any (transitive) dependency between them, this could significantly alleviate the issue.

To illustrate, let's start with the simple optimized set given in Fig. 1 of the paper: program1 IR needs library IR; program2 IR needs library IR. If we consider the case where program1 is some normal program expected to be run by unprivileged users, and program2 is a tiny helper program that is SetUID in order to obtain specific resources it then hands off to program1, but both use the same set of libraries, then guided linking may significantly reduce the overall security of the system.

If, however, it was possible to state that no dependency relationship may be created from code in program2 IR to code in program1 IR (with being in the same merged library counting as a dependency relationship in both directions), this problem could be avoided.

Support ThinLTO

Currently the guided linker combines the entire merged library into a single module. For large sets of software, optimizing and compiling this module is very slow (e.g., LLVM+Clang takes several hours). We should add ThinLTO support so the merged library can be optimized faster.

Detect violated constraints

For debugging purposes, we should add checks at run time and raise an error if any of the constraints are violated. How to do this is explained in the paper.

Upgrade FalseMemorySSA

lib/outlining/FalseMemorySSA.cpp is based on MemorySSA.cpp from LLVM 12. LLVM 13 has a few improvements to this file, which are probably worth copying over.

Handle DICompileUnits properly

Currently, splitting and joining causes DICompileUnits to be duplicated so each function gets its own copy of the compile unit. To fix this, we need to use the remainder module to keep track of which compile units are actually the same.

Consider giving names to split functions

Because split functions have no name, the globalopt pass (included in opt -O1) deletes them. If we give all the split functions a standardized name (like f) this won't be a problem. However, any name we choose could potentially conflict with other names used by the program.

Another option: store split functions without a name, but give users the option to add a name when retrieving a function from the BCDB.

Split functions which use blockaddresses normally

In the normal case, function @f may have blockaddresses stored in global constant @g, which is only used by function @f. The obvious solution in this case is to put @g in the split module along with @f.

In the general case, blockaddresses for function @f may be used by other functions, but this should be extremely rare and we don't need to handle it well.

bcdb invalidate is too slow

When invalidating a function with >10,000 cached results, bcdb invalidate is extremely slow (many minutes, CPU-bound). It's much faster to just run sqlite3 /path/to/bcdb 'DELETE FROM call WHERE fid = -1;', even though that should be equivalent. Probably the BCDB's connection setting pragmas (maybe the write-ahead log?) are making it slow.

Outlining code bugs

There are various cases in the outlining code that I haven't fully thought through, and some of them are probably handled incorrectly. If we try to perform actual outlining, this will lead to incorrect code being generated.

  • various TODOs and FIXMEs in lib/Outlining
  • tests should be much more thorough
  • if the original function has an address space, section, comdat, or garbage collector, how should we handle them?
  • parameter attributes may be handled incorrectly
  • function attributes may be handled incorrectly
    • see constructFunction in llvm/lib/Transforms/Utils/CodeExtractor.cpp
  • metadata may be handled incorrectly
    • debug
    • TBAA
    • noalias
    • callback
    • llvm.loop
    • prof

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.