yotann / bcdb Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 0.0 14.62 MB

A database and infrastructure for distributed processing of LLVM bitcode.

License: Other

Shell 1.08% CMake 1.45% Nix 5.32% C++ 68.68% C 0.15% LLVM 18.02% Python 4.90% Dockerfile 0.39%

bitcode distributed dynamic-linking llvm optimization outlining

bcdb's Issues

Splitting debug info is too inefficient

When applied to a module with megabytes of debug metadata, the splitter is extremely slow and uses way too much memory.

Backtrace:

llvm::MDNode::operator new
llvm::DISubprogram::getImpl
llvm::DISubprogram::cloneImpl
llvm::MDNode::clone
MDNodeMapper::mapTopLevelUniquedNode
Mapper::mapMetadata [clone .part.318]
Mapper::mapMetadata
Mapper::remapInstruction
Mapper::remapFunction
llvm::ValueMapper::remapFunction
llvm::RemapFunction
ExtractFunction
bcdb::SplitModule

Handle DICompileUnits properly.

Currently, splitting and joining causes DICompileUnits to be duplicated so each function gets its own copy of the compile unit. To fix this, we need to use the remainder module to keep track of which compile units are actually the same.

Security implications of guided linking

One concern I have, in looking at guided linking, is that it potentially causes a huge increase in the ROP surface of all programs linked using it. If it were possible to specify portions of the optimized set which, in the optimized output, must not have any (transitive) dependency between them, this could significantly alleviate the issue.

To illustrate, let's start with the simple optimized set given in Fig. 1 of the paper: program1 IR needs library IR; program2 IR needs library IR. If we consider the case where program1 is some normal program expected to be run by unprivileged users, and program2 is a tiny helper program that is SetUID in order to obtain specific resources it then hands off to program1, but both use the same set of libraries, then guided linking may significantly reduce the overall security of the system.

If, however, it was possible to state that no dependency relationship may be created from code in program2 IR to code in program1 IR (with being in the same merged library counting as a dependency relationship in both directions), this problem could be avoided.

Support ThinLTO

Currently the guided linker combines the entire merged library into a single module. For large sets of software, optimizing and compiling this module is very slow (e.g., LLVM+Clang takes several hours). We should add ThinLTO support so the merged library can be optimized faster.

Detect violated constraints

For debugging purposes, we should add checks at run time and raise an error if any of the constraints are violated. How to do this is explained in the paper.

Upgrade FalseMemorySSA

lib/outlining/FalseMemorySSA.cpp is based on MemorySSA.cpp from LLVM 12. LLVM 13 has a few improvements to this file, which are probably worth copying over.

Handle DICompileUnits properly

Consider giving names to split functions

Because split functions have no name, the globalopt pass (included in opt -O1) deletes them. If we give all the split functions a standardized name (like f) this won't be a problem. However, any name we choose could potentially conflict with other names used by the program.

Another option: store split functions without a name, but give users the option to add a name when retrieving a function from the BCDB.

Split functions which use blockaddresses normally

In the normal case, function @f may have blockaddresses stored in global constant @g, which is only used by function @f. The obvious solution in this case is to put @g in the split module along with @f.

In the general case, blockaddresses for function @f may be used by other functions, but this should be extremely rare and we don't need to handle it well.

No CI for the guided linking experiments

bcdb invalidate is too slow

When invalidating a function with >10,000 cached results, bcdb invalidate is extremely slow (many minutes, CPU-bound). It's much faster to just run sqlite3 /path/to/bcdb 'DELETE FROM call WHERE fid = -1;', even though that should be equivalent. Probably the BCDB's connection setting pragmas (maybe the write-ahead log?) are making it slow.

Outlining code bugs

There are various cases in the outlining code that I haven't fully thought through, and some of them are probably handled incorrectly. If we try to perform actual outlining, this will lead to incorrect code being generated.

yotann / bcdb Goto Github PK

bcdb's People

Contributors

Stargazers

Watchers

bcdb's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs