GithubHelp home page GithubHelp logo

Comments (11)

LucasSte avatar LucasSte commented on August 23, 2024 2

Algoside Rust updates, the LLVM backend is also updated. Although the SBF code generation hasn't had any modification, the LLVM target independent code generation is constantly updated. This time, the SROA (Scalar Replacement of Aggregates) pass, an optimization that breaks down structs in its individual values, had an update and is breaking down structs in different places in the code, using more stack space than before.

Thanks for the explanation! I'll note that this looks like a major regression for our case because not only does the 10 account example I've given work on v1.37, but you can also add many more accounts to the instruction until you hit the transaction size limit (1232 bytes). It doesn't run into any stack issues even with many more accounts used compared to v1.39.

Those regressions are a concern, but I have good news. I've tested platform-tools version v1.41 in this Openbook-v2 issue, in Anchor's tests tests/pyth and tests/ido-pool and the new LLVM version made everything work again. We'll back-port v1.41 to Solana v1.18.

from agave.

LucasSte avatar LucasSte commented on August 23, 2024 1

I've found the cause for the problem in OpenBook. Your example @acheroncrypto is likely to be hitting the same problem, as I simplified the OpenBook contract so much that it looked like the code you showed.

What is the problem?

SBFv1 functions have a limited frame size of 4096 bytes (4 kb), so using too many stack variables risks overwriting the frame of the caller function. In the OpenBook example, the anchor-generated function try_accounts (this one) deserializes instructions and accounts, and performs all accounts check, with heavy stack use. Such a function can get quite big when an instruction utilizes many accounts, as it is the case for OpenBook.

In the example, try_accounts is writing a value in frame of create_market, which had stored on its stack a pointer address. It reads a wrong pointer value from the stack and tries to access it, leading to a memory access violation, because the address it had stored in the stack now contains gibberish.

SBFv2 introduces dynamic stack frames, so this problem won't exist anymore once we migrate to the new runtime.

Why wasn't this a problem in v1.37?

Algoside Rust updates, the LLVM backend is also updated. Although the SBF code generation hasn't had any modification, the LLVM target independent code generation is constantly updated. This time, the SROA (Scalar Replacement of Aggregates) pass, an optimization that breaks down structs in its individual values, had an update and is breaking down structs in different places in the code, using more stack space than before.

In v1.37, try_accounts utilizes exactly 4096 bytes of the stack, so a couple more allocations were needed for us to break the code. These extra allocations come from the new SROA pass.

Any solution?

Although we can disable the SROA pass, such a measure won't make try_accounts impervious to future optimization changes or overflowing its frame in case a contract utilizes too many accounts. A suggestion would be to break down that method in smaller ones, decreasing stack usage.

from agave.

acheroncrypto avatar acheroncrypto commented on August 23, 2024

Anchor has a bunch of tests that fail after upgrading to 1.18 CLI, with the main difference coming from platform-tools v1.37 vs v1.39 (coral-xyz/anchor#2795 (comment)).

The tests work as long as the program is built using an earlier version than v1.39, independent of solana-cli, test-validator or solana-program version used.

from agave.

LucasSte avatar LucasSte commented on August 23, 2024

As an update for this issue, the pyth tests failures mentioned in coral-xyz/anchor#2795 (comment) by @acheroncrypto are caused by the change in the minimum size for enums in Rust. I've fixed this bug in anza-xyz/rust#90. I ran the test with the fix and everything turned out green.

The OpenBook-V2 failure has been consuming more time, as it is a large contract that also depends on Anchor's code generation. I discovered the anchor expand command to generate a single file with the Rust code passed to the compiler and I've been ridding it of the code portions that do not influence the error.

I haven't yet pinpointed the problem, but I suspect something has changed in Rust's data structures that interferes with function calls and stack variables.

from agave.

godmodegalactus avatar godmodegalactus commented on August 23, 2024

I guess we can create a simpler example then if we could pinpoint where the issue comes from.
Or we can try to test an anchor example.

from agave.

acheroncrypto avatar acheroncrypto commented on August 23, 2024

As an update for this issue, the pyth tests failures mentioned in coral-xyz/anchor#2795 (comment) by @acheroncrypto are caused by the change in the minimum size for enums in Rust. I've fixed this bug in anza-xyz/rust#90. I ran the test with the fix and everything turned out green.

Nice! Have you checked any of the other failures too?

The OpenBook-V2 failure has been consuming more time, as it is a large contract that also depends on Anchor's code generation. I discovered the anchor expand command to generate a single file with the Rust code passed to the compiler and I've been ridding it of the code portions that do not influence the error.

Here is a much shorter example that is likely related: https://beta.solpg.io/65cbb30bcffcf4b13384cf5b (run locally)

I haven't yet pinpointed the problem, but I suspect something has changed in Rust's data structures that interferes with function calls and stack variables.

I think we might be using more memory somehow. The behavior on the example I've shared is very weird too.

from agave.

LucasSte avatar LucasSte commented on August 23, 2024

As an update for this issue, the pyth tests failures mentioned in coral-xyz/anchor#2795 (comment) by @acheroncrypto are caused by the change in the minimum size for enums in Rust. I've fixed this bug in anza-xyz/rust#90. I ran the test with the fix and everything turned out green.

Nice! Have you checked any of the other failures too?

I had a look at ido-pool, but the problem I've found is the same one as the one in OpenBook. We'll back-port the enum size bug fix to v1.18. Please, @acheroncrypto let us know you need anything else to get your PR merged.

from agave.

acheroncrypto avatar acheroncrypto commented on August 23, 2024

Algoside Rust updates, the LLVM backend is also updated. Although the SBF code generation hasn't had any modification, the LLVM target independent code generation is constantly updated. This time, the SROA (Scalar Replacement of Aggregates) pass, an optimization that breaks down structs in its individual values, had an update and is breaking down structs in different places in the code, using more stack space than before.

Thanks for the explanation! I'll note that this looks like a major regression for our case because not only does the 10 account example I've given work on v1.37, but you can also add many more accounts to the instruction until you hit the transaction size limit (1232 bytes). It doesn't run into any stack issues even with many more accounts used compared to v1.39.

Although we can disable the SROA pass, such a measure won't make try_accounts impervious to future optimization changes or overflowing its frame in case a contract utilizes too many accounts. A suggestion would be to break down that method in smaller ones, decreasing stack usage.

The issue is that we can fix these problems in our tests, but it's likely that many of the production programs will also hit this problem once they start using solana-cli 1.18.

I had a look at ido-pool, but the problem I've found is the same one as the one in OpenBook. We'll back-port the enum size bug fix to v1.18. Please, @acheroncrypto let us know you need anything else to get your PR merged.

Thanks, we'll first need a new release that has the fixes to get the PR merged.

We also have some token 2022 tests failing, which I haven't yet debugged, but they are most likely not related to platform-tools.

from agave.

acheroncrypto avatar acheroncrypto commented on August 23, 2024

Thanks @LucasSte! The memory issues we had are fixed in the 1.18.8 release.

from agave.

LucasSte avatar LucasSte commented on August 23, 2024

Thanks @LucasSte! The memory issues we had are fixed in the 1.18.8 release.

Thanks for the feedback. Can we close this issue?

from agave.

acheroncrypto avatar acheroncrypto commented on August 23, 2024

I think so, yes.

from agave.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.