Problem When building Openbook-V2 with the tools version v1.37, th

I've found the cause for the problem in OpenBook. Your example <a class="user-mention

As an update for this issue, the pyth tests failures mentioned in <a class="issue-link

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Thanks <a class="user-mention notranslate" data-hovercard-type="user" dat

Bug when changing from platform tools v1.37 to v1.39 about agave HOT 11 CLOSED

anza-xyz commented on August 23, 2024

Bug when changing from platform tools v1.37 to v1.39

from agave.

Comments (11)

LucasSte commented on August 23, 2024 2

Algoside Rust updates, the LLVM backend is also updated. Although the SBF code generation hasn't had any modification, the LLVM target independent code generation is constantly updated. This time, the SROA (Scalar Replacement of Aggregates) pass, an optimization that breaks down structs in its individual values, had an update and is breaking down structs in different places in the code, using more stack space than before.

Thanks for the explanation! I'll note that this looks like a major regression for our case because not only does the 10 account example I've given work on v1.37, but you can also add many more accounts to the instruction until you hit the transaction size limit (1232 bytes). It doesn't run into any stack issues even with many more accounts used compared to v1.39.

Those regressions are a concern, but I have good news. I've tested platform-tools version v1.41 in this Openbook-v2 issue, in Anchor's tests tests/pyth and tests/ido-pool and the new LLVM version made everything work again. We'll back-port v1.41 to Solana v1.18.

from agave.

LucasSte commented on August 23, 2024 1

I've found the cause for the problem in OpenBook. Your example @acheroncrypto is likely to be hitting the same problem, as I simplified the OpenBook contract so much that it looked like the code you showed.

What is the problem?

SBFv1 functions have a limited frame size of 4096 bytes (4 kb), so using too many stack variables risks overwriting the frame of the caller function. In the OpenBook example, the anchor-generated function try_accounts (this one) deserializes instructions and accounts, and performs all accounts check, with heavy stack use. Such a function can get quite big when an instruction utilizes many accounts, as it is the case for OpenBook.

In the example, try_accounts is writing a value in frame of create_market, which had stored on its stack a pointer address. It reads a wrong pointer value from the stack and tries to access it, leading to a memory access violation, because the address it had stored in the stack now contains gibberish.

SBFv2 introduces dynamic stack frames, so this problem won't exist anymore once we migrate to the new runtime.

Why wasn't this a problem in v1.37?

Algoside Rust updates, the LLVM backend is also updated. Although the SBF code generation hasn't had any modification, the LLVM target independent code generation is constantly updated. This time, the SROA (Scalar Replacement of Aggregates) pass, an optimization that breaks down structs in its individual values, had an update and is breaking down structs in different places in the code, using more stack space than before.

In v1.37, try_accounts utilizes exactly 4096 bytes of the stack, so a couple more allocations were needed for us to break the code. These extra allocations come from the new SROA pass.

Any solution?

Although we can disable the SROA pass, such a measure won't make try_accounts impervious to future optimization changes or overflowing its frame in case a contract utilizes too many accounts. A suggestion would be to break down that method in smaller ones, decreasing stack usage.

from agave.

acheroncrypto commented on August 23, 2024

Anchor has a bunch of tests that fail after upgrading to 1.18 CLI, with the main difference coming from platform-tools v1.37 vs v1.39 (coral-xyz/anchor#2795 (comment)).

The tests work as long as the program is built using an earlier version than v1.39, independent of solana-cli, test-validator or solana-program version used.

from agave.

LucasSte commented on August 23, 2024

As an update for this issue, the pyth tests failures mentioned in coral-xyz/anchor#2795 (comment) by @acheroncrypto are caused by the change in the minimum size for enums in Rust. I've fixed this bug in anza-xyz/rust#90. I ran the test with the fix and everything turned out green.

The OpenBook-V2 failure has been consuming more time, as it is a large contract that also depends on Anchor's code generation. I discovered the anchor expand command to generate a single file with the Rust code passed to the compiler and I've been ridding it of the code portions that do not influence the error.

I haven't yet pinpointed the problem, but I suspect something has changed in Rust's data structures that interferes with function calls and stack variables.

from agave.

godmodegalactus commented on August 23, 2024

I guess we can create a simpler example then if we could pinpoint where the issue comes from.
Or we can try to test an anchor example.

from agave.

acheroncrypto commented on August 23, 2024

As an update for this issue, the pyth tests failures mentioned in coral-xyz/anchor#2795 (comment) by @acheroncrypto are caused by the change in the minimum size for enums in Rust. I've fixed this bug in anza-xyz/rust#90. I ran the test with the fix and everything turned out green.

Nice! Have you checked any of the other failures too?

The OpenBook-V2 failure has been consuming more time, as it is a large contract that also depends on Anchor's code generation. I discovered the anchor expand command to generate a single file with the Rust code passed to the compiler and I've been ridding it of the code portions that do not influence the error.

Here is a much shorter example that is likely related: https://beta.solpg.io/65cbb30bcffcf4b13384cf5b (run locally)

I haven't yet pinpointed the problem, but I suspect something has changed in Rust's data structures that interferes with function calls and stack variables.

I think we might be using more memory somehow. The behavior on the example I've shared is very weird too.

from agave.

LucasSte commented on August 23, 2024

As an update for this issue, the pyth tests failures mentioned in coral-xyz/anchor#2795 (comment) by @acheroncrypto are caused by the change in the minimum size for enums in Rust. I've fixed this bug in anza-xyz/rust#90. I ran the test with the fix and everything turned out green.

Nice! Have you checked any of the other failures too?

I had a look at ido-pool, but the problem I've found is the same one as the one in OpenBook. We'll back-port the enum size bug fix to v1.18. Please, @acheroncrypto let us know you need anything else to get your PR merged.

from agave.

acheroncrypto commented on August 23, 2024

Algoside Rust updates, the LLVM backend is also updated. Although the SBF code generation hasn't had any modification, the LLVM target independent code generation is constantly updated. This time, the SROA (Scalar Replacement of Aggregates) pass, an optimization that breaks down structs in its individual values, had an update and is breaking down structs in different places in the code, using more stack space than before.

Thanks for the explanation! I'll note that this looks like a major regression for our case because not only does the 10 account example I've given work on v1.37, but you can also add many more accounts to the instruction until you hit the transaction size limit (1232 bytes). It doesn't run into any stack issues even with many more accounts used compared to v1.39.

Although we can disable the SROA pass, such a measure won't make try_accounts impervious to future optimization changes or overflowing its frame in case a contract utilizes too many accounts. A suggestion would be to break down that method in smaller ones, decreasing stack usage.

The issue is that we can fix these problems in our tests, but it's likely that many of the production programs will also hit this problem once they start using solana-cli 1.18.

I had a look at ido-pool, but the problem I've found is the same one as the one in OpenBook. We'll back-port the enum size bug fix to v1.18. Please, @acheroncrypto let us know you need anything else to get your PR merged.

Thanks, we'll first need a new release that has the fixes to get the PR merged.

We also have some token 2022 tests failing, which I haven't yet debugged, but they are most likely not related to platform-tools.

from agave.

acheroncrypto commented on August 23, 2024

Thanks @LucasSte! The memory issues we had are fixed in the 1.18.8 release.

from agave.

LucasSte commented on August 23, 2024

Thanks @LucasSte! The memory issues we had are fixed in the 1.18.8 release.

Thanks for the feedback. Can we close this issue?

from agave.

acheroncrypto commented on August 23, 2024

I think so, yes.

from agave.

Bug when changing from platform tools v1.37 to v1.39 about agave HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs