GithubHelp home page GithubHelp logo

Comments (6)

jrudolph avatar jrudolph commented on July 20, 2024

Yes, "vtable chunks" is related to dynamic dispatch. Unfortunately, I don't know of any good tools that would help explaining where these calls originate from (you would need to walk the stack to find out). Since you mention worker threads a blind guess would be that you have very short lived tasks and calling the tasks itself from the worker threads produce part of this overhead.

Another unrelated issue is that perf top always displays "unable to open vsyscalls".

I don't know about that, I guess that's an issue with perf top.

from perf-map-agent.

kotaravi avatar kotaravi commented on July 20, 2024

Thank you, Johannes. Based on perf top output, we initially suspected vtable lookups and dynamic dispatch as a key issue but could not really confirm it. perf-map-agent is a really great tool in that it separates out this dynamic dispatch overhead from user code. Most Java profilers appear to include this overhead against method time or is unaccounted for.

We certainly have a lot of “light” virtual methods that are called frequently and are likely contributing to this issue. We will review our invokevirtual call sites along with high TypeProfile counts to avoid dynamic dispatch for such code paths. Hopefully, we can also benefit from inlining as a result of it.

I have a few follow up questions — I am hoping @jrudolph or other experts on this forum can comment upon.

(1) Previously (last year) this overhead showed up against the symbol: vtable stub. Are these the same/ similar and defined in the Java runtime or is it defined in theperf-map-agent code ?

(2) Is the reported CPU time for vtable chunks primary due to dynamic dispatch or does it also include any other tasks ? How accurate is this timing ?

(3) Thoughts on why dynamic dispatch does not scale well. We’ve observed that it is also CPU-architecture-dependent to a large degree. Also curious if this has been an issue for anyone else.

Thanks,
Ravi

from perf-map-agent.

jrudolph avatar jrudolph commented on July 20, 2024

Wrt (1):

Now that you say it, I can only remember seeing vtable stub and not vtable chunks. They are both reported through the JVMTI DynamicCodeGenerated callback so perf-map-agent will just output whatever your JVM tells it.

I had a quick look and there's a slight code structure change going from JDK 8 -> JDK 9 which could explain that there was an (intermittent?) name change for the same thing:

http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/c25bfaaed7f2/src/share/vm/code/vtableStubs.cpp#l62

(2) TBH I don't know how accurate the reported times are. My current assumption is that for each non-optimizable (e.g. monomorphic) virtual call from a compiled method the virtual stub is run once. E.g. see this description: https://wikis.oracle.com/display/HotSpotInternals/VirtualCalls

On the other hand, as long as I haven't seen actual numbers accompanied with the code positions from where these calls originate I cannot say from experience that these numbers correspond to anything you can do something about. (This alludes to a paper that evaluated Java profilers measuring if the profiler produces "actionable" results which you could use to actually optimize the profiled program.)

(3) Wrt "scaling", I think it's known that dynamic program flow with unpredictable branching has some direct and indirect costs. The question is if the JVM amplifies some of these costs in certain situations. If you look at http://shipilev.net/blog/2015/black-magic-method-dispatch/ it seems that, yes, some calls through invokevirtual and invokeinterface are more expensive than a simulation with other branches constructs like if/else or switch.

IMO in most situations the solution is still not to micro-optimize these dynamic parts but to increase the size of code blocks that have a statically predictable execution flow. E.g. if the problem should really be dispatching of tasks, you'd need to increase the amount of work done during a task, etc. For other kind of work this just isn't as easily optimized (e.g. data structure processing with many different node types like ASTs).

Wrt CPU-architecture dependence: Out of interest: What kind of differences did you see and on which architectures did you test?

The architecture is quite a big factor, apart from the hardware itself you change to a different JIT compiler and a different performance event implementation, so it's hard to say if it makes sense to compare results (though, I agree, it would be nice if you could optimize for a larger range of architectures).

from perf-map-agent.

jrudolph avatar jrudolph commented on July 20, 2024

Thanks for starting this discussion, @kotaravi. I'm going to close this for now. Please comment or reopen if there are any questions left open.

from perf-map-agent.

kotaravi avatar kotaravi commented on July 20, 2024

Thank you @jrudolph. Your insights and perf-map-agent itself have helped set a new direction in our approach to solving our scalability issues. Appreciate all your support.

We compared Intel Westmere processors vs. SandyBridge and observed that the SB architecture based systems were impacted much more by ~20%. We've also used Intel's emon to quantify CPI, LLC miss latencies to get an understanding of how that contributes to the overall slowdown. We have also tested/eliminated NUMA as being an issue by controlling processor affinity with taskset and memory locality with numactl.

Finally, we have some work in progress to make code changes wrt. to being vtable and inlining friendly. I am not sure we will have test results and comparisons anytime soon. We can leave this issue closed. Thanks.

from perf-map-agent.

jrudolph avatar jrudolph commented on July 20, 2024

@kotaravi very interesting, thanks for sharing your insights.

from perf-map-agent.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.