Comments (6)
Yes, "vtable chunks" is related to dynamic dispatch. Unfortunately, I don't know of any good tools that would help explaining where these calls originate from (you would need to walk the stack to find out). Since you mention worker threads a blind guess would be that you have very short lived tasks and calling the tasks itself from the worker threads produce part of this overhead.
Another unrelated issue is that perf top always displays "unable to open vsyscalls".
I don't know about that, I guess that's an issue with perf top
.
from perf-map-agent.
Thank you, Johannes. Based on perf top
output, we initially suspected vtable lookups and dynamic dispatch as a key issue but could not really confirm it. perf-map-agent
is a really great tool in that it separates out this dynamic dispatch overhead from user code. Most Java profilers appear to include this overhead against method time or is unaccounted for.
We certainly have a lot of “light” virtual methods that are called frequently and are likely contributing to this issue. We will review our invokevirtual
call sites along with high TypeProfile
counts to avoid dynamic dispatch for such code paths. Hopefully, we can also benefit from inlining
as a result of it.
I have a few follow up questions — I am hoping @jrudolph or other experts on this forum can comment upon.
(1) Previously (last year) this overhead showed up against the symbol: vtable stub
. Are these the same/ similar and defined in the Java runtime or is it defined in theperf-map-agent
code ?
(2) Is the reported CPU time for vtable chunks
primary due to dynamic dispatch or does it also include any other tasks ? How accurate is this timing ?
(3) Thoughts on why dynamic dispatch does not scale well. We’ve observed that it is also CPU-architecture-dependent to a large degree. Also curious if this has been an issue for anyone else.
Thanks,
Ravi
from perf-map-agent.
Wrt (1):
Now that you say it, I can only remember seeing vtable stub
and not vtable chunks
. They are both reported through the JVMTI DynamicCodeGenerated callback so perf-map-agent will just output whatever your JVM tells it.
I had a quick look and there's a slight code structure change going from JDK 8 -> JDK 9 which could explain that there was an (intermittent?) name change for the same thing:
http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/c25bfaaed7f2/src/share/vm/code/vtableStubs.cpp#l62
(2) TBH I don't know how accurate the reported times are. My current assumption is that for each non-optimizable (e.g. monomorphic) virtual call from a compiled method the virtual stub is run once. E.g. see this description: https://wikis.oracle.com/display/HotSpotInternals/VirtualCalls
On the other hand, as long as I haven't seen actual numbers accompanied with the code positions from where these calls originate I cannot say from experience that these numbers correspond to anything you can do something about. (This alludes to a paper that evaluated Java profilers measuring if the profiler produces "actionable" results which you could use to actually optimize the profiled program.)
(3) Wrt "scaling", I think it's known that dynamic program flow with unpredictable branching has some direct and indirect costs. The question is if the JVM amplifies some of these costs in certain situations. If you look at http://shipilev.net/blog/2015/black-magic-method-dispatch/ it seems that, yes, some calls through invokevirtual
and invokeinterface
are more expensive than a simulation with other branches constructs like if/else
or switch
.
IMO in most situations the solution is still not to micro-optimize these dynamic parts but to increase the size of code blocks that have a statically predictable execution flow. E.g. if the problem should really be dispatching of tasks, you'd need to increase the amount of work done during a task, etc. For other kind of work this just isn't as easily optimized (e.g. data structure processing with many different node types like ASTs).
Wrt CPU-architecture dependence: Out of interest: What kind of differences did you see and on which architectures did you test?
The architecture is quite a big factor, apart from the hardware itself you change to a different JIT compiler and a different performance event implementation, so it's hard to say if it makes sense to compare results (though, I agree, it would be nice if you could optimize for a larger range of architectures).
from perf-map-agent.
Thanks for starting this discussion, @kotaravi. I'm going to close this for now. Please comment or reopen if there are any questions left open.
from perf-map-agent.
Thank you @jrudolph. Your insights and perf-map-agent itself
have helped set a new direction in our approach to solving our scalability issues. Appreciate all your support.
We compared Intel Westmere
processors vs. SandyBridge
and observed that the SB architecture based systems were impacted much more by ~20%. We've also used Intel's emon
to quantify CPI, LLC miss latencies to get an understanding of how that contributes to the overall slowdown. We have also tested/eliminated NUMA
as being an issue by controlling processor affinity with taskset
and memory locality with numactl
.
Finally, we have some work in progress to make code changes wrt. to being vtable
and inlining
friendly. I am not sure we will have test results and comparisons anytime soon. We can leave this issue closed. Thanks.
from perf-map-agent.
@kotaravi very interesting, thanks for sharing your insights.
from perf-map-agent.
Related Issues (20)
- 'jvmticmlr.h' file not found HOT 1
- perf script using perf-java-report-stack output will get "Misordered timestamps: 0"
- perf-map-agent as jvmti agent on start HOT 4
- Unable to load agent sence Premain-Class manifest attribute is missing HOT 1
- Perf not listing more than 1 line in JVM stack HOT 4
- Lack of perf PMU event HOT 1
- Some C stack frames with Java ancestor not correctly shown HOT 9
- Improve support for JSVC
- bump up to new version for new release?
- Clarify license of shell scripts
- JDK inline info not accurate in all cases HOT 1
- when do i have to create the map HOT 2
- time to generate map file HOT 2
- ./bin/dtrace-java-record-stack is not working HOT 1
- Can't compile on Raspberry 4 with jdk 8u181
- Trouble to attach with high PID value HOT 1
- Running create-java-perf-map.sh in different containers in the same pod reports an error? HOT 2
- java.io.IOException: No such process HOT 1
- bjdump: /tmp/perf-1390290.map: File format not recognized HOT 3
- Cross-compiling for Aarch64 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from perf-map-agent.