Comments (10)
How many items are you keeping?
What are the types you are sorting?
from scalding.
By the way, if you just want to get it done, it's only a couple times slower to do:
.groupBy('key) {
_.sortBy('field).reverse.take(10000)
}
to get the top 10k. This will not do any map-side aggregation however.
We may need to make a special case optimization for medium sized sortWithTake situations. Currently, we are serializing every key into one row, and if the number of items is large, or the objects are large, this could present memory problems.
lastly, we have some fixes coming to improve memory utilization for non-primitive types within the tuples.
from scalding.
I was doing a _.sortedReverseTake to try and keep the most recent 1000 items, sorting 24-byte string ids. Just so I understand the slower solution -- I understand that the mapper than emit each value with no map-side aggregation. How would the sort be accomplished, though? For each 'key, would the reducer receive values in a secondary sort order on 'field, or would the reducer actually have to sort an unordered stream of values.
from scalding.
Per Argyris' request -- here's a snippet (maybe I'm doing something wrong?)
.groupBy('venueId) {
_.sortedReverseTake[String]('checkinId -> 'checkinId, 1000).toList[String]('shout -> 'shouts)
}
from scalding.
Can you try without the toList and without the sortedReverseTake?
I wonder which one is really causing the problem.
from scalding.
I guess the issue here is the:
.toList[String]('shout -> 'shouts)
You are putting all the shouts at a venue into one list in one row of a reducer, and this does not spill to disk. If there are a few venues with a lot shouts, you could have an issue.
Could you try without that part just to verify? Java strings take like 45-50 bytes even for the shortest strings. When they are living in memory, that could potentially add up. If you can avoid using strings and use Longs or Ints, you will almost certainly see some big savings. For instance, I don't know if your checkinId can fit in a Long or not, but that might help.
from scalding.
I know I'm totally bombarding this issue, but Chris pointed out that the default spill threshold (an internal cascading variable) is rather small. I bumped it to his recommended value (which could cause OOM if people have really low memory clusters). It hasn't caused any issues for us, and it may help your issue.
Let me know if the latest version works better for you (PS: you can also turn up or down the number of reducers for a particular part of the job and that kind of tuning could help).
from scalding.
Thanks, I'll give it a shot!
from scalding.
Ben, did Oscar's suggestion solve your issue? Are you still getting OOM errors?
from scalding.
Assuming our suggestions or cascading improvements helped here.
from scalding.
Related Issues (20)
- Use cats PairingHeap for PriorityQueueMonoid
- Optimizing KyroCoder in beam backend HOT 7
- changes to continuous integration HOT 2
- Scalding on Beam discussion about Joins HOT 1
- migrate to github CI HOT 10
- scalding-jbdc current use / discussion on support/deprecation HOT 4
- flake on optimization test HOT 1
- Set up automatic publishing HOT 16
- scalding-hadoop3 backend OR spark backend upgrade to 3.X HOT 12
- Re-enable code coverage HOT 3
- copy dagon into this tree HOT 1
- Beam backend is missing some pipes
- [Proposal] Support more sinks/sources in scalding-spark HOT 1
- Support CounterPipe in spark-backend HOT 8
- [bug/nit] code format enforcement lost
- make codecov not suck
- push new tags HOT 4
- incompatibility with recent java8 runtime environments due to hadoop
- duplicate tags in pom files HOT 1
- my account is closed HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scalding.