GithubHelp home page GithubHelp logo

Comments (10)

johnynek avatar johnynek commented on August 27, 2024

How many items are you keeping?

What are the types you are sorting?

from scalding.

johnynek avatar johnynek commented on August 27, 2024

By the way, if you just want to get it done, it's only a couple times slower to do:

.groupBy('key) {
_.sortBy('field).reverse.take(10000)
}

to get the top 10k. This will not do any map-side aggregation however.

We may need to make a special case optimization for medium sized sortWithTake situations. Currently, we are serializing every key into one row, and if the number of items is large, or the objects are large, this could present memory problems.

lastly, we have some fixes coming to improve memory utilization for non-primitive types within the tuples.

from scalding.

benlee avatar benlee commented on August 27, 2024

I was doing a _.sortedReverseTake to try and keep the most recent 1000 items, sorting 24-byte string ids. Just so I understand the slower solution -- I understand that the mapper than emit each value with no map-side aggregation. How would the sort be accomplished, though? For each 'key, would the reducer receive values in a secondary sort order on 'field, or would the reducer actually have to sort an unordered stream of values.

from scalding.

benlee avatar benlee commented on August 27, 2024

Per Argyris' request -- here's a snippet (maybe I'm doing something wrong?)

.groupBy('venueId) {                                                                                             
   _.sortedReverseTake[String]('checkinId -> 'checkinId, 1000).toList[String]('shout -> 'shouts)
}

from scalding.

johnynek avatar johnynek commented on August 27, 2024

Can you try without the toList and without the sortedReverseTake?

I wonder which one is really causing the problem.

from scalding.

johnynek avatar johnynek commented on August 27, 2024

I guess the issue here is the:

.toList[String]('shout -> 'shouts)

You are putting all the shouts at a venue into one list in one row of a reducer, and this does not spill to disk. If there are a few venues with a lot shouts, you could have an issue.

Could you try without that part just to verify? Java strings take like 45-50 bytes even for the shortest strings. When they are living in memory, that could potentially add up. If you can avoid using strings and use Longs or Ints, you will almost certainly see some big savings. For instance, I don't know if your checkinId can fit in a Long or not, but that might help.

from scalding.

johnynek avatar johnynek commented on August 27, 2024

I know I'm totally bombarding this issue, but Chris pointed out that the default spill threshold (an internal cascading variable) is rather small. I bumped it to his recommended value (which could cause OOM if people have really low memory clusters). It hasn't caused any issues for us, and it may help your issue.

Let me know if the latest version works better for you (PS: you can also turn up or down the number of reducers for a particular part of the job and that kind of tuning could help).

from scalding.

benlee avatar benlee commented on August 27, 2024

Thanks, I'll give it a shot!

from scalding.

azymnis avatar azymnis commented on August 27, 2024

Ben, did Oscar's suggestion solve your issue? Are you still getting OOM errors?

from scalding.

johnynek avatar johnynek commented on August 27, 2024

Assuming our suggestions or cascading improvements helped here.

from scalding.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.