GithubHelp home page GithubHelp logo

Comments (7)

wlam avatar wlam commented on September 24, 2024

When you run Mupd8Main, the parameter -threads specifies the number of threads that the process should reserve to run the tasks you mentioned. That's all you need--the system automatically figures out the rest for you!

For example, if an update task needs to process multiple events for the same key, the implementation will try to queue them on the same thread to ensure that each event sees the slate as updated by the previous event (as required by the model) as efficiently as possible. On the other hand, events with different keys will be automatically hashed onto different threads to improve the concurrent throughput of the application. Any of the threads can run any of the map/update tasks, so if there are more events for one map task over another (for example), the high-demand one can naturally run more often/in more threads to better match the load of the application.

Does the above answer your question? (If not, please let me know what more I can clarify/document for you!)

from mupd8.

Teots avatar Teots commented on September 24, 2024

Thank you this answered my question, but it also raised some further questions.

If I need a different partitioning scheme, e.g. round robin, can I implement this by setting specific keys? To do so I would need to know the the number of available threads. Is there any way to get this number?

Another question related to passing data between tasks is about the I/O layer. When a data tuple should be passed to another task, it will be passed to a queue and a thread from a thread pool reads this queue and thus the tuple will be removed from the queue eventually. This tuple is now passed via a TCP connection to the next task. Did I get this right from the source code (I've never worked with Scala so far)?

from mupd8.

wlam avatar wlam commented on September 24, 2024

For the first questions: I'm not sure I follow the question correctly, but I encourage you to think about the application's logic rather than Mupd8's event distribution first. (Hypothetically, the application is not supposed to worry about how independent slates and events are partitioned across threads or processes; that event hashing happens at all is an artifact of the Mupd8 implementation that can be completely replaced if necessary, e.g., if a new approach improves the system's performance for you, without changing the MapUpdate interface or your applications. Similarly, as a thought experiment you can imagine a simplified MapUpdate-framework implementation MiniMU that runs all events iteratively in one big while loop--in which case partitioning schemes and thread numbers are no longer meaningful, but hopefully your application still is.)

For the passing data ("events," in MapUpdate parlance): Yes, you basically have it right--TCP is used if the event is being passed to another Mupd8Main JVM (generally on another machine). If the event is being passed within the same JVM process (e.g., from one thread to another), then no network traffic is required to queue the new event internally.

Does this reply help? I fear I may have missed an underlying question you wanted to ask above; if so, please poke me again!

from mupd8.

Teots avatar Teots commented on September 24, 2024

Let take a word count application as example. In systems like Hadoop I set the number of mappers to 100 and the number of reducers to 20. As far as I understand your explanations this is not necessary in mupd8.

To apply a Round Robin partitioning instead of a hash partitioning I would need to know the number of available thread in mupd8 (In Hadoop I know that there are 20 reduce tasks and I can apply round robin partitioning easily). Thus I was wondering whether I could acquire this number at runtime or whether I have hardcode it?

from mupd8.

zheguang avatar zheguang commented on September 24, 2024

@Teots I believe in Hadoop the number you set to mappers will only serve as a hint to the system. The number of mappers is usually determined at run-time by how many input splits there are in the HDFS. The actual number of mappers can be higher than the number you set to mapred.map.tasks. The number of reducers can indeed be controlled by yourself in the mapred.reduce.tasks parameter, but that kind of need usually comes with the need to customize the partitioner.

If my understanding of the question is correct, @Teots is asking whether the number of updater (analogous to the reducer in MapReduce) can be customized or controlled by application. Analogous to Hadoop, unless you need to customize the event partitioner in Mupd8, your application wouldn't need to control the number of reducers. (Wang needs to correct me if I'm wrong here)

from mupd8.

wlam avatar wlam commented on September 24, 2024

Correct, it is not necessary to set any number analogous to the 100 or the 20 (above) to run an application in Mupd8. You can set a number of (total) threads for map/update processing as a command-line parameter (-threads ), but that's it--Mupd8 will reuse all of those threads for the various mappers and updaters as needed automatically, so there is no fixed number for each map or update.

Further, in light of the work (thanks to Zoheb Vacheri) in Mupd8 to help balance event load across threads (to encourage events of different keys to end up on different threads, and away from any threads handling popular/"hotspot" event keys), I'd also encourage you to use Mupd8's internal partitioning and see what it can do for you.

from mupd8.

Teots avatar Teots commented on September 24, 2024

Thanks for your answer!

from mupd8.

Related Issues (16)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.