GithubHelp home page GithubHelp logo

Comments (6)

danielplohmann avatar danielplohmann commented on August 29, 2024 1

Oh and the original issue with the mongo error above is indeed related to int64/uint64, or rather BSON not being able to store uint64 values. My guess is that you have some binaries/smda reports where the base address is above 0x7FFFFFFFFFFFFFFF. I will look into the conversion of such problematic values, at least for the purpose of database storage.

UPDATE: I was able to replicate the issue with a crafted SMDA report. Therefore, the original issue was now fixed in the just published mcrit 1.0.20. I am now converting potentially large values (base addresses, function offsets) in two complement for storage to achieve BSON int64 compatibility.

from mcrit.

danielplohmann avatar danielplohmann commented on August 29, 2024

Hey!
Thanks for highlighting this.
Now what's interesting is that this never happened to me in 10k+ files, which has me pretty confused right now.
Looking at serialization of the SampleEntry, there's barely fields that I could imagine exceeding an 8 byte integer.
If I had to guess that might be some uint/int issue but without further information it's really hard to figure this out.
I'll have a look and will improve the overall logging so that we may get some more telemetry and are able to identify a reproducible case upon which I can investigate and hopefully fix this issue.
I'll ping you here once MCRIT is updated with said extended logging.

from mcrit.

yankovs avatar yankovs commented on August 29, 2024

We're at 60k files and more than 50 million functions, pushing this system to the limit as it seems 🦔.

Thank you for the quick response, I'll be glad to help if I can regarding this issue

from mcrit.

danielplohmann avatar danielplohmann commented on August 29, 2024

Oh, haha, I see. :D
With what specs are you running this (cores, ram, disk?).
And what parts appear to become especially slow?

I mean, the default configuration given in the docker repo is pretty much tailored to what I expected to be the envisioned use case (up to ~10k files, i.e. on manually curated data sets) but I would expect it to still work fine for a low multiple of that.

As you seem to aim for way beyond more than magnitude in size, I would still expect that changing some of the parameters should still yield tolerable performance.
Biggest impact should have increasing the default lower threshold to matches (MinHashConfig.MINHASH_MATCHING_THRESHOLD) to something like 70 and lowering the bands used in Minhash indexing (StorageConfig._default_storage_bands) to something like {4: 10} or {5: 10} - this requires re-indexing though.
Both should help increase the matching speed and reduce the output report size.
There are some more parameters that are currently not exposed in configs that could also help when there is bigger hardware available.

from mcrit.

yankovs avatar yankovs commented on August 29, 2024

So, currently our setup is three EC2 instances: one for mongodb alone (4 cores/16gb/disk that should be fast enough), one for server/15 worker replicas (16 cores/64gb) and a pretty small one for nginx/web.

The first immediate thing I noticed is that when having a lot (10+) of submitter processes (submitting files to be indexed in mcrit), the default number of threads in waitress (8) was not enough. This made using the web basically not usable while submitting files, it would take a lot of time for even simple queries. Adding threads=10 to the serve method in waitress solved this.

The next pretty immediate thing is that some parts of the code don't scale well. For example, in the statistics part of the web interface, the underlying operation is to run countDocuments on samples, families, etc. This works well when numbers are relatively small but for tens of millions of functions, functions.countDocument() just takes a very very long time. In this case functions.stats().count or functions.estimatedDocumentCount() is immediate but not as accurate (from my understanding). Another approach is to create a counts document that is linked to function/sample collections and updates on insert to them.

Also, it seems like the workers don't quite work to their full capacity. Those 15 workers we have don't really work 100% in parallel even though there are more than 15 updateMinHashForSample jobs queued up. Could be related to the config options you mentioned.

from mcrit.

danielplohmann avatar danielplohmann commented on August 29, 2024

Sounds like a nice setup!

The first immediate thing I noticed is that when having a lot (10+) of submitter processes (submitting files to be indexed in mcrit), the default number of threads in waitress (8) was not enough. This made using the web basically not usable while submitting files, it would take a lot of time for even simple queries. Adding threads=10 to the serve method in waitress solved this.

Do you mean this one?

serve(wrapped_app, listen="*:8000")

Then I will make threads a configurable parameter as well, sounds very reasonable.

The next pretty immediate thing is that some parts of the code don't scale well. For example, in the statistics part of the web interface, the underlying operation is to run countDocuments on samples, families, etc. This works well when numbers are relatively small but for tens of millions of functions, functions.countDocument() just takes a very very long time. In this case functions.stats().count or functions.estimatedDocumentCount() is immediate but not as accurate (from my understanding). Another approach is to create a counts document that is linked to function/sample collections and updates on insert to them.

Funny enough, we noticed that at some point as well and had introduced internal counters in the family collection. I just noticed that we never updated the method to deliver statistics after this. This was just addressed in mcrit 1.0.19, pushed an hour ago. ;)
Yes indeed, this already dropped the response time for that API call on my instance from 60 sec to <5 sec.
Not sure what to do about the aggregation of PicHashes though. Those will remain as a slow component in the statistics but they are also pretty useful when trying to estimate how much "unique" functions are in the DB.

Also, it seems like the workers don't quite work to their full capacity. Those 15 workers we have don't really work 100% in parallel even though there are more than 15 updateMinHashForSample jobs queued up. Could be related to the config options you mentioned.

This may also be related to how the queue works, I would need to replicate that with a bigger setup myself to get some introspection on that. Normally, both minhash calculation and matching should be fully parallelized.

from mcrit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.