GithubHelp home page GithubHelp logo

danielplohmann / mcrit Goto Github PK

View Code? Open in Web Editor NEW
82.0 82.0 12.0 1015 KB

The MinHash-based Code Relationship & Investigation Toolkit (MCRIT) is a framework created to simplify the application of the MinHash algorithm in the context of code similarity.

License: GNU General Public License v3.0

Makefile 0.06% Python 99.94%
code-similarity disassembly reverse-engineering

mcrit's Introduction

Hi! I'm Daniel and I do research around (malware) reverse engineering and analysis automation.

The root and motivation for most of my projects is Malpedia, a a resource for rapid identification and actionable context when investigating malware. It was launched in December 2017 by Steffen Enders and me and is maintained by us ever since.

SMDA is a minimalistic recursive disassembler, which internally uses capstone. It was created to study and improve heuristics for function entry point detection, especially in memory-mapped buffers and shellcode.

MCRIT is the MinHash-based Code Relationship & Investigation Toolkit, a binary code similarity analysis framework. It uses SMDA as its built-in disassembler, and picblocks for the hashing of basic blocks. For easy deployment, it comes as docker-mcrit, including its web UI mcritweb.

To filter out library code during analysis, we created mcrit-data, a collection of reference library code for various compilers (MSVC, MinGW, Go, Nim, ...) and commonly found 3rd party libraries. For this, the support tool lib2smda was created, which can be used to convert LIB/OBJ files into SMDA reports, which can then be imported into MCRIT. Empty MSVC was a pre-cursor to this, which is a collection of "empty main()" Visual Studio projects, compiled with various options - which can also serve well as ground truth for commonly found compiler/library code.

During my research on dynamic Windows API imports in malware, I wrote ApiScout. It's a method/tool to reliably recover such dynamic imports and make them usable in other tools. We also showed that the entirety of Windows API imports used by a malware family can be used effectively for its identification.

In 2012, I created IDAscope, an IDA Pro plugin that provides various convenience functionality during reversing. It was one of the first plugins which extensive rich use of PySide/PyQt in IDA and served as a template for many others.

Over the years, I occassionally wrote some blog posts, which cover many of the above projects or aspects of them in detail.

If you want to support my work, I would be happy if you'd buy me a coffee.

mcrit's People

Contributors

blattm avatar danielplohmann avatar dannyquist avatar yankovs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mcrit's Issues

Use child processes in client

Similar to the solution for memory issues observed in the backend, it might be worthwhile to adapt the spawning methodology for batch processing in the CLI client (e.g. malpedia update processing).

Consider use of JSON marshalling accelerators

By transforming additional DTOs into full python dataclasses, it would likely become possible to use an acceleration library like mashumaro for the (un)marshalling of MatchingResult, which is very expensive for large reports right now.

Result filtering based on benign functions

Hey!

Awesome project! I've been reading the code and wondering, is there currently any way to filter benign functions using mcrit?

Let's say I keep a repository of functions extracted from compiler boilerplate and the such (much like mcrit-data). Is there any way, when I index some malware sample, to basically remove these functions from the malware sample? Ideally, I'd like the mcrit DB to index only "interesting" functions from the binary and make a minhash based on that.

Maybe mcrit works in a different manner than what I'm thinking about so my question is not relevant, but it would be nice to know either way

Spawningworker: logging, check if process crashed

Hey!

I've tried the new spawning worker type, and I have some remarks about it:

In this section:

def _executeJobPayload(self, job_payload, job):
# instead of execution within our own context, spawn a new process as worker for this job payload
console_handle = subprocess.Popen(["python", "-m", "mcrit", "singlejobworker", "--job_id", str(job.job_id)], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
# extract result_id from console_output
result_id = None
try:
stdout_result, stderr_result = console_handle.communicate(timeout=self._queue_config.QUEUE_SPAWNINGWORKER_CHILDREN_TIMEOUT)
stdout_result = stdout_result.strip().decode("utf-8")
last_line = stdout_result.split("\n")[-1]
# successful output should be just the result_id in a single line
match = re.match("(?P<result_id>[0-9a-fA-F]{24})", last_line)
if match:
result_id = match.group("result_id")
except subprocess.TimeoutExpired:
LOGGER.error(f"Job {str(job.job_id)} running as child from SpawningWorker timed out during processing.")
return result_id

it should be taken into account that the singlejobworker subprocess might crash. We've noticed cases where it does crash, and in those cases what happens is that result_id = None is returned, and Finished Remote Job with resuld_id: None is logged. This causes the job to appear as finished in the web, but trying to access the result will result in an error page saying the result doesn't exist.

I think this can be easily handled with (plus maybe logic to mark job as failed):

with job as j:
                LOGGER.info("Processing Remote Job: %s", job)
                result_id = self._executeJobPayload(j["payload"], job)
-               # result should have already been persisted by the child process,we repeat it here to close the job for the queue
-               job.result = result_id
-               LOGGER.info("Finished Remote Job with result_id: %s", result_id)
+              if result_id:
+                  # result should have already been persisted by the child process,we repeat it here to close the job for the queue
+                  job.result = result_id
+                  LOGGER.info("Finished Remote Job with result_id: %s", result_id)

One other thing I wanted to mention, is that logs from the child singlejobworker process aren't appearing in logs when for example doing docker logs <worker-container-id>. Since you already get stdout and stderr of the child process, I think it's just a matter of logging them from the spawningworker parent process

Other than this the feature seems to be working great, and thank you for taking the time to implement it :)

Sample Deletion is incomplete

When using the client's functionality to delete samples by function_id, the respective entries are not removed from the band_* collections. This means when candidates are generated, there will be dangling entries among them, which will lead to errors as these can not be resolved by their id.

Generally, if we have a broken state, we can fix it like this:

for entry in database["functions"].find():
    if entry["function_id"] - previous_id > 1:
        print("we have a gap here!")
        print(previous_id, entry["function_id"])
        for fid in range(previous_id + 1, entry["function_id"]):
            all_gap_function_ids.append(fid)
        break
    previous_id = entry["function_id"]

for band_number in range(0, 20):
    database[f"band_{band_number}"].update_many({},{"$pull": {"function_ids": {"$in": all_gap_function_ids }}})

As a result, we probably want to use the lower part to repair our method in MongoDbStorage to remove function_ids from band_* collections.

Reassign unstarted jobs on crashed worker

Hey! ๐Ÿ˜„
Regarding:

def release_all_jobs(self):
# release all jobs associated with our consumer id if they are started, locked, but not finished.
self._getCollection().update_many(
filter={"locked_by": self.consumer_id, "started_at": {"$ne": None}, "finished_at": {"$eq": None}},
update={"$set": {"locked_by": None, "locked_at": None}, "$inc": {"attempts_left": -1}}
)

I'm not 100% sure how dispatching of jobs in MCRIT works so this question might be irrelevant. Is it possible for jobs to be assigned to a worker while it is still processing some other job? Meaning they are locked, but started_at == None. If so, then it might also make sense to release those jobs so they can be either:

  • taken by other workers to be processed
  • taken by the same worker in case some worker recovery/restart mechanism is implemented

Binary removal on job deletion might cause issues for certain files

Hey!

Some thoughts on a scenario: let's say sample xyz.exe is indexed in an MCRIT instance, i.e., added using an addBinarySample job. Later, let's assume the same xyz.exe sample was queried, for example using getMatchesForUnmappedBinary.

From what I saw, I guess for optimizations sake, files in fs.files can have a list of jobs, meaning that a file is saved once across multiple jobs. The issue is that from what I saw experimenting with this, both the job types will be stored in the fs.files entry for this file.

Then, when the addBinarySample is a week old (for example), it will be deleted during cleanup. Not sure what happens then when trying to use the system.

Solution I thought about: only delete a sample from gridfs when (1) it's not indexed, AND, (2) all of the referenced jobs for sample are >= TLL (user defined) old, no new job for it.

edit: I think that the existing clean method will work for this, it deletes the job_id from the fs.files entry which should mitigate the issue. Delete job_ids from the list of the file in fs.files, only delete the file if is it safe to delete - if the list of jobs is empty.

Index-out-of-range may occur in job's `sample_id` property

Hey! :)

I've recently updated to current MCRIT/web and noticed I get a crash on basically every family I try to press on in the families view. The traceback is as follows:

[2023-12-12 16:39:05,607] ERROR in app: Exception on /explore/families/5 [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2190, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1486, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1484, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1469, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/opt/mcritweb/mcritweb/views/utility.py", line 39, in wrapped_view
    return view(**kwargs)
  File "/opt/mcritweb/mcritweb/views/authentication.py", line 204, in wrapped_view
    return view(**kwargs)
  File "/opt/mcritweb/mcritweb/views/explore.py", line 213, in family_by_id
    job_collection.filterToSampleIds([s.sample_id for s in samples])
  File "/usr/local/lib/python3.8/dist-packages/mcrit/queue/JobCollection.py", line 36, in filterToSampleIds
    if job.sample_id in sample_ids:
  File "/usr/local/lib/python3.8/dist-packages/mcrit/queue/LocalQueue.py", line 137, in sample_id
    return int(self.arguments[0][0])
IndexError: list index out of **range**

After a bit of code reading it seems like the issue is that when trying to get the samples in a family for some reason it tries to get the sample_id via getUniqueBlocks even though this is not the case.

Edit: the same behavior was observed in two other spots. (1) When pressing Explore -> samples, (2) Data -> Jobs/Results -> Blocks

Configurable UniqueBlocks queries

Instead of doing the UniqueBlocks analysis based on our best practice settings from YARA-Signator, make them configurable with at least the following parameters:

  • min/max instructions per selected block
  • min/max bytes per selected block
  • required blocks as coverage per sample (currently 10 hardcoded)
  • of them as rule condition

Performance: query on large collections

Hey!

We see MCRIT as a great tool for malware similarity purposes and want to see if it can be integrated into our malware pipeline, with emphasis on the API it provides. We have a DB with a lot of samples, some families with tens of thousands of files associated with them. Simple testing with a moderate number of files shows MCRIT indeed works great. However, when it grows to 100k+ files, it begins to significantly slow down and can take more than 10 minutes for a query about some file. Because of the amount of samples we already have and the daily amount of stuff we get from different sources, it is a matter of time before we reach 100k even if we start small and make a curated set of samples for each family.

Of course, this isn't a trivial thing and it requires further inspection of each step in the process. However, I think that it raises some questions worth discussing:

  • Why was MongoDB chosen for the project? Is it the right fit, if we keep scale in mind?
  • Is the database design optimal, or is there any place to improve in regard to the indexes chosen and the queries performed?
  • How does MCRIT deal with files that contain many functions (we have ones with over 80k!๐Ÿ˜…)? Is it there any other way to deal with them?
  • Should MCRIT support managed solutions like amazon's DocumentDB? Those kinds of solutions handle things like sharding the DB for horizontal scaling and are easy to deploy. However, DocumentDB in particular isn't quite 100% MongoDB compatible.
  • Where are places where a bottleneck can occur during a query?

I hope this doesn't come across as a complaint, because we think MCRIT is great and would really love for it to use it in production :)

recalculateMinHashes progress measure is inaccurate

I am running the recalculateMinHashes via the web interface and the Progress: is showing 6187.50% and growing. I think, but am not certain, that the Worker.updateMinHashes function needs to set the total for progress instead of allowing the calculateMinHashes function to do it, since updateMinhashes currently batches multiple calls to calculateMinHashes.

Mongo: BSON document might be bigger than 16mb

MongoDbStorage's insert_many method should probably check for the total size of the documents or if one of the documents is too big itself. In some (pretty rare) cases, the size could exceed the 16mb BSON limit and result in an exception:

mcrit-server                  | 2023-09-12 17:57:10 [FALCON] [ERROR] POST /samples => Traceback (most recent call last):
mcrit-server                  |   File "/opt/mcrit/mcrit/storage/MongoDbStorage.py", line 229, in _dbInsertMany
mcrit-server                  |     insert_result = self._database[collection].insert_many([self._toBinary(document) for document in data])
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/_csot.py", line 108, in csot_wrapper
mcrit-server                  |     return func(self, *args, **kwargs)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/collection.py", line 757, in insert_many
mcrit-server                  |     blk.execute(write_concern, session=session)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/bulk.py", line 580, in execute
mcrit-server                  |     return self.execute_command(generator, write_concern, session)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/bulk.py", line 447, in execute_command
mcrit-server                  |     client._retry_with_session(self.is_retryable, retryable_bulk, s, self)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/mongo_client.py", line 1413, in _retry_with_session
mcrit-server                  |     return self._retry_internal(retryable, func, session, bulk)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/_csot.py", line 108, in csot_wrapper
mcrit-server                  |     return func(self, *args, **kwargs)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/mongo_client.py", line 1460, in _retry_internal
mcrit-server                  |     return func(session, conn, retryable)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/bulk.py", line 435, in retryable_bulk
mcrit-server                  |     self._execute_command(
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/bulk.py", line 381, in _execute_command
mcrit-server                  |     result, to_send = bwc.execute(cmd, ops, client)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/message.py", line 966, in execute
mcrit-server                  |     request_id, msg, to_send = self.__batch_command(cmd, docs)
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/message.py", line 956, in __batch_command
mcrit-server                  |     request_id, msg, to_send = _do_batched_op_msg(
mcrit-server                  |   File "/usr/local/lib/python3.8/dist-packages/pymongo/message.py", line 1353, in _do_batched_op_msg
mcrit-server                  |     return _batched_op_msg(operation, command, docs, ack, opts, ctx)
mcrit-server                  | pymongo.errors.DocumentTooLarge: BSON document too large (60427090 bytes) - the connected server supports BSON document sizes up to 16777216 bytes.
mcrit-server                  |
mcrit-server                  | During handling of the above exception, another exception occurred:
mcrit-server                  |
mcrit-server                  | Traceback (most recent call last):
mcrit-server                  |   File "falcon/app.py", line 365, in falcon.app.App.__call__
mcrit-server                  |   File "/opt/mcrit/mcrit/server/utils.py", line 51, in wrapper
mcrit-server                  |     func(*args, **kwargs)
mcrit-server                  |   File "/opt/mcrit/mcrit/server/SampleResource.py", line 126, in on_post_collection
mcrit-server                  |     summary = self.index.addReportJson(req.media, username=username)
mcrit-server                  |   File "/opt/mcrit/mcrit/index/MinHashIndex.py", line 280, in addReportJson
mcrit-server                  |     return self.addReport(report, calculate_hashes=calculate_hashes, calculate_matches=calculate_matches, username=username)
mcrit-server                  |   File "/opt/mcrit/mcrit/index/MinHashIndex.py", line 265, in addReport
mcrit-server                  |     sample_entry = self._storage.addSmdaReport(smda_report)
mcrit-server                  |   File "/opt/mcrit/mcrit/storage/MongoDbStorage.py", line 622, in addSmdaReport
mcrit-server                  |     self._dbInsertMany("functions", function_dicts)
mcrit-server                  |   File "/opt/mcrit/mcrit/storage/MongoDbStorage.py", line 238, in _dbInsertMany
mcrit-server                  |     raise ValueError("Database insert failed.")
mcrit-server                  | ValueError: Database insert failed.

Unfortunately I didn't add any print of the samples that caused this, so I don't really have context to provide ๐Ÿ˜ญ. Overall this is pretty uncommon, happened 4 times for over 120k files.

Considerations for DbCleanup

  • think of orphan query samples and functions that don't have a job connected to them
  • think of doing a DB compact afterwards

KeyError on import

I am receiving the following error when attempting to import the MSVC/x86/mcrit/2003_Express_x86.mcrit file from the mcrit-data repo:

INFO:mcrit.index.MinHashIndex:Family remapping created: 1 families, 1 samples.
2024-02-09 23:46:06 [FALCON] [ERROR] POST /import => Traceback (most recent call last):
  File "falcon/app.py", line 365, in falcon.app.App.__call__
  File "/opt/mcrit/mcrit/server/utils.py", line 51, in wrapper
    func(*args, **kwargs)
  File "/opt/mcrit/mcrit/server/StatusResource.py", line 73, in on_post_import
    import_report = self.index.addImportData(import_data)
  File "/opt/mcrit/mcrit/index/MinHashIndex.py", line 220, in addImportData
    sample_entry.family_id = family_id_remapping[sample_entry.family_id]
KeyError: 0

I am running docker-mcrit that is using mcrit v1.3.4. I am initiating the import using the command-line client.

Register Worker IDs to avoid zombie jobs

When workers fetch items from the queue, they register on the job with a dynamically generated worker ID.
If for whatever reason a job terminates/crashes unexpectedly, this job will remain marked in progress with the original worker's ID.
Now, should the worker be restarted, it will have a new ID, leading to a situation where the previous ID is no longer among the live workers and the job remaining forever "in progress" while not being processed any more, making it a zombie job.

To address this issue, the following should be done:

  • workers could be started with an additional parameter specifying an ID to be used by them instead of the dynamically generated ID.
  • workers could/should register centrally with their ID in a dedicated database collection, also providing additional information that allows to deconflict their ID after restarts, and possibly providing a heartbeat whenever they have last processed a job. This would allow to clean up zombie jobs.
  • Improve resilience of workers to avoid crashes, so that when handling issues gracefully, they have a chance to de-register from the collection.

Add universal tagging

For various purposes, it might be worthwhile to introduce and support universal tagging on the level of families, samples, functions, possibly also matching reports.

Workers consume a lot of ram on query

While doing a query on a sample, often the memory usage of a single worker jumps to 10s of GBs - sometime even more than 60GB. It seems like there's no limit at all and the workers are greedy with memory usage, if some file will require 200GB of ram the worker will try to get it and will probably crash due to lack memory. As a result in a setup with multiple workers it happens quite a lot that some worker will hoard all the memory and starve other workers so that they crash.

Here is a list of hashes for samples that consistently consume a lot of ram on a worker (on our MCRIT instance with 20 million functions):
cea60afdae38004df948f1d2c6cb11d2d0a9ab52950c97142d0a417d5de2ff87
d92f6dd996a2f52e86f86d870ef30d8c80840fe36769cb825f3e30109078e339
bab77145165ebe5ab733487915841c23b29be7efec9a4f407a111c6aa79b00ce
97f1ea0a143f371ecf377912cbe4565d1f5c6d60ed63742ffa0b35b51a83afa2
94433566d1cb5a9962de6279c212c3ab6aa5f18dbff59fe489ec76806b09b15f
a5b38fa9a0031e8913e19ef95ac2bd21cb07052e0ef64abb8f5ef03cf11cb4d5
085b68fa717510f527f74025025b6a91de83c229dc1080c58f0f7b13e8a39904
043aac85af1bda77c259b56cd76e4750c8c10c382d7b6ec29be48ee6e40faa00
84ad84a1f730659ac2e227b71528daec5d59b361ace00554824e0fddb4b453cf
1c4bdd70338655f16cd6cf1eb596cd82a1caaf51722d0015726ec95e719f7a27
29bd1ffe07d8820c4d34a7869dbd96c8a4733c496b225b1caf31be2a7d4ff6df
f72bb91a4569fb9ba2aa40db2499f39bb7aba4d20a5cb5f6dd1e2a9a4ce9af98
9119213b617e203fbc44348eb91150a4db009d78a4123a5cbce6dc6421982a91
a614ed116edc46301a4b3995067d5028af14c8949f406165d702496630cb02ce
0c9edded5ff2ac86b06c1b9929117eab3be54ee45d44fcdb0b416664c7183cbf

I am not sure what is the correct way to handle this, but I think there at least should be a way to limit each worker to some amount of memory.

QoL: Add last index timestamp to server statistics

It would be nice if part of the statistics was a timestamp of the last time a file was indexed (successfully) in the system.

This is useful when the match reports of MCRIT are stored elsewhere other than MCRIT itself. So in my case, a short summary of the report is stored on a different system. In such cases, since the data already exists in this other platform, this sort of information can help with the question of whether to re-query a file for matches - if the DB didn't change at all, there's no need to even fetch the cached result.

Exception on POST /query/function

I am unsure if this indicates an error with the IDA script, the mcrit service, or smda. I am using the ida_mcrit.py in IDA via File->Script File... while I have a binary open in IDA. Within the mcrit window, I click on the fingerprint icon (Convert this IDB to a SMDA...), then click on the Upload icon (Reparse and upload the SMDA report...). I see a Python exception within the IDA console Output window and I see an exception in the mcrit logs (using docker-mcrit and docker compose).

The following is the exception I see in the mcrit-server log:

2024-02-14 21:51:31 [FALCON] [ERROR] POST /query/function => Traceback (most recent call last):
  File "falcon/app.py", line 365, in falcon.app.App.__call__
  File "/opt/mcrit/mcrit/server/utils.py", line 51, in wrapper
    func(*args, **kwargs)
  File "/opt/mcrit/mcrit/server/QueryResource.py", line 85, in on_post_query_smda_function
    summary = self.index.getMatchesForSmdaFunction(smda_report, **parameters)
  File "/opt/mcrit/mcrit/index/MinHashIndex.py", line 340, in getMatchesForSmdaFunction
    match_report = matcher.getMatchesForSmdaFunction(smda_report)
  File "/opt/mcrit/mcrit/matchers/MatcherInterface.py", line 53, in wrapper
    result = func(*args, **kwargs)
  File "/opt/mcrit/mcrit/matchers/MatcherQueryFUnction.py", line 35, in getMatchesForSmdaFunction
    function_entry = FunctionEntry(self._sample_entry, smda_function, -1, minhash)
  File "/opt/mcrit/mcrit/storage/FunctionEntry.py", line 59, in __init__
    self.xcfg = smda_function.toDict()
  File "/usr/local/lib/python3.8/dist-packages/smda/common/SmdaFunction.py", line 257, in toDict
    "nesting_depth": self.nesting_depth,
AttributeError: 'SmdaFunction' object has no attribute 'nesting_depth'

I took a quick look at smda/common/SmdaFunction.py and it looks like the fromDict() function will only assign nesting_depth if a version is given. I do not know the code well enough to be sure if this is the cause of the issue, but if so then you might just need to move the else statement on line 235 out one nesting layer to be the else for the if version and re.match... statement.

Limit export

On bigger instances, trying to export the entire database will likely lead to an out-of-memory situation.

To avoid this, the maximum export possible should be capped or in another way limited to avoid the server crashing.

Improve performance of job page

The jobs page is currently very slow when having a large queue.
One reason for this is certainly that the queue is barely using any indices on fields but also that rendering the page carries out multiple full collection scans and data retrievals as JobResource is doing the start/limit and filtering instead of having the database/MongoDB do this efficiently.

In order to improve performance for this,

  • the jobs page could be split up into multiple pages per Job type (Matching, Query, Blocks, Other)
  • start/limits could be performed efficiently on the DB
  • cleanup functionality for the queue should be exposed to the front-end (related to #38)

Add TTL to query_* documents

In an automated system, the insert to the query_* collections during a query grows very large very quick. After a couple of weeks or months since query have passed the MCRIT db has probably changed so this sample will probably require a re-query anyway. So saving old results is not that useful.

It would be nice it there was an option to turn on TTL and just remove such query-related data after some user-defined period

Question: database migrations

Hey!

I've been wondering. If at some point in the future, MCRIT's mongo schema changes (e.g.: how function matches are indexed, or core changes to the LSH implementation which will result in schema changes), will re-indexing of the whole DB be required?

Queue cleaning is unused

Regarding:

time_threshold = datetime.now() - timedelta(seconds=self.cache_time)

Is this intentional? since cache_time is 10 ** 9, it will actually cache for over 30 years and this clean function is essentially a dead path of code.

MongoDB may throw an overflow error

Hey!

Some reoccurring mongo related error pops up in mcrit server logs from time to time when running an indexing process that submits files to mcrit.

Not sure if this is a mongo issue or an issue in mcrit, but it seems to be related to the ID generation done in mcrit. Can some field in the metadata saved to mongo be bigger than the 8 byte integer limit in BSON?

mcrit-server                 | 2023-09-03 04:57:12 [FALCON] [ERROR] POST /samples => Traceback (most recent call last):
mcrit-server                 |   File "/opt/mcrit/mcrit/storage/MongoDbStorage.py", line 188, in _dbInsert
mcrit-server                 |     insert_result = self._database[collection].insert_one(self._toBinary(data))
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/collection.py", line 671, in insert_one
mcrit-server                 |     self._insert_one(
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/collection.py", line 611, in _insert_one
mcrit-server                 |     self.__database.client._retryable_write(acknowledged, _insert_command, session)
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/mongo_client.py", line 1568, in _retryable_write
mcrit-server                 |     return self._retry_with_session(retryable, func, s, None)
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/mongo_client.py", line 1413, in _retry_with_session
mcrit-server                 |     return self._retry_internal(retryable, func, session, bulk)
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/_csot.py", line 108, in csot_wrapper
mcrit-server                 |     return func(self, *args, **kwargs)
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/mongo_client.py", line 1460, in _retry_internal
mcrit-server                 |     return func(session, conn, retryable)
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/collection.py", line 599, in _insert_command
mcrit-server                 |     result = conn.command(
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/helpers.py", line 315, in inner
mcrit-server                 |     return func(*args, **kwargs)
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/pool.py", line 960, in command
mcrit-server                 |     self._raise_connection_failure(error)
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/pool.py", line 932, in command
mcrit-server                 |     return command(
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/network.py", line 150, in command
mcrit-server                 |     request_id, msg, size, max_doc_size = message._op_msg(
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/message.py", line 765, in _op_msg
mcrit-server                 |     return _op_msg_uncompressed(flags, command, identifier, docs, opts)
mcrit-server                 | OverflowError: MongoDB can only handle up to 8-byte ints
mcrit-server                 |
mcrit-server                 |
mcrit-server                 | During handling of the above exception, another exception occurred:
mcrit-server                 |
mcrit-server                 |
mcrit-server                 | Traceback (most recent call last):
mcrit-server                 |   File "falcon/app.py", line 365, in falcon.app.App.__call__
mcrit-server                 |   File "/opt/mcrit/mcrit/server/utils.py", line 51, in wrapper
mcrit-server                 |     func(*args, **kwargs)
mcrit-server                 |   File "/opt/mcrit/mcrit/server/SampleResource.py", line 126, in on_post_collection
mcrit-server                 |     summary = self.index.addReportJson(req.media, username=username)
mcrit-server                 |   File "/opt/mcrit/mcrit/index/MinHashIndex.py", line 280, in addReportJson
mcrit-server                 |     return self.addReport(report, calculate_hashes=calculate_hashes, calculate_matches=calculate_matches, username=username)
mcrit-server                 |   File "/opt/mcrit/mcrit/index/MinHashIndex.py", line 265, in addReport
mcrit-server                 |     sample_entry = self._storage.addSmdaReport(smda_report)
mcrit-server                 |   File "/opt/mcrit/mcrit/storage/MongoDbStorage.py", line 585, in addSmdaReport
mcrit-server                 |     self._dbInsert("samples", sample_entry.toDict())
mcrit-server                 |   File "/opt/mcrit/mcrit/storage/MongoDbStorage.py", line 197, in _dbInsert
mcrit-server                 |     raise ValueError("Database insert failed.")
mcrit-server                 | ValueError: Database insert failed.

Cross compare broken

when I use the cross compare

mcritweb | [2024-05-30 07:49:34,433] ERROR in app: Exception on /data/jobs/66582cf4d8e38bfb930fc16b [GET] mcritweb | Traceback (most recent call last): mcritweb | File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2529, in wsgi_app mcritweb | response = self.full_dispatch_request() mcritweb | File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1825, in full_dispatch_request mcritweb | rv = self.handle_user_exception(e) mcritweb | File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1823, in full_dispatch_request mcritweb | rv = self.dispatch_request() mcritweb | File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1799, in dispatch_request mcritweb | return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) mcritweb | File "/opt/mcritweb/mcritweb/views/utility.py", line 40, in wrapped_view mcritweb | return view(**kwargs) mcritweb | File "/opt/mcritweb/mcritweb/views/authentication.py", line 204, in wrapped_view mcritweb | return view(**kwargs) mcritweb | File "/opt/mcritweb/mcritweb/views/data.py", line 754, in job_by_id mcritweb | return render_template('job_overview.html', job_info=job_info, auto_refresh=auto_refresh, child_jobs=child_jobs) mcritweb | File "/usr/local/lib/python3.8/dist-packages/flask/templating.py", line 147, in render_template mcritweb | return _render(app, template, context) mcritweb | File "/usr/local/lib/python3.8/dist-packages/flask/templating.py", line 130, in _render mcritweb | rv = template.render(context) mcritweb | File "/usr/local/lib/python3.8/dist-packages/jinja2/environment.py", line 1304, in render mcritweb | self.environment.handle_exception() mcritweb | File "/usr/local/lib/python3.8/dist-packages/jinja2/environment.py", line 939, in handle_exception mcritweb | raise rewrite_traceback_stack(source=source) mcritweb | File "/opt/mcritweb/mcritweb/templates/job_overview.html", line 5, in top-level template code mcritweb | {% extends 'base.html' %} mcritweb | File "/opt/mcritweb/mcritweb/templates/base.html", line 146, in top-level template code mcritweb | {% block content %}{% endblock %} mcritweb | File "/opt/mcritweb/mcritweb/templates/job_overview.html", line 24, in block 'content' mcritweb | {{ job_table(child_jobs) }} mcritweb | File "/usr/local/lib/python3.8/dist-packages/jinja2/runtime.py", line 782, in _invoke mcritweb | rv = self._func(*arguments) mcritweb | File "/opt/mcritweb/mcritweb/templates/table/table.html", line 35, in template mcritweb | {{ _table_base(jobs, job_header, job_row, table_id=table_id, **kwargs) }} mcritweb | File "/usr/local/lib/python3.8/dist-packages/jinja2/runtime.py", line 782, in _invoke mcritweb | rv = self._func(*arguments) mcritweb | File "/opt/mcritweb/mcritweb/templates/table/table.html", line 13, in template mcritweb | {{ row_macro(row_data, parent=table_id, **kwargs) }} mcritweb | File "/usr/local/lib/python3.8/dist-packages/jinja2/runtime.py", line 782, in _invoke mcritweb | rv = self._func(*arguments) mcritweb | File "/opt/mcritweb/mcritweb/templates/table/job_row.html", line 191, in template mcritweb | <td job_id="{{ job.job_id }}" class="job-cell" valign="middle">{{ job_description(job, kwargs['families_by_id'], kwargs['samples_by_id']) }}</td> mcritweb | File "/usr/local/lib/python3.8/dist-packages/jinja2/runtime.py", line 782, in _invoke mcritweb | rv = self._func(*arguments) mcritweb | File "/opt/mcritweb/mcritweb/templates/table/job_row.html", line 17, in template mcritweb | {% set sample_entry = samples_by_id.get(job.sample_id) %} mcritweb | File "/usr/local/lib/python3.8/dist-packages/jinja2/environment.py", line 487, in getattr mcritweb | return getattr(obj, attribute) mcritweb | jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'samples_by_id'

And I've an error 500

Question: PicHash and MinHash recalculation results

Hey!

I'd ask this in private, but I assume this question applies to more people so it can help others.
We recently did the recalculation actions needed for the new upgraded SMDA, and these are the results:

image
image (1)

In particular, we noticed only ~1/2 of updatable functions were updated in PIcHash recalculation, and similarly for PicBlockHashes only a small fraction was actually updated.

Is this behavior normal? How can we assure that those were not updated because of an error?

Thank you :)

Consider adding actors field to families

Hey :)!

It would be cool if the family summary would include the actors associated with it. Somewhat similar to the way Malpedia has this info present:
image

It does change the DB's overall schema I guess, but since it's only addition of data on top of existing JSONs I think it's ok. What do you think?

Improve performance for displaying MatchingResult pages

When having large data sets in MCRIT, the results will often drastically increase in size, yielding JSON files up to several hundred MB in size.
Oftentimes, the aggregated result is of primary interest to an analyst (e.g. for family identification), while the detailed function matches are only relevant for deeper analysis/inspection.

To improve performance, a "thin" result could be delivered to the front-end that only contains all samples matches and its aggregated results, which would massively reduce the footprint of the result file.
Also, investigate if specialized marshalling libraries (#44) can improve performance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.