Comments (12)
These malware samples are indeed huge, tens of megabytes each. Our current code can't handle creating a JSON blob that large and is failing.
from threatexchange.
Just as an idea for now (temporarily):
How about to automatically remove the sample field form the response if the sample size is over XX MB (on server side)?
I mean just until we find a way to deliver bigger file size samples.
This would prevent to receive HTTP 500 errors on the client side.
from threatexchange.
We could remove the field, or replace it with an error message. For example, for a request with
fields=status,sample
We could return either:
{
"status": "MALICIOUS",
"sample": "This sample is too large to return",
"id": "1234456789"
}
or
{
"status": "MALICIOUS",
"id": "1234456789"
}
Personally, I would favor the former. My concern is that people will assume the field is present in the data being returned. Silently omitting it could lead to problems. What do you think?
from threatexchange.
I would vote for not returning the sample field at all if a sample's size is above a certain threshold. If there's no way to download it, an error message is about as useful as the sample just not being available from a programmatic perspective.
Currently there's no "size" field returned by malware objects, but if there was one it would at least provide some context as to how large the sample is. It would even allow you to note in the documentation that samples above a specific size will not be available for download (at least until there's a way to do so?).
from threatexchange.
We can certainly add a field for the sample size. Should this be the size of the ZIP file which you'll download, or the actual file size. The former will be used for the cutoff of what can be downloaded, but I suspect the latter would be more useful during analysis. Thoughts?
from threatexchange.
Hmm, that's a really good question. Both of those would be useful, the latter more-so probably from an analyst's perspective and someone looking to pull down metadata about the sample. I'd probably vote for the latter myself and maybe just document the ZIP size threshold so people are aware. It's impossible to relate uncompressed size to compressed size with the different data types and content.
from threatexchange.
How about a compromise? Fields for sample_size and sample_size_compressed? We'd put in the documentation "If the compressed sample size is larger than 25MB, the sample field will be omitted."
from threatexchange.
That would work for me!
from threatexchange.
Sounds like a good solution and would work for me too. :-)
from threatexchange.
The compromise solution is now live! Using the example from the top of the issue, I ran a query just now for:
/1068651733168127?fields=md5,sample,sample_size,sample_size_compressed
and got something like the following (actual values obfuscated):
{
"md5": "3269e9fde81f7ea4e538ba595f77f52f",
"sample_size": 71777777,
"sample_size_compressed": 71755555,
"id": "1068651733168127"
}
If this works for you, please close out the issue.
from threatexchange.
Nice! I'll add that to pytx :)
On Wednesday, December 30, 2015, Jesse Kornblum [email protected]
wrote:
The compromise solution is now live! Using the example from the top of the
issue, I ran a query just now for:/1068651733168127?fields=md5,sample,sample_size,sample_size_compressed
and got something like the following (actual values obfuscated):
{
"md5": "3269e9fde81f7ea4e538ba595f77f52f",
"sample_size": 71777777,
"sample_size_compressed": 71755555,
"id": "1068651733168127"
}If this works for you, please close out the issue.
—
Reply to this email directly or view it on GitHub
#99 (comment)
.
from threatexchange.
Nice! Thx a lot!
from threatexchange.
Related Issues (20)
- Typing of SignalExchangeAPIWithSimpleUpdates is too Generic | remove use of t.Any
- [py-tx] CLI error opaque for PDQ match with low hash quality HOT 1
- [py-tx] Use the new NON_MALICIOUS reaction
- pdq_hasher error for B/W png HOT 1
- [py-tx] SignalType Reference implementation for Video TMK+PDQF Matching
- [py-tx] ThreatExchange checkpoint time implementation is incorrect, potentially skipping updates HOT 2
- [py-tx] Investigate dbm as a replacement for the default store
- /matches/for-hash/ returns 400, could not parse request HOT 9
- [hma] Clicking Sync button on the webui doesn't do anything
- [py-tx] New extension interface for storage
- [py-ty] Venv setup documentation and/or files
- [hma] Cleanup Settings > ThreatExchange Tab
- [hma] 500 error thrown on invalid PDQ hash HOT 1
- [HMA] graph API 9.0 hardcoded, now deprecated HOT 1
- [py-tx][HMA-in-a-bottle] Modularising py-tx -- Draft roadmap HOT 6
- [hma] Fetcher policy fails to access index HOT 1
- [hma] submitting content gets stuck between "hashed" and "matched" HOT 2
- /matches/for-hash/ gives AttributeError: 'IndexMatchUntyped' object has no attribute 'distance' HOT 1
- [pytx] No match results if creating a local_file with only 1 hash in it HOT 1
- [hma] Size of hashkey has exceeded the maximum size limit of 2048 bytes HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from threatexchange.