cdgriffith / puremagic Goto Github PK

View Code? Open in Web Editor NEW

160.0 160.0 34.0 376 KB

Pure python implementation of identifying files based off their magic numbers

License: MIT License

Python 100.00%

puremagic's People

Contributors

Stargazers

Watchers

puremagic's Issues

Faster than a libmagic wrapper?

The Readme claims

Advantages over using a wrapper for 'file' or 'libmagic':

Faster

Do you have any actual evidence for that (reproducible benchmark or similar) ?
For typically pure-python re-implementations are slower than C library bindings, unless the pure-python package uses significantly more efficient algorithms, or there is a lot of object transfer or FFI overhead involved with the binding.

Version 2.0 Goals

Now that puremagic is picking up some outside traction, and used in places like MongoDB, want to lay out clear future plans.

Stay backwards compatible. Anything changed or added has to be behind a feature flag.
#71 Faster. (Is a json file best way to store data? switch to tree lookup instead of loop iteration?)
Higher accuracy. Some ideas in #12
Even better test coverage. All platforms, all current python versions, both success and failure cases. (Started in #67)
Documentation improvements
Better sub variation names #69

Please keep comments on this page limited to overall goals, any specific conversations about any goal should be their own issue and will be updated here.

Webp image mime type is empty

Hello,

I encountered a discrepancy when running a test with the following code:

import puremagic

print(puremagic.from_file("test/resources/images/test.webp", mime=True)) # prints "image/webp"
with open("test/resources/images/test.webp", "rb") as f:
    print(puremagic.from_string(f.read(), mime=True)). # prints ""

In comparison, the python-magic library outputs "image/webp" for both the from_file and from_buffer functions.

I am uncertain whether this difference in behavior is intentional.

Thank you!

mimetype from stream

Are you interested in a pull-request with something like this?

def mimetype_from_stream(stream, filename=None):
    """
    Reads in stream, attempts to identify content based
    off magic number and will return the mimetype.
    If filename is provided it will be used in the computation.
    """
    assert isinstance(stream, BytesIO)
    head, foot = _stream_details(stream)
    ext = puremagic.ext_from_filename(filename) if filename else None
    return puremagic.main._magic(head, foot, True, ext)


def _stream_details(stream):
    """ Grab the start and end of the stream"""
    assert isinstance(stream, BytesIO)
    max_head, max_foot = puremagic.main._max_lengths()
    head = stream.read(max_head)
    try:
        stream.seek(-max_foot, os.SEEK_END)
    except IOError:
        stream.seek(0)
    foot = stream.read()
    return head, foot

I will mimic your from_string function more in the pull-request. In my case I only need the mime-type, not the extension.

JPEG XS Two mime types

I was having a look around the various JPEG X* flavours and came across https://en.wikipedia.org/wiki/JPEG_XS which is both a still image and video codec.

Just to be awkward they use the same fingerprint 0xFF10 FF50 for both image and video but then give it two mime types image/jxsc and video/jxsv.

What would be the best approach to handle this? Two entries in the .json one for each type? I'm not sure of other formats that would do this but I reckon they are out there.

Price-matching other repos for more file support

Take a look at these Python repos to see what can puremagic add to cover more formats.

https://github.com/floyernick/fleep-py/blob/master/fleep/data.json (193 stars)
https://github.com/h2non/filetype.py/tree/master/filetype/types (134 stars)
https://github.com/openpreserve/fido/blob/master/fido/conf/format_extensions.xml (84 stars)
https://github.com/omriher/Whatype/blob/master/whatype/magics.csv (12 stars)
https://github.com/schlerp/pyfsig/blob/master/src/pyfsig/file_signatures.py (9 stars)
https://github.com/7h3rAm/cigma/blob/master/cigma/magicbytes.json (1 star)

If we include non-Python repos:

same (mp3) file, different name ... different output: mp3 versus koz

same (mp3) file, different name ... different output

Make a copy:
sander@brixit:~/git/puremagic$ cp test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
Verify it's there with same size:

sander@brixit:~/git/puremagic$ ll test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
-rw-rw-r-- 1 sander sander 26989 jun 11 10:36 test/resources/audio/testblabla.bla
-rw-rw-r-- 1 sander sander 26989 jun 11 10:35 test/resources/audio/test.mp3

... and same contents:

sander@brixit:~/git/puremagic$ md5sum test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
3de8d656af21a836f2ba4f2949feb77c  test/resources/audio/test.mp3
3de8d656af21a836f2ba4f2949feb77c  test/resources/audio/testblabla.bla

... but puremagic says the first one is mp3 and the second is ... koz?

sander@brixit:~/git/puremagic$ python3 -m puremagic test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
'test/resources/audio/test.mp3' : .mp3
'test/resources/audio/testblabla.bla' : .koz

Is this wanted behaviour, or a bug?

PS: Linux' file reports it correctly as mp3:

sander@brixit:~/git/puremagic$ file  test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
test/resources/audio/test.mp3:       Audio file with ID3 version 2.3.0, contains:MPEG ADTS, layer III, v1, 128 kbps, 44.1 kHz, JntStereo
test/resources/audio/testblabla.bla: Audio file with ID3 version 2.3.0, contains:MPEG ADTS, layer III, v1, 128 kbps, 44.1 kHz, JntStereo

Multi-part checks with negative offset for second match

I'm just going through some various files for some formats, in some cases I could increase the confidence as the file has both a solid position for a header, but also a solid position for a footer. Would it be possible to have the multi-part use a negative byte count to match from?

This would be handy as in one case I have a file that is clashing with another match at 0.8, adding the footer as an additional match should push past this to give a solid first match. I believe this would help with things like #37 .svg matching confidence as well.

Example entry for multi-part-headers:

"4352454D" : [
    ["444f4e4500000000", -8, ".ctm", "", "CreamTracker module"]			
]

Create setup.py and integrate with PyPi

Speed Improvements

Talk about ideas to make PureMagic faster!

Initial thoughts:

How much does JSON slow us down? (Putting the data directly in code looks to be large speedup for repeated initialization, possibly 30%)
How much does iteration vs graph slow us down?
Are namedtuples the fastest way to store the data internally?

Optimizations in progress:

Remove max header length calculation that requires iterating through all data on startup. Provide a global integer. (~0.4% speedup)

New archivers support: Brotli, LZ4 and ZStd

Recently was published few new compression algorithms and formats that are going to be quite popular:
Brotli from Google: is supported by all browsers, was standardized and have it's RFC but it doesn't have magic signature.
LZ4 which is ultra fast and lightweight.
ZStandard from Facebook which is already widely supported by a lot of systems including Linux kernel. It's mime type is application/zstd and it's magic is \x28\xB5\x2F\xFD
Could you add a support for them? If yes please also add their tar version (tzst, tlz4, tbr).
The more programs supports them then the easier it will be to migrate to this compressors.

py.typed is not included in wheel

In PR #54, the file py.typed was added, however it was not added to the list of pacakge_data files in setup.py, meaning it is not present in the wheel, as seen by the following command:

$ unzip -l ~/Downloads/puremagic-1.25-py3-none-any.whl
Archive:  /Users/user/Downloads/puremagic-1.25-py3-none-any.whl
  Length      Date    Time    Name
---------  ---------- -----   ----
      102  06-19-2024 03:37   puremagic/__init__.py
       90  06-19-2024 03:37   puremagic/__main__.py
   143189  06-19-2024 03:37   puremagic/magic_data.json
    15622  06-19-2024 03:37   puremagic/main.py
     1086  06-19-2024 03:37   puremagic-1.25.dist-info/LICENSE
     5834  06-19-2024 03:37   puremagic-1.25.dist-info/METADATA
       92  06-19-2024 03:37   puremagic-1.25.dist-info/WHEEL
       10  06-19-2024 03:37   puremagic-1.25.dist-info/top_level.txt
      703  06-19-2024 03:37   puremagic-1.25.dist-info/RECORD
---------                     -------
   166728                     9 files

See this Github query for examples where py.typed is correctly added to the setup.py file.

Adding JPEG-XL Support

From https://github.com/ImageMagick/jpeg-xl/blob/main/doc/format_overview.md

JPEG-XL Either starts with
0xFF0A for a Raw JPEG-XL Codestream
0x0000000C 4A584C20 0D0A870A for a ISOBMFF-based container

Checking the files at https://jpegxl.info/art/2021-04_jon.html, I can confirm that 0xFF0A is used for Raw Codestreams

For Python 3.13: A drop-in replacement for `imghdr.what()`

Given the discussion in #67 about imghdr being removed from the Python Standard Library, it might be quite helpful to have a drop-in replacement for imghdr.what(). It would provide a smooth transition to Py3.13 if developers could confidently replace all instances of imghdr.what() with puremagic.what() -- same args, same results.

missing mime type for webp

For example, this is a webp image, download it to test.webp:

This is the difference between puremagic and magic:

In [22]: import puremagic

In [23]: puremagic.from_file("test.webp", mime=True)
Out[23]: ''

In [24]: import magic

In [25]: magic.from_file("test.webp", mime=True)
Out[25]: 'image/webp'

It is seems that mime is missing, but if I remove the mime=True. I can get the webp extension:

In [26]: puremagic.from_file("test.webp")
Out[26]: '.webp'

Weird issue with non-compliant AIFF files

Just starting a PR based on #85 and came across a weird issue. It appears the certain malformed AIFF files cannot be read under certain conditions. If we use the example in python:

import puremagic
filename = "r:\aiff\Fnonull.aif"
puremagic.magic_file(filename)

We get the following:

[PureMagicWithConfidence(byte_match=b'FORM', offset=0, extension='.aif', mime_type='audio/x-aiff', name='Audio Interchange File', confidence=0.9), PureMagicWithConfidence(byte_match=b'FORM\x00\x00\x00\\AIFC', offset=8, extension='.aifc', mime_type='audio/x-aiff', name='Audio Interchange File Format (Compressed)', confidence=0.8), PureMagicWithConfidence(byte_match=b'FORM', offset=0, extension='.aiff', mime_type='audio/aiff', name='Audio Interchange File', confidence=0.4), PureMagicWithConfidence(byte_match=b'FORM', offset=0, extension='.djv', mime_type='image/vnd.djvu', name='DjVu image', confidence=0.4), PureMagicWithConfidence(byte_match=b'FORM', offset=0, extension='.djv', mime_type='image/vnd.djvu+multipage', name='DjVu document', confidence=0.4), PureMagicWithConfidence(byte_match=b'FORM', offset=0, extension='', mime_type='application/x-iff', name='IFF file', confidence=0.4), PureMagicWithConfidence(byte_match=b'FORM', offset=0, extension='.sc2', mime_type='', name='SimCity 2000 Map File', confidence=0.4), PureMagicWithConfidence(byte_match=b'AIFC', offset=8, extension='.aiffc', mime_type='audio/x-aifc', name='AIFC audio', confidence=0.4)]

However, if we do the following in a .py file

import puremagic
with open(r"r:\aiff\Fnonull.aif", "rb") as file:
    print(puremagic.magic_stream(file))

We get:

Traceback (most recent call last):
  File "R:\WinUAE\pm.py", line 3, in <module>
    print(puremagic.magic_stream(file))
  File "C:\Users\Andy\AppData\Local\Programs\Python\Python310\lib\site-packages\puremagic\main.py", line 351, in magic_stream
    head, foot = _stream_details(stream)
  File "C:\Users\Andy\AppData\Local\Programs\Python\Python310\lib\site-packages\puremagic\main.py", line 229, in _stream_details
    stream.seek(-max_foot, os.SEEK_END)
OSError: [Errno 22] Invalid argument

From reading around it appears to have something to do with malformed files and seek errors, but I can't quite see how Puremagic can read it one way and not the other.

Any thoughts on this?

Test files.

aiff.zip
The files causing trouble are the ones labelled as Perverse Files from this page Samples

Is it possible to use filehandles / bytestream?

I would love to do something like this:

import puremagic

with open(file_path, "rb") as fh:
    ext = puremagic.from_file_handler(fh)

Especially the bytestream support might be nice in case the file is not / cannot / should not be stored on the disk (e.g. AWS Lambda)

Find something as quick yet safer than eval()

Instead of opening a seperate data file and evaling it, find a method that is just as fast to store that information that is safer and just as easy.

Integrate advanced file scanning techniques

Better identify common files. Such as opening .docx/.pptx/.xlsx and viewing the XML file to figure out which exactly they are.

Confidence/Selection logic question

Hi,

I just found PureMagic and am trying to use it to identify if a file my script receives is ELF or not. I am using a test ELF binary and instead of get back "ELF executable" as i would expect I am getting ".AppImage".

I did run readelf against the file and here is results:

$ readelf -h elf_hello/chello
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Intel 80386
  Version:                           0x1
  Entry point address:               0x80482f0
  Start of program headers:          52 (bytes into file)
  Start of section headers:          1904 (bytes into file)
  Flags:                             0x0
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         7
  Size of section headers:           40 (bytes)
  Number of section headers:         27
  Section header string table index: 26

$ python3 -m puremagic elf_hello/chello
'elf_hello/chello' : .AppImage

I also dug into the magic_data.json file and found out that those 2 file types share a lot of the same bytes:

  "454c46", 1, ".AppImage"
"7f454c46", 0, "", "", "ELF executable"

After doing some more digging it looks like puremagic find both options but always returns the AppImage entry.

These are the 2 results from the confidence function:

PureMagicWithConfidence(byte_match=b'ELF', offset=1, extension='.AppImage', mime_type='application/x-iso9660-appimage', name='AppImage application bundle', confidence=0.9)
PureMagicWithConfidence(byte_match=b'\x7fELF', offset=0, extension='', mime_type='', name='ELF executable', confidence=0.9)

I admit that i can be totally blind and am not seeing where the logic decides which one to choose. I'd get it if it looked at the file extension and saw that there wasn't one and choose the ELF executable vs. the AppImage, but it looks like it is a toss up when the confidence level is the same...

Thanks in advance for any insight, suggestions, etc :)

Some common filetypes are not detected

Pure magic seems to be failing to detect some very common file types, like text files (.py, .txt, .md).

$ file changelog.txt
changelog.txt: ASCII English text

$ python3.6 -m puremagic ./changelog.txt
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py:125: 
RuntimeWarning: 'puremagic.__main__' found in sys.modules after import of package 
'puremagic', but prior to execution of 'puremagic.__main__'; this may result in 
unpredictable behaviour
  warn(RuntimeWarning(msg))
'./changelog.txt' : could not be Identified

$ python3.6 -m puremagic -m ./changelog.txt
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py:125: 
RuntimeWarning: 'puremagic.__main__' found in sys.modules after import of package 
'puremagic', but prior to execution of 'puremagic.__main__'; this may result in 
unpredictable behaviour
  warn(RuntimeWarning(msg))
'./changelog.txt' : could not be Identified

SVG images not recogniced

SVG images return mime_type='video/x-ms-asf'

Add mime type lookup for exts

EncodingWarning when PYTHONWARNDEFAULTENCODING

 @ env PYTHONWARNDEFAULTENCODING=1 pip-run puremagic -- -c 'import puremagic'
/var/folders/f2/2plv6q2n7l932m2x004jlw340000gn/T/pip-run-pb4us6gg/puremagic/main.py:76: EncodingWarning: 'encoding' argument not specified
  with open(filename) as f:

Add documentation

stream does not work properly on opened small files

using the following image (icon-32.png)

the following code fails

import puremagic

with open(r"icon-32.png", "rb") as file:
    print(puremagic.magic_stream(file))

here's the error

Traceback (most recent call last):
  File "/home/…/./playing-with-puremagic", line 4, in <module>
    print(puremagic.magic_stream(file))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/…/.venv/lib/python3.12/site-packages/puremagic/main.py", line 351, in magic_stream
    head, foot = _stream_details(stream)
                 ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/…/.venv/lib/python3.12/site-packages/puremagic/main.py", line 229, in _stream_details
    stream.seek(-max_foot, os.SEEK_END)
OSError: [Errno 22] Invalid argument

not sure why, but maybe it's because it's trying to seek() to a position that's bigger than its size?

Remove unsupported Python stuff

Since puremagic only supports Python 3.5 and up, I think you can remove:
https://github.com/cdgriffith/puremagic/blob/master/setup.py#L25

And:
https://github.com/cdgriffith/puremagic/blob/master/setup.py#L35

RuntimeWarning when running package

I am getting this warning in Python 3.6 and 3.7 beta when running `python3 -m puremagic":

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py:125: 
RuntimeWarning: 'puremagic.__main__' found in sys.modules after import of package 
'puremagic', but prior to execution of 'puremagic.__main__'; this may result in 
unpredictable behaviour
  warn(RuntimeWarning(msg))

Missing mime types for some headers

Some of the headers in magic_data lack mime-type, for example:
https://github.com/cdgriffith/puremagic/blob/master/puremagic/magic_data.json#L69
This causes from_string(str, mime=True) to return empty string instead of the mime-type in some cases, eg JPEG image.

Is there a specific reason for this, or is the dataset just incomplete?

For Python 3.13: A drop-in replacement for `sndhdr.what()` and `sndhdr.whathdr()`

Like #72 but for sndhdr instead of imghdr. Given that puremagic.what() is now mentioned in What's new in Python 3.13 should we do something for the sound file formats in https://docs.python.org/3/library/sndhdr.html that will also be removed in Python 3.13? Does puremagic have support for the twelve sound file formats that sndhdr supports?

It might be quite helpful to have a drop-in replacement for sndhdr.what(). It would provide a smooth transition to Py3.13 if developers could confidently replace all instances of sndhdr.what() with puremagic.what() -- same args, same results.

@NebularNerd

imghdr matches in PureMagic?

I've re-titled this as it got a bit off-track from the original purpose of the issue.

@cclauss wanted to know if PureMagic could provide a 1:1 replacement of imghdr, to avoid crosstalk I've moved the original content to #69

Automatically build documentation with readthedocs.org

How to handle two sets of bytes for matching improvements?

Hi there,

I'm looking for a python package to help identify weird and wonderful files inside various scripts. I had seen fleep but that appears to be dead. Puremagic looks to offer the same functionality for what I want it for.

One job is for handling Amiga .iff files in an image conversion script. Having a quick look, it's nice to see .iff getting some love:

puremagic/puremagic/magic_data.json

Line 1084 in ff042db

["464f524d", 0, "", "application/x-iff", "IFF file"],

But in Amiga land that .iff FORM header is used for many things Wikipedia: List_of_file_signatures

Is there a way to help improve mapping and confidence by adding additional matching strings such as ILBM ACBM etc..? I'm happy to help with a PR if it can be done.

.epub listed as "INI Config file" in magic_data.json

A file with an .epub extension is extremely likely to be the zip+xml+html based, and extremely popular, e-book format rather than any form of INI configuration file. I cannot find any reference to .epub being a configuration file name for any system with a cursory search.

As it is listed directly after .ini in the data file, it was likely a copy/paste error.

The line should probably read as:

        ["", 0, ".epub", "application/epub+zip", "electronic book document"]

Variant field in magic.json?

Re-opening as #68 got a bit off-track from the original topic

Looking at #67 regarding imghdr I started looking at the SGI File format as that needs some love much like I did for PCX on #50. I was about to start on a PR to add all the variants but had an idea regarding naming convention.

At present the .json has a single name field, this works well enough but depending on how people use that name there could be a better way.

For example with PCX we now have:

puremagic/puremagic/magic_data.json

Lines 831 to 866 in 88bc58f

 ["0a000101", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, fixed EGA palette, 1bpp)"], 

 ["0a020101", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, modified EGA palette, 1bpp)"], 

 ["0a030101", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.8, 1bpp)"], 

 ["0a040101", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (Paintbrush for Windows, 1bpp)"], 

 ["0a050101", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (3.0, 1bpp)"], 

 ["0a000001", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, fixed EGA palette, no encoding, 1bpp)"], 

 ["0a020001", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, modified EGA palette, no encoding, 1bpp)"], 

 ["0a030001", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.8, no encoding, 1bpp)"], 

 ["0a040001", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (Paintbrush for Windows, no encoding, 1bpp)"], 

 ["0a050001", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (3.0, no encoding, 1bpp)"], 

 ["0a000102", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, fixed EGA palette, 2bpp)"], 

 ["0a020102", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, modified EGA palette, 2bpp)"], 

 ["0a030102", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.8, 2bpp)"], 

 ["0a040102", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (Paintbrush for Windows, 2bpp)"], 

 ["0a050102", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (3.0, 2bpp)"], 

 ["0a000002", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, fixed EGA palette, no encoding, 2bpp)"], 

 ["0a020002", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, modified EGA palette, no encoding, 2bpp)"], 

 ["0a030002", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.8, no encoding, 2bpp)"], 

 ["0a040002", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (Paintbrush for Windows, no encoding, 2bpp)"], 

 ["0a050002", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (3.0, no encoding, 2bpp)"], 

 ["0a030104", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.8, 4bpp)"], 

 ["0a040104", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (Paintbrush for Windows, 4bpp)"], 

 ["0a050104", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (3.0, 4bpp)"], 

 ["0a000004", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, fixed EGA palette, no encoding, 4bpp)"], 

 ["0a020004", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, modified EGA palette, no encoding, 4bpp)"], 

 ["0a030004", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.8, no encoding, 4bpp)"], 

 ["0a040004", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (Paintbrush for Windows, no encoding, 4bpp)"], 

 ["0a050004", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (3.0, no encoding, 4bpp)"], 

 ["0a030108", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.8, 8bpp)"], 

 ["0a040108", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (Paintbrush for Windows, 8bpp)"], 

 ["0a050108", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (3.0, 8bpp)"], 

 ["0a000008", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, fixed EGA palette, no encoding, 8bpp)"], 

 ["0a020008", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, modified EGA palette, no encoding, 8bpp)"], 

 ["0a030008", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.8, no encoding, 8bpp)"], 

 ["0a040008", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (Paintbrush for Windows, no encoding, 8bpp)"], 

 ["0a050008", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (3.0, no encoding, 8bpp)"],

This is nice as we can determine every variant this format can offer, but maybe it's a bit too 'wordy'. A possible enhancement could be a 'variant' field in the .json like so:

    ["0a000101", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file", "(2.5, fixed EGA palette, 1bpp)"],
    ["0a020101", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file", "(2.5, modified EGA palette, 1bpp)"],
    ["0a030101", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file", "(2.8, 1bpp)"],
    ["0a040101", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file", "(Paintbrush for Windows, 1bpp)"],
    ["0a050101", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file", "(3.0, 1bpp)"],

etc...

This would give those who only need a basic name a straightforward 'this is what I am', while those who would like to know precisely could use the variant to get specifics.

Going back to the SGI format, we can see in the file specifications that it's got a small header with a lot of variant flags. At present there is one SGI entry in the .json at:

puremagic/puremagic/magic_data.json

Line 794 in 88bc58f

["01da01010003", 0, ".rgb", "image/x-rgb", "Silicon Graphics RGB Bitmap"],

Which equates to a RLE compression, 2bpc, multiple 2D images. I'm happy to run a PR with a similar naming convention as I did for PCX, this would generate a roughly similar length list for SGI variants, but had the idea and wanted to share.

Any thoughts on this?

PDF files are not always detected

From my testing the %PDF- does not necessarily have to be at offset 0.
It can be located anywhere in the file. For example I can type some junk into the file in the beginning and it still opens file.

I received multiple files like this from people, so there is something or someone out in the wild that adds extra characters in front of the magic sequence.

A detector would look something like that it searches for a substring inside a search window:

def is_pdf(file_path):
    with open(file_path, "rb") as file:
        # may throw IOError
        header = file.read(1024)
        return b"%PDF-" in header

From what I see currently the library is not built to handle this kind of situation.
So I'm leaving this ticket here with this code snippet in case more advanced detection is implemented.

	["0a000101", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, fixed EGA palette, 1bpp)"],
	["0a020101", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, modified EGA palette, 1bpp)"],
	["0a030101", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.8, 1bpp)"],
	["0a040101", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (Paintbrush for Windows, 1bpp)"],
	["0a050101", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (3.0, 1bpp)"],
	["0a000001", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, fixed EGA palette, no encoding, 1bpp)"],
	["0a020001", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, modified EGA palette, no encoding, 1bpp)"],
	["0a030001", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.8, no encoding, 1bpp)"],
	["0a040001", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (Paintbrush for Windows, no encoding, 1bpp)"],
	["0a050001", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (3.0, no encoding, 1bpp)"],
	["0a000102", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, fixed EGA palette, 2bpp)"],
	["0a020102", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, modified EGA palette, 2bpp)"],
	["0a030102", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.8, 2bpp)"],
	["0a040102", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (Paintbrush for Windows, 2bpp)"],
	["0a050102", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (3.0, 2bpp)"],
	["0a000002", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, fixed EGA palette, no encoding, 2bpp)"],
	["0a020002", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, modified EGA palette, no encoding, 2bpp)"],
	["0a030002", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.8, no encoding, 2bpp)"],
	["0a040002", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (Paintbrush for Windows, no encoding, 2bpp)"],
	["0a050002", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (3.0, no encoding, 2bpp)"],
	["0a030104", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.8, 4bpp)"],
	["0a040104", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (Paintbrush for Windows, 4bpp)"],
	["0a050104", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (3.0, 4bpp)"],
	["0a000004", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, fixed EGA palette, no encoding, 4bpp)"],
	["0a020004", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, modified EGA palette, no encoding, 4bpp)"],
	["0a030004", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.8, no encoding, 4bpp)"],
	["0a040004", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (Paintbrush for Windows, no encoding, 4bpp)"],
	["0a050004", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (3.0, no encoding, 4bpp)"],
	["0a030108", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.8, 8bpp)"],
	["0a040108", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (Paintbrush for Windows, 8bpp)"],
	["0a050108", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (3.0, 8bpp)"],
	["0a000008", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, fixed EGA palette, no encoding, 8bpp)"],
	["0a020008", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.5, modified EGA palette, no encoding, 8bpp)"],
	["0a030008", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (2.8, no encoding, 8bpp)"],
	["0a040008", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (Paintbrush for Windows, no encoding, 8bpp)"],
	["0a050008", 0, ".pcx", "image/x-pcx", "ZSOFT Paintbrush file (3.0, no encoding, 8bpp)"],

cdgriffith / puremagic Goto Github PK

puremagic's People

Contributors

Stargazers

Watchers

Forkers

puremagic's Issues

Test files.

Re-opening as #68 got a bit off-track from the original topic

Recommend Projects

Recommend Topics

Recommend Org

Jobs