GithubHelp home page GithubHelp logo

Comments (6)

cdgriffith avatar cdgriffith commented on September 3, 2024 1

Yes, I can add that note in the Readme!

from puremagic.

cdgriffith avatar cdgriffith commented on September 3, 2024 1

I did decide to go and just test this further because it was bothering me as I knew this was faster in the past (~10 years ago)

Testing on develop branch for 1.27 using just my computer's downloads folder.

puremagic Test File
import time
from pathlib import Path
import tracemalloc

download_files = list(x for x in Path("Downloads").glob("*") if x.is_file())


tracemalloc.start()

import_time_start = time.perf_counter()
import puremagic
print("Import time:", time.perf_counter() - import_time_start)
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 10**6}MB")

print(f"\nTesting {len(download_files)} files")
download_start_time = time.perf_counter()
unknown_results_types = set()
unknown_total = 0
for file in download_files:
    try:
        puremagic.from_file(file)
    except puremagic.PureError:
        unknown_results_types.add(file.suffix.lower() if file.suffix else file.stem)
        unknown_total += 1
    except Exception as e:
        print(f"Error: {file} - {e}")
print("\nDownload file time:", time.perf_counter() - download_start_time)
print(f"Unknown results types: {unknown_results_types}")
print(f"Unknown total: {unknown_total}")
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 10**6}MB")

print(f"\nPeak memory usage: {peak / 10**6}MB")
tracemalloc.stop()
python-magic Test File
import time
from pathlib import Path
import tracemalloc

download_files = list(x for x in Path("Downloads").glob("*") if x.is_file())

tracemalloc.start()

import_time_start = time.perf_counter()
import magic
print("Import time:", time.perf_counter() - import_time_start)
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 10**6}MB")

print(f"\nTesting {len(download_files)} files")
download_start_time = time.perf_counter()
unknown_results_types = set()
unknown_total = 0
for file in download_files:
   try:
       result = magic.from_file(file)
   except Exception as e:
       print(f"Error: {file} - {e}")
   else:
       if result in ("ASCII text", "data"):
           unknown_results_types.add(file.suffix.lower() if file.suffix else file.stem)
           unknown_total += 1
print("Download file time:", time.perf_counter() - download_start_time)
print(f"Unknown results types: {unknown_results_types}")
print(f"Unknown total: {unknown_total}")
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 10**6}MB")

print(f"\nPeak memory usage: {peak / 10**6}MB")
tracemalloc.stop()

puremagic results

$ time python speed_test_pure.py

Import time: 0.030981435003923252
Current memory usage: 1.009179MB

Testing 1131 files

Download file time: 2.9169464290025644
Unknown results types: {'.ovpn', '.docx', '.img', 'README'}
Unknown total: 4
Current memory usage: 1.022658MB

Peak memory usage: 1.355427MB

real    0m3.892s
user    0m1.001s
sys     0m0.134s

python-magic results

$ time python speed_test_pm.py

Import time: 0.061987199005670846
Current memory usage: 1.83419MB

Testing 1131 files
Download file time: 4.262647944997298
Unknown results types: {'.docx', '.vba', '.y4m', '.txt', '.stl', '.pem', '.json', '.bvr', 'README', '.ovpn', '.log', '.p7b'}
Unknown total: 30
Current memory usage: 1.849338MB

Peak memory usage: 1.88779MB

real    0m5.383s
user    0m0.301s
sys     0m0.290s

In this instance was:

  • Faster
  • Less Memory Usage
  • More Accurate Matches - When discounting "ASCII text" and "data" as real results from libmagic (surprising me, honestly)

I did also ensure that the overhead for checking unknown types was the same in both cases, and that removing it also produced same speed differences.

The only time I saw the python-magic wrapper faster is when doing 1000+ iterations over a small test string. I don't have 1000+ different strings to test with, so don't know if that's because it is faster or just cached the results. Which is causing me to think maybe should add a lru cache with configurable size.

Overall, giving me lots to think of and happy with my findings. Thanks for the inspiration @mara004

Going to keep it just as Faster in, now with proof ™️

from puremagic.

cdgriffith avatar cdgriffith commented on September 3, 2024

Here's a quick test:

python-magic (libmagic wrapper)

import magic
print(magic.from_buffer("#!/usr/bin/env python"))
$time python speed_test_pm.py
a /usr/bin/env python script, ASCII text executable, with no line terminators

real    0m0.108s
user    0m0.018s
sys     0m0.008s

puremagic

import puremagic
print(puremagic.from_string("#!/usr/bin/env python"))
$ time python speed_test_pure.py
.py

real    0m0.068s
user    0m0.015s
sys     0m0.000s

from puremagic.

mara004 avatar mara004 commented on September 3, 2024

For one thing, a single invocation isn't exactly reliable. For another, the above always includes import-time tasks, where libmagic is at a disadvantage because it has to locate and load the DLL.

A more reliable benchmark would be needed to actually support the "Faster" claim.

from puremagic.

cdgriffith avatar cdgriffith commented on September 3, 2024

The whole point is it's faster because it doesn't need to load in an external library? That's the point of the claim.

from puremagic.

mara004 avatar mara004 commented on September 3, 2024

The whole point is it's faster because it doesn't need to load in an external library? That's the point of the claim.

Well, that should be clarified in the Readme (e.g. "Faster to import" rather than just "Faster").
I took it to mean the from_*(...) calls would be claimed faster. 😅
If only importing is supposed to be faster, that will be true, but the primary concern is runtime, not startup time.
The 0.04s import-time difference may not be relevant to most users.

from puremagic.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.