bfraboni / fastgaussianblur Goto Github PK
View Code? Open in Web Editor NEWFast Gaussian Blur algorithm
Fast Gaussian Blur algorithm
reproduce command:
./fastblur test12.png out.png 250
I'm not sure if it's a issue about sigma, but on OpenCV, everything works fine for the Gaussian calculation.
when you multiply the output with float "iarr" and "T" is uchar you should add +0.5f to round it, otherwise the output will be slighty darker, I suggest to add a constexpr to check if T is uchar and so round it, if you do that the output is as expected to be.
GOOD - opencv blur as reference:
GOOD - Uchar rounded:
BAD - Uchar not rounded
Is there a room for implementing image compression?
a 62Kb jpg image after blur becomes a 419Kb png with 50 sigma.
But after using compression (i used Caesium) the size dropped to just 28Kb, using lossy jpg with 80 quality.
I also tried using cjpeg but it doesn't seem to accept png on windows.
Any thoughts about this?
Thanks!
Hi,
This is not a bug report, but an ask for help.
I'm working on an FOSS called Pencil2D (www.pencil2d.org). It's a software that is used for traditional, hand drawn 2D animation.
The last year I've worked on the camera. The rewrite is almost finished. The only thing missing is depth of field, and for this I need fast gaussian blur.
I've looked at your code, and as I understand it, your software must be called via commandline, and saves the resulting image as a file. What I/we need is a c++ class (or function) that can be called, with an image, blur value and maybe more as parameters, and returning the same image, with added blur.
I am not sure how to implement it. Could I persuade you to help me implement this?
Looking forward to hear from you!
Yours,
David Lamhauge
in horizontal_blur_extended (same pattern should work for ..._crop as well)
TODO:
remove ternary operator inside "perform filtering loop", a bit harder in this case
gist here, read comments :)
https://gist.github.com/michelerenzullo/89a047422fc0c0fc2f57432b80383676
The traditional gaussian blur has a gaussian kernel,and this kernel has two important parameters : kernel‘s size and sigma. but i can't see the kernel's size in your method.
Integer mul is usually faster than float one, so when T is uchar(or any other integer) we don't need to use float, this increase the performance, tested on an old laptop, using an image 24bit 6000 * 8000 px and r = 337
You might add a check for the type T, using float or int based on him. Let me know
with float
(horizontalblur only, tot - transp) : 4120.961600
(horizontalblur only, tot - transp) : 4184.653200
(horizontalblur only, tot - transp) : 4241.367500
(horizontalblur only, tot - transp) : 4068.939800
with int
(horizontalblur only, tot - transp) : 3496.743000
(horizontalblur only, tot - transp) : 3719.186500
(horizontalblur only, tot - transp) : 3510.837400
(horizontalblur only, tot - transp) : 3603.649300
Hi, long time I contributed when we discussed about reflection padding and cache improvements.
I'm working on some web assembly projects, pthreads etc... And I want to contribute again, sometimes linking with OpenMP is messy or impossible on some platforms, therefore we can implement a posix standard, where the performance are totally equivalent since the code is quite simple and "the power of openmp" doesn't justify the complexity or missing implementation on WebAssembly, Android or again difficulties on Mac (for example here, OpenCV defaults to TBB or pthreads, same as iOS)
I wrote a snippet and tested properly where we can switch mode, please feel free to test it and implement it eventually. Further I improved the performance of flip_block
in order to have exactly the same when using threads, this transpose function was the reason of my request, as it is now can causes performance issues when not using openmp that is smart enough to collapse it, and spread equally amongst the threads despite our increments of +=block
template <typename T, typename op>
void hybrid_loop(T end, op operation)
{
auto operation_wrapper = [&](T i, int tid = 0) {
if constexpr (std::is_invocable_v<op, T>) operation(i);
else operation(i, tid);
};
#ifdef __SINGLE__
for (T i = 0; i < end; ++i) operation_wrapper(i);
#elif __OMP__
#pragma omp parallel for
for (T i = 0; i < end; ++i) operation_wrapper(i, omp_get_thread_num());
#elif __STD_THREADS__
const int num_threads = std::thread::hardware_concurrency();
// Split in block equally for each thread. ex: 3 threads, start = 0, end = 8
// Thread 1: 0,1,2
// Thread 2: 3,4,5
// Thread 3: 6,7
const T block_size = ((end + num_threads - 1) / num_threads);
std::vector<std::thread> threads;
for (int tid = 0; tid < num_threads; ++tid) {
threads.emplace_back([=]() {
T block_start = tid * block_size;
T block_end = std::min<T>(block_start + block_size, end);
for (T i = block_start; i < block_end; ++i) operation_wrapper(i, tid); });
}
for (auto &thread : threads) thread.join();
#endif
my flip block:
template <typename T, int C>
void flip_block(const T *in, T *out, const int w, const int h)
{
// Suppose a square block of L2 cache size = 256KB
// to be divided for the num of channels and bytes
const int block = sqrt(262144.0 / (C * sizeof(T))); // <-- note sqrt and also sizeof(T)
const int w_blocks = std::ceil(static_cast<float>(w) / block);
const int h_blocks = std::ceil(static_cast<float>(h) / block);
hybrid_loop(w_blocks * h_blocks, [&](int n) {
int x = (n / h_blocks) * block;
int y = (n % h_blocks) * block;
const T *p = in + y * w * C + x * C;
T *q = out + y * C + x * h * C;
const int blockx = std::min(w, (x + block)) - x ;
const int blocky = std::min(h, (y + block)) - y ;
for (int xx = 0; xx < blockx; xx++)
{
for (int yy = 0; yy < blocky; yy++)
{
for (int k = 0; k < C; k++)
q[k] = p[k];
p += w * C;
q += C;
}
p += -blocky * w * C + C;
q += -blocky * C + h * C;
}
});
}
I think that your current flip block might be wrong in the cache calculation because
sqrt
in my codeFeel free to disagree or add thoughts, but if you can test with my snippets (hybrid loop with -D__STD_THREADS__
vs -D__OMP__
) and compare with your current code, perhaps might more clear than read my verbose notes.
Not relevant but if you have a look also at my deinterleave and interleave to have a better pic of this cache friendly operations and my opinion when dealing with only 1 dimension at time, so kinda easier...
template<typename T, typename U>
void deinterleave_BGR(const T* const interleaved_BGR, U** const deinterleaved_BGR, const uint32_t total_size) {
// Cache-friendly deinterleave BGR, splitting for blocks of 256 KB, inspired by flip-block
constexpr float round = std::is_integral_v<U> ? std::is_integral_v<T> ? 0 : 0.5f : 0;
constexpr uint32_t block = 262144 / (3 * std::max(sizeof(T), sizeof(U)));
const uint32_t num_blocks = std::ceil(total_size / (float)block);
hybrid_loop(num_blocks, [&](auto n) {
const uint32_t x = n * block;
U* const B = deinterleaved_BGR[0] + x;
U* const G = deinterleaved_BGR[1] + x;
U* const R = deinterleaved_BGR[2] + x;
const T* const interleaved_ptr = interleaved_BGR + x * 3;
const int blockx = std::min(total_size, x + block) - x;
for (int xx = 0; xx < blockx; ++xx)
{
B[xx] = interleaved_ptr[xx * 3 + 0] + round;
G[xx] = interleaved_ptr[xx * 3 + 1] + round;
R[xx] = interleaved_ptr[xx * 3 + 2] + round;
}
});
}
template<typename T, typename U>
void interleave_BGR(const U** const deinterleaved_BGR, T* const interleaved_BGR, const uint32_t total_size) {
constexpr float round = std::is_integral_v<T> ? std::is_integral_v<U> ? 0 : 0.5f : 0;
constexpr uint32_t block = 262144 / (3 * std::max(sizeof(T), sizeof(U)));
const uint32_t num_blocks = std::ceil(total_size / (float)block);
hybrid_loop(num_blocks, [&](auto n) {
const uint32_t x = n * block;
const U* const B = deinterleaved_BGR[0] + x;
const U* const G = deinterleaved_BGR[1] + x;
const U* const R = deinterleaved_BGR[2] + x;
T* const interleaved_ptr = interleaved_BGR + x * 3;
const int blockx = std::min(total_size, x + block) - x;
for (int xx = 0; xx < blockx; ++xx)
{
interleaved_ptr[xx * 3 + 0] = B[xx] + round;
interleaved_ptr[xx * 3 + 1] = G[xx] + round;
interleaved_ptr[xx * 3 + 2] = R[xx] + round;
}
});
}
Example of deinterleave_BGR:
std::vector<std::vector<float>> temp(3, std::vector<float>(sizes[0] * sizes[1]));
float* BGR[3] = { temp[0].data(), temp[1].data(), temp[2].data() };
deinterleave_BGR((const uint8_t*)padded.get(), BGR, sizes[0] * sizes[1]);
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.