bfraboni / fastgaussianblur Goto Github PK

View Code? Open in Web Editor NEW

89.0 3.0 16.0 7.54 MB

Fast Gaussian Blur algorithm

Makefile 0.07% C++ 7.65% C 72.89% Jupyter Notebook 19.39%

image-processing cpp gaussian-blur blur

fastgaussianblur's People

Contributors

Stargazers

Watchers

Forkers

peterzhousz qyhy gamedev-vault templeblock zhangjunhit setoutsoft jianjunxia zhangpengk jackzhousz imageprocessing-electronicpublications jwurzer louis-tru peterzs patilnabhi maze516 vn-os

fastgaussianblur's Issues

Exception: Access violation reading

reproduce command:

./fastblur test12.png out.png 250

test12.zip

I'm not sure if it's a issue about sigma, but on OpenCV, everything works fine for the Gaussian calculation.

output is a slighty darker when using uchar arrays due to non rounded multiply

when you multiply the output with float "iarr" and "T" is uchar you should add +0.5f to round it, otherwise the output will be slighty darker, I suggest to add a constexpr to check if T is uchar and so round it, if you do that the output is as expected to be.

FastGaussianBlur/fast_gaussian_blur_template.h

Line 99 in fabc47f

out[ti*C+ch] = acc[ch]*iarr;

GOOD - opencv blur as reference:

GOOD - Uchar rounded:

BAD - Uchar not rounded

Support for compressed jpeg output

Is there a room for implementing image compression?

a 62Kb jpg image after blur becomes a 419Kb png with 50 sigma.
But after using compression (i used Caesium) the size dropped to just 28Kb, using lossy jpg with 80 quality.
I also tried using cjpeg but it doesn't seem to accept png on windows.

Any thoughts about this?
Thanks!

Help for implementation?

Hi,
This is not a bug report, but an ask for help.
I'm working on an FOSS called Pencil2D (www.pencil2d.org). It's a software that is used for traditional, hand drawn 2D animation.
The last year I've worked on the camera. The rewrite is almost finished. The only thing missing is depth of field, and for this I need fast gaussian blur.
I've looked at your code, and as I understand it, your software must be called via commandline, and saves the resulting image as a file. What I/we need is a c++ class (or function) that can be called, with an image, blur value and maybe more as parameters, and returning the same image, with added blur.
I am not sure how to implement it. Could I persuade you to help me implement this?
Looking forward to hear from you!
Yours,
David Lamhauge

remove ternary operators, reduce redundance . improvements

in horizontal_blur_extended (same pattern should work for ..._crop as well)

Removed ternary operator in initial accumulation
Reduced constant redundant calculation in "initial accumulation"
Reduced constant redundant calculation in "perform filtering loop"

TODO:
remove ternary operator inside "perform filtering loop", a bit harder in this case

gist here, read comments :)
https://gist.github.com/michelerenzullo/89a047422fc0c0fc2f57432b80383676

kernel's size of blur

The traditional gaussian blur has a gaussian kernel，and this kernel has two important parameters : kernel‘s size and sigma. but i can't see the kernel's size in your method.

performance improvement on big images

Integer mul is usually faster than float one, so when T is uchar(or any other integer) we don't need to use float, this increase the performance, tested on an old laptop, using an image 24bit 6000 * 8000 px and r = 337
You might add a check for the type T, using float or int based on him. Let me know

FastGaussianBlur/fast_gaussian_blur_template.h

Line 77 in 05508f8

float fv[C], lv[C], acc[C]; // first value, last value, sliding accumulator

with float
(horizontalblur only, tot - transp) : 4120.961600
(horizontalblur only, tot - transp) : 4184.653200
(horizontalblur only, tot - transp) : 4241.367500
(horizontalblur only, tot - transp) : 4068.939800

with int
(horizontalblur only, tot - transp) : 3496.743000
(horizontalblur only, tot - transp) : 3719.186500
(horizontalblur only, tot - transp) : 3510.837400
(horizontalblur only, tot - transp) : 3603.649300

STD::THREADS + some improvements in flip_block

Hi, long time I contributed when we discussed about reflection padding and cache improvements.

I'm working on some web assembly projects, pthreads etc... And I want to contribute again, sometimes linking with OpenMP is messy or impossible on some platforms, therefore we can implement a posix standard, where the performance are totally equivalent since the code is quite simple and "the power of openmp" doesn't justify the complexity or missing implementation on WebAssembly, Android or again difficulties on Mac (for example here, OpenCV defaults to TBB or pthreads, same as iOS)

I wrote a snippet and tested properly where we can switch mode, please feel free to test it and implement it eventually. Further I improved the performance of flip_block in order to have exactly the same when using threads, this transpose function was the reason of my request, as it is now can causes performance issues when not using openmp that is smart enough to collapse it, and spread equally amongst the threads despite our increments of +=block

template <typename T, typename op>
void hybrid_loop(T end, op operation)
{
	auto operation_wrapper = [&](T i, int tid = 0) {
		if constexpr (std::is_invocable_v<op, T>) operation(i);
		else operation(i, tid);
	};
#ifdef __SINGLE__
	for (T i = 0; i < end; ++i) operation_wrapper(i);
#elif __OMP__
#pragma omp parallel for
	for (T i = 0; i < end; ++i) operation_wrapper(i, omp_get_thread_num());
#elif __STD_THREADS__
	const int num_threads = std::thread::hardware_concurrency();
	// Split in block equally for each thread. ex: 3 threads, start = 0, end = 8
        // Thread 1: 0,1,2
        // Thread 2: 3,4,5
        // Thread 3: 6,7
	const T block_size = ((end + num_threads - 1) / num_threads);
	std::vector<std::thread> threads;

	for (int tid = 0; tid < num_threads; ++tid) {
		threads.emplace_back([=]() {
      			T block_start = tid * block_size;
			T block_end = std::min<T>(block_start + block_size, end);

		for (T i = block_start; i < block_end; ++i) operation_wrapper(i, tid); });
	}

	for (auto &thread : threads) thread.join();
#endif

my flip block:

template <typename T, int C>
void flip_block(const T *in, T *out, const int w, const int h)
{
    // Suppose a square block of L2 cache size = 256KB
    // to be divided for the num of channels and bytes
    const int block = sqrt(262144.0 / (C * sizeof(T)));   // <-- note sqrt and also sizeof(T)
    const int w_blocks = std::ceil(static_cast<float>(w) / block);
    const int h_blocks = std::ceil(static_cast<float>(h) / block);

    hybrid_loop(w_blocks * h_blocks, [&](int n) {
            int x = (n / h_blocks) * block;
            int y = (n % h_blocks) * block;
            const T *p = in + y * w * C + x * C;
            T *q = out + y * C + x * h * C;

            const int blockx = std::min(w, (x + block)) - x ;
            const int blocky = std::min(h, (y + block)) - y ;
            for (int xx = 0; xx < blockx; xx++)
            {
                for (int yy = 0; yy < blocky; yy++)
                {
                    for (int k = 0; k < C; k++)
                        q[k] = p[k];
                    p += w * C;
                    q += C;
                }
                p += -blocky * w * C + C;
                q += -blocky * C + h * C;
            }
        });
}

I think that your current flip block might be wrong in the cache calculation because

it doesn't consider the data type of the input, suppose we have 1 channel to process and float type (4 bytes), we gonna have 65536 operation for each thread because 65535 * 4 = 262140 bytes = ~ 256KB
Further it doesn't count the bidimensionality, that's why you see sqrt in my code

Feel free to disagree or add thoughts, but if you can test with my snippets (hybrid loop with -D__STD_THREADS__ vs -D__OMP__) and compare with your current code, perhaps might more clear than read my verbose notes.

Not relevant but if you have a look also at my deinterleave and interleave to have a better pic of this cache friendly operations and my opinion when dealing with only 1 dimension at time, so kinda easier...

template<typename T, typename U>
void deinterleave_BGR(const T* const interleaved_BGR, U** const deinterleaved_BGR, const uint32_t total_size) {

    // Cache-friendly deinterleave BGR, splitting for blocks of 256 KB, inspired by flip-block
    constexpr float round = std::is_integral_v<U> ? std::is_integral_v<T> ? 0 : 0.5f : 0;
    constexpr uint32_t block = 262144 / (3 * std::max(sizeof(T), sizeof(U)));
    const uint32_t num_blocks = std::ceil(total_size / (float)block);

    hybrid_loop(num_blocks, [&](auto n) {
		const uint32_t x = n * block;
		U* const B = deinterleaved_BGR[0] + x;
		U* const G = deinterleaved_BGR[1] + x;
		U* const R = deinterleaved_BGR[2] + x;
		const T* const interleaved_ptr = interleaved_BGR + x * 3;

		const int blockx = std::min(total_size, x + block) - x;
		for (int xx = 0; xx < blockx; ++xx)
		{
			B[xx] = interleaved_ptr[xx * 3 + 0] + round;
			G[xx] = interleaved_ptr[xx * 3 + 1] + round;
			R[xx] = interleaved_ptr[xx * 3 + 2] + round;
		}
	});

}

template<typename T, typename U>
void interleave_BGR(const U** const deinterleaved_BGR, T* const interleaved_BGR, const uint32_t total_size) {

    constexpr float round = std::is_integral_v<T> ? std::is_integral_v<U> ? 0 : 0.5f : 0;
    constexpr uint32_t block = 262144 / (3 * std::max(sizeof(T), sizeof(U)));
    const uint32_t num_blocks = std::ceil(total_size / (float)block);
	
    hybrid_loop(num_blocks, [&](auto n) {
		const uint32_t x = n * block;
		const U* const B = deinterleaved_BGR[0] + x;
		const U* const G = deinterleaved_BGR[1] + x;
		const U* const R = deinterleaved_BGR[2] + x;
		T* const interleaved_ptr = interleaved_BGR + x * 3;

		const int blockx = std::min(total_size, x + block) - x;
		for (int xx = 0; xx < blockx; ++xx)
		{
			interleaved_ptr[xx * 3 + 0] = B[xx] + round;
			interleaved_ptr[xx * 3 + 1] = G[xx] + round;
			interleaved_ptr[xx * 3 + 2] = R[xx] + round;
		}
	});

}

Example of deinterleave_BGR:

	std::vector<std::vector<float>> temp(3, std::vector<float>(sizes[0] * sizes[1]));
	float* BGR[3] = { temp[0].data(), temp[1].data(), temp[2].data() };
	deinterleave_BGR((const uint8_t*)padded.get(), BGR, sizes[0] * sizes[1]);

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble