GithubHelp home page GithubHelp logo

bfraboni / fastgaussianblur Goto Github PK

View Code? Open in Web Editor NEW
89.0 3.0 16.0 7.54 MB

Fast Gaussian Blur algorithm

Makefile 0.07% C++ 7.65% C 72.89% Jupyter Notebook 19.39%
image-processing cpp gaussian-blur blur

fastgaussianblur's People

Contributors

bfraboni avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

fastgaussianblur's Issues

output is a slighty darker when using uchar arrays due to non rounded multiply

when you multiply the output with float "iarr" and "T" is uchar you should add +0.5f to round it, otherwise the output will be slighty darker, I suggest to add a constexpr to check if T is uchar and so round it, if you do that the output is as expected to be.

out[ti*C+ch] = acc[ch]*iarr;

GOOD - opencv blur as reference:
opencv_blur
GOOD - Uchar rounded:
FGB_uchar_rounded
BAD - Uchar not rounded
FGB_uchar

Support for compressed jpeg output

Is there a room for implementing image compression?

a 62Kb jpg image after blur becomes a 419Kb png with 50 sigma.
But after using compression (i used Caesium) the size dropped to just 28Kb, using lossy jpg with 80 quality.
I also tried using cjpeg but it doesn't seem to accept png on windows.

Any thoughts about this?
Thanks!

Help for implementation?

Hi,
This is not a bug report, but an ask for help.
I'm working on an FOSS called Pencil2D (www.pencil2d.org). It's a software that is used for traditional, hand drawn 2D animation.
The last year I've worked on the camera. The rewrite is almost finished. The only thing missing is depth of field, and for this I need fast gaussian blur.
I've looked at your code, and as I understand it, your software must be called via commandline, and saves the resulting image as a file. What I/we need is a c++ class (or function) that can be called, with an image, blur value and maybe more as parameters, and returning the same image, with added blur.
I am not sure how to implement it. Could I persuade you to help me implement this?
Looking forward to hear from you!
Yours,
David Lamhauge

remove ternary operators, reduce redundance . improvements

in horizontal_blur_extended (same pattern should work for ..._crop as well)

  1. Removed ternary operator in initial accumulation
  2. Reduced constant redundant calculation in "initial accumulation"
  3. Reduced constant redundant calculation in "perform filtering loop"

TODO:
remove ternary operator inside "perform filtering loop", a bit harder in this case

gist here, read comments :)
https://gist.github.com/michelerenzullo/89a047422fc0c0fc2f57432b80383676

kernel's size of blur

The traditional gaussian blur has a gaussian kernel,and this kernel has two important parameters : kernel‘s size and sigma. but i can't see the kernel's size in your method.

performance improvement on big images

Integer mul is usually faster than float one, so when T is uchar(or any other integer) we don't need to use float, this increase the performance, tested on an old laptop, using an image 24bit 6000 * 8000 px and r = 337
You might add a check for the type T, using float or int based on him. Let me know

float fv[C], lv[C], acc[C]; // first value, last value, sliding accumulator

with float
(horizontalblur only, tot - transp) : 4120.961600
(horizontalblur only, tot - transp) : 4184.653200
(horizontalblur only, tot - transp) : 4241.367500
(horizontalblur only, tot - transp) : 4068.939800

with int
(horizontalblur only, tot - transp) : 3496.743000
(horizontalblur only, tot - transp) : 3719.186500
(horizontalblur only, tot - transp) : 3510.837400
(horizontalblur only, tot - transp) : 3603.649300

STD::THREADS + some improvements in flip_block

Hi, long time I contributed when we discussed about reflection padding and cache improvements.

I'm working on some web assembly projects, pthreads etc... And I want to contribute again, sometimes linking with OpenMP is messy or impossible on some platforms, therefore we can implement a posix standard, where the performance are totally equivalent since the code is quite simple and "the power of openmp" doesn't justify the complexity or missing implementation on WebAssembly, Android or again difficulties on Mac (for example here, OpenCV defaults to TBB or pthreads, same as iOS)

I wrote a snippet and tested properly where we can switch mode, please feel free to test it and implement it eventually. Further I improved the performance of flip_block in order to have exactly the same when using threads, this transpose function was the reason of my request, as it is now can causes performance issues when not using openmp that is smart enough to collapse it, and spread equally amongst the threads despite our increments of +=block

template <typename T, typename op>
void hybrid_loop(T end, op operation)
{
	auto operation_wrapper = [&](T i, int tid = 0) {
		if constexpr (std::is_invocable_v<op, T>) operation(i);
		else operation(i, tid);
	};
#ifdef __SINGLE__
	for (T i = 0; i < end; ++i) operation_wrapper(i);
#elif __OMP__
#pragma omp parallel for
	for (T i = 0; i < end; ++i) operation_wrapper(i, omp_get_thread_num());
#elif __STD_THREADS__
	const int num_threads = std::thread::hardware_concurrency();
	// Split in block equally for each thread. ex: 3 threads, start = 0, end = 8
        // Thread 1: 0,1,2
        // Thread 2: 3,4,5
        // Thread 3: 6,7
	const T block_size = ((end + num_threads - 1) / num_threads);
	std::vector<std::thread> threads;

	for (int tid = 0; tid < num_threads; ++tid) {
		threads.emplace_back([=]() {
      			T block_start = tid * block_size;
			T block_end = std::min<T>(block_start + block_size, end);

		for (T i = block_start; i < block_end; ++i) operation_wrapper(i, tid); });
	}

	for (auto &thread : threads) thread.join();
#endif

my flip block:

template <typename T, int C>
void flip_block(const T *in, T *out, const int w, const int h)
{
    // Suppose a square block of L2 cache size = 256KB
    // to be divided for the num of channels and bytes
    const int block = sqrt(262144.0 / (C * sizeof(T)));   // <-- note sqrt and also sizeof(T)
    const int w_blocks = std::ceil(static_cast<float>(w) / block);
    const int h_blocks = std::ceil(static_cast<float>(h) / block);

    hybrid_loop(w_blocks * h_blocks, [&](int n) {
            int x = (n / h_blocks) * block;
            int y = (n % h_blocks) * block;
            const T *p = in + y * w * C + x * C;
            T *q = out + y * C + x * h * C;

            const int blockx = std::min(w, (x + block)) - x ;
            const int blocky = std::min(h, (y + block)) - y ;
            for (int xx = 0; xx < blockx; xx++)
            {
                for (int yy = 0; yy < blocky; yy++)
                {
                    for (int k = 0; k < C; k++)
                        q[k] = p[k];
                    p += w * C;
                    q += C;
                }
                p += -blocky * w * C + C;
                q += -blocky * C + h * C;
            }
        });
}

I think that your current flip block might be wrong in the cache calculation because

  • it doesn't consider the data type of the input, suppose we have 1 channel to process and float type (4 bytes), we gonna have 65536 operation for each thread because 65535 * 4 = 262140 bytes = ~ 256KB
  • Further it doesn't count the bidimensionality, that's why you see sqrt in my code

Feel free to disagree or add thoughts, but if you can test with my snippets (hybrid loop with -D__STD_THREADS__ vs -D__OMP__) and compare with your current code, perhaps might more clear than read my verbose notes.

Not relevant but if you have a look also at my deinterleave and interleave to have a better pic of this cache friendly operations and my opinion when dealing with only 1 dimension at time, so kinda easier...

template<typename T, typename U>
void deinterleave_BGR(const T* const interleaved_BGR, U** const deinterleaved_BGR, const uint32_t total_size) {

    // Cache-friendly deinterleave BGR, splitting for blocks of 256 KB, inspired by flip-block
    constexpr float round = std::is_integral_v<U> ? std::is_integral_v<T> ? 0 : 0.5f : 0;
    constexpr uint32_t block = 262144 / (3 * std::max(sizeof(T), sizeof(U)));
    const uint32_t num_blocks = std::ceil(total_size / (float)block);

    hybrid_loop(num_blocks, [&](auto n) {
		const uint32_t x = n * block;
		U* const B = deinterleaved_BGR[0] + x;
		U* const G = deinterleaved_BGR[1] + x;
		U* const R = deinterleaved_BGR[2] + x;
		const T* const interleaved_ptr = interleaved_BGR + x * 3;

		const int blockx = std::min(total_size, x + block) - x;
		for (int xx = 0; xx < blockx; ++xx)
		{
			B[xx] = interleaved_ptr[xx * 3 + 0] + round;
			G[xx] = interleaved_ptr[xx * 3 + 1] + round;
			R[xx] = interleaved_ptr[xx * 3 + 2] + round;
		}
	});

}

template<typename T, typename U>
void interleave_BGR(const U** const deinterleaved_BGR, T* const interleaved_BGR, const uint32_t total_size) {

    constexpr float round = std::is_integral_v<T> ? std::is_integral_v<U> ? 0 : 0.5f : 0;
    constexpr uint32_t block = 262144 / (3 * std::max(sizeof(T), sizeof(U)));
    const uint32_t num_blocks = std::ceil(total_size / (float)block);
	
    hybrid_loop(num_blocks, [&](auto n) {
		const uint32_t x = n * block;
		const U* const B = deinterleaved_BGR[0] + x;
		const U* const G = deinterleaved_BGR[1] + x;
		const U* const R = deinterleaved_BGR[2] + x;
		T* const interleaved_ptr = interleaved_BGR + x * 3;

		const int blockx = std::min(total_size, x + block) - x;
		for (int xx = 0; xx < blockx; ++xx)
		{
			interleaved_ptr[xx * 3 + 0] = B[xx] + round;
			interleaved_ptr[xx * 3 + 1] = G[xx] + round;
			interleaved_ptr[xx * 3 + 2] = R[xx] + round;
		}
	});

}

Example of deinterleave_BGR:

	std::vector<std::vector<float>> temp(3, std::vector<float>(sizes[0] * sizes[1]));
	float* BGR[3] = { temp[0].data(), temp[1].data(), temp[2].data() };
	deinterleave_BGR((const uint8_t*)padded.get(), BGR, sizes[0] * sizes[1]);

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.