GithubHelp home page GithubHelp logo

Comments (14)

william-silversmith avatar william-silversmith commented on July 22, 2024

Hi Karol!

Thanks for the kind words! Currently cc3d works with arrays internally and doesn't have the ability to work with mmapped arrays. Even if it did, the output array will still be at least uint16, uint32, or uint64 depending on your data, so the output alone might be too much (I suppose it could be modified to use mmapped files for the output too).

However, with respect to larger-than-RAM files, I do have a (beta) solution for that if 6-way connectivity is sufficient. Check out https://github.com/seung-lab/igneous#connected-components-labeling-ccl-beta which has the ability to independently process image cutouts in parallel and perform CCL labeling. You'd have to first convert your image into a Neuroglancer Precomputed format using CloudVolume https://github.com/seung-lab/cloud-volume. However, then you'd be able to visualize your data too.

One warning. This procedure passes my automated tests and I've used it for pretty big volumes successfully. I did see it screw up on one very large volume and I haven't figured out why yet, but odds are it will work fine for you.

from connected-components-3d.

william-silversmith avatar william-silversmith commented on July 22, 2024

The memory mapped numpy array does sound interesting. It would be very nice to just direct people to using that for very large volumes. I'll have to play around with this.

from connected-components-3d.

Karol-G avatar Karol-G commented on July 22, 2024

Hey William,

The memory mapped numpy array does sound interesting. It would be very nice to just direct people to using that for very large volumes. I'll have to play around with this.

It would certainly be awesome, if cc3d would be compatible with memory-mapped arrays in the future. For large images speed is often less relevant than memory consumption. So even if cc3d functions would be slower when applied on memory-mapped data that shouldn't be a big issue.

However, with respect to larger-than-RAM files, I do have a (beta) solution for that if 6-way connectivity is sufficient. Check out https://github.com/seung-lab/igneous#connected-components-labeling-ccl-beta which has the ability to independently process image cutouts in parallel and perform CCL labeling. You'd have to first convert your image into a Neuroglancer Precomputed format using CloudVolume https://github.com/seung-lab/cloud-volume. However, then you'd be able to visualize your data too.

This also sounds interesting, but this solution has probably too much pre- and postprocessing overhead and conversion of the images between different formats for my usecases.

I also had the idea to do CCL in a sliding-window manner. The image would be patchified, every patch labeled via CCL and then assembled again. In the naive version of this approach, components that strech over multiple patches would be split.
To prevent this, a 1 pixel patch overlap could be introduced in order to propagate labels of components that strech over multiple patches from the previous patch. It would also require to have a running counter of the number of components and use this counter to incremeant the labels of components in every new patch.

Does this approach make sense to you? Are there problems that I have not considered?

Best,
Karol

from connected-components-3d.

william-silversmith avatar william-silversmith commented on July 22, 2024

Hi Karol,

I just gave it a try with an mmapped input file and it seems to work. However, it outputs to an in-memory array which can be several times bigger since its 2-8 bytes per voxel. The union-find data structure will also be in memory, but usually it is 10x to 100x smaller than the input image so I'm not as worried about it.

from connected-components-3d.

william-silversmith avatar william-silversmith commented on July 22, 2024

As for the strategy you suggested, yes, that's pretty much what I did in Igneous. It seems to work. 6-connected is much easier to implement than 26-connected in that scheme. If that interests you, you can look at the CCL code in Igneous for tips on implementing it. I may also just add mmap support shortly so give me a day or two before trying that.

from connected-components-3d.

william-silversmith avatar william-silversmith commented on July 22, 2024

Hi Karol,

I implemented mmap output, and it already worked with mmap input. You can try experimenting with the master branch. I will also be releasing 3.11.0 shortly and you will be able to get it on PyPI. Check the front page README examples for how to use it. Let me know if you have any feedback!

from connected-components-3d.

Karol-G avatar Karol-G commented on July 22, 2024

Hey,

Thank you a lot! I first tested it with cc3d.connected_components and it worked like a charm. The memory consumption is virtually non-existent with only 1-2 GB max when running it on a (2598, 2833, 2857) uint8 array (~21 GB). The function estimated that uint32 would be needed for the output and created a memory-mapped output array with a size of 84 GB. Uint16 would actually be completely fine as there are about ~40.000 components, but that is another topic ;) The connected_component function took maybe ~6 min, which is not a problem for my use case.

Something I noticed is that the 'r+' mode is required with just the 'r' mode raising an exception when running cc3d.connected_components. Does this mean that the method modifies the input array? This is not a problem in my current code but would be something to keep in mind if that is the case.

I then ran cc3d.statistics with the memory-mapped output, which sadly ran seemingly forever and I had to quit it after 2-3 hours. On the plus side, it essentially did not consume any memory.
Do you think this could be a bug or is it simply very slow when using a memory-mapped array?

Best,
Karol

from connected-components-3d.

william-silversmith avatar william-silversmith commented on July 22, 2024

from connected-components-3d.

william-silversmith avatar william-silversmith commented on July 22, 2024

So, I looked into it. I think it's actually working, but very slow b/c it's swapping a lot. I was being cautious and used large data types assuming a reasonable number of regions, but if you have a noise dataset, then it takes 8 * 6 * N bytes, which in the case of a 1000^3 noise dataset with 742769605 regions, ends up being 35 GB for the bounds alone. When I pick a dataset small enough to not swap too much (e.g. 700^3), the calculation is quick. When it is big, it is very slow and may get killed by OOM.

What I can do is use tighter data types (uint16 for bounds, uint32 for label counts, and float for centroids), give an option for skipping converting the bounds into slices, and make sure to iterate in the contiguous direction. Do you know how many regions you have and how much RAM you've been using for statistics?

from connected-components-3d.

william-silversmith avatar william-silversmith commented on July 22, 2024

Hi Karol,

I did some memory and performance optimization on cc3d.statistics if you get the latest version 3.12.0. You can give it another shot and see if it does any better.

from connected-components-3d.

Karol-G avatar Karol-G commented on July 22, 2024

Hey,

Sorry for the late reply and thanks a lot for the optimization! I am at a conference this week and will only be able to test it next week. I will give you an update then :)

from connected-components-3d.

Karol-G avatar Karol-G commented on July 22, 2024

Hey,

I was finally able to test it on cc3d.statistics and it runs perfectly now!
The memory consumption was neglectable and the runtime was 309s for a (2598, 2833, 2857) array with ~40.000 components.
This essentially enables my pipeline to be memory efficient throughout every stage without major bottlenecks. Thank you a lot!

Best,
Karol

from connected-components-3d.

william-silversmith avatar william-silversmith commented on July 22, 2024

from connected-components-3d.

Karol-G avatar Karol-G commented on July 22, 2024

Hey,

Glad to know that the fix was not too complicated! From my side, the issue is solved. Thank you again for your outstanding support :)

Best,
Karol

from connected-components-3d.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.