Comments (14)
Hi Karol!
Thanks for the kind words! Currently cc3d works with arrays internally and doesn't have the ability to work with mmapped arrays. Even if it did, the output array will still be at least uint16, uint32, or uint64 depending on your data, so the output alone might be too much (I suppose it could be modified to use mmapped files for the output too).
However, with respect to larger-than-RAM files, I do have a (beta) solution for that if 6-way connectivity is sufficient. Check out https://github.com/seung-lab/igneous#connected-components-labeling-ccl-beta which has the ability to independently process image cutouts in parallel and perform CCL labeling. You'd have to first convert your image into a Neuroglancer Precomputed format using CloudVolume https://github.com/seung-lab/cloud-volume. However, then you'd be able to visualize your data too.
One warning. This procedure passes my automated tests and I've used it for pretty big volumes successfully. I did see it screw up on one very large volume and I haven't figured out why yet, but odds are it will work fine for you.
from connected-components-3d.
The memory mapped numpy array does sound interesting. It would be very nice to just direct people to using that for very large volumes. I'll have to play around with this.
from connected-components-3d.
Hey William,
The memory mapped numpy array does sound interesting. It would be very nice to just direct people to using that for very large volumes. I'll have to play around with this.
It would certainly be awesome, if cc3d would be compatible with memory-mapped arrays in the future. For large images speed is often less relevant than memory consumption. So even if cc3d functions would be slower when applied on memory-mapped data that shouldn't be a big issue.
However, with respect to larger-than-RAM files, I do have a (beta) solution for that if 6-way connectivity is sufficient. Check out https://github.com/seung-lab/igneous#connected-components-labeling-ccl-beta which has the ability to independently process image cutouts in parallel and perform CCL labeling. You'd have to first convert your image into a Neuroglancer Precomputed format using CloudVolume https://github.com/seung-lab/cloud-volume. However, then you'd be able to visualize your data too.
This also sounds interesting, but this solution has probably too much pre- and postprocessing overhead and conversion of the images between different formats for my usecases.
I also had the idea to do CCL in a sliding-window manner. The image would be patchified, every patch labeled via CCL and then assembled again. In the naive version of this approach, components that strech over multiple patches would be split.
To prevent this, a 1 pixel patch overlap could be introduced in order to propagate labels of components that strech over multiple patches from the previous patch. It would also require to have a running counter of the number of components and use this counter to incremeant the labels of components in every new patch.
Does this approach make sense to you? Are there problems that I have not considered?
Best,
Karol
from connected-components-3d.
Hi Karol,
I just gave it a try with an mmapped input file and it seems to work. However, it outputs to an in-memory array which can be several times bigger since its 2-8 bytes per voxel. The union-find data structure will also be in memory, but usually it is 10x to 100x smaller than the input image so I'm not as worried about it.
from connected-components-3d.
As for the strategy you suggested, yes, that's pretty much what I did in Igneous. It seems to work. 6-connected is much easier to implement than 26-connected in that scheme. If that interests you, you can look at the CCL code in Igneous for tips on implementing it. I may also just add mmap support shortly so give me a day or two before trying that.
from connected-components-3d.
Hi Karol,
I implemented mmap output, and it already worked with mmap input. You can try experimenting with the master branch. I will also be releasing 3.11.0 shortly and you will be able to get it on PyPI. Check the front page README examples for how to use it. Let me know if you have any feedback!
from connected-components-3d.
Hey,
Thank you a lot! I first tested it with cc3d.connected_components
and it worked like a charm. The memory consumption is virtually non-existent with only 1-2 GB max when running it on a (2598, 2833, 2857) uint8 array (~21 GB). The function estimated that uint32 would be needed for the output and created a memory-mapped output array with a size of 84 GB. Uint16 would actually be completely fine as there are about ~40.000 components, but that is another topic ;) The connected_component function took maybe ~6 min, which is not a problem for my use case.
Something I noticed is that the 'r+' mode is required with just the 'r' mode raising an exception when running cc3d.connected_components
. Does this mean that the method modifies the input array? This is not a problem in my current code but would be something to keep in mind if that is the case.
I then ran cc3d.statistics with the memory-mapped output, which sadly ran seemingly forever and I had to quit it after 2-3 hours. On the plus side, it essentially did not consume any memory.
Do you think this could be a bug or is it simply very slow when using a memory-mapped array?
Best,
Karol
from connected-components-3d.
from connected-components-3d.
So, I looked into it. I think it's actually working, but very slow b/c it's swapping a lot. I was being cautious and used large data types assuming a reasonable number of regions, but if you have a noise dataset, then it takes 8 * 6 * N bytes, which in the case of a 1000^3 noise dataset with 742769605 regions, ends up being 35 GB for the bounds alone. When I pick a dataset small enough to not swap too much (e.g. 700^3), the calculation is quick. When it is big, it is very slow and may get killed by OOM.
What I can do is use tighter data types (uint16 for bounds, uint32 for label counts, and float for centroids), give an option for skipping converting the bounds into slices, and make sure to iterate in the contiguous direction. Do you know how many regions you have and how much RAM you've been using for statistics?
from connected-components-3d.
Hi Karol,
I did some memory and performance optimization on cc3d.statistics if you get the latest version 3.12.0. You can give it another shot and see if it does any better.
from connected-components-3d.
Hey,
Sorry for the late reply and thanks a lot for the optimization! I am at a conference this week and will only be able to test it next week. I will give you an update then :)
from connected-components-3d.
Hey,
I was finally able to test it on cc3d.statistics and it runs perfectly now!
The memory consumption was neglectable and the runtime was 309s for a (2598, 2833, 2857) array with ~40.000 components.
This essentially enables my pipeline to be memory efficient throughout every stage without major bottlenecks. Thank you a lot!
Best,
Karol
from connected-components-3d.
from connected-components-3d.
Hey,
Glad to know that the fix was not too complicated! From my side, the issue is solved. Thank you again for your outstanding support :)
Best,
Karol
from connected-components-3d.
Related Issues (20)
- Applying Dust and largest_k dtype output option HOT 2
- dust sugnature HOT 1
- Massive memory Leak HOT 7
- 1D Array of 4 Elements Incorrect HOT 5
- Cannot find reference 'dust' in 'cc3d.py' HOT 2
- Question on comparing individual lesions between two masks based on the cc3d.statistics output. HOT 1
- Additional metrics support HOT 2
- cc3d.statistics["bounding_boxes"] are wrong HOT 1
- largest_k fails for transposed arrays HOT 6
- About the lastest_k function HOT 4
- Statistics output HOT 7
- Question on the output of contacts HOT 9
- Periodic Boundary Conditions HOT 4
- Is the output label of largest_k ordered? HOT 2
- Add a better error for type support. HOT 4
- Any way to make this GPU Compatible? HOT 6
- Applying dust to labels does not do anything HOT 3
- Support for Numpy 2.0 HOT 3
- cc3d.dust fails HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from connected-components-3d.