Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

No, I'm talking about <a href="https://github.com/Wuziyi616/multi_part_assembly/

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

So you switched the 3 losses from <a href="https://github.com/pairlab/NSM/blob/ec6d139

Some clarification questions,about wuziyi616/multi_part_assembly

Comments (22)

Wuziyi616 commented on July 1, 2024 1

That's true, I agree. You can definitely try that and I'm also interested in how many local features will help. I'm just a bit worried about the GPU memory haha. But since you're playing around with 2 parts, maybe it's fine

from multi_part_assembly.

Wuziyi616 commented on July 1, 2024 1

Yes, I believe you need to modify some parts of the code to integrate local features

from multi_part_assembly.

Wuziyi616 commented on July 1, 2024 1

No, I'm talking about this line, i.e. PointNet uses global max-pooling to get the global feature
Yes, I think local features should also interact, like what's done in NSM
Iterative refinement is the same as what DGL does. We first predict [R1, T1] for each part, then transform them using these predicted poses, and predict again [R2, T2] (like a refinement), and transform, and predict again... Repeat this process multiple times

from multi_part_assembly.

Wuziyi616 commented on July 1, 2024

Hi, thanks for your interest in our work. Here are my answers:

Regarding 1,2,3, unfortunately, that config is NOT for NSM, i.e. we don't implement NSM in this codebase. This is because NSM applies Transformer over every point in each part (i.e. the Transformer input is num_part x N tokens, N~1000), which requires lots of memory when the num_part is large (note NSM only uses 2 parts so the GPU memory is affordable). The pn_transformer implemented here applies Transformer over the PointNet global feature from each part (i.e. the Transformer input is only num_part tokens), so the memory is much lower.
Though we didn't implement NSM, we did find it work better than the baselines and pn_transformer in the everyday subset + 2 part only setting. I think this is because NSM learns local surface features which are important for geometric assembly. On the other hand, all the baselines directly apply PointNet to extract global features from each part and perform reasoning over them, which ignores the rich surface feature.
Loss matching: in semantic assembly, say we want to assemble a chair with 4 legs. The 4 legs are usually the same, i.e. they are geometrically equivalent in their canonical poses. So when calculating the loss, if leg1 is put to the correct position of leg2, and leg2 is put to the correct position of leg1, etc. The loss should be 0. In order to count for such equivalence, in semantic assembly, we need to do loss matching, i.e. match the predicted parts with its ground-truth parts via Hungarian matching (see this function).
However, in geometric assembly, parts are randomly broken, so usually, there are no geometrically equivalent parts. Therefore, we don't need to apply the Hungarian algorithm for loss matching.
The final loss is calculated here. Basically, we take (key, value) from the loss dict, if the key looks like xxx_loss, we multiply it with the xxx_loss_w in the cfg and accumulate it to the total loss.
For the difference between geo. and sem. assembly, please see our paper and Twitter thread explanation (especially the 4/12 one). So basically, building a chair from legs, arms, seat, and back is considered sem. assembly, because all parts have semantic meanings. On the other hand, building a broken vase is considered geo. assembly, as they are random fractures, and don't have semantic meanings.

Feel free to ask if you have more questions. Hope this help!

from multi_part_assembly.

ttsesm commented on July 1, 2024

@Wuziyi616 thanks for you time and the feedback. Please find my comments bellow inline.

Hi, thanks for your interest in our work. Here are my answers:

* Regarding 1,2,3, unfortunately, that config is NOT for NSM, i.e. **we don't implement NSM in this codebase**. This is because NSM applies Transformer over every point in each part (i.e. the Transformer input is `num_part x N` tokens, N~1000), which requires lots of memory when the `num_part` is large (note NSM only uses 2 parts so the GPU memory is affordable). The `pn_transformer` implemented here applies Transformer over the PointNet global feature from each part (i.e. the Transformer input is only `num_part` tokens), so the memory is much lower.

I see, so at the end you just incorporated the adversarial loss on top of the pn_transformer implementation in order to check how it performs. Now it is a bit more clear. Interesting though.

* Though we didn't implement NSM, we did find it work better than the baselines and `pn_transformer` in the everyday subset + 2 part only setting. I think this is because **NSM learns local surface features** which are important for geometric assembly. On the other hand, all the baselines directly apply PointNet to extract **global features** from each part and perform reasoning over them, which ignores the rich surface feature.

What do you mean by "we did find it work better than the baselines and pn_transformer in the everyday subset + 2 part only setting"? As I understand it you mean the NSM implementation, right? We agree that since NSM learns local surface features, this means that it should be superior to the baselines. At least for the geometric assembly as you are pointing out. Btw, is it possible to share the NSM implementation because the implementation from this repository is not complete and when we contacted the author he pointed us to this repository but on the other hand here you do not have the NSM implementation.

* Loss matching: in semantic assembly, say we want to assemble a chair with 4 legs. The 4 legs are usually the same, i.e. they are **geometrically equivalent** in their canonical poses. So when calculating the loss, if leg1 is put to the correct position of leg2, and leg2 is put to the correct position of leg1, etc. The loss should be 0. In order to count for such equivalence, in semantic assembly, we need to do loss matching, i.e. match the predicted parts with its ground-truth parts via Hungarian matching (see this [function](https://github.com/Wuziyi616/multi_part_assembly/blob/dcda0aa88e5ddf9933095569932cdfbd34c6ff4e/multi_part_assembly/models/modules/base_model.py#L184)).

* However, in geometric assembly, parts are randomly broken, so usually, there are no geometrically equivalent parts. Therefore, we don't need to apply the Hungarian algorithm for loss matching.

Ok, I see what you mean. It is clear now. Thanks.

* The final loss is calculated [here](https://github.com/Wuziyi616/multi_part_assembly/blob/dcda0aa88e5ddf9933095569932cdfbd34c6ff4e/multi_part_assembly/models/modules/base_model.py#L417-L422). Basically, we take `(key, value)` from the loss dict, if the key looks like `xxx_loss`, we multiply it with the `xxx_loss_w` in the `cfg` and accumulate it to the total loss.

I see, seems clear.

* For the difference between geo. and sem. assembly, please see our paper and [Twitter thread explanation](https://twitter.com/ycchen918/status/1586169332685471745) (especially the `4/12` one). So basically, building a chair from legs, arms, seat, and back is considered sem. assembly, because all parts have semantic meanings. On the other hand, building a broken vase is considered geo. assembly, as they are random fractures, and don't have semantic meanings.

Sure, understood. It makes sense, however someone could say that sem. assembly is as sub-category of geom. assembly since you could also consider that there is not any semantic meaning for the chair parts and take it again as random fractures. In any case though I understand you point.

Thanks also for the twitter link, it looks interesting.

Feel free to ask if you have more questions. Hope this help!

from multi_part_assembly.

Wuziyi616 commented on July 1, 2024

Indeed, I agree that sem. assembly seems to be a sub-category of geo. assembly. But there are much more information in sem. assembly, so maybe worth designing new algorithms there. But I fully agree that a unify framework that can handle both tasks will be super interesting.

Regarding the NSM code, indeed I don't implement it here, and I just tried GAN over pn_transformer to see if it helps. Interestingly, the GAN adv loss doesn't seem to help. In my preliminary experiment of NSM, I didn't use SDF loss + adv loss, so the implementation you mentioned should be good to use.

from multi_part_assembly.

ttsesm commented on July 1, 2024

I see, and you found that the NSM from the aforementioned repository considering only the rotation, translation and point distance losses performs better than the baselines in the everyday subset with only 2 parts settings? Because in my case it seems to perform worst.

from multi_part_assembly.

Wuziyi616 commented on July 1, 2024

I used the same loss as baselines, i.e. the geometric_loss with trans, rot, chamfer, l2, etc. I think some losses are important for geo. assembly

from multi_part_assembly.

ttsesm commented on July 1, 2024

So you switched the 3 losses from here to the ones you mentioned above? Interesting.

Also in principle you could also use local features instead of global ones also in the baseline methods you are benchmarking here. Then if you wanted not to have issues with the memory you could test only for a a 2 parts settings or a limited number of parts e.g. up to 5.

Can you also explain a bit about the padding in the data. Because in practice after some debugging I did not see it used... 🤔

from multi_part_assembly.

Wuziyi616 commented on July 1, 2024

I'm not sure how we can use, say, DGL with local features. Currently given N global features, we can easily build a GNN over it. But if we have N x P per-point features for all parts, how do you build the graph? Do you mean to treat each point as a node? I haven't tried that, but wouldn't that be super slow? Because from my experiment, DGL over global features is already very slow.

The padding is simply for batch processing in PyTorch. Since different shapes have different number of parts, let's assume a chair is of shape [3, 1024], and a table is of shape [6, 1024], then we cannot stack them to form a batch. Also PyTorch DataLoader requires all loaded data to have the same shape in order to batch them. Of course, you can write a custom sampler, but I choose not to do that...

from multi_part_assembly.

ttsesm commented on July 1, 2024

Ok, I got the point of the padding. Actually since I was playing with examples of only 2 parts settings it was not really making any difference and that's why I couldn't see any difference in practice. Thanks for the elaboration ;-).

For extracting local and global features I am still a bit puzzled. What I mean is, for extracting per point (local) or per cloud (global) features it is up to the settings that you pass to the encoder. Meaning that depending whether the flag global_feats= in dgcnn or pointnet is set to True or False you have the corresponding behavior. Then this means that you could activate a local extraction of features in your baselines as well, not?

from multi_part_assembly.

Wuziyi616 commented on July 1, 2024

Feel free to reopen it if you have further questions!

Best,
Ziyi

from multi_part_assembly.

ttsesm commented on July 1, 2024

Sure, thanks for your time ;-)

from multi_part_assembly.

ttsesm commented on July 1, 2024

@Wuziyi616 I have three more questions.

why in the benchmark reports you have results of your transformer implementations for the semantic assembly but not for the geometry assembly?
why do you extract the results by training individually per category for the everyday dataset and then getting the average of all categories? Wouldn't make more sense and is more scientifically correct to train the models at once on all categories and test on all categories?
what is the difference of the relative errors in comparison to the initial reported errors?

Thanks.

from multi_part_assembly.

Wuziyi616 commented on July 1, 2024

This is simply because we didn't find improvement using Transformer in geo. assembly, so we don't include them. On the other hand, Transformer does outperform most of the baselines in sem. assembly, so we include them in the sem. report
We also have results of training on all categories and testing on all, see our paper Table 11 in the Appendix. Basically, the upper part of the table is the result trained on each category individually, and the bottom part of the table is train/test together
You can see the explanation of relative metrics here (the first bullet point under Experimental). The relative metrics are also designed to handle the symmetry ambiguity. Let's imagine a bottle that is broken into 5 pieces. As long as the relative pose between each piece is correct, they will form a perfectly assembled bottle. In this case, the relative metrics will always give 0 error. On the other hand, the absolute metrics (e.g. rot_mse) may still give a non-zero error, if there is a global transformation of the bottle compared to GT

from multi_part_assembly.

ttsesm commented on July 1, 2024

Ok, I see.
Sure, but if I look on the reported results on the table 11 you are still considering the results per category which you average on the last column, or not?

As I understand what you did is you trained the model on all categories but you still tested individually per category and then you averaged the results.

In my case I've tried to train and test on all categories at the same time based on the split data for the everyday subset using the following commands respectively:
train:

python scripts/train.py --cfg_file $CFG --fp16 --cudnn

test:

python scripts/test.py --cfg_file $CFG --weight path/to/weight

and I've got on pair results with the ones reported here. Thus, to be honest I do not really see the reason to train individually per category times x3 and then average per category as well as over the three times results when you can just give the results over all categories all together. To my opinion it becomes to much complicated without reason. Anyways, do not get me wrong it is welcome that you have done this ablation study :-).

Ok I see the point.

Btw, I am trying to play with the settings for extracting local features instead of global ones but it seems that it is not something that it can be applied directly by just setting the global_feats=False flag. Changes in the code need to be applied, right?

from multi_part_assembly.

ttsesm commented on July 1, 2024

Is making sense to fuse the local features to global ones per piece. For example, let's say that I have a tensor like [Batch, Pieces, Points, Local Features] so something like [6, 20, 1024, 256] and would make sense to transform it to [Batch, Pieces, Fussed Features] so something like [6, 20, 256]? Which I guess this would make sense to be done in the Transformer side, right?

from multi_part_assembly.

Wuziyi616 commented on July 1, 2024

This is a good question and something worth looking into. As you can see, currently we're transforming local features to global features simply with a max-pooling operation. You can try better aggregation methods, but Idk how much they will work. I think another direction is to first let local features interact with local features from other parts, then aggregate them to a global one, and predict pose from it

Also we're now using a PointNet to extract part features. PointNet is definitely not good at local feature extraction. So you can try replacing it with PointNet++ or DGCNN first

from multi_part_assembly.

ttsesm commented on July 1, 2024

Interesting, thanks for the feedback. Actually I am trying to use your transformer solution and apply changes based on this. By max pooling you mean the poolings in the encoders here and here, right?

So in principle what you are suggesting is to let the features interact in the transformer and then aggregate them for passing them to pose predictor.

Can you also elaborate a bit what you do in the iterative refined version since I am not sure I got it right.

Indeed pointnet is not good on extracting local features. dgcnn seems to be a better alternative and which I am trying at the moment but I would also like to test KPConv (or SPConv) which seems to be much more superior from both.

from multi_part_assembly.

ttsesm commented on July 1, 2024

Thanks ;-)

from multi_part_assembly.

Some clarification questions about multi_part_assembly HOT 22 CLOSED

Comments (22)

Related Issues (11)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs