Fault-tolerant, highly scalable GPU orchestration, and a machine learning framework designed for training models with billions to trillions of parameters
I tried to use it with the internal github which is different from public github and got an error.
There are a lot of cases where your error messages are useless like in first example.
I tried to use higgsfield manually and got a lot of messages like 'something is not a string'.
Quick debug helped me to find that I forgot or put wrong command line parameter. It could be improved.
When I import llama loader it automatically tries to get access to the HF without any my permitions. Overall trying to access something from internet without explicit calls is a big red flag from the security of view. In my case I've already downloaded everything and don't need to connect to the HG at all.
Would be nice to see more examples:
very simple manually implemented architecture which supports deepspeed/zero distribution training.
example which show how to manually run everything without github and hf access.
ability to run your code on a single machine - single gpu and single machine multiple gpu too.
Because how do you expect people to debug their code?
I wanted to run a simple example without setting up my machines and using github and found it impossible which is a big problem in my opinion/
Overall great job and nice implementation but it could be much user friendlier.
Thanks!