I run the scripts/search_dist.sh of llama_hf model using A800 with 8 gpus
`export NUM_NODES=1
export NUM_GPUS_PER_NODE=8
MODEL_SIZE="llama-13b"
MEMORY=75
MODEL_ARGS="
--model_size ${MODEL_SIZE}
--set_model_config_manually 0
--set_layernum_manually 0
--vocab_size 32000
--hidden_size 5120
--num_hidden_layers 40
--num_attention_heads 40
--seq_length 2048"
BSZ_ARGS="
--min_bsz 64
--max_bsz 64
--bsz_scale 16
--settle_bsz -1
--recommend_min_bsz 0
"
SEARCH_SPACE_ARGS="
--search_space full
--disable_dp 0
--disable_tp 0
--disable_pp 0
--disable_sdp 1
--disable_ckpt 0
--disable_tp_consec 0
--max_tp_deg 8
--max_pp_deg 8
"
SEARCH_ARGS="
${BSZ_ARGS}
${SEARCH_SPACE_ARGS}
${MODEL_ARGS}
--num_nodes ${NUM_NODES}
--num_gpus_per_node ${NUM_GPUS_PER_NODE}
--memory_constraint $MEMORY
--mixed_precision bf16
--pipeline_type pipedream_flush
--default_dp_type ddp
--embed_sdp 0
"
BACKGROUND=1
if [ $BACKGROUND -eq 1 ]; then
echo "Search in background..."
OUTPUT_FILE="Search_${MODEL_SIZE}${MEMORY}GB${NUM_NODES}Nodes_${NUM_GPUS_PER_NODE}GPUs_per_node_bsz64.log"
nohup python3 search_dist.py ${SEARCH_ARGS} 1> ${OUTPUT_FILE} 2>&1 &
else
echo "Search in foreground..."
python3 search_dist.py ${SEARCH_ARGS}
fi`
The final search result shows that the Max throughput=1.8973885326702808 samples/s
Then I run the scripts/train_dist.sh using the searched config file, the real execution time is about 6.7s, which means the throughput is about 9.55 samples/s (the global batchsize is 64). It seems that the prediction error is large, is there anything wrong?