Describe the bug When I set <code class="notranslat

[BUG] 4-bit quantized models would repeatedly generate the same tokens when bf16.enabled is true about deepspeed HOT 1 OPEN

Atry commented on July 3, 2024
[BUG] 4-bit quantized models would repeatedly generate the same tokens when bf16.enabled is true
from deepspeed.
Comments (1)

Atry commented on July 3, 2024
Workaround

If I change "bf16": {"enabled": True} to "fp16": {"enabled": True}, then the output would be:
Using quantizer for weights: CUDAQuantizer
[2024-06-10 21:48:13,828] [INFO] [partition_parameters.py:562:patch_init_and_builtins] Enable Zero3 engine with INT4 quantization.
[2024-06-10 21:48:14,225] [INFO] [partition_parameters.py:345:__exit__] finished initializing model - num_params = 603, num_elems = 3.30B
[2024-06-10 21:48:18,472] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-06-10 21:48:18,473] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
[2024-06-10 21:48:18,652] [INFO] [utils.py:779:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-06-10 21:48:18,653] [INFO] [utils.py:780:see_memory_usage] MA 1.78 GB         Max_MA 2.15 GB         CA 2.26 GB         Max_CA 2 GB 
[2024-06-10 21:48:18,654] [INFO] [utils.py:787:see_memory_usage] CPU Virtual Memory:  used = 7.4 GB, percent = 23.6%
Parameter Offload: Total persistent parameters: 92160 in 45 params
[2024-06-10 21:48:18,847] [INFO] [utils.py:779:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-06-10 21:48:18,847] [INFO] [utils.py:780:see_memory_usage] MA 1.78 GB         Max_MA 1.78 GB         CA 2.26 GB         Max_CA 2 GB 
[2024-06-10 21:48:18,848] [INFO] [utils.py:787:see_memory_usage] CPU Virtual Memory:  used = 7.4 GB, percent = 23.6%
[2024-06-10 21:48:18,849] [INFO] [config.py:996:print] DeepSpeedEngine configuration:
[2024-06-10 21:48:18,850] [INFO] [config.py:1000:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2024-06-10 21:48:18,851] [INFO] [config.py:1000:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-06-10 21:48:18,852] [INFO] [config.py:1000:print]   amp_enabled .................. False
[2024-06-10 21:48:18,852] [INFO] [config.py:1000:print]   amp_params ................... False
[2024-06-10 21:48:18,853] [INFO] [config.py:1000:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2024-06-10 21:48:18,855] [INFO] [config.py:1000:print]   bfloat16_enabled ............. False
[2024-06-10 21:48:18,856] [INFO] [config.py:1000:print]   bfloat16_immediate_grad_update  False
[2024-06-10 21:48:18,857] [INFO] [config.py:1000:print]   checkpoint_parallel_write_pipeline  False
[2024-06-10 21:48:18,857] [INFO] [config.py:1000:print]   checkpoint_tag_validation_enabled  True
[2024-06-10 21:48:18,859] [INFO] [config.py:1000:print]   checkpoint_tag_validation_fail  False
[2024-06-10 21:48:18,859] [INFO] [config.py:1000:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f2b900e22d0>
[2024-06-10 21:48:18,860] [INFO] [config.py:1000:print]   communication_data_type ...... None
[2024-06-10 21:48:18,860] [INFO] [config.py:1000:print]   compile_config ............... enabled=False backend='inductor' kwargs={}
[2024-06-10 21:48:18,861] [INFO] [config.py:1000:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-06-10 21:48:18,863] [INFO] [config.py:1000:print]   curriculum_enabled_legacy .... False
[2024-06-10 21:48:18,863] [INFO] [config.py:1000:print]   curriculum_params_legacy ..... False
[2024-06-10 21:48:18,864] [INFO] [config.py:1000:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-06-10 21:48:18,865] [INFO] [config.py:1000:print]   data_efficiency_enabled ...... False
[2024-06-10 21:48:18,865] [INFO] [config.py:1000:print]   dataloader_drop_last ......... False
[2024-06-10 21:48:18,866] [INFO] [config.py:1000:print]   disable_allgather ............ False
[2024-06-10 21:48:18,866] [INFO] [config.py:1000:print]   dump_state ................... False
[2024-06-10 21:48:18,867] [INFO] [config.py:1000:print]   dynamic_loss_scale_args ...... None
[2024-06-10 21:48:18,868] [INFO] [config.py:1000:print]   eigenvalue_enabled ........... False
[2024-06-10 21:48:18,868] [INFO] [config.py:1000:print]   eigenvalue_gas_boundary_resolution  1
[2024-06-10 21:48:18,869] [INFO] [config.py:1000:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-06-10 21:48:18,869] [INFO] [config.py:1000:print]   eigenvalue_layer_num ......... 0
[2024-06-10 21:48:18,870] [INFO] [config.py:1000:print]   eigenvalue_max_iter .......... 100
[2024-06-10 21:48:18,871] [INFO] [config.py:1000:print]   eigenvalue_stability ......... 1e-06
[2024-06-10 21:48:18,871] [INFO] [config.py:1000:print]   eigenvalue_tol ............... 0.01
[2024-06-10 21:48:18,872] [INFO] [config.py:1000:print]   eigenvalue_verbose ........... False
[2024-06-10 21:48:18,872] [INFO] [config.py:1000:print]   elasticity_enabled ........... False
[2024-06-10 21:48:18,873] [INFO] [config.py:1000:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2024-06-10 21:48:18,874] [INFO] [config.py:1000:print]   fp16_auto_cast ............... False
[2024-06-10 21:48:18,874] [INFO] [config.py:1000:print]   fp16_enabled ................. True
[2024-06-10 21:48:18,875] [INFO] [config.py:1000:print]   fp16_master_weights_and_gradients  False
[2024-06-10 21:48:18,875] [INFO] [config.py:1000:print]   global_rank .................. 0
[2024-06-10 21:48:18,877] [INFO] [config.py:1000:print]   grad_accum_dtype ............. None
[2024-06-10 21:48:18,878] [INFO] [config.py:1000:print]   gradient_accumulation_steps .. 1
[2024-06-10 21:48:18,879] [INFO] [config.py:1000:print]   gradient_clipping ............ 0.0
[2024-06-10 21:48:18,879] [INFO] [config.py:1000:print]   gradient_predivide_factor .... 1.0
[2024-06-10 21:48:18,880] [INFO] [config.py:1000:print]   graph_harvesting ............. False
[2024-06-10 21:48:18,880] [INFO] [config.py:1000:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-06-10 21:48:18,881] [INFO] [config.py:1000:print]   initial_dynamic_scale ........ 65536
[2024-06-10 21:48:18,882] [INFO] [config.py:1000:print]   load_universal_checkpoint .... False
[2024-06-10 21:48:18,883] [INFO] [config.py:1000:print]   loss_scale ................... 0
[2024-06-10 21:48:18,883] [INFO] [config.py:1000:print]   memory_breakdown ............. False
[2024-06-10 21:48:18,884] [INFO] [config.py:1000:print]   mics_hierarchial_params_gather  False
[2024-06-10 21:48:18,885] [INFO] [config.py:1000:print]   mics_shard_size .............. -1
[2024-06-10 21:48:18,885] [INFO] [config.py:1000:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-06-10 21:48:18,886] [INFO] [config.py:1000:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2024-06-10 21:48:18,887] [INFO] [config.py:1000:print]   optimizer_legacy_fusion ...... False
[2024-06-10 21:48:18,887] [INFO] [config.py:1000:print]   optimizer_name ............... None
[2024-06-10 21:48:18,888] [INFO] [config.py:1000:print]   optimizer_params ............. None
[2024-06-10 21:48:18,889] [INFO] [config.py:1000:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-06-10 21:48:18,890] [INFO] [config.py:1000:print]   pld_enabled .................. False
[2024-06-10 21:48:18,890] [INFO] [config.py:1000:print]   pld_params ................... False
[2024-06-10 21:48:18,891] [INFO] [config.py:1000:print]   prescale_gradients ........... False
[2024-06-10 21:48:18,895] [INFO] [config.py:1000:print]   scheduler_name ............... None
[2024-06-10 21:48:18,896] [INFO] [config.py:1000:print]   scheduler_params ............. None
[2024-06-10 21:48:18,896] [INFO] [config.py:1000:print]   seq_parallel_communication_data_type  torch.float32
[2024-06-10 21:48:18,897] [INFO] [config.py:1000:print]   sparse_attention ............. None
[2024-06-10 21:48:18,897] [INFO] [config.py:1000:print]   sparse_gradients_enabled ..... False
[2024-06-10 21:48:18,898] [INFO] [config.py:1000:print]   steps_per_print .............. 10
[2024-06-10 21:48:18,899] [INFO] [config.py:1000:print]   train_batch_size ............. 1
[2024-06-10 21:48:18,899] [INFO] [config.py:1000:print]   train_micro_batch_size_per_gpu  1
[2024-06-10 21:48:18,900] [INFO] [config.py:1000:print]   use_data_before_expert_parallel_  False
[2024-06-10 21:48:18,900] [INFO] [config.py:1000:print]   use_node_local_storage ....... False
[2024-06-10 21:48:18,901] [INFO] [config.py:1000:print]   wall_clock_breakdown ......... False
[2024-06-10 21:48:18,902] [INFO] [config.py:1000:print]   weight_quantization_config ... q_type='symmetric' q_groups=1 enabled=True num_bits=8 quantized_initialization={'num_bits': 4, 'group_size': 64, 'group_dim': 1, 'symmetric': False} post_init_quant={}
[2024-06-10 21:48:18,902] [INFO] [config.py:1000:print]   world_size ................... 1
[2024-06-10 21:48:18,903] [INFO] [config.py:1000:print]   zero_allow_untested_optimizer  False
[2024-06-10 21:48:18,904] [INFO] [config.py:1000:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=False elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=True zero_quantized_nontrainable_weights=True zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-06-10 21:48:18,904] [INFO] [config.py:1000:print]   zero_enabled ................. True
[2024-06-10 21:48:18,905] [INFO] [config.py:1000:print]   zero_force_ds_cpu_optimizer .. True
[2024-06-10 21:48:18,905] [INFO] [config.py:1000:print]   zero_optimization_stage ...... 3
[2024-06-10 21:48:18,906] [INFO] [config.py:986:print_user_config]   json = {
    "zero_optimization": {
        "load_from_fp32_weights": false, 
        "stage": 3, 
        "zero_quantized_weights": true, 
        "zero_quantized_nontrainable_weights": true
    }, 
    "train_micro_batch_size_per_gpu": 1, 
    "fp16": {
        "enabled": true
    }, 
    "weight_quantization": {
        "quantized_initialization": {
            "num_bits": 4, 
            "group_size": 64, 
            "group_dim": 1, 
            "symmetric": false
        }
    }
}
['<s>metro.org/metro-pilots-plan-to-build-new-metro-']
The generated text is now "metro.org/metro-pilots-plan-to-build-new-metro-". It does not repeat the same token any more.
from deepspeed.
[BUG] 4-bit quantized models would repeatedly generate the same tokens when bf16.enabled is true about deepspeed HOT 1 OPEN

Comments (1)

Workaround

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs