-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL stuck #2
Comments
hello, did you solve it? |
Could you change to use GLOO as backend? Does it work? |
hello, I meet the same problem, have you solved it? |
Could you change to GLOO backend (for both P2P and collective) to see whether it works or not? We didn't fix NCCL with torchrun yet. |
I didn't solve the stuck problem. But I guess it is caused by simultaneous transmission and reception between ranks. Batched P2P communication ops such as Megatron-LM could solve it. Code like this: def send_forward_recv_backward(output_tensor: torch.Tensor,
tensor_shape: Shape,
config: ModelParallelConfig) -> torch.Tensor:
"""Batched send and recv with next rank in pipeline.
See _communicate for argument details.
"""
if core.parallel_state.is_pipeline_last_stage():
output_tensor_grad = None
else:
if config.timers is not None:
config.timers('forward-send-backward-recv', log_level=2).start()
_, output_tensor_grad,_ = _communicate(
tensor_send_next=output_tensor,
tensor_send_prev=None,
recv_prev=False,
recv_next=True,
tensor_shape=tensor_shape,
config=config)
if config.timers is not None:
config.timers('forward-send-backward-recv').stop()
return output_tensor_grad |
It works! Thanks a lot! |
I tried to change the slurm script (i.e., prof_steps.sh) to torchrun and ran it directly, but encountered a stuck issue with NCCL as collective_backend. The torchran script is as follows:
When I choose ‘gpipe’ or '1f1b' as the pipeline method, it can work normally. However, selecting 'interleave' will result in a loss of 0, while 'chimera' leads the program to get stuck and then raises an error of timeout.
The text was updated successfully, but these errors were encountered: