You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our current design keeps the OrtAllocator cuda allocator until you exit. So the cuda memory pool will not decrease to zero until that point. We could potentially have a way to release this allocator if no objects are allocated from it.
Our current design keeps the OrtAllocator cuda allocator until you exit. So the cuda memory pool will not decrease to zero until that point. We could potentially have a way to release this allocator if no objects are allocated from it.
At present, there are some problems with this, in smaller GPU memory devices, it is not possible to inference efficiently multiple times, and the inference speed is getting slower and slower as GPU Memory approaches 100%.
I don't quit after every inference (reloading the model, which results in a cold start, making a single inference longer). We prefer to release the GPU Memory used in this inference after inference, but keep the loaded model
I am using Phi3.5mini-cuda-fp16 With A Nvida GPU (24G Memory).
When i load model Memory is 8490MiB in use.
When I entered an inference of about 3K tokens, the GPU Memory used 10580MiB
If I continue the conversation afterwards, GPU memory will continue to rise
If I am not having a conversation, even if I leave it for an hour, the memory will not decrease.
I don't know if this is a bug, as this phenomenon seems to have existed since 0.4, and the same goes for 0.5.2
Or did I miss something?
This is My code ,I did not forget to release any object, of course, the Model object was not released because we need to reuse it
The text was updated successfully, but these errors were encountered: