Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.Net How to free GPU memory after each inference #1131

Open
strikene opened this issue Dec 9, 2024 · 4 comments
Open

.Net How to free GPU memory after each inference #1131

strikene opened this issue Dec 9, 2024 · 4 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@strikene
Copy link

strikene commented Dec 9, 2024

I am using Phi3.5mini-cuda-fp16 With A Nvida GPU (24G Memory).

When i load model Memory is 8490MiB in use.

Image

When I entered an inference of about 3K tokens, the GPU Memory used 10580MiB

Image

If I continue the conversation afterwards, GPU memory will continue to rise

Image

If I am not having a conversation, even if I leave it for an hour, the memory will not decrease.

I don't know if this is a bug, as this phenomenon seems to have existed since 0.4, and the same goes for 0.5.2
Or did I miss something?

This is My code ,I did not forget to release any object, of course, the Model object was not released because we need to reuse it
Image

@RyanUnderhill
Copy link
Member

RyanUnderhill commented Dec 11, 2024

Our current design keeps the OrtAllocator cuda allocator until you exit. So the cuda memory pool will not decrease to zero until that point. We could potentially have a way to release this allocator if no objects are allocated from it.

@strikene
Copy link
Author

Our current design keeps the OrtAllocator cuda allocator until you exit. So the cuda memory pool will not decrease to zero until that point. We could potentially have a way to release this allocator if no objects are allocated from it.

At present, there are some problems with this, in smaller GPU memory devices, it is not possible to inference efficiently multiple times, and the inference speed is getting slower and slower as GPU Memory approaches 100%.
I don't quit after every inference (reloading the model, which results in a cold start, making a single inference longer). We prefer to release the GPU Memory used in this inference after inference, but keep the loaded model

@RyanUnderhill RyanUnderhill added enhancement New feature or request bug Something isn't working labels Dec 12, 2024
@RyanUnderhill
Copy link
Member

The memory shouldn't be growing every time, that might be a bug. Marked this as an enhancement & bug to track.

@strikene
Copy link
Author

The memory shouldn't be growing every time, that might be a bug. Marked this as an enhancement & bug to track.

I look forward to the next update, although it may take a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants