Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azcopy.exe v10.27.0 memory usage #2855

Open
ep-ss-2021 opened this issue Nov 5, 2024 · 18 comments
Open

azcopy.exe v10.27.0 memory usage #2855

ep-ss-2021 opened this issue Nov 5, 2024 · 18 comments
Assignees

Comments

@ep-ss-2021
Copy link

Hi,

It was noticed that 10.27.0 azcopy.exe version forces 16GB RAM Windows Azure VirtualScale Set instances (AzureDevOps agents) to show "Free memory is lower than 5%" pipeline warning after ~1h of execution during long Azure storage account cross containers copying operation.
It happens from Thursday 31-Oct-2024. Previous 10.26.0 version works fine.
The following agent log fragment of last in in-memory record was shown for one of the pipeline failures:
"/golang.org/[email protected]/src/runtime/proc.go: 424, time.go: 285, asm_amd64.s:1700"
As the version had 'Golang 1.22.5 -> 1.23.1' dependency update, not sure if it can be related.

Thanks

@gapra-msft
Copy link
Member

Hi @ep-ss-2021 Does your workload consist of a large directory structure/many directories?

@ep-ss-2021
Copy link
Author

ep-ss-2021 commented Nov 5, 2024

Hi @ep-ss-2021 Does your workload consist of a large directory structure/many directories?

Approximately: Around 10 root "folders" with ~6 subfolder level depth, and 10-50 files for almost each subfolder. If more convenient, I can create Microsoft support ticket and mention you there so you can contact me through MS Teams if quicker to have a call.

@gapra-msft
Copy link
Member

No, that's not necessary.

The only AzCopy feature that I can think of that increased memory requirements is this.

We now cache folder names to reduce redundant calls to create directories.

Does the warning result in the job entirely failing? Would you be able to share how much memory AzCopy v10.26 used vs v10.27?

@ep-ss-2021
Copy link
Author

Hi Gauri,

Yes, the agents entirely fail after ~1-1.5h with 'out of memory' or 'stack overflow error' plus 'We stopped hearing from agent XXXX. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error.' AzureDevOps pipeline level error messages on the step where we used azcopy.
The agent has Standard D4ds_v5 Azure SKU (16 GB of RAM), the usage is checked in task manager after several minutes of azcopy process start:

  1. v10.26 case:
    1.1) after 2 mins the agent has 20% of RAM used, the process consumes 50-90% CPU + 0.9Gb RAM.
    1.2) after 5 mins the agent has 20% of RAM used, the process consumes 50-90% CPU + 0.9Gb RAM.
  2. v10.27 case:
    2.1) after 2 mins the agent has 40% of RAM used, the process consumes 80-90% CPU + 4Gb RAM.
    2.2) after 5min the agent has 90% of RAM used, the process consumes 80-90% CPU + 9Gb RAM.

Thanks,
Sergey

@gapra-msft
Copy link
Member

gapra-msft commented Nov 7, 2024

@ep-ss-2021 Could you please share logs for the v10.26 and v10.27 runs? That might help expediate investigation here. You can email them to azcopydev AT microsoft.com

AzCopy also supports memory profiling if you set the AZCOPY_PROFILE_MEM environment variable to the file path where you want the memory profile file. If you could share those files for 10.26 and 10.27 runs, that would also be helpful

@gapra-msft
Copy link
Member

Hi @ep-ss-2021 We just released a patch which should resolve the issue. Please upgrade to 10.27.1

@ep-ss-2021
Copy link
Author

ep-ss-2021 commented Nov 14, 2024

Hi Colleagues,
Still the same error with 10.27.1 (CPU usage was lower 60-90%, RAM sill >90%). I'll try to raise a standard MS request and mention you, so MS can take the logs from exact agent instances.
Thanks,
Sergey.

@sarbiaAtAdobe
Copy link

Hi there, I've been impacted by OOMs with both 10.27.0 and 10.27.1 when copying from a storage container with millions of blobs, and I'm using a Linux VM with 128GB of RAM allocated. No issues with smaller storage containers, but in this case RAM availability was quickly nosediving resulting in consistently repeatable OOMs. 10.26.0 worked always fine in this regard and I have now downgraded to that release to complete the cross-blob-storage copy I had to perform.

@ep-ss-2021
Copy link
Author

Hi Gauri,
I've just created MS ticket, so hope the corresponding team finds you internally, so we can have a session tomorrow if you are free.
Thanks,
Sergey

@sarbiaAtAdobe
Copy link

sarbiaAtAdobe commented Nov 14, 2024

For the record, I could complete my azcopy copy using 10.26.0 (and, as a side note, it would be nice if it were easier to locate older binaries). So definitely not an issue with what I was attempting to copy (at times I feared some content of my storage container might have had some unusual feature that might have caused even older releases to fail), but rather with the software I was using to attempt my copy.

@keikohashizume
Copy link

keikohashizume commented Nov 27, 2024

We have faced the same issue with both versions: 10.27.0 and 10.27.1.
It is the same scenario where we copying hundreds of thousand of files.
@sarbiaAtAdobe did you get it to work with 10.27.1?

@sarbiaAtAdobe
Copy link

Thanks @keikohashizume, as per my previous messages "I've been impacted by OOMs with both 10.27.0 and 10.27.1" and "I could complete my azcopy copy using 10.26.0". Unfortunately, this issue we're commenting on has been closed, unsure if there's an open one for MSFT to address this. What I can say is that with a blob count in the hundreds of thousands, I could complete copies even with 10.27.0 (but I'm using a 128GB RAM VM). But when we get to tens of millions, even 128GBs are not enough. It's clearly a memory leak.
I'm sticking to 10.26 for the foreseeable future, it would be good to find an open issue to track in order to see when this will actually be resolved.

@ep-ss-2021
Copy link
Author

Hi Colleagues,
Let me share some updates: I've raised an official ticket with Microsoft and addressed this (logs, trace). So that's under investigation. I've also highlighted for the MS team colleagues that probably that issue should be reopened here.

@gidiLuke
Copy link

Thanks @ep-ss-2021
Looking forward to hearing an update once you got a reply.

@ep-ss-2021
Copy link
Author

Hi @gapra-msft , @seanmcc-msft , Could you reopen that issue as assume it's not resolved yet based on the MS ticket?

@echee-insitro
Copy link

echee-insitro commented Dec 13, 2024

We confirmed that a 30M object, 1.5TB server-to-server copy of s3-->blobstore has a reproducible memory leak (fails on both 32 GB & 128 GB clients). Client machines become unresponsive due to OOM.

The same dataset successfully copies when replacing azcopy binary with version 10.25.1 (running on the 128 GB client, system RAM maxes at 3GB).

AZCOPY_CONCURRENCY_VALUE=3000 in all cases

Image

@keikohashizume
Copy link

@ep-ss-2021 I would like to know when the fix for this issue will be released

@gapra-msft gapra-msft reopened this Jan 7, 2025
@gapra-msft
Copy link
Member

Hi all, We are currently working on root causing the memory usage issue. We hope to release a fix with the next scheduled AzCopy release later this month. We will keep you updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants