Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete file cp to s3 mount #1133

Open
snowch opened this issue Nov 14, 2024 · 6 comments
Open

Incomplete file cp to s3 mount #1133

snowch opened this issue Nov 14, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@snowch
Copy link

snowch commented Nov 14, 2024

Mountpoint for Amazon S3 version

1.10.0

AWS Region

n/a

Describe the running environment

Running on local S3 (Vast Data)

Mountpoint options

mount-s3 \
    --log-directory ~/s3.log \
    --debug-crt \
    --region VAST \
    --endpoint-url $AWS_ENDPOINT_URL \
    --allow-delete \
    --uid $(id -u jovyan) \
    --gid $(id -g jovyan) \
    --file-mode 0664 \
    --dir-mode 0775 \
    "$S3A_BUCKET" ${HOME}/s3

What happened?

Files copied to s3 mount have different checksum:

wget -c --quiet https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet

rm ../s3/nyc-data/yellow_tripdata_2024-01.parquet
cp yellow_tripdata_2024-01.parquet ../s3/nyc-data/
shasum yellow_tripdata_2024-01.parquet
shasum ../s3/nyc-data/yellow_tripdata_2024-01.parquet

The output:

a73b714de7672a752a58de34826e08fae203e91b  yellow_tripdata_2024-01.parquet
2cf39315d5c8e2cab85703477c0169ac9785fcb9  ../s3/nyc-data/yellow_tripdata_2024-01.parquet

Relevant log output

https://gist.github.com/snowch/e2fc06bd420d92f060ee6e985ea3ed73
@snowch snowch added the bug Something isn't working label Nov 14, 2024
@monthonk
Copy link
Contributor

monthonk commented Nov 15, 2024

Hi, thanks for reporting the issue. Are the checksums always different when uploading with Mountpoint or only in some occasions?

I also noticed that you are using a third-party storage. Do you know whether they support additional checksums or not?

Mountpoint computes checksums for your data by default and send them along with the data so that data integrity can be verified on server side. However, POSIX file operations like read and write do not offer a built-in integrity mechanism and it's possible for data integrity to be lost in transit between your application and Mountpoint. More details in the SEMANTICS doc.

@snowch
Copy link
Author

snowch commented Nov 15, 2024

The checksums are always different. Thanks for sharing the semantic information - I've used s3cmd for my usecase for now.

@monthonk
Copy link
Contributor

Thanks for confirming. The logs you provided only have information up to when the MultipartUpload is complete. Would you be able to also share relevant logs from read operation?

@snowch
Copy link
Author

snowch commented Nov 19, 2024

Hopefully this should have everything? https://gist.github.com/snowch/1b401dcb5fc4320ee33fce60c4bc28c0

@muddyfish
Copy link
Contributor

Hi, I've checked through your logs and I've found an inconsistency in the size of the object between writing and reading.

The original file is 49961641 bytes, and the total amount written in the MPU is 8388608 + 8388608 + 8388608 + 8388608 + 8388608 + 8018601 = 49961641 bytes. The server also reports the MPU as successful. However, when reading, we see that 49961922 bytes are requested from the service. Given the information provided, we cannot be certain this is an issue with Mountpoint.

There's some possibilities here:

  1. A general problem with Mountpoint.
  2. An incompatibility with the third party service.
  3. The object has changed concurrently and has been overwritten externally.

Can you confirm that the object is not changed in the service side? We will look into adding some additional logging to make it clearer in Mountpoint's logs when an object's content is changed on the backend.

If you're able to reproduce this on Amazon S3, please get in contact with us/send an update here including request ids.

@muddyfish
Copy link
Contributor

In particular, please try and see if it works correctly by disabling trailing checksums with --upload-checksums off. If doing this leads to your issue being resolved, it means your third party service does not implement trailing checksums, and also does not return an error when they're used by a client which can lead to this silent corruption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants