File Transfer with Bash¶
Uploading large files using S3 multipart protocol¶
For large file uploads, FirecREST provides upload URLs based on the S3 multipart protocol, the number of URLs depends on the file size and on the FirecREST settings. The user must split the file accordingly and upload each part to the assigned URL.
Once all parts have been uploaded, the user must call the provided complete upload URL to finalize the transfer. After completion, a remote job moves the file from the staging storage to its final destination.
Commented example¶
The first step is to determine the size of your large file, expressed in bytes. A reliable method is to use the command: stat --printf "%s" "$LARGE_FILE_NAME"
.
Then call the /filesystem/{system}/transfer/upload
endpoint as following.
Call to transfer/upload to activate the multipart protocol
curl -s --location --globoff "${F7T_URL}/filesystem/${F7T_SYSTEM}/transfer/upload" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer $ACCESS_TOKEN" \
--data "{
\"path\":\"${DESTINATION_PATH}\",
\"fileName\":\"${LARGE_FILE_NAME}\",
\"fileSize\":\"${LARGE_FILE_SIZE_IN_BYTES}\"
}"
The JSON response from this call follows the structure shown below. FirecREST calculates the number of parts the file must be split into, based on the provided file size and the maxPartSize
setting. Each part is assigned a number from 1 to n and must be uploaded using the presigned URLs listed in partsUploadUrls
. Once all parts have been successfully uploaded, the presigned URL in completeUploadUrl
is used to finalize the upload sequence and initiate the transfer of the complete data file from S3 to its final destination.
FirecREST response from /filesystem/{system}/transfer/upload
endpoint
{
"transferJob": {
"jobId": nnnnnnnnn,
"system": "SYSTEM",
"workingDirectory": "/xxxxxxxxx",
"logs": {
"outputLog": "/xxxxxxxx.log",
"errorLog": "/xxxxxxxxx.log"
}
},
"partsUploadUrls": [
"https://part1-url", "https://part2-url", "https://part3-url"
],
"completeUploadUrl": "https://upload-complete-url",
"maxPartSize": 1073741824
}
Extract the most useful information from the response using jq
_
- the list of presigned URLs to upload the single parts is treated as a single string that will be parsed later
- the presigned URL to close the upload protocol
- the maximum part size allowed to prepare valid chunks of your large file to be uploaded.
Extract information from FirecREST response.
Given the maxPartSize
field in the /filesystem/{system}/transfer/upload
end-point response, split your large file consequently:
This will divide your large file in a set of parts numbered from x01,x02, etc. to the number of parts that the S3 multipart upload protocol expects. The number of split parts must match the number of items in the partsUploadUrls
list.
Upload each part in the correct order. After a successful upload, S3 responds with an ETag
, which is a checksum of that specific part. This value is essential for completing the multipart upload, so be sure to record it.
The example below demonstrates a simple sequential upload. However, this approach is not mandatory, since S3 fully supports uploading parts in parallel.
Upload parts call
part_id=1
etags_xml=""
while read -r part_url; do
# Generate the name of the part file sequentially
part_file=$(printf "%s/x%02d" $PARTS_DIR ${part_id})
# Upload data with curl and extract ETag
if line=$(curl -f --show-error -D - --upload-file "$part_file" "$part_url" | grep -i "^ETag: " ) ;
then
etag=$(echo $line | awk -F'"' '{print $2}')
etags_xml="${etags_xml}<Part><PartNumber>${part_id}</PartNumber><ETag>\"${etag%|*}\"</ETag></Part>"
else
echo "Error uploading part ${part_id}"
fi
part_id=$(( part_id + 1 ))
done <<< "$(echo "$partsUploadUrls" | jq -r '.[]')"
# Prepare ETag's XML collection
complete_upload_xml="<CompleteMultipartUpload xmlns=\"http://s3.amazonaws.com/doc/2006-03-01/\">${etags_xml}</CompleteMultipartUpload>"
Note
Note: The ETags
have been assembled into an XML structure, as required by the S3 multipart upload protocol. This format ensures the upload can be finalized correctly. The XML must strictly follow the expected schema, including XML namespace, quoted ETag
values and integer PartNumber
entries.
Etag
collection's XML.
<CompleteMultipartUpload
xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Part>
<PartNumber>1</PartNumber>
<ETag>"b93fa37618435783645da0f851497506"</ETag>
</Part>
<Part>
<PartNumber>2</PartNumber>
<ETag>"f3f341e6043429eb010fa335f0697390"</ETag>
</Part>
<Part>
<PartNumber>3</PartNumber>
<ETag>"f621481ce07eddab98291227f81d8248"</ETag>
</Part>
</CompleteMultipartUpload>
Complete the upload by calling the presigned completeUploadUrl
as in the example below. Pass the XML collection of ETags as body data.
Complete upload call
Script examples¶
Using split¶
To run the example you need first to set up the environment file
using the provided env-template file.
Set the field in the template as described int the user guide to match your deployment and save the template as a new file.
Launch the script as in the example
File upload using split
The script uploads your_data_fil.zip
to the designated cluster. Note that the split
command generates all temporary part files beforehand, so your local disk must have at least as much free space as the total size of the data being uploaded.
Using dd¶
To run the example you need first to set up the environment file
using the provided env-template file.
Set the field in the template as described int the user guide to match your deployment and save the template as a new file.
Launch the script as in the example
File upload using split
The script uploads your_data_fil.zip
to the specified cluster. When using dd
, only a single temporary part file is created and overwritten with each upload. In this case, your local disk must have at least as much free space as the max_part_size, which defaults to 1GB.
The key difference when using split
is highlighted in the upload loop below. The dd
tool reads data blocks sequentially at each iteration, by incrementing the skip
offset. For correct operation, the part size and offset must be specified as a multiple of BLOCK_SIZE
bytes. By default, this example assumes 1MB per block. This value can be adjusted to optimize performance, but regardless of tuning, the part file size is always defined in blocks. The final part may be shorter than expected, which is acceptable and does not result in an error.
Creating temporary part files with dd
# Define the part size, depending on the block size
part_blocks=$(( max_part_size / BLOCK_SIZE ))
while read -r part_url; do
# Generate temporary part file
if ! dd if="${DATA_FILE}" of="${PART_FILE}" bs=${BLOCK_SIZE} count=${part_blocks} ${skip} status=none ; then
>&2 echo "Error generating part file for part ${part_id}"
upload_error=true
else
ls -hl
echo "Uploading part ${part_id}: ${PART_FILE}"
# Upload data with curl and extract ETag
if line=$(curl -f --show-error -D - --upload-file "$PART_FILE" "$part_url" | grep -i "^ETag: " ) ;
then
etag=$(echo $line | awk -F'"' '{print $2}')
etags_xml="${etags_xml}<Part><PartNumber>${part_id}</PartNumber><ETag>\"${etag%|*}\"</ETag></Part>"
else
>&2 echo "Error uploading part ${part_id}"
upload_error=true
fi
# Cleanup
rm "${PART_FILE}"
fi
# Increase part index
part_id=$(( part_id + 1 ))
# Offset for next chunk
skip="skip=$((( part_id - 1 ) * part_blocks))"
done <<< "$(echo "$parts_upload_urls" | jq -r '.[]')"