File Transfer with Bash¶

Uploading large files using S3 multipart protocol¶

For large file uploads, FirecREST provides upload URLs based on the S3 multipart protocol, the number of URLs depends on the file size and on the FirecREST settings. The user must split the file accordingly and upload each part to the assigned URL.

Once all parts have been uploaded, the user must call the provided complete upload URL to finalize the transfer. After completion, a remote job moves the file from the staging storage to its final destination.

Commented example¶

The first step is to determine the size of your large file, expressed in bytes. A reliable method is to use the command: stat --printf "%s" "$LARGE_FILE_NAME".

Then call the /filesystem/{system}/transfer/upload endpoint as following.

Call to transfer/upload to activate the multipart protocol

curl -s --location --globoff "${F7T_URL}/filesystem/${F7T_SYSTEM}/transfer/upload" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer $ACCESS_TOKEN" \
--data "{
    \"path\":\"${DESTINATION_PATH}\",
    \"fileName\":\"${LARGE_FILE_NAME}\",
    \"fileSize\":\"${LARGE_FILE_SIZE_IN_BYTES}\"
}"

The JSON response from this call follows the structure shown below. FirecREST calculates the number of parts the file must be split into, based on the provided file size and the maxPartSize setting. Each part is assigned a number from 1 to n and must be uploaded using the presigned URLs listed in partsUploadUrls. Once all parts have been successfully uploaded, the presigned URL in completeUploadUrl is used to finalize the upload sequence and initiate the transfer of the complete data file from S3 to its final destination.

FirecREST response from /filesystem/{system}/transfer/upload endpoint

{
"transferJob": {
    "jobId": nnnnnnnnn,
    "system": "SYSTEM",
    "workingDirectory": "/xxxxxxxxx",
    "logs": {
        "outputLog": "/xxxxxxxx.log",
        "errorLog": "/xxxxxxxxx.log"
    }
},
"partsUploadUrls": [
    "https://part1-url", "https://part2-url", "https://part3-url"
],
"completeUploadUrl": "https://upload-complete-url",
"maxPartSize": 1073741824
}

Extract the most useful information from the response using jq_ - the list of presigned URLs to upload the single parts is treated as a single string that will be parsed later - the presigned URL to close the upload protocol - the maximum part size allowed to prepare valid chunks of your large file to be uploaded.

Extract information from FirecREST response.

parts_upload_urls=$(echo $response | jq -r ".partsUploadUrls")
complete_upload_url=$(echo $response | jq -r ".completeUploadUrl")
max_part_size=$(echo $response | jq -r ".maxPartSize")

Given the maxPartSize field in the /filesystem/{system}/transfer/upload end-point response, split your large file consequently:

Split large file to upload

$ split "$LARGE_FILE_NAME" -b "$maxPartSize" --numeric-suffixes=1

This will divide your large file in a set of parts numbered from x01,x02, etc. to the number of parts that the S3 multipart upload protocol expects. The number of split parts must match the number of items in the partsUploadUrlslist.

Upload each part in the correct order. After a successful upload, S3 responds with an ETag, which is a checksum of that specific part. This value is essential for completing the multipart upload, so be sure to record it.

The example below demonstrates a simple sequential upload. However, this approach is not mandatory, since S3 fully supports uploading parts in parallel.

Upload parts call

part_id=1
etags_xml=""
while read -r part_url; do
    # Generate the name of the part file sequentially
    part_file=$(printf "%s/x%02d" $PARTS_DIR ${part_id})
    # Upload data with curl and extract ETag
    if line=$(curl -f --show-error -D - --upload-file "$part_file" "$part_url" | grep -i "^ETag: " ) ;
    then
        etag=$(echo $line | awk -F'"' '{print $2}')
        etags_xml="${etags_xml}<Part><PartNumber>${part_id}</PartNumber><ETag>\"${etag%|*}\"</ETag></Part>"
    else
        echo "Error uploading part ${part_id}"
    fi
    part_id=$(( part_id + 1 ))
done <<< "$(echo "$partsUploadUrls" | jq -r '.[]')"

# Prepare ETag's XML collection
complete_upload_xml="<CompleteMultipartUpload xmlns=\"http://s3.amazonaws.com/doc/2006-03-01/\">${etags_xml}</CompleteMultipartUpload>"

Note

Note: The ETags have been assembled into an XML structure, as required by the S3 multipart upload protocol. This format ensures the upload can be finalized correctly. The XML must strictly follow the expected schema, including XML namespace, quoted ETag values and integer PartNumber entries.

Etag collection's XML.

<CompleteMultipartUpload
    xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
    <Part>
        <PartNumber>1</PartNumber>
        <ETag>"b93fa37618435783645da0f851497506"</ETag>
    </Part>
    <Part>
        <PartNumber>2</PartNumber>
        <ETag>"f3f341e6043429eb010fa335f0697390"</ETag>
    </Part>
    <Part>
        <PartNumber>3</PartNumber>
        <ETag>"f621481ce07eddab98291227f81d8248"</ETag>
    </Part>
</CompleteMultipartUpload>

Complete the upload by calling the presigned completeUploadUrl as in the example below. Pass the XML collection of ETags as body data.

Complete upload call

curl -f --show-error -i -w "%{http_code}" -H "Content-Type: application/xml" -d "$complete_upload_xml" -X POST $complete_upload_url

Script examples¶

Using split¶

To run the example you need first to set up the environment file using the provided env-template file. Set the field in the template as described int the user guide to match your deployment and save the template as a new file.

Launch the script as in the example

File upload using split

./multipart_upload_split your_data_file.zip cluster_name /home/user/data_dir/ environment_file

The script uploads your_data_fil.zip to the designated cluster. Note that the split command generates all temporary part files beforehand, so your local disk must have at least as much free space as the total size of the data being uploaded.

Using dd¶

To run the example you need first to set up the environment file using the provided env-template file. Set the field in the template as described int the user guide to match your deployment and save the template as a new file.

Launch the script as in the example

File upload using split

./multipart_upload_dd your_data_file.zip cluster_name /home/user/data_dir/ environment_file

The script uploads your_data_fil.zip to the specified cluster. When using dd, only a single temporary part file is created and overwritten with each upload. In this case, your local disk must have at least as much free space as the max_part_size, which defaults to 1GB.

The key difference when using split is highlighted in the upload loop below. The dd tool reads data blocks sequentially at each iteration, by incrementing the skip offset. For correct operation, the part size and offset must be specified as a multiple of BLOCK_SIZE bytes. By default, this example assumes 1MB per block. This value can be adjusted to optimize performance, but regardless of tuning, the part file size is always defined in blocks. The final part may be shorter than expected, which is acceptable and does not result in an error.

Creating temporary part files with dd

# Define the part size, depending on the block size
part_blocks=$(( max_part_size / BLOCK_SIZE ))
while read -r part_url; do    
    # Generate temporary part file    
    if ! dd if="${DATA_FILE}" of="${PART_FILE}" bs=${BLOCK_SIZE} count=${part_blocks} ${skip} status=none ; then
        >&2 echo "Error generating part file for part ${part_id}"
        upload_error=true
    else
        ls -hl
        echo "Uploading part ${part_id}: ${PART_FILE}"
        # Upload data with curl and extract ETag
        if line=$(curl -f --show-error -D - --upload-file "$PART_FILE" "$part_url" | grep -i "^ETag: " ) ;
        then
            etag=$(echo $line | awk -F'"' '{print $2}')
            etags_xml="${etags_xml}<Part><PartNumber>${part_id}</PartNumber><ETag>\"${etag%|*}\"</ETag></Part>"
        else
            >&2 echo "Error uploading part ${part_id}"
            upload_error=true
        fi
        # Cleanup
        rm "${PART_FILE}"
    fi
    # Increase part index
    part_id=$(( part_id + 1 ))
    # Offset for next chunk
    skip="skip=$((( part_id - 1 ) * part_blocks))"
done <<< "$(echo "$parts_upload_urls" | jq -r '.[]')"