Skip to content

Data Transfer

Motivation

There are 2 endpoints that performs data transfer (upload and download) using FirecREST:

  1. filesystem/<system>/ops/upload[|download], which is meant for small files data transfer, which blocks the interface of FirecREST, and
  2. filesystem/<system>/transfer/upload[|download], which is designed to handle large data transfer, using another API or transfer service to perform the operation, which doesn't block FirecREST API.

In this section we discuss the latter.

Types of data transfer using FirecREST

FirecREST presents various types of data transfer that can be selected using the Data Operation configuration.

S3DataTransfer

Note

Configuration for S3 Data Transfer can be found in this link

FirecREST enables users to upload and download large data files of up to 5TB each, utilizing S3 buckets as a data buffer. Users requesting data uploads or downloads to the HPC infrastructure receive presigned URLs to transfer data to or from the S3 storage.

Ownership of buckets and data remains with the FirecREST service account, but FirecREST creates one bucket per user. Each file transferred (uploaded or downloaded) is stored in a unique identified data object into the user's bucket. Data objects within the buckets are retained for a configurable period, managed through S3's lifecycle expiration functionality. This expiration period, expressed in days, can be specified using the bucket_lifecycle_configuration parameter.

The S3 storage can be either on-premises or cloud-based. In any case it is required an valid service account having sufficient permissions to handle buckets creation and the generation of presigned URLs.

S3DataTransfer Upload

Uploading data from the extern of the HPC infrastructure requires users to apply the multipart upload protocol, therefore the data file shall be divided into parts. The size limit of each part is defined in the response of FirecREST to the upload call.

Tha maximum size of the parts can be configured via the max_part_size parameter. The specified value must be lower than S3's maximum limit of 5GB per part and it is applied to any upload data call. When setting this value, please consider that clients using this protocol to upload parts may need to store them in memory before streaming the data to S3. Therefore, setting a large value for this parameter could impact clients utilizing FirecREST for data uploads. S3 limits the file uploads to 5TB with a maximum of 10,000 parts. To fully utilize the maximum number of parts for transferring the largest allowable file, each part should be approximately 525 MB. FirecREST's default is set to 1GB per part.

The user needs to specify the name of the file to be uploaded, the destination path on the HPC cluster and the size of the file, which allows FirecREST to properly generate the set of presigned URLs for uploading each part.

After the user completes the upload process, an already scheduled job transfers the data from the S3 bucket to its final destination on the HPC cluster, typically in a dedicated storage. To enhance performance, this job should be scheduled on a partition capable of supporting concurrent executions to enable parallel data transfers. The implementation of this partition may vary based on the hardware resources available in the cluster. Utilizing dedicated nodes could further improve responsiveness and improve the overall throughput of the data transfer.

The diagram below illustrates the sequence of calls required to correctly process an upload.

external storage upload

  1. The user calls API resource transfer/upload of endpoint filesystem with the parameters
    • path: destination of the file in the HPC cluster
    • fileName: destination name of the file
    • fileSize: size of the file to transfer expressed in bytes
  2. FirecREST processes the request and, if valid, generates a dedicated bucket on S3 for the specific user. All further file transfers are conducted within this bucket
  3. FirecREST schedules a job on the HPC cluster that waits for the completion of the upload in the S3 bucket
  4. FirecREST returns the following information to the user
    • maxPartSize: the maximum size for parts
    • partsUploadUrls: one distinct upload URL per part
    • completeUploadUrl: the URL to complete the multipart upload in compliancy with S3 protocol
    • transferJob: information to track the data transfer job
  5. The user uploads data on applying the multipart protocol with presigned URLs. An S3 object, labeled with the file name and uniquely tagged by a UUID, is created within the user's bucket to receive the file parts and merge them into the final uploaded file. The protocol to be followed by users is implemented through the following steps:
    1. Split the file into parts of maximum size as specified
    2. Upload each part with the given URLs, collecting the returned E-Tags for checkdata integrity verification
    3. Complete the upload with the given URL, providing the list of the E-Tags
  6. The data transfer job detects the upload completion
  7. The data transfer job downloads the incoming data from S3 to the destination specified by the user

S3DataTransfer Download

Exporting large data file from the HPC cluster to external systems begins with a user's request to download data. FirecREST returns a presigned URL to access the S3 object and then it schedules a job to upload the data to an S3 object into the user's data bucket. The user must wait until the upload process within the HPC infrastructure is fully complete before accessing the data on S3.

To address any potential limitations, FirecREST schedules a job to transfer data to S3 using the multipart upload protocol. This process is entirely transparent to the user and ensures that the S3 object becomes accessible only once the transfer is successfully completed.

Once the presigned URL is provided by FirecREST, users can access the S3 bucket without any restrictions. The maximum file size allowed for a single download by S3 accommodates even large exported data files, in a single transfer.

The diagram below illustrates the sequence of calls required to correctly process a download.

external storage upload

  1. The user calls API resource transfer/download of the filesystem endpoint providing the following parameter
    • path: source of the file in the HPC cluster
  2. FirecREST processes the request and creates a dedicated bucket on S3 for the specific user (the same can be used for file transfer upload).
  3. FirecREST schedules the data transfer job and responds to the user with the followng data:
    • downloadUrl: the presigned URL to access the S3 object containing the requested data
    • transferJob: information to track the data transfer job
  4. The transfer job generates an S3 object within the user's designated bucket, labeled with the file name and uniquely tagged with a UUID. It automatically transfers data from the specified source into the bucket, utilizing the multipart upload protocol
  5. Once the transfer is complete, the user can access the S3 object in the bucket until the expiration period ends

Although single-file download is an option, S3 supports HTTP Range Request, which can be used to parallelly download chunks of a file stored in the S3 bucket.

StreamerDataTransfer

Note

Configuration for Streamer Data Transfer can be found in this link

When requested, FirecREST creates a scheduler job that opens a websocket on a port from a range of available ports on a compute node of the cluster.

Once opened, this websocket is able to receive or transmit chunks of data using the firecrest-streamer Python library, which is developed and maintained by the FirecREST team.

Features

When compared to the S3DataTransfer, this method has a number of advantages:

  • Data transfer is perfomed as point-to-point between the user and the target remote filesystem
  • The staging area is no longer needed, which prevents writing the data twice for one operation
  • There is no limit on the amount of data to be transferred, this is an improvement compared with the 5TB of S3DataTransfer
  • There is no need for splitting the file before the upload when it's larger than 5 GB
  • To avoid that an idle transfer occupies a shared resource such as a compute node, an wait_timeout parameter can be configured. Once this timeout is achieved, the job is cancelled automatically.
  • Additionally, to prevent that the data transferred exceeds the capacity supported by the HPC centre, the parameter inbound_transfer_limit limits the amount of data that can be received.

Limitations

It's important to mention that using this data transfer type assumes that the compute nodes where the websocket are opened has public IP or DNS address and the range of ports selected for the data streaming are opened to external networks as well.

Additionally, users must use the firecrest-streamer python library (or CLI tool) in order to perform the data transfer.

StreamerDataTransfer Download and Upload

streamer transfer

  1. User calls the API resource transfer/download or transfer/upload, requesting the data transfer of a file.
  2. FirecREST creates a data transfer job using the scheduler and launching the firecrest-streamer server.
  3. This server opens an available port (from the range of ports that are configured) and returns an unique "coordinate", which acts as a shared secret between the server and the user client
  4. FirecREST response holds the "coordinates" to perform the poit-to-point data transfer between the user and the remote filesystem
  5. Using the firecrest-streamer client, the user performs the upload or download.

WormholeDataTransfer

Note

Configuration for Wormhole Data Transfer can be found in this link

This data transfer type enables Magic Wormhole integration for data transfer through FirecREST.

The idea behind is the same as with the [StreamerDataTransfer]: a job is created on the scheduler and it creates a Magic Wormhole server that can receive or send chunks of data using a Magic Wormhole Relay Server and a Rendezvous Server.

Note

For more information on Magic Wormhole servers, refer to this link

As with the Streamer data transfer type, users must use a python client or a CLI, in this case provided by the developers of Magic Wormhole, in order to provide point-to-point communication between client and server.