File Delivery Retry logic

In this topic, you will find details about the retry logic implemented for file delivery operations in MFT (Managed File Transfer) when interacting with endpoints.
Explore the different types of errors by entity, including those that are retryable and those that are not, as well as the two primary retry policies.

Endpoint operations

For all external endpoint types, MFT performs the following operations on the endpoint as needed:

Connection Attempt
Get List of Files
Download File
Upload File
Delete File
Rename File
File Exists Check
Folder Exists Check

Error types by entity

Depending on the error, it may be an error related to the Endpoint, Flow Endpoint, or an individual File.

Endpoints
- Connection Interrupted / Timeout
- Authentication Failure
- Bucket (S3) /Container (Azure) - Doesn't exist
- Unknown Host
- No Permission / Misconfiguration
Flow Endpoints
- Connection Interrupted / Timeout
- Folder Not Found
- No Permission / Misconfiguration
File
- Connection Interrupted / Timeout
- No Permission / Misconfiguration

Non-retryable errors

The following error types will not be retried and are not captured by the retry policies:

Authentication Failure
Unknown Host
Bucket/Container - Doesn't exist
(Get List of Files) Folder Not Found
No Permission / Misconfiguration

Retryable errors

Any errors not in the above list are retried automatically. Some examples include:

Connection Timeout
Connection Interrupted

Retry policy(s)

We have two primary retry policies, one for the ‘Connect’ operation and another for all other operations. In general, both retry policies implement an exponential time delay between attempts. The time delay includes some jitter in the time.

For the ‘Connect' retry policy: We retry for a maximum of three minutes before ultimately failing.
- Example 1 - MFT attempts to connect to an External S3 bucket. The operation times out after 100 seconds. We wait approximately two seconds (2^1) before trying again. The operation fails again after 100 seconds. At this point, the overall attempt has lasted over 200 seconds, which is more than three minutes, and we will terminate it.
- Example 2 - MFT attempts to connect to an External SFTP server. The operation fails due to unknown network reasons after five seconds. We wait two seconds and try again. The operation fails again, and we wait again, this time for four seconds. This process repeats until the sum of all time spent (attempts + waits) exceeds three minutes.
For the ‘Other’ retry policy: We retry based on the number of attempts. If an error is retryable, we will retry 10 times (up to 11 attempts). The total wait time between all attempts will be approximately 15-17 minutes.
- For example, MFT begins to download a file from an External FTP server. During the transfer, the connection gets terminated. We will wait two seconds before trying again. This process repeats if the failure persists, until the maximum ‘wait’ time between attempts exceeds 392 seconds.

Both attempts (attempts + wait) take more time, depending on the error.
For example, a 'Timeout' error typically takes longer to occur, sometimes up to 60 seconds.
The user will be notified via an Alert for all operations that exhaust retry attempts. In addition, if the error was for an individual file, the file will be marked as 'Error' state in the system and will not be retried further.

There are a few mechanisms in the system that can help manually retry operations:

FlowEndpoint schedule - Will retry Connection Attempt failures
'Run Endpoint Now' API call (source & target) - same result as FlowEndpoint schedule
'Replay' feature (individual file)