Monitoring FSx for ONTAP workload balance - FSx for ONTAP
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Monitoring FSx for ONTAP workload balance

If you have a file system with multiple HA pairs, then its performance and throughput is spread across each of your HA pairs. FSx for ONTAP automatically balances your files as they are written to your file system, but in rare cases it's possible that your workload data or I/O can become imbalanced across HA pairs, which can impact your workload's overall performance. You can monitor your workload to ensure that it remains balanced across each of your file system’s HA pairs (and their commensurate file servers and aggregates—the storage pools which make up your primary storage tier).

Primary storage utilization balance

Your file system’s primary storage capacity is divided evenly among each of your HA pairs in storage pools called aggregates. Each HA pair has one aggregate. We recommend that you maintain an average utilization no higher than 80% for your primary storage tier on an ongoing basis. For file systems with multiple HA pairs, we recommend that you maintain an average utilization of up to 80% for every aggregate.

Maintaining 80% utilization ensures there is free space for new incoming data, and maintains a healthy overhead for maintenance operations which can temporarily claim free space on your aggregates.

If you notice that your aggregates are imbalanced, you can either increase your file system’s primary storage capacity (commensurately increasing the storage capacity of each aggregate), or you can move your volumes between aggregates using the volume move command in the ONTAP CLI.

File server and disk performance utilization imbalance

Your file system’s total performance capabilities (such as the network throughput, file server to disk throughput and IOPS, and disk IOPS) is divided evenly among your file system’s HA pairs. We recommend that you maintain an average utilization below 50% (and a maximum peak utilization below 80%) for all performance limits on an ongoing basis—this goes for both the overall utilization of your file system’s file server resources across all HA pairs, as well as on a per-file server basis.

If you notice that your file server performance utilization is imbalanced—and the file servers on which your workload is imbalanced have an ongoing utilization of over 80%—you can use the ONTAP CLI and REST API to further diagnose the cause of performance imbalance and remediate it. Following is a table of possible imbalance indicators and next steps for further diagnosis.

If your file system's... Then...

File server disk throughput or file server disk IOPS are imbalanced

You may be experiencing I/O hotspotting on a subset of HA pairs (a subset of your volumes containing an outsized amount of data being accessed) which can limit your workload's overall performance because it's bottlenecked against a subset of HA pairs. For each highly-utilized file server, check the most-utilized volumes to see which volumes have the most activity within an aggregate. For more information on this procedure, see Rebalancing highly-utilized volumes.

Network throughput is imbalanced, but your file server disk throughput, file server disk IOPS, or disk IOPS are not imbalanced

Your data is evenly-distributed across HA pairs, but your clients are not. For the file servers which have more network throughput utilization than others, check the top clients for each file server, then rebalance those clients by unmounting any volumes from those clients and remounting them using a different endpoint on a different HA pair. For more information on this procedure, see Rebalancing high-traffic clients.

Mapping CloudWatch dimensions to ONTAP CLI and REST API resources

Your scale-out file system has Amazon CloudWatch metrics with the FileServer or Aggregate dimension. In order to further diagnose cases of imbalance, you need to map these dimension values to specific file servers (or nodes) and aggregates in the ONTAP CLI or REST API.

  • For file servers, each file server name maps to a file server (or node) name in ONTAP (for example, FsxId01234567890abcdef-01). Odd-numbered file servers are preferred file servers (that is, they service traffic unless the file system has failed over to the secondary file server), while even-numbered file servers are secondary file servers (that is, they serve traffic only when their partner is unavailable). Because of this, secondary file servers will typically show less utilization than preferred file servers.

  • For aggregates, each aggregate name maps to an aggregate in ONTAP (for example, aggr1). There is one aggregate for every HA pair, meaning aggregate aggr1 is shared by file servers FsxId01234567890abcdef-01 (the active file server) and FsxId01234567890abcdef-02 (the secondary file server) in an HA pair, aggregate aggr2 is shared by file servers FsxId01234567890abcdef-03 and FsxId01234567890abcdef-04, and so on.

You can view the mappings between all aggregates and file servers using the ONTAP CLI.

  1. To SSH into the NetApp ONTAP CLI of your file system, follow the steps documented in the Using the NetApp ONTAP CLI section of the Amazon FSx for NetApp ONTAP User Guide.

    ssh fsxadmin@file-system-management-endpoint-ip-address
  2. Use the storage aggregate show command, specifying the -fields node parameter.

    ::> storage aggregate show -fields node aggregate node ------------------------------- ------------------------- aggr1 FsxId01234567890abcdef-01 aggr2 FsxId01234567890abcdef-03 aggr3 FsxId01234567890abcdef-05 aggr4 FsxId01234567890abcdef-07 aggr5 FsxId01234567890abcdef-09 aggr6 FsxId01234567890abcdef-11 6 entries were displayed.

Rebalancing high-traffic clients

If you're experiencing I/O imbalance across file servers (specifically with Network throughput utilization), high I/O clients may be the cause. To identify high-traffic clients, use the ONTAP CLI.

  1. To SSH into the NetApp ONTAP CLI of your file system, follow the steps documented in the Using the NetApp ONTAP CLI section of the Amazon FSx for NetApp ONTAP User Guide.

    ssh fsxadmin@file-system-management-endpoint-ip-address
  2. To view the highest-traffic clients, use the statistics top client show ONTAP CLI command. You can optionally specify the -node parameter to only view the top clients for a specific file server. If you are diagnosing imbalance for a specific file server, use the -node parameter, replacing node_name with the name of the file server (for example, FsxId01234567890abcdef-01).

    You can optionally add the -interval parameter, providing the interval over which to measure (in seconds) before each report is output. Increasing the interval (for example, to the maximum 300 seconds) provides a longer-term sample for the amount of traffic driven to each volume. The default is 5 (seconds).

    ::> statistics top client show -node FsxId01234567890abcdef-01 [-interval [5,300]]

    In the output, the top clients are shown by their IP address and port.

    *Total Total Client Vserver Node Ops (Bps) ------------------ --------- ------------------------- ------ --------- 172.17.236.53:938 svm01 FsxId01234567890abcdef-01 2143 140443648 172.17.236.160:898 svm02 FsxId01234567890abcdef-01 812 53215232
  3. You can rebalance a subset of the listed high-traffic clients to other file servers. To do so, unmount the volume from the client and remount it using the DNS name for the SVM’s NFS/SMB endpoint—this returns a random endpoint corresponding to a random HA pair.

    We recommend you re-use the DNS name, but you have the option to explicitly choose which HA pair a given client mounts. To guarantee that you are mounting a client to a different endpoint, you can instead specify a different endpoint IP address than the one that corresponds to node that is experiencing high traffic. You can do so by running the following command:

    ::> network interface show -vserver svm_name -lif nfs_smb_management* -fields address,curr-node vserver lif address curr-node --------- -------------------- ------------ ------------------------- svm01 nfs_smb_management_1 172.31.15.89 FsxId01234567890abcdef-01 svm01 nfs_smb_management_3 172.31.8.112 FsxId01234567890abcdef-03 2 entries were displayed.

    According to the example output for the statistics top client show command, client 172.17.236.53 is driving high traffic to FsxId01234567890abcdef-01. The output of the network interface show command indicates this is the address 172.31.15.89. To mount to a different endpoint, select any other address (in this example, the only other address is 172.31.8.112, corresponding to FsxId01234567890abcdef-03).

Rebalancing highly-utilized volumes

If you're experiencing I/O imbalance across your volumes or aggregates, you can rebalance volumes in order to redistribute your I/O traffic across your volumes.

Note

If you're experiencing storage utilization imbalance across your aggregates, there is generally not any performance impact unless the high utilization is coupled with I/O imbalance. While you can move volumes between aggregates to balance storage utilization, we recommend only moving volumes if you are seeing a performance impact, as moving volumes can have adverse impact on performance if you don't also consider the I/O driven to each volume you're considering moving.

  1. To SSH into the NetApp ONTAP CLI of your file system, follow the steps documented in the Using the NetApp ONTAP CLI section of the Amazon FSx for NetApp ONTAP User Guide.

    ssh fsxadmin@file-system-management-endpoint-ip-address
  2. Use the statistics volume show ONTAP CLI command to view the highest-traffic volumes for a given aggregate, with the following changes:

    • Replace aggregate_name with the aggregate’s name (for example, aggr1).

    • You can optionally add the -interval parameter, providing the interval over which to measure (in seconds) before each report is output. Increasing the interval (for example, to the maximum 300 seconds) provides a longer-term sample for the amount of traffic driven to each volume. The default is 5 (seconds).

    ::> statistics volume show -aggregate aggregate_name -sort-key total_ops [-interval [5,300]]

    Depending on the interval you chose, it can take up to 5 minutes to show data. The command shows all volumes in the aggregate, along with the amount of traffic being driven to each aggregate.

    *Total Read Write Other Read Write Latency Volume Vserver Aggregate Ops Ops Ops Ops (Bps) (Bps) (us) ---------- ------- --------- ------ ---- ----- ----- --------- ----- ------- vol1__0007 svm1 aggr1 4078 4078 0 0 267255808 0 1092 vol1__0005 svm1 aggr1 4078 4078 0 0 267255808 0 1086 vol1__0003 svm1 aggr1 4077 4077 0 0 267223040 0 1086 vol1__0001 svm1 aggr1 4077 4077 0 0 267239424 0 1087 vol1__0008 svm1 aggr2 2314 2314 0 0 151650304 0 1112 vol1__0006 svm1 aggr2 2144 2144 0 0 140509184 0 1104 vol1__0002 svm1 aggr2 2183 2183 0 0 143065088 0 1106 vol1__0004 svm1 aggr2 2183 2183 0 0 143065088 0 1103

    The volume statistics are shown on a per-constituent basis (for example, vol1__0015 is the 15th constituent for FlexGroup vol1). You can see from the example output, the constituents for aggr1 are more highly-utilized than the constituents for aggr2. To balance traffic between aggregates, you can move the constituent volumes between aggregates so that traffic is more evenly distributed.

  3. To move a volume between aggregates, use the volume move start ONTAP CLI command, replacing the following values:

    • Replace svm_name with the name of the SVM hosting the volume you're moving.

    • Replace volume_name with the name of the volume constituent (for example, vol1__0001).

    • Replace aggregate_name with the name of the destination aggregate for the volume.

    Important

    Volume movement consumes network and disk resources for the source and destination file servers. As a result, the performance of your workload can be impacted by any in-progress volume moves. In addition, there is a cut-over phase of the volume movement process that temporarily pauses I/O for any traffic to the volume.

    ::> volume move start -vserver svm_name -volume volume_name -destination aggregate_name -foreground false [Job 1] Job is queued: Move "vol1__0001" in Vserver "svm01" to aggregate "aggr1". Use the "volume move show -vserver svm01 -volume vol1__0001" command to view the status of this operation.

    To check the status of the volume move operation, use the volume move show ONTAP CLI command.

    ::> volume move show -vserver svm_name -volume volume_name Vserver Name: svm01 Volume Name: vol1__0001 Actual Completion Time: - Bytes Remaining: 1.00TB Specified Action For Cutover: retry_on_failure Specified Cutover Time Window: 30 Destination Aggregate: aggr2 Destination Node: FsxId01234567890abcdef-03 Detailed Status: Transferring data: 12.23GB sent. Percentage Complete: 1% Move Phase: replicating Prior Issues Encountered: - Estimated Remaining Duration: 00:40:25 Replication Throughput: 434.3MB/s Duration of Move: 00:00:27 Source Aggregate: aggr2 Source Node: FsxId01234567890abcdef-01 Move State: healthy

    This command shows the estimated time to complete the move, as one of the information fields. When the operation finishes, the same command will show that the Move Phase field is completed.

You should ensure that each FlexGroup is evenly distributed across your aggregates, ideally with the recommended 8 constituents per aggregate. If you move one constituent volume to another aggregate for an otherwise balanced FlexGroup, you should in turn move another (less-utilized) constituent volume to the source aggregate to maintain balance.