Data deduplication
FSx supports the use of Microsoft Data Deduplication to identify and eliminate redundant data. Large datasets often have redundant data, which increases the data storage costs. For example, with user file shares, multiple users can store many copies or versions of the same file. With software development shares, many binaries remain unchanged from build to build.
You can reduce your data storage costs by turning on data deduplication for your file system. Data deduplication reduces or eliminates redundant data by storing duplicated portions of the dataset only once. Data compression is enabled by default when you use data deduplication, further reducing the amount of data storage by compressing the data after deduplication. Data deduplication runs as a background process that continually and automatically scans and optimizes your file system, and it is transparent to your users and connected clients.
The storage savings that you can achieve with data deduplication depends on the nature
of your dataset, including how much duplication exists across files. Typical savings
average 50–60 percent for general-purpose file shares. Within shares, savings
range from 30–50 percent for user documents to 70–80 percent for software
development datasets. You can measure potential deduplication savings using the
Measure-FSxDedupFileMetadata
command described below.
You can also customize data deduplication to meet your specific storage needs. For example,
you can configure deduplication to run only on certain file types, or you can create a custom
job schedule. Because deduplication jobs can consume file server resources, we recommend
monitoring the status of your deduplication jobs using the
Get-FSxDedupStatus
command described below.
For more information about data deduplication, see
the Microsoft
Understanding Data Deduplication
Note
Please see our best practices for Best practices when using data deduplication. If you encounter issues with getting data deduplication jobs to run successfully, see Troubleshooting data deduplication.
Warning
It is not recommended to run certain Robocopy commands with data deduplication
because these commands can impact the data integrity of the Chunk Store. For more information,
see the Microsoft
Data Deduplication interoperability
Best practices when using data deduplication
Here are some best practices for using Data Deduplication:
Schedule Data Deduplication jobs to run when your file system is idle: The default schedule includes a weekly
GarbageCollection
job at 2:45 UTC on Saturdays. It can take multiple hours to complete if you have a large amount of data churn on your file system. If this time isn't ideal for your workload, schedule this job to run at a time when you expect low traffic on your file system.Configure sufficient throughput capacity for Data Deduplication to complete: Higher throughput capacities provide higher levels of memory. Microsoft recommends having 1 GB of memory per 1 TB of logical data to run Data Deduplication. Use the Amazon FSx performance table to determine the memory that's associated with your file system's throughput capacity and ensure that the memory resources are sufficient for the size of your data.
Customize Data Deduplication settings to meet your specific storage needs and reduce performance requirements: You can constrain the optimization to run on specific file types or folders, or set a minimum file size and age for optimization. To learn more, see Data deduplication.
Managing data deduplication
You can manage data deduplication on your file system using the Amazon FSx CLI for remote management on PowerShell. To learn how to use this CLI, see Using the Amazon FSx CLI for PowerShell.
Following are commands that you can use for data deduplication.
Data deduplication command | Description |
---|---|
Enables data deduplication on the file share. Data compression after deduplication is enabled by default when you enable data deduplication. |
|
Disable-FSxDedup |
Disables data deduplication on the file share. |
Get-FSxDedupConfiguration |
Retrieves deduplication configuration information, including Minimum file size and age for optimization, compression settings, and Excluded file types and folders. |
Set-FSxDedupConfiguration |
Changes the deduplication configuration settings, including minimum file size and age for optimization, compression settings, and excluded file types and folders. |
Retrieves the deduplication status, and includes read-only properties that describe optimization savings and status on the file system, times, and completion status for the last jobs on the file system. | |
Get-FSxDedupMetadata |
Retrieves deduplication optimization metadata. |
Update-FSxDedupStatus |
Computes and retrieves updated data deduplication savings information. |
Measure-FSxDedupFileMetadata |
Measures and retrieves the potential storage space that you can reclaim on your file system if you delete a group of folders. Files often have chunks that are shared across other folders, and the deduplication engine calculates which chunks are unique and would be deleted. |
Get-FSxDedupSchedule |
Retrieves deduplication schedules that are currently defined. |
Creates and customizes a data deduplication schedule. | |
Changes configuration settings for existing data deduplication schedules. | |
Remove-FSxDedupSchedule |
Deletes a deduplication schedule. |
Get-FSxDedupJob |
Gets status and information for all currently running or queued deduplication jobs. |
Stop-FSxDedupJob |
Cancel one or more specified data deduplication jobs. |
The online help for each command provides a reference of all command options. To access this help, run the command with -?, for example Enable-FSxDedup -?.