Filtering data transferred by Amazon DataSync
Amazon DataSync lets you apply filters if you only want to transfer a subset of data (such as
specific files, folders, or objects). For example, if your source location includes
temporary files that end with .tmp
, you can create an exclude filter
that keeps these files from making their way to the destination.
You can use a combination of exclude and include filters in the same transfer task. You can add filters when creating or starting your task by using the DataSync console or the CreateTask or StartTaskExecution operations.
Filtering terms, definitions, and syntax
Familiarize yourself with some filtering terms and definitions:
- Filter
-
The whole string that makes up a particular filter (for example,
*.tmp
|
*.temp
or/folderA|/folderB
).Filters are made up of patterns delimited with a pipe (|). You don't need a delimiter when you add patterns in the DataSync console because you add each pattern separately.
Note
Filters are case sensitive. For example, filter
/folderA
won't match/FolderA
. - Pattern
-
A pattern within a filter. For example,
*.tmp
is a pattern that's part of the*.tmp
|
*.temp
filter. - Folders
-
All filters are relative to the source location path. For example, suppose that you specify
/my_source/
as the source path when you create your source location and task and specify the include filter/transfer_this/
. In this case, DataSync transfers only the directory/my_source/transfer_this/
and its contents.To specify a folder directly under the source location, include a forward slash (/) in front of the folder name. In the example preceding, the pattern uses
/transfer_this
, nottransfer_this
.DataSync interprets the following patterns the same way and matches both the folder and its content.
/dir
/dir/
When you are transferring data from or to an Amazon S3 bucket, DataSync treats the
/
character in the object key as the equivalent of a folder on a file system.
- Special characters
-
Following are special characters for use with filtering.
Special character Description *
(wildcard)A character used to match zero or more characters. For example,
/movies_folder*
matches both/movies_folder
and/movies_folder1
.|
(pipe delimiter)A character used as a delimiter between patterns. It enables specifying multiple patterns, any of which can match the filter. For example,
*.tmp
|
*.temp
matches files ending with eithertmp
ortemp
.Note
This delimiter isn't needed when you add patterns on the console because you add each pattern on a separate line.
\
(backslash)A character used for escaping special characters (*, |, \) in a file or object name.
A double backslash (\\) is required when a backslash is part of a file name. Similarly, \\\\ represents two consecutive backslashes in a file name.
A backslash followed by a pipe (\|) is required when a pipe is part of a file name.
A backslash (\) followed by any other character, or at the end of a pattern, is ignored.
Excluding data from a transfer
Exclude filters define files, folders, and objects that are excluded when you transfer files from a source to a destination location. You can configure these filters when you create, edit, or start a task.
Folders excluded by default
DataSync automatically ignores some folders (or directories) during a data transfer, including folders commonly used for snapshots and to help facilitate the transfer. These are the folders that DataSync excludes by default:
-
/.snapshot
This directory is typically used for storing point-in-time snapshots of a storage system's files or directories.
-
/.aws-datasync
and/.awssync
DataSync creates these directories in your location to help facilitate your transfer.
-
/.zfs
You might see this directory for Amazon FSx for OpenZFS locations.
Adding exclude filters
To create a transfer task with an exclude filter in the DataSync console, specify a list
of patterns in the Data transfer configuration section under
Exclude patterns. For example, to exclude the temporary folders
named temp
or tmp
, you can specify
*/temp
in the Exclude patterns text box,
choose Add patterns and then specify */tmp
in
the second text box. To add more patterns to the filter, choose Add
pattern.
When you're using the Amazon Command Line Interface (Amazon CLI), single quotation marks ('
) are
required around the filter and a | (pipe) is used as a delimiter. For this
example, you would specify
'*/temp
|*/tmp'
:
aws datasync create-task \ --source-location-arn 'arn:aws-cn:datasync:
region
:account-id
:location/location-id
' \ --destination-location-arn 'arn:aws-cn:datasync:region
:account-id
:location/location-id
' \ --cloud-watch-log-group-arn 'arn:aws-cn:logs:region
:account-id
:log-group:your-log-group
' \ --excludes FilterType=SIMPLE_PATTERN,Value='*/temp|*/tmp'
After you create a task, you can still add or remove patterns from the exclude filter. Your changes are applied to future runs of the task.
When you start a task, you can also modify the exclude filter patterns.
Including data in a transfer
Include filters define files, folders, and objects that DataSync transfers when you run a task. You can configure include filters when you create, edit, or start a task.
To create a task with an include filter, choose the Specific files and folders option, and then specify a list of patterns to include under Include patterns.
DataSync scans and transfers only files and folders that match the include filters.
For example, to include a subset of your source folders, you might specify
/important_folder_1
|/important_folder_2
.
After you have created a task, you can edit the task configuration to add or remove patterns from the include filter. Any changes that you make are applied to future executions of the task.
When you run a task, you can modify the include filter patterns by using the Start with overrides option. Any changes that you make are applied only to that execution of the task.
You can also use the Amazon CLI to create or edit an include filter. The following example
shows the CLI command. Take note of the quotation marks ('
) around the
filter and the |
(pipe) that's used as a delimiter.
aws datasync start-task-execution --task-arn 'arn:aws-cn:datasync:
region
:account-id
:task/task-id
' --includes FilterType=SIMPLE_PATTERN,Value='/important_folder1|/important_folder2'
Note
Include filters support the wildcard (*) character only as the rightmost character
in a pattern. For example,
/documents*
|/code*
is supported, but
*.txt
isn't.
Example filters
The following examples show common filters you can use with DataSync.
Note
There are limits to how many characters you can use in a filter. For more information, see DataSync task quotas.
Exclude some folders from your source location
In some cases, you want might exclude folders in your source location to not copy them to your destination location. For example, if you have temporary work-in-progress folders, you can use something like the following filter:
*/.temp
To exclude folders with similar content (such as /reports2021
and
/reports2022)
), you can use an exclude filter like the
following:
/reports*
To exclude folders at any level in the file hierarchy, you can use an exclude filter like the following.
*/folder-to-exclude-1
|*/folder-to-exclude-2
To exclude folders at the top level of the source location, you can use an exclude filter like the following.
/top-level-folder-to-exclude-1
|/top-level-folder-to-exclude-2
Include a subset of the folders on your source location
In some cases, your source location might be a large share and you need to transfer a subset of the folders under the root. To include specific folders, start a task execution with an include filter like the following.
/folder-to-transfer/*
Exclude specific file types
To exclude certain file types from the transfer, you can create a task execution
with an exclude filter such as *.temp
.
Transfer individual files you specify
To transfer a list of individual files, start a task execution with an include
filter like the following:
"/folder/subfolder/file1.txt
|/folder/subfolder/file2.txt
|/folder/subfolder/file2.txt
"