AWS::Kendra::DataSource WebCrawlerUrls
Specifies the seed or starting point URLs of the websites or the sitemap URLs of the websites you want to crawl.
You can include website subdomains. You can list up to 100 seed URLs and up to three sitemap URLs.
You can only crawl websites that use the secure communication protocol, Hypertext Transfer Protocol Secure (HTTPS). If you receive an error when crawling a website, it could be that the website is blocked from crawling.
When selecting websites to index, you must adhere to the Amazon Acceptable Use Policy
Syntax
To declare this entity in your Amazon CloudFormation template, use the following syntax:
JSON
{ "SeedUrlConfiguration" :
WebCrawlerSeedUrlConfiguration
, "SiteMapsConfiguration" :WebCrawlerSiteMapsConfiguration
}
YAML
SeedUrlConfiguration:
WebCrawlerSeedUrlConfiguration
SiteMapsConfiguration:WebCrawlerSiteMapsConfiguration
Properties
SeedUrlConfiguration
-
Configuration of the seed or starting point URLs of the websites you want to crawl.
You can choose to crawl only the website host names, or the website host names with subdomains, or the website host names with subdomains and other domains that the web pages link to.
You can list up to 100 seed URLs.
Required: No
Type: WebCrawlerSeedUrlConfiguration
Update requires: No interruption
SiteMapsConfiguration
-
Configuration of the sitemap URLs of the websites you want to crawl.
Only URLs belonging to the same website host names are crawled. You can list up to three sitemap URLs.
Required: No
Type: WebCrawlerSiteMapsConfiguration
Update requires: No interruption