Find keywords in arrays using regexp_like - Amazon Athena
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Find keywords in arrays using regexp_like

The following examples illustrate how to search a dataset for a keyword within an element inside an array, using the regexp_like function. It takes as an input a regular expression pattern to evaluate, or a list of terms separated by a pipe (|), evaluates the pattern, and determines if the specified string contains it.

The regular expression pattern needs to be contained within the string, and does not have to match it. To match the entire string, enclose the pattern with ^ at the beginning of it, and $ at the end, such as '^pattern$'.

Consider an array of sites containing their host name, and a flaggedActivity element. This element includes an ARRAY, containing several MAP elements, each listing different popular keywords and their popularity count. Assume you want to find a particular keyword inside a MAP in this array.

To search this dataset for sites with a specific keyword, we use regexp_like instead of the similar SQL LIKE operator, because searching for a large number of keywords is more efficient with regexp_like.

Example 1: Using regexp_like

The query in this example uses the regexp_like function to search for terms 'politics|bigdata', found in values within arrays:

WITH dataset AS ( SELECT ARRAY[ CAST( ROW('aws.amazon.com', ROW(ARRAY[ MAP(ARRAY['term', 'count'], ARRAY['bigdata', '10']), MAP(ARRAY['term', 'count'], ARRAY['serverless', '50']), MAP(ARRAY['term', 'count'], ARRAY['analytics', '82']), MAP(ARRAY['term', 'count'], ARRAY['iot', '74']) ]) ) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) )) ), CAST( ROW('news.cnn.com', ROW(ARRAY[ MAP(ARRAY['term', 'count'], ARRAY['politics', '241']), MAP(ARRAY['term', 'count'], ARRAY['technology', '211']), MAP(ARRAY['term', 'count'], ARRAY['serverless', '25']), MAP(ARRAY['term', 'count'], ARRAY['iot', '170']) ]) ) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) )) ), CAST( ROW('netflix.com', ROW(ARRAY[ MAP(ARRAY['term', 'count'], ARRAY['cartoons', '1020']), MAP(ARRAY['term', 'count'], ARRAY['house of cards', '112042']), MAP(ARRAY['term', 'count'], ARRAY['orange is the new black', '342']), MAP(ARRAY['term', 'count'], ARRAY['iot', '4']) ]) ) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) )) ) ] AS items ), sites AS ( SELECT sites.hostname, sites.flaggedactivity FROM dataset, UNNEST(items) t(sites) ) SELECT hostname FROM sites, UNNEST(sites.flaggedActivity.flags) t(flags) WHERE regexp_like(flags['term'], 'politics|bigdata') GROUP BY (hostname)

This query returns two sites:

+----------------+ | hostname | +----------------+ | aws.amazon.com | +----------------+ | news.cnn.com | +----------------+
Example 2: Using regexp_like

The query in the following example adds up the total popularity scores for the sites matching your search terms with the regexp_like function, and then orders them from highest to lowest.

WITH dataset AS ( SELECT ARRAY[ CAST( ROW('aws.amazon.com', ROW(ARRAY[ MAP(ARRAY['term', 'count'], ARRAY['bigdata', '10']), MAP(ARRAY['term', 'count'], ARRAY['serverless', '50']), MAP(ARRAY['term', 'count'], ARRAY['analytics', '82']), MAP(ARRAY['term', 'count'], ARRAY['iot', '74']) ]) ) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) )) ), CAST( ROW('news.cnn.com', ROW(ARRAY[ MAP(ARRAY['term', 'count'], ARRAY['politics', '241']), MAP(ARRAY['term', 'count'], ARRAY['technology', '211']), MAP(ARRAY['term', 'count'], ARRAY['serverless', '25']), MAP(ARRAY['term', 'count'], ARRAY['iot', '170']) ]) ) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) )) ), CAST( ROW('netflix.com', ROW(ARRAY[ MAP(ARRAY['term', 'count'], ARRAY['cartoons', '1020']), MAP(ARRAY['term', 'count'], ARRAY['house of cards', '112042']), MAP(ARRAY['term', 'count'], ARRAY['orange is the new black', '342']), MAP(ARRAY['term', 'count'], ARRAY['iot', '4']) ]) ) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) )) ) ] AS items ), sites AS ( SELECT sites.hostname, sites.flaggedactivity FROM dataset, UNNEST(items) t(sites) ) SELECT hostname, array_agg(flags['term']) AS terms, SUM(CAST(flags['count'] AS INTEGER)) AS total FROM sites, UNNEST(sites.flaggedActivity.flags) t(flags) WHERE regexp_like(flags['term'], 'politics|bigdata') GROUP BY (hostname) ORDER BY total DESC

This query returns two sites:

+------------------------------------+ | hostname | terms | total | +----------------+-------------------+ | news.cnn.com | politics | 241 | +----------------+-------------------+ | aws.amazon.com | bigdata | 10 | +----------------+-------------------+