VARIABLE_COLUMN_LOG_PARSE - Amazon Kinesis Data Analytics SQL Reference
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

VARIABLE_COLUMN_LOG_PARSE

VARIABLE_COLUMN_LOG_PARSE(  <character-expression>, <columns>, <delimiter-string>  [ , <escape-string>, <quote-string> ] )  <columns> := <number of columns> | <list of columns>  <number of columns> := <numeric value expression>  <list of columns> := '<column description>[, ...]'  <column description> := <identifier> TYPE <data type> [ NOT NULL ]  <delimiter string> := <character-expression>  <escape-string> := <character-expression>  <quote-string> := '<begin quote character> [ <end quote character> ]'

VARIABLE_COLUMN_LOG_PARSE splits an input string (its first argument, <character-expression>) into fields separated by a delimiter character or delimiter string. Thus it handles comma-separated values or tab-separated values. It can be combined with FIXED_COLUMN_LOG_PARSE to handle something like maillog, where some fields are fixed-length and others are variable-length.

Note

Parsing of binary files is not supported.

The arguments <escape-string> and <quote-string> are optional. Specifying an <escape-string> allows the value of a field to contain an embedded delimiter. As a simple example, if the <delimiter-string> specified a comma, and the <escape-string> specified a backslash, then an input of "a,b' would be split into two fields "a" and "b", but an input of "a\,b" would result in a single field "a,b".

Since Amazon Kinesis Data Analytics supports Expressions and Literals, a tab can also be a delimiter, specified using a unicode escape, e.g., u&'\0009', which is a string consisting only of a tab character.

Specifying a <quote-string> is another way to hide an embedded delimiter. The <quote-string> should be a one or two character expression: the first is used as the <begin quote character> character; the second, if present, is used as the <end quote character> character. If only one character is supplied, it is used as both to begin and to end quoted strings. When the input includes a quoted string, that is, a string enclosed in the characters specified as <quote-string>, then that string appears in one field, even if it contains a delimiter.

Note that the <begin quote character> and <end quote character> are single characters and can be different. The <begin quote character> can be used to start and end the quoted string, or the <begin quote character> can start the quoted string and the <end quote character> used to end that quoted string.

When a list of columns <list of columns> is supplied as the second parameter <columns>, the column specifications (<column description>) for types DATE, TIME, and TIMESTAMP support a format parameter allowing the user to specify exact time component layout. The parser uses the Java class java.lang.SimpleDateFormat to parse the strings for those types. Date and Time Patterns gives a full description of timestamp format strings, with examples. The following is an example of a column definition with a format string:

   "name" TYPE TIMESTAMP 'dd/MMM/yyyy:HH:mm:ss'

By default, the output columns are named COLUMN1, COLUMN2, COLUMN3, etc., each of SQL data type VARCHAR(1024).