AnyCompany Developer Guide



# Defining and managing classifiers
<a name="add-classifier"></a>

A classifier reads the data in a data store. If it recognizes the format of the data, it generates a schema. The classifier also returns a certainty number to indicate how certain the format recognition was. 

Amazon Glue provides a set of built-in classifiers, but you can also create custom classifiers. Amazon Glue invokes custom classifiers first, in the order that you specify in your crawler definition. Depending on the results that are returned from custom classifiers, Amazon Glue might also invoke built-in classifiers. If a classifier returns `certainty=1.0` during processing, it indicates that it's 100 percent certain that it can create the correct schema. Amazon Glue then uses the output of that classifier. 

If no classifier returns `certainty=1.0`, Amazon Glue uses the output of the classifier that has the highest certainty. If no classifier returns a certainty greater than `0.0`, Amazon Glue returns the default classification string of `UNKNOWN`.

## When do I use a classifier?
<a name="classifier-when-used"></a>

You use classifiers when you crawl a data store to define metadata tables in the Amazon Glue Data Catalog. You can set up your crawler with an ordered set of classifiers. When the crawler invokes a classifier, the classifier determines whether the data is recognized. If the classifier can't recognize the data or is not 100 percent certain, the crawler invokes the next classifier in the list to determine whether it can recognize the data. 

 For more information about creating a classifier using the Amazon Glue console, see [Creating classifiers using the Amazon Glue console](console-classifiers.md). 

## Custom classifiers
<a name="classifier-defining"></a>

The output of a classifier includes a string that indicates the file's classification or format (for example, `json`) and the schema of the file. For custom classifiers, you define the logic for creating the schema based on the type of classifier. Classifier types include defining schemas based on grok patterns, XML tags, and JSON paths.

If you change a classifier definition, any data that was previously crawled using the classifier is not reclassified. A crawler keeps track of previously crawled data. New data is classified with the updated classifier, which might result in an updated schema. If the schema of your data has evolved, update the classifier to account for any schema changes when your crawler runs. To reclassify data to correct an incorrect classifier, create a new crawler with the updated classifier. 

For more information about creating custom classifiers in Amazon Glue, see [Writing custom classifiers for diverse data formats](custom-classifier.md).

**Note**  
If your data format is recognized by one of the built-in classifiers, you don't need to create a custom classifier.

## Built-in classifiers
<a name="classifier-built-in"></a>

 Amazon Glue provides built-in classifiers for various formats, including JSON, CSV, web logs, and many database systems.

If Amazon Glue doesn't find a custom classifier that fits the input data format with 100 percent certainty, it invokes the built-in classifiers in the order shown in the following table. The built-in classifiers return a result to indicate whether the format matches (`certainty=1.0`) or does not match (`certainty=0.0`). The first classifier that has `certainty=1.0` provides the classification string and schema for a metadata table in your Data Catalog.


| Classifier type | Classification string | Notes | 
| --- | --- | --- | 
| Apache Avro | avro | Reads the schema at the beginning of the file to determine format. | 
| Apache ORC | orc | Reads the file metadata to determine format. | 
| Apache Parquet | parquet | Reads the schema at the end of the file to determine format. | 
| JSON | json | Reads the beginning of the file to determine format. | 
| Binary JSON | bson | Reads the beginning of the file to determine format. | 
| XML | xml | Reads the beginning of the file to determine format. Amazon Glue determines the table schema based on XML tags in the document.  For information about creating a custom XML classifier to specify rows in the document, see [Writing XML custom classifiers](custom-classifier.md#custom-classifier-xml).  | 
| Amazon Ion | ion | Reads the beginning of the file to determine format. | 
| Combined Apache log | combined\$1apache | Determines log formats through a grok pattern. | 
| Apache log | apache | Determines log formats through a grok pattern. | 
| Linux kernel log | linux\$1kernel | Determines log formats through a grok pattern. | 
| Microsoft log | microsoft\$1log | Determines log formats through a grok pattern. | 
| Ruby log | ruby\$1logger | Reads the beginning of the file to determine format. | 
| Squid 3.x log | squid | Reads the beginning of the file to determine format. | 
| Redis monitor log | redismonlog | Reads the beginning of the file to determine format. | 
| Redis log | redislog | Reads the beginning of the file to determine format. | 
| CSV | csv | Checks for the following delimiters: comma (,), pipe (\$1), tab (\$1t), semicolon (;), and Ctrl-A (\$1u0001). Ctrl-A is the Unicode control character for Start Of Heading. | 
| Amazon Redshift | redshift | Uses JDBC connection to import metadata. | 
| MySQL | mysql | Uses JDBC connection to import metadata. | 
| PostgreSQL | postgresql | Uses JDBC connection to import metadata. | 
| Oracle database | oracle | Uses JDBC connection to import metadata. | 
| Microsoft SQL Server | sqlserver | Uses JDBC connection to import metadata. | 
| Amazon DynamoDB | dynamodb | Reads data from the DynamoDB table. | 

Files in the following compressed formats can be classified:
+ ZIP (supported for archives containing only a single file). Note that Zip is not well-supported in other services (because of the archive).
+ BZIP
+ GZIP
+ LZ4
+ Snappy (supported for both standard and Hadoop native Snappy formats)

### Built-in CSV classifier
<a name="classifier-builtin-rules"></a>

The built-in CSV classifier parses CSV file contents to determine the schema for an Amazon Glue table. This classifier checks for the following delimiters:
+ Comma (,)
+ Pipe (\$1)
+ Tab (\$1t)
+ Semicolon (;)
+ Ctrl-A (\$1u0001)

  Ctrl-A is the Unicode control character for `Start Of Heading`.

To be classified as CSV, the table schema must have at least two columns and two rows of data. The CSV classifier uses a number of heuristics to determine whether a header is present in a given file. If the classifier can't determine a header from the first row of data, column headers are displayed as `col1`, `col2`, `col3`, and so on. The built-in CSV classifier determines whether to infer a header by evaluating the following characteristics of the file:
+ Every column in a potential header parses as a STRING data type.
+ Except for the last column, every column in a potential header has content that is fewer than 150 characters. To allow for a trailing delimiter, the last column can be empty throughout the file.
+ Every column in a potential header must meet the Amazon Glue `regex` requirements for a column name.
+ The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than STRING type. If all columns are of type STRING, then the first row of data is not sufficiently different from subsequent rows to be used as the header.

**Note**  
If the built-in CSV classifier does not create your Amazon Glue table as you want, you might be able to use one of the following alternatives:  
Change the column names in the Data Catalog, set the `SchemaChangePolicy` to LOG, and set the partition output configuration to `InheritFromTable` for future crawler runs.
Create a custom grok classifier to parse the data and assign the columns that you want.
The built-in CSV classifier creates tables referencing the `LazySimpleSerDe` as the serialization library, which is a good choice for type inference. However, if the CSV data contains quoted strings, edit the table definition and change the SerDe library to `OpenCSVSerDe`. Adjust any inferred types to STRING, set the `SchemaChangePolicy` to LOG, and set the partitions output configuration to `InheritFromTable` for future crawler runs. For more information about SerDe libraries, see [SerDe Reference](https://docs.amazonaws.cn/athena/latest/ug/serde-reference.html) in the Amazon Athena User Guide.

# Writing custom classifiers for diverse data formats
<a name="custom-classifier"></a>

You can provide a custom classifier to classify your data in Amazon Glue. You can create a custom classifier using a grok pattern, an XML tag, JavaScript Object Notation (JSON), or comma-separated values (CSV). An Amazon Glue crawler calls a custom classifier. If the classifier recognizes the data, it returns the classification and schema of the data to the crawler. You might need to define a custom classifier if your data doesn't match any built-in classifiers, or if you want to customize the tables that are created by the crawler.

 For more information about creating a classifier using the Amazon Glue console, see [Creating classifiers using the Amazon Glue console](console-classifiers.md). 

Amazon Glue runs custom classifiers before built-in classifiers, in the order you specify. When a crawler finds a classifier that matches the data, the classification string and schema are used in the definition of tables that are written to your Amazon Glue Data Catalog.

**Topics**
+ [Writing grok custom classifiers](#custom-classifier-grok)
+ [Writing XML custom classifiers](#custom-classifier-xml)
+ [Writing JSON custom classifiers](#custom-classifier-json)
+ [Writing CSV custom classifiers](#custom-classifier-csv)

## Writing grok custom classifiers
<a name="custom-classifier-grok"></a>

Grok is a tool that is used to parse textual data given a matching pattern. A grok pattern is a named set of regular expressions (regex) that are used to match data one line at a time. Amazon Glue uses grok patterns to infer the schema of your data. When a grok pattern matches your data, Amazon Glue uses the pattern to determine the structure of your data and map it into fields.

Amazon Glue provides many built-in patterns, or you can define your own. You can create a grok pattern using built-in patterns and custom patterns in your custom classifier definition. You can tailor a grok pattern to classify custom text file formats.

**Note**  
Amazon Glue grok custom classifiers use the `GrokSerDe` serialization library for tables created in the Amazon Glue Data Catalog. If you are using the Amazon Glue Data Catalog with Amazon Athena, Amazon EMR, or Redshift Spectrum, check the documentation about those services for information about support of the `GrokSerDe`. Currently, you might encounter problems querying tables created with the `GrokSerDe` from Amazon EMR and Redshift Spectrum.

The following is the basic syntax for the components of a grok pattern:

```
%{PATTERN:field-name}
```

Data that matches the named `PATTERN` is mapped to the `field-name` column in the schema, with a default data type of `string`. Optionally, the data type for the field can be cast to `byte`, `boolean`, `double`, `short`, `int`, `long`, or `float` in the resulting schema.

```
%{PATTERN:field-name:data-type}
```

For example, to cast a `num` field to an `int` data type, you can use this pattern: 

```
%{NUMBER:num:int}
```

Patterns can be composed of other patterns. For example, you can have a pattern for a `SYSLOG` timestamp that is defined by patterns for month, day of the month, and time (for example, `Feb 1 06:25:43`). For this data, you might define the following pattern:

```
SYSLOGTIMESTAMP %{MONTH} +%{MONTHDAY} %{TIME}
```

**Note**  
Grok patterns can process only one line at a time. Multiple-line patterns are not supported. Also, line breaks within a pattern are not supported.

### Custom values for grok classifier
<a name="classifier-values"></a>

When you define a grok classifier, you supply the following values to create the custom classifier.

**Name**  
Name of the classifier.

**Classification**  
The text string that is written to describe the format of the data that is classified; for example, `special-logs`.

**Grok pattern**  
The set of patterns that are applied to the data store to determine whether there is a match. These patterns are from Amazon Glue [built-in patterns](#classifier-builtin-patterns) and any custom patterns that you define.  
The following is an example of a grok pattern:  

```
%{TIMESTAMP_ISO8601:timestamp} \[%{MESSAGEPREFIX:message_prefix}\] %{CRAWLERLOGLEVEL:loglevel} : %{GREEDYDATA:message}
```
When the data matches `TIMESTAMP_ISO8601`, a schema column `timestamp` is created. The behavior is similar for the other named patterns in the example.

**Custom patterns**  
Optional custom patterns that you define. These patterns are referenced by the grok pattern that classifies your data. You can reference these custom patterns in the grok pattern that is applied to your data. Each custom component pattern must be on a separate line. [Regular expression (regex)](http://en.wikipedia.org/wiki/Regular_expression) syntax is used to define the pattern.   
The following is an example of using custom patterns:  

```
CRAWLERLOGLEVEL (BENCHMARK|ERROR|WARN|INFO|TRACE)
MESSAGEPREFIX .*-.*-.*-.*-.*
```
The first custom named pattern, `CRAWLERLOGLEVEL`, is a match when the data matches one of the enumerated strings. The second custom pattern, `MESSAGEPREFIX`, tries to match a message prefix string.

Amazon Glue keeps track of the creation time, last update time, and version of your classifier.

### Built-in patterns
<a name="classifier-builtin-patterns"></a>

Amazon Glue provides many common patterns that you can use to build a custom classifier. You add a named pattern to the `grok pattern` in a classifier definition.

The following list consists of a line for each pattern. In each line, the pattern name is followed its definition. [Regular expression (regex)](http://en.wikipedia.org/wiki/Regular_expression) syntax is used in defining the pattern.

```
#<noloc>&GLU;</noloc> Built-in patterns
 USERNAME [a-zA-Z0-9._-]+
 USER %{USERNAME:UNWANTED}
 INT (?:[+-]?(?:[0-9]+))
 BASE10NUM (?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))
 NUMBER (?:%{BASE10NUM:UNWANTED})
 BASE16NUM (?<![0-9A-Fa-f])(?:[+-]?(?:0x)?(?:[0-9A-Fa-f]+))
 BASE16FLOAT \b(?<![0-9A-Fa-f.])(?:[+-]?(?:0x)?(?:(?:[0-9A-Fa-f]+(?:\.[0-9A-Fa-f]*)?)|(?:\.[0-9A-Fa-f]+)))\b
 BOOLEAN (?i)(true|false)
 
 POSINT \b(?:[1-9][0-9]*)\b
 NONNEGINT \b(?:[0-9]+)\b
 WORD \b\w+\b
 NOTSPACE \S+
 SPACE \s*
 DATA .*?
 GREEDYDATA .*
 #QUOTEDSTRING (?:(?<!\\)(?:"(?:\\.|[^\\"])*"|(?:'(?:\\.|[^\\'])*')|(?:`(?:\\.|[^\\`])*`)))
 QUOTEDSTRING (?>(?<!\\)(?>"(?>\\.|[^\\"]+)+"|""|(?>'(?>\\.|[^\\']+)+')|''|(?>`(?>\\.|[^\\`]+)+`)|``))
 UUID [A-Fa-f0-9]{8}-(?:[A-Fa-f0-9]{4}-){3}[A-Fa-f0-9]{12}
 
 # Networking
 MAC (?:%{CISCOMAC:UNWANTED}|%{WINDOWSMAC:UNWANTED}|%{COMMONMAC:UNWANTED})
 CISCOMAC (?:(?:[A-Fa-f0-9]{4}\.){2}[A-Fa-f0-9]{4})
 WINDOWSMAC (?:(?:[A-Fa-f0-9]{2}-){5}[A-Fa-f0-9]{2})
 COMMONMAC (?:(?:[A-Fa-f0-9]{2}:){5}[A-Fa-f0-9]{2})
 IPV6 ((([0-9A-Fa-f]{1,4}:){7}([0-9A-Fa-f]{1,4}|:))|(([0-9A-Fa-f]{1,4}:){6}(:[0-9A-Fa-f]{1,4}|((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){5}(((:[0-9A-Fa-f]{1,4}){1,2})|:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){4}(((:[0-9A-Fa-f]{1,4}){1,3})|((:[0-9A-Fa-f]{1,4})?:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){3}(((:[0-9A-Fa-f]{1,4}){1,4})|((:[0-9A-Fa-f]{1,4}){0,2}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){2}(((:[0-9A-Fa-f]{1,4}){1,5})|((:[0-9A-Fa-f]{1,4}){0,3}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){1}(((:[0-9A-Fa-f]{1,4}){1,6})|((:[0-9A-Fa-f]{1,4}){0,4}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(:(((:[0-9A-Fa-f]{1,4}){1,7})|((:[0-9A-Fa-f]{1,4}){0,5}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:)))(%.+)?
 IPV4 (?<![0-9])(?:(?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2}))(?![0-9])
 IP (?:%{IPV6:UNWANTED}|%{IPV4:UNWANTED})
 HOSTNAME \b(?:[0-9A-Za-z][0-9A-Za-z-_]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-_]{0,62}))*(\.?|\b)
 HOST %{HOSTNAME:UNWANTED}
 IPORHOST (?:%{HOSTNAME:UNWANTED}|%{IP:UNWANTED})
 HOSTPORT (?:%{IPORHOST}:%{POSINT:PORT})
 
 # paths
 PATH (?:%{UNIXPATH}|%{WINPATH})
 UNIXPATH (?>/(?>[\w_%!$@:.,~-]+|\\.)*)+
 #UNIXPATH (?<![\w\/])(?:/[^\/\s?*]*)+
 TTY (?:/dev/(pts|tty([pq])?)(\w+)?/?(?:[0-9]+))
 WINPATH (?>[A-Za-z]+:|\\)(?:\\[^\\?*]*)+
 URIPROTO [A-Za-z]+(\+[A-Za-z+]+)?
 URIHOST %{IPORHOST}(?::%{POSINT:port})?
 # uripath comes loosely from RFC1738, but mostly from what Firefox
 # doesn't turn into %XX
 URIPATH (?:/[A-Za-z0-9$.+!*'(){},~:;=@#%_\-]*)+
 #URIPARAM \?(?:[A-Za-z0-9]+(?:=(?:[^&]*))?(?:&(?:[A-Za-z0-9]+(?:=(?:[^&]*))?)?)*)?
 URIPARAM \?[A-Za-z0-9$.+!*'|(){},~@#%&/=:;_?\-\[\]]*
 URIPATHPARAM %{URIPATH}(?:%{URIPARAM})?
 URI %{URIPROTO}://(?:%{USER}(?::[^@]*)?@)?(?:%{URIHOST})?(?:%{URIPATHPARAM})?
 
 # Months: January, Feb, 3, 03, 12, December
 MONTH \b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\b
 MONTHNUM (?:0?[1-9]|1[0-2])
 MONTHNUM2 (?:0[1-9]|1[0-2])
 MONTHDAY (?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])
 
 # Days: Monday, Tue, Thu, etc...
 DAY (?:Mon(?:day)?|Tue(?:sday)?|Wed(?:nesday)?|Thu(?:rsday)?|Fri(?:day)?|Sat(?:urday)?|Sun(?:day)?)
 
 # Years?
 YEAR (?>\d\d){1,2}
 # Time: HH:MM:SS
 #TIME \d{2}:\d{2}(?::\d{2}(?:\.\d+)?)?
 # TIME %{POSINT<24}:%{POSINT<60}(?::%{POSINT<60}(?:\.%{POSINT})?)?
 HOUR (?:2[0123]|[01]?[0-9])
 MINUTE (?:[0-5][0-9])
 # '60' is a leap second in most time standards and thus is valid.
 SECOND (?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)
 TIME (?!<[0-9])%{HOUR}:%{MINUTE}(?::%{SECOND})(?![0-9])
 # datestamp is YYYY/MM/DD-HH:MM:SS.UUUU (or something like it)
 DATE_US %{MONTHNUM}[/-]%{MONTHDAY}[/-]%{YEAR}
 DATE_EU %{MONTHDAY}[./-]%{MONTHNUM}[./-]%{YEAR}
 DATESTAMP_US %{DATE_US}[- ]%{TIME}
 DATESTAMP_EU %{DATE_EU}[- ]%{TIME}
 ISO8601_TIMEZONE (?:Z|[+-]%{HOUR}(?::?%{MINUTE}))
 ISO8601_SECOND (?:%{SECOND}|60)
 TIMESTAMP_ISO8601 %{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?
 TZ (?:[PMCE][SD]T|UTC)
 DATESTAMP_RFC822 %{DAY} %{MONTH} %{MONTHDAY} %{YEAR} %{TIME} %{TZ}
 DATESTAMP_RFC2822 %{DAY}, %{MONTHDAY} %{MONTH} %{YEAR} %{TIME} %{ISO8601_TIMEZONE}
 DATESTAMP_OTHER %{DAY} %{MONTH} %{MONTHDAY} %{TIME} %{TZ} %{YEAR}
 DATESTAMP_EVENTLOG %{YEAR}%{MONTHNUM2}%{MONTHDAY}%{HOUR}%{MINUTE}%{SECOND}
 CISCOTIMESTAMP %{MONTH} %{MONTHDAY} %{TIME}
 
 # Syslog Dates: Month Day HH:MM:SS
 SYSLOGTIMESTAMP %{MONTH} +%{MONTHDAY} %{TIME}
 PROG (?:[\w._/%-]+)
 SYSLOGPROG %{PROG:program}(?:\[%{POSINT:pid}\])?
 SYSLOGHOST %{IPORHOST}
 SYSLOGFACILITY <%{NONNEGINT:facility}.%{NONNEGINT:priority}>
 HTTPDATE %{MONTHDAY}/%{MONTH}/%{YEAR}:%{TIME} %{INT}
 
 # Shortcuts
 QS %{QUOTEDSTRING:UNWANTED}
 
 # Log formats
 SYSLOGBASE %{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} %{SYSLOGPROG}:
 
 MESSAGESLOG %{SYSLOGBASE} %{DATA}
 
 COMMONAPACHELOG %{IPORHOST:clientip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{Bytes:bytes=%{NUMBER}|-})
 COMBINEDAPACHELOG %{COMMONAPACHELOG} %{QS:referrer} %{QS:agent}
 COMMONAPACHELOG_DATATYPED %{IPORHOST:clientip} %{USER:ident;boolean} %{USER:auth} \[%{HTTPDATE:timestamp;date;dd/MMM/yyyy:HH:mm:ss Z}\] "(?:%{WORD:verb;string} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion;float})?|%{DATA:rawrequest})" %{NUMBER:response;int} (?:%{NUMBER:bytes;long}|-)
 
 
 # Log Levels
 LOGLEVEL ([A|a]lert|ALERT|[T|t]race|TRACE|[D|d]ebug|DEBUG|[N|n]otice|NOTICE|[I|i]nfo|INFO|[W|w]arn?(?:ing)?|WARN?(?:ING)?|[E|e]rr?(?:or)?|ERR?(?:OR)?|[C|c]rit?(?:ical)?|CRIT?(?:ICAL)?|[F|f]atal|FATAL|[S|s]evere|SEVERE|EMERG(?:ENCY)?|[Ee]merg(?:ency)?)
```

## Writing XML custom classifiers
<a name="custom-classifier-xml"></a>

XML defines the structure of a document with the use of tags in the file. With an XML custom classifier, you can specify the tag name used to define a row.

### Custom classifier values for an XML classifier
<a name="classifier-values-xml"></a>

When you define an XML classifier, you supply the following values to Amazon Glue to create the classifier. The classification field of this classifier is set to `xml`.

**Name**  
Name of the classifier.

**Row tag**  
The XML tag name that defines a table row in the XML document, without angle brackets `< >`. The name must comply with XML rules for a tag.  
The element containing the row data **cannot** be a self-closing empty element. For example, this empty element is **not** parsed by Amazon Glue:  

```
            <row att1=”xx” att2=”yy” />  
```
 Empty elements can be written as follows:  

```
            <row att1=”xx” att2=”yy”> </row> 
```

Amazon Glue keeps track of the creation time, last update time, and version of your classifier.

For example, suppose that you have the following XML file. To create an Amazon Glue table that only contains columns for author and title, create a classifier in the Amazon Glue console with **Row tag** as `AnyCompany`. Then add and run a crawler that uses this custom classifier.

```
<?xml version="1.0"?>
<catalog>
   <book id="bk101">
     <AnyCompany>
       <author>Rivera, Martha</author>
       <title>AnyCompany Developer Guide</title>
     </AnyCompany>
   </book>
   <book id="bk102">
     <AnyCompany>   
       <author>Stiles, John</author>
       <title>Style Guide for AnyCompany</title>
     </AnyCompany>
   </book>
</catalog>
```

## Writing JSON custom classifiers
<a name="custom-classifier-json"></a>

JSON is a data-interchange format. It defines data structures with name-value pairs or an ordered list of values. With a JSON custom classifier, you can specify the JSON path to a data structure that is used to define the schema for your table.

### Custom classifier values in Amazon Glue
<a name="classifier-values-json"></a>

When you define a JSON classifier, you supply the following values to Amazon Glue to create the classifier. The classification field of this classifier is set to `json`.

**Name**  
Name of the classifier.

**JSON path**  
A JSON path that points to an object that is used to define a table schema. The JSON path can be written in dot notation or bracket notation. The following operators are supported:      
[\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/custom-classifier.html)

Amazon Glue keeps track of the creation time, last update time, and version of your classifier.

**Example Using a JSON classifier to pull records from an array**  
Suppose that your JSON data is an array of records. For example, the first few lines of your file might look like the following:  

```
[
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:ak",
    "name": "Alaska"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:al\/cd:1",
    "name": "Alabama's 1st congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:al\/cd:2",
    "name": "Alabama's 2nd congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:al\/cd:3",
    "name": "Alabama's 3rd congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:al\/cd:4",
    "name": "Alabama's 4th congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:al\/cd:5",
    "name": "Alabama's 5th congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:al\/cd:6",
    "name": "Alabama's 6th congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:al\/cd:7",
    "name": "Alabama's 7th congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:ar\/cd:1",
    "name": "Arkansas's 1st congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:ar\/cd:2",
    "name": "Arkansas's 2nd congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:ar\/cd:3",
    "name": "Arkansas's 3rd congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:ar\/cd:4",
    "name": "Arkansas's 4th congressional district"
  }
]
```
When you run a crawler using the built-in JSON classifier, the entire file is used to define the schema. Because you don’t specify a JSON path, the crawler treats the data as one object, that is, just an array. For example, the schema might look like the following:  

```
root
|-- record: array
```
However, to create a schema that is based on each record in the JSON array, create a custom JSON classifier and specify the JSON path as `$[*]`. When you specify this JSON path, the classifier interrogates all 12 records in the array to determine the schema. The resulting schema contains separate fields for each object, similar to the following example:  

```
root
|-- type: string
|-- id: string
|-- name: string
```

**Example Using a JSON classifier to examine only parts of a file**  
Suppose that your JSON data follows the pattern of the example JSON file `s3://awsglue-datasets/examples/us-legislators/all/areas.json` drawn from [http://everypolitician.org/](http://everypolitician.org/). Example objects in the JSON file look like the following:  

```
{
  "type": "constituency",
  "id": "ocd-division\/country:us\/state:ak",
  "name": "Alaska"
}
{
  "type": "constituency",
  "identifiers": [
    {
      "scheme": "dmoz",
      "identifier": "Regional\/North_America\/United_States\/Alaska\/"
    },
    {
      "scheme": "freebase",
      "identifier": "\/m\/0hjy"
    },
    {
      "scheme": "fips",
      "identifier": "US02"
    },
    {
      "scheme": "quora",
      "identifier": "Alaska-state"
    },
    {
      "scheme": "britannica",
      "identifier": "place\/Alaska"
    },
    {
      "scheme": "wikidata",
      "identifier": "Q797"
    }
  ],
  "other_names": [
    {
      "lang": "en",
      "note": "multilingual",
      "name": "Alaska"
    },
    {
      "lang": "fr",
      "note": "multilingual",
      "name": "Alaska"
    },
    {
      "lang": "nov",
      "note": "multilingual",
      "name": "Alaska"
    }
  ],
  "id": "ocd-division\/country:us\/state:ak",
  "name": "Alaska"
}
```
When you run a crawler using the built-in JSON classifier, the entire file is used to create the schema. You might end up with a schema like this:  

```
root
|-- type: string
|-- id: string
|-- name: string
|-- identifiers: array
|    |-- element: struct
|    |    |-- scheme: string
|    |    |-- identifier: string
|-- other_names: array
|    |-- element: struct
|    |    |-- lang: string
|    |    |-- note: string
|    |    |-- name: string
```
However, to create a schema using just the "`id`" object, create a custom JSON classifier and specify the JSON path as `$.id`. Then the schema is based on only the "`id`" field:  

```
root
|-- record: string
```
The first few lines of data extracted with this schema look like this:  

```
{"record": "ocd-division/country:us/state:ak"}
{"record": "ocd-division/country:us/state:al/cd:1"}
{"record": "ocd-division/country:us/state:al/cd:2"}
{"record": "ocd-division/country:us/state:al/cd:3"}
{"record": "ocd-division/country:us/state:al/cd:4"}
{"record": "ocd-division/country:us/state:al/cd:5"}
{"record": "ocd-division/country:us/state:al/cd:6"}
{"record": "ocd-division/country:us/state:al/cd:7"}
{"record": "ocd-division/country:us/state:ar/cd:1"}
{"record": "ocd-division/country:us/state:ar/cd:2"}
{"record": "ocd-division/country:us/state:ar/cd:3"}
{"record": "ocd-division/country:us/state:ar/cd:4"}
{"record": "ocd-division/country:us/state:as"}
{"record": "ocd-division/country:us/state:az/cd:1"}
{"record": "ocd-division/country:us/state:az/cd:2"}
{"record": "ocd-division/country:us/state:az/cd:3"}
{"record": "ocd-division/country:us/state:az/cd:4"}
{"record": "ocd-division/country:us/state:az/cd:5"}
{"record": "ocd-division/country:us/state:az/cd:6"}
{"record": "ocd-division/country:us/state:az/cd:7"}
```
To create a schema based on a deeply nested object, such as "`identifier`," in the JSON file, you can create a custom JSON classifier and specify the JSON path as `$.identifiers[*].identifier`. Although the schema is similar to the previous example, it is based on a different object in the JSON file.   
The schema looks like the following:  

```
root
|-- record: string
```
Listing the first few lines of data from the table shows that the schema is based on the data in the "`identifier`" object:  

```
{"record": "Regional/North_America/United_States/Alaska/"}
{"record": "/m/0hjy"}
{"record": "US02"}
{"record": "5879092"}
{"record": "4001016-8"}
{"record": "destination/alaska"}
{"record": "1116270"}
{"record": "139487266"}
{"record": "n79018447"}
{"record": "01490999-8dec-4129-8254-eef6e80fadc3"}
{"record": "Alaska-state"}
{"record": "place/Alaska"}
{"record": "Q797"}
{"record": "Regional/North_America/United_States/Alabama/"}
{"record": "/m/0gyh"}
{"record": "US01"}
{"record": "4829764"}
{"record": "4084839-5"}
{"record": "161950"}
{"record": "131885589"}
```
To create a table based on another deeply nested object, such as the "`name`" field in the "`other_names`" array in the JSON file, you can create a custom JSON classifier and specify the JSON path as `$.other_names[*].name`. Although the schema is similar to the previous example, it is based on a different object in the JSON file. The schema looks like the following:  

```
root
|-- record: string
```
Listing the first few lines of data in the table shows that it is based on the data in the "`name`" object in the "`other_names`" array:  

```
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "Аляска"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "ألاسكا"}
{"record": "ܐܠܐܣܟܐ"}
{"record": "الاسكا"}
{"record": "Alaska"}
{"record": "Alyaska"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "Штат Аляска"}
{"record": "Аляска"}
{"record": "Alaska"}
{"record": "আলাস্কা"}
```

## Writing CSV custom classifiers
<a name="custom-classifier-csv"></a>

 Custom CSV classifiers allows you to specify datatypes for each column in the custom csv classifier field. You can specify each column’s datatype separated by a comma. By specifying datatypes, you can override the crawlers inferred datatypes and ensure data will be classified appropriately.

You can set the SerDe for processing CSV in the classifier, which will be applied in the Data Catalog.

When you create a custom classifier, you can also re-use the classifer for different crawlers.
+  For csv files with only headers (no data), these files will be classified as UNKNOWN since not enough information is provided. If you specify that the CSV 'Has headings' in the *Column headings* option, and provide the datatypes, we can classify these files correctly. 

You can use a custom CSV classifier to infer the schema of various types of CSV data. The custom attributes that you can provide for your classifier include delimiters, a CSV SerDe option, options about the header, and whether to perform certain validations on the data.

### Custom classifier values in Amazon Glue
<a name="classifier-values-csv"></a>

When you define a CSV classifier, you provide the following values to Amazon Glue to create the classifier. The classification field of this classifier is set to `csv`.

**Classifier name**  
Name of the classifier.

**CSV Serde**  
Sets the SerDe for processing CSV in the classifier, which will be applied in the Data Catalog. Options are Open CSV SerDe, Lazy Simple SerDe, and None. You can specify the None value when you want the crawler to do the detection.

**Column delimiter**  
A custom symbol to denote what separates each column entry in the row. Provide a unicode character. If you cannot type your delimiter, you can copy and paste it. This works for printable characters, including those your system does not support (typically shown as □).

**Quote symbol**  
A custom symbol to denote what combines content into a single column value. Must be different from the column delimiter. Provide a unicode character. If you cannot type your delimiter, you can copy and paste it. This works for printable characters, including those your system does not support (typically shown as □).

**Column headings**  
Indicates the behavior for how column headings should be detected in the CSV file. If your custom CSV file has column headings, enter a comma-delimited list of the column headings.

**Processing options: Allow files with single column**  
Enables the processing of files that contain only one column.

**Processing options: Trim white space before identifying column values**  
Specifies whether to trim values before identifying the type of column values.

**Custom datatypes - *optional***  
 Enter the custom datatype separated by a comma. Specifies the custom datatypes in the CSV file. The custom datatype must be a supported datatype. Supported datatypes are: “BINARY”, “BOOLEAN”, “DATE”, “DECIMAL”, “DOUBLE”, “FLOAT”, “INT”, “LONG”, “SHORT”, “STRING”, “TIMESTAMP”. Unsupported datatypes will display an error. 

# Creating classifiers using the Amazon Glue console
<a name="console-classifiers"></a>

A classifier determines the schema of your data. You can write a custom classifier and point to it from Amazon Glue. 

## Creating classifiers
<a name="add-classifier-console"></a>

To add a classifier in the Amazon Glue console, choose **Add classifier**. When you define a classifier, you supply values for the following:
+ **Classifier name** – Provide a unique name for your classifier.
+ **Classifier type** – The classification type of tables inferred by this classifier.
+ **Last updated** – The last time this classifier was updated.

**Classifier name**  
Provide a unique name for your classifier.

**Classifier type**  
Choose the type of classifier to create.

Depending on the type of classifier you choose, configure the following properties for your classifier:

------
#### [ Grok ]
+ **Classification** 

  Describe the format or type of data that is classified or provide a custom label. 
+ **Grok pattern** 

  This is used to parse your data into a structured schema. The grok pattern is composed of named patterns that describe the format of your data store. You write this grok pattern using the named built-in patterns provided by Amazon Glue and custom patterns you write and include in the **Custom patterns** field. Although grok debugger results might not match the results from Amazon Glue exactly, we suggest that you try your pattern using some sample data with a grok debugger. You can find grok debuggers on the web. The named built-in patterns provided by Amazon Glue are generally compatible with grok patterns that are available on the web. 

  Build your grok pattern by iteratively adding named patterns and check your results in a debugger. This activity gives you confidence that when the Amazon Glue crawler runs your grok pattern, your data can be parsed.
+ **Custom patterns** 

  For grok classifiers, these are optional building blocks for the **Grok pattern** that you write. When built-in patterns cannot parse your data, you might need to write a custom pattern. These custom patterns are defined in this field and referenced in the **Grok pattern** field. Each custom pattern is defined on a separate line. Just like the built-in patterns, it consists of a named pattern definition that uses [regular expression (regex)](http://en.wikipedia.org/wiki/Regular_expression) syntax. 

  For example, the following has the name `MESSAGEPREFIX` followed by a regular expression definition to apply to your data to determine whether it follows the pattern. 

  ```
  MESSAGEPREFIX .*-.*-.*-.*-.*
  ```

------
#### [ XML ]
+ **Row tag** 

  For XML classifiers, this is the name of the XML tag that defines a table row in the XML document. Type the name without angle brackets `< >`. The name must comply with XML rules for a tag.

  For more information, see [Writing XML custom classifiers](custom-classifier.md#custom-classifier-xml). 

------
#### [ JSON ]
+ **JSON path** 

  For JSON classifiers, this is the JSON path to the object, array, or value that defines a row of the table being created. Type the name in either dot or bracket JSON syntax using Amazon Glue supported operators. 

  For more information, see the list of operators in [Writing JSON custom classifiers](custom-classifier.md#custom-classifier-json). 

------
#### [ CSV ]
+ **Column delimiter** 

  A single character or symbol to denote what separates each column entry in the row. Choose the delimiter from the list, or choose `Other` to enter a custom delimiter.
+ **Quote symbol** 

  A single character or symbol to denote what combines content into a single column value. Must be different from the column delimiter. Choose the quote symbol from the list, or choose `Other` to enter a custom quote character.
+ **Column headings** 

  Indicates the behavior for how column headings should be detected in the CSV file. You can choose `Has headings`, `No headings`, or `Detect headings`. If your custom CSV file has column headings, enter a comma-delimited list of the column headings. 
+ **Allow files with single column** 

  To be classified as CSV, the data must have at least two columns and two rows of data. Use this option to allow the processing of files that contain only one column.
+ **Trim whitespace before identifying column values** 

  This option specifies whether to trim values before identifying the type of column values.
+  **Custom datatype** 

   (Optional) - Enter custom datatypes in a comma-delimited list. The supported datatypes are: “BINARY”, “BOOLEAN”, “DATE”, “DECIMAL”, “DOUBLE”, “FLOAT”, “INT”, “LONG”, “SHORT”, “STRING”, “TIMESTAMP”. 
+  **CSV Serde** 

   (Optional) - A SerDe for processing CSV in the classifier, which will be applied in the Data Catalog. Choose from `Open CSV SerDe`, `Lazy Simple SerDe`, or `None`. You can specify the `None` value when you want the crawler to do the detection. 

------

For more information, see [Writing custom classifiers for diverse data formats](custom-classifier.md).

## Viewing classifiers
<a name="view-classifiers-console"></a>

To see a list of all the classifiers that you have created, open the Amazon Glue console at [https://console.amazonaws.cn/glue/](https://console.amazonaws.cn/glue/), and choose the **Classifiers** tab.

The list displays the following properties about each classifier:
+ **Classifier** – The classifier name. When you create a classifier, you must provide a name for it.
+ **Classification** – The classification type of tables inferred by this classifier.
+ **Last updated** – The last time this classifier was updated.

## Managing classifiers
<a name="manage-classifiers-console"></a>

From the **Classifiers** list in the Amazon Glue console, you can add, edit, and delete classifiers. To see more details for a classifier, choose the classifier name in the list. Details include the information you defined when you created the classifier.