Example: Read CSV Example: Write CSV CSV reference Using optimized CSV reader

Using the CSV format in Amazon Glue

Amazon Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the CSV data format, this document introduces you available features for using your data in Amazon Glue.

Amazon Glue supports using the comma-separated value (CSV) format. This format is a minimal, row-based data format. CSVs often don't strictly conform to a standard, but you can refer to RFC 4180 and RFC 7111 for more information.

You can use Amazon Glue to read CSVs from Amazon S3 and from streaming sources as well as write CSVs to Amazon S3. You can read and write bzip and gzip archives containing CSV files from S3. You configure compression behavior on the S3 connection parameters instead of in the configuration discussed on this page.

The following table shows which common Amazon Glue features support the CSV format option.

Read	Write	Streaming read	Group small files	Job bookmarks
Supported	Supported	Supported	Supported	Supported

Example: Read CSV files or folders from S3

Prerequisites: You will need the S3 paths (s3path) to the CSV files or folders that you want to read.

Configuration: In your function options, specify format="csv". In your connection_options, use the paths key to specify s3path. You can configure how the reader interacts with S3 in connection_options. For details, see Connection types and options for ETL in Amazon Glue: S3 connection parameters. You can configure how the reader interprets CSV files in your format_options. For details, see CSV Configuration Reference.

The following Amazon Glue ETL script shows the process of reading CSV files or folders from S3.

We provide a custom CSV reader with performance optimizations for common workflows through the optimizePerformance configuration key. To determine if this reader is right for your workload, see Optimize read performance with vectorized SIMD CSV reader.

Python

For this example, use the create_dynamic_frame.from_options method.


# Example: Read CSV from S3
# For show, we handle a CSV with a header row.  Set the withHeader option.
# Consider whether optimizePerformance is right for your workflow.

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://s3path"]},
    format="csv",
    format_options={
        "withHeader": True,
        # "optimizePerformance": True,
    },
)

You can also use DataFrames in a script (pyspark.sql.DataFrame).


dataFrame = spark.read\
    .format("csv")\
    .option("header", "true")\
    .load("s3://s3path")

Scala

For this example, use the getSourceWithFormat operation.


// Example: Read CSV from S3
// For show, we handle a CSV with a header row.  Set the withHeader option.
// Consider whether optimizePerformance is right for your workflow.

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    
    val dynamicFrame = glueContext.getSourceWithFormat(
      formatOptions=JsonOptions("""{"withHeader": true}"""),
      connectionType="s3",
      format="csv",
      options=JsonOptions("""{"paths": ["s3://s3path"], "recurse": true}""")
    ).getDynamicFrame()
  }
}

You can also use DataFrames in a script (org.apache.spark.sql.DataFrame).


val dataFrame = spark.read
  .option("header","true")
  .format("csv")
  .load("s3://s3path“)

Example: Write CSV files and folders to S3

Prerequisites: You will need an initialized DataFrame (dataFrame) or a DynamicFrame (dynamicFrame). You will also need your expected S3 output path, s3path.

Configuration: In your function options, specify format="csv". In your connection_options, use the paths key to specify s3path. You can configure how the writer interacts with S3 in connection_options. For details, see Connection types and options for ETL in Amazon Glue: S3 connection parameters. You can configure how your operation writes the contents of your files in format_options. For details, see CSV Configuration Reference. The following Amazon Glue ETL script shows the process of writing CSV files and folders to S3.

CSV configuration reference

You can use the following format_options wherever Amazon Glue libraries specify format="csv":

separator –Specifies the delimiter character. The default is a comma, but any other character can be specified.
- Type: Text, Default: ","
escaper – Specifies a character to use for escaping. This option is used only when reading CSV files, not writing. If enabled, the character that immediately follows is used as-is, except for a small set of well-known escapes (\n, \r, \t, and \0).
- Type: Text, Default: none
quoteChar – Specifies the character to use for quoting. The default is a double quote. Set this to -1 to turn off quoting entirely.
- Type: Text, Default: '"'
multiLine – Specifies whether a single record can span multiple lines. This can occur when a field contains a quoted new-line character. You must set this option to True if any record spans multiple lines. Enabling multiLine might decrease performance because it requires more cautious file-splitting while parsing.
- Type: Boolean, Default: false
withHeader – Specifies whether to treat the first line as a header. This option can be used in the DynamicFrameReader class.
- Type: Boolean, Default: false
writeHeader – Specifies whether to write the header to output. This option can be used in the DynamicFrameWriter class.
- Type: Boolean, Default: true
skipFirst – Specifies whether to skip the first data line.
- Type: Boolean, Default: false
optimizePerformance – Specifies whether to use the advanced SIMD CSV reader along with Apache Arrow–based columnar memory formats. Only available in Amazon Glue 3.0+.
- Type: Boolean, Default: false
strictCheckForQuoting – When writing CSVs, Glue may add quotes to values it interprets as strings. This is done to prevent ambiguity in what is written out. To save time when deciding what to write, Glue may quote in certain situations where quotes are not necessary. Enabling a strict check will perform a more intensive computation and will only quote when strictly necessary. Only available in Amazon Glue 3.0+.
- Type: Boolean, Default: false

Optimize read performance with vectorized SIMD CSV reader

Amazon Glue version 3.0 adds an optimized CSV reader that can significantly speed up overall job performance compared to row-based CSV readers.

The optimized reader:

Uses CPU SIMD instructions to read from disk
Immediately writes records to memory in a columnar format (Apache Arrow)
Divides the records into batches

This saves processing time when records would be batched or converted to a columnar format later on. Some examples are when changing schemas or retrieving data by column.

To use the optimized reader, set "optimizePerformance" to true in the format_options or table property.


glueContext.create_dynamic_frame.from_options(
    frame = datasource1,
    connection_type = "s3", 
    connection_options = {"paths": ["s3://s3path"]}, 
    format = "csv", 
    format_options={
        "optimizePerformance": True, 
        "separator": ","
        }, 
    transformation_ctx = "datasink2")

Limitations for the vectorized CSV reader

Note the following limitations of the vectorized CSV reader:

It doesn't support the multiLine and escaper format options. It uses the default escaper of double quote char '"'. When these options are set, Amazon Glue automatically falls back to using the row-based CSV reader.
It doesn't support creating a DynamicFrame with ChoiceType.
It doesn't support creating a DynamicFrame with error records.
It doesn't support reading CSV files with multibyte characters such as Japanese or Chinese characters.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Data format options

Parquet