Known issues in Athena for Spark - Amazon Athena
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Known issues in Athena for Spark

This page documents some of the known issues in Athena for Apache Spark.

Illegal argument exception when creating a table

Although Spark does not allow databases to be created with an empty location property, databases in Amazon Glue can have an empty LOCATION property if they are created outside of Spark.

If you create a table and specify a Amazon Glue database that has an empty LOCATION field, an exception like the following can occur: IllegalArgumentException: Cannot create a path from an empty string.

For example, the following command throws an exception if the default database in Amazon Glue contains an empty LOCATION field:

spark.sql("create table testTable (firstName STRING)")

Suggested solution A – Use Amazon Glue to add a location to the database that you are using.

To add a location to an Amazon Glue database
  1. Sign in to the Amazon Web Services Management Console and open the Amazon Glue console at https://console.amazonaws.cn/glue/.

  2. In the navigation pane, choose Databases.

  3. In the list of databases, choose the database that you want to edit.

  4. On the details page for the database, choose Edit.

  5. On the Update a database page, for Location, enter an Amazon S3 location.

  6. Choose Update Database.

Suggested solution B – Use a different Amazon Glue database that has an existing, valid location in Amazon S3. For example, if you have a database named dbWithLocation, use the command spark.sql("use dbWithLocation") to switch to that database.

Suggested solution C – When you use Spark SQL to create the table, specify a value for location, as in the following example.

spark.sql("create table testTable (firstName STRING) location 's3://DOC-EXAMPLE-BUCKET/'").

Suggested solution D – If you specified a location when you created the table, but the issue still occurs, make sure the Amazon S3 path you provide has a trailing forward slash. For example, the following command throws an illegal argument exception:

spark.sql("create table testTable (firstName STRING) location 's3://DOC-EXAMPLE-BUCKET'")

To correct this, add a trailing slash to the location (for example, 's3:// DOC-EXAMPLE-BUCKET/').

Database created in a workgroup location

If you use a command like spark.sql('create database db') to create a database and do not specify a location for the database, Athena creates a subdirectory in your workgroup location and uses that location for the newly created database.

Issues with Hive managed tables in the Amazon Glue default database

If the Location property of your default database in Amazon Glue is nonempty and specifies a valid location in Amazon S3, and you use Athena for Spark to create a Hive managed table in your Amazon Glue default database, data are written to the Amazon S3 location specified in your Athena Spark workgroup instead of to the location specified by the Amazon Glue database.

This issue occurs because of how Apache Hive handles its default database. Apache Hive creates table data in the Hive warehouse root location, which can be different from the actual default database location.

When you use Athena for Spark to create a Hive managed table under the default database in Amazon Glue, the Amazon Glue table metadata can point to two different locations. This can cause unexpected behavior when you attempt an INSERT or DROP TABLE operation.

The steps to reproduce the issue are the following:

  1. In Athena for Spark, you use one of the following methods to create or save a Hive managed table:

    • A SQL statement like CREATE TABLE $tableName

    • A PySpark command like df.write.mode("overwrite").saveAsTable($tableName) that does not specify the path option in the Dataframe API.

    At this point, the Amazon Glue console may show an incorrect location in Amazon S3 for the table.

  2. In Athena for Spark, you use the DROP TABLE $table_name statement to drop the table that you created.

  3. After you run the DROP TABLE statement, you notice that the underlying files in Amazon S3 are still present.

To resolve this issue, do one of the following:

Solution A – Use a different Amazon Glue database when you create Hive managed tables.

Solution B – Specify an empty location for the default database in Amazon Glue. Then, create your managed tables in the default database.

CSV and JSON file format incompatibility between Athena for Spark and Athena SQL

Due to a known issue with open source Spark, when you create a table in Athena for Spark on CSV or JSON data, the table might not be readable from Athena SQL, and vice versa.

For example, you might create a table in Athena for Spark in one of the following ways:

  • With the following USING csv syntax:

    spark.sql('''CREATE EXTERNAL TABLE $tableName ( $colName1 $colType1, $colName2 $colType2, $colName3 $colType3) USING csv PARTITIONED BY ($colName1) LOCATION $s3_location''')
  • With the following DataFrame API syntax:

    df.write.format('csv').saveAsTable($table_name)

Due to the known issue with open source Spark, queries from Athena SQL on the resulting tables might not succeed.

Suggested solution – Try creating the table in Athena for Spark using Apache Hive syntax. For more information, see CREATE HIVEFORMAT TABLE in the Apache Spark documentation.