Learn about known issues in Athena for Spark
This page documents some of the known issues in Athena for Apache Spark.
Illegal argument exception when creating a table
Although Spark does not allow databases to be created with an empty location property,
databases in Amazon Glue can have an empty LOCATION
property if they are created
outside of Spark.
If you create a table and specify a Amazon Glue database that has an empty
LOCATION
field, an exception like the following can occur:
IllegalArgumentException: Cannot create a path from an empty
string.
For example, the following command throws an exception if the default database in
Amazon Glue contains an empty LOCATION
field:
spark.sql("create table testTable (firstName STRING)")
Suggested solution A – Use Amazon Glue to add a location to the database that you are using.
To add a location to an Amazon Glue database
Sign in to the Amazon Web Services Management Console and open the Amazon Glue console at https://console.amazonaws.cn/glue/
. -
In the navigation pane, choose Databases.
-
In the list of databases, choose the database that you want to edit.
-
On the details page for the database, choose Edit.
-
On the Update a database page, for Location, enter an Amazon S3 location.
-
Choose Update Database.
Suggested solution B – Use a different Amazon Glue
database that has an existing, valid location in Amazon S3. For example, if you have a
database named dbWithLocation
, use the command spark.sql("use
dbWithLocation")
to switch to that database.
Suggested solution C – When you use Spark SQL to
create the table, specify a value for location
, as in the following
example.
spark.sql("create table testTable (firstName STRING) location 's3://amzn-s3-demo-bucket/'").
Suggested solution D – If you specified a location when you created the table, but the issue still occurs, make sure the Amazon S3 path you provide has a trailing forward slash. For example, the following command throws an illegal argument exception:
spark.sql("create table testTable (firstName STRING) location 's3://amzn-s3-demo-bucket'")
To correct this, add a trailing slash to the location (for example,
's3://amzn-s3-demo-bucket/'
).
Database created in a workgroup location
If you use a command like spark.sql('create database db')
to create a
database and do not specify a location for the database, Athena creates a subdirectory in
your workgroup location and uses that location for the newly created database.
Issues with Hive managed tables in the Amazon Glue default database
If the Location
property of your default database in Amazon Glue is nonempty
and specifies a valid location in Amazon S3, and you use Athena for Spark to create a Hive
managed table in your Amazon Glue default database, data are written to the Amazon S3 location
specified in your Athena Spark workgroup instead of to the location specified by the
Amazon Glue database.
This issue occurs because of how Apache Hive handles its default database. Apache Hive creates table data in the Hive warehouse root location, which can be different from the actual default database location.
When you use Athena for Spark to create a Hive managed table under the default database
in Amazon Glue, the Amazon Glue table metadata can point to two different locations. This can cause
unexpected behavior when you attempt an INSERT
or DROP TABLE
operation.
The steps to reproduce the issue are the following:
-
In Athena for Spark, you use one of the following methods to create or save a Hive managed table:
-
A SQL statement like
CREATE TABLE $tableName
-
A PySpark command like
df.write.mode("overwrite").saveAsTable($tableName)
that does not specify thepath
option in the Dataframe API.
At this point, the Amazon Glue console may show an incorrect location in Amazon S3 for the table.
-
-
In Athena for Spark, you use the
DROP TABLE $table_name
statement to drop the table that you created. -
After you run the
DROP TABLE
statement, you notice that the underlying files in Amazon S3 are still present.
To resolve this issue, do one of the following:
Solution A – Use a different Amazon Glue database when you create Hive managed tables.
Solution B – Specify an empty location for the default database in Amazon Glue. Then, create your managed tables in the default database.
CSV and JSON file format incompatibility between Athena for Spark and Athena SQL
Due to a known issue with open source Spark, when you create a table in Athena for Spark on CSV or JSON data, the table might not be readable from Athena SQL, and vice versa.
For example, you might create a table in Athena for Spark in one of the following ways:
-
With the following
USING csv
syntax:spark.sql('''CREATE EXTERNAL TABLE $tableName ( $colName1 $colType1, $colName2 $colType2, $colName3 $colType3) USING csv PARTITIONED BY ($colName1) LOCATION $s3_location''')
-
With the following DataFrame
API syntax: df.write.format('csv').saveAsTable($table_name)
Due to the known issue with open source Spark, queries from Athena SQL on the resulting tables might not succeed.
Suggested solution – Try creating the table in
Athena for Spark using Apache Hive syntax. For more information, see CREATE HIVEFORMAT TABLE