Import files and Python libraries to Athena for Spark
This document provides examples of how to import files and Python libraries to Amazon Athena for Apache Spark.
Considerations and Limitations
- 
                Python version – Currently, Athena for Spark uses Python version 3.9.16. Note that Python packages are sensitive to minor Python versions. 
- 
                Athena for Spark architecture – Athena for Spark uses Amazon Linux 2 on ARM64 architecture. Note that some Python libraries do not distribute binaries for this architecture. 
- 
                Binary shared objects (SOs) – Because the SparkContext addPyFile method does not detect binary shared objects, it cannot be used in Athena for Spark to add Python packages that depend on shared objects. 
- 
                Resilient Distributed Datasets (RDDs) – RDDs are not supported. 
- 
                Dataframe.foreach – The PySpark DataFrame.foreach method is not supported. 
Examples
The examples use the following conventions.
- 
                The placeholder Amazon S3 location s3://amzn-s3-demo-bucket. Replace this with your own S3 bucket location.
- 
                All code blocks that execute from a Unix shell are shown as directory_name$. For example, the commandlsin the directory/tmpand its output are displayed as follows:/tmp $ lsOutput file1 file2
Import text files for use in calculations
The examples in this section show how to import text files for use in calculations in your notebooks in Athena for Spark.
The following example shows how to write a file to a local temporary directory, add it to a notebook, and test it.
import os from pyspark import SparkFiles tempdir = '/tmp/' path = os.path.join(tempdir, "test.txt") with open(path, "w") as testFile: _ = testFile.write("5") sc.addFile(path) def func(iterator): with open(SparkFiles.get("test.txt")) as testFile: fileVal = int(testFile.readline()) return [x * fileVal for x in iterator] #Test the file from pyspark.sql.functions import udf from pyspark.sql.functions import col udf_with_import = udf(func) df = spark.createDataFrame([(1, "a"), (2, "b")]) df.withColumn("col", udf_with_import(col('_2'))).show()
Output
Calculation completed.
+---+---+-------+
| _1| _2|    col|
+---+---+-------+
|  1|  a|[aaaaa]|
|  2|  b|[bbbbb]|
+---+---+-------+The following example shows how to import a file from Amazon S3 into a notebook and test it.
To import a file from Amazon S3 into a notebook
- 
                        Create a file named test.txtthat has a single line that contains the value5.
- 
                        Add the file to a bucket in Amazon S3. This example uses the location s3://amzn-s3-demo-bucket.
- 
                        Use the following code to import the file to your notebook and test the file. from pyspark import SparkFiles sc.addFile('s3://amzn-s3-demo-bucket/test.txt') def func(iterator): with open(SparkFiles.get("test.txt")) as testFile: fileVal = int(testFile.readline()) return [x * fileVal for x in iterator] #Test the file from pyspark.sql.functions import udf from pyspark.sql.functions import col udf_with_import = udf(func) df = spark.createDataFrame([(1, "a"), (2, "b")]) df.withColumn("col", udf_with_import(col('_2'))).show()Output Calculation completed. +---+---+-------+ | _1| _2| col| +---+---+-------+ | 1| a|[aaaaa]| | 2| b|[bbbbb]| +---+---+-------+
Add Python files
The examples in this section show how to add Python files and libraries to your Spark notebooks in Athena.
The following example shows how to add Python files from Amazon S3 to your notebook and register a UDF.
To add Python files to your notebook and register a UDF
- 
                        Using your own Amazon S3 location, create the file s3://amzn-s3-demo-bucket/file1.pywith the following contents:def xyz(input): return 'xyz - udf ' + str(input);
- 
                        In the same S3 location, create the file s3://amzn-s3-demo-bucket/file2.pywith the following contents:from file1 import xyz def uvw(input): return 'uvw -> ' + xyz(input);
- 
                        In your Athena for Spark notebook, run the following commands. sc.addPyFile('s3://amzn-s3-demo-bucket/file1.py') sc.addPyFile('s3://amzn-s3-demo-bucket/file2.py') def func(iterator): from file2 import uvw return [uvw(x) for x in iterator] from pyspark.sql.functions import udf from pyspark.sql.functions import col udf_with_import = udf(func) df = spark.createDataFrame([(1, "a"), (2, "b")]) df.withColumn("col", udf_with_import(col('_2'))).show(10)Output Calculation started (calculation_id=1ec09e01-3dec-a096-00ea-57289cdb8ce7) in (session=c8c09e00-6f20-41e5-98bd-4024913d6cee). Checking calculation status... Calculation completed. +---+---+--------------------+ | _1| _2| col| +---+---+--------------------+ | 1 | a|[uvw -> xyz - ud... | | 2 | b|[uvw -> xyz - ud... | +---+---+--------------------+
You can use the Python addPyFile and import methods
                    to import a Python .zip file to your notebook.
Note
The .zip files that you import to Athena Spark may
                        include only Python packages. For example, including packages with C-based
                        files is not supported.
To import a Python .zip file to your
                        notebook
- 
                        On your local computer, in a desktop directory such as \tmp, create a directory calledmoduletest.
- 
                        In the moduletestdirectory, create a file namedhello.pywith the following contents:def hi(input): return 'hi ' + str(input);
- 
                        In the same directory, add an empty file with the name __init__.py.If you list the directory contents, they should now look like the following. /tmp $ ls moduletest __init__.py hello.py
- 
                        Use the zipcommand to place the two module files into a file calledmoduletest.zip.moduletest $ zip -r9 ../moduletest.zip *
- 
                        Upload the .zipfile to your bucket in Amazon S3.
- 
                        Use the following code to import the Python .zipfile into your notebook.sc.addPyFile('s3://amzn-s3-demo-bucket/moduletest.zip') from moduletest.hello import hi from pyspark.sql.functions import udf from pyspark.sql.functions import col hi_udf = udf(hi) df = spark.createDataFrame([(1, "a"), (2, "b")]) df.withColumn("col", hi_udf(col('_2'))).show()Output Calculation started (calculation_id=6ec09e8c-6fe0-4547-5f1b-6b01adb2242c) in (session=dcc09e8c-3f80-9cdc-bfc5-7effa1686b76). Checking calculation status... Calculation completed. +---+---+----+ | _1| _2| col| +---+---+----+ | 1| a|hi a| | 2| b|hi b| +---+---+----+
The following code examples show how to add and import two different versions of a Python library from a location in Amazon S3 as two separate modules. The code adds each the library file from S3, imports it, and then prints the library version to verify the import.
sc.addPyFile('s3://amzn-s3-demo-bucket/python-third-party-libs-test/simplejson_v3_15.zip') sc.addPyFile('s3://amzn-s3-demo-bucket/python-third-party-libs-test/simplejson_v3_17_6.zip') import simplejson_v3_15 print(simplejson_v3_15.__version__)
Output
3.15.0import simplejson_v3_17_6 print(simplejson_v3_17_6.__version__)
Output
3.17.6This example uses the pip command to download a Python .zip file
                    of the bpabel/piglatin
To import a Python .zip file from PyPI
- 
                        On your local desktop, use the following commands to create a directory called testpiglatinand create a virtual environment./tmp $ mkdir testpiglatin /tmp $ cd testpiglatin testpiglatin $ virtualenv .Output created virtual environment CPython3.9.6.final.0-64 in 410ms creator CPython3Posix(dest=/private/tmp/testpiglatin, clear=False, no_vcs_ignore=False, global=False) seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/Users/user1/Library/Application Support/virtualenv) added seed packages: pip==22.0.4, setuptools==62.1.0, wheel==0.37.1 activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
- 
                        Create a subdirectory named unpackedto hold the project.testpiglatin $ mkdir unpacked
- 
                        Use the pipcommand to install the project into theunpackeddirectory.testpiglatin $ bin/pip install -t $PWD/unpacked piglatinOutput Collecting piglatin Using cached piglatin-1.0.6-py2.py3-none-any.whl (3.1 kB) Installing collected packages: piglatin Successfully installed piglatin-1.0.6
- 
                        Check the contents of the directory. testpiglatin $ lsOutput bin lib pyvenv.cfg unpacked
- 
                        Change to the unpackeddirectory and display the contents.testpiglatin $ cd unpacked unpacked $ lsOutput piglatin piglatin-1.0.6.dist-info
- 
                        Use the zipcommand to place the contents of the piglatin project into a file calledlibrary.zip.unpacked $ zip -r9 ../library.zip *Output adding: piglatin/ (stored 0%) adding: piglatin/__init__.py (deflated 56%) adding: piglatin/__pycache__/ (stored 0%) adding: piglatin/__pycache__/__init__.cpython-39.pyc (deflated 31%) adding: piglatin-1.0.6.dist-info/ (stored 0%) adding: piglatin-1.0.6.dist-info/RECORD (deflated 39%) adding: piglatin-1.0.6.dist-info/LICENSE (deflated 41%) adding: piglatin-1.0.6.dist-info/WHEEL (deflated 15%) adding: piglatin-1.0.6.dist-info/REQUESTED (stored 0%) adding: piglatin-1.0.6.dist-info/INSTALLER (stored 0%) adding: piglatin-1.0.6.dist-info/METADATA (deflated 48%)
- 
                        (Optional) Use the following commands to test the import locally. - 
                                Set the Python path to the library.zipfile location and start Python./home $ PYTHONPATH=/tmp/testpiglatin/library.zip /home $ python3Output Python 3.9.6 (default, Jun 29 2021, 06:20:32) [Clang 12.0.0 (clang-1200.0.32.29)] on darwin Type "help", "copyright", "credits" or "license" for more information.
- 
                                Import the library and run a test command. >>> import piglatin >>> piglatin.translate('hello')Output 'ello-hay'
 
- 
                                
- 
                        Use commands like the following to add the .zipfile from Amazon S3, import it into your notebook in Athena, and test it.sc.addPyFile('s3://amzn-s3-demo-bucket/library.zip') import piglatin piglatin.translate('hello') from pyspark.sql.functions import udf from pyspark.sql.functions import col hi_udf = udf(piglatin.translate) df = spark.createDataFrame([(1, "hello"), (2, "world")]) df.withColumn("col", hi_udf(col('_2'))).show()Output Calculation started (calculation_id=e2c0a06e-f45d-d96d-9b8c-ff6a58b2a525) in (session=82c0a06d-d60e-8c66-5d12-23bcd55a6457). Checking calculation status... Calculation completed. +---+-----+--------+ | _1| _2| col| +---+-----+--------+ | 1|hello|ello-hay| | 2|world|orld-way| +---+-----+--------+
This example imports the md2gemini
cjkwrap mistune wcwidth
To import a Python .zip file that has dependencies
- 
                        On your local computer, use the following commands to create a directory called testmd2geminiand create a virtual environment./tmp $ mkdir testmd2gemini /tmp $ cd testmd2gemini testmd2gemini$ virtualenv .
- 
                        Create a subdirectory named unpackedto hold the project.testmd2gemini $ mkdir unpacked
- 
                        Use the pipcommand to install the project into theunpackeddirectory./testmd2gemini $ bin/pip install -t $PWD/unpacked md2geminiOutput Collecting md2gemini Downloading md2gemini-1.9.0-py3-none-any.whl (31 kB) Collecting wcwidth Downloading wcwidth-0.2.5-py2.py3-none-any.whl (30 kB) Collecting mistune<3,>=2.0.0 Downloading mistune-2.0.2-py2.py3-none-any.whl (24 kB) Collecting cjkwrap Downloading CJKwrap-2.2-py2.py3-none-any.whl (4.3 kB) Installing collected packages: wcwidth, mistune, cjkwrap, md2gemini Successfully installed cjkwrap-2.2 md2gemini-1.9.0 mistune-2.0.2 wcwidth-0.2.5 ...
- 
                        Change to the unpackeddirectory and check the contents.testmd2gemini $ cd unpacked unpacked $ ls -lahOutput total 16 drwxr-xr-x 13 user1 wheel 416B Jun 7 18:43 . drwxr-xr-x 8 user1 wheel 256B Jun 7 18:44 .. drwxr-xr-x 9 user1 staff 288B Jun 7 18:43 CJKwrap-2.2.dist-info drwxr-xr-x 3 user1 staff 96B Jun 7 18:43 __pycache__ drwxr-xr-x 3 user1 staff 96B Jun 7 18:43 bin -rw-r--r-- 1 user1 staff 5.0K Jun 7 18:43 cjkwrap.py drwxr-xr-x 7 user1 staff 224B Jun 7 18:43 md2gemini drwxr-xr-x 10 user1 staff 320B Jun 7 18:43 md2gemini-1.9.0.dist-info drwxr-xr-x 12 user1 staff 384B Jun 7 18:43 mistune drwxr-xr-x 8 user1 staff 256B Jun 7 18:43 mistune-2.0.2.dist-info drwxr-xr-x 16 user1 staff 512B Jun 7 18:43 tests drwxr-xr-x 10 user1 staff 320B Jun 7 18:43 wcwidth drwxr-xr-x 9 user1 staff 288B Jun 7 18:43 wcwidth-0.2.5.dist-info
- 
                        Use the zipcommand to place the contents of the md2gemini project into a file calledmd2gemini.zip.unpacked $ zip -r9 ../md2gemini *Output adding: CJKwrap-2.2.dist-info/ (stored 0%) adding: CJKwrap-2.2.dist-info/RECORD (deflated 37%) .... adding: wcwidth-0.2.5.dist-info/INSTALLER (stored 0%) adding: wcwidth-0.2.5.dist-info/METADATA (deflated 62%)
- 
                        (Optional) Use the following commands to test that the library works on your local computer. - 
                                Set the Python path to the md2gemini.zipfile location and start Python./home $ PYTHONPATH=/tmp/testmd2gemini/md2gemini.zip /home python3
- 
                                Import the library and run a test. >>> from md2gemini import md2gemini >>> print(md2gemini('[abc](https://abc.def)'))Output https://abc.def abc
 
- 
                                
- 
                        Use the following commands to add the .zipfile from Amazon S3, import it into your notebook in Athena, and perform a non UDF test.# (non udf test) sc.addPyFile('s3://amzn-s3-demo-bucket/md2gemini.zip') from md2gemini import md2gemini print(md2gemini('[abc](https://abc.def)'))Output Calculation started (calculation_id=0ac0a082-6c3f-5a8f-eb6e-f8e9a5f9bc44) in (session=36c0a082-5338-3755-9f41-0cc954c55b35). Checking calculation status... Calculation completed. => https://abc.def (https://abc.def/) abc
- 
                        Use the following commands to perform a UDF test. # (udf test) from pyspark.sql.functions import udf from pyspark.sql.functions import col from md2gemini import md2gemini hi_udf = udf(md2gemini) df = spark.createDataFrame([(1, "[first website](https://abc.def)"), (2, "[second website](https://aws.com)")]) df.withColumn("col", hi_udf(col('_2'))).show()Output Calculation started (calculation_id=60c0a082-f04d-41c1-a10d-d5d365ef5157) in (session=36c0a082-5338-3755-9f41-0cc954c55b35). Checking calculation status... Calculation completed. +---+--------------------+--------------------+ | _1| _2| col| +---+--------------------+--------------------+ | 1|[first website](h...|=> https://abc.de...| | 2|[second website](...|=> https://aws.co...| +---+--------------------+--------------------+