从独立 Spark 应用程序连接到 Data Catalog - Amazon Glue
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅 中国的 Amazon Web Services 服务入门 (PDF)

从独立 Spark 应用程序连接到 Data Catalog

您可以使用 Apache Iceberg 连接器从独立应用程序连接到 Data Catalog。

  1. 为 Spark 应用程序创建 IAM 角色。

  2. 使用 Iceberg 连接器连接到 Amazon Glue Iceberg Rest 端点。

    # configure your application. Refer to https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html for best practices on configuring environment variables. export AWS_ACCESS_KEY_ID=$(aws configure get appUser.aws_access_key_id) export AWS_SECRET_ACCESS_KEY=$(aws configure get appUser.aws_secret_access_key) export AWS_SESSION_TOKEN=$(aws configure get appUser.aws_secret_token) export AWS_REGION=us-east-1 export REGION=us-east-1 export AWS_ACCOUNT_ID = {specify your aws account id here} ~/spark-3.5.3-bin-hadoop3/bin/spark-shell \ --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.6.0 \ --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \ --conf "spark.sql.defaultCatalog=spark_catalog" \ --conf "spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog" \ --conf "spark.sql.catalog.spark_catalog.type=rest" \ --conf "spark.sql.catalog.spark_catalog.uri=https://glue.us-east-1.amazonaws.com/iceberg" \ --conf "spark.sql.catalog.spark_catalog.warehouse = {AWS_ACCOUNT_ID}" \ --conf "spark.sql.catalog.spark_catalog.rest.sigv4-enabled=true" \ --conf "spark.sql.catalog.spark_catalog.rest.signing-name=glue" \ --conf "spark.sql.catalog.spark_catalog.rest.signing-region=us-east-1" \ --conf "spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO" \ --conf "spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialProvider"
  3. 查询 Data Catalog 中的数据。

    spark.sql("create database myicebergdb").show()
    spark.sql("""CREATE TABLE myicebergdb.mytbl (name string) USING iceberg location 's3://bucket_name/mytbl'""")
    spark.sql("insert into myicebergdb.mytbl values('demo') ").show()