Amazon Glue 使用适用于 PHP 的 SDK 的示例 - Amazon SDK for PHP
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅 中国的 Amazon Web Services 服务入门 (PDF)

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

Amazon Glue 使用适用于 PHP 的 SDK 的示例

以下代码示例向您展示了如何使用with来执行操作和实现常见场景 Amazon Glue。 Amazon SDK for PHP

操作是大型程序的代码摘录,必须在上下文中运行。您可以通过操作了解如何调用单个服务函数,还可以通过函数相关场景和跨服务示例的上下文查看操作。

场景是展示如何通过在同一服务中调用多个函数来完成特定任务任务的代码示例。

每个示例都包含一个指向的链接 GitHub,您可以在其中找到有关如何在上下文中设置和运行代码的说明。

操作

以下代码示例演示如何使用 CreateCrawler

适用于 PHP 的 SDK
注意

还有更多相关信息 GitHub。在 Amazon 代码示例存储库中查找完整实例,了解如何进行设置和运行。

$crawlerName = "example-crawler-test-" . $uniqid; $role = $iamService->getRole("AWSGlueServiceRole-DocExample"); $path = 's3://crawler-public-us-east-1/flight/2016/csv'; $glueService->createCrawler($crawlerName, $role['Role']['Arn'], $databaseName, $path); public function createCrawler($crawlerName, $role, $databaseName, $path): Result { return $this->customWaiter(function () use ($crawlerName, $role, $databaseName, $path) { return $this->glueClient->createCrawler([ 'Name' => $crawlerName, 'Role' => $role, 'DatabaseName' => $databaseName, 'Targets' => [ 'S3Targets' => [[ 'Path' => $path, ]] ], ]); }); }
  • 有关 API 的详细信息,请参阅 Amazon SDK for PHP API 参考CreateCrawler中的。

以下代码示例演示如何使用 CreateJob

适用于 PHP 的 SDK
注意

还有更多相关信息 GitHub。在 Amazon 代码示例存储库中查找完整实例,了解如何进行设置和运行。

$role = $iamService->getRole("AWSGlueServiceRole-DocExample"); $jobName = 'test-job-' . $uniqid; $scriptLocation = "s3://$bucketName/run_job.py"; $job = $glueService->createJob($jobName, $role['Role']['Arn'], $scriptLocation); public function createJob($jobName, $role, $scriptLocation, $pythonVersion = '3', $glueVersion = '3.0'): Result { return $this->glueClient->createJob([ 'Name' => $jobName, 'Role' => $role, 'Command' => [ 'Name' => 'glueetl', 'ScriptLocation' => $scriptLocation, 'PythonVersion' => $pythonVersion, ], 'GlueVersion' => $glueVersion, ]); }
  • 有关 API 的详细信息,请参阅 Amazon SDK for PHP API 参考CreateJob中的。

以下代码示例演示如何使用 DeleteCrawler

适用于 PHP 的 SDK
注意

还有更多相关信息 GitHub。在 Amazon 代码示例存储库中查找完整实例,了解如何进行设置和运行。

echo "Delete the crawler.\n"; $glueClient->deleteCrawler([ 'Name' => $crawlerName, ]); public function deleteCrawler($crawlerName) { return $this->glueClient->deleteCrawler([ 'Name' => $crawlerName, ]); }
  • 有关 API 的详细信息,请参阅 Amazon SDK for PHP API 参考DeleteCrawler中的。

以下代码示例演示如何使用 DeleteDatabase

适用于 PHP 的 SDK
注意

还有更多相关信息 GitHub。在 Amazon 代码示例存储库中查找完整实例,了解如何进行设置和运行。

echo "Delete the databases.\n"; $glueClient->deleteDatabase([ 'Name' => $databaseName, ]); public function deleteDatabase($databaseName) { return $this->glueClient->deleteDatabase([ 'Name' => $databaseName, ]); }
  • 有关 API 的详细信息,请参阅 Amazon SDK for PHP API 参考DeleteDatabase中的。

以下代码示例演示如何使用 DeleteJob

适用于 PHP 的 SDK
注意

还有更多相关信息 GitHub。在 Amazon 代码示例存储库中查找完整实例,了解如何进行设置和运行。

echo "Delete the job.\n"; $glueClient->deleteJob([ 'JobName' => $job['Name'], ]); public function deleteJob($jobName) { return $this->glueClient->deleteJob([ 'JobName' => $jobName, ]); }
  • 有关 API 的详细信息,请参阅 Amazon SDK for PHP API 参考DeleteJob中的。

以下代码示例演示如何使用 DeleteTable

适用于 PHP 的 SDK
注意

还有更多相关信息 GitHub。在 Amazon 代码示例存储库中查找完整实例,了解如何进行设置和运行。

echo "Delete the tables.\n"; foreach ($tables['TableList'] as $table) { $glueService->deleteTable($table['Name'], $databaseName); } public function deleteTable($tableName, $databaseName) { return $this->glueClient->deleteTable([ 'DatabaseName' => $databaseName, 'Name' => $tableName, ]); }
  • 有关 API 的详细信息,请参阅 Amazon SDK for PHP API 参考DeleteTable中的。

以下代码示例演示如何使用 GetCrawler

适用于 PHP 的 SDK
注意

还有更多相关信息 GitHub。在 Amazon 代码示例存储库中查找完整实例,了解如何进行设置和运行。

echo "Waiting for crawler"; do { $crawler = $glueService->getCrawler($crawlerName); echo "."; sleep(10); } while ($crawler['Crawler']['State'] != "READY"); echo "\n"; public function getCrawler($crawlerName) { return $this->customWaiter(function () use ($crawlerName) { return $this->glueClient->getCrawler([ 'Name' => $crawlerName, ]); }); }
  • 有关 API 的详细信息,请参阅 Amazon SDK for PHP API 参考GetCrawler中的。

以下代码示例演示如何使用 GetDatabase

适用于 PHP 的 SDK
注意

还有更多相关信息 GitHub。在 Amazon 代码示例存储库中查找完整实例,了解如何进行设置和运行。

$databaseName = "doc-example-database-$uniqid"; $database = $glueService->getDatabase($databaseName); echo "Found a database named " . $database['Database']['Name'] . "\n"; public function getDatabase(string $databaseName): Result { return $this->customWaiter(function () use ($databaseName) { return $this->glueClient->getDatabase([ 'Name' => $databaseName, ]); }); }
  • 有关 API 的详细信息,请参阅 Amazon SDK for PHP API 参考GetDatabase中的。

以下代码示例演示如何使用 GetJobRun

适用于 PHP 的 SDK
注意

还有更多相关信息 GitHub。在 Amazon 代码示例存储库中查找完整实例,了解如何进行设置和运行。

$jobName = 'test-job-' . $uniqid; $outputBucketUrl = "s3://$bucketName"; $runId = $glueService->startJobRun($jobName, $databaseName, $tables, $outputBucketUrl)['JobRunId']; echo "waiting for job"; do { $jobRun = $glueService->getJobRun($jobName, $runId); echo "."; sleep(10); } while (!array_intersect([$jobRun['JobRun']['JobRunState']], ['SUCCEEDED', 'STOPPED', 'FAILED', 'TIMEOUT'])); echo "\n"; public function getJobRun($jobName, $runId, $predecessorsIncluded = false): Result { return $this->glueClient->getJobRun([ 'JobName' => $jobName, 'RunId' => $runId, 'PredecessorsIncluded' => $predecessorsIncluded, ]); }
  • 有关 API 的详细信息,请参阅 Amazon SDK for PHP API 参考GetJobRun中的。

以下代码示例演示如何使用 GetJobRuns

适用于 PHP 的 SDK
注意

还有更多相关信息 GitHub。在 Amazon 代码示例存储库中查找完整实例,了解如何进行设置和运行。

$jobName = 'test-job-' . $uniqid; $jobRuns = $glueService->getJobRuns($jobName); public function getJobRuns($jobName, $maxResults = 0, $nextToken = ''): Result { $arguments = ['JobName' => $jobName]; if ($maxResults) { $arguments['MaxResults'] = $maxResults; } if ($nextToken) { $arguments['NextToken'] = $nextToken; } return $this->glueClient->getJobRuns($arguments); }
  • 有关 API 的详细信息,请参阅 Amazon SDK for PHP API 参考GetJobRuns中的。

以下代码示例演示如何使用 GetTables

适用于 PHP 的 SDK
注意

还有更多相关信息 GitHub。在 Amazon 代码示例存储库中查找完整实例,了解如何进行设置和运行。

$databaseName = "doc-example-database-$uniqid"; $tables = $glueService->getTables($databaseName); public function getTables($databaseName): Result { return $this->glueClient->getTables([ 'DatabaseName' => $databaseName, ]); }
  • 有关 API 的详细信息,请参阅 Amazon SDK for PHP API 参考GetTables中的。

以下代码示例演示如何使用 ListJobs

适用于 PHP 的 SDK
注意

还有更多相关信息 GitHub。在 Amazon 代码示例存储库中查找完整实例,了解如何进行设置和运行。

$jobs = $glueService->listJobs(); echo "Current jobs:\n"; foreach ($jobs['JobNames'] as $jobsName) { echo "{$jobsName}\n"; } public function listJobs($maxResults = null, $nextToken = null, $tags = []): Result { $arguments = []; if ($maxResults) { $arguments['MaxResults'] = $maxResults; } if ($nextToken) { $arguments['NextToken'] = $nextToken; } if (!empty($tags)) { $arguments['Tags'] = $tags; } return $this->glueClient->listJobs($arguments); }
  • 有关 API 的详细信息,请参阅 Amazon SDK for PHP API 参考ListJobs中的。

以下代码示例演示如何使用 StartCrawler

适用于 PHP 的 SDK
注意

还有更多相关信息 GitHub。在 Amazon 代码示例存储库中查找完整实例,了解如何进行设置和运行。

$crawlerName = "example-crawler-test-" . $uniqid; $databaseName = "doc-example-database-$uniqid"; $glueService->startCrawler($crawlerName); public function startCrawler($crawlerName): Result { return $this->glueClient->startCrawler([ 'Name' => $crawlerName, ]); }
  • 有关 API 的详细信息,请参阅 Amazon SDK for PHP API 参考StartCrawler中的。

以下代码示例演示如何使用 StartJobRun

适用于 PHP 的 SDK
注意

还有更多相关信息 GitHub。在 Amazon 代码示例存储库中查找完整实例,了解如何进行设置和运行。

$jobName = 'test-job-' . $uniqid; $databaseName = "doc-example-database-$uniqid"; $tables = $glueService->getTables($databaseName); $outputBucketUrl = "s3://$bucketName"; $runId = $glueService->startJobRun($jobName, $databaseName, $tables, $outputBucketUrl)['JobRunId']; public function startJobRun($jobName, $databaseName, $tables, $outputBucketUrl): Result { return $this->glueClient->startJobRun([ 'JobName' => $jobName, 'Arguments' => [ 'input_database' => $databaseName, 'input_table' => $tables['TableList'][0]['Name'], 'output_bucket_url' => $outputBucketUrl, '--input_database' => $databaseName, '--input_table' => $tables['TableList'][0]['Name'], '--output_bucket_url' => $outputBucketUrl, ], ]); }
  • 有关 API 的详细信息,请参阅 Amazon SDK for PHP API 参考StartJobRun中的。

场景

以下代码示例展示了如何:

  • 创建爬网程序,爬取公有 Amazon S3 存储桶并生成包含 CSV 格式的元数据的数据库。

  • 列出有关中数据库和表的信息 Amazon Glue Data Catalog。

  • 创建任务,从 S3 存储桶提取 CSV 数据,转换数据,然后将 JSON 格式的输出加载到另一个 S3 存储桶中。

  • 列出有关作业运行的信息,查看转换后的数据,并清除资源。

有关更多信息,请参阅教程: Amazon Glue Studio 入门

适用于 PHP 的 SDK
注意

还有更多相关信息 GitHub。在 Amazon 代码示例存储库中查找完整示例,了解如何进行设置和运行。

namespace Glue; use Aws\Glue\GlueClient; use Aws\S3\S3Client; use AwsUtilities\AWSServiceClass; use GuzzleHttp\Psr7\Stream; use Iam\IAMService; class GettingStartedWithGlue { public function run() { echo("\n"); echo("--------------------------------------\n"); print("Welcome to the AWS Glue getting started demo using PHP!\n"); echo("--------------------------------------\n"); $clientArgs = [ 'region' => 'us-west-2', 'version' => 'latest', 'profile' => 'default', ]; $uniqid = uniqid(); $glueClient = new GlueClient($clientArgs); $glueService = new GlueService($glueClient); $iamService = new IAMService(); $crawlerName = "example-crawler-test-" . $uniqid; AWSServiceClass::$waitTime = 5; AWSServiceClass::$maxWaitAttempts = 20; $role = $iamService->getRole("AWSGlueServiceRole-DocExample"); $databaseName = "doc-example-database-$uniqid"; $path = 's3://crawler-public-us-east-1/flight/2016/csv'; $glueService->createCrawler($crawlerName, $role['Role']['Arn'], $databaseName, $path); $glueService->startCrawler($crawlerName); echo "Waiting for crawler"; do { $crawler = $glueService->getCrawler($crawlerName); echo "."; sleep(10); } while ($crawler['Crawler']['State'] != "READY"); echo "\n"; $database = $glueService->getDatabase($databaseName); echo "Found a database named " . $database['Database']['Name'] . "\n"; //Upload job script $s3client = new S3Client($clientArgs); $bucketName = "test-glue-bucket-" . $uniqid; $s3client->createBucket([ 'Bucket' => $bucketName, 'CreateBucketConfiguration' => ['LocationConstraint' => 'us-west-2'], ]); $s3client->putObject([ 'Bucket' => $bucketName, 'Key' => 'run_job.py', 'SourceFile' => __DIR__ . '/flight_etl_job_script.py' ]); $s3client->putObject([ 'Bucket' => $bucketName, 'Key' => 'setup_scenario_getting_started.yaml', 'SourceFile' => __DIR__ . '/setup_scenario_getting_started.yaml' ]); $tables = $glueService->getTables($databaseName); $jobName = 'test-job-' . $uniqid; $scriptLocation = "s3://$bucketName/run_job.py"; $job = $glueService->createJob($jobName, $role['Role']['Arn'], $scriptLocation); $outputBucketUrl = "s3://$bucketName"; $runId = $glueService->startJobRun($jobName, $databaseName, $tables, $outputBucketUrl)['JobRunId']; echo "waiting for job"; do { $jobRun = $glueService->getJobRun($jobName, $runId); echo "."; sleep(10); } while (!array_intersect([$jobRun['JobRun']['JobRunState']], ['SUCCEEDED', 'STOPPED', 'FAILED', 'TIMEOUT'])); echo "\n"; $jobRuns = $glueService->getJobRuns($jobName); $objects = $s3client->listObjects([ 'Bucket' => $bucketName, ])['Contents']; foreach ($objects as $object) { echo $object['Key'] . "\n"; } echo "Downloading " . $objects[1]['Key'] . "\n"; /** @var Stream $downloadObject */ $downloadObject = $s3client->getObject([ 'Bucket' => $bucketName, 'Key' => $objects[1]['Key'], ])['Body']->getContents(); echo "Here is the first 1000 characters in the object."; echo substr($downloadObject, 0, 1000); $jobs = $glueService->listJobs(); echo "Current jobs:\n"; foreach ($jobs['JobNames'] as $jobsName) { echo "{$jobsName}\n"; } echo "Delete the job.\n"; $glueClient->deleteJob([ 'JobName' => $job['Name'], ]); echo "Delete the tables.\n"; foreach ($tables['TableList'] as $table) { $glueService->deleteTable($table['Name'], $databaseName); } echo "Delete the databases.\n"; $glueClient->deleteDatabase([ 'Name' => $databaseName, ]); echo "Delete the crawler.\n"; $glueClient->deleteCrawler([ 'Name' => $crawlerName, ]); $deleteObjects = $s3client->listObjectsV2([ 'Bucket' => $bucketName, ]); echo "Delete all objects in the bucket.\n"; $deleteObjects = $s3client->deleteObjects([ 'Bucket' => $bucketName, 'Delete' => [ 'Objects' => $deleteObjects['Contents'], ] ]); echo "Delete the bucket.\n"; $s3client->deleteBucket(['Bucket' => $bucketName]); echo "This job was brought to you by the number $uniqid\n"; } } namespace Glue; use Aws\Glue\GlueClient; use Aws\Result; use function PHPUnit\Framework\isEmpty; class GlueService extends \AwsUtilities\AWSServiceClass { protected GlueClient $glueClient; public function __construct($glueClient) { $this->glueClient = $glueClient; } public function getCrawler($crawlerName) { return $this->customWaiter(function () use ($crawlerName) { return $this->glueClient->getCrawler([ 'Name' => $crawlerName, ]); }); } public function createCrawler($crawlerName, $role, $databaseName, $path): Result { return $this->customWaiter(function () use ($crawlerName, $role, $databaseName, $path) { return $this->glueClient->createCrawler([ 'Name' => $crawlerName, 'Role' => $role, 'DatabaseName' => $databaseName, 'Targets' => [ 'S3Targets' => [[ 'Path' => $path, ]] ], ]); }); } public function startCrawler($crawlerName): Result { return $this->glueClient->startCrawler([ 'Name' => $crawlerName, ]); } public function getDatabase(string $databaseName): Result { return $this->customWaiter(function () use ($databaseName) { return $this->glueClient->getDatabase([ 'Name' => $databaseName, ]); }); } public function getTables($databaseName): Result { return $this->glueClient->getTables([ 'DatabaseName' => $databaseName, ]); } public function createJob($jobName, $role, $scriptLocation, $pythonVersion = '3', $glueVersion = '3.0'): Result { return $this->glueClient->createJob([ 'Name' => $jobName, 'Role' => $role, 'Command' => [ 'Name' => 'glueetl', 'ScriptLocation' => $scriptLocation, 'PythonVersion' => $pythonVersion, ], 'GlueVersion' => $glueVersion, ]); } public function startJobRun($jobName, $databaseName, $tables, $outputBucketUrl): Result { return $this->glueClient->startJobRun([ 'JobName' => $jobName, 'Arguments' => [ 'input_database' => $databaseName, 'input_table' => $tables['TableList'][0]['Name'], 'output_bucket_url' => $outputBucketUrl, '--input_database' => $databaseName, '--input_table' => $tables['TableList'][0]['Name'], '--output_bucket_url' => $outputBucketUrl, ], ]); } public function listJobs($maxResults = null, $nextToken = null, $tags = []): Result { $arguments = []; if ($maxResults) { $arguments['MaxResults'] = $maxResults; } if ($nextToken) { $arguments['NextToken'] = $nextToken; } if (!empty($tags)) { $arguments['Tags'] = $tags; } return $this->glueClient->listJobs($arguments); } public function getJobRuns($jobName, $maxResults = 0, $nextToken = ''): Result { $arguments = ['JobName' => $jobName]; if ($maxResults) { $arguments['MaxResults'] = $maxResults; } if ($nextToken) { $arguments['NextToken'] = $nextToken; } return $this->glueClient->getJobRuns($arguments); } public function getJobRun($jobName, $runId, $predecessorsIncluded = false): Result { return $this->glueClient->getJobRun([ 'JobName' => $jobName, 'RunId' => $runId, 'PredecessorsIncluded' => $predecessorsIncluded, ]); } public function deleteJob($jobName) { return $this->glueClient->deleteJob([ 'JobName' => $jobName, ]); } public function deleteTable($tableName, $databaseName) { return $this->glueClient->deleteTable([ 'DatabaseName' => $databaseName, 'Name' => $tableName, ]); } public function deleteDatabase($databaseName) { return $this->glueClient->deleteDatabase([ 'Name' => $databaseName, ]); } public function deleteCrawler($crawlerName) { return $this->glueClient->deleteCrawler([ 'Name' => $crawlerName, ]); } }