选择算法 - Amazon SageMaker
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅中国的 Amazon Web Services 服务入门

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

选择算法

机器学习可以帮助您完成需要某种感应推理的经验任务。这项任务涉及到感应,因为它使用数据训练算法来进行可概括的推理。这意味着算法可以做出统计上可靠的预测或决策,或者在应用于未用于训练它们的新数据时完成其他任务。

为了帮助您为您的任务选择最佳算法,我们将这些任务按不同的抽象级别进行分类。在最高抽象级别中,机器学习会尝试查找要素之间的模式或关系,或者结构较小的项目(如数据集中的文本)之间的关系。模式识别技术可分为不同的机器学习范式,每个模式都可以解决特定的问题类型。目前,机器学习有三个基本范例用于解决各种问题类型:

每个学习范式可以解决的问题类型是通过考虑您希望从您拥有或可能收集的数据类型中做出的推断(或预测、决策或其他任务)来确定的。机器学习范例使用算法方法来解决它们的各种问题类型。这些算法提供了解决这些问题的方法。

然而,许多算法(如神经网络)都可以用不同的学习范式和不同类型的问题进行部署。多种算法也可以解决特定的问题类型。一些算法更普遍地适用,另一些算法对于某些类型的目标和数据非常具体。所以机器学习算法和问题类型之间的映射是多对多的。此外,还有各种可用于算法的实现选项。

以下各节提供了有关实现选项、机器学习范例和适用于不同问题类型的算法的指导。

选择算法实现

选择算法后,您必须决定要使用它的哪个实现。Amazon SageMaker 支持三个实施选项,这些选项需要不断提高工作量。

  • 内置算法如果数据集规模庞大并且需要大量资源来训练和部署模型,则需要最小的工作量和规模。

  • 如果没有工作的内置解决方案,请尝试开发一个使用用于机器和深度学习框架的预制映像,以获得支持的框架,如西基-学习,TensorFlow,PyTorch,MxNet 或网络。

  • 如果您需要运行自定义软件包或使用任何不属于受支持框架的一部分或通过 PyPI 提供的代码,那么您需要构建您自己的自定义 Docker 图像配置为安装必要的软件包或软件。还必须将自定义映像推送到亚马逊弹性容器注册表之类的在线存储库。

算法实施指南
实现 需要代码 预编码算法 Support 第三方软件包 Support 自定义代码 努力程度
内置
Scikit-learn 仅限 PyPi ue
Spark ML 仅限 PyPi
XGBoost(开源) 仅限 PyPi
TensorFlow 仅限 PyPI 中等高
PyTorch 仅限 PyPI 中等高
MXNet 仅限 PyPI 中等高
Chainer 仅限 PyPi 中等高
自定义映像 是,来自任何源

使用内置算法

针对您的问题和数据类型选择算法时,最简单的选择是使用 Amazon SageMaker 的内置算法之一。这些内置算法具有两大优点。

  • The built-in algorithms require no coding to start running experiments. The only inputs you need to provide are the data, hyperparameters, and compute resources. This allows you to run experiments more quickly, with less overhead for tracking results and code changes.

  • The built-in algorithms come with parallelization across multiple compute instances and GPU support right out of the box for all applicable algorithms (some algorithms may not be included due to inherent limitations). If you have a lot of data with which to train your model, most built-in algorithms can easily scale to meet the demand. Even if you already have a pre-trained model, it may still be easier to use its corollary in SageMaker and input the hyper-parameters you already know than to port it over, using script mode on a supported framework.

For more information on the built-in algorithms provided by SageMaker, see Use Amazon SageMaker Built-in Algorithms.

For important information about docker registry paths, data formats, recommended EC2 instance types, and CloudWatch logs common to all of the built-in algorithms provided by SageMaker, see 有关内置算法的常见信息.

Use script mode in a supported framework

If the algorithm you want to use for your model is not supported by a built-in choice and you are comfortable coding your own solution, then you should consider using an Amazon SageMaker supported framework. This is referred to as "script mode" because you write your custom code (script) in a text file with a .py extension. As the table above indicates, SageMaker supports most of the popular machine learning frameworks. These frameworks come preloaded with the corresponding framework and some additional Python packages, such as Pandas and NumPy, so you can write your own code for training an algorithm. These frameworks also allow you to install any Python package hosted on PyPi by including a requirements.txt file with your training code or to include your own code directories. R is also supported natively in SageMaker notebook kernels. Some frameworks, like scikit-learn and Spark ML, have pre-coded algorithms you can use easily, while other frameworks like TensorFlow and PyTorch may require you to implement the algorithm yourself. The only limitation when using a supported framework image is that you cannot import any software packages that are not hosted on PyPi or that are not already included with the framework’s image.

For more information on the frameworks supported by SageMaker, see 将 Machine Learning 框架、Python 和 R 与 Amazon SageMaker 结合使用.

Use a custom Docker image

Amazon SageMaker's built-in algorithms and supported frameworks should cover most use cases, but there are times when you may need to use an algorithm from a package not included in any of the supported frameworks. You might also have a pre-trained model picked or persisted somewhere which you need to deploy. SageMaker uses Docker images to host the training and serving of all models, so you can supply your own custom Docker image if the package or software you need is not included in a supported framework. This may be your own Python package or an algorithm coded in a language like Stan or Julia. For these images you must also configure the training of the algorithm and serving of the model properly in your Dockerfile. This requires intermediate knowledge of Docker and is not recommended unless you are comfortable writing your own machine learning algorithm. Your Docker image must be uploaded to an online repository, such as the Amazon Elastic Container Registry (ECR) before you can train and serve your model properly.

For more information on custom Docker images in SageMaker, see 将码头容器与 SageMaker 一起使用 .

Problem types for the basic machine learning paradigms

The following three sections describe the main problem types addressed by the three basic paradigms for machine learning. For a list of the built-in algorithms that SageMaker provides to address these problem types, see Use Amazon SageMaker Built-in Algorithms.

Supervised learning

If your data set consists of features or attributes (inputs) that contain target values (outputs), then you have a supervised learning problem. If your target values are categorical (mathematically discrete), then you have a classification problem. It is a standard practice to distinguish binary from multiclass classification.

  • Binary classification is a type of supervised learning that assigns an individual to one of two predefined and mutually exclusive classes based on the individual's attributes. It is supervised because the models are trained using examples in which the attributes are provided with correctly labeled objects. 基于诊断测试的结果对个人是否患有疾病的医学诊断是二元分类的一个示例。

  • Multiclass classification is a type of supervised learning that assigns an individual to one of several classes based on the individual's attributes. It is supervised because the models are trained using examples in which the attributes are provided with correctly labeled objects. 一个例子是预测与文本文档最相关的主题。A document may be classified as being about religion, politics, or finance, or as about one of several other predefined topic classes.

If the target values you are trying to predict are mathematically continuous, then you have a regression problem. 回归根据一个或多个与其相关的其他变量或属性来估计因果目标变量的值。An example is the prediction of house prices using features like the number of bathrooms and bedrooms and the square footage of the house and garden. 回归分析可以创建一个模型,该模型将其中一个或多个特征作为输入并预测房屋价格。

For more information on the built-in supervised learning algorithms provided by SageMaker, see 监督学习.

Unsupervised learning

If your data set consists of features or attributes (inputs) that do not contain labels or target values (outputs), then you have an unsupervised learning problem. In this type of problem, the output must be predicted based on the pattern discovered in the input data. The goal in unsupervised learning problems is to discover patterns such as groupings within the data. There are a large variety of tasks or problem types to which unsupervised learning can be applied. Principal component and cluster analyses are two of the main methods commonly deployed for preprocessing data. Here is a short list of problem types that can be addressed by unsupervised learning:

  • Dimension reduction is typically part of a data exploration step used to determine the most relevant features to use for model construction. The idea is to transform data from a high-dimensional, sparsely populated space into a low-dimensional space that retains most significant properties of the original data. This provides relief for the curse of dimensionality that can arise with sparsely populated, high-dimensional data on which statistical analysis becomes problematic. It can also be used to help understand data, reducing high-dimensional data to a lower dimensionality that can be visualized.

  • Cluster analysis is a class of techniques that are used to classify objects or cases into groups called clusters. 它尝试在数据中寻找离散组,其中一个组的成员尽可能彼此相似,而与其他组的成员尽可能互不相同。You define the features or attributes that you want the algorithm to use to determine similarity, select a distance function to measure similarity, and specify the number of clusters to use in the analysis.

  • Anomaly detection is the identification of rare items, events, or observations in a data set which raise suspicions because they differ significantly from the rest of the data. The identification of anomalous items can be used, for example, to detect bank fraud or medical errors. Anomalies are also referred to as outliers, novelties, noise, deviations, and exceptions.

  • Density estimation is the construction of estimates of unobservable underlying probability density functions based on observed data. A natural use of density estimates is for data exploration. Density estimates can discover features such as skewness and multimodality in the data. The most basic form of density estimation is a rescaled histogram.

SageMaker provides several built-in machine learning algorithms that you can use for these unsupervised learning tasks. For more information on the built-in unsupervised algorithms provided by SageMaker, see 无监督学习.

强化学习

Reinforcement learning is a type of learning that is based on interaction with the environment. This type of learning is used by an agent that must learn behavior through trial-and-error interactions with a dynamic environment in which the goal is to maximize the long-term rewards that the agent receives as a result of its actions. Rewards are maximized by trading off exploring actions that have uncertain rewards with exploiting actions that have known rewards.

For more information on SageMaker's frameworks, toolkits, and environments for reinforcement learning, see 将强化学习与 Amazon SageMaker.