评估配方示例 - 亚马逊 SageMaker AI

这些配方让您能够通过一套全面的纯文本基准测试来评估 Amazon Nova 模型的基本功能。它们以 xxx_general_text_benchmark_eval.yaml 格式提供。

这些配方让您能够使用自己的数据集进行基准测试，并使用不同类型的指标将模型输出与参考答案进行对比。它们以 xxx_bring_your_own_dataset_eval.yaml 格式提供。

以下是自带数据集的要求：

文件格式要求
- 必须有一个包含评估示例的 gen_qa.jsonl 文件。
- 您的数据集必须上传到 A SageMaker I 训练作业可以访问的 S3 位置。
- 该文件必须遵循常规问答数据集所需的架构格式。
架构格式要求：JSONL 文件中的每一行都必须是包含以下字段的 JSON 对象：
- query：（必填）包含需要答案的问题或说明的字符串
- response：（必填）包含预期模型输出的字符串
- system：（可选）包含系统提示的字符串，用于在处理查询之前设置 AI 模型的行为、角色或个性
- metadata:（可选）包含与条目关联的元数据的字符串，用于标记。

以下是自带数据集示例条目


{
   "system":"You are a english major with top marks in class who likes to give minimal word responses: ",
   "query":"What is the symbol that ends the sentence as a question",
   "response":"?"
}
{
   "system":"You are a pattern analysis specialist that provides succinct answers: ",
   "query":"What is the next number in this series? 1, 2, 4, 8, 16, ?",
   "response":"32"
}
{
   "system":"You have great attention to detail that follows instructions accurately: ",
   "query":"Repeat only the last two words of the following: I ate a hamburger today and it was kind of dry",
   "response":"of dry"
}

要使用您的自定义数据集，请按以下必填字段修改您的评估配方，且不要更改其中任何内容：


evaluation:
  task: gen_qa 
  strategy: gen_qa 
  metric: all

适用以下限制：

每次评估只支持一个 JSONL 文件。
该文件必须严格遵循定义的架构。
上下文长度限制：对于数据集中的每个样本，上下文长度（包括系统 + 查询提示）应小于 3.5k。

Amazon Nova LLM-as-a-Judge 是一项模型评测特征，让客户可以使用自定义数据集，将一个模型的响应质量与基准模型的响应质量进行比较。它接受包含提示、基准响应和挑战响应的数据集，然后使用 Nova Judge 模型基于 Bradley-Terry 概率和成对比较提供胜率指标。

配方以 xxx_llm_judge_eval.yaml 格式提供。

以下是 LLM-as-a-Judge 的要求：

文件格式要求
- 有一个包含评估示例的 llm_judge.jsonl 文件。文件名称必须为 llm_judge.jsonl。
- 您的数据集必须上传到 SageMaker AI HyperPod RIG 可以访问的 S3 位置。
- 该文件必须遵循 llm_judge.jsonl 数据集所需的架构格式。
- 输入数据集应确保所有记录的上下文长度小于 12k。
架构格式要求：JSONL 文件中的每一行都必须是包含以下字段的 JSON 对象：
- prompt：（必填）包含生成响应提示的字符串。
- response_A：包含基准响应的字符串。
- response_B：包含与基准响应进行比较的替代响应的字符串。

这是一个 LLM-as-a-Judge 示例条目


{
"prompt": "What is the most effective way to combat climate change?",
"response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.",
"response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less."
}
{
"prompt": "Explain how a computer's CPU works",
"response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.",
"response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations."
}
{
"prompt": "How does photosynthesis work?",
"response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.",
"response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts."
}

要使用您的自定义数据集，请按以下必填字段修改您的评估配方，且不要更改其中任何内容：


evaluation:
  task: llm_judge
  strategy: judge
  metric: all

适用以下限制：

每次评估只支持一个 JSONL 文件。
该文件必须严格遵循定义的架构。
Amazon Nova Judge 模型对于所有模型系列规格（即 Lite、Micro 和 Pro）都是一样的。
目前不支持自定义评估模型。
上下文长度限制：对于数据集中的每个样本，上下文长度（包括系统 + 查询提示）应小于 7k。

适用于多模态的 Nova LLM Judge（图像）是 Nova MM_LLM Judge 的缩写，是一项模型评估功能，可让您使用自定义数据集将来自一个模型的响应质量与基线模型的响应质量进行比较。它接受包含提示、基线响应和挑战者响应的数据集以及 Base64 编码字符串形式的图像，然后使用 Nova Judge 模型通过配对比较提供基于 B radley- Terry 概率的胜率指标。配方格式：xxx_mm_llm_judge _eval.yaml。

Nova LLM 数据集要求

文件格式：

包含评估示例的单个 mm_llm_judge.jsonl 文件。文件名必须完全相同llm_judge.jsonl。
您必须将数据集上传到 SageMaker 训练作业可以访问的 S3 位置。
该文件必须遵循 mm_llm_judge 数据集所需的架构格式。
输入数据集应确保所有记录的上下文长度均低于 12 k，不包括图像的属性。

架构格式：.jsonl 文件中的每一行都必须是包含以下字段的 JSON 对象。

必填字段。

prompt：包含生成响应提示的字符串。

images: 包含具有数据属性的对象列表的数组（值为 Base64 编码的图像字符串）。

response_A：包含基准响应的字符串。

response_B：包含与基准响应进行比较的替代响应的字符串。

示例条目

为了便于阅读，以下示例包括换行符和缩进，但在实际数据集中，每条记录都应在一行上。


{
  "prompt": "what is in the image?",
  "images": [
    {
      "data": "data:image/jpeg;Base64,/9j/2wBDAAQDAwQDAwQEAwQFBAQFBgo..."
    }
  ],
  "response_A": "a dog.",
  "response_B": "a cat.",
} 
{
  "prompt": "how many animals in echo of the images?",
  "images": [
    {
      "data": "data:image/jpeg;Base64,/9j/2wBDAAQDAwQDAwQEAwQFBAQFBgo..."
    },
    {
      "data": "data:image/jpeg;Base64,/DKEafe3gihn..."
    }
  ],
  "response_A": "The first image contains one cat and the second image contains one dog",
  "response_B": "The first image has one aminal and the second has one animal",
}

要使用您的自定义数据集，请按以下必填字段修改您的评估配方，且不要更改其中任何内容：


evaluation:
  task: mm_llm_judge
  strategy: judge
  metric: all

限制

每次评估只支持一个 .jsonl 文件。
该文件必须严格遵循定义的架构。
Nova MM Judge 型号仅支持图像参考。
Nova MM Judge 机型在 Amazon Nova Lite 规格中相同
目前不支持自定义评估模型。
不支持亚马逊 S3 图片 URI。
输入数据集应确保所有记录的上下文长度均低于 12 k，不包括图像属性。

Rubric Judge 是一款基于 Nova 2.0 Lite 构建的增强型 LLM-as-a-judge评估模型。与仅提供偏好判决（A>B、B>A 或平局）的原始评委模型不同，Rubric Judge动态生成针对每个提示量身定制的自定义评估标准，并在多个维度上分配精细分数。

关键能力：

动态标准生成：根据输入提示自动创建相关的评估维度
加权评分：为每个标准分配重要性权重以反映其相对重要性
精细评估：以二进制（真/假）或量表（1-5）为基础提供每个标准的详细分数
质量指标：计算连续质量分数（0-1 量表），以量化响应之间的差异幅度

模型生成的标准示例：


price_validation:
  description: "The response includes validation to ensure price is a positive value."
  type: "scale"
  weight: 0.3

该模型根据所有生成的标准评估两个响应，然后使用这些标准级别的分数为其最终的偏好决策提供依据。

配方配置

专栏评委食谱

通过在食谱task: rubric_llm_judge中设置以下内容来启用评分量规判断：


run:
  name: nova-eval-job-name                              # [MODIFIABLE] Unique identifier for your evaluation job
  model_type: amazon.nova-2-lite-v1:0:256k              # [FIXED] Rubric Judge model type
  model_name_or_path: "nova-lite-2/prod"                # [FIXED] Path to model checkpoint or identifier
  replicas: 1                                           # [MODIFIABLE] Number of replicas for SageMaker Training job
  data_s3_path: ""                                      # [FIXED] Leave empty for SageMaker Training job
  output_s3_path: ""                                    # [FIXED] Leave empty for SageMaker Training job

evaluation:
  task: rubric_llm_judge                                # [FIXED] Evaluation task - enables Rubric Judge
  strategy: judge                                       # [FIXED] Evaluation strategy
  metric: all                                           # [FIXED] Metric calculation method

inference:
  max_new_tokens: 12000                                 # [MODIFIABLE] Maximum tokens to generate
  top_k: -1                                             # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                            # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                        # [MODIFIABLE] Sampling temperature (0 = deterministic)

原版 LLM as a Judge 食谱（供比较）

最初的评委模型使用task: llm_judge：


run:
  name: eval-job-name                                   # [MODIFIABLE] Unique identifier for your evaluation job
  model_type: amazon.nova-micro-v1:0:128k               # [FIXED] Model type
  model_name_or_path: "nova-micro/prod"                 # [FIXED] Path to model checkpoint or identifier
  replicas: 1                                           # [MODIFIABLE] Number of replicas for SageMaker Training job
  data_s3_path: ""                                      # [FIXED] Leave empty for SageMaker Training job
  output_s3_path: ""                                    # [FIXED] Leave empty for SageMaker Training job

evaluation:
  task: llm_judge                                       # [FIXED] Original judge task
  strategy: judge                                       # [FIXED] Evaluation strategy
  metric: all                                           # [FIXED] Metric calculation method

inference:
  max_new_tokens: 12000                                 # [MODIFIABLE] Maximum tokens to generate
  top_k: -1                                             # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                            # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                        # [MODIFIABLE] Sampling temperature (0 = deterministic)

输入数据集格式

输入数据集的格式与原始判断模型相同：

必填字段：

prompt: 包含输入提示和说明的字符串
response_A: 包含基准模型输出的字符串
response_B: 包含自定义模型输出的字符串

示例数据集（JSONL 格式）：


{"prompt": "What is the most effective way to combat climate change?", "response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.", "response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less."}
{"prompt": "Explain how a computer's CPU works", "response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.", "response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations."}
{"prompt": "How does photosynthesis work?", "response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.", "response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts."}

格式要求：

每个条目必须是单行 JSON 对象
用换行符分隔条目
按照示例中所示的确切字段命名

评估输出

输出结构

与最初的评委模型相比，Rubric Judge 得出的评估指标更强：


{
  "config_general": {
    "lighteval_sha": "string",
    "num_fewshot_seeds": "int",
    "max_samples": "int | null",
    "job_id": "int",
    "start_time": "float",
    "end_time": "float",
    "total_evaluation_time_secondes": "string",
    "model_name": "string",
    "model_sha": "string",
    "model_dtype": "string | null",
    "model_size": "string"
  },
  "results": {
    "custom|rubric_llm_judge_judge|0": {
      "a_scores": "float",
      "a_scores_stderr": "float",
      "b_scores": "float",
      "b_scores_stderr": "float",
      "ties": "float",
      "ties_stderr": "float",
      "inference_error": "float",
      "inference_error_stderr": "float",
      "score": "float",
      "score_stderr": "float",
      "weighted_score_A": "float",
      "weighted_score_A_stderr": "float",
      "weighted_score_B": "float",
      "weighted_score_B_stderr": "float",
      "score_margin": "float",
      "score_margin_stderr": "float",
      "winrate": "float",
      "lower_rate": "float",
      "upper_rate": "float"
    }
  },
  "versions": {
    "custom|rubric_llm_judge_judge|0": "int"
  }
}

Rubric Judge 中的新指标

以下六个指标是 Rubric Judge 所独有的，可提供精细的质量评估：

指标	说明
加权分数_A	在所有模型生成的评估标准中，Response_A 的平均标准化质量得分。分数按标准重要性加权并归一化为 0-1 等级（越高 = 质量越好）
加权分数_a_stderr	weighted_score_A 均值的标准误差，表示统计不确定性
加权分数_B	在所有模型生成的评估标准中，response_b 的平均标准化质量得分。分数按标准重要性加权并归一化为 0-1 等级（越高 = 质量越好）
加权分数_b_stderr	weighted_score_B 均值的标准误差，表示统计不确定性
score_margin	加权分数之间的差异（计算方法为 weighted_score_A-weighted_score_B）。范围：-1.0 到 1.0。阳性 = response_A 更好；负数 = response_B 更好；接近零 = 质量相似
score_margin_stderr	score_margin 均值的标准误差，表示质量差异测量存在不确定性

了解加权分数指标

目的：加权分数提供连续的质量测量，以补充二元偏好判断，从而可以更深入地了解模型性能。

与@@ 原评委的主要区别：

原始判断：仅输出离散偏好（A>B、B>A、A=B）
Rubric Jud ge：根据自定义标准输出偏好和连续质量分数（0-1 等级）

解释 score_margin：

score_margin = -0.128: Response_B 的得分比 response_A 高 12.8 个百分点
|score_margin| < 0.1: 质量差异很小（近距离决定）
|score_margin| > 0.2: 明显的质量差异（自信的决定）

用例：

模型改进：确定模型表现不佳的具体领域
质量量化：衡量绩效差距的大小，而不仅仅是 win/loss 比率
信心评估：区分严密决策和明显的质量差异

重要

最终判决仍基于评委模型的明确偏好标签，以保持整体推理，并确保通过 forward/backward 评估适当缓解立场偏见。加权分数用作可观察性工具，而不是主要判决的替代品。

计算方法

加权分数通过以下过程计算：

提取标准数据：解析评委的 YAML 输出以提取标准分数和权重
标准化分数：
- 比例类型标准 (1-5)：通过计算将标准化为 0-1 (score - 1) / 4
- 二进制标准（真/假）：转换为 1.0/0.0
应用权重：将每个标准化分数乘以其标准权重
汇总：对每个响应的所有加权分数求和
计算利润：计算 score_margin = weighted_score_A - weighted_score_B

示例：如果 response_A 的加权总和为 0.65，response_B 的加权和为 0.78，则应为 -0.13，这表明 response_B 在score_margin所有加权标准中的质量要高 13 个百分点。

推理模型支持

推理模型支持允许使用具有推理功能的 Nova 模型进行评估，这些模型在生成最终响应之前执行明确的内部推理。此功能通过reasoning_effort参数使用 API 级控制来动态启用或禁用推理功能，从而有可能提高复杂分析任务的响应质量。

支持的型号：

amazon.nova-2-lite-v 1:0:256 k

配方配置

通过将reasoning_effort参数添加到食谱的inference部分来启用推理：


run:
  name: eval-job-name                                    # [MODIFIABLE] Unique identifier for your evaluation job
  model_type: amazon.nova-2-lite-v1:0:256k               # [FIXED] Must be a reasoning-supported model
  model_name_or_path: nova-lite-2/prod                   # [FIXED] Path to model checkpoint or identifier
  replicas: 1                                            # [MODIFIABLE] Number of replicas for SageMaker Training job
  data_s3_path: ""                                       # [MODIFIABLE] Leave empty for SageMaker Training job; optional for SageMaker HyperPod job
  output_s3_path: ""                                     # [MODIFIABLE] Output path for SageMaker HyperPod job (not compatible with SageMaker Training jobs)

evaluation:
  task: mmlu                                             # [MODIFIABLE] Evaluation task
  strategy: generate                                     # [MODIFIABLE] Evaluation strategy
  metric: all                                            # [MODIFIABLE] Metric calculation method

inference:
  reasoning_effort: high                                 # [MODIFIABLE] Enables reasoning mode; options: low/medium/high or null to disable
  max_new_tokens: 200                                    # [MODIFIABLE] Maximum tokens to generate
  top_k: 50                                              # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                             # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                         # [MODIFIABLE] Sampling temperature (0 = deterministic)

使用 reasing_effort 参数

该reasoning_effort参数控制具有推理能力的模型的推理行为。

先决条件：

模型兼容性：reasoning_effort仅在model_type指定具有推理功能的模型时才进行设置（当前）amazon.nova-2-lite-v1:0:256k
错误处理：在不支持的型号上使用reasoning_effort会失败 ConfigValidationError: "Reasoning mode is enabled but model '{model_type}' does not support reasoning. Please use a reasoning-capable model or disable reasoning mode."

可用选项：

Option	行为	代币上限	使用场景
空（默认）	禁用推理模式	不适用	没有推理开销的标准评估
low	允许使用约束进行推理	4,000 个代币用于内部推理	需要简明推理的场景；针对速度和成本进行了优化
high	支持不受限制的推理	内部推理没有代币限制	需要大量分析和 step-by-step推理的复杂问题

何时启用推理

使用推理模式（lowmedium、或high）执行以下操作：

复杂的问题解决任务（数学、逻辑谜题、编码）
需要中间推理的多步骤分析问题
详细解释或 step-by-step思考可以提高准确性的任务
响应质量优先于速度的场景

使用非推理模式（省略参数）：

简单的问答或事实查询
创意写作任务
当更快的响应时间至关重要时
绩效基准测试，其中应排除推理开销
当推理无法提高任务性能时进行成本优化

问题排查

错误：“推理模式已启用，但模型不支持推理”

原因：reasoning_effort参数设置为非空值，但指定的值model_type不支持推理。

解决方法：

验证您的型号类型是 amazon.nova-2-lite-v1:0:256k
如果使用其他模型，要么切换到具有推理功能的模型，要么从配方中删除该reasoning_effort参数