您的 Amazon 环境中的自定义奖励函数

您的 Amazon 环境中的自定义奖励函数仅支持单回合 RFT。它可以训练模型执行任务，其中单个提示会收到单个响应，并进行独立评估。模型收到一个提示并生成一个响应，然后由你的奖励函数对其进行评分，即没有 back-and-forth对话。这与多回合 RFT 形成鲜明对比，在多回合 RFT 中，模型在获得最终奖励之前与环境或用户进行多轮互动。

主题

架构概述
配方配置
配方参数
推理模式选择
奖励功能实现

架构概述

该架构由两个主要组件组成：

训练 VPC：

Rollout：加载数据集和模型，向奖励函数发送推出，并获得奖励
T@@ rainer：接收来自 Rollout 组件的推出，执行向前和向后传球，并更新模型权重

客户 VPC：

奖励 Lambda：客户实现的奖励函数，用于评估模型响应并返回奖励分数

工作流程：

Rollout 加载数据集和模型
Rollout 生成模型响应并调用 Lambda 获取奖励
Lambda 返回奖励分数
Rollout 会向 Trainer 发送推出
培训师根据奖励更新政策权重

配方配置

当您的奖励功能在 15 分钟内完成处理后，请使用此食谱。


## Nova Lite RLVR Training (PEFT)
run:
  name: my-rft-run
  model_type: amazon.nova-2-lite-v1:0:256k
  model_name_or_path: nova-lite-2/prod
  data_s3_path: s3://example-bucket/train.jsonl
  output_s3_path: ""
  
  replicas: 2 # Number of compute instances for training. All supported values: {2, 4, 8, 16}
  generation_replicas: 2 # LLM inference replicas
  rollout_worker_replicas: 1
  
  # Lambda functions for RFT
  reward_lambda_arn: ""

## Training config - essential fields for all services
training_config:
  max_length: 10240
  global_batch_size: 256
  reasoning_effort: high
  
  data:
    shuffle: false
  
  rollout:
    rollout_strategy:
      type: off_policy_async
      age_tolerance: 2
    advantage_strategy:
      number_generation: 8
    generator:
      max_new_tokens: 8192
      set_random_seed: true
      temperature: 1
      top_k: 0
    rewards:
      api_endpoint:
        lambda_arn: ${run.reward_lambda_arn}
        lambda_concurrency_limit: 100 # Lambda should be able to handle (rollout_worker_replicas * 64) requests
  
  # Training configuration
  trainer:
    max_steps: 100
    save_steps: 5
    save_top_k: 5
    
    # RL parameters
    refit_freq: 4
    clip_ratio_high: 0.2
    ent_coeff: 0.001
    loss_scale: 1
  
    optim_config:                    # Optimizer settings
        lr: 7e-7                       # Learning rate
        weight_decay: 0.0              # L2 regularization strength (0.0–1.0)
        adam_beta1: 0.9
        adam_beta2: 0.95
  
    peft:                            # Parameter-efficient fine-tuning (LoRA)
        peft_scheme: "lora"            # Enable LoRA for PEFT
        lora_tuning:
            alpha: 32
            lora_plus_lr_ratio: 64.0     # LoRA+ learning rate scaling factor (0.0–100.0)


## Nova Lite RLVR Training
run:
  name: my-rft-run
  model_type: amazon.nova-2-lite-v1:0:256k
  model_name_or_path: nova-lite-2/prod
  data_s3_path: s3://example-bucket/train.jsonl
  output_s3_path: ""
  
  replicas: 2 # Number of compute instances for training. All supported values: {2, 4, 8, 16}
  generation_replicas: 2 # LLM inference replicas
  rollout_worker_replicas: 1
  
  # Lambda functions for RFT
  reward_lambda_arn: ""

## Training config - essential fields for all services
training_config:
  max_length: 10240
  global_batch_size: 256
  reasoning_effort: high
  
  data:
    shuffle: false
  
  rollout:
    rollout_strategy:
      type: off_policy_async
      age_tolerance: 2
    advantage_strategy:
      number_generation: 8
    generator:
      max_new_tokens: 8192
      set_random_seed: true
      temperature: 1
      top_k: 0
    rewards:
      api_endpoint:
        lambda_arn: ${run.reward_lambda_arn}
        lambda_concurrency_limit: 100 # Lambda should be able to handle (rollout_worker_replicas * 64) requests
  
  # Training configuration
  trainer:
    max_steps: 100
    save_steps: 5
    save_top_k: 5
    
    # RL parameters
    refit_freq: 4
    clip_ratio_high: 0.2
    ent_coeff: 0.001
    loss_scale: 1
  
    optim_config:                    # Optimizer settings
        lr: 7e-7                       # Learning rate
        weight_decay: 0.0              # L2 regularization strength (0.0–1.0)
        adam_beta1: 0.9
        adam_beta2: 0.95
          
    peft:                            # Parameter-efficient fine-tuning (LoRA)
        peft_scheme: "null"            # Disable LoRA for PEFT

配方参数

max_steps：模型的梯度更新次数。每次更新都使用global_batch_size × refit_freq示例。每个样本对应一个模型生成。训练样本总数 = max_steps × global_batch_size。
max_seq_length：模型在训练期间处理的最大上下文长度（以令牌为单位）。应容纳输入提示长度 + 生成的响应长度。设置得太短会导致训练错误；设置得太大会浪费 GPU 内存并减慢训练速度。可用预设：8K（默认）、16K、32K。
global_batch_size：模型每次梯度更新的样本数。值越大，梯度越稳定，但需要更多的内存。请注意，每个样本对应于模型的一代，而不是提示。使用单个提示来创建number_generation示例。推荐：64-4096 以 2 的次方。
refit_freq：模型权重更新的频率。每次更新中的样本数为refit_freq * global_batch_size。控制生成模型的更新频率。较高的值会增加有效的批次大小，从而提高学习的稳定性。较低的值会增加训练速度，但会增加方差。refit_freq 的增加，“非政策” 数据增加。推荐：4（最小：1，最大：4）。
rollout_strategy.off_policy_async：使模型的更新变得 “不合时宜”，即用于计算损失的代可能来自模型的先前版本，而不是当前模型。启用政策外可以加快训练速度，但如果设置过高，则age_tolerance可能不稳定。推荐：真（对，假）。
rollout_strategy.age_tolerance：仅在启用时off_policy_async才起作用。仅保留模型版本中低于当前模型age_tolerance旧版本的数据。较低的值会丢弃数据，较高的值会包含更多来自模型先前版本的数据。推荐：2（最小：1，最大：20）。
clip_ratio_high：剪辑有助于防止可能破坏训练稳定的大型政策更新。较大的值会鼓励更新以修复模型错误，但可能会破坏训练的稳定性。值越小，学习量越少。推荐值：0.3 (0.1、10)。
ent_coeff: 该参数是 “熵系数” 的缩写，通过在损失函数中添加熵加成来鼓励在训练期间进行探索。较高的值会促进更多的 diverse/exploratory 行为，而较低的值则侧重于利用当前的知识。推荐值：0.0（最小值：0，最大值：0.1）。

推理模式选择

根据任务的复杂程度，从三个推理努力级别中进行选择：

推理努力	使用场景	成本/延迟	适用场景
省略字段（无推理）	简单的事实查询、分类	低	速度和成本优化
low	中等复杂性需要一些推理	中	性能和效率的平衡
high	复杂的分析任务、多步骤的问题	高	最大推理能力

默认行为：reasoning_effort如果指定时没有值，则默认为high。

指导方针：

high用于 step-by-step思维可以增加价值的复杂分析任务（数学、逻辑、代码调试）
low用于需要一些推理的中等复杂度任务
对于直接的事实查询、简单的分类以及针对速度和成本进行优化，请完全省略该字段

重要

更高的推理模式可以提高需要逻辑分析和复杂推理的任务的性能，但会增加训练和部署期间的成本和延迟。它们对诸如 “法国的首都是什么？” 之类的简单事实查询无济于事

奖励功能实现

奖励函数（也称为记分器或评分器）是评估模型响应并为训练提供反馈信号的核心组件。它必须作为接受模型响应并返回奖励分数的 Lambda 函数来实现。

先决条件

确保您的 Lambda 函数和 SQS 队列遵循所需的命名格式，并且您的执行角色具有必要的权限。

Lambda ARN 命名：

Lambda ARN 必须遵循以下命名格式：


arn:aws:lambda:*:*:function:*SageMaker*

SQS 命名（仅适用于您自己 Amazon 环境中的远程奖励功能）：

确保为集群创建的执行角色具有 SQS 权限 HyperPod

SQS ARN 必须匹配以下命名格式之一：


arn:aws:sqs:*:*:*SageMaker*
arn:aws:sqs:*:*:*Sagemaker*
arn:aws:sqs:*:*:*sagemaker*

在 SQS 客户端中，使用终端节点覆盖：--endpoint https://sqs.us-west-2.amazonaws.com因为在 VPCE 中，传统 SQS 服务端点不可用

执行角色的 IAM 策略：


{
  "Action": "lambda:InvokeFunction",
  "Resource": [
    "arn:aws:lambda:*:*:function:*SageMaker*"
  ],
  "Effect": "Allow"
},
{
  "Action": [
    "sqs:DeleteMessage",
    "sqs:ReceiveMessage",
    "sqs:SendMessage"
  ],
  "Resource": [
    "arn:aws:sqs:*:*:*SageMaker*"
  ],
  "Effect": "Allow"
}

VPC 终端节点：

要使集 HyperPod 群调用 Lambda 函数，您必须：

在 HyperPod 集群的 VPC 中为 Lambda 服务创建 VPC 终端节点
将终端节点与集群的安全组关联
确保 VPC 终端节点策略允许 lambda: 操作 InvokeFunction

确认您在 VPC 中看到连接到 EKS 的 lambda 终端节点。

接口格式

您的奖励函数必须接受并返回以下格式的数据。

训练的样本输入：


[{
  "messages": [
    {
      "role": "user",
      "content": "Do you have a dedicated security team?"
    }
  ],
  "metadata": {
    "reference_answer": {
      "compliant": "No",
      "explanation": "As an AI developed by Company, I do not have a traditional security team..."
    },
    "my_key": "sample-001"
  }
}]

奖励 Lambda 的有效载荷示例：

系统会将助手回合（生成的响应）附加到messages场地的最后一回合，并添加一个独特的id回合：


[{
  "id": "123",
  "messages": [
    {
      "role": "user",
      "content": "Do you have a dedicated security team?"
    },
    {
      "role": "assistant",
      "content": "As an AI developed by Amazon, I do not have a dedicated security team..."
    }
  ],
  "metadata": {
    "reference_answer": {
      "compliant": "No",
      "explanation": "As an AI developed by Company, I do not have a traditional security team..."
    },
    "my_key": "sample-001"
  }
}]

奖励 Lambda 合约：


def lambda_handler(event, context):
    return lambda_grader(event)

def lambda_grader(samples: list[dict]) -> list[dict]:
    """
    Args:
        samples: List of dictionaries in OpenAI format
        
        Example input (List of such sample):
        {
            "id": "123",
            "messages": [
                {
                    "role": "user",
                    "content": "Do you have a dedicated security team?"
                },
                {
                    "role": "assistant",
                    "content": "As an AI developed by Company, I do not have a dedicated security team..."
                }
            ],
            "metadata": {
                "reference_answer": {
                    "compliant": "No",
                    "explanation": "As an AI developed by Company, I do not have a traditional security team..."
                },
                "my_key": "sample-001"
            }
        }
    
    Returns:
        List of dictionaries with reward scores:
        {
            "id": str,                              # Same id as input sample
            "aggregate_reward_score": float,        # Overall score for the sample
            "metrics_list": [                       # OPTIONAL: Component scores
                {
                    "name": str,                    # Name of the component score
                    "value": float,                 # Value of the component score
                    "type": str                     # "Reward" or "Metric"
                }
            ]
        }
    """

输入字段：

字段	说明	附加说明
id	样本的唯一标识符	在输出中回声。字符串格式
消息	以 OpenAI 格式排序的聊天记录	消息对象数组
消息 [] .role	留言的发言人	常用值：“用户”、“助手”、“系统”
消息 [] .content	消息的文字内容	纯字符串
元数据	有助于评分的自由格式信息	对象；从训练数据传递的可选字段

输出字段：

字段	说明	附加说明
id	与输入样本相同的标识符	必须匹配输入
聚合_奖励_分数	样本的总分数	浮动（例如，0.0—1.0 或任务定义的范围）
指标列表	构成汇总的分量分数	指标对象数组
metrics_list [] .name	组件的名称指标/奖励	字符串（例如，“准确性”、“policy_reward”）
metrics_list [] .value	组件指标/奖励的价值	浮点型
metrics_list [] .type	组件的类别	字符串：“奖励” 或 “指标”

技术限制

超时限制：每次 Lambda 调用的最大执行时间 15 分钟
并@@ 发：必须处理rollout_worker_replicas × 64并发请求
可靠性：必须实施正确的错误处理并始终如一地返回有效分数
性能：针对快速执行（几秒钟，而不是几分钟）进行优化，从而实现高效训练

最佳实践：

尽量减少外部 API 调用
使用高效的算法和数据结构
实现瞬态故障的重试逻辑
缓存可重复使用的计算
在训练前进行彻底测试，确保执行无错误

使用自定义奖励功能

当你有特定任务的评估标准时，可以实现自定义奖励函数：

定义评估标准：确定哪些因素可以很好地响应您的任务
实现 Lambda 函数：按照接口格式创建一个 Lambda 函数
本地测试：验证您的函数会返回样本输入的正确分数
部署到 Amazon：部署您的 Lambda 并记下 ARN
配置食谱：将 Lambda ARN 添加到您的食谱字段中 reward_lambda_arn
使用小型数据集进行测试：使用最少的数据运行 RFT 以验证集成

示例 Lambda 函数

此示例验证输入格式，并将模型输出与参考答案进行比较。将评分逻辑替换为实际评估标准。


from typing import List
import json
from dataclasses import asdict, dataclass


@dataclass
class RewardOutput:
    """Reward service output."""
    id: str
    aggregate_reward_score: float


def lambda_handler(event, context):
    """ Main lambda handler """
    return lambda_grader(event)


def lambda_grader(samples: list[dict]) -> list[dict]:
    """ Core grader function """
    scores: List[RewardOutput] = []
    for sample in samples:

        # Extract components
        idx = sample["id"]
        ground_truth = sample.get("metadata", {}).get("reference_answer")

        if "messages" not in sample:
            print(f"Messages is None/empty for id: {idx}")
            ro = RewardOutput(id=idx, aggregate_reward_score=0.0)
            scores.append(ro)

        if ground_truth is None:
            print(f"No answer found in ground truth for id: {idx}")
            ro = RewardOutput(id=idx, aggregate_reward_score=0.0)
            scores.append(ro)

        # Get model's response (last turn is assistant turn)
        last_message = sample["messages"][-1]
        assert last_message["role"] == "assistant", "Last message must be from assistant"
        model_text = last_message["content"]

        ground_truth_text = _extract_ground_truth_text(ground_truth)
        
        if model_text.lower() == ground_truth_text.lower():
            score = 1.0
        else:
            score = 0.0

        ro = RewardOutput(id=idx, aggregate_reward_score=score)
        scores.append(ro)

    # Convert to dict format for JSON serialization
    return [asdict(score) for score in scores]


def _extract_ground_truth_text(ground_truth) -> str:
    """
    Turn the `ground_truth` field into a plain string.
    """
    if isinstance(ground_truth, str):
        return ground_truth

    if isinstance(ground_truth, dict):
        # Common patterns: { "explanation": "...", "answer": "..." }
        if "explanation" in ground_truth and isinstance(ground_truth["explanation"], str):
            return ground_truth["explanation"]
        if "answer" in ground_truth and isinstance(ground_truth["answer"], str):
            return ground_truth["answer"]
        # Fallback: stringify the whole dict
        return json.dumps(ground_truth, ensure_ascii=False)

    # Fallback: stringify anything else
    return str(ground_truth)

使用法学硕士作为奖励职能的评委

大型语言模型 (LLMs) 越来越多地被用作强化微调 (RFT) 工作流程中的评判，它提供了指导模型优化的自动奖励信号。在这种方法中，法学硕士根据指定的标准（无论是评估正确性、质量、风格依从性还是语义等效性）评估模型输出，并分配推动强化学习过程的奖励。

这对于难以以编程方式定义传统奖励函数的任务特别有价值，例如确定不同的表示形式（例如 “1/3”、“0.333” 和 “三分之一”）在语义上是否相同，或者评估连贯性和相关性等细微差别的品质。通过利用基于LLM的评委作为奖励功能，您可以将RFT扩展到复杂的领域，而无需大量的人工注释，从而可以在不同的用例中快速迭代和持续改进模型，而不仅仅是传统的对齐问题。

在生产环境 LLM-as-a-Judge中部署之前，请验证评判模型的评估是否与人类判断一致。这包括衡量法学硕士评委和人类评估人员之间对你的任务的代表性样本的同意率，理想的情况是确保法学硕士与人类的协议达到或超过人际一致率。此验证步骤有助于识别潜在的偏差，确保奖励信号引导您的模型朝着预期的方向发展，并建立信心，即自动评估过程将生成符合您的生产质量标准的模型。

使用 LLM-as-a-Judge是使用 Lambda 函数进行可验证奖励 (RLVR) 的强化学习的简单扩展。在 Lambda 函数中，您可以调用 Amazon Bedrock 中托管的其中一个模型。为确保训练和评估与评判模型配合良好，请确保所使用的 Amazon Bedrock 模型的吞吐量配额足够。

配置您的 Lambda 函数，使超时时间较长，最长不超过 15 分钟。Lambda 的默认设置为 3 秒，并且更改 Lambda 配置中的超时对于考虑与基于逻辑的奖励函数相比，Amazon Bedrock 模型的响应时间更长，因此必须更改 Lambda 配置中的超时时间。Lambda 也会在训练期间并行调用，因此请增加并发度以完全最大限度地提高可用吞吐量。请注意，需要在 Lambda 配置和训练作业配方中设置并发限制。

训练食谱示例：


display_name: "Nova Lite V2 LoRA RLVR SMTJ training on GPU"
version: "1.0"
instance_types: ["ml.p5.48xlarge", "ml.p5en.48xlarge"]

run:
  name: <experiment_name>
  model_type: amazon.nova-2-lite-v1:0:256k
  model_name_or_path: "nova-lite-2/prod"
  data_s3_path: s3://<path>/<training_data>.jsonl
  replicas: 4
  reward_lambda_arn: arn:aws:lambda:<region>:<account>:function:<lambda-name>

## SMTJ RFT Training specific configs
training_config:
  max_length: 1200                             # Context window (tokens) for inputs+prompt
  global_batch_size: 64                         # Total samples per optimizer step across all replicas (16/32/64/128/256)
  reasoning_effort: high                        # Enables reasoning mode High / Low / or null for non-reasoning
  test_freq: 10
  
  rollout:                                      # How responses are generated for GRPO/advantage calc
    advantage_strategy:
      number_generation: 4                      # N samples per prompt to estimate advantages (variance vs cost)
    generator:
      max_new_tokens: 1024                     # Cap on tokens generated per sample
      set_random_seed: true                     # Seed generation for reproducibility across runs
      temperature: 1                            # Softmax temperature
      top_k: 1                                  # Sample only from top-K logits
    rewards:
      preset_reward_function: null              # Usage of reward functions built into Verl [exact_match, code_executions, math_answers]
      api_endpoint:
        lambda_arn: arn:aws:lambda:<region>:<account>:function:<lambda-name>
        lambda_concurrency_limit: 12             # Max concurrent Lambda invocations (throughput vs. throttling)
  
  trainer:
    max_steps: 100                                 # Steps to train for. One Step = global_batch_size
    save_steps: 20
    test_freq:10
    
    # RL parameters
    ent_coeff: 0.0                              # A bonus added to the policy loss that rewards higher-output entropy
    kl_loss_coef: 0.0                         # Weight on the KL penalty between the actor (trainable policy) and a frozen reference model
    
    optim_config:                    # Optimizer settings
        lr: 1e-6                       # Learning rate
        weight_decay: 0.0              # L2 regularization strength (0.0–1.0)
        adam_beta1: 0.9
        adam_beta2: 0.95

示例 Lambda：

此 Lambda 函数实现了用于强化微调的 LLM-as-a-Judge奖励评分系统。它通过从格式良好的输出中提取答案（寻找\boxed{}符号）来处理模型生成的批量响应，然后使用Claude Haiku作为判断模型，以0.0-1.0的比例评估提取的答案与基本真相参考答案之间的语义相似性。法官比较答案以确定它们在语义上是否相同（即使表示方式不同，例如 “1/3” 与 “0.333”），处理答案可能以各种方式格式化的案件。该函数包括用于限制的重试逻辑，验证消息结构，并返回可在强化学习过程中用作训练信号的奖励分数列表，当无法提取答案或验证失败时，分数为 0.0。


import json
import random

from dataclasses import asdict, dataclass

import re

from typing import Dict, Optional, Any, List
import boto3

from botocore.exceptions import ClientError

from copy import deepcopy
import time

import base64


def extract_solution_nova(solution_str: str) -> Optional[str]:
    """
    Extract solution from Nova-formatted response.
    
    Args:
        solution_str: The solution text from Nova model
        method: "strict" or "flexible" extraction method
    
    Returns:
        Extracted numerical answer or None
    """
    boxed_matches = re.findall(r'\\boxed\{([^}]+)\}', solution_str)
    if boxed_matches:
        final_answer = boxed_matches[-1].replace(",", "").replace("$", "")
        return final_answer
    
    return 0.0

bedrock_runtime = boto3.client('bedrock-runtime', region_name='us-east-1')
JUDGE_MODEL_ID = "global.anthropic.claude-haiku-4-5-20251001-v1:0"
SYSTEM_PROMPT = "You must output ONLY a number between 0.0 and 1.0. No explanations, no text, just the number."

JUDGE_PROMPT_TEMPLATE = """Compare the following two responses and rate how similar they are on a scale of 0.0 to 1.0, where:
- 1.0 means the responses are semantically equivalent (same meaning, even if worded differently)
- 0.5 means the responses are partially similar
- 0.0 means the responses are completely different or contradictory

Response A: {response_a}

Response B: {response_b}

Output ONLY a number between 0.0 and 1.0. No explanations."""

def lambda_graded(id: str, response_a: str, response_b: str, max_retries: int = 50) -> float:
    """Call Bedrock to compare responses and return similarity score."""
    prompt = JUDGE_PROMPT_TEMPLATE.format(response_a=response_a, response_b=response_b)
    print(f"Calling judge: {JUDGE_MODEL_ID}")
    for attempt in range(max_retries):
        try:
            print(f"Attempt: {attempt}")
            response = bedrock_runtime.converse(
                modelId=JUDGE_MODEL_ID,
                messages=[{"role": "user", "content": [{"text": prompt}]}],
                system=[{"text": SYSTEM_PROMPT}],
                inferenceConfig={"temperature": 0.0, "maxTokens": 10}
                )
            print(f"Bedrock call successful: {response}")
            output = response['output']['message']['content'][0]['text'].strip()
            score = float(output)

            print(f"Score parsed: {score}")

            return max(0.0, min(1.0, score))

        except Exception as e:
            if "ThrottlingException" in str(e) and attempt < max_retries - 1:
                time.sleep(2 ** attempt)
                print(f"Throttling {id}")
            else:
                print(f"Bedrock call failed: {e}")
                return 0.0
    print("Max retries reached. Unable to complete the request.")
    return 0.0


def compute_score(id: str, solution_str: str, ground_truth: str, method: str = "strict",
                      format_score: float = 0.0, score: float = 1.0,
                      data_source: str ='dataset_name', extra_info: Optional[dict] = None) -> float:
    """
    The scoring function for PandaLM with Nova format.
    
    Args:
        solution_str: The solution text from Nova model
        ground_truth: JSON string containing the ground truth answer
        method: The method to extract the solution, choices are 'strict' and 'flexible'
        format_score: The score for format compliance
        score: The score for correct answer
        data_source: Should match the data_source in the given dataset
        extra_info: Optional dict with additional fields. Required in function signature.
    
    Returns:
        Score between 0 and 1
    """
    import json
    
    answer = extract_solution_nova(solution_str=solution_str, method=method)
    if answer is None:
        return 0.0
    print(f"Answer: {str(answer)}, Reference: {str(ground_truth)}")
    
    # Clean both answers for comparison
    clean_answer = str(answer)
    clean_ground_truth = str(ground_truth)
    
    score = lambda_graded(id, response_a=clean_answer, response_b=clean_ground_truth)
    print(f"Raw score: {score}")
    return score


@dataclass
class RewardOutput:
    """Reward service."""

    id: str
    aggregate_reward_score: float

def lambda_handler(event, context):

    scores: List[RewardOutput] = []

    samples = event
    print(len(samples))
    for sample in samples:
        # Extract the ground truth key. In the current dataset it's answer
        print("Sample: ", json.dumps(sample, indent=2))
        ground_truth = sample["reference_answer"]
        
        idx = "no id"
        # print(sample)
        if not "id" in sample:
            print(f"ID is None/empty for sample: {sample}")
        else:
            idx = sample["id"]

        ro = RewardOutput(id=idx, aggregate_reward_score=0.0)

        if not "messages" in sample:
            print(f"Messages is None/empty for id: {idx}")
            scores.append(RewardOutput(id="0", aggregate_reward_score=0.0))
            continue
        
        # Extract answer from ground truth dict
        if ground_truth is None:
            print(f"No answer found in ground truth for id: {idx}")
            scores.append(RewardOutput(id="0", aggregate_reward_score=0.0))
            continue
        
        # Get completion from last message (assistant message)
        last_message = sample["messages"][-1]
        completion_text = last_message["content"]
        
        if last_message["role"] not in ["assistant", "nova_assistant"]:
            print(f"Last message is not from assistant for id: {idx}")
            scores.append(RewardOutput(id="0", aggregate_reward_score=0.0))
            continue

        if not "content" in last_message:
            print(f"Completion text is empty for id: {idx}")
            scores.append(RewardOutput(id="0", aggregate_reward_score=0.0))
            continue

        random_score = compute_score(id=id, solution_str=completion_text, ground_truth=ground_truth)
        ro = RewardOutput(id=idx, aggregate_reward_score=random_score)

        print(f"Response for id: {idx} is {ro}")
        scores.append(ro)

    return [asdict(score) for score in scores]

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

Nova 2.0 上的 RFT

监控 RFT 训练