基于专栏的评委

概述

Rubric Judge 是一款基于 Nova 2.0 Lite 构建的增强型 LLM-as-a-judge评估模型。与仅提供偏好判决（A>B、B>A 或平局）的原始评委模型不同，Rubric Judge动态生成针对每个提示量身定制的自定义评估标准，并在多个维度上分配精细分数。

关键能力

动态标准生成-根据输入提示自动创建相关的评估维度
加权评分 — 为每个标准分配重要性权重以反映其相对重要性
精细评估 — 以二进制（真/假）或量表（1-5）为基础提供每个标准的详细分数
质量指标-计算连续质量分数（0-1 量表），以量化响应之间的差异幅度

模型生成的示例标准


price_validation:  
  description: "The response includes validation to ensure price is a positive value."  
  type: "scale"  
  weight: 0.3

该模型根据所有生成的标准评估两个响应，然后使用这些标准级别的分数为其最终的偏好决策提供依据。

配方配置

专栏评委食谱

通过在食谱task: rubric_llm_judge中设置以下内容来启用评分量规判断：


run:  
  name: nova-eval-job-name                              # [MODIFIABLE] Unique identifier for your evaluation job  
  model_type: amazon.nova-2-lite-v1:0:256k              # [FIXED] Rubric Judge model type  
  model_name_or_path: "nova-lite-2/prod"                # [FIXED] Path to model checkpoint or identifier  
  replicas: 1                                           # [MODIFIABLE] Number of replicas for SageMaker Training job  
  data_s3_path: ""                                      # [FIXED] Leave empty for SageMaker Training job  
  output_s3_path: ""                                    # [FIXED] Leave empty for SageMaker Training job  
    
evaluation:  
  task: rubric_llm_judge                                # [FIXED] Evaluation task - enables Rubric Judge  
  strategy: judge                                       # [FIXED] Evaluation strategy  
  metric: all                                           # [FIXED] Metric calculation method  
    
inference:  
  max_new_tokens: 12000                                 # [MODIFIABLE] Maximum tokens to generate  
  top_k: -1                                             # [MODIFIABLE] Top-k sampling parameter  
  top_p: 1.0                                            # [MODIFIABLE] Nucleus sampling parameter  
  temperature: 0                                        # [MODIFIABLE] Sampling temperature (0 = deterministic)

原版 LLM as a Judge 食谱（供比较）

最初的评委模型使用task: llm_judge：


run:  
  name: eval-job-name                                   # [MODIFIABLE] Unique identifier for your evaluation job  
  model_type: amazon.nova-micro-v1:0:128k               # [FIXED] Model type   
  model_name_or_path: "nova-micro/prod"                 # [FIXED] Path to model checkpoint or identifier  
  replicas: 1                                           # [MODIFIABLE] Number of replicas for SageMaker Training job  
  data_s3_path: ""                                      # [FIXED] Leave empty for SageMaker Training job  
  output_s3_path: ""                                    # [FIXED] Leave empty for SageMaker Training job  
    
evaluation:  
  task: llm_judge                                       # [FIXED] Original judge task  
  strategy: judge                                       # [FIXED] Evaluation strategy  
  metric: all                                           # [FIXED] Metric calculation method  
  
inference:  
  max_new_tokens: 12000                                 # [MODIFIABLE] Maximum tokens to generate  
  top_k: -1                                             # [MODIFIABLE] Top-k sampling parameter  
  top_p: 1.0                                            # [MODIFIABLE] Nucleus sampling parameter  
  temperature: 0                                        # [MODIFIABLE] Sampling temperature (0 = deterministic)

输入数据集格式

输入数据集的格式与原始判断模型相同：

必填字段

prompt— 包含输入提示和说明的字符串
response_A— 包含基准模型输出的字符串
response_B— 包含自定义模型输出的字符串

示例数据集（JSONL 格式）


{"prompt": "What is the most effective way to combat climate change?", "response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.", "response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less."}  
{"prompt": "Explain how a computer's CPU works", "response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.", "response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations."}  
{"prompt": "How does photosynthesis work?", "response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.", "response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts."}

格式要求

每个条目必须是单行 JSON 对象
用换行符分隔条目
按照示例中所示的确切字段命名

评估输出

输出结构

与最初的评委模型相比，Rubric Judge 得出的评估指标更强：


{  
  "config_general": {  
    "lighteval_sha": "string",  
    "num_fewshot_seeds": "int",  
    "max_samples": "int | null",  
    "job_id": "int",  
    "start_time": "float",  
    "end_time": "float",  
    "total_evaluation_time_secondes": "string",  
    "model_name": "string",  
    "model_sha": "string",  
    "model_dtype": "string | null",  
    "model_size": "string"  
  },  
  "results": {  
    "custom|rubric_llm_judge_judge|0": {  
      "a_scores": "float",  
      "a_scores_stderr": "float",  
      "b_scores": "float",  
      "b_scores_stderr": "float",  
      "ties": "float",  
      "ties_stderr": "float",  
      "inference_error": "float",  
      "inference_error_stderr": "float",  
      "score": "float",  
      "score_stderr": "float",  
      "weighted_score_A": "float",  
      "weighted_score_A_stderr": "float",  
      "weighted_score_B": "float",  
      "weighted_score_B_stderr": "float",  
      "score_margin": "float",  
      "score_margin_stderr": "float",  
      "winrate": "float",  
      "lower_rate": "float",  
      "upper_rate": "float"  
    }  
  },  
  "versions": {  
    "custom|rubric_llm_judge_judge|0": "int"  
  }  
}

Rubric Judge 中的新指标

以下六个指标是 Rubric Judge 所独有的，可提供精细的质量评估：

指标	说明
加权分数_A	在所有模型生成的评估标准中，Response_A 的平均标准化质量得分。分数按标准重要性加权并归一化为 0-1 等级（越高 = 质量越好）
加权分数_a_stderr	weighted_score_A 均值的标准误差，表示统计不确定性
加权分数_B	在所有模型生成的评估标准中，response_b 的平均标准化质量得分。分数按标准重要性加权并归一化为 0-1 等级（越高 = 质量越好）
加权分数_b_stderr	weighted_score_B 均值的标准误差，表示统计不确定性
score_margin	加权分数之间的差异（计算方法为 weighted_score_A-weighted_score_B）。范围：-1.0 到 1.0。阳性 = response_A 更好；负数 = response_B 更好；接近零 = 质量相似
score_margin_stderr	score_margin 均值的标准误差，表示质量差异测量存在不确定性

了解加权分数指标

目的：加权分数提供连续的质量测量，以补充二元偏好判断，从而可以更深入地了解模型性能。

与原评委的主要区别

原始判断 — 仅输出离散偏好（A>B、B>A、A=B）
Rubric Jud ge — 根据自定义标准输出偏好和连续质量分数（0-1 等级）

解释 score_margin

score_margin = -0.128: Response_B 的得分比 response_A 高 12.8 个百分点
|score_margin| < 0.1: 质量差异很小（近距离决定）
|score_margin| > 0.2: 明显的质量差异（自信的决定）

使用案例

模型改进-确定模型表现不佳的具体领域
质量量化 — 衡量绩效差距的大小，而不仅仅是 win/loss 比率
信心评估 — 区分严密决策和明显的质量差异

重要

最终判决仍基于评委模型的明确偏好标签，以保持整体推理，并确保通过 forward/backward 评估适当缓解立场偏见。加权分数用作可观察性工具，而不是主要判决的替代品。

计算方法

加权分数通过以下过程计算：

提取标准数据 — 解析评委的 YAML 输出以提取标准分数和权重
标准化分数：
- 量表类型标准 (1-5)：通过计算（分数-1）/4 将标准化为 0-1
- 二进制标准（真/假）：转换为 1.0/0.0
应用权重-将每个归一化分数乘以其标准权重
汇总-汇总每个响应的所有加权分数
计算利润-计算 score_margin = weighted_score_A - weighted_score_B

示例：如果 response_A 的加权总和为 0.65，response_B 的加权和为 0.78，则应为 -0.13，这表明 response_B 在score_margin所有加权标准中的质量要高 13 个百分点。

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

评估

推理模型评估