本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。
基于专栏的评委
概述
Rubric Judge 是一款基于 Nova 2.0 Lite 构建的增强型 LLM-as-a-judge评估模型。与仅提供偏好判决(A>B、B>A 或平局)的原始评委模型
关键能力
-
动态标准生成-根据输入提示自动创建相关的评估维度
-
加权评分 — 为每个标准分配重要性权重以反映其相对重要性
-
精细评估 — 以二进制(真/假)或量表(1-5)为基础提供每个标准的详细分数
-
质量指标-计算连续质量分数(0-1 量表),以量化响应之间的差异幅度
模型生成的示例标准
price_validation: description: "The response includes validation to ensure price is a positive value." type: "scale" weight: 0.3
该模型根据所有生成的标准评估两个响应,然后使用这些标准级别的分数为其最终的偏好决策提供依据。
配方配置
专栏评委食谱
通过在食谱task: rubric_llm_judge中设置以下内容来启用评分量规判断:
run: name: nova-eval-job-name # [MODIFIABLE] Unique identifier for your evaluation job model_type: amazon.nova-2-lite-v1:0:256k # [FIXED] Rubric Judge model type model_name_or_path: "nova-lite-2/prod" # [FIXED] Path to model checkpoint or identifier replicas: 1 # [MODIFIABLE] Number of replicas for SageMaker Training job data_s3_path: "" # [FIXED] Leave empty for SageMaker Training job output_s3_path: "" # [FIXED] Leave empty for SageMaker Training job evaluation: task: rubric_llm_judge # [FIXED] Evaluation task - enables Rubric Judge strategy: judge # [FIXED] Evaluation strategy metric: all # [FIXED] Metric calculation method inference: max_new_tokens: 12000 # [MODIFIABLE] Maximum tokens to generate top_k: -1 # [MODIFIABLE] Top-k sampling parameter top_p: 1.0 # [MODIFIABLE] Nucleus sampling parameter temperature: 0 # [MODIFIABLE] Sampling temperature (0 = deterministic)
原版 LLM as a Judge 食谱(供比较)
最初的评委模型使用task: llm_judge:
run: name: eval-job-name # [MODIFIABLE] Unique identifier for your evaluation job model_type: amazon.nova-micro-v1:0:128k # [FIXED] Model type model_name_or_path: "nova-micro/prod" # [FIXED] Path to model checkpoint or identifier replicas: 1 # [MODIFIABLE] Number of replicas for SageMaker Training job data_s3_path: "" # [FIXED] Leave empty for SageMaker Training job output_s3_path: "" # [FIXED] Leave empty for SageMaker Training job evaluation: task: llm_judge # [FIXED] Original judge task strategy: judge # [FIXED] Evaluation strategy metric: all # [FIXED] Metric calculation method inference: max_new_tokens: 12000 # [MODIFIABLE] Maximum tokens to generate top_k: -1 # [MODIFIABLE] Top-k sampling parameter top_p: 1.0 # [MODIFIABLE] Nucleus sampling parameter temperature: 0 # [MODIFIABLE] Sampling temperature (0 = deterministic)
输入数据集格式
输入数据集的格式与原始判断模型
必填字段
-
prompt— 包含输入提示和说明的字符串 -
response_A— 包含基准模型输出的字符串 -
response_B— 包含自定义模型输出的字符串
示例数据集(JSONL 格式)
{"prompt": "What is the most effective way to combat climate change?", "response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.", "response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less."} {"prompt": "Explain how a computer's CPU works", "response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.", "response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations."} {"prompt": "How does photosynthesis work?", "response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.", "response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts."}
格式要求
-
每个条目必须是单行 JSON 对象
-
用换行符分隔条目
-
按照示例中所示的确切字段命名
评估输出
输出结构
与最初的评委模型相比,Rubric Judge 得出的评估指标更强:
{ "config_general": { "lighteval_sha": "string", "num_fewshot_seeds": "int", "max_samples": "int | null", "job_id": "int", "start_time": "float", "end_time": "float", "total_evaluation_time_secondes": "string", "model_name": "string", "model_sha": "string", "model_dtype": "string | null", "model_size": "string" }, "results": { "custom|rubric_llm_judge_judge|0": { "a_scores": "float", "a_scores_stderr": "float", "b_scores": "float", "b_scores_stderr": "float", "ties": "float", "ties_stderr": "float", "inference_error": "float", "inference_error_stderr": "float", "score": "float", "score_stderr": "float", "weighted_score_A": "float", "weighted_score_A_stderr": "float", "weighted_score_B": "float", "weighted_score_B_stderr": "float", "score_margin": "float", "score_margin_stderr": "float", "winrate": "float", "lower_rate": "float", "upper_rate": "float" } }, "versions": { "custom|rubric_llm_judge_judge|0": "int" } }
Rubric Judge 中的新指标
以下六个指标是 Rubric Judge 所独有的,可提供精细的质量评估:
| 指标 | 说明 |
|---|---|
| 加权分数_A | 在所有模型生成的评估标准中,Response_A 的平均标准化质量得分。分数按标准重要性加权并归一化为 0-1 等级(越高 = 质量越好) |
| 加权分数_a_stderr | weighted_score_A 均值的标准误差,表示统计不确定性 |
| 加权分数_B | 在所有模型生成的评估标准中,response_b 的平均标准化质量得分。分数按标准重要性加权并归一化为 0-1 等级(越高 = 质量越好) |
| 加权分数_b_stderr | weighted_score_B 均值的标准误差,表示统计不确定性 |
| score_margin | 加权分数之间的差异(计算方法为 weighted_score_A-weighted_score_B)。范围:-1.0 到 1.0。阳性 = response_A 更好;负数 = response_B 更好;接近零 = 质量相似 |
| score_margin_stderr | score_margin 均值的标准误差,表示质量差异测量存在不确定性 |
了解加权分数指标
目的:加权分数提供连续的质量测量,以补充二元偏好判断,从而可以更深入地了解模型性能。
与原评委的主要区别
-
原始判断 — 仅输出离散偏好(A>B、B>A、A=B)
-
Rubric Jud ge — 根据自定义标准输出偏好和连续质量分数(0-1 等级)
解释 score_margin
-
score_margin = -0.128: Response_B 的得分比 response_A 高 12.8 个百分点 -
|score_margin| < 0.1: 质量差异很小(近距离决定) -
|score_margin| > 0.2: 明显的质量差异(自信的决定)
使用案例
-
模型改进-确定模型表现不佳的具体领域
-
质量量化 — 衡量绩效差距的大小,而不仅仅是 win/loss 比率
-
信心评估 — 区分严密决策和明显的质量差异
重要
最终判决仍基于评委模型的明确偏好标签,以保持整体推理,并确保通过 forward/backward 评估适当缓解立场偏见。加权分数用作可观察性工具,而不是主要判决的替代品。
计算方法
加权分数通过以下过程计算:
-
提取标准数据 — 解析评委的 YAML 输出以提取标准分数和权重
-
标准化分数:
-
量表类型标准 (1-5):通过计算(分数-1)/4 将标准化为 0-1
-
二进制标准(真/假):转换为 1.0/0.0
-
-
应用权重-将每个归一化分数乘以其标准权重
-
汇总-汇总每个响应的所有加权分数
-
计算利润-计算
score_margin = weighted_score_A - weighted_score_B
示例:如果 response_A 的加权总和为 0.65,response_B 的加权和为 0.78,则应为 -0.13,这表明 response_B 在score_margin所有加权标准中的质量要高 13 个百分点。