可用的基准测试任务 - 亚马逊 SageMaker AI
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅 中国的 Amazon Web Services 服务入门 (PDF)

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

可用的基准测试任务

提供了一个示例代码包,用于演示如何使用 Amazon Nova 的 SageMaker AI 模型评估功能计算基准指标。要访问代码包,请参阅 Sample-n ova-lighteval-custom-task

以下是支持的可用行业标准基准测试列表。您可以在 eval_task 参数中指定以下基准测试:

基准

模式

说明

指标

Strategy

可用子任务

mmlu

文本

多任务语言理解:考核 57 个科目的知识。

准确性

zs_cot

mmlu_pro

文本

MMLU(专业子集),专注于法律、医学、会计和工程等专业领域。

准确性

zs_cot

bbh

文本

高级推理任务:一系列具有挑战性的问题,用于考核更高级别的认知和解决问题的能力。

准确性

zs_cot

gpqa

文本

一般物理问题解答:评测对物理概念和相关问题解决能力的理解情况。

准确性

zs_cot

math

文本

数学问题解决:衡量在代数、微积分及应用题等领域的数学推理能力。

exact_match

zs_cot

strong_reject

文本

质量控制任务:测试模型检测和拒绝不当、有害或不正确内容的能力。

deflection

zs

IFEval

文本

指令跟随评估:衡量模型遵循给定指令并按照规范完成任务的准确程度。

准确性

zs

gen_qa

文本

自定义数据集评估:让您可以自带数据集进行基准测试,将模型输出与参考答案及 ROUGE 和 BLEU 等指标进行比较。

全部

gen_qa

llm_judge

文本

LLM-as-a-Judge 偏好比较 — 使用 Nova Judge 模型来确定提示的配对响应(B 和 A)之间的偏好,计算 B 比 A 更受青睐的概率。

全部

judge

humaneval

文本

HumanEval -旨在评估大型语言模型代码生成能力的基准数据集

pass@1

zs

mm_llm_judge

多模式(图片)

这个新的基准测试的行为与上面基于文本llm_judge的基准相同。唯一的区别是它支持图像推理。

全部

judge

rubric_llm_judge

文本

Rubric Judge 是一款基于 Nova 2.0 Lite 构建的增强型 LLM-as-a-judge评估模型。与仅提供偏好判决的原始评委模型不同,Rubric Judge动态生成针对每个提示量身定制的自定义评估标准,并在多个维度上分配精细分数。

全部

judge

aime_2024

文本

AIME 2024-美国邀请赛数学考试题目测试高级数学推理和问题解决

exact_match

zs_cot

日历日程安排

文本

Natural Plan-日历排程任务测试计划能力,用于安排多天和多人的会议

exact_match

fs

以下 mmlu 子任务可用:

MMLU_SUBTASKS = [ "abstract_algebra", "anatomy", "astronomy", "business_ethics", "clinical_knowledge", "college_biology", "college_chemistry", "college_computer_science", "college_mathematics", "college_medicine", "college_physics", "computer_security", "conceptual_physics", "econometrics", "electrical_engineering", "elementary_mathematics", "formal_logic", "global_facts", "high_school_biology", "high_school_chemistry", "high_school_computer_science", "high_school_european_history", "high_school_geography", "high_school_government_and_politics", "high_school_macroeconomics", "high_school_mathematics", "high_school_microeconomics", "high_school_physics", "high_school_psychology", "high_school_statistics", "high_school_us_history", "high_school_world_history", "human_aging", "human_sexuality", "international_law", "jurisprudence", "logical_fallacies", "machine_learning", "management", "marketing", "medical_genetics", "miscellaneous", "moral_disputes", "moral_scenarios", "nutrition", "philosophy", "prehistory", "professional_accounting", "professional_law", "professional_medicine", "professional_psychology", "public_relations", "security_studies", "sociology", "us_foreign_policy", "virology", "world_religions" ]

以下 bbh 子任务可用:

BBH_SUBTASKS = [ "boolean_expressions", "causal_judgement", "date_understanding", "disambiguation_qa", "dyck_languages", "formal_fallacies", "geometric_shapes", "hyperbaton", "logical_deduction_five_objects", "logical_deduction_seven_objects", "logical_deduction_three_objects", "movie_recommendation", "multistep_arithmetic_two", "navigate", "object_counting", "penguins_in_a_table", "reasoning_about_colored_objects", "ruin_names", "salient_translation_error_detection", "snarks", "sports_understanding", "temporal_sequences", "tracking_shuffled_objects_five_objects", "tracking_shuffled_objects_seven_objects", "tracking_shuffled_objects_three_objects", "web_of_lies", "word_sorting" ]

以下 math 子任务可用:

MATH_SUBTASKS = [ "algebra", "counting_and_probability", "geometry", "intermediate_algebra", "number_theory", "prealgebra", "precalculus", ]