本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。
可用的基准测试任务
提供了一个示例代码包,用于演示如何使用 Amazon Nova 的 SageMaker AI 模型评估功能计算基准指标。要访问代码包,请参阅 Sample-n ova-lighteval-custom-task
以下是支持的可用行业标准基准测试列表。您可以在 eval_task 参数中指定以下基准测试:
基准 |
模式 |
说明 |
指标 |
Strategy |
可用子任务 |
|---|---|---|---|---|---|
mmlu |
文本 |
多任务语言理解:考核 57 个科目的知识。 |
准确性 |
zs_cot |
是 |
mmlu_pro |
文本 |
MMLU(专业子集),专注于法律、医学、会计和工程等专业领域。 |
准确性 |
zs_cot |
否 |
bbh |
文本 |
高级推理任务:一系列具有挑战性的问题,用于考核更高级别的认知和解决问题的能力。 |
准确性 |
zs_cot |
是 |
gpqa |
文本 |
一般物理问题解答:评测对物理概念和相关问题解决能力的理解情况。 |
准确性 |
zs_cot |
否 |
math |
文本 |
数学问题解决:衡量在代数、微积分及应用题等领域的数学推理能力。 |
exact_match |
zs_cot |
是 |
strong_reject |
文本 |
质量控制任务:测试模型检测和拒绝不当、有害或不正确内容的能力。 |
deflection |
zs |
是 |
IFEval |
文本 |
指令跟随评估:衡量模型遵循给定指令并按照规范完成任务的准确程度。 |
准确性 |
zs |
否 |
gen_qa |
文本 |
自定义数据集评估:让您可以自带数据集进行基准测试,将模型输出与参考答案及 ROUGE 和 BLEU 等指标进行比较。 |
全部 |
gen_qa |
否 |
llm_judge |
文本 |
LLM-as-a-Judge 偏好比较 — 使用 Nova Judge 模型来确定提示的配对响应(B 和 A)之间的偏好,计算 B 比 A 更受青睐的概率。 |
全部 |
judge |
否 |
humaneval |
文本 |
HumanEval -旨在评估大型语言模型代码生成能力的基准数据集 |
pass@1 |
zs |
否 |
|
mm_llm_judge |
多模式(图片) |
这个新的基准测试的行为与上面基于文本 |
全部 |
judge |
否 |
|
rubric_llm_judge |
文本 |
Rubric Judge 是一款基于 Nova 2.0 Lite 构建的增强型 LLM-as-a-judge评估模型。与仅提供偏好判决的原始评委模型 |
全部 |
judge |
否 |
|
aime_2024 |
文本 |
AIME 2024-美国邀请赛数学考试题目测试高级数学推理和问题解决 |
exact_match |
zs_cot |
否 |
|
日历日程安排 |
文本 |
Natural Plan-日历排程任务测试计划能力,用于安排多天和多人的会议 |
exact_match |
fs |
否 |
以下 mmlu 子任务可用:
MMLU_SUBTASKS = [ "abstract_algebra", "anatomy", "astronomy", "business_ethics", "clinical_knowledge", "college_biology", "college_chemistry", "college_computer_science", "college_mathematics", "college_medicine", "college_physics", "computer_security", "conceptual_physics", "econometrics", "electrical_engineering", "elementary_mathematics", "formal_logic", "global_facts", "high_school_biology", "high_school_chemistry", "high_school_computer_science", "high_school_european_history", "high_school_geography", "high_school_government_and_politics", "high_school_macroeconomics", "high_school_mathematics", "high_school_microeconomics", "high_school_physics", "high_school_psychology", "high_school_statistics", "high_school_us_history", "high_school_world_history", "human_aging", "human_sexuality", "international_law", "jurisprudence", "logical_fallacies", "machine_learning", "management", "marketing", "medical_genetics", "miscellaneous", "moral_disputes", "moral_scenarios", "nutrition", "philosophy", "prehistory", "professional_accounting", "professional_law", "professional_medicine", "professional_psychology", "public_relations", "security_studies", "sociology", "us_foreign_policy", "virology", "world_religions" ]
以下 bbh 子任务可用:
BBH_SUBTASKS = [ "boolean_expressions", "causal_judgement", "date_understanding", "disambiguation_qa", "dyck_languages", "formal_fallacies", "geometric_shapes", "hyperbaton", "logical_deduction_five_objects", "logical_deduction_seven_objects", "logical_deduction_three_objects", "movie_recommendation", "multistep_arithmetic_two", "navigate", "object_counting", "penguins_in_a_table", "reasoning_about_colored_objects", "ruin_names", "salient_translation_error_detection", "snarks", "sports_understanding", "temporal_sequences", "tracking_shuffled_objects_five_objects", "tracking_shuffled_objects_seven_objects", "tracking_shuffled_objects_three_objects", "web_of_lies", "word_sorting" ]
以下 math 子任务可用:
MATH_SUBTASKS = [ "algebra", "counting_and_probability", "geometry", "intermediate_algebra", "number_theory", "prealgebra", "precalculus", ]