Search Skills

Search for skills or navigate to categories

Skillforthat
AI & Machine Learning
evaluating-llms-harness

evaluating-llms-harness

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag)

Category

AI & Machine Learning

Developer

davila7
davila7

Updated

Jan
2026

Tags

3
Total

Description

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

Skill File

SKILL.md
1Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

Tags

ApiAiMl

Information

Developerdavila7
CategoryAI & Machine Learning
CreatedJan 15, 2026
UpdatedJan 15, 2026

You Might Also Like