ADeLe Question Classification¶
ADeLe (Assessment of Difficulty Level) classifies benchmark questions across 18 cognitive dimensions, from domain knowledge required to reasoning complexity. Each dimension produces an ordinal score from 0 (none) to 5 (very high). Use classification results to understand your benchmark's difficulty profile, filter questions by complexity, or create rubrics based on ADeLe traits.
This tutorial is useful when you have a set of questions and want to characterize what your benchmark measures before running verification. Classification can also guide question selection: keep only high-reasoning questions for a challenging benchmark, or balance difficulty across categories.
What you'll learn:
- Create a
QuestionClassifierwith model configuration - Classify a single question with
classify_single() - Inspect scores, labels, and summary format
- Classify a benchmark in batch with
classify_batch() - Store classifications in
custom_metadataviato_checkpoint_metadata() - Reload classifications with
from_checkpoint_metadata() - Create a rubric from ADeLe traits with
create_adele_rubric() - Filter questions by cognitive complexity
The 18 Dimensions¶
ADeLe organizes its dimensions into four categories. Each dimension uses a 0 to 5 ordinal scale, from "very low" to "very high."
| Category | Dimension | What It Measures |
|---|---|---|
| Attention | attention_and_scan |
Visual scanning or search required |
atypicality |
Uncommon or surprising elements | |
| Comprehension | comprehension_complexity |
Structural complexity of the question |
comprehension_evaluation |
Evaluation or judgment required | |
conceptualization_and_learning |
Concept formation or learning demand | |
| Knowledge | knowledge_applied_sciences |
Applied science knowledge needed |
knowledge_cultural |
Cultural or humanities knowledge | |
knowledge_formal_sciences |
Math or formal science knowledge | |
knowledge_natural_sciences |
Natural science domain expertise | |
knowledge_social_sciences |
Social science knowledge needed | |
| Metacognition | metacognition_relevance |
Relevance judgment required |
metacognition_task_planning |
Planning or strategy needed | |
metacognition_uncertainty |
Uncertainty handling required | |
| Other | mind_modelling |
Theory of mind or perspective-taking |
logical_reasoning_logic |
Formal logical deduction | |
logical_reasoning_quantitative |
Quantitative reasoning | |
spatial_physical_understanding |
Spatial or physical reasoning | |
volume |
Amount of content to process |
Create a Classifier¶
The QuestionClassifier wraps an LLM to evaluate questions against ADeLe dimensions. By default it uses claude-haiku-4-5 for efficiency, since classification runs one LLM call per question in batch mode.
from karenina.integrations.adele import QuestionClassifier
classifier = QuestionClassifier(
model_name="claude-haiku-4-5",
provider="anthropic",
)
print(f"Classifier ready: model={classifier._model_name}, mode={classifier._trait_eval_mode}")
Classifier ready: model=claude-haiku-4-5, mode=batch
Classify a Single Question¶
classify_single() evaluates a question against all 18 dimensions and returns a QuestionClassificationResult with scores, labels, and usage metadata.
with patch.object(QuestionClassifier, "classify_single", _mock_classify_single):
result = classifier.classify_single(
question_text="What is the approved pharmacological target of venetoclax?",
question_id="q1",
)
print(f"Question: {result.question_text}")
print(f"Model: {result.model}")
print(f"Dimensions classified: {len(result.scores)}")
Question: What is the approved pharmacological target of venetoclax? Model: claude-haiku-4-5 Dimensions classified: 18
The scores dict maps each dimension to an integer (0 to 5), while labels maps to the corresponding level name:
# Inspect a few scores and labels
for dim in ["knowledge_natural_sciences", "logical_reasoning_logic", "volume"]:
print(f" {dim}: score={result.scores[dim]}, label={result.labels[dim]}")
knowledge_natural_sciences: score=4, label=high logical_reasoning_logic: score=2, label=low volume: score=3, label=moderate
The get_summary() method produces a compact "label (score)" format for each dimension:
summary = result.get_summary()
for dim, value in list(summary.items())[:5]:
print(f" {dim}: {value}")
attention_and_scan: low (2) atypicality: very_low (1) comprehension_complexity: low (2) comprehension_evaluation: moderate (3) conceptualization_and_learning: very_low (1)
Select Specific Traits¶
When you only need a subset of dimensions, pass trait_names to skip the rest. This reduces token usage and latency.
with patch.object(QuestionClassifier, "classify_single", _mock_classify_single):
partial = classifier.classify_single(
question_text="What is the approved pharmacological target of venetoclax?",
question_id="q1",
trait_names=["knowledge_natural_sciences", "logical_reasoning_logic"],
)
print(f"Dimensions classified: {len(partial.scores)}")
for dim, score in partial.scores.items():
print(f" {dim}: {partial.labels[dim]} ({score})")
Dimensions classified: 2 knowledge_natural_sciences: high (4) logical_reasoning_logic: low (2)
Batch Classification¶
For a full benchmark, classify_batch() accepts a list of (question_id, question_text) tuples and returns a dict mapping each ID to its classification result. The optional on_progress callback receives (completed, total) counts.
questions = [
("q1", "What is the approved pharmacological target of venetoclax?"),
("q2", "Compare the mechanisms of action of imatinib and dasatinib."),
("q3", "What is the half-life of amoxicillin?"),
]
with patch.object(QuestionClassifier, "classify_batch", _mock_classify_batch):
batch_results = classifier.classify_batch(
questions=questions,
on_progress=lambda done, total: print(f" Classified {done}/{total}"),
)
print(f"\nClassified {len(batch_results)} questions")
for qid, res in batch_results.items():
print(f" {qid}: knowledge_natural_sciences={res.scores['knowledge_natural_sciences']}, "
f"logical_reasoning_logic={res.scores['logical_reasoning_logic']}")
Classified 1/3 Classified 2/3 Classified 3/3 Classified 3 questions q1: knowledge_natural_sciences=4, logical_reasoning_logic=2 q2: knowledge_natural_sciences=5, logical_reasoning_logic=3 q3: knowledge_natural_sciences=2, logical_reasoning_logic=0
Store in Checkpoint Metadata¶
to_checkpoint_metadata() converts a classification result into a dict suitable for storing in a question's custom_metadata field. This preserves classifications across save/load cycles.
metadata = result.to_checkpoint_metadata()
print("Checkpoint metadata keys:", list(metadata.keys()))
print("Inner keys:", list(metadata["adele_classification"].keys()))
Checkpoint metadata keys: ['adele_classification'] Inner keys: ['scores', 'labels', 'classified_at', 'model']
In a real workflow, you would update the question's metadata in the benchmark:
# In a real workflow with a benchmark loaded:
#
# for qid, classification in batch_results.items():
# question = benchmark.get_question(qid)
# existing_meta = question.get("custom_metadata", {})
# existing_meta.update(classification.to_checkpoint_metadata())
# benchmark.update_question_metadata(qid, custom_metadata=existing_meta)
#
# benchmark.save("checkpoint.jsonld")
print("Classifications stored in custom_metadata under 'adele_classification' key")
Classifications stored in custom_metadata under 'adele_classification' key
Reload Classifications¶
from_checkpoint_metadata() reconstructs a QuestionClassificationResult from stored metadata, completing the round-trip.
from karenina.integrations.adele.schemas import QuestionClassificationResult
# Simulate loading from checkpoint
stored_metadata = result.to_checkpoint_metadata()
reloaded = QuestionClassificationResult.from_checkpoint_metadata(
metadata=stored_metadata,
question_id="q1",
question_text="What is the approved pharmacological target of venetoclax?",
)
print(f"Reloaded: {reloaded.question_id}")
print(f"Scores match: {reloaded.scores == result.scores}")
print(f"Labels match: {reloaded.labels == result.labels}")
print(f"Model: {reloaded.model}")
Reloaded: q1 Scores match: True Labels match: True Model: claude-haiku-4-5
If the metadata does not contain an adele_classification key, from_checkpoint_metadata() returns None:
empty = QuestionClassificationResult.from_checkpoint_metadata({})
print(f"Missing classification returns: {empty}")
Missing classification returns: None
Filter by Complexity¶
Use scores to select question subsets by cognitive profile. This is useful for building targeted benchmarks or stratifying evaluation by difficulty.
# Find questions requiring significant logical reasoning (logical_reasoning_logic >= 3)
high_reasoning = {
qid: res for qid, res in batch_results.items()
if res.scores.get("logical_reasoning_logic", 0) >= 3
}
print(f"High logical reasoning questions: {len(high_reasoning)}")
for qid in high_reasoning:
score = high_reasoning[qid].scores["logical_reasoning_logic"]
print(f" {qid}: logical_reasoning_logic={score}")
# Find low-knowledge questions (knowledge_natural_sciences <= 2)
low_knowledge = {
qid: res for qid, res in batch_results.items()
if res.scores.get("knowledge_natural_sciences", 0) <= 2
}
print(f"\nLow natural science knowledge questions: {len(low_knowledge)}")
for qid in low_knowledge:
score = low_knowledge[qid].scores["knowledge_natural_sciences"]
print(f" {qid}: knowledge_natural_sciences={score}")
High logical reasoning questions: 1 q2: logical_reasoning_logic=3 Low natural science knowledge questions: 1 q3: knowledge_natural_sciences=2
You can also combine filters for more specific selection:
# Questions requiring high domain knowledge AND logical reasoning
complex_questions = {
qid: res for qid, res in batch_results.items()
if res.scores.get("knowledge_natural_sciences", 0) >= 4
and res.scores.get("logical_reasoning_logic", 0) >= 3
}
print(f"High knowledge + logical reasoning: {len(complex_questions)} questions")
High knowledge + logical reasoning: 1 questions
Create Rubric from ADeLe Traits¶
ADeLe traits can be used directly as rubric traits for verification. create_adele_rubric() builds a Rubric containing the specified ADeLe dimensions as LLMRubricTrait objects with kind="literal".
from karenina.integrations.adele import ADELE_TRAIT_NAMES, create_adele_rubric
# Create a rubric with selected traits
rubric = create_adele_rubric(
trait_names=["knowledge_natural_sciences", "logical_reasoning_logic"]
)
print(f"Rubric traits: {len(rubric.llm_traits)}")
for trait in rubric.llm_traits:
print(f" {trait.name}: kind={trait.kind}, classes={len(trait.classes)}")
Rubric traits: 2 knowledge_natural_sciences: kind=literal, classes=6 logical_reasoning_logic: kind=literal, classes=6
To use all 18 traits, pass None (or omit trait_names):
full_rubric = create_adele_rubric()
print(f"Full rubric: {len(full_rubric.llm_traits)} traits")
print(f"Available trait names ({len(ADELE_TRAIT_NAMES)}):")
for name in ADELE_TRAIT_NAMES[:6]:
print(f" - {name}")
print(f" ... and {len(ADELE_TRAIT_NAMES) - 6} more")
Full rubric: 18 traits Available trait names (18): - attention_and_scan - atypicality - comprehension_complexity - comprehension_evaluation - conceptualization_and_learning - knowledge_applied_sciences ... and 12 more
Cleanup¶
# No temporary files were created in this tutorial.
print("Done.")
Done.
Next Steps¶
- ADeLe Concept Page: Full dimension reference and scoring details
- Rubrics: Deep dive into rubric concepts and trait types
- Scaled Authoring: Bulk workflows, template generation, and classification in context
- Running Verification: Execute benchmarks with ADeLe rubrics