LLM-as-a-Judge approaches with reliability calibration Inter-Rater Reliability & Agreement: Cohen's κ, Fleiss' π, and practical calibration workflows Benchmarking Test Frameworks: How to evaluate test ...