Which evaluation metrics (comparing single-dimension vs. composite indices) are sensitive enough to detect quality differences within similar processes but robust enough for valid comparison across different topics, geographies, and participant populations?