Code/llm-evaluation

Community curated code

github.com

Open Source LLM Evaluation Framework - DeepEval

DeepEval is an open-source framework for evaluating large language models, offering customizable metrics and seamless integration with popular AI frameworks.

evaluation-frameworkevaluation-metricsllmllm-evaluationllm-evaluation-framework

flux

github.com

AI Skill Evaluation Framework for Agents

Evaluate AI agent skills with this TypeScript-based tool that provides objective performance assessments.

agent-evalsagent-skillsagentskillsai-agentscli

flux

github.com

MLflow: Open Source AI Engineering Platform

MLflow is an open source platform for managing AI applications, enabling teams to optimize and monitor production-quality models.

agentopsagentsaiai-agentsai-governance

flux

github.com

Open Source LLM Observability Platform - Helicone

Helicone is an open-source LLM observability platform that enables AI engineers to monitor and evaluate models efficiently.

agent-monitoringanalyticsevaluationgptlangchain

flux

github.com

Open Source LLM Engineering Platform - Langfuse

Langfuse is an open source platform for LLM observability and management, enabling teams to develop and debug AI applications efficiently.

analyticsautogenevaluationlangchainlarge-language-models

flux

github.com

Agenta: Open-Source LLMOps Platform for Developers

Agenta is an open-source platform for building reliable LLM applications with integrated management, evaluation, and observability tools.

agentsevaluationllmllm-as-a-judgellm-evaluation

flux

github.com

Promptfoo: CLI for LLM Evaluation and Security Testing

Promptfoo is a CLI tool for evaluating and securing LLM applications through automated testing and red teaming.

cici-cdcicdclaudeevaluation

flux