Community curated code

DeepEval is an open-source framework for evaluating large language models, offering customizable metrics and seamless integration with popular AI frameworks.

Evaluate AI agent skills with this TypeScript-based tool that provides objective performance assessments.

MLflow is an open source platform for managing AI applications, enabling teams to optimize and monitor production-quality models.
Helicone is an open-source LLM observability platform that enables AI engineers to monitor and evaluate models efficiently.

Langfuse is an open source platform for LLM observability and management, enabling teams to develop and debug AI applications efficiently.

Agenta is an open-source platform for building reliable LLM applications with integrated management, evaluation, and observability tools.

Promptfoo is a CLI tool for evaluating and securing LLM applications through automated testing and red teaming.