duarteocarmo.com

AMÁLIA: Advancing European Portuguese LLMs

AMÁLIA is an open-source LLM for European Portuguese, focusing on data utilization and benchmarking in NLP.

flux

Summary

AMÁLIA is a large-scale Large Language Model (LLM) developed for European Portuguese, backed by a significant investment from the Portuguese government. The initiative aims to enhance the representation of European Portuguese in the field of natural language processing.

Key features:

Open Source - AMÁLIA is designed to be fully open source, although currently, not all components are publicly accessible.
Data Utilization - The model focuses on using a substantial amount of European Portuguese data, primarily sourced from Arquivo.pt.
Benchmarking - The team has created four new benchmarks specifically for evaluating the model's performance in European Portuguese.
Collaboration - AMÁLIA is the result of a collaboration between several prestigious Portuguese universities and research labs.

Despite its promising features, there are concerns regarding the amount of European Portuguese data utilized in training, with only 5.5% of the training tokens being clearly identified as European Portuguese. The article discusses the implications of this and the importance of transparency in the model's development.

Comments

No comments yet. Sign in to add the first comment!