Barun Saha's blog on AI and Networks

Posts

Cohere Aya Dataset: Exploring the Split-by-language Collection

March 17, 2024

A snapshot of the Aya collection (Bengali) . Image taken from HuggingFace. In February 2024, Cohere launched Aya , a multilingual Large Language Model (LLM). Alongside, a set of datasets used to train Aya has also been released. For example, the aya_dataset consists around 205K examples annotated by humans. On the other hand, the recently released aya_collection_language_split is a gigantic dataset with more than 500 million data points spread across more than 100 languages. As the name suggests, this dataset is split by language. For example, all data points in Bengali, irrespective of the underlying task, can be found in a single split. Apart from the original human-annotated examples from the aya_dataset, aya_collection_language_split also contains a lot of translated and templated data. The dataset is released using an Apache-2.0 license, allowing academic and commercial use. The Bengali Language Split Each language split in the Aya collection has three splits. The Bengali split,...