A snapshot of the Aya collection (Bengali) . Image taken from HuggingFace. In February 2024, Cohere launched Aya , a multilingual Large Language Model (LLM). Alongside, a set of datasets used to train Aya has also been released. For example, the aya_dataset consists around 205K examples annotated by humans. On the other hand, the recently released aya_collection_language_split is a gigantic dataset with more than 500 million data points spread across more than 100 languages. As the name suggests, this dataset is split by language. For example, all data points in Bengali, irrespective of the underlying task, can be found in a single split. Apart from the original human-annotated examples from the aya_dataset, aya_collection_language_split also contains a lot of translated and templated data. The dataset is released using an Apache-2.0 license, allowing academic and commercial use. The Bengali Language Split Each language split in the Aya collection has three splits. The Bengali split,...
Hi! Barun. I had a chance to meet your mentor(Dr. Sudip Misra if I am not wrong) while he came to deliver an invited session on Opportunistic Network in Jadavpur University. I am Suvadip Batabyal, pursuing PhD. on the very same topic i.e. OpNet. I have been involved in this area of research for past 1.5 years. I am trying to solve the very question which you have asked. And I have come up with interesting results.
ReplyDeleteSuvadip,
ReplyDeleteThanks for the message! Great to know about your research, and that you have found answers to the questions raised here. Wishing you the best!