Skip to main content

Cohere Aya Dataset: Exploring the Split-by-language Collection

A snapshot of the Aya collection (Bengali). Image taken from HuggingFace.

In February 2024, Cohere launched Aya, a multilingual Large Language Model (LLM). Alongside, a set of datasets used to train Aya has also been released. For example, the aya_dataset consists around 205K examples annotated by humans. On the other hand, the recently released aya_collection_language_split is a gigantic dataset with more than 500 million data points spread across more than 100 languages. As the name suggests, this dataset is split by language. For example, all data points in Bengali, irrespective of the underlying task, can be found in a single split. Apart from the original human-annotated examples from the aya_dataset, aya_collection_language_split also contains a lot of translated and templated data. The dataset is released using an Apache-2.0 license, allowing academic and commercial use.

The Bengali Language Split

Each language split in the Aya collection has three splits. The Bengali split, for example, contains:

  •     3601287 examples in 'train'
  •     274546 data points in 'validation'
  •     276504 rows in 'test'

Let us take a look at this collection for the Bengali split, specifically focusing on the tasks and data sources.

All Task Types with Examples

 There are 10 different task types. There are:

  •     'summarization'
  •     'paraphrasing'
  •     'text-simplification'
  •      'question-answering'
  •      '-'
  •      'dialogue'
  •      'translation'
  •     'generation'
  •     'event-linking'
  •     'paraphrase-identification'

An example for each task is provided below. For the sake of brevity, all texts in the following are truncated after 80 characters:

#1
> Task: summarization
inputs: āĻ¸ংāĻ°āĻ•্āĻˇāĻŖেāĻ° āĻ†āĻ—ে āĻ¯েāĻ•োāĻ¨ āĻ–াāĻĻ্āĻ¯েāĻ° āĻ…āĻŦāĻļিāĻˇ্āĻŸাংāĻļ āĻŦা āĻĻাāĻ— āĻ…āĻĒāĻ¸াāĻ°āĻŖ āĻ•āĻ°া āĻĒ্āĻ°āĻ¯়োāĻœāĻ¨ āĻ•াāĻ°āĻŖ āĻāĻ—ুāĻ˛ি āĻ¤াāĻĻ
targets: āĻ¸ংāĻ°āĻ•্āĻˇāĻŖেāĻ° āĻ†āĻ—ে āĻ¨িāĻļ্āĻšিāĻ¤ āĻ•āĻ°ুāĻ¨ āĻ¯ে āĻ†āĻĒāĻ¨াāĻ° āĻ•োāĻ¯়েāĻ˛āĻŸি āĻĒāĻ°িāĻˇ্āĻ•াāĻ°। āĻ•োāĻ¯়াāĻ°্āĻŸāĻ•ে āĻ­্āĻ¯াāĻ•ুāĻ¯়াāĻŽ āĻ•āĻ°ো
template_id: 1
============================================================
#2
> Task: paraphrasing
inputs: āĻ­িāĻ¨্āĻ¨ āĻļāĻŦ্āĻĻāĻ—ুāĻš্āĻ› āĻŦ্āĻ¯āĻŦāĻšাāĻ° āĻ•āĻ°ে āĻ¨িāĻšেāĻ° āĻŦাāĻ•্āĻ¯āĻŸি āĻ˛েāĻ–: "āĻ–āĻŦāĻ° āĻĒেāĻ¯়ে āĻĒুāĻ˛িāĻļ āĻ˜āĻŸāĻ¨াāĻ¸্āĻĨāĻ˛ে āĻĒৌঁāĻ›ে
targets: "āĻĒুāĻ˛িāĻļ āĻ–āĻŦāĻ° āĻĒেā§Ÿে āĻ˜āĻŸāĻ¨াāĻ¸্āĻĨāĻ˛ে āĻĒৌঁāĻ›ে āĻ†āĻšāĻ¤āĻĻেāĻ° āĻ‰āĻĻ্āĻ§াāĻ° āĻ•āĻ°ে āĻ¸্āĻĨাāĻ¨ীāĻ¯় āĻšাāĻ¸āĻĒাāĻ¤াāĻ˛ে āĻ¨িāĻ¯়ে āĻ¯াāĻ¯়।
template_id: 1
============================================================
#3
> Task: text-simplification
inputs: āĻāĻ‡ āĻŦাāĻ•্āĻ¯āĻŸিāĻ° āĻ†āĻ°ো āĻœāĻŸিāĻ˛ āĻ¸ংāĻ¸্āĻ•āĻ°āĻŖ āĻ¤ৈāĻ°ি āĻ•āĻ°ুāĻ¨'''āĻāĻ• āĻ­াāĻˇা āĻĨেāĻ•ে āĻ…āĻ¨্āĻ¯ āĻ­াāĻˇাāĻ¯় āĻ…āĻ¨ুāĻŦাāĻĻ āĻ•āĻ°াāĻ° āĻ¸āĻŽ
targets: āĻ…āĻŦāĻļ্āĻ¯āĻ‡, āĻŦাāĻ•্āĻ¯āĻŸিāĻ° āĻ†āĻ°ো āĻœāĻŸিāĻ˛ āĻ¸ংāĻ¸্āĻ•āĻ°āĻŖ āĻšāĻ˛ "''āĻ¤িāĻ¨ি āĻāĻ• āĻ­াāĻˇা āĻĨেāĻ•ে āĻ…āĻ¨্āĻ¯ āĻ­াāĻˇাāĻ¯় āĻ…āĻ¨ুāĻŦাāĻĻ āĻ•āĻ°া
template_id: 1
============================================================
#4
> Task: question-answering
inputs: āĻ¤ুāĻ°্āĻ•ি āĻœāĻ¨āĻ—āĻŖ (), āĻŦা āĻ¤ুāĻ°্āĻ•িāĻ°া (), āĻ¯া āĻ†āĻ¨াāĻ¤োāĻ˛িāĻ¯়াāĻ¨ āĻ¤ুāĻ°্āĻ•ি āĻ¨াāĻŽেāĻ“ āĻĒāĻ°িāĻšিāĻ¤ (), āĻāĻ•āĻŸি āĻ¤ুāĻ°্
targets: ā§§। āĻ†āĻ¨াāĻ¤োāĻ˛িāĻ¯়াāĻ° āĻ—্āĻ°াāĻŽāĻŦাāĻ¸ী ā§¨। āĻ¨া ā§Š। āĻš্āĻ¯াঁ ā§Ē. āĻ¤ুāĻ°্āĻ•ি ā§Ģ। āĻ¨া ā§Ŧ। āĻĒāĻļ্āĻšিāĻŽ āĻ‡āĻ‰āĻ°োāĻĒ ā§­। āĻŸাāĻ°্āĻ—
template_id: 1
============================================================
#5
> Task: -
inputs: āĻ¨িāĻšেāĻ° āĻ…āĻ¨ুāĻš্āĻ›েāĻĻেāĻ° āĻŦিāĻˇāĻ¯় āĻ•ি ?

āĻŸাāĻ™্āĻ—ুāĻ¯়াāĻ° āĻšাāĻ“āĻ° (āĻ¸িāĻ˛েāĻŸি: ꠐꠣꠋꠉꠥꠀꠞ ꠀꠅꠞ) āĻŦাংāĻ˛াāĻĻেāĻļেāĻ° āĻŦৃ
targets: āĻŸাāĻ™্āĻ—ুāĻ¯়াāĻ° āĻšাāĻ“āĻ° |
template_id: 0
============================================================
#6
> Task: dialogue
inputs: āĻ¨িāĻŽ্āĻ¨āĻ˛িāĻ–িāĻ¤ āĻŦিāĻˇāĻ¯়েāĻ° āĻ‰āĻĒāĻ° āĻ­িāĻ¤্āĻ¤ি āĻ•āĻ°ে āĻāĻ•āĻŸি āĻ¸ংāĻ•্āĻˇিāĻĒ্āĻ¤ āĻŦāĻ°্āĻŖāĻ¨া āĻ˛িāĻ–ুāĻ¨ঃ  āĻŦ্āĻ¯āĻ•্āĻ¤ি āĻāĻ•্āĻ¸ āĻ—িāĻ°
targets: āĻ…āĻŦāĻļ্āĻ¯āĻ‡, āĻāĻ–াāĻ¨ে āĻāĻ•āĻŸি āĻ¸ংāĻ•্āĻˇিāĻĒ্āĻ¤ āĻ…āĻ¨ুāĻš্āĻ›েāĻĻ āĻ°āĻ¯়েāĻ›ে: āĻ§āĻ°্āĻŽেāĻ° āĻĒ্āĻ°āĻ¤ি āĻ†āĻ—্āĻ°āĻšী āĻšāĻ“āĻ¯়াāĻ¯় āĻ­েāĻĻা āĻ—
template_id: 1
============================================================
#7
> Task: translation
inputs: Translate from English to Bengali: "This boat's soundbar is still wire-connectiv
targets: "āĻāĻ‡ āĻŦোāĻŸেāĻ° āĻ¸াāĻ‰āĻ¨্āĻĄāĻŦাāĻ°āĻŸি āĻāĻ–āĻ¨āĻ“ āĻ¸āĻŦ āĻ¸্āĻĒিāĻ•াāĻ°েāĻ° āĻœāĻ¨্āĻ¯ āĻ¤াāĻ°েāĻ° āĻ¸ংāĻ¯োāĻ—। āĻāĻ‡āĻšāĻĄিāĻāĻŽāĻ†āĻ‡ āĻĒোāĻ°্āĻŸ āĻ¸āĻŦ āĻĄিāĻ­
template_id: 1
============================================================
#8
> Task: generation
inputs: āĻ¨িāĻŽ্āĻ¨āĻ˛িāĻ–িāĻ¤ āĻĻুāĻŸি āĻŦাāĻ•্āĻ¯ āĻĨেāĻ•ে āĻ•োāĻ¨āĻŸি āĻ¸াāĻ§াāĻ°āĻŖ āĻœ্āĻžাāĻ¨েāĻ° āĻŦিāĻ°ুāĻĻ্āĻ§ে? āĻŦিāĻ•āĻ˛্āĻĒঃ - āĻĒ্āĻ°āĻĨāĻŽ āĻŦাāĻ•্āĻ¯:
targets: āĻ•āĻŽāĻ˛া āĻ°āĻ¸েāĻ° āĻ¸্āĻŦাāĻĻ āĻļāĻ¸্āĻ¯েāĻ° āĻ¸াāĻĨে āĻ­াāĻ˛ āĻšāĻ¯় āĻ¨া। āĻšূāĻĄ়াāĻ¨্āĻ¤ āĻ‰āĻ¤্āĻ¤āĻ°: A āĻŦাāĻ•্āĻ¯।
template_id: 1
============================================================
#9
> Task: event-linking
inputs: āĻ¨িāĻŽ্āĻ¨āĻ˛িāĻ–িāĻ¤ āĻŦাāĻ•্āĻ¯āĻŸি āĻ¸āĻŽ্āĻĒূāĻ°্āĻŖ āĻ•āĻ°ুāĻ¨: ā§§ā§¯ā§Šā§Ļ āĻāĻ° āĻĻāĻļāĻ•ে āĻ¨াā§ŽāĻ¸ি āĻœাāĻ°্āĻŽাāĻ¨িāĻ° āĻ‰āĻ¤্āĻĨাāĻ¨ āĻ…āĻ¸্āĻŸ্āĻ°িāĻ¯়া
targets: ā§§ā§¯ā§Šā§­ āĻ¸াāĻ˛ে āĻ…āĻ¸্āĻŸ্āĻ°িāĻ¯়াāĻ•ে āĻ¸ংāĻ¯ুāĻ•্āĻ¤ āĻ•āĻ°াāĻ° āĻ¸āĻŽāĻ¯় āĻĒāĻ°িāĻŦাāĻ°āĻ•ে āĻ¤াāĻ° āĻŦ্āĻ¯াংāĻ•িং āĻ•াāĻ°্āĻ¯āĻ•্āĻ°āĻŽ āĻŦিāĻ•্āĻ°ি
template_id: 1
============================================================
#10
> Task: paraphrase-identification
inputs: āĻŦাāĻ•্āĻ¯ ā§§ঃ (ā§§ā§Žā§¯ā§¨-ā§§ā§¯ā§Ŧā§¨) āĻ›িāĻ˛েāĻ¨ āĻ‡āĻ¯়াāĻ° āĻāĻ•াāĻĄেāĻŽি āĻœিāĻŽāĻ°িāĻ—েāĻ° (āĻ“āĻ¯়েāĻ˛āĻļ āĻāĻ•াāĻĄেāĻŽি) āĻĒ্āĻ°āĻĨāĻŽ āĻ¸āĻ­াāĻĒāĻ¤ি।
targets: āĻš্āĻ¯াঁ
template_id: 1
============================================================

Names of All Datasets with Examples

As noted earlier, the Aya collection has data from different sources. Overall, the Aya collection contains 23 distinct datasets. There are:

  •     'WIKI QA (T)'
  •      'Flan-GEM-wiki-lingua (T)'
  •     'SODA-inst (T)'
  •     'Joke-explaination-inst (T)'
  •     'IndicSentiment-inst'
  •     'Wiki-split-inst (T)'
  •      'Dolly-v2 (T)'
  •     'HotpotQA (T)'
  •      'Mintaka-inst (T)'
  •     'Xlel_wd-inst (T)'
  •      'IndicXParaphrase-inst'
  •     'Flan-lambada (T)'
  •     'PAWS-Wiki (T)'
  •     'CNN-Daily-Mail (T)'
  •     'Flan-Coqa (T)'
  •     'Xlel_wd-inst', 'NQ-Open (T)'
  •      'Flan-CoT-submix (T)'
  •     'Aya-Dataset'
  •     'Adversarial QA (T)'
  •     'PIQA (T)'
  •     'Flan-unified-QA (T)'
  •     'News-summary-instruct'

In the following, a sample Bengali data point from all the above 23 datasets is presented (all texts are truncated here):

#1
> Dataset: WIKI QA (T)
inputs: āĻĒ্āĻ°āĻļ্āĻ¨āĻŸি āĻ•ীঃ ""6 āĻĢুāĻŸ 7 āĻĢুāĻŸ" (āĻāĻ›াāĻĄ়াāĻ“ "6'7" āĻšিāĻ¸াāĻŦে āĻ¸্āĻŸাāĻ‡āĻ˛ āĻ•āĻ°া āĻšāĻ¯়) āĻ†āĻŽেāĻ°িāĻ•াāĻ¨ āĻ°্āĻ¯াāĻĒ
targets: "ā§Ŧ āĻĢুāĻŸ ā§­ āĻĢুāĻŸ āĻ‰āĻš্āĻšāĻ¤াāĻ¯় āĻ•োāĻ¨ āĻ—াāĻ¨āĻŸি āĻ—াāĻ“āĻ¯়া āĻšāĻ¯়?"
template_id: 1
============================================================
#2
> Dataset: Flan-GEM-wiki-lingua (T)
inputs: āĻ¸ংāĻ°āĻ•্āĻˇāĻŖেāĻ° āĻ†āĻ—ে āĻ¯েāĻ•োāĻ¨ āĻ–াāĻĻ্āĻ¯েāĻ° āĻ…āĻŦāĻļিāĻˇ্āĻŸাংāĻļ āĻŦা āĻĻাāĻ— āĻ…āĻĒāĻ¸াāĻ°āĻŖ āĻ•āĻ°া āĻĒ্āĻ°āĻ¯়োāĻœāĻ¨ āĻ•াāĻ°āĻŖ āĻāĻ—ুāĻ˛ি āĻ¤াāĻĻ
targets: āĻ¸ংāĻ°āĻ•্āĻˇāĻŖেāĻ° āĻ†āĻ—ে āĻ¨িāĻļ্āĻšিāĻ¤ āĻ•āĻ°ুāĻ¨ āĻ¯ে āĻ†āĻĒāĻ¨াāĻ° āĻ•োāĻ¯়েāĻ˛āĻŸি āĻĒāĻ°িāĻˇ্āĻ•াāĻ°। āĻ•োāĻ¯়াāĻ°্āĻŸāĻ•ে āĻ­্āĻ¯াāĻ•ুāĻ¯়াāĻŽ āĻ•āĻ°ো
template_id: 1
============================================================
#3
> Dataset: SODA-inst (T)
inputs: āĻ¨িāĻŽ্āĻ¨āĻ˛িāĻ–িāĻ¤ āĻŦিāĻˇāĻ¯়েāĻ° āĻ‰āĻĒāĻ° āĻ­িāĻ¤্āĻ¤ি āĻ•āĻ°ে āĻāĻ•āĻŸি āĻ¸ংāĻ•্āĻˇিāĻĒ্āĻ¤ āĻŦāĻ°্āĻŖāĻ¨া āĻ˛িāĻ–ুāĻ¨ঃ  āĻŦ্āĻ¯āĻ•্āĻ¤ি āĻāĻ•্āĻ¸ āĻ—িāĻ°
targets: āĻ…āĻŦāĻļ্āĻ¯āĻ‡, āĻāĻ–াāĻ¨ে āĻāĻ•āĻŸি āĻ¸ংāĻ•্āĻˇিāĻĒ্āĻ¤ āĻ…āĻ¨ুāĻš্āĻ›েāĻĻ āĻ°āĻ¯়েāĻ›ে: āĻ§āĻ°্āĻŽেāĻ° āĻĒ্āĻ°āĻ¤ি āĻ†āĻ—্āĻ°āĻšী āĻšāĻ“āĻ¯়াāĻ¯় āĻ­েāĻĻা āĻ—
template_id: 1
============================================================
#4
> Dataset: Joke-explaination-inst (T)
inputs: āĻ¨িāĻŽ্āĻ¨āĻ˛িāĻ–িāĻ¤ āĻ•ৌāĻ¤ুāĻ•āĻŸি āĻŦ্āĻ¯াāĻ–্āĻ¯া āĻ•āĻ°ুāĻ¨ঃ āĻ•āĻŽ্āĻĒিāĻ‰āĻŸাāĻ° āĻ¯āĻ–āĻ¨ āĻ•্āĻ˛াāĻ¨্āĻ¤ āĻšāĻ¯় āĻ¤āĻ–āĻ¨ āĻ•ী āĻ•āĻ°ে? āĻ‰ঃ āĻāĻŸা āĻ•
targets: āĻŦ্āĻ¯াāĻ–্āĻ¯াঃ āĻ†āĻĒāĻ¨াāĻ° āĻ•āĻŽ্āĻĒিāĻ‰āĻŸাāĻ° āĻ•ি āĻ•āĻ–āĻ¨āĻ“ āĻ•াāĻœ āĻ•āĻ°া āĻŦāĻ¨্āĻ§ āĻ•āĻ°ে āĻĻেāĻ¯় (āĻĢ্āĻ°িāĻœ) āĻŦা āĻ¯āĻ–āĻ¨ āĻ†āĻĒāĻ¨ি āĻāĻŸি
template_id: 2
============================================================
#5
> Dataset: IndicSentiment-inst
inputs: Translate from English to Bengali: "This boat's soundbar is still wire-connectiv
targets: "āĻāĻ‡ āĻŦোāĻŸেāĻ° āĻ¸াāĻ‰āĻ¨্āĻĄāĻŦাāĻ°āĻŸি āĻāĻ–āĻ¨āĻ“ āĻ¸āĻŦ āĻ¸্āĻĒিāĻ•াāĻ°েāĻ° āĻœāĻ¨্āĻ¯ āĻ¤াāĻ°েāĻ° āĻ¸ংāĻ¯োāĻ—। āĻāĻ‡āĻšāĻĄিāĻāĻŽāĻ†āĻ‡ āĻĒোāĻ°্āĻŸ āĻ¸āĻŦ āĻĄিāĻ­
template_id: 1
============================================================
#6
> Dataset: Wiki-split-inst (T)
inputs: āĻāĻ‡ āĻŦাāĻ•্āĻ¯āĻŸিāĻ° āĻ†āĻ°ো āĻœāĻŸিāĻ˛ āĻ¸ংāĻ¸্āĻ•āĻ°āĻŖ āĻ¤ৈāĻ°ি āĻ•āĻ°ুāĻ¨'''āĻāĻ• āĻ­াāĻˇা āĻĨেāĻ•ে āĻ…āĻ¨্āĻ¯ āĻ­াāĻˇাāĻ¯় āĻ…āĻ¨ুāĻŦাāĻĻ āĻ•āĻ°াāĻ° āĻ¸āĻŽ
targets: āĻ…āĻŦāĻļ্āĻ¯āĻ‡, āĻŦাāĻ•্āĻ¯āĻŸিāĻ° āĻ†āĻ°ো āĻœāĻŸিāĻ˛ āĻ¸ংāĻ¸্āĻ•āĻ°āĻŖ āĻšāĻ˛ "''āĻ¤িāĻ¨ি āĻāĻ• āĻ­াāĻˇা āĻĨেāĻ•ে āĻ…āĻ¨্āĻ¯ āĻ­াāĻˇাāĻ¯় āĻ…āĻ¨ুāĻŦাāĻĻ āĻ•āĻ°া
template_id: 1
============================================================
#7
> Dataset: Dolly-v2 (T)
inputs: āĻ­াāĻ°্āĻœিāĻ¨ āĻ…āĻ¸্āĻŸ্āĻ°েāĻ˛িāĻ¯়া āĻ•āĻ–āĻ¨ āĻ•াāĻœ āĻļুāĻ°ু āĻ•āĻ°ে?
Context:āĻ­াāĻ°্āĻœিāĻ¨ āĻ…āĻ¸্āĻŸ্āĻ°েāĻ˛িāĻ¯়া, āĻ­াāĻ°্āĻœিāĻ¨ āĻ…āĻ¸্
targets: āĻ­াāĻ°্āĻœিāĻ¨ āĻ…āĻ¸্āĻŸ্āĻ°েāĻ˛িāĻ¯়া ā§Šā§§ āĻ†āĻ—āĻ¸্āĻŸ ā§¨ā§Ļā§Ļā§Ļ āĻ¸াāĻ˛ে āĻ­াāĻ°্āĻœিāĻ¨ āĻŦ্āĻ˛ু āĻ¨াāĻŽে āĻāĻ•āĻŸি āĻ°ুāĻŸে āĻĻুāĻŸি āĻŦিāĻŽাāĻ¨ āĻĻ
template_id: 1
============================================================
#8
> Dataset: HotpotQA (T)
inputs: "āĻ āĻ¨াāĻ‡āĻŸ āĻ†āĻ‰āĻŸ āĻ‡āĻ¨ āĻ˛āĻ¨্āĻĄāĻ¨" āĻšāĻ˛ āĻšāĻ¤ুāĻ°্āĻĨ āĻĒāĻ°্āĻŦ āĻ¯েāĻ–াāĻ¨ে āĻŦ্āĻ°িāĻŸিāĻļ āĻ¸িāĻŸāĻ•āĻŽ āĻ¸িāĻŽোāĻ¨ āĻŦাāĻ°্āĻĄ āĻ…āĻ­িāĻ¨ীāĻ¤?
targets: "āĻ‡āĻ¨āĻŦিāĻŸুāĻ‡āĻ¨াāĻ°্āĻ¸"
template_id: 3
============================================================
#9
> Dataset: Mintaka-inst (T)
inputs: āĻāĻ‡ āĻŦিāĻˇāĻ¯়āĻļ্āĻ°েāĻŖীāĻ° āĻŽāĻ§্āĻ¯ে āĻāĻ•āĻŸি āĻ¸াāĻ§াāĻ°āĻŖ āĻŦিāĻˇāĻ¯়েāĻ° āĻ‰āĻĻাāĻšāĻ°āĻŖ āĻĻাāĻ“: āĻ­ূāĻ—োāĻ˛
targets: āĻ‰āĻ¤্āĻ¤āĻ° āĻ†āĻŽেāĻ°িāĻ•াāĻ° āĻ¸āĻĒ্āĻ¤āĻŽ āĻ¸āĻ°্āĻŦোāĻš্āĻš āĻĒāĻ°্āĻŦāĻ¤ āĻ•োāĻ¨āĻŸি? āĻŽাāĻ‰āĻ¨্āĻŸ āĻ˛ুāĻ¸াāĻ¨িāĻ¯়া
template_id: 1
============================================================
#10
> Dataset: Xlel_wd-inst (T)
inputs: āĻ¨িāĻŽ্āĻ¨āĻ˛িāĻ–িāĻ¤ āĻŦাāĻ•্āĻ¯āĻŸি āĻ¸āĻŽ্āĻĒূāĻ°্āĻŖ āĻ•āĻ°ুāĻ¨: ā§§ā§¯ā§Šā§Ļ āĻāĻ° āĻĻāĻļāĻ•ে āĻ¨াā§ŽāĻ¸ি āĻœাāĻ°্āĻŽাāĻ¨িāĻ° āĻ‰āĻ¤্āĻĨাāĻ¨ āĻ…āĻ¸্āĻŸ্āĻ°িāĻ¯়া
targets: ā§§ā§¯ā§Šā§­ āĻ¸াāĻ˛ে āĻ…āĻ¸্āĻŸ্āĻ°িāĻ¯়াāĻ•ে āĻ¸ংāĻ¯ুāĻ•্āĻ¤ āĻ•āĻ°াāĻ° āĻ¸āĻŽāĻ¯় āĻĒāĻ°িāĻŦাāĻ°āĻ•ে āĻ¤াāĻ° āĻŦ্āĻ¯াংāĻ•িং āĻ•াāĻ°্āĻ¯āĻ•্āĻ°āĻŽ āĻŦিāĻ•্āĻ°ি
template_id: 1
============================================================
#11
> Dataset: IndicXParaphrase-inst
inputs: āĻ­িāĻ¨্āĻ¨ āĻļāĻŦ্āĻĻāĻ—ুāĻš্āĻ› āĻŦ্āĻ¯āĻŦāĻšাāĻ° āĻ•āĻ°ে āĻ¨িāĻšেāĻ° āĻŦাāĻ•্āĻ¯āĻŸি āĻ˛েāĻ–: "āĻ–āĻŦāĻ° āĻĒেāĻ¯়ে āĻĒুāĻ˛িāĻļ āĻ˜āĻŸāĻ¨াāĻ¸্āĻĨāĻ˛ে āĻĒৌঁāĻ›ে
targets: "āĻĒুāĻ˛িāĻļ āĻ–āĻŦāĻ° āĻĒেā§Ÿে āĻ˜āĻŸāĻ¨াāĻ¸্āĻĨāĻ˛ে āĻĒৌঁāĻ›ে āĻ†āĻšāĻ¤āĻĻেāĻ° āĻ‰āĻĻ্āĻ§াāĻ° āĻ•āĻ°ে āĻ¸্āĻĨাāĻ¨ীāĻ¯় āĻšাāĻ¸āĻĒাāĻ¤াāĻ˛ে āĻ¨িāĻ¯়ে āĻ¯াāĻ¯়।
template_id: 1
============================================================
#12
> Dataset: Flan-lambada (T)
inputs: ` ` āĻšāĻŽā§ŽāĻ•াāĻ°, āĻ†āĻŽি āĻ†āĻļা āĻ•āĻ°āĻ›িāĻ˛াāĻŽ āĻ¤ুāĻŽি āĻ•āĻ°āĻŦে. āĻ†āĻŽাāĻ° āĻ•্āĻˇāĻŽা āĻšাāĻ“āĻ¯়াāĻ° āĻœāĻ¨্āĻ¯ āĻ¤োāĻŽাāĻ•ে āĻ­āĻ¯় āĻĻেāĻ–াāĻ¨ো
targets: āĻ†āĻĻāĻŽ
template_id: 1
============================================================
#13
> Dataset: PAWS-Wiki (T)
inputs: āĻŦাāĻ•্āĻ¯ ā§§ঃ (ā§§ā§Žā§¯ā§¨-ā§§ā§¯ā§Ŧā§¨) āĻ›িāĻ˛েāĻ¨ āĻ‡āĻ¯়াāĻ° āĻāĻ•াāĻĄেāĻŽি āĻœিāĻŽāĻ°িāĻ—েāĻ° (āĻ“āĻ¯়েāĻ˛āĻļ āĻāĻ•াāĻĄেāĻŽি) āĻĒ্āĻ°āĻĨāĻŽ āĻ¸āĻ­াāĻĒāĻ¤ি।
targets: āĻš্āĻ¯াঁ
template_id: 1
============================================================
#14
> Dataset: CNN-Daily-Mail (T)
inputs: āĻ¨িāĻŦāĻ¨্āĻ§āĻŸি āĻ¸ংāĻ•্āĻˇিāĻĒ্āĻ¤ āĻ•āĻ°ে āĻŦāĻ˛ুāĻ¨:  āĻĒাāĻ•িāĻ¸্āĻ¤াāĻ¨েāĻ° āĻĒেāĻļোāĻ¯়াāĻ°েāĻ° āĻāĻ•āĻŸি āĻ¸্āĻ•ুāĻ˛েāĻ° āĻšāĻ˛েāĻ° āĻ­েāĻ¤āĻ° āĻĻিāĻ¯়
targets: āĻĒাāĻ•িāĻ¸্āĻ¤াāĻ¨েāĻ° āĻĒ্āĻ°āĻ¤িāĻ°āĻ•্āĻˇা āĻŽāĻ¨্āĻ¤্āĻ°ী āĻŦāĻ˛েāĻ¨, āĻ¸āĻ¨্āĻ¤্āĻ°াāĻ¸েāĻ° āĻŦিāĻ°ুāĻĻ্āĻ§ে āĻ¯ুāĻĻ্āĻ§েāĻ° āĻĒ্āĻ°āĻĨāĻŽ āĻ¸াāĻ°িāĻ¤ে āĻļি
template_id: 1
============================================================
#15
> Dataset: Flan-Coqa (T)
inputs: āĻ¤ুāĻ°্āĻ•ি āĻœāĻ¨āĻ—āĻŖ (), āĻŦা āĻ¤ুāĻ°্āĻ•িāĻ°া (), āĻ¯া āĻ†āĻ¨াāĻ¤োāĻ˛িāĻ¯়াāĻ¨ āĻ¤ুāĻ°্āĻ•ি āĻ¨াāĻŽেāĻ“ āĻĒāĻ°িāĻšিāĻ¤ (), āĻāĻ•āĻŸি āĻ¤ুāĻ°্
targets: ā§§। āĻ†āĻ¨াāĻ¤োāĻ˛িāĻ¯়াāĻ° āĻ—্āĻ°াāĻŽāĻŦাāĻ¸ী ā§¨। āĻ¨া ā§Š। āĻš্āĻ¯াঁ ā§Ē. āĻ¤ুāĻ°্āĻ•ি ā§Ģ। āĻ¨া ā§Ŧ। āĻĒāĻļ্āĻšিāĻŽ āĻ‡āĻ‰āĻ°োāĻĒ ā§­। āĻŸাāĻ°্āĻ—
template_id: 1
============================================================
#16
> Dataset: Xlel_wd-inst
inputs: Complete the following phrase:  āĻĻ্āĻŦিāĻ¤ীāĻ¯় āĻ†āĻŦāĻĻুāĻ˛ āĻšাāĻŽিāĻĻ
targets: āĻ¤āĻ°ুāĻŖ āĻ¤ুāĻ°্āĻ•ি āĻŦিāĻĒ্āĻ˛āĻŦ āĻĻ্āĻŦাāĻ°া āĻĻ্āĻŦিāĻ¤ীāĻ¯় āĻ¸াংāĻŦিāĻ§াāĻ¨িāĻ• āĻ¯ুāĻ—েāĻ° āĻ¸ূāĻšāĻ¨া āĻ•āĻ°ে āĻ¸াংāĻŦিāĻ§াāĻ¨িāĻ• āĻ°াāĻœāĻ¤āĻ¨্āĻ¤
template_id: 1
============================================================
#17
> Dataset: NQ-Open (T)
inputs: āĻĒ্āĻ°āĻļ্āĻ¨: āĻ¤াāĻ°া āĻ•োāĻĨাāĻ¯় āĻ—āĻ°āĻŽ āĻŸāĻŦ āĻŸাāĻ‡āĻŽ āĻŽেāĻļিāĻ¨ āĻĢিāĻ˛্āĻŽ āĻ•āĻ°েāĻ›ে āĻ‰āĻ¤্āĻ¤āĻ°ঃ
targets: āĻĢাāĻ°্āĻ¨ি āĻ†āĻ˛্āĻĒাāĻ‡āĻ¨ āĻ°িāĻ¸োāĻ°্āĻŸ
template_id: 2
============================================================
#18
> Dataset: Flan-CoT-submix (T)
inputs: āĻ¨িāĻŽ্āĻ¨āĻ˛িāĻ–িāĻ¤ āĻĻুāĻŸি āĻŦাāĻ•্āĻ¯ āĻĨেāĻ•ে āĻ•োāĻ¨āĻŸি āĻ¸াāĻ§াāĻ°āĻŖ āĻœ্āĻžাāĻ¨েāĻ° āĻŦিāĻ°ুāĻĻ্āĻ§ে? āĻŦিāĻ•āĻ˛্āĻĒঃ - āĻĒ্āĻ°āĻĨāĻŽ āĻŦাāĻ•্āĻ¯:
targets: āĻ•āĻŽāĻ˛া āĻ°āĻ¸েāĻ° āĻ¸্āĻŦাāĻĻ āĻļāĻ¸্āĻ¯েāĻ° āĻ¸াāĻĨে āĻ­াāĻ˛ āĻšāĻ¯় āĻ¨া। āĻšূāĻĄ়াāĻ¨্āĻ¤ āĻ‰āĻ¤্āĻ¤āĻ°: A āĻŦাāĻ•্āĻ¯।
template_id: 1
============================================================
#19
> Dataset: Aya-Dataset
inputs: āĻ¨িāĻšেāĻ° āĻ…āĻ¨ুāĻš্āĻ›েāĻĻেāĻ° āĻŦিāĻˇāĻ¯় āĻ•ি ?

āĻŸাāĻ™্āĻ—ুāĻ¯়াāĻ° āĻšাāĻ“āĻ° (āĻ¸িāĻ˛েāĻŸি: ꠐꠣꠋꠉꠥꠀꠞ ꠀꠅꠞ) āĻŦাংāĻ˛াāĻĻেāĻļেāĻ° āĻŦৃ
targets: āĻŸাāĻ™্āĻ—ুāĻ¯়াāĻ° āĻšাāĻ“āĻ° |
template_id: 0
============================================================
#20
> Dataset: Adversarial QA (T)
inputs: "āĻĢিāĻ˛্āĻĄ-āĻ‡āĻĢেāĻ•্āĻŸ" āĻāĻ•āĻŸি āĻ˛েāĻŦেāĻ˛ āĻ¯া āĻāĻ•āĻŸি āĻ§āĻ°āĻ¨েāĻ° āĻŦāĻ°্āĻŖāĻ¨া āĻ•āĻ°āĻ¤ে āĻŦ্āĻ¯āĻŦāĻšৃāĻ¤ āĻšāĻ¯়? āĻĒূāĻ°্āĻŦāĻŦāĻ°্āĻ¤ী āĻĒ্āĻ°āĻļ
targets: āĻĻুāĻŸি āĻ§āĻ°āĻŖেāĻ° āĻŸ্āĻ°াāĻ¨āĻœিāĻ¸্āĻŸāĻ° āĻ°āĻ¯়েāĻ›ে, āĻ¯া āĻāĻ•āĻŸি āĻ¸াāĻ°্āĻ•িāĻŸে āĻ•ীāĻ­াāĻŦে āĻŦ্āĻ¯āĻŦāĻšৃāĻ¤ āĻšāĻ¯় āĻ¤াāĻ° āĻ¸াāĻŽাāĻ¨্āĻ¯ āĻĒ
template_id: 1
============================================================
#21
> Dataset: PIQA (T)
inputs: āĻ¨িāĻŽ্āĻ¨āĻ˛িāĻ–িāĻ¤ āĻŦাāĻ•্āĻ¯āĻŸি āĻ¸āĻ°্āĻŦোāĻ¤্āĻ¤āĻŽ āĻŦিāĻ•āĻ˛্āĻĒেāĻ° āĻ¸াāĻĨে āĻļেāĻˇ āĻ•āĻ°ুāĻ¨ঃ.āĻ•ুāĻ•িāĻ¨ি āĻļীāĻ°্āĻˇāĻ—ুāĻ˛ি āĻĻিāĻ¯়ে āĻ•ী āĻ•
targets: āĻ¤াāĻĻেāĻ° āĻ˛āĻŦāĻŖাāĻ•্āĻ¤ āĻĒাāĻ¨িāĻ¤ে āĻ¸াāĻĻা āĻ•āĻ°ুāĻ¨ āĻāĻŦং āĻ¤াāĻĻেāĻ° āĻāĻ•āĻŸি āĻ¸্āĻ¨্āĻ¯াāĻ• āĻšিāĻ¸াāĻŦে āĻĒāĻ°িāĻŦেāĻļāĻ¨ āĻ•āĻ°ুāĻ¨
template_id: 1
============================================================
#22
> Dataset: Flan-unified-QA (T)
inputs: āĻ•োāĻ¨ āĻŦিāĻŦৃāĻ¤ি āĻ¸āĻ িāĻ•āĻ­াāĻŦে āĻŦāĻ°্āĻŖāĻ¨া āĻ•āĻ°ে āĻ¯ে āĻ•েāĻ¨ āĻāĻĒ্āĻ°িāĻ˛েāĻ° āĻĒ্āĻ°āĻ¤িāĻŸি āĻĻিāĻ¨ে āĻ†āĻ˛াāĻ¸্āĻ•াāĻ¤ে ā§¨ā§Ē āĻ˜āĻŖ্āĻŸাāĻ°āĻ“
targets: (āĻ—)
template_id: 1
============================================================
#23
> Dataset: News-summary-instruct
inputs: āĻ†āĻ°ো āĻ•āĻŽ āĻļāĻŦ্āĻĻে āĻŦাāĻ•্āĻ¯āĻŸিāĻ° āĻŽূāĻ˛āĻ­াāĻŦ āĻŦāĻ°্āĻŖāĻ¨া āĻ•āĻ°: āĻ¸্āĻŸ্āĻ¯াāĻ¨্āĻĄাāĻ°্āĻĄ āĻšাāĻ°্āĻŸাāĻ°্āĻĄ āĻŦ্āĻ¯াংāĻ•েāĻ° āĻ¨āĻ¤ুāĻ¨ āĻĒ্
targets: āĻŦাāĻ•্āĻ¯āĻŸিāĻ° āĻ¸ংāĻ•্āĻˇিāĻĒ্āĻ¤ āĻŽূāĻ˛āĻ­াāĻŦ āĻšāĻ˛ো, āĻ¸্āĻŸ্āĻ¯াāĻ¨্āĻĄাāĻ°্āĻĄ āĻšাāĻ°্āĻŸাāĻ°্āĻĄেāĻ° āĻ¨āĻ¤ুāĻ¨ āĻ¸িāĻ‡āĻ“ āĻ†āĻŦāĻ°াāĻ°।
template_id: 1
============================================================

Templates

There appears to be 16 different templates used by the templated data. To quote Cohere, "Templated data: We collaborated with fluent speakers to create templates that allowed for the automatic expansion of existing datasets into various languages."

Conclusion

Aya collection is a huge multilingual dataset. This can be quite useful for language-specific training and fine-tuning. How do you plan to use the Aya dataset/collection?

Further references:

Comments

Popular posts from this blog

Text Highlighting in Latex

While preparing a manuscript with Latex, it is often useful to highlight the changes made in the current revision with a different color. This can be achieved using the \ textcolor command provided by Latex. For example, \textcolor {red}{Hello World} would display the string "Hello World" in red color. However, the final/published copy of the manuscript does not contain any highlighted text. Therefore, if a large volume of changes were made, it becomes tiresome at the end to find and remove all the individual portions of highlighted text. This can be circumvented by defining a utility command to switch highlighting on and off as desired. In the following, we define a new Latex command, highlighttext , for this purpose. The command takes only a single argument—the text to be highlighted.     \usepackage {color}    % For highlighting changes in this version with red color   \newcommand { \highlighttext }[1] { \textcolor {red}{#1}}   % Remove...

Specifying Source and Destination of Messages

One of the frequently asked questions in the community is how to specify which particular nodes would act as source(s) and destination(s) of the messages created in the ONE simulator. The simulator, in fact, provides a pair of settings (shown below in bold face) aimed for this particular purpose. Let us consider that there are $n + 1$ nodes in an OMN.  Further, let the nodes with addresses from $x$ to $y$, both inclusive, would create messages. The nodes in the range $w$ to $z$, both inclusive, would be the destinations of those messages, where $0 \le x \le y \le n$, and $0 \le w \le z \le n$. Then, the corresponding simulation scenario can be configured as follows. ## Message creation parameters # How many event generators Events.nrof = 1 # Class of the first event generator Events1.class = MessageEventGenerator # (Following settings are specific for the MessageEventGenerator class) # Creation interval in seconds (one new message every 25 to 35 seconds) Events1.interval ...