Skip to main content

Cohere Aya Dataset: Exploring the Split-by-language Collection

A snapshot of the Aya collection (Bengali). Image taken from HuggingFace.

In February 2024, Cohere launched Aya, a multilingual Large Language Model (LLM). Alongside, a set of datasets used to train Aya has also been released. For example, the aya_dataset consists around 205K examples annotated by humans. On the other hand, the recently released aya_collection_language_split is a gigantic dataset with more than 500 million data points spread across more than 100 languages. As the name suggests, this dataset is split by language. For example, all data points in Bengali, irrespective of the underlying task, can be found in a single split. Apart from the original human-annotated examples from the aya_dataset, aya_collection_language_split also contains a lot of translated and templated data. The dataset is released using an Apache-2.0 license, allowing academic and commercial use.

The Bengali Language Split

Each language split in the Aya collection has three splits. The Bengali split, for example, contains:

  •     3601287 examples in 'train'
  •     274546 data points in 'validation'
  •     276504 rows in 'test'

Let us take a look at this collection for the Bengali split, specifically focusing on the tasks and data sources.

All Task Types with Examples

 There are 10 different task types. There are:

  •     'summarization'
  •     'paraphrasing'
  •     'text-simplification'
  •      'question-answering'
  •      '-'
  •      'dialogue'
  •      'translation'
  •     'generation'
  •     'event-linking'
  •     'paraphrase-identification'

An example for each task is provided below. For the sake of brevity, all texts in the following are truncated after 80 characters:

#1
> Task: summarization
inputs: āϏংāϰāĻ•্āώāĻŖেāϰ āφāĻ—ে āϝেāĻ•োāύ āĻ–াāĻĻ্āϝেāϰ āĻ…āĻŦāĻļিāώ্āϟাংāĻļ āĻŦা āĻĻাāĻ— āĻ…āĻĒāϏাāϰāĻŖ āĻ•āϰা āĻĒ্āϰāϝ়োāϜāύ āĻ•াāϰāĻŖ āĻāĻ—ুāϞি āϤাāĻĻ
targets: āϏংāϰāĻ•্āώāĻŖেāϰ āφāĻ—ে āύিāĻļ্āϚিāϤ āĻ•āϰুāύ āϝে āφāĻĒāύাāϰ āĻ•োāϝ়েāϞāϟি āĻĒāϰিāώ্āĻ•াāϰ। āĻ•োāϝ়াāϰ্āϟāĻ•ে āĻ­্āϝাāĻ•ুāϝ়াāĻŽ āĻ•āϰো
template_id: 1
============================================================
#2
> Task: paraphrasing
inputs: āĻ­িāύ্āύ āĻļāĻŦ্āĻĻāĻ—ুāϚ্āĻ› āĻŦ্āϝāĻŦāĻšাāϰ āĻ•āϰে āύিāϚেāϰ āĻŦাāĻ•্āϝāϟি āϞেāĻ–: "āĻ–āĻŦāϰ āĻĒেāϝ়ে āĻĒুāϞিāĻļ āϘāϟāύাāϏ্āĻĨāϞে āĻĒৌঁāĻ›ে
targets: "āĻĒুāϞিāĻļ āĻ–āĻŦāϰ āĻĒে⧟ে āϘāϟāύাāϏ্āĻĨāϞে āĻĒৌঁāĻ›ে āφāĻšāϤāĻĻেāϰ āωāĻĻ্āϧাāϰ āĻ•āϰে āϏ্āĻĨাāύীāϝ় āĻšাāϏāĻĒাāϤাāϞে āύিāϝ়ে āϝাāϝ়।
template_id: 1
============================================================
#3
> Task: text-simplification
inputs: āĻāχ āĻŦাāĻ•্āϝāϟিāϰ āφāϰো āϜāϟিāϞ āϏংāϏ্āĻ•āϰāĻŖ āϤৈāϰি āĻ•āϰুāύ'''āĻāĻ• āĻ­াāώা āĻĨেāĻ•ে āĻ…āύ্āϝ āĻ­াāώাāϝ় āĻ…āύুāĻŦাāĻĻ āĻ•āϰাāϰ āϏāĻŽ
targets: āĻ…āĻŦāĻļ্āϝāχ, āĻŦাāĻ•্āϝāϟিāϰ āφāϰো āϜāϟিāϞ āϏংāϏ্āĻ•āϰāĻŖ āĻšāϞ "''āϤিāύি āĻāĻ• āĻ­াāώা āĻĨেāĻ•ে āĻ…āύ্āϝ āĻ­াāώাāϝ় āĻ…āύুāĻŦাāĻĻ āĻ•āϰা
template_id: 1
============================================================
#4
> Task: question-answering
inputs: āϤুāϰ্āĻ•ি āϜāύāĻ—āĻŖ (), āĻŦা āϤুāϰ্āĻ•িāϰা (), āϝা āφāύাāϤোāϞিāϝ়াāύ āϤুāϰ্āĻ•ি āύাāĻŽেāĻ“ āĻĒāϰিāϚিāϤ (), āĻāĻ•āϟি āϤুāϰ্
targets: ā§§। āφāύাāϤোāϞিāϝ়াāϰ āĻ—্āϰাāĻŽāĻŦাāϏী ⧍। āύা ā§Š। āĻš্āϝাঁ ā§Ē. āϤুāϰ্āĻ•ি ā§Ģ। āύা ā§Ŧ। āĻĒāĻļ্āϚিāĻŽ āχāωāϰোāĻĒ ā§­। āϟাāϰ্āĻ—
template_id: 1
============================================================
#5
> Task: -
inputs: āύিāϚেāϰ āĻ…āύুāϚ্āĻ›েāĻĻেāϰ āĻŦিāώāϝ় āĻ•ি ?

āϟাāĻ™্āĻ—ুāϝ়াāϰ āĻšাāĻ“āϰ (āϏিāϞেāϟি: ꠐꠣꠋꠉꠥꠀꠞ ꠀꠅꠞ) āĻŦাংāϞাāĻĻেāĻļেāϰ āĻŦৃ
targets: āϟাāĻ™্āĻ—ুāϝ়াāϰ āĻšাāĻ“āϰ |
template_id: 0
============================================================
#6
> Task: dialogue
inputs: āύিāĻŽ্āύāϞিāĻ–িāϤ āĻŦিāώāϝ়েāϰ āωāĻĒāϰ āĻ­িāϤ্āϤি āĻ•āϰে āĻāĻ•āϟি āϏংāĻ•্āώিāĻĒ্āϤ āĻŦāϰ্āĻŖāύা āϞিāĻ–ুāύঃ  āĻŦ্āϝāĻ•্āϤি āĻāĻ•্āϏ āĻ—িāϰ
targets: āĻ…āĻŦāĻļ্āϝāχ, āĻāĻ–াāύে āĻāĻ•āϟি āϏংāĻ•্āώিāĻĒ্āϤ āĻ…āύুāϚ্āĻ›েāĻĻ āϰāϝ়েāĻ›ে: āϧāϰ্āĻŽেāϰ āĻĒ্āϰāϤি āφāĻ—্āϰāĻšী āĻšāĻ“āϝ়াāϝ় āĻ­েāĻĻা āĻ—
template_id: 1
============================================================
#7
> Task: translation
inputs: Translate from English to Bengali: "This boat's soundbar is still wire-connectiv
targets: "āĻāχ āĻŦোāϟেāϰ āϏাāωāύ্āĻĄāĻŦাāϰāϟি āĻāĻ–āύāĻ“ āϏāĻŦ āϏ্āĻĒিāĻ•াāϰেāϰ āϜāύ্āϝ āϤাāϰেāϰ āϏংāϝোāĻ—। āĻāχāϚāĻĄিāĻāĻŽāφāχ āĻĒোāϰ্āϟ āϏāĻŦ āĻĄিāĻ­
template_id: 1
============================================================
#8
> Task: generation
inputs: āύিāĻŽ্āύāϞিāĻ–িāϤ āĻĻুāϟি āĻŦাāĻ•্āϝ āĻĨেāĻ•ে āĻ•োāύāϟি āϏাāϧাāϰāĻŖ āϜ্āĻžাāύেāϰ āĻŦিāϰুāĻĻ্āϧে? āĻŦিāĻ•āϞ্āĻĒঃ - āĻĒ্āϰāĻĨāĻŽ āĻŦাāĻ•্āϝ:
targets: āĻ•āĻŽāϞা āϰāϏেāϰ āϏ্āĻŦাāĻĻ āĻļāϏ্āϝেāϰ āϏাāĻĨে āĻ­াāϞ āĻšāϝ় āύা। āϚূāĻĄ়াāύ্āϤ āωāϤ্āϤāϰ: A āĻŦাāĻ•্āϝ।
template_id: 1
============================================================
#9
> Task: event-linking
inputs: āύিāĻŽ্āύāϞিāĻ–িāϤ āĻŦাāĻ•্āϝāϟি āϏāĻŽ্āĻĒূāϰ্āĻŖ āĻ•āϰুāύ: ā§§ā§¯ā§Šā§Ļ āĻāϰ āĻĻāĻļāĻ•ে āύাā§ŽāϏি āϜাāϰ্āĻŽাāύিāϰ āωāϤ্āĻĨাāύ āĻ…āϏ্āϟ্āϰিāϝ়া
targets: ā§§ā§¯ā§Šā§­ āϏাāϞে āĻ…āϏ্āϟ্āϰিāϝ়াāĻ•ে āϏংāϝুāĻ•্āϤ āĻ•āϰাāϰ āϏāĻŽāϝ় āĻĒāϰিāĻŦাāϰāĻ•ে āϤাāϰ āĻŦ্āϝাংāĻ•িং āĻ•াāϰ্āϝāĻ•্āϰāĻŽ āĻŦিāĻ•্āϰি
template_id: 1
============================================================
#10
> Task: paraphrase-identification
inputs: āĻŦাāĻ•্āϝ ā§§ঃ (ā§§ā§Žā§¯ā§¨-⧧⧝ā§Ŧ⧍) āĻ›িāϞেāύ āχāϝ়াāϰ āĻāĻ•াāĻĄেāĻŽি āϜিāĻŽāϰিāĻ—েāϰ (āĻ“āϝ়েāϞāĻļ āĻāĻ•াāĻĄেāĻŽি) āĻĒ্āϰāĻĨāĻŽ āϏāĻ­াāĻĒāϤি।
targets: āĻš্āϝাঁ
template_id: 1
============================================================

Names of All Datasets with Examples

As noted earlier, the Aya collection has data from different sources. Overall, the Aya collection contains 23 distinct datasets. There are:

  •     'WIKI QA (T)'
  •      'Flan-GEM-wiki-lingua (T)'
  •     'SODA-inst (T)'
  •     'Joke-explaination-inst (T)'
  •     'IndicSentiment-inst'
  •     'Wiki-split-inst (T)'
  •      'Dolly-v2 (T)'
  •     'HotpotQA (T)'
  •      'Mintaka-inst (T)'
  •     'Xlel_wd-inst (T)'
  •      'IndicXParaphrase-inst'
  •     'Flan-lambada (T)'
  •     'PAWS-Wiki (T)'
  •     'CNN-Daily-Mail (T)'
  •     'Flan-Coqa (T)'
  •     'Xlel_wd-inst', 'NQ-Open (T)'
  •      'Flan-CoT-submix (T)'
  •     'Aya-Dataset'
  •     'Adversarial QA (T)'
  •     'PIQA (T)'
  •     'Flan-unified-QA (T)'
  •     'News-summary-instruct'

In the following, a sample Bengali data point from all the above 23 datasets is presented (all texts are truncated here):

#1
> Dataset: WIKI QA (T)
inputs: āĻĒ্āϰāĻļ্āύāϟি āĻ•ীঃ ""6 āĻĢুāϟ 7 āĻĢুāϟ" (āĻāĻ›াāĻĄ়াāĻ“ "6'7" āĻšিāϏাāĻŦে āϏ্āϟাāχāϞ āĻ•āϰা āĻšāϝ়) āφāĻŽেāϰিāĻ•াāύ āϰ্āϝাāĻĒ
targets: "ā§Ŧ āĻĢুāϟ ā§­ āĻĢুāϟ āωāϚ্āϚāϤাāϝ় āĻ•োāύ āĻ—াāύāϟি āĻ—াāĻ“āϝ়া āĻšāϝ়?"
template_id: 1
============================================================
#2
> Dataset: Flan-GEM-wiki-lingua (T)
inputs: āϏংāϰāĻ•্āώāĻŖেāϰ āφāĻ—ে āϝেāĻ•োāύ āĻ–াāĻĻ্āϝেāϰ āĻ…āĻŦāĻļিāώ্āϟাংāĻļ āĻŦা āĻĻাāĻ— āĻ…āĻĒāϏাāϰāĻŖ āĻ•āϰা āĻĒ্āϰāϝ়োāϜāύ āĻ•াāϰāĻŖ āĻāĻ—ুāϞি āϤাāĻĻ
targets: āϏংāϰāĻ•্āώāĻŖেāϰ āφāĻ—ে āύিāĻļ্āϚিāϤ āĻ•āϰুāύ āϝে āφāĻĒāύাāϰ āĻ•োāϝ়েāϞāϟি āĻĒāϰিāώ্āĻ•াāϰ। āĻ•োāϝ়াāϰ্āϟāĻ•ে āĻ­্āϝাāĻ•ুāϝ়াāĻŽ āĻ•āϰো
template_id: 1
============================================================
#3
> Dataset: SODA-inst (T)
inputs: āύিāĻŽ্āύāϞিāĻ–িāϤ āĻŦিāώāϝ়েāϰ āωāĻĒāϰ āĻ­িāϤ্āϤি āĻ•āϰে āĻāĻ•āϟি āϏংāĻ•্āώিāĻĒ্āϤ āĻŦāϰ্āĻŖāύা āϞিāĻ–ুāύঃ  āĻŦ্āϝāĻ•্āϤি āĻāĻ•্āϏ āĻ—িāϰ
targets: āĻ…āĻŦāĻļ্āϝāχ, āĻāĻ–াāύে āĻāĻ•āϟি āϏংāĻ•্āώিāĻĒ্āϤ āĻ…āύুāϚ্āĻ›েāĻĻ āϰāϝ়েāĻ›ে: āϧāϰ্āĻŽেāϰ āĻĒ্āϰāϤি āφāĻ—্āϰāĻšী āĻšāĻ“āϝ়াāϝ় āĻ­েāĻĻা āĻ—
template_id: 1
============================================================
#4
> Dataset: Joke-explaination-inst (T)
inputs: āύিāĻŽ্āύāϞিāĻ–িāϤ āĻ•ৌāϤুāĻ•āϟি āĻŦ্āϝাāĻ–্āϝা āĻ•āϰুāύঃ āĻ•āĻŽ্āĻĒিāωāϟাāϰ āϝāĻ–āύ āĻ•্āϞাāύ্āϤ āĻšāϝ় āϤāĻ–āύ āĻ•ী āĻ•āϰে? āωঃ āĻāϟা āĻ•
targets: āĻŦ্āϝাāĻ–্āϝাঃ āφāĻĒāύাāϰ āĻ•āĻŽ্āĻĒিāωāϟাāϰ āĻ•ি āĻ•āĻ–āύāĻ“ āĻ•াāϜ āĻ•āϰা āĻŦāύ্āϧ āĻ•āϰে āĻĻেāϝ় (āĻĢ্āϰিāϜ) āĻŦা āϝāĻ–āύ āφāĻĒāύি āĻāϟি
template_id: 2
============================================================
#5
> Dataset: IndicSentiment-inst
inputs: Translate from English to Bengali: "This boat's soundbar is still wire-connectiv
targets: "āĻāχ āĻŦোāϟেāϰ āϏাāωāύ্āĻĄāĻŦাāϰāϟি āĻāĻ–āύāĻ“ āϏāĻŦ āϏ্āĻĒিāĻ•াāϰেāϰ āϜāύ্āϝ āϤাāϰেāϰ āϏংāϝোāĻ—। āĻāχāϚāĻĄিāĻāĻŽāφāχ āĻĒোāϰ্āϟ āϏāĻŦ āĻĄিāĻ­
template_id: 1
============================================================
#6
> Dataset: Wiki-split-inst (T)
inputs: āĻāχ āĻŦাāĻ•্āϝāϟিāϰ āφāϰো āϜāϟিāϞ āϏংāϏ্āĻ•āϰāĻŖ āϤৈāϰি āĻ•āϰুāύ'''āĻāĻ• āĻ­াāώা āĻĨেāĻ•ে āĻ…āύ্āϝ āĻ­াāώাāϝ় āĻ…āύুāĻŦাāĻĻ āĻ•āϰাāϰ āϏāĻŽ
targets: āĻ…āĻŦāĻļ্āϝāχ, āĻŦাāĻ•্āϝāϟিāϰ āφāϰো āϜāϟিāϞ āϏংāϏ্āĻ•āϰāĻŖ āĻšāϞ "''āϤিāύি āĻāĻ• āĻ­াāώা āĻĨেāĻ•ে āĻ…āύ্āϝ āĻ­াāώাāϝ় āĻ…āύুāĻŦাāĻĻ āĻ•āϰা
template_id: 1
============================================================
#7
> Dataset: Dolly-v2 (T)
inputs: āĻ­াāϰ্āϜিāύ āĻ…āϏ্āϟ্āϰেāϞিāϝ়া āĻ•āĻ–āύ āĻ•াāϜ āĻļুāϰু āĻ•āϰে?
Context:āĻ­াāϰ্āϜিāύ āĻ…āϏ্āϟ্āϰেāϞিāϝ়া, āĻ­াāϰ্āϜিāύ āĻ…āϏ্
targets: āĻ­াāϰ্āϜিāύ āĻ…āϏ্āϟ্āϰেāϞিāϝ়া ā§Šā§§ āφāĻ—āϏ্āϟ ⧍ā§Ļā§Ļā§Ļ āϏাāϞে āĻ­াāϰ্āϜিāύ āĻŦ্āϞু āύাāĻŽে āĻāĻ•āϟি āϰুāϟে āĻĻুāϟি āĻŦিāĻŽাāύ āĻĻ
template_id: 1
============================================================
#8
> Dataset: HotpotQA (T)
inputs: "āĻ āύাāχāϟ āφāωāϟ āχāύ āϞāύ্āĻĄāύ" āĻšāϞ āϚāϤুāϰ্āĻĨ āĻĒāϰ্āĻŦ āϝেāĻ–াāύে āĻŦ্āϰিāϟিāĻļ āϏিāϟāĻ•āĻŽ āϏিāĻŽোāύ āĻŦাāϰ্āĻĄ āĻ…āĻ­িāύীāϤ?
targets: "āχāύāĻŦিāϟুāχāύাāϰ্āϏ"
template_id: 3
============================================================
#9
> Dataset: Mintaka-inst (T)
inputs: āĻāχ āĻŦিāώāϝ়āĻļ্āϰেāĻŖীāϰ āĻŽāϧ্āϝে āĻāĻ•āϟি āϏাāϧাāϰāĻŖ āĻŦিāώāϝ়েāϰ āωāĻĻাāĻšāϰāĻŖ āĻĻাāĻ“: āĻ­ূāĻ—োāϞ
targets: āωāϤ্āϤāϰ āφāĻŽেāϰিāĻ•াāϰ āϏāĻĒ্āϤāĻŽ āϏāϰ্āĻŦোāϚ্āϚ āĻĒāϰ্āĻŦāϤ āĻ•োāύāϟি? āĻŽাāωāύ্āϟ āϞুāϏাāύিāϝ়া
template_id: 1
============================================================
#10
> Dataset: Xlel_wd-inst (T)
inputs: āύিāĻŽ্āύāϞিāĻ–িāϤ āĻŦাāĻ•্āϝāϟি āϏāĻŽ্āĻĒূāϰ্āĻŖ āĻ•āϰুāύ: ā§§ā§¯ā§Šā§Ļ āĻāϰ āĻĻāĻļāĻ•ে āύাā§ŽāϏি āϜাāϰ্āĻŽাāύিāϰ āωāϤ্āĻĨাāύ āĻ…āϏ্āϟ্āϰিāϝ়া
targets: ā§§ā§¯ā§Šā§­ āϏাāϞে āĻ…āϏ্āϟ্āϰিāϝ়াāĻ•ে āϏংāϝুāĻ•্āϤ āĻ•āϰাāϰ āϏāĻŽāϝ় āĻĒāϰিāĻŦাāϰāĻ•ে āϤাāϰ āĻŦ্āϝাংāĻ•িং āĻ•াāϰ্āϝāĻ•্āϰāĻŽ āĻŦিāĻ•্āϰি
template_id: 1
============================================================
#11
> Dataset: IndicXParaphrase-inst
inputs: āĻ­িāύ্āύ āĻļāĻŦ্āĻĻāĻ—ুāϚ্āĻ› āĻŦ্āϝāĻŦāĻšাāϰ āĻ•āϰে āύিāϚেāϰ āĻŦাāĻ•্āϝāϟি āϞেāĻ–: "āĻ–āĻŦāϰ āĻĒেāϝ়ে āĻĒুāϞিāĻļ āϘāϟāύাāϏ্āĻĨāϞে āĻĒৌঁāĻ›ে
targets: "āĻĒুāϞিāĻļ āĻ–āĻŦāϰ āĻĒে⧟ে āϘāϟāύাāϏ্āĻĨāϞে āĻĒৌঁāĻ›ে āφāĻšāϤāĻĻেāϰ āωāĻĻ্āϧাāϰ āĻ•āϰে āϏ্āĻĨাāύীāϝ় āĻšাāϏāĻĒাāϤাāϞে āύিāϝ়ে āϝাāϝ়।
template_id: 1
============================================================
#12
> Dataset: Flan-lambada (T)
inputs: ` ` āϚāĻŽā§ŽāĻ•াāϰ, āφāĻŽি āφāĻļা āĻ•āϰāĻ›িāϞাāĻŽ āϤুāĻŽি āĻ•āϰāĻŦে. āφāĻŽাāϰ āĻ•্āώāĻŽা āϚাāĻ“āϝ়াāϰ āϜāύ্āϝ āϤোāĻŽাāĻ•ে āĻ­āϝ় āĻĻেāĻ–াāύো
targets: āφāĻĻāĻŽ
template_id: 1
============================================================
#13
> Dataset: PAWS-Wiki (T)
inputs: āĻŦাāĻ•্āϝ ā§§ঃ (ā§§ā§Žā§¯ā§¨-⧧⧝ā§Ŧ⧍) āĻ›িāϞেāύ āχāϝ়াāϰ āĻāĻ•াāĻĄেāĻŽি āϜিāĻŽāϰিāĻ—েāϰ (āĻ“āϝ়েāϞāĻļ āĻāĻ•াāĻĄেāĻŽি) āĻĒ্āϰāĻĨāĻŽ āϏāĻ­াāĻĒāϤি।
targets: āĻš্āϝাঁ
template_id: 1
============================================================
#14
> Dataset: CNN-Daily-Mail (T)
inputs: āύিāĻŦāύ্āϧāϟি āϏংāĻ•্āώিāĻĒ্āϤ āĻ•āϰে āĻŦāϞুāύ:  āĻĒাāĻ•িāϏ্āϤাāύেāϰ āĻĒেāĻļোāϝ়াāϰেāϰ āĻāĻ•āϟি āϏ্āĻ•ুāϞেāϰ āĻšāϞেāϰ āĻ­েāϤāϰ āĻĻিāϝ়
targets: āĻĒাāĻ•িāϏ্āϤাāύেāϰ āĻĒ্āϰāϤিāϰāĻ•্āώা āĻŽāύ্āϤ্āϰী āĻŦāϞেāύ, āϏāύ্āϤ্āϰাāϏেāϰ āĻŦিāϰুāĻĻ্āϧে āϝুāĻĻ্āϧেāϰ āĻĒ্āϰāĻĨāĻŽ āϏাāϰিāϤে āĻļি
template_id: 1
============================================================
#15
> Dataset: Flan-Coqa (T)
inputs: āϤুāϰ্āĻ•ি āϜāύāĻ—āĻŖ (), āĻŦা āϤুāϰ্āĻ•িāϰা (), āϝা āφāύাāϤোāϞিāϝ়াāύ āϤুāϰ্āĻ•ি āύাāĻŽেāĻ“ āĻĒāϰিāϚিāϤ (), āĻāĻ•āϟি āϤুāϰ্
targets: ā§§। āφāύাāϤোāϞিāϝ়াāϰ āĻ—্āϰাāĻŽāĻŦাāϏী ⧍। āύা ā§Š। āĻš্āϝাঁ ā§Ē. āϤুāϰ্āĻ•ি ā§Ģ। āύা ā§Ŧ। āĻĒāĻļ্āϚিāĻŽ āχāωāϰোāĻĒ ā§­। āϟাāϰ্āĻ—
template_id: 1
============================================================
#16
> Dataset: Xlel_wd-inst
inputs: Complete the following phrase:  āĻĻ্āĻŦিāϤীāϝ় āφāĻŦāĻĻুāϞ āĻšাāĻŽিāĻĻ
targets: āϤāϰুāĻŖ āϤুāϰ্āĻ•ি āĻŦিāĻĒ্āϞāĻŦ āĻĻ্āĻŦাāϰা āĻĻ্āĻŦিāϤীāϝ় āϏাংāĻŦিāϧাāύিāĻ• āϝুāĻ—েāϰ āϏূāϚāύা āĻ•āϰে āϏাংāĻŦিāϧাāύিāĻ• āϰাāϜāϤāύ্āϤ
template_id: 1
============================================================
#17
> Dataset: NQ-Open (T)
inputs: āĻĒ্āϰāĻļ্āύ: āϤাāϰা āĻ•োāĻĨাāϝ় āĻ—āϰāĻŽ āϟāĻŦ āϟাāχāĻŽ āĻŽেāĻļিāύ āĻĢিāϞ্āĻŽ āĻ•āϰেāĻ›ে āωāϤ্āϤāϰঃ
targets: āĻĢাāϰ্āύি āφāϞ্āĻĒাāχāύ āϰিāϏোāϰ্āϟ
template_id: 2
============================================================
#18
> Dataset: Flan-CoT-submix (T)
inputs: āύিāĻŽ্āύāϞিāĻ–িāϤ āĻĻুāϟি āĻŦাāĻ•্āϝ āĻĨেāĻ•ে āĻ•োāύāϟি āϏাāϧাāϰāĻŖ āϜ্āĻžাāύেāϰ āĻŦিāϰুāĻĻ্āϧে? āĻŦিāĻ•āϞ্āĻĒঃ - āĻĒ্āϰāĻĨāĻŽ āĻŦাāĻ•্āϝ:
targets: āĻ•āĻŽāϞা āϰāϏেāϰ āϏ্āĻŦাāĻĻ āĻļāϏ্āϝেāϰ āϏাāĻĨে āĻ­াāϞ āĻšāϝ় āύা। āϚূāĻĄ়াāύ্āϤ āωāϤ্āϤāϰ: A āĻŦাāĻ•্āϝ।
template_id: 1
============================================================
#19
> Dataset: Aya-Dataset
inputs: āύিāϚেāϰ āĻ…āύুāϚ্āĻ›েāĻĻেāϰ āĻŦিāώāϝ় āĻ•ি ?

āϟাāĻ™্āĻ—ুāϝ়াāϰ āĻšাāĻ“āϰ (āϏিāϞেāϟি: ꠐꠣꠋꠉꠥꠀꠞ ꠀꠅꠞ) āĻŦাংāϞাāĻĻেāĻļেāϰ āĻŦৃ
targets: āϟাāĻ™্āĻ—ুāϝ়াāϰ āĻšাāĻ“āϰ |
template_id: 0
============================================================
#20
> Dataset: Adversarial QA (T)
inputs: "āĻĢিāϞ্āĻĄ-āχāĻĢেāĻ•্āϟ" āĻāĻ•āϟি āϞেāĻŦেāϞ āϝা āĻāĻ•āϟি āϧāϰāύেāϰ āĻŦāϰ্āĻŖāύা āĻ•āϰāϤে āĻŦ্āϝāĻŦāĻšৃāϤ āĻšāϝ়? āĻĒূāϰ্āĻŦāĻŦāϰ্āϤী āĻĒ্āϰāĻļ
targets: āĻĻুāϟি āϧāϰāĻŖেāϰ āϟ্āϰাāύāϜিāϏ্āϟāϰ āϰāϝ়েāĻ›ে, āϝা āĻāĻ•āϟি āϏাāϰ্āĻ•িāϟে āĻ•ীāĻ­াāĻŦে āĻŦ্āϝāĻŦāĻšৃāϤ āĻšāϝ় āϤাāϰ āϏাāĻŽাāύ্āϝ āĻĒ
template_id: 1
============================================================
#21
> Dataset: PIQA (T)
inputs: āύিāĻŽ্āύāϞিāĻ–িāϤ āĻŦাāĻ•্āϝāϟি āϏāϰ্āĻŦোāϤ্āϤāĻŽ āĻŦিāĻ•āϞ্āĻĒেāϰ āϏাāĻĨে āĻļেāώ āĻ•āϰুāύঃ.āĻ•ুāĻ•িāύি āĻļীāϰ্āώāĻ—ুāϞি āĻĻিāϝ়ে āĻ•ী āĻ•
targets: āϤাāĻĻেāϰ āϞāĻŦāĻŖাāĻ•্āϤ āĻĒাāύিāϤে āϏাāĻĻা āĻ•āϰুāύ āĻāĻŦং āϤাāĻĻেāϰ āĻāĻ•āϟি āϏ্āύ্āϝাāĻ• āĻšিāϏাāĻŦে āĻĒāϰিāĻŦেāĻļāύ āĻ•āϰুāύ
template_id: 1
============================================================
#22
> Dataset: Flan-unified-QA (T)
inputs: āĻ•োāύ āĻŦিāĻŦৃāϤি āϏāĻ িāĻ•āĻ­াāĻŦে āĻŦāϰ্āĻŖāύা āĻ•āϰে āϝে āĻ•েāύ āĻāĻĒ্āϰিāϞেāϰ āĻĒ্āϰāϤিāϟি āĻĻিāύে āφāϞাāϏ্āĻ•াāϤে ⧍ā§Ē āϘāĻŖ্āϟাāϰāĻ“
targets: (āĻ—)
template_id: 1
============================================================
#23
> Dataset: News-summary-instruct
inputs: āφāϰো āĻ•āĻŽ āĻļāĻŦ্āĻĻে āĻŦাāĻ•্āϝāϟিāϰ āĻŽূāϞāĻ­াāĻŦ āĻŦāϰ্āĻŖāύা āĻ•āϰ: āϏ্āϟ্āϝাāύ্āĻĄাāϰ্āĻĄ āϚাāϰ্āϟাāϰ্āĻĄ āĻŦ্āϝাংāĻ•েāϰ āύāϤুāύ āĻĒ্
targets: āĻŦাāĻ•্āϝāϟিāϰ āϏংāĻ•্āώিāĻĒ্āϤ āĻŽূāϞāĻ­াāĻŦ āĻšāϞো, āϏ্āϟ্āϝাāύ্āĻĄাāϰ্āĻĄ āϚাāϰ্āϟাāϰ্āĻĄেāϰ āύāϤুāύ āϏিāχāĻ“ āφāĻŦāϰাāϰ।
template_id: 1
============================================================

Templates

There appears to be 16 different templates used by the templated data. To quote Cohere, "Templated data: We collaborated with fluent speakers to create templates that allowed for the automatic expansion of existing datasets into various languages."

Conclusion

Aya collection is a huge multilingual dataset. This can be quite useful for language-specific training and fine-tuning. How do you plan to use the Aya dataset/collection?

Further references:

Comments

Popular posts from this blog

Text Highlighting in Latex

While preparing a manuscript with Latex, it is often useful to highlight the changes made in the current revision with a different color. This can be achieved using the \ textcolor command provided by Latex. For example, \textcolor {red}{Hello World} would display the string "Hello World" in red color. However, the final/published copy of the manuscript does not contain any highlighted text. Therefore, if a large volume of changes were made, it becomes tiresome at the end to find and remove all the individual portions of highlighted text. This can be circumvented by defining a utility command to switch highlighting on and off as desired. In the following, we define a new Latex command, highlighttext , for this purpose. The command takes only a single argument—the text to be highlighted.     \usepackage {color}    % For highlighting changes in this version with red color   \newcommand { \highlighttext }[1] { \textcolor {red}{#1}}   % Remove...

50K Views of the DTN Blog

The DTN blog recently reached a milestone — it crossed 50,000 page views! The journey of this blog — and that of mine with the ONE simulator — started in 2011. Back in those days, old-timers would recall, there were not much resources available on the ONE simulator. After spending some time with it, I was finally able to gain some understanding about its functionality and work flow. I realized that a short how-to document might benefit others. What started as a humble effort to provide a quick tutorial on the ONE simulator soon became a popular resource in the community over the years. I'm thankful to all the users of the ONE mailing list who continuously kept me motivated to enrich this blog. I took this opportunity to prepare an infographic on the usage of the DTN blog based on the statistics provided by Blogger. It has not been possible to include all statistics in the infographic. A few interesting observations are noted below. Undoubtedly, the ONE tutorial is the mos...