| A snapshot of the Aya collection (Bengali). Image taken from HuggingFace. |
In February 2024, Cohere launched Aya, a multilingual Large Language Model (LLM). Alongside, a set of datasets used to train Aya has also been released. For example, the aya_dataset consists around 205K examples annotated by humans. On the other hand, the recently released aya_collection_language_split is a gigantic dataset with more than 500 million data points spread across more than 100 languages. As the name suggests, this dataset is split by language. For example, all data points in Bengali, irrespective of the underlying task, can be found in a single split. Apart from the original human-annotated examples from the aya_dataset, aya_collection_language_split also contains a lot of translated and templated data. The dataset is released using an Apache-2.0 license, allowing academic and commercial use.
The Bengali Language Split
Each language split in the Aya collection has three splits. The Bengali split, for example, contains:
- 3601287 examples in 'train'
- 274546 data points in 'validation'
- 276504 rows in 'test'
Let us take a look at this collection for the Bengali split, specifically focusing on the tasks and data sources.
All Task Types with Examples
There are 10 different task types. There are:
- 'summarization'
- 'paraphrasing'
- 'text-simplification'
- 'question-answering'
- '-'
- 'dialogue'
- 'translation'
- 'generation'
- 'event-linking'
- 'paraphrase-identification'
An example for each task is provided below. For the sake of brevity, all texts in the following are truncated after 80 characters:
#1
> Task: summarization
inputs: āϏংāϰāĻ্āώāĻŖেāϰ āĻāĻে āϝেāĻোāύ āĻাāĻĻ্āϝেāϰ āĻ
āĻŦāĻļিāώ্āĻাংāĻļ āĻŦা āĻĻাāĻ āĻ
āĻĒāϏাāϰāĻŖ āĻāϰা āĻĒ্āϰāϝ়োāĻāύ āĻাāϰāĻŖ āĻāĻুāϞি āϤাāĻĻ
targets: āϏংāϰāĻ্āώāĻŖেāϰ āĻāĻে āύিāĻļ্āĻিāϤ āĻāϰুāύ āϝে āĻāĻĒāύাāϰ āĻোāϝ়েāϞāĻি āĻĒāϰিāώ্āĻাāϰ। āĻোāϝ়াāϰ্āĻāĻে āĻ্āϝাāĻুāϝ়াāĻŽ āĻāϰো
template_id: 1
============================================================
#2
> Task: paraphrasing
inputs: āĻিāύ্āύ āĻļāĻŦ্āĻĻāĻুāĻ্āĻ āĻŦ্āϝāĻŦāĻšাāϰ āĻāϰে āύিāĻেāϰ āĻŦাāĻ্āϝāĻি āϞেāĻ: "āĻāĻŦāϰ āĻĒেāϝ়ে āĻĒুāϞিāĻļ āĻāĻāύাāϏ্āĻĨāϞে āĻĒৌঁāĻে
targets: "āĻĒুāϞিāĻļ āĻāĻŦāϰ āĻĒেā§ে āĻāĻāύাāϏ্āĻĨāϞে āĻĒৌঁāĻে āĻāĻšāϤāĻĻেāϰ āĻāĻĻ্āϧাāϰ āĻāϰে āϏ্āĻĨাāύীāϝ় āĻšাāϏāĻĒাāϤাāϞে āύিāϝ়ে āϝাāϝ়।
template_id: 1
============================================================
#3
> Task: text-simplification
inputs: āĻāĻ āĻŦাāĻ্āϝāĻিāϰ āĻāϰো āĻāĻিāϞ āϏংāϏ্āĻāϰāĻŖ āϤৈāϰি āĻāϰুāύ'''āĻāĻ āĻাāώা āĻĨেāĻে āĻ
āύ্āϝ āĻাāώাāϝ় āĻ
āύুāĻŦাāĻĻ āĻāϰাāϰ āϏāĻŽ
targets: āĻ
āĻŦāĻļ্āϝāĻ, āĻŦাāĻ্āϝāĻিāϰ āĻāϰো āĻāĻিāϞ āϏংāϏ্āĻāϰāĻŖ āĻšāϞ "''āϤিāύি āĻāĻ āĻাāώা āĻĨেāĻে āĻ
āύ্āϝ āĻাāώাāϝ় āĻ
āύুāĻŦাāĻĻ āĻāϰা
template_id: 1
============================================================
#4
> Task: question-answering
inputs: āϤুāϰ্āĻি āĻāύāĻāĻŖ (), āĻŦা āϤুāϰ্āĻিāϰা (), āϝা āĻāύাāϤোāϞিāϝ়াāύ āϤুāϰ্āĻি āύাāĻŽেāĻ āĻĒāϰিāĻিāϤ (), āĻāĻāĻি āϤুāϰ্
targets: ā§§। āĻāύাāϤোāϞিāϝ়াāϰ āĻ্āϰাāĻŽāĻŦাāϏী ⧍। āύা ā§Š। āĻš্āϝাঁ ā§Ē. āϤুāϰ্āĻি ā§Ģ। āύা ā§Ŧ। āĻĒāĻļ্āĻিāĻŽ āĻāĻāϰোāĻĒ ā§। āĻাāϰ্āĻ
template_id: 1
============================================================
#5
> Task: -
inputs: āύিāĻেāϰ āĻ
āύুāĻ্āĻেāĻĻেāϰ āĻŦিāώāϝ় āĻি ?
āĻাāĻ্āĻুāϝ়াāϰ āĻšাāĻāϰ (āϏিāϞেāĻি: ę ꠣꠋę ꠥę ę ę ę
ę ) āĻŦাংāϞাāĻĻেāĻļেāϰ āĻŦৃ
targets: āĻাāĻ্āĻুāϝ়াāϰ āĻšাāĻāϰ |
template_id: 0
============================================================
#6
> Task: dialogue
inputs: āύিāĻŽ্āύāϞিāĻিāϤ āĻŦিāώāϝ়েāϰ āĻāĻĒāϰ āĻিāϤ্āϤি āĻāϰে āĻāĻāĻি āϏংāĻ্āώিāĻĒ্āϤ āĻŦāϰ্āĻŖāύা āϞিāĻুāύঃ āĻŦ্āϝāĻ্āϤি āĻāĻ্āϏ āĻিāϰ
targets: āĻ
āĻŦāĻļ্āϝāĻ, āĻāĻাāύে āĻāĻāĻি āϏংāĻ্āώিāĻĒ্āϤ āĻ
āύুāĻ্āĻেāĻĻ āϰāϝ়েāĻে: āϧāϰ্āĻŽেāϰ āĻĒ্āϰāϤি āĻāĻ্āϰāĻšী āĻšāĻāϝ়াāϝ় āĻেāĻĻা āĻ
template_id: 1
============================================================
#7
> Task: translation
inputs: Translate from English to Bengali: "This boat's soundbar is still wire-connectiv
targets: "āĻāĻ āĻŦোāĻেāϰ āϏাāĻāύ্āĻĄāĻŦাāϰāĻি āĻāĻāύāĻ āϏāĻŦ āϏ্āĻĒিāĻাāϰেāϰ āĻāύ্āϝ āϤাāϰেāϰ āϏংāϝোāĻ। āĻāĻāĻāĻĄিāĻāĻŽāĻāĻ āĻĒোāϰ্āĻ āϏāĻŦ āĻĄিāĻ
template_id: 1
============================================================
#8
> Task: generation
inputs: āύিāĻŽ্āύāϞিāĻিāϤ āĻĻুāĻি āĻŦাāĻ্āϝ āĻĨেāĻে āĻোāύāĻি āϏাāϧাāϰāĻŖ āĻ্āĻাāύেāϰ āĻŦিāϰুāĻĻ্āϧে? āĻŦিāĻāϞ্āĻĒঃ - āĻĒ্āϰāĻĨāĻŽ āĻŦাāĻ্āϝ:
targets: āĻāĻŽāϞা āϰāϏেāϰ āϏ্āĻŦাāĻĻ āĻļāϏ্āϝেāϰ āϏাāĻĨে āĻাāϞ āĻšāϝ় āύা। āĻূāĻĄ়াāύ্āϤ āĻāϤ্āϤāϰ: A āĻŦাāĻ্āϝ।
template_id: 1
============================================================
#9
> Task: event-linking
inputs: āύিāĻŽ্āύāϞিāĻিāϤ āĻŦাāĻ্āϝāĻি āϏāĻŽ্āĻĒূāϰ্āĻŖ āĻāϰুāύ: ā§§ā§¯ā§Šā§Ļ āĻāϰ āĻĻāĻļāĻে āύাā§āϏি āĻাāϰ্āĻŽাāύিāϰ āĻāϤ্āĻĨাāύ āĻ
āϏ্āĻ্āϰিāϝ়া
targets: ā§§ā§¯ā§Šā§ āϏাāϞে āĻ
āϏ্āĻ্āϰিāϝ়াāĻে āϏংāϝুāĻ্āϤ āĻāϰাāϰ āϏāĻŽāϝ় āĻĒāϰিāĻŦাāϰāĻে āϤাāϰ āĻŦ্āϝাংāĻিং āĻাāϰ্āϝāĻ্āϰāĻŽ āĻŦিāĻ্āϰি
template_id: 1
============================================================
#10
> Task: paraphrase-identification
inputs: āĻŦাāĻ্āϝ ā§§ঃ (ā§§ā§Žā§¯ā§¨-⧧⧝ā§Ŧ⧍) āĻিāϞেāύ āĻāϝ়াāϰ āĻāĻাāĻĄেāĻŽি āĻিāĻŽāϰিāĻেāϰ (āĻāϝ়েāϞāĻļ āĻāĻাāĻĄেāĻŽি) āĻĒ্āϰāĻĨāĻŽ āϏāĻাāĻĒāϤি।
targets: āĻš্āϝাঁ
template_id: 1
============================================================
Names of All Datasets with Examples
As noted earlier, the Aya collection has data from different sources. Overall, the Aya collection contains 23 distinct datasets. There are:
- 'WIKI QA (T)'
- 'Flan-GEM-wiki-lingua (T)'
- 'SODA-inst (T)'
- 'Joke-explaination-inst (T)'
- 'IndicSentiment-inst'
- 'Wiki-split-inst (T)'
- 'Dolly-v2 (T)'
- 'HotpotQA (T)'
- 'Mintaka-inst (T)'
- 'Xlel_wd-inst (T)'
- 'IndicXParaphrase-inst'
- 'Flan-lambada (T)'
- 'PAWS-Wiki (T)'
- 'CNN-Daily-Mail (T)'
- 'Flan-Coqa (T)'
- 'Xlel_wd-inst', 'NQ-Open (T)'
- 'Flan-CoT-submix (T)'
- 'Aya-Dataset'
- 'Adversarial QA (T)'
- 'PIQA (T)'
- 'Flan-unified-QA (T)'
- 'News-summary-instruct'
In the following, a sample Bengali data point from all the above 23 datasets is presented (all texts are truncated here):
#1
> Dataset: WIKI QA (T)
inputs: āĻĒ্āϰāĻļ্āύāĻি āĻীঃ ""6 āĻĢুāĻ 7 āĻĢুāĻ" (āĻāĻাāĻĄ়াāĻ "6'7" āĻšিāϏাāĻŦে āϏ্āĻাāĻāϞ āĻāϰা āĻšāϝ়) āĻāĻŽেāϰিāĻাāύ āϰ্āϝাāĻĒ
targets: "ā§Ŧ āĻĢুāĻ ā§ āĻĢুāĻ āĻāĻ্āĻāϤাāϝ় āĻোāύ āĻাāύāĻি āĻাāĻāϝ়া āĻšāϝ়?"
template_id: 1
============================================================
#2
> Dataset: Flan-GEM-wiki-lingua (T)
inputs: āϏংāϰāĻ্āώāĻŖেāϰ āĻāĻে āϝেāĻোāύ āĻাāĻĻ্āϝেāϰ āĻ
āĻŦāĻļিāώ্āĻাংāĻļ āĻŦা āĻĻাāĻ āĻ
āĻĒāϏাāϰāĻŖ āĻāϰা āĻĒ্āϰāϝ়োāĻāύ āĻাāϰāĻŖ āĻāĻুāϞি āϤাāĻĻ
targets: āϏংāϰāĻ্āώāĻŖেāϰ āĻāĻে āύিāĻļ্āĻিāϤ āĻāϰুāύ āϝে āĻāĻĒāύাāϰ āĻোāϝ়েāϞāĻি āĻĒāϰিāώ্āĻাāϰ। āĻোāϝ়াāϰ্āĻāĻে āĻ্āϝাāĻুāϝ়াāĻŽ āĻāϰো
template_id: 1
============================================================
#3
> Dataset: SODA-inst (T)
inputs: āύিāĻŽ্āύāϞিāĻিāϤ āĻŦিāώāϝ়েāϰ āĻāĻĒāϰ āĻিāϤ্āϤি āĻāϰে āĻāĻāĻি āϏংāĻ্āώিāĻĒ্āϤ āĻŦāϰ্āĻŖāύা āϞিāĻুāύঃ āĻŦ্āϝāĻ্āϤি āĻāĻ্āϏ āĻিāϰ
targets: āĻ
āĻŦāĻļ্āϝāĻ, āĻāĻাāύে āĻāĻāĻি āϏংāĻ্āώিāĻĒ্āϤ āĻ
āύুāĻ্āĻেāĻĻ āϰāϝ়েāĻে: āϧāϰ্āĻŽেāϰ āĻĒ্āϰāϤি āĻāĻ্āϰāĻšী āĻšāĻāϝ়াāϝ় āĻেāĻĻা āĻ
template_id: 1
============================================================
#4
> Dataset: Joke-explaination-inst (T)
inputs: āύিāĻŽ্āύāϞিāĻিāϤ āĻৌāϤুāĻāĻি āĻŦ্āϝাāĻ্āϝা āĻāϰুāύঃ āĻāĻŽ্āĻĒিāĻāĻাāϰ āϝāĻāύ āĻ্āϞাāύ্āϤ āĻšāϝ় āϤāĻāύ āĻী āĻāϰে? āĻঃ āĻāĻা āĻ
targets: āĻŦ্āϝাāĻ্āϝাঃ āĻāĻĒāύাāϰ āĻāĻŽ্āĻĒিāĻāĻাāϰ āĻি āĻāĻāύāĻ āĻাāĻ āĻāϰা āĻŦāύ্āϧ āĻāϰে āĻĻেāϝ় (āĻĢ্āϰিāĻ) āĻŦা āϝāĻāύ āĻāĻĒāύি āĻāĻি
template_id: 2
============================================================
#5
> Dataset: IndicSentiment-inst
inputs: Translate from English to Bengali: "This boat's soundbar is still wire-connectiv
targets: "āĻāĻ āĻŦোāĻেāϰ āϏাāĻāύ্āĻĄāĻŦাāϰāĻি āĻāĻāύāĻ āϏāĻŦ āϏ্āĻĒিāĻাāϰেāϰ āĻāύ্āϝ āϤাāϰেāϰ āϏংāϝোāĻ। āĻāĻāĻāĻĄিāĻāĻŽāĻāĻ āĻĒোāϰ্āĻ āϏāĻŦ āĻĄিāĻ
template_id: 1
============================================================
#6
> Dataset: Wiki-split-inst (T)
inputs: āĻāĻ āĻŦাāĻ্āϝāĻিāϰ āĻāϰো āĻāĻিāϞ āϏংāϏ্āĻāϰāĻŖ āϤৈāϰি āĻāϰুāύ'''āĻāĻ āĻাāώা āĻĨেāĻে āĻ
āύ্āϝ āĻাāώাāϝ় āĻ
āύুāĻŦাāĻĻ āĻāϰাāϰ āϏāĻŽ
targets: āĻ
āĻŦāĻļ্āϝāĻ, āĻŦাāĻ্āϝāĻিāϰ āĻāϰো āĻāĻিāϞ āϏংāϏ্āĻāϰāĻŖ āĻšāϞ "''āϤিāύি āĻāĻ āĻাāώা āĻĨেāĻে āĻ
āύ্āϝ āĻাāώাāϝ় āĻ
āύুāĻŦাāĻĻ āĻāϰা
template_id: 1
============================================================
#7
> Dataset: Dolly-v2 (T)
inputs: āĻাāϰ্āĻিāύ āĻ
āϏ্āĻ্āϰেāϞিāϝ়া āĻāĻāύ āĻাāĻ āĻļুāϰু āĻāϰে?
Context:āĻাāϰ্āĻিāύ āĻ
āϏ্āĻ্āϰেāϞিāϝ়া, āĻাāϰ্āĻিāύ āĻ
āϏ্
targets: āĻাāϰ্āĻিāύ āĻ
āϏ্āĻ্āϰেāϞিāϝ়া ā§Šā§§ āĻāĻāϏ্āĻ ā§¨ā§Ļā§Ļā§Ļ āϏাāϞে āĻাāϰ্āĻিāύ āĻŦ্āϞু āύাāĻŽে āĻāĻāĻি āϰুāĻে āĻĻুāĻি āĻŦিāĻŽাāύ āĻĻ
template_id: 1
============================================================
#8
> Dataset: HotpotQA (T)
inputs: "āĻ āύাāĻāĻ āĻāĻāĻ āĻāύ āϞāύ্āĻĄāύ" āĻšāϞ āĻāϤুāϰ্āĻĨ āĻĒāϰ্āĻŦ āϝেāĻাāύে āĻŦ্āϰিāĻিāĻļ āϏিāĻāĻāĻŽ āϏিāĻŽোāύ āĻŦাāϰ্āĻĄ āĻ
āĻিāύীāϤ?
targets: "āĻāύāĻŦিāĻুāĻāύাāϰ্āϏ"
template_id: 3
============================================================
#9
> Dataset: Mintaka-inst (T)
inputs: āĻāĻ āĻŦিāώāϝ়āĻļ্āϰেāĻŖীāϰ āĻŽāϧ্āϝে āĻāĻāĻি āϏাāϧাāϰāĻŖ āĻŦিāώāϝ়েāϰ āĻāĻĻাāĻšāϰāĻŖ āĻĻাāĻ: āĻূāĻোāϞ
targets: āĻāϤ্āϤāϰ āĻāĻŽেāϰিāĻাāϰ āϏāĻĒ্āϤāĻŽ āϏāϰ্āĻŦোāĻ্āĻ āĻĒāϰ্āĻŦāϤ āĻোāύāĻি? āĻŽাāĻāύ্āĻ āϞুāϏাāύিāϝ়া
template_id: 1
============================================================
#10
> Dataset: Xlel_wd-inst (T)
inputs: āύিāĻŽ্āύāϞিāĻিāϤ āĻŦাāĻ্āϝāĻি āϏāĻŽ্āĻĒূāϰ্āĻŖ āĻāϰুāύ: ā§§ā§¯ā§Šā§Ļ āĻāϰ āĻĻāĻļāĻে āύাā§āϏি āĻাāϰ্āĻŽাāύিāϰ āĻāϤ্āĻĨাāύ āĻ
āϏ্āĻ্āϰিāϝ়া
targets: ā§§ā§¯ā§Šā§ āϏাāϞে āĻ
āϏ্āĻ্āϰিāϝ়াāĻে āϏংāϝুāĻ্āϤ āĻāϰাāϰ āϏāĻŽāϝ় āĻĒāϰিāĻŦাāϰāĻে āϤাāϰ āĻŦ্āϝাংāĻিং āĻাāϰ্āϝāĻ্āϰāĻŽ āĻŦিāĻ্āϰি
template_id: 1
============================================================
#11
> Dataset: IndicXParaphrase-inst
inputs: āĻিāύ্āύ āĻļāĻŦ্āĻĻāĻুāĻ্āĻ āĻŦ্āϝāĻŦāĻšাāϰ āĻāϰে āύিāĻেāϰ āĻŦাāĻ্āϝāĻি āϞেāĻ: "āĻāĻŦāϰ āĻĒেāϝ়ে āĻĒুāϞিāĻļ āĻāĻāύাāϏ্āĻĨāϞে āĻĒৌঁāĻে
targets: "āĻĒুāϞিāĻļ āĻāĻŦāϰ āĻĒেā§ে āĻāĻāύাāϏ্āĻĨāϞে āĻĒৌঁāĻে āĻāĻšāϤāĻĻেāϰ āĻāĻĻ্āϧাāϰ āĻāϰে āϏ্āĻĨাāύীāϝ় āĻšাāϏāĻĒাāϤাāϞে āύিāϝ়ে āϝাāϝ়।
template_id: 1
============================================================
#12
> Dataset: Flan-lambada (T)
inputs: ` ` āĻāĻŽā§āĻাāϰ, āĻāĻŽি āĻāĻļা āĻāϰāĻিāϞাāĻŽ āϤুāĻŽি āĻāϰāĻŦে. āĻāĻŽাāϰ āĻ্āώāĻŽা āĻাāĻāϝ়াāϰ āĻāύ্āϝ āϤোāĻŽাāĻে āĻāϝ় āĻĻেāĻাāύো
targets: āĻāĻĻāĻŽ
template_id: 1
============================================================
#13
> Dataset: PAWS-Wiki (T)
inputs: āĻŦাāĻ্āϝ ā§§ঃ (ā§§ā§Žā§¯ā§¨-⧧⧝ā§Ŧ⧍) āĻিāϞেāύ āĻāϝ়াāϰ āĻāĻাāĻĄেāĻŽি āĻিāĻŽāϰিāĻেāϰ (āĻāϝ়েāϞāĻļ āĻāĻাāĻĄেāĻŽি) āĻĒ্āϰāĻĨāĻŽ āϏāĻাāĻĒāϤি।
targets: āĻš্āϝাঁ
template_id: 1
============================================================
#14
> Dataset: CNN-Daily-Mail (T)
inputs: āύিāĻŦāύ্āϧāĻি āϏংāĻ্āώিāĻĒ্āϤ āĻāϰে āĻŦāϞুāύ: āĻĒাāĻিāϏ্āϤাāύেāϰ āĻĒেāĻļোāϝ়াāϰেāϰ āĻāĻāĻি āϏ্āĻুāϞেāϰ āĻšāϞেāϰ āĻেāϤāϰ āĻĻিāϝ়
targets: āĻĒাāĻিāϏ্āϤাāύেāϰ āĻĒ্āϰāϤিāϰāĻ্āώা āĻŽāύ্āϤ্āϰী āĻŦāϞেāύ, āϏāύ্āϤ্āϰাāϏেāϰ āĻŦিāϰুāĻĻ্āϧে āϝুāĻĻ্āϧেāϰ āĻĒ্āϰāĻĨāĻŽ āϏাāϰিāϤে āĻļি
template_id: 1
============================================================
#15
> Dataset: Flan-Coqa (T)
inputs: āϤুāϰ্āĻি āĻāύāĻāĻŖ (), āĻŦা āϤুāϰ্āĻিāϰা (), āϝা āĻāύাāϤোāϞিāϝ়াāύ āϤুāϰ্āĻি āύাāĻŽেāĻ āĻĒāϰিāĻিāϤ (), āĻāĻāĻি āϤুāϰ্
targets: ā§§। āĻāύাāϤোāϞিāϝ়াāϰ āĻ্āϰাāĻŽāĻŦাāϏী ⧍। āύা ā§Š। āĻš্āϝাঁ ā§Ē. āϤুāϰ্āĻি ā§Ģ। āύা ā§Ŧ। āĻĒāĻļ্āĻিāĻŽ āĻāĻāϰোāĻĒ ā§। āĻাāϰ্āĻ
template_id: 1
============================================================
#16
> Dataset: Xlel_wd-inst
inputs: Complete the following phrase: āĻĻ্āĻŦিāϤীāϝ় āĻāĻŦāĻĻুāϞ āĻšাāĻŽিāĻĻ
targets: āϤāϰুāĻŖ āϤুāϰ্āĻি āĻŦিāĻĒ্āϞāĻŦ āĻĻ্āĻŦাāϰা āĻĻ্āĻŦিāϤীāϝ় āϏাংāĻŦিāϧাāύিāĻ āϝুāĻেāϰ āϏূāĻāύা āĻāϰে āϏাংāĻŦিāϧাāύিāĻ āϰাāĻāϤāύ্āϤ
template_id: 1
============================================================
#17
> Dataset: NQ-Open (T)
inputs: āĻĒ্āϰāĻļ্āύ: āϤাāϰা āĻোāĻĨাāϝ় āĻāϰāĻŽ āĻāĻŦ āĻাāĻāĻŽ āĻŽেāĻļিāύ āĻĢিāϞ্āĻŽ āĻāϰেāĻে āĻāϤ্āϤāϰঃ
targets: āĻĢাāϰ্āύি āĻāϞ্āĻĒাāĻāύ āϰিāϏোāϰ্āĻ
template_id: 2
============================================================
#18
> Dataset: Flan-CoT-submix (T)
inputs: āύিāĻŽ্āύāϞিāĻিāϤ āĻĻুāĻি āĻŦাāĻ্āϝ āĻĨেāĻে āĻোāύāĻি āϏাāϧাāϰāĻŖ āĻ্āĻাāύেāϰ āĻŦিāϰুāĻĻ্āϧে? āĻŦিāĻāϞ্āĻĒঃ - āĻĒ্āϰāĻĨāĻŽ āĻŦাāĻ্āϝ:
targets: āĻāĻŽāϞা āϰāϏেāϰ āϏ্āĻŦাāĻĻ āĻļāϏ্āϝেāϰ āϏাāĻĨে āĻাāϞ āĻšāϝ় āύা। āĻূāĻĄ়াāύ্āϤ āĻāϤ্āϤāϰ: A āĻŦাāĻ্āϝ।
template_id: 1
============================================================
#19
> Dataset: Aya-Dataset
inputs: āύিāĻেāϰ āĻ
āύুāĻ্āĻেāĻĻেāϰ āĻŦিāώāϝ় āĻি ?
āĻাāĻ্āĻুāϝ়াāϰ āĻšাāĻāϰ (āϏিāϞেāĻি: ę ꠣꠋę ꠥę ę ę ę
ę ) āĻŦাংāϞাāĻĻেāĻļেāϰ āĻŦৃ
targets: āĻাāĻ্āĻুāϝ়াāϰ āĻšাāĻāϰ |
template_id: 0
============================================================
#20
> Dataset: Adversarial QA (T)
inputs: "āĻĢিāϞ্āĻĄ-āĻāĻĢেāĻ্āĻ" āĻāĻāĻি āϞেāĻŦেāϞ āϝা āĻāĻāĻি āϧāϰāύেāϰ āĻŦāϰ্āĻŖāύা āĻāϰāϤে āĻŦ্āϝāĻŦāĻšৃāϤ āĻšāϝ়? āĻĒূāϰ্āĻŦāĻŦāϰ্āϤী āĻĒ্āϰāĻļ
targets: āĻĻুāĻি āϧāϰāĻŖেāϰ āĻ্āϰাāύāĻিāϏ্āĻāϰ āϰāϝ়েāĻে, āϝা āĻāĻāĻি āϏাāϰ্āĻিāĻে āĻীāĻাāĻŦে āĻŦ্āϝāĻŦāĻšৃāϤ āĻšāϝ় āϤাāϰ āϏাāĻŽাāύ্āϝ āĻĒ
template_id: 1
============================================================
#21
> Dataset: PIQA (T)
inputs: āύিāĻŽ্āύāϞিāĻিāϤ āĻŦাāĻ্āϝāĻি āϏāϰ্āĻŦোāϤ্āϤāĻŽ āĻŦিāĻāϞ্āĻĒেāϰ āϏাāĻĨে āĻļেāώ āĻāϰুāύঃ.āĻুāĻিāύি āĻļীāϰ্āώāĻুāϞি āĻĻিāϝ়ে āĻী āĻ
targets: āϤাāĻĻেāϰ āϞāĻŦāĻŖাāĻ্āϤ āĻĒাāύিāϤে āϏাāĻĻা āĻāϰুāύ āĻāĻŦং āϤাāĻĻেāϰ āĻāĻāĻি āϏ্āύ্āϝাāĻ āĻšিāϏাāĻŦে āĻĒāϰিāĻŦেāĻļāύ āĻāϰুāύ
template_id: 1
============================================================
#22
> Dataset: Flan-unified-QA (T)
inputs: āĻোāύ āĻŦিāĻŦৃāϤি āϏāĻ িāĻāĻাāĻŦে āĻŦāϰ্āĻŖāύা āĻāϰে āϝে āĻেāύ āĻāĻĒ্āϰিāϞেāϰ āĻĒ্āϰāϤিāĻি āĻĻিāύে āĻāϞাāϏ্āĻাāϤে ⧍ā§Ē āĻāĻŖ্āĻাāϰāĻ
targets: (āĻ)
template_id: 1
============================================================
#23
> Dataset: News-summary-instruct
inputs: āĻāϰো āĻāĻŽ āĻļāĻŦ্āĻĻে āĻŦাāĻ্āϝāĻিāϰ āĻŽূāϞāĻাāĻŦ āĻŦāϰ্āĻŖāύা āĻāϰ: āϏ্āĻ্āϝাāύ্āĻĄাāϰ্āĻĄ āĻাāϰ্āĻাāϰ্āĻĄ āĻŦ্āϝাংāĻেāϰ āύāϤুāύ āĻĒ্
targets: āĻŦাāĻ্āϝāĻিāϰ āϏংāĻ্āώিāĻĒ্āϤ āĻŽূāϞāĻাāĻŦ āĻšāϞো, āϏ্āĻ্āϝাāύ্āĻĄাāϰ্āĻĄ āĻাāϰ্āĻাāϰ্āĻĄেāϰ āύāϤুāύ āϏিāĻāĻ āĻāĻŦāϰাāϰ।
template_id: 1
============================================================
Templates
There appears to be 16 different templates used by the templated data. To quote Cohere, "Templated data: We collaborated with fluent speakers to create templates that allowed for the automatic expansion of existing datasets into various languages."
Conclusion
Aya collection is a huge multilingual dataset. This can be quite useful for language-specific training and fine-tuning. How do you plan to use the Aya dataset/collection?
Further references:
- Aya Dataset: An Open-Access Collection Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning [Paper]
- Aya collection language split [Colab Notebook; used by this blog post]
Comments
Post a Comment