Pali (pi) subword embeddings

Vocab size vocab model 25 dim 50 dim 100 dim 200 dim 300 dim
1000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
3000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
5000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix

Embedding matrix plots

Training corpus sample, encoded with different BPE vocabulary sizes

Vocab sizepiwiki sample
original {| border=0 align=right cellpadding=0 cellspacing=0 width=000 style="margin: 0 0 0em 0em; background: #f0f0f0; border: 0px #aaaaaa solid; border-colla
00. yā pana bhikkhunī purebhattaṃ kulāni upasaṅkamitvā āsane nisīditvā sāmike anāpucchā pakkameyya, pācittiyaṃ.
uddiṭṭhā satta adhikaraṇasamathā dhammā, ettakaṃ tassa bhagavato suttāgataṃ suttapariyāpannaṃ anvaddhamāsaṃ uddesaṃ āgacchati, tattha sabbāheva samagg
1000 ▁{| ▁border =0 ▁align = right ▁cellpadding =0 ▁cellspacing =0 ▁width =000 ▁style =" margin : ▁0 ▁0 ▁0 em ▁0 em ; ▁background : ▁# f 0 f 0 f 0; ▁border : ▁0 px ▁# aaaaaa ▁solid ; ▁border - c oll a
▁00. ▁yā ▁pana ▁bhikkhunī ▁pu re bh attaṃ ▁kul āni ▁up as aṅk am itvā ▁ā s an e ▁n is ī dit vā ▁sām ik e ▁an āp ucchā ▁pa kk am eyya , ▁pācittiyaṃ .
▁udd i ṭṭhā ▁s att a ▁ad hik araṇ asam athā ▁dhammā , ▁e tt akaṃ ▁tassa ▁bhagavat o ▁s utt ā gat aṃ ▁s utt ap ari yā p ann aṃ ▁an va dd ham ās aṃ ▁udd esaṃ ▁ā g ac ch ati , ▁tattha ▁sa bb ā heva ▁sama gg
3000 ▁{| ▁border =0 ▁align = right ▁cellpadding =0 ▁cellspacing =0 ▁width =000 ▁style =" margin : ▁0 ▁0 ▁0 em ▁0 em ; ▁background : ▁# f 0 f 0 f 0; ▁border : ▁0 px ▁# aaaaaa ▁solid ; ▁border - c oll a
▁00. ▁yā ▁pana ▁bhikkhunī ▁pure bhattaṃ ▁kulāni ▁upasaṅkamitvā ▁ā s ane ▁nis ī dit vā ▁sāmike ▁anāpucchā ▁pakkam eyya , ▁pācittiyaṃ .
▁uddiṭṭhā ▁satt a ▁adhikaraṇ asam athā ▁dhammā , ▁e tt akaṃ ▁tassa ▁bhagavato ▁sutt ā gataṃ ▁sutt apariyāpann aṃ ▁anvaddhamāsaṃ ▁udd esaṃ ▁ā g ac ch ati , ▁tattha ▁sabb ā heva ▁samagg
5000 ▁{| ▁border =0 ▁align = right ▁cellpadding =0 ▁cellspacing =0 ▁width =000 ▁style =" margin : ▁0 ▁0 ▁0 em ▁0 em ; ▁background : ▁# f 0 f 0 f 0; ▁border : ▁0 px ▁# aaaaaa ▁solid ; ▁border - c oll a
▁00. ▁yā ▁pana ▁bhikkhunī ▁purebhattaṃ ▁kulāni ▁upasaṅkamitvā ▁āsane ▁nis ī dit vā ▁sāmike ▁anāpucchā ▁pakkameyya , ▁pācittiyaṃ .
▁uddiṭṭhā ▁satta ▁adhikaraṇ asamathā ▁dhammā , ▁e tt akaṃ ▁tassa ▁bhagavato ▁sutt āgataṃ ▁suttapariyāpann aṃ ▁anvaddhamāsaṃ ▁uddesaṃ ▁āgacch ati , ▁tattha ▁sabbā heva ▁samagg