Old English (ca. 450-1100) (ang) subword embeddings

Vocab size vocab model 25 dim 50 dim 100 dim 200 dim 300 dim
1000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
3000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
5000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
10000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
25000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix

Training corpus sample, encoded with different BPE vocabulary sizes

Vocab sizeangwiki sample
original scēap scēap is ǣnig syndrigra willenra cliferfētedēora, ac gemǣnlīcost þæt getemod scēap (''ovis aries''), þe gewēne ofcymþ of wildum urisc of sūþmidd
xhosa: * sūðaffrica (mid afrikaans, nīwum englisce, ndebele, northern sotho, sotho, swazi, tsonga, tswana, venda, zulu)
etymologists also try to reconstruct information about languages that are too old for any direct information (such as writing) to be known. by compari
1000 ▁sc ēa p ▁sc ēa p ▁is ▁ǣ n ig ▁s ynd r ig ra ▁will en ra ▁cl if er f ē t ed ēor a , ▁ac ▁gem ǣ n līc ost ▁þæt ▁ge te m od ▁sc ēa p ▁('' ov is ▁ar ies '' ), ▁þe ▁gew ē ne ▁of c y m þ ▁of ▁w ild um ▁ ur isc ▁of ▁sūþ mid d
▁ x h os a : ▁* ▁sū ð af f ric a ▁( mid ▁af ri k a ans , ▁nīw um ▁englisce , ▁n de b ele , ▁nor ther n ▁so th o , ▁so th o , ▁sw az i , ▁t s ong a , ▁t sw an a , ▁v enda , ▁z ul u )
▁e t y m ol og ist s ▁also ▁t ry ▁to ▁re c on st ru ct ▁in for m ation ▁ab o ut ▁lang u ag es ▁that ▁are ▁to o ▁o ld ▁for ▁an y ▁d ire ct ▁in for m ation ▁( s u ch ▁as ▁w rit ing ) ▁to ▁be ▁k n own . ▁by ▁comp ar i
3000 ▁scēa p ▁scēa p ▁is ▁ǣnig ▁syndrig ra ▁will en ra ▁cl if er f ē ted ēor a , ▁ac ▁gemǣn līc ost ▁þæt ▁ge tem od ▁scēa p ▁('' ov is ▁ar ies ''), ▁þe ▁gew ēne ▁of cy mþ ▁of ▁w ild um ▁ur isc ▁of ▁sūþ mid d
▁x h os a : ▁* ▁sūð aff rica ▁( mid ▁af ri ka ans , ▁nīwum ▁englisce , ▁n de b ele , ▁northern ▁so th o , ▁so th o , ▁sw az i , ▁t s ong a , ▁t sw ana , ▁v enda , ▁z ul u )
▁et ym olog ists ▁also ▁t ry ▁to ▁rec on st ru ct ▁in form ation ▁about ▁langu ages ▁that ▁are ▁to o ▁old ▁for ▁any ▁dire ct ▁in form ation ▁( s uch ▁as ▁writ ing ) ▁to ▁be ▁known . ▁by ▁comp ar i
5000 ▁scēa p ▁scēa p ▁is ▁ǣnig ▁syndrig ra ▁will en ra ▁cl if er f ē ted ēor a , ▁ac ▁gemǣn līc ost ▁þæt ▁ge tem od ▁scēa p ▁('' ov is ▁ar ies ''), ▁þe ▁gew ēne ▁of cy mþ ▁of ▁wild um ▁ur isc ▁of ▁sūþ midd
▁x h osa : ▁* ▁sūð affrica ▁( mid ▁af ri ka ans , ▁nīwum ▁englisce , ▁n de b ele , ▁northern ▁so tho , ▁so tho , ▁sw az i , ▁ts ong a , ▁t sw ana , ▁v enda , ▁z ul u )
▁et ym olog ists ▁also ▁t ry ▁to ▁rec on st ruct ▁in formation ▁about ▁languages ▁that ▁are ▁to o ▁old ▁for ▁any ▁direct ▁in formation ▁( s uch ▁as ▁writ ing ) ▁to ▁be ▁known . ▁by ▁comp ari
10000 ▁scēa p ▁scēa p ▁is ▁ǣnig ▁syndrig ra ▁will enra ▁cl ifer fē ted ēor a , ▁ac ▁gemǣn līc ost ▁þæt ▁ge tem od ▁scēa p ▁('' ov is ▁ar ies ''), ▁þe ▁gew ēne ▁of cymþ ▁of ▁wild um ▁ur isc ▁of ▁sūþ midd
▁xhosa : ▁* ▁sūðaffrica ▁( mid ▁afrikaans , ▁nīwum ▁englisce , ▁ndebele , ▁northern ▁sotho , ▁sotho , ▁swazi , ▁tsonga , ▁tswana , ▁venda , ▁zulu )
▁etym ologists ▁also ▁try ▁to ▁recon st ruct ▁information ▁about ▁languages ▁that ▁are ▁too ▁old ▁for ▁any ▁direct ▁information ▁( such ▁as ▁writing ) ▁to ▁be ▁known . ▁by ▁comp ari
25000 ▁scēap ▁scēap ▁is ▁ǣnig ▁syndrigra ▁will enra ▁cl ifer fē ted ēora , ▁ac ▁gemǣn līc ost ▁þæt ▁getem od ▁scēap ▁('' ov is ▁aries ''), ▁þe ▁gewēne ▁ofcymþ ▁of ▁wildum ▁ur isc ▁of ▁sūþ midd
▁xhosa : ▁* ▁sūðaffrica ▁( mid ▁afrikaans , ▁nīwum ▁englisce , ▁ndebele , ▁northern ▁sotho , ▁sotho , ▁swazi , ▁tsonga , ▁tswana , ▁venda , ▁zulu )
▁etym ologists ▁also ▁try ▁to ▁reconst ruct ▁information ▁about ▁languages ▁that ▁are ▁too ▁old ▁for ▁any ▁direct ▁information ▁( such ▁as ▁writing ) ▁to ▁be ▁known . ▁by ▁comp ari