Tok Pisin (tpi) subword embeddings

Vocab size vocab model 25 dim 50 dim 100 dim 200 dim 300 dim
1000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
3000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
5000 vocab model txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix
txt | bin
bokeh | umap | matrix

Embedding matrix plots

Training corpus sample, encoded with different BPE vocabulary sizes

Vocab sizetpiwiki sample
original wanpela famili bilong tokples emi wanpela grup bilong ol tokples. olgeta tokples insait long wanpela famili igat wanpela tumbuna tokples tasol long ta
long septemba 0000, duke na duchess long chambridge (duk na dases long kaimbris) i go lukim ol ailan long makim namba 00 yia bihain ilisabet i tanim l
dispela namba em i makim olsem wanpela ten tu pesen (00%) bilong olgeta tokples long wol, na i no gat narapela kantri long wol i winim dispela namba b
1000 ▁wanpela ▁famili ▁bilong ▁tokples ▁emi ▁wanpela ▁gr up ▁bilong ▁ol ▁tokples . ▁olgeta ▁tokples ▁insait ▁long ▁wanpela ▁famili ▁igat ▁wanpela ▁tu mb una ▁tokples ▁tasol ▁long ▁ta
▁long ▁se pt emba ▁0000, ▁d uk e ▁na ▁du ch es s ▁long ▁ch am b ri d ge ▁( d uk ▁na ▁d as es ▁long ▁k a im b ris ) ▁i ▁go ▁luk im ▁ol ▁ailan ▁long ▁makim ▁namba ▁00 ▁yia ▁bihain ▁i lis ab et ▁i ▁t an im ▁l
▁dispela ▁namba ▁em ▁i ▁makim ▁olsem ▁wanpela ▁ten ▁tu ▁p esen ▁( 00 % ) ▁bilong ▁olgeta ▁tokples ▁long ▁wol , ▁na ▁i ▁no ▁gat ▁narapela ▁kantri ▁long ▁wol ▁i ▁winim ▁dispela ▁namba ▁b
3000 ▁wanpela ▁famili ▁bilong ▁tokples ▁emi ▁wanpela ▁grup ▁bilong ▁ol ▁tokples . ▁olgeta ▁tokples ▁insait ▁long ▁wanpela ▁famili ▁igat ▁wanpela ▁tumbuna ▁tokples ▁tasol ▁long ▁ta
▁long ▁septemba ▁0000, ▁d uk e ▁na ▁duch ess ▁long ▁ch am bri d ge ▁( d uk ▁na ▁d ases ▁long ▁ka im bris ) ▁i ▁go ▁lukim ▁ol ▁ailan ▁long ▁makim ▁namba ▁00 ▁yia ▁bihain ▁i lis abet ▁i ▁tanim ▁l
▁dispela ▁namba ▁em ▁i ▁makim ▁olsem ▁wanpela ▁ten ▁tu ▁p esen ▁(00 % ) ▁bilong ▁olgeta ▁tokples ▁long ▁wol , ▁na ▁i ▁no ▁gat ▁narapela ▁kantri ▁long ▁wol ▁i ▁winim ▁dispela ▁namba ▁b
5000 ▁wanpela ▁famili ▁bilong ▁tokples ▁emi ▁wanpela ▁grup ▁bilong ▁ol ▁tokples . ▁olgeta ▁tokples ▁insait ▁long ▁wanpela ▁famili ▁igat ▁wanpela ▁tumbuna ▁tokples ▁tasol ▁long ▁ta
▁long ▁septemba ▁0000, ▁duk e ▁na ▁duch ess ▁long ▁ch ambridge ▁( d uk ▁na ▁d ases ▁long ▁kaim bris ) ▁i ▁go ▁lukim ▁ol ▁ailan ▁long ▁makim ▁namba ▁00 ▁yia ▁bihain ▁ilisabet ▁i ▁tanim ▁l
▁dispela ▁namba ▁em ▁i ▁makim ▁olsem ▁wanpela ▁ten ▁tu ▁pesen ▁(00%) ▁bilong ▁olgeta ▁tokples ▁long ▁wol , ▁na ▁i ▁no ▁gat ▁narapela ▁kantri ▁long ▁wol ▁i ▁winim ▁dispela ▁namba ▁b