• KSII Transactions on Internet and Information Systems
    Monthly Online Journal (eISSN: 1976-7277)

Tokenization Stability Index: A Catalyst for Optimizing Transformer Models for Low Resource Languages

Vol. 18, No. 11, November 30, 2024
10.3837/tiis.2024.11.001, Download Paper (Free):

Abstract

Texts from low-resource languages, including those from the Dravidian language family, are characterized by complex morphological structures that can substantially challenge large language models. While transformer models have proven effective in numerous applications, morphological features make low-resource languages less represented. To address this problem, we present the Tokenization Stability Index (TSI), a new metric that objectively captures the differences and similarities between tokenization techniques. TSI assesses token stability, the degree of vocabulary integration, multi-token matching, and the overall rate of all tokens versus unique tokens. We offer a robust mathematical overview, theoretical implications, and case studies to show that TSI creates a reliable framework for improving low-resource language transformer models. Custom tokenization techniques were developed and tested on Tamil-based text inputs. The modified BERT model significantly surpassed the baseline and IndicBERT models, illustrating further potential for refining tokenization frameworks to enhance text processing accuracy on Dravidian-based languages and low resource languages.


Statistics

Show / Hide Statistics

Statistics (Cumulative Counts from December 1st, 2015)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.


Cite this article

[IEEE Style]
V. N and A. N, "Tokenization Stability Index: A Catalyst for Optimizing Transformer Models for Low Resource Languages," KSII Transactions on Internet and Information Systems, vol. 18, no. 11, pp. 3109-3128, 2024. DOI: 10.3837/tiis.2024.11.001.

[ACM Style]
Venkatesan N and Arulanand N. 2024. Tokenization Stability Index: A Catalyst for Optimizing Transformer Models for Low Resource Languages. KSII Transactions on Internet and Information Systems, 18, 11, (2024), 3109-3128. DOI: 10.3837/tiis.2024.11.001.

[BibTeX Style]
@article{tiis:101544, title="Tokenization Stability Index: A Catalyst for Optimizing Transformer Models for Low Resource Languages", author="Venkatesan N and Arulanand N and ", journal="KSII Transactions on Internet and Information Systems", DOI={10.3837/tiis.2024.11.001}, volume={18}, number={11}, year="2024", month={November}, pages={3109-3128}}