Zero-Shot Tokenizer Transfer

Author: NewsCrawler
Published: 5/14/2024, 10:45:27 PM
Category: Resource

Method for seamlessly swapping tokenizers to enhance model flexibility across languages and coding tasks.

Paper

Code

Zero-Shot Tokenizer Transfer (ZeTT) is a method that introduces a hypernetwork training paradigm to seamlessly substitute tokenizers and enhance model flexibility. It aims to overcome the constraint of language models (LMs) being closely linked to particular tokenizers, restricting their adaptability across languages and coding tasks.

At the core of ZeTT lies the challenge of finding embeddings for tokens in the vocabulary of a new tokenizer. Traditional heuristics for initializing embeddings often fall short in this context, prompting the proposal of a new solution: training a hypernetwork capable of predicting embeddings based on the input tokenizer. Empirical evaluations demonstrate the effectiveness of this approach, showcasing its ability to generalize to diverse tokenizers across encoder and decoder LMs.

The methodology involves defining distributions over texts and tokenizers, with an emphasis on sampling diverse tokenizers to foster generalization. Training the hypernetwork involves assembling substrings into the tokenizer and creating a UnigramLM based on these substrings. The loss is computed on a subset of tokens in each batch, and parameters are updated accordingly to minimize the LM's loss with the new tokenizer.

Importantly, the proposed method significantly reduces the length of tokenized sequences while maintaining performance levels close to those of models with the original tokenizer. Furthermore, the gap in performance can be further narrowed through continued training on a relatively small token corpus. Notably, a ZeTT hypernetwork trained for a base LM can also be applied to fine-tuned variants without additional training, enhancing its practical applicability.