• TheNeedle.AI
  • Posts
  • Salesforce introduces XGen-7B: Revolutionizing Long Sequence Modeling with State-of-the-Art Language Models

Salesforce introduces XGen-7B: Revolutionizing Long Sequence Modeling with State-of-the-Art Language Models

Breakthrough LLM Trained on 8K Input Sequence Length Delivers Unprecedented Performance and Extends Possibilities

WHAT

The XGen project introduces a series of 7B LLMs (Language Models) called XGen-7B, trained with up to 8K sequence length. The models utilize standard dense attention and are fine-tuned on public-domain instructional data. They achieve comparable or better results than other state-of-the-art LLMs of similar size on standard NLP benchmarks. XGen-7B exhibits strong performance in both text-based tasks (e.g., MMLU, QA) and code-related tasks (HumanEval).

WHY

Existing open-source LLMs have limitations in modeling long sequences, often trained with a maximum of 2K token sequence length. XGen-7B addresses this limitation by training models with up to 8K sequence length, enabling effective consideration of long-distance structural dependencies. The project aims to improve performance in tasks such as text summarization, code writing, and protein sequence prediction, where long context is crucial.

HOW IT WORKS

  • Pre-training Data: XGen-7B utilizes a two-stage training strategy with different data mixtures. In the first stage, datasets such as RedPajama-CommonCrawl, RedPajama-GitHub, Wikipedia, and more are incorporated, totaling 1.37T tokens. In the second stage, additional code data from Starcoder is mixed with the Stage 1 data, resulting in a total of 110B tokens.

  • Training Details: XGen-7B models are trained using the JaxFormer library, optimized for TPU-v4 hardware. The training process involves addressing "loss spikes" and supports sequence lengths up to 8,192 tokens. By adapting to longer sequences, the models demonstrate improved perplexity over sequence length.

  • Dense Attention: XGen-7B utilizes standard dense attention mechanisms, allowing the model to capture dependencies between distant tokens in the input sequence effectively. This enables the model to consider long-distance structural relationships, crucial for tasks involving extended context.

  • Fine-tuning: After pre-training, XGen-7B models are fine-tuned on public-domain instructional data. This process further enhances the model's performance and enables it to achieve comparable or superior results compared to other state-of-the-art LLMs on standard NLP benchmarks.

RESULTS

XGen-7B achieves excellent results on standard benchmarks such as Measuring Massive Multitask Language Understanding (MMLU), outperforming other LLMs like LLaMA, OpenLLaMA, Falcon, and MPT. It also performs well in general zero-shot NLP tasks involving common sense reasoning and QA. The model checkpoints and codebase are publicly available for further research and exploration.

CONCLUSION

XGen-7B represents a significant advancement in long sequence modeling with LLMs. By training on up to 8K sequence length and utilizing standard dense attention mechanisms, the models demonstrate improved performance on various tasks compared to existing LLMs. The project opens new possibilities for applications requiring effective modeling of long dependencies, benefiting fields like text summarization, code generation, and more.