Aug 16, 2024

Alpine Ibex Delta

Introducing AlpineIbexDelta: a large multi-step instruct dataset full of code problems and solutions spanning multiple programming languages.

Dataset Summary

Alpine-Ibex-Delta is a synthetically generated dataset for supervised fine-tuning using the new Codestral-22B model.

The dataset contains multi-step Q&A for writing and solving code problems. All explanations are generated in a short and to-the-point style to reduce verbosity and save programmers time.

🤗 Codestral-22B

Data Generation

The dataset was generated using a single 4x4090 machine:

  • Generating the question-answer pairs took ~ 168 hours.

  • Computing the embeddings & assessing quality ~ 1 hour.

Data Structure

The data has the following structure:

  • id: The ID of the prompt. This is for error tracing and data validation. The prompt ID is structured in the following format: dataset:UUID.

  • messages: The generated instruct prompt and explanation are stored in an OpenAI-compatible dictionary structure.

  • prompt: The generated instruct prompt in the messages array.

The Dataset

Download Dataset

  • HuggingFace: In the interest of free and opensource development of AI models, the entire dataset has been published to HuggingFace free of charge under the Apache 2.0 license.

If you have any questions or problems related to this dataset's Q&A, don’t hesitate to contact me.