Aug 16, 2024
Alpine Ibex Delta
Introducing AlpineIbexDelta: a large multi-step instruct dataset full of code problems and solutions spanning multiple programming languages.
Dataset Summary
Alpine-Ibex-Delta
is a synthetically generated dataset for supervised fine-tuning using the new Codestral-22B
model.
The dataset contains multi-step Q&A for writing and solving code problems. All explanations are generated in a short and to-the-point style to reduce verbosity and save programmers time.
Data Generation
The dataset was generated using a single 4x4090 machine:
Generating the question-answer pairs took ~ 168 hours.
Computing the embeddings & assessing quality ~ 1 hour.
Data Structure
The data has the following structure:
id
: The ID of the prompt. This is for error tracing and data validation. The prompt ID is structured in the following format:dataset:UUID
.messages
: The generated instruct prompt and explanation are stored in an OpenAI-compatible dictionary structure.prompt
: The generated instruct prompt in the messages array.
The Dataset
Download Dataset
HuggingFace
: In the interest of free and opensource development of AI models, the entire dataset has been published to HuggingFace free of charge under theApache 2.0 license
.
If you have any questions or problems related to this dataset's Q&A, don’t hesitate to contact me.