Aug 10, 2024
Code Malade
Introducing Code Malade: a large instruct dataset designed to give LLMs a conversational understanding of code vulnerabilities & fixes.
Dataset Summary
Code-Malade
is a synthetically generated dataset for supervised fine-tuning using the new Mistral-NeMo
model, together with Mixtral-8x22b
.
The dataset contains Q&A for finding and fixing code vulnerabilities. All explanations in this dataset are generated in a verbose markdown style to give a Chat-GPT-like feel.
Data Generation
The dataset was generated using a single 4xA100 machine:
Generating the question-answer pairs took ~ 8 hours.
Computing the embeddings & assessing quality ~ 1 hour.
Thanks to CyberNative/Code_Vulnerability_Security_DPO
the generation of this dataset was fairly straightforward since the code & classifications were already written.
Data Structure
The data has the following structure:
id
: The ID of the prompt. This is for error tracing and data validation. The prompt ID is structured in the following format:dataset:UUID
.messages
: The generated instruct prompt and explanation are stored in an OpenAI-compatible dictionary structure.prompt
: The generated instruct prompt in the messages array.
The Dataset
Download Dataset
HuggingFace
: In the interest of free and opensource development of AI models, the entire dataset has been published to HuggingFace free of charge under theApache 2.0 license
.
If you have any questions or problems related to this dataset's Q&A, don’t hesitate to contact me.