Aug 10, 2024

Code Malade

Introducing Code Malade: a large instruct dataset designed to give LLMs a conversational understanding of code vulnerabilities & fixes.

Dataset Summary

Code-Malade is a synthetically generated dataset for supervised fine-tuning using the new Mistral-NeMo model, together with Mixtral-8x22b.

The dataset contains Q&A for finding and fixing code vulnerabilities. All explanations in this dataset are generated in a verbose markdown style to give a Chat-GPT-like feel.

🤗 Mistral NeMo

Data Generation

The dataset was generated using a single 4xA100 machine:

  • Generating the question-answer pairs took ~ 8 hours.

  • Computing the embeddings & assessing quality ~ 1 hour.

Thanks to CyberNative/Code_Vulnerability_Security_DPO the generation of this dataset was fairly straightforward since the code & classifications were already written.

Data Structure

The data has the following structure:

  • id: The ID of the prompt. This is for error tracing and data validation. The prompt ID is structured in the following format: dataset:UUID.

  • messages: The generated instruct prompt and explanation are stored in an OpenAI-compatible dictionary structure.

  • prompt: The generated instruct prompt in the messages array.

The Dataset

Download Dataset

  • HuggingFace: In the interest of free and opensource development of AI models, the entire dataset has been published to HuggingFace free of charge under the Apache 2.0 license.

If you have any questions or problems related to this dataset's Q&A, don’t hesitate to contact me.