Aug 10, 2024

Marmottes Données

Introducing Marmottes Données: a large instruct dataset designed to give LLMs a modern understanding of cybersecurity events, surveys, threat mediation, and policies.

Dataset Summary

Marmottes-Données is a synthetically generated dataset for supervised fine-tuning using the new Mistral-large model, together with Mistral-8x22b.

The dataset contains Q&A for cybersecurity events, surveys, threat mediation, and policies. All explanations in this dataset are generated in a verbose markdown style to give a Chat-GPT-like feel.

🤗 Mistral Large

Data Generation

The dataset was generated using a single 4xA100 machine:

  • Crawling GPT-BOT allowed sites took ~ 36 hours

  • Generating the question-answer pairs took ~ 168 hours.

  • Assessing quality ~ 4 hours.

Data Structure

The data has the following structure:

  • id: The ID of the prompt. This is for error tracing and data validation. The prompt ID is structured in the following format: dataset:UUID.

  • messages: The generated instruct prompt and explanation are stored in an OpenAI-compatible dictionary structure.

  • prompt: The generated instruct prompt in the messages array.

The Dataset

Download Dataset

  • HuggingFace: In the interest of free and opensource development of AI models, the entire dataset has been published to HuggingFace free of charge under the Apache 2.0 license.

If you have any questions or problems related to this dataset's Q&A, don’t hesitate to contact me.