Aug 10, 2024
Marmottes Données
Introducing Marmottes Données: a large instruct dataset designed to give LLMs a modern understanding of cybersecurity events, surveys, threat mediation, and policies.
Dataset Summary
Marmottes-Données is a synthetically generated dataset for supervised fine-tuning using the new Mistral-large model, together with Mistral-8x22b.
The dataset contains Q&A for cybersecurity events, surveys, threat mediation, and policies. All explanations in this dataset are generated in a verbose markdown style to give a Chat-GPT-like feel.
Data Generation
The dataset was generated using a single 4xA100 machine:
Crawling
GPT-BOTallowed sites took ~ 36 hoursGenerating the question-answer pairs took ~ 168 hours.
Assessing quality ~ 4 hours.
Data Structure
The data has the following structure:
id: The ID of the prompt. This is for error tracing and data validation. The prompt ID is structured in the following format:dataset:UUID.messages: The generated instruct prompt and explanation are stored in an OpenAI-compatible dictionary structure.prompt: The generated instruct prompt in the messages array.
The Dataset
Download Dataset
HuggingFace: In the interest of free and opensource development of AI models, the entire dataset has been published to HuggingFace free of charge under theApache 2.0 license.
If you have any questions or problems related to this dataset's Q&A, don’t hesitate to contact me.