Aug 10, 2024
Marmottes Données
Introducing Marmottes Données: a large instruct dataset designed to give LLMs a modern understanding of cybersecurity events, surveys, threat mediation, and policies.
Dataset Summary
Marmottes-Données
is a synthetically generated dataset for supervised fine-tuning using the new Mistral-large
model, together with Mistral-8x22b
.
The dataset contains Q&A for cybersecurity events, surveys, threat mediation, and policies. All explanations in this dataset are generated in a verbose markdown style to give a Chat-GPT-like feel.
Data Generation
The dataset was generated using a single 4xA100 machine:
Crawling
GPT-BOT
allowed sites took ~ 36 hoursGenerating the question-answer pairs took ~ 168 hours.
Assessing quality ~ 4 hours.
Data Structure
The data has the following structure:
id
: The ID of the prompt. This is for error tracing and data validation. The prompt ID is structured in the following format:dataset:UUID
.messages
: The generated instruct prompt and explanation are stored in an OpenAI-compatible dictionary structure.prompt
: The generated instruct prompt in the messages array.
The Dataset
Download Dataset
HuggingFace
: In the interest of free and opensource development of AI models, the entire dataset has been published to HuggingFace free of charge under theApache 2.0 license
.
If you have any questions or problems related to this dataset's Q&A, don’t hesitate to contact me.