Aug 10, 2024

Marmottes Données

Introducing Marmottes Données: a large instruct dataset designed to give LLMs a modern understanding of cybersecurity events, surveys, threat mediation, and policies.

Dataset Summary

Marmottes-Données is a synthetically generated dataset for supervised fine-tuning using the new Mistral-large model, together with Mistral-8x22b.

The dataset contains Q&A for cybersecurity events, surveys, threat mediation, and policies. All explanations in this dataset are generated in a verbose markdown style to give a Chat-GPT-like feel.

🤗 Mistral Large

🤗 Mixtral 8x22B

Data Generation

The dataset was generated using a single 4xA100 machine:

Crawling GPT-BOT allowed sites took ~ 36 hours
Generating the question-answer pairs took ~ 168 hours.
Assessing quality ~ 4 hours.

Data Structure

The data has the following structure:

id: The ID of the prompt. This is for error tracing and data validation. The prompt ID is structured in the following format: dataset:UUID.
messages: The generated instruct prompt and explanation are stored in an OpenAI-compatible dictionary structure.
prompt: The generated instruct prompt in the messages array.

The Dataset

Download Dataset

HuggingFace: In the interest of free and opensource development of AI models, the entire dataset has been published to HuggingFace free of charge under the Apache 2.0 license.

If you have any questions or problems related to this dataset's Q&A, don’t hesitate to contact me.

🤗 HuggingFace

Marmottes Données

Dataset Summary

Data Generation

Data Structure

The Dataset

Download Dataset

Eoin Shearer

Socials

Contact