Aug 1, 2024

Modified Aug 7, 2024

Hello Logic

Introducing Cyber Logic: a comprehensive collection of questions, answers, and explanations aimed at fostering a deep understanding of cybersecurity, networking, and IT support.

Dataset Summary

Cyber-Logic is a synthetically generated dataset for supervised fine-tuning using the new Mistral-NeMo model, together with other Mistral models like Mixtral-8x7b and Mixtral-8x22b.

The dataset contains questions and explanations for various tech-based tasks like cybersecurity, networking, and IT support. All explanations in this dataset are generated in the COT (Chain of Thought) style.

🤗 Mixtral 8x22B

Data Generation

The dataset was generated using a single 4x4090 machine:

  • Generating the question-answer pairs took ~ 482 hours

  • Generating the question-explanation pairs took ~ 40 hours.

  • Computing the embeddings, assessing quality, and classifying the questions into the three categories ~ 38 hours.

Data Structure

The data has the following structure:

  • id: The ID of the prompt. This is for error tracing and data validation. The prompt ID is structured in the following format: dataset:UUID.

  • messages: The generated instruct prompt and explanation are stored in an OpenAI-compatible dictionary structure.

  • prompt: The generated instruct prompt in the messages array.

The Dataset

Download Dataset

  • HuggingFace: In the interest of free and opensource development of AI models, the entire dataset has been published to HuggingFace free of charge under the Apache 2.0 license.

  • GitHub: For those studying for Network+ or Security+, the questions, answers, and explanations have been published in markdown on GitHub. If you have any questions or problems related to this dataset's questions or explanations, don’t hesitate to contact me.