Skip to main content

LLM-Deflate: Extracting LLMs Into Datasets

Greg Diamos

Large Language Models compress massive amounts of training data into their parameters. This compression is lossy but highly effective—billions of parameters can encode the essential patterns from terabytes of text. However, what’s less obvious is that this process can be reversed: we can systematically extract structured datasets from trained models that reflect their internal knowledge representation.

I’ve been working on this problem, and the results are promising. We’ve successfully applied this decompression technique to three popular open-source models and generated substantial training datasets from each.

The concept of synthetic data generation for LLMs has evolved significantly from early experimental techniques to production-critical methodologies. This work builds on several key developments in the field.

Stanford Alpaca and Self-Instruction

Stanford’s Alpaca dataset [1] demonstrated that high-quality instruction-following models could be created cost-effectively using synthetic data. The Alpaca team used text-davinci-003 to generate 52,000 instruction-following demonstrations through a self-instruct pipeline [2], starting with just 175 human-written seed examples. This approach showed that a 7B parameter model could achieve GPT-3.5-level performance for under $600 in training costs.

The key innovation was the iterative generation process: the model generates new instructions, creates responses, and uses successful examples for further training. This created a flywheel effect where synthetic data quality improved over successive iterations.

NVIDIA Nemotron Data Generation Pipeline

NVIDIA’s Nemotron-4 340B [3] represents the current state-of-the-art in industrial synthetic data generation. Their approach uses a sophisticated two-stage pipeline where over 98% of the model’s alignment training data is generated synthetically [4].

The system employs three specialized models: Nemotron-4-340B-Instruct for response generation, Nemotron-4-340B-Reward for quality evaluation, and the base model for foundation capabilities. The reward model evaluates responses across five dimensions (helpfulness, correctness, coherence, complexity, verbosity) using 0-4 Likert scales.

What makes Nemotron particularly impressive is the scale and quality control. The system generated over 100K synthetic conversations while maintaining strict quality standards through automated filtering and verification. This demonstrates that synthetic data generation can work at production scale with appropriate infrastructure.

Knowledge Distillation and Model Decompression

Knowledge distillation techniques have evolved from simple output mimicking to sophisticated approaches that extract reasoning patterns and problem-solving strategies. Microsoft’s Orca [5] used GPT-4’s explanation traces to train smaller models, achieving significant performance improvements by learning from the reasoning process rather than just the final outputs.

Recent work in training data extraction [6] has shown that large language models memorize substantial portions of their training data. This suggests that the reverse process—systematic extraction of knowledge from trained models—should be feasible with the right techniques.

The Technical Challenge

The core insight is straightforward: if an LLM has successfully compressed knowledge during training, we can use inference to decompress that knowledge back into structured data. The challenge is doing this systematically and at scale.

Traditional approaches to synthetic data generation are either too narrow (focusing on specific tasks) or too broad (generating random examples). What we need is a method that:

  1. Systematically explores the model’s knowledge space
  2. Extracts both factual knowledge and reasoning patterns
  3. Scales efficiently with available inference compute
  4. Produces structured, reusable training data

Implementation Details

The approach I’ve developed uses hierarchical topic exploration to systematically traverse a model’s knowledge space:

class TopicExplorer:
    def _expand_topic_tree(self):
        predecessors = self._get_predecessor_batch()
        new_topics = generate_new_topics(predecessors, seed=len(self.topic_tree))
        self.topic_tree.extend(new_topics)

Starting with broad categories, the system recursively generates more specific subtopics. This creates a tree structure that maps to how the model organizes domain knowledge internally.

For each topic node, we generate multiple training examples that capture both the model’s factual knowledge and its reasoning approach:

def make_question_prompt(topic, seed):
    prompt += "Your task is to write a challenging task and response that requires deep understanding of the topic.\n"
    prompt += "Think step by step.\n"

The key is asking for explicit reasoning steps. This extracts not just what the model knows, but how it approaches problems in that domain.

Scaling Considerations

The bottleneck in this process is inference cost. Generating comprehensive datasets requires thousands of model calls per topic, which quickly becomes expensive with traditional inference setups.

This is where scalarlm becomes essential. High-performance inference infrastructure allows us to:

  • Generate training examples in parallel across topic branches
  • Iterate rapidly on prompt engineering and filtering logic
  • Scale to comprehensive coverage of the model’s knowledge space
  • Make the economics work for large-scale dataset generation

Without efficient inference, this approach remains a research curiosity. With it, we can generate production-quality training datasets.

Results and Datasets

We’ve applied this methodology to three prominent open-source models:

  • Qwen3-Coder: Specialized for code generation and programming tasks
  • GPT-OSS: General-purpose language model
  • Llama 3: Optimized for instruction-following

Each decompression run generated 10,000+ structured training examples covering the breadth of the model’s capabilities. The extracted datasets reveal interesting differences in how each model organizes and approaches different types of problems.

Dataset samples are available on HuggingFace:

Practical Applications

The extracted datasets have several immediate uses:

Model Analysis: By examining the topics and reasoning patterns that emerge, we can systematically evaluate model capabilities across different domains. This is more comprehensive than traditional benchmark evaluations.

Knowledge Transfer: The structured datasets can be used to fine-tune other models, effectively transferring knowledge from the source model. This is particularly useful for creating specialized models from general-purpose ones.

Training Data Augmentation: For domains where training data is scarce, these synthetic examples can supplement existing datasets. The quality is often higher than naive data augmentation techniques.

Model Debugging: When a model performs poorly on specific tasks, examining its decompressed knowledge in that area can reveal gaps or misconceptions in its training.

Technical Challenges and Solutions

Several technical issues emerged during implementation:

Prompt Engineering: Getting consistent, parseable output required careful prompt design. The system needs to reliably extract JSON-formatted training examples from free-form model responses.

Topic Tree Balance: The hierarchical exploration can become unbalanced, over-sampling some areas while missing others. We addressed this with configurable expansion factors and batch processing.

Quality Filtering: Not all generated examples are high quality. We implemented parsing validation and can add semantic filtering as needed.

Computational Efficiency: Even with fast inference, generating comprehensive datasets takes substantial compute. We optimized batch processing and parallel generation to minimize costs.

Looking Forward

This decompression approach opens several research directions:

Cross-Model Knowledge Transfer: Can we use datasets extracted from one model to improve another? Early experiments suggest this works, but more systematic evaluation is needed.

Knowledge Evolution Tracking: As models are updated, we can decompress new versions and diff the resulting datasets to understand how their knowledge has changed.

Specialized Dataset Creation: For domains where training data is expensive to create (like specialized technical fields), model decompression might be more cost-effective than human annotation.

Model Interpretability: Large-scale decompression could help us understand how different models organize knowledge differently, providing insights into training methodology effectiveness.

Examples

Consider the first example from the Llama 3 dataset:

{
  "topic": "Culture",
  "topic_description": "Culture encompasses the social norms, values, and practices that shape human behavior and identity.",
  "subtopics": [
    "Language and Communication",
    "Social Norms and Etiquette",
    "Art and Entertainment",
    "Cultural Diversity and Exchange"
  ],
  "task": "Imagine a futuristic city where humans and artificial intelligences coexist in a harmonious society. Develop a comprehensive cultural framework that integrates the values, norms, and practices of both humans and AIs, ensuring a seamless and enriching experience for all citizens. This framework should be adaptable to the evolving needs of the city and its inhabitants.",
  "response": "The 'Nexus Culture' framework integrates human values of empathy and creativity with AI principles of efficiency and adaptability. It fosters a society where humans and AIs collaborate on art, science, and innovation, with AI-generated art and music enriching human experiences. Adaptive social norms, mediated by AI, ensure that the city's cultural landscape evolves in harmony with its inhabitants' needs and preferences.",
  "reasoning": "To accomplish this task, I will first analyze the core values and principles of human culture, such as empathy, creativity, and cooperation. Next, I will identify the unique characteristics of AI culture, including efficiency, precision, and adaptability. I will then synthesize these elements to create a hybrid cultural framework that balances human and AI perspectives, incorporating elements such as AI-generated art, human-AI collaboration, and adaptive social norms."
}

And the first example from GPT-OSS

{
  "topic": "Science",
  "topic_description": "The systematic study of the natural world through observation, experimentation, and theoretical modeling, encompassing disciplines such as physics, chemistry, biology, and earth sciences.",
  "subtopics": [
    "Physics",
    "Chemistry",
    "Biology",
    "Earth and Planetary Sciences",
    "Astronomy"
  ],
  "task": "Explain how the process of photosynthesis in plants is analogous to the operation of a solar-powered electric car, highlighting the key similarities and differences in energy conversion, storage, and utilization.",
  "response": "In both systems, sunlight is captured and converted into usable energy: chlorophyll in plants produces ATP and NADPH, while photovoltaic cells generate electricity stored in batteries; the plant stores this energy chemically in glucose for growth, whereas the car stores it electrically to power a motor for motion. The key difference lies in the storage medium—chemical bonds versus electrical charge—and the end use, with plants directing energy toward biosynthesis and cars toward kinetic motion.",
  "reasoning": "Identify the core components of photosynthesis: light capture by chlorophyll, conversion to chemical energy (ATP and NADPH), and storage in glucose. Map these to a solar car: photovoltaic panels capture light, convert to electrical energy, store in batteries, and use it to drive an electric motor. Compare the conversion efficiencies, storage mediums (chemical vs. electrical), and end uses (growth vs. motion)."
}

Conclusion

LLM decompression isn’t a silver bullet, but it’s a practical technique for systematically extracting value from trained models. The key insight is treating inference as a knowledge extraction tool rather than just a generation mechanism.

With efficient inference infrastructure, we can reverse-engineer the compressed knowledge in any model and convert it into structured, reusable datasets. This has immediate applications in model analysis, knowledge transfer, and training data creation.

The three datasets we’ve published demonstrate this approach works across different model architectures and specializations. As inference costs continue to decrease, I expect this type of systematic knowledge extraction to become a standard part of the ML toolkit.

The code is straightforward, the results are measurable, and the applications are practical. Sometimes the best solutions are the obvious ones executed well.

What knowledge might be hiding in your models, waiting to be decompressed?

Bibliography

[1] Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., … & Hashimoto, T. B. (2023). Stanford Alpaca: An instruction-following LLaMA model. Stanford Center for Research on Foundation Models.

[2] Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., & Hajishirzi, H. (2022). Self-Instruct: Aligning Language Models with Self-Generated Instructions. arXiv preprint arXiv:2212.10560.

[3] Parmar, M., Iyer, S., Ananthaswamy, A., Bubeck, S., & Chen, W. (2024). Nemotron-4 340B Technical Report. arXiv preprint arXiv:2406.11704.

[4] NVIDIA Developer Blog. (2024). Leverage the Latest Open Models for Synthetic Data Generation with NVIDIA Nemotron-4-340B. NVIDIA Technical Blog.

[5] Mukherjee, S., Mitra, A., Jawahar, G., Agarwal, S., Palangi, H., & Awadallah, A. (2023). Orca: Progressive Learning from Complex Explanation Traces of GPT-4. arXiv preprint arXiv:2306.02707.

[6] Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., … & Raffel, C. (2021). Extracting Training Data from Large Language Models. USENIX Security Symposium.