Multimodal AI, DeepSeek10 min readUpdated March 12, 2026

DeepSeek Multimodal Models: VL, Janus & JanusFlow Explained

Explore DeepSeek AI multimodal models from DeepSeek-VL to Janus and JanusFlow. Learn how each architecture advances vision-language understanding and generation.

DeepSeek Multimodal Models: VL, Janus & JanusFlow Explained

DeepSeek AI's Innovations in Multimodal Understanding and Generation

TL;DR: DeepSeek AI has released four major open-source multimodal model families between March 2024 and January 2025: DeepSeek-VL (hybrid vision encoder, 1.3B/7B parameters), DeepSeek-VL2 (Mixture-of-Experts with dynamic tiling), Janus (unified understanding and generation via decoupled visual encoding), and JanusFlow/Janus-Pro (autoregressive language models combined with rectified flow for image generation). Each model is fully open-source and achieves competitive or state-of-the-art performance on vision-language benchmarks.

Key Takeaways

  • DeepSeek AI has released four major multimodal model families: DeepSeek-VL, DeepSeek-VL2, Janus, and JanusFlow/Janus-Pro -- all open-source.
  • DeepSeek-VL (March 2024) introduced a hybrid vision encoder for real-world vision-language understanding with 1.3B and 7B parameter models.
  • DeepSeek-VL2 (December 2024) added Mixture-of-Experts architecture with Multi-head Latent Attention and dynamic tiling for high-resolution images.
  • The Janus family (October 2024 onwards) pioneered unified multimodal understanding and generation by decoupling visual encoding pathways within a single transformer.
  • JanusFlow harmonizes autoregressive language models with rectified flow for image generation, while Janus-Pro scales these capabilities with optimized training and larger models.

DeepSeek AI has quickly become one of the most prolific contributors to open-source multimodal AI. From their first vision-language model in March 2024 to the unified understanding-and-generation capabilities of Janus-Pro in January 2025, DeepSeek has released a progression of models that push the boundaries of what open-source multimodal systems can do.

This article traces the full DeepSeek multimodal journey -- covering DeepSeek-VL, DeepSeek-VL2, Janus, JanusFlow, and Janus-Pro -- explaining the architectural innovations in each and how they build on one another. Whether you are evaluating these models for production use or studying multimodal AI architectures, this overview provides the technical context you need.

Below is a chronological overview of all the research papers published by DeepSeek AI, with our focus on the field of multimodal understanding and generation.

The Foundational Vision-Language Model: DeepSeek-VL

Published in March 2024, the paper "DeepSeek-VL: Towards Real-World Vision-Language Understanding" laid the groundwork for DeepSeek AI's contributions to the field. The central theme of DeepSeek-VL was the development of an open-source VL model specifically designed for practical, real-world applications.

A key focus of DeepSeek-VL was the diversity and scalability of its training data. The researchers aimed to create a model that could effectively understand a wide range of visual and textual information encountered in everyday scenarios, including web screenshots, PDFs, OCR content, charts, and general knowledge. To enhance the model's usability, the team also developed a taxonomy of real-world user scenarios and used this to create an instruction-tuning dataset. This approach aimed to improve the model's ability to function effectively as a vision-language chatbot in practical settings.

Technically, DeepSeek-VL incorporated a hybrid vision encoder designed to efficiently process high-resolution images (1024 x 1024) without incurring excessive computational costs. Furthermore, the research emphasized the importance of preserving strong language abilities during the pretraining process. The pretraining strategy was carefully designed to integrate language model training from the outset, managing the interaction between visual and linguistic information to ensure that the model maintained proficiency in language-centric tasks.

The DeepSeek-VL family included both 1.3 billion and 7 billion parameter models, both of which were made publicly accessible. The paper reported that these models achieved competitive or even state-of-the-art performance on various visual-language benchmarks while also maintaining robust language capabilities. The community engagement on the paper's Hugging Face page indicates interest in the model and its potential applications.

Advancing Performance with Mixture-of-Experts: DeepSeek-VL2

Building upon the success of DeepSeek-VL, DeepSeek AI introduced "DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding" in December 2024. The primary goal of DeepSeek-VL2 was to significantly improve upon its predecessor by incorporating a Mixture-of-Experts (MoE) architecture.

The MoE approach was primarily applied to the language component of the model, leveraging DeepSeekMoE models with a Multi-head Latent Attention mechanism. This mechanism is designed to compress the key-value cache into latent vectors, thereby enabling more efficient inference and higher throughput. For the vision component, DeepSeek-VL2 introduced a dynamic tiling vision encoding strategy. This innovation was specifically aimed at improving the model's ability to process high-resolution images with varying aspect ratios, a common challenge in real-world visual inputs. The model was also trained on an improved vision-language dataset, further contributing to its enhanced performance.

DeepSeek-VL2 was released in three variants with different activated parameter sizes: Tiny (1.0B), Small (2.8B), and Base (4.5B). The research demonstrated that these models achieved superior capabilities across a range of multimodal tasks, including visual question answering, optical character recognition, document, table, and chart understanding, and visual grounding. Notably, DeepSeek-VL2 was reported to achieve competitive or state-of-the-art performance with a similar or even fewer number of activated parameters compared to other open-source dense and MoE-based models. Like its predecessor, the code and pretrained models for DeepSeek-VL2 were made publicly accessible, fostering further community research and development. The comments on the paper's Hugging Face page show users engaging with the model and its capabilities.

The Janus Family: Unified Multimodal Understanding and Generation

In parallel to the advancements in the DeepSeek-VL series, DeepSeek AI also introduced the Janus family of models, focusing on the ambitious goal of unified multimodal understanding and generation within a single framework. This family includes Janus, JanusFlow, and Janus-Pro, each building upon the previous work to create more versatile and capable multimodal AI systems.

Janus: Decoupling for Enhanced Performance

The first in this series, "Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation," was published in October 2024. The core idea behind Janus was to address the potential conflict arising from using a single visual encoder for both understanding and generation tasks, which often have different requirements in terms of information granularity.

Janus introduced an autoregressive framework that unified these two functionalities by decoupling the visual encoding pathways into separate streams, one optimized for understanding and the other for generation. Despite this decoupling, Janus still utilized a single, unified transformer architecture for processing the information. This design allowed each component (understanding and generation) to independently select the most suitable encoding methods, enhancing the overall flexibility and performance of the model. Experiments reportedly showed that Janus surpassed previous unified models and matched or even exceeded the performance of task-specific models on various benchmarks. The authors highlighted the simplicity, flexibility, and effectiveness of Janus as key strengths, positioning it as a promising architecture for next-generation unified multimodal models. Community interest in Janus is evident from the comments and model citations on its Hugging Face page.

JanusFlow: Harmonizing Autoregression and Rectified Flow

Published shortly after in November 2024, "JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation" took a different approach to unified multimodal capabilities. JanusFlow aimed to integrate autoregressive language models with rectified flow, a state-of-the-art generative modeling technique, within a minimalist architectural framework.

A key finding of this research was that rectified flow could be effectively trained within a large language model framework without necessitating complex architectural modifications. To further enhance the performance of their unified model, the researchers adopted two key strategies: decoupling the understanding and generation encoders (similar to the original Janus) and aligning their representations during unified training. This alignment process helped to ensure that the information learned by the understanding encoder was effectively utilized by the generation encoder. Extensive experiments reportedly demonstrated that JanusFlow achieved comparable or even superior performance to specialized models in their respective domains while significantly outperforming existing unified approaches across standard benchmarks. This work was presented as a significant step towards developing more efficient and versatile vision-language models. The Hugging Face page for JanusFlow shows community engagement and links to the model's code.

Janus-Pro: Scaling Data and Models for Advanced Capabilities

The latest evolution in the Janus family, "Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling," was published in January 2025. Janus-Pro was explicitly presented as an advanced version of the original Janus model.

The significant advancements achieved by Janus-Pro were attributed to three main factors: an optimized training strategy, expanded training data, and scaling to a larger model size. Through these improvements, Janus-Pro reportedly achieved significant progress in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. The researchers expressed hope that this work would inspire further exploration in the field, and like the other DeepSeek AI models, the code and models for Janus-Pro were made publicly available. The Hugging Face page for Janus-Pro lists several models citing the paper, indicating community adoption and further development.

Overarching Themes and Future Directions

Collectively, the research papers from DeepSeek AI showcase a strong and consistent effort to push the boundaries of multimodal AI. Several overarching themes emerge from their work:

  • Focus on Real-World Applications: DeepSeek-VL explicitly targeted practical use cases, and the advancements in DeepSeek-VL2 and the Janus family are likely aimed at improving performance in real-world scenarios.
  • Unified Multimodal Architectures: The Janus family represents a significant investment in developing models that can seamlessly handle both understanding and generation of multimodal content.
  • Architectural Innovation: Each paper introduces novel architectural elements, such as MoE layers, dynamic tiling, decoupled encoding, and the integration of rectified flow.
  • Importance of Scale and Data: Janus-Pro highlights the critical role of data quality, training strategies, and model size in achieving state-of-the-art performance.
  • Commitment to Open Science: The public availability of code and pretrained models across all these projects underscores DeepSeek AI's commitment to fostering open research and development within the AI community.

The progression from specialized vision-language models like DeepSeek-VL to the more unified and advanced capabilities demonstrated by the Janus family highlights a clear trajectory towards more versatile, efficient, and practically applicable multimodal AI systems. DeepSeek AI's contributions are not only advancing the technical frontiers of the field but also enabling broader access and utilization of these powerful technologies through their open-source initiatives.

About the Author

Aaron is a senior software engineer and AI researcher specializing in generative AI, multimodal systems, and cloud-native AI infrastructure. He writes about cutting-edge AI developments, practical tutorials, and deep technical analysis at fp8.co.

Cite this Article

Aaron. "DeepSeek Multimodal Models: VL, Janus & JanusFlow Explained." fp8.co, March 15, 2025. https://fp8.co/articles/DeepSeek-AI-Journey-in-Multimodal-Understanding-and-Generation

Related Articles

Multimodal Video Search: 10+ Tools Compared (2025)

Compare commercial and open-source multimodal video search platforms. Discover which tools support text, image, and voice queries for precise scene retrieval.

Multimodal AI, Video Search

Multimodal Models Learning Notes - A Beginner's Guide

Learn multimodal AI from scratch: embedding, understanding, and generation paradigms explained. Covers CLIP, Qwen2.5-VL, Sora, and practical video AI architectures with code examples.

Multimodal AI, Machine Learning