Microsoft unveils three new models in its Phi 3.5 series, demonstrating superior performance and outperforming rivals in various tasks across multiple benchmarks. They succeed the original Phi-3 family, which debuted in April of this year.
“Our model is designed to accelerate research on language and multimodal models for use as a building block for generative AI-powered features,” Microsoft stated.
Performance and Capabilities of Phi 3.5 Models
The Phi 3.5-mini-instruct model with 3.8 billion parameters is optimized for environments with memory and compute constraints, making it best for quick reasoning tasks. This model is an evolution of the June 2024 instruction-tuned Phi-3 Mini, benefiting from additional post-training data. It is particularly effective in code generation, mathematical problem-solving, and logic-based reasoning.
Despite its compact size, the Phi 3.5-mini-instruct model outperforms larger models like Meta’s Llama 3.1 and Mistral 7B on benchmarks like RepoQA. Furthermore, the model offers multilingual support, having been trained on a selective set of including major global languages, like English, Chinese, and Arabic.
Advancements in Multimodal AI and Open-Source Commitment
The Phi 3.5-MoE-instruct model is the flagship of the series, boasting 42 billion parameters and utilizing a Mixture of Experts (MoE) architecture. This design allows only the most relevant sub-models to activate for a given task, optimizing both performance and resource usage.
It excels particularly in tasks that require deep, context-aware understanding and decision-making. Microsoft claims that this model outperforms Google’s Gemini 1.5 Flash in several benchmarks, including BigBench, MMLU, and ARC Challenge. However, it falls short compared to OpenAI’s GPT-4o-mini.
Microsoft acknowledges that the Phi 3.5-MoE-instruct model has limitations in storing extensive factual knowledge, stating, “The model simply does not have the capacity to store too much factual knowledge, therefore, users may experience factual incorrectness.”
To mitigate this, they suggest augmenting the model with a search engine when used in Retrieval-Augmented Generation (RAG) settings.
Advancements in Multimodal AI and Open-Source Commitment
Expanding into the realm of multimodal AI, the Phi 3.5-vision-instruct model integrates text and image processing capabilities. With 4.2 billion parameters, it is adept at handling tasks such as detailed image comparison, multi-image summarization/storytelling, and video summarization. Its support for a 128K token context length allows it to maintain coherence over long sequences of input.
This model outperforms rivals like Claude-3.5-Sonnet and GPT-4o-mini in its class despite having fewer parameters. Its architecture includes an image encoder, connector, projector, and the Phi-3-Mini language model, making it suitable for complex visual tasks.
Open-Source Release and Industry Impact
Training these models required significant computational resources. For instance, the Phi 3.5-mini-instruct model was trained on 3.4 trillion tokens over ten days using 512 Nvidia H100-80G GPUs. The Phi 3.5-MoE-instruct model underwent 4.9 trillion tokens over 23 days with the same number of GPUs. The Phi 3.5-vision-instruct model was trained on 500 billion tokens over six days using 256 Nvidia A100-80G GPUs.
Microsoft’s decision to release these models under an open-source MIT license is aligned with its commitment to advancing AI technology while ensuring accessibility for developers globally. This move is expected to spur widespread adoption, particularly in fields that require advanced AI capabilities but may lack the resources to develop models from scratch.
Overall, Microsoft’s models have set new benchmarks, further proving their models to be effective in small environments, but the race to surpass OpenAI’s GPT-4o-mini remains ongoing.