Topic
Multimodal
Vision-language models, audio AI, and cross-modal capabilities
Featured

14 things Claude Opus 4.7 actually does better than Opus 4.6 and Sonnet 4.6
All Stories

Pulse AI and Bedrock Cut Financial Document Processing From Days to Hours
AWS and Pulse AI have demonstrated a financial document processing pipeline that combines Pulse's document…

Thinking Machines Previews Full-Duplex AI for Real-Time Conversation
Thinking Machines, the AI startup founded by former OpenAI CTO Mira Murati and researcher John Schulman, has unveiled a…

Amazon Nova Multimodal Embeddings Unlock Cross-Modal Search for Manufacturing Docs
Amazon has released Nova Multimodal Embeddings, a model that maps text, images, and document pages into a shared vector…

OpenAI Adds Reasoning to Realtime Voice Models
OpenAI has released new realtime voice models available through its API that can reason, translate, and transcribe…

Pet Camera Startup Cuts Inference Costs with AWS Inferentia2
Tomofun, maker of the Furbo pet camera, migrated its vision-language model inference from GPU-based EC2 instances to…

ChatGPT Images 2.0 gains traction in India, lags elsewhere
ChatGPT Images 2.0 has gained significant traction among Indian users who are leveraging the tool to create personal…

Google TV Adds Gemini Photo and Video Tools
Google TV is integrating additional Gemini AI features, including photo and video transformation capabilities powered…

Poly-DPO and ViPO: Scaling Visual Preference Optimization
Researchers introduced Poly-DPO, an algorithmic extension to preference optimization that adds a polynomial term to…

NVIDIA Nemotron 3 Nano Omni Consolidates Multimodal AI for Agents
NVIDIA and AWS announced day-zero availability of Nemotron 3 Nano Omni on Amazon SageMaker JumpStart, a…
Audio-Omni Unifies Generation and Editing Across Sound, Music, Speech
Researchers have introduced Audio-Omni, a unified framework that combines audio understanding, generation, and editing…

New Multilingual Medical AI Benchmark Reveals Language and Vision Gaps
Researchers have developed EuropeMedQA, a multilingual and multimodal medical examination dataset drawn from official…

ActorMind Brings Emotional Speech Role-Playing to AI
Researchers have introduced ActorMind, a reasoning framework that enables AI models to perform speech role-playing by…

Web Video as Training Data for 3D Scene Understanding
Researchers demonstrate that unlabeled internet videos can be automatically processed into training data for 3D scene…

MIT's AromaGen Generates Custom Scents from Text Using LLMs
Researchers at MIT and collaborators have developed AromaGen, an AI-powered wearable that generates custom scents from…

Multimodal LLM for Materials Science Accelerates Discovery
Researchers at Nature Machine Intelligence have introduced MatterChat, a multimodal framework that combines material…

Multimodal AI Models Tackle Healthcare's Data Silos
AWS is positioning itself as a unified platform for deploying multimodal biological foundation models that integrate…

Lightweight Model Beats GPT-4o at Robot Gesture Prediction
Researchers have developed a lightweight transformer model that generates co-speech gestures for robots by predicting…

Comic Strips Bypass Safety in Multimodal AI Models
Researchers have identified a new class of jailbreak attacks against multimodal large language models that embed…

Vision Models Outperform LLMs for Time Series Anomaly Detection
Researchers propose VAN-AD, a framework that adapts visual Masked Autoencoders pretrained on ImageNet for time series…

OpenAI Releases ChatGPT Images 2.0 with Text, Infographics, and UI Generation
OpenAI has released ChatGPT Images 2.0, a significant upgrade to its image generation capabilities that can now produce…

Specialized Vision-Language Model Beats Human Experts at Factory Defect Detection
Researchers have developed AD-Copilot, a vision-language model specialized for industrial anomaly detection that…

DeepSeek-OCR Adapted for Molecular Recognition, Hits Sequence-Model Limits
Researchers adapted DeepSeek-OCR-2 to recognize molecular structures in 2D chemical diagrams by framing the task as…

OpenAI Tests Photorealistic Image Model to Drive ChatGPT Growth
OpenAI is testing a new image generation model, internally referred to as 'gpt-image-2,' that produces photorealistic…

FlowCoMotion Bridges Semantic and Motion Fidelity in Text-to-Motion
Researchers propose FlowCoMotion, a text-to-motion generation framework that combines continuous and discrete motion…

NVIDIA Releases Fast Multilingual OCR Model Trained on Synthetic Data
NVIDIA released Nemotron OCR v2, a multilingual optical character recognition model trained on 12 million synthetic…

Sentence Transformers Adds Multimodal Embeddings and Reranking
Sentence Transformers v5.4 adds multimodal embedding and reranking capabilities, allowing developers to encode and…

NVIDIA Cosmos Reason 2 Tops Physical AI Leaderboards
NVIDIA released Cosmos Reason 2, an open-source reasoning vision-language model designed to improve how robots and AI…
Google DeepMind Upgrades Gemini Robotics for Spatial Reasoning
Google DeepMind released Gemini Robotics ER 1.6, an update focused on improving spatial reasoning and multi-view…