Topic

Multimodal

Vision-language models, audio AI, and cross-modal capabilities

Featured

AI Agents

NVIDIA Releases Nemotron 3 Nano Omni, Unifying Multimodal AI at 9x Efficiency

19 days ago· NVIDIA Blog (AI)

Prompt Engineering

14 things Claude Opus 4.7 actually does better than Opus 4.6 and Sonnet 4.6

28 days ago· Manual Paste

All Stories

AI for Business

Pulse AI and Bedrock Cut Financial Document Processing From Days to Hours

AWS and Pulse AI have demonstrated a financial document processing pipeline that combines Pulse's document…

3 days ago· AWS Machine Learning Blog

Model ReleasesTrending

Thinking Machines Previews Full-Duplex AI for Real-Time Conversation

Thinking Machines, the AI startup founded by former OpenAI CTO Mira Murati and researcher John Schulman, has unveiled a…

5 days ago· VentureBeat AI

Model Releases

Amazon Nova Multimodal Embeddings Unlock Cross-Modal Search for Manufacturing Docs

Amazon has released Nova Multimodal Embeddings, a model that maps text, images, and document pages into a shared vector…

6 days ago· AWS Machine Learning Blog

Voice & Video AITrending

OpenAI Adds Reasoning to Realtime Voice Models

OpenAI has released new realtime voice models available through its API that can reason, translate, and transcribe…

9 days ago· OpenAI

AI Hardware

Pet Camera Startup Cuts Inference Costs with AWS Inferentia2

Tomofun, maker of the Furbo pet camera, migrated its vision-language model inference from GPU-based EC2 instances to…

11 days ago· AWS Machine Learning Blog

Generative AI

ChatGPT Images 2.0 gains traction in India, lags elsewhere

ChatGPT Images 2.0 has gained significant traction among Indian users who are leveraging the tool to create personal…

16 days ago· TechCrunch AI

Generative AITrending

Google TV Adds Gemini Photo and Video Tools

Google TV is integrating additional Gemini AI features, including photo and video transformation capabilities powered…

17 days ago· TechCrunch AI

Generative AI

Poly-DPO and ViPO: Scaling Visual Preference Optimization

Researchers introduced Poly-DPO, an algorithmic extension to preference optimization that adds a polynomial term to…

17 days ago· ArXiv (cs.AI)

AI Agents

NVIDIA Nemotron 3 Nano Omni Consolidates Multimodal AI for Agents

NVIDIA and AWS announced day-zero availability of Nemotron 3 Nano Omni on Amazon SageMaker JumpStart, a…

18 days ago· AWS Machine Learning Blog

Generative AI

Audio-Omni Unifies Generation and Editing Across Sound, Music, Speech

Researchers have introduced Audio-Omni, a unified framework that combines audio understanding, generation, and editing…

19 days ago· ArXiv (cs.AI)

Data & Training

New Multilingual Medical AI Benchmark Reveals Language and Vision Gaps

Researchers have developed EuropeMedQA, a multilingual and multimodal medical examination dataset drawn from official…

20 days ago· ArXiv (cs.AI)

AI Agents

ActorMind Brings Emotional Speech Role-Playing to AI

Researchers have introduced ActorMind, a reasoning framework that enables AI models to perform speech role-playing by…

20 days ago· ArXiv (cs.AI)

Data & Training

Web Video as Training Data for 3D Scene Understanding

Researchers demonstrate that unlabeled internet videos can be automatically processed into training data for 3D scene…

20 days ago· ArXiv (cs.AI)

Generative AI

MIT's AromaGen Generates Custom Scents from Text Using LLMs

Researchers at MIT and collaborators have developed AromaGen, an AI-powered wearable that generates custom scents from…

20 days ago· ArXiv (cs.AI)

AI for Business

Multimodal LLM for Materials Science Accelerates Discovery

Researchers at Nature Machine Intelligence have introduced MatterChat, a multimodal framework that combines material…

23 days ago· Nature Machine Intelligence

AI for Business

Multimodal AI Models Tackle Healthcare's Data Silos

AWS is positioning itself as a unified platform for deploying multimodal biological foundation models that integrate…

23 days ago· AWS Machine Learning Blog

AI Agents

Lightweight Model Beats GPT-4o at Robot Gesture Prediction

Researchers have developed a lightweight transformer model that generates co-speech gestures for robots by predicting…

23 days ago· ArXiv (cs.AI)

AI Risk & Security

Comic Strips Bypass Safety in Multimodal AI Models

Researchers have identified a new class of jailbreak attacks against multimodal large language models that embed…

23 days ago· ArXiv (cs.AI)

Multimodal

Vision Models Outperform LLMs for Time Series Anomaly Detection

Researchers propose VAN-AD, a framework that adapts visual Masked Autoencoders pretrained on ImageNet for time series…

24 days ago· ArXiv (cs.AI)

MultimodalTrending

OpenAI Releases ChatGPT Images 2.0 with Text, Infographics, and UI Generation

OpenAI has released ChatGPT Images 2.0, a significant upgrade to its image generation capabilities that can now produce…

25 days ago· VentureBeat AI

AI for Business

Specialized Vision-Language Model Beats Human Experts at Factory Defect Detection

Researchers have developed AD-Copilot, a vision-language model specialized for industrial anomaly detection that…

25 days ago· ArXiv (cs.AI)

Model Releases

DeepSeek-OCR Adapted for Molecular Recognition, Hits Sequence-Model Limits

Researchers adapted DeepSeek-OCR-2 to recognize molecular structures in 2D chemical diagrams by framing the task as…

25 days ago· ArXiv (cs.AI)

Model ReleasesTrending

OpenAI Tests Photorealistic Image Model to Drive ChatGPT Growth

OpenAI is testing a new image generation model, internally referred to as 'gpt-image-2,' that produces photorealistic…

26 days ago· The Information

Generative AI

FlowCoMotion Bridges Semantic and Motion Fidelity in Text-to-Motion

Researchers propose FlowCoMotion, a text-to-motion generation framework that combines continuous and discrete motion…

26 days ago· ArXiv (cs.AI)

Multimodal

NVIDIA Releases Fast Multilingual OCR Model Trained on Synthetic Data

NVIDIA released Nemotron OCR v2, a multilingual optical character recognition model trained on 12 million synthetic…

27 days ago· Hugging Face Blog

Open Source

Sentence Transformers Adds Multimodal Embeddings and Reranking

Sentence Transformers v5.4 adds multimodal embedding and reranking capabilities, allowing developers to encode and…

29 days ago· Hugging Face Blog

Voice & Video AI

NVIDIA Cosmos Reason 2 Tops Physical AI Leaderboards

NVIDIA released Cosmos Reason 2, an open-source reasoning vision-language model designed to improve how robots and AI…

29 days ago· Hugging Face Blog

AI Agents

Google DeepMind Upgrades Gemini Robotics for Spatial Reasoning

Google DeepMind released Gemini Robotics ER 1.6, an update focused on improving spatial reasoning and multi-view…

29 days ago· Google Deepmind