Google Launches Gemini 3 Pro: Next-Gen Multimodal AI That Reasons Spatially & Converts Documents to Code

by Nam Phong · December 9, 2025

Google has unveiled Gemini 3 Pro — a new generation of multimodal models that not only see images and video, but genuinely reason about what is taking place within them. According to the company, it is Google’s most powerful visual and spatial AI to date: it sets new benchmark records in document understanding, screen comprehension, complex schematic analysis, and long-form video reasoning, and is already oriented toward concrete applications ranging from education and medicine to law and finance.

One of the most transformative advances in Gemini 3 Pro lies in its ability to understand real-world documents. Unlike polished textbook examples, real documents are often chaotic: photographed pages; interwoven images, tables, formulas, and diagrams; illegible handwriting; and convoluted layouts. The model pairs high-precision OCR with visual and logical analysis, enabling it not merely to read such documents but to reconstruct their structure as executable code — HTML, LaTeX, or Markdown. Demonstrations include the reconstruction of a complex handwritten table from an eighteenth-century trade journal, converting a photographed formula into valid LaTeX, and turning Florence Nightingale’s famous diagram into an interactive chart.

From there, deeper reasoning comes into play. Gemini 3 Pro can work through long reports step by step, correlating tables, charts, and narrative analysis. In one demonstration, the model parses the U.S. Census Bureau’s 62-page report Income in the United States: 2022. It locates the relevant Gini index tables for “money income” and “income after taxes,” compares year-over-year changes, and ties those trends to textual explanations — such as the expiration of crisis-relief programs and stimulus payments. It then inspects data on income share for the lowest quintile and determines whether that share rose or fell. On the CharXiv Reasoning benchmark for tasks of this type, Gemini 3 Pro even surpasses average human performance.

Its spatial reasoning has also been significantly strengthened. Gemini 3 Pro can identify the precise coordinates of objects in an image and operate over sequences of such points — enabling, for instance, pose estimation or trajectory tracking. The model uses an open vocabulary: one can ask, “Create a plan to clean up this messy desk and sort the trash,” and it will rely not on rigid taxonomies but on its understanding of the objects and their roles. Similarly, it can be embedded into AR/XR devices: a user may view a manual and ask the assistant, “Show me which screw the instructions refer to,” and the model highlights the correct object in the real scene.

These same capabilities underpin its understanding of digital screens. Google notes that Gemini 3 Pro handles desktop and mobile interfaces with confidence and can act as the “engine” behind agents that perform routine computer actions. In a demonstration, the model interacts with an Excel spreadsheet: accurately clicking the required cells, creating a pivot table, and generating a revenue summary across promotion types on a separate sheet. This level of UI comprehension lends itself to automated testing, user training, and UX analytics.

Video receives special attention. Gemini 3 Pro has been optimized to process high frame rates — up to 10 FPS, a tenfold improvement over the baseline. This is crucial for tasks requiring fine-grained motion analysis, such as examining the mechanics of an athletic movement. The enhanced “thinking mode” teaches the model not merely to enumerate what appears on screen but to infer causal relationships and explain why events unfold as they do. Another notable capability is its ability to convert long videos into structured knowledge for downstream automation: extracting key information from lectures or tutorials and immediately translating it into working code or formalized workflows.

Google highlights a wide range of sector-specific applications. In education, improved visual reasoning helps students and teachers unpack math, physics, and chemistry problems involving diagrams or drawings — from elementary school to university level. The same technology powers the Nano Banana Pro assistant, which can, for example, overlay a student’s notebook photo with the exact step where an error occurred and annotate the correction directly on the image rather than as dry text.

In medicine and biomedical research, Gemini 3 Pro is positioned as Google’s most capable general-purpose model for imaging tasks. It achieves state-of-the-art results on MedXpertQA-MM (advanced medical reasoning), VQA-RAD (radiology question-answering), and MicroVQA (microscopy image analysis). Demonstrations include interpreting high-magnification micrographs, linking observed structures to diagnoses or experimental conditions.

Lawyers and financial specialists can use Gemini 3 Pro to dissect voluminous documents, contracts, and reports. Contract-management platforms can delegate complex revision scenarios with extensive redlines and footnotes to the model. Harvey.ai, a legal-AI startup, reports marked improvements in sophisticated legal reasoning and document comprehension — particularly valuable for corporate counsel handling large flows of internal and external agreements.

For developers, Gemini 3 Pro introduces major improvements in visual data handling. The model now preserves the original aspect ratio of images, enhancing overall understanding. A new media_resolution parameter allows users to control the resolution — and thus resource cost — at which images or videos are processed. High resolution benefits dense text, intricate documents, and complex scenes; lower resolution suits general scene recognition or long-context analysis where performance and cost are paramount.

Taken together, Gemini 3 Pro represents a shift from mere recognition to a fully fledged visual intelligence capable of linking images, text, and actions. Google anticipates that such multimodal systems will form the backbone of next-generation assistants and industry solutions — from warehouse robotics to legal platforms and educational tools.

Support Our Threat Intelligence

If you find our technology report and cybersecurity news helpful, consider supporting our work.

Buy Me a Coffee PayPal