WonderZoom:

Cross-scale 3D Scene Generation

  • Anonymous Authors

Teaser Image

WonderZoom fuses real‑time rendering with multi‑scale 3D world generation. Starting from a single image, the system represents the scene using Multi-Scale Gaussian Surfels and synthesizes finer details on‑demand as you pan, zoom, or describe new content in text. This lets you dive seamlessly from sweeping landscapes down to microscopic objects while maintaining geometric and stylistic coherence. (Videos below are time‑lapsed recordings of this cross‑scale generation process.)

Generated Virtual World

Here are some examples of generated worlds with different camera path styles: zoom-in, curvy, and Rotating.

Interactive Viewing

Keyboard: Move by "W/A/S/D", look around by "I/J/K/L". Rotate around the center point by "F". Click the button "Scale 0/1/2" to zoom-in/out.
Touch Screen: Move by one-finger drag, look around by two-finger drag.
Note: Clicking an image will automatically download a generated virtual world example from an anonymous source. After loading, please click on the canvas to activate control. The rendering here is done on your device in real-time. Downloading an example (~100MB) may take a while. Since the web renderer does not support our proposed LoD and applies depth quantization, the rendering quality suffers from noticeable degradation.

Drag & drop .splat/.spz/.ply files
or double-click to open
FOV: 18.9°
Focal: 1441px
FPS: 0
Position:
  X: 0.00
  Y: 0.00
  Z: 0.10
Rotation:
  X: 0.00°
  Y: 0.00°
  Z: 0.00°
Splats: 0
Rendered: ?
Speed Factor: 1.00
Orbit: OFF
Split: OFF
Move
W
Orbit
A
S
D
F
Look
I
J
K
L

Approach

WonderZoom begins with a single input image and progressively constructs a hierarchy of 3D scenes covering ever‑finer spatial scales. Two key innovations make this possible:

Multi-Scale Gaussian Surfels – a dynamically updatable representation that lets new fine‑scale surfels be inserted without re‑optimising the entire scene, while still rendering in real time.
Progressive detail synthesizer – an autoregressive module that leverages image, depth and large‑language models to hallucinate and register novel 3D content whenever the user zooms into a region or issues a text prompt.

These components work together to enable interactive exploration from macro panoramas to micro textures, outperforming prior single‑scale generators in both perceptual quality and scale consistency.