-
Link to the paper: LuxDiT
-
Problem: estimate scene lighting from visual cues
-
Output/representation: HDR environment map (HDRI)
-
Key challenge: generating an envmap requires reasoning about geometry, materials, and indirect cues (not just a pixel-to-pixel LDR→HDR mapping)
-
Implication: simple SD-XL fine-tuning tends to under-enforce physical consistency; the training recipe must force cue-based extrapolation to lighting
-
Data bottleneck: real HDRI capture is expensive → not enough diversity to generalize
-
Main approach: train on large synthetic data so the model learns physically grounded cues (direction, intensity)
-
Synthetic data recipe: procedurally generated scenes with varied inserted Objaverse objects and diverse lighting conditions
-
Feels reminiscent of a NeuralGaffer data setup
-
Plus: LDR panoramic videos to expose temporal/sequence behavior
-
Base model: CogVideoX
-
Training pipeline (chronological):
- Phase 1a (base image training, full fine-tune): treat inputs as single images (1-frame videos), 12k iterations on synthetic data
- Phase 1b (base video training, full fine-tune): continue 12k iterations with sequences (9/17/25 frames) from synthetic data to learn temporal consistency and lighting changes across frames
- Phase 2 (LoRA adaptation): freeze base weights; train LoRA for 5k iterations on real data (perspective crops of real HDR panoramas + real LDR panoramic videos) to improve semantic alignment and reduce sim-to-real gap
Qualitative eval
-
Given their reported metrics and the visuals in the paper, the method looks like a clear step up over predecessors (StyleLight, DiffusionLight, DiffusionLight-Turbo).
-
Nonetheless, given the data setup, the method has limitations. Below are some qualitative results to get a feel for where it works well and where it struggles.
Polyheaven
- 50 samples from Poly Haven, cropped to a vertical 60° FOV, and tone-mapped following the StyleLight evaluation protocol (specifically the implementation released by the DiffusionLight team: https://github.com/DiffusionLight/DiffusionLight-evaluation).
- For predictions, I used their image-based model with the provided LoRA (as released in their repo; no extra tricks).
- Visualizations: predict an HDR environment map from the input image, then use Blender Cycles to render a perfectly reflective ball at normal exposure and underexposure.
- Styling inspired by the DiffusionLight project website.
- Swipe or use arrows to move between images.
Out of domain
- Since they fine-tune a LoRA on real data (including HDRIs from Poly Haven), here are some results on out-of-domain inputs.
- Some inputs are images of my own; others are typical hard cases (bird’s-eye view, low-angle views, animated/stylized imagery, etc.).
- The key question: does it generalize well to in-the-wild data?
References
- LuxDiT. https://research.nvidia.com/labs/toronto-ai/LuxDiT/ (accessed 2026-04-02).
- DiffusionLight. https://diffusionlight.github.io/ (accessed 2026-04-02).
- DiffusionLight-Turbo. https://diffusionlight.github.io/turbo/ (accessed 2026-04-02).
- StyleLight. https://style-light.github.io/ (accessed 2026-04-02).
- NeuralGaffer. https://neural-gaffer.github.io/ (accessed 2026-04-02).