LuxDiT - Lighting Estimation with Video Diffusion Transformer

Link to the paper: LuxDiT
Problem: estimate scene lighting from visual cues
Output/representation: HDR environment map (HDRI)
Key challenge: generating an envmap requires reasoning about geometry, materials, and indirect cues (not just a pixel-to-pixel LDR→HDR mapping)
Implication: simple SD-XL fine-tuning tends to under-enforce physical consistency; the training recipe must force cue-based extrapolation to lighting
Data bottleneck: real HDRI capture is expensive → not enough diversity to generalize
Main approach: train on large synthetic data so the model learns physically grounded cues (direction, intensity)
Synthetic data recipe: procedurally generated scenes with varied inserted Objaverse objects and diverse lighting conditions
Feels reminiscent of a NeuralGaffer data setup
Plus: LDR panoramic videos to expose temporal/sequence behavior
Base model: CogVideoX
Training pipeline (chronological):
- Phase 1a (base image training, full fine-tune): treat inputs as single images (1-frame videos), 12k iterations on synthetic data
- Phase 1b (base video training, full fine-tune): continue 12k iterations with sequences (9/17/25 frames) from synthetic data to learn temporal consistency and lighting changes across frames
- Phase 2 (LoRA adaptation): freeze base weights; train LoRA for 5k iterations on real data (perspective crops of real HDR panoramas + real LDR panoramic videos) to improve semantic alignment and reduce sim-to-real gap

Qualitative eval

Given their reported metrics and the visuals in the paper, the method looks like a clear step up over predecessors (StyleLight, DiffusionLight, DiffusionLight-Turbo).
Nonetheless, given the data setup, the method has limitations. Below are some qualitative results to get a feel for where it works well and where it struggles.

50 samples from Poly Haven, cropped to a vertical 60° FOV, and tone-mapped following the StyleLight evaluation protocol (specifically the implementation released by the DiffusionLight team: https://github.com/DiffusionLight/DiffusionLight-evaluation).
For predictions, I used their image-based model with the provided LoRA (as released in their repo; no extra tricks).
Visualizations: predict an HDR environment map from the input image, then use Blender Cycles to render a perfectly reflective ball at normal exposure and underexposure.
Styling inspired by the DiffusionLight project website.

Since they fine-tune a LoRA on real data (including HDRIs from Poly Haven), here are some results on out-of-domain inputs.
Some inputs are images of my own; others are typical hard cases (bird’s-eye view, low-angle views, animated/stylized imagery, etc.).
The key question: does it generalize well to in-the-wild data?