• Link to the paper: LuxDiT

  • Problem: estimate scene lighting from visual cues

  • Output/representation: HDR environment map (HDRI)

  • Key challenge: generating an envmap requires reasoning about geometry, materials, and indirect cues (not just a pixel-to-pixel LDR→HDR mapping)

  • Implication: simple SD-XL fine-tuning tends to under-enforce physical consistency; the training recipe must force cue-based extrapolation to lighting

  • Data bottleneck: real HDRI capture is expensive → not enough diversity to generalize

  • Main approach: train on large synthetic data so the model learns physically grounded cues (direction, intensity)

  • Synthetic data recipe: procedurally generated scenes with varied inserted Objaverse objects and diverse lighting conditions

  • Feels reminiscent of a NeuralGaffer data setup

  • Plus: LDR panoramic videos to expose temporal/sequence behavior

  • Base model: CogVideoX

  • Training pipeline (chronological):

    • Phase 1a (base image training, full fine-tune): treat inputs as single images (1-frame videos), 12k iterations on synthetic data
    • Phase 1b (base video training, full fine-tune): continue 12k iterations with sequences (9/17/25 frames) from synthetic data to learn temporal consistency and lighting changes across frames
    • Phase 2 (LoRA adaptation): freeze base weights; train LoRA for 5k iterations on real data (perspective crops of real HDR panoramas + real LDR panoramic videos) to improve semantic alignment and reduce sim-to-real gap

Qualitative eval

  • Given their reported metrics and the visuals in the paper, the method looks like a clear step up over predecessors (StyleLight, DiffusionLight, DiffusionLight-Turbo).

  • Nonetheless, given the data setup, the method has limitations. Below are some qualitative results to get a feel for where it works well and where it struggles.

Polyheaven

  • 50 samples from Poly Haven, cropped to a vertical 60° FOV, and tone-mapped following the StyleLight evaluation protocol (specifically the implementation released by the DiffusionLight team: https://github.com/DiffusionLight/DiffusionLight-evaluation).
  • For predictions, I used their image-based model with the provided LoRA (as released in their repo; no extra tricks).
  • Visualizations: predict an HDR environment map from the input image, then use Blender Cycles to render a perfectly reflective ball at normal exposure and underexposure.
  • Styling inspired by the DiffusionLight project website.
abandoned_waterworks_2k
afrikaans_church_interior_2k
anniversary_lounge_2k
autumn_meadow_2k
bambanani_sunset_2k
barnaslingan_01_2k
between_bridges_2k
bloem_train_track_cloudy_2k
brown_photostudio_02_2k
carpentry_shop_02_2k
christmas_photo_studio_07_2k
circus_maximus_1_2k
citrus_orchard_puresky_2k
distribution_board_2k
drachenfels_cellar_2k
dry_meadow_2k
ferndale_studio_02_2k
furry_clouds_2k
greenwich_park_02_2k
hay_bales_2k
hochsal_forest_2k
immenstadter_horn_2k
modern_buildings_2k
moonless_golf_2k
mud_road_puresky_2k
overcast_soil_puresky_2k
paul_lobe_haus_2k
preller_drive_2k
pump_house_2k
qwantani_dusk_2_2k
qwantani_sunset_2k
rogland_moonlit_night_2k
rosendal_plains_2_2k
rotes_rathaus_2k
skylit_garage_2k
small_harbor_02_2k
snowy_forest_path_01_2k
squash_court_2k
studio_small_05_2k
teufelsberg_ground_1_2k
thatch_chapel_2k
twilight_sunset_2k
versveldpas_2k
wasteland_clouds_2k
wide_street_01_2k
winter_orchard_2k
wobbly_bridge_2k
yoga_room_2k
zwartkops_curve_afternoon_2k
zwartkops_start_sunset_2k
  • Swipe or use arrows to move between images.

Out of domain

  • Since they fine-tune a LoRA on real data (including HDRIs from Poly Haven), here are some results on out-of-domain inputs.
  • Some inputs are images of my own; others are typical hard cases (bird’s-eye view, low-angle views, animated/stylized imagery, etc.).
  • The key question: does it generalize well to in-the-wild data?
bahn
bird_eye
car_interior
coffee
coral_reef
fab
historicum_1
historicum_2
moon
rango
shrek
trunk_shot
tum_tower

References