fixing wobbly AI-generated GIF frames with phase correlation

easygif.lol (previous post) has been using Gemini’s image generation for a while now, and while the results are often ok, sometimes the frames have this annoying jitter - the subject shifts around between frames even though it shouldn’t. makes the animation look wobbly and annoying.

time to fix that.

original
from this
fixed
to this

the problem

when easygif asks Gemini to generate “a cute white bear dancing on a white background” as a sequence of animation frames, each frame is generated somewhat independently. while I am looking into ways to improve that (and make the generation faster/cheaper), it’s what I have. the model tries to keep things consistent, but it’s not perfect:

  • misalignment: the bear might be centered in frame 1, then shift 10 pixels left in frame 2, then 5 pixels right in frame 3
  • zoom inconsistency: the bear might appear slightly larger or smaller between frames

the result is a GIF that wobbles and jitters instead of having smooth animation. not good enough.

exploring solutions

I looked into several approaches, keeping in mind that it should not be too slow on my cheap VPS:

phase correlation

  • uses FFT to find translation offsets between frames in the frequency domain
  • fast, robust to noise
  • only handles translation (not rotation/scale)
  • OpenCV has cv2.phaseCorrelate() built-in

feature-based alignment (ORB + homography)

  • detect keypoints, match them across frames, compute transformation matrix
  • can handle translation, rotation, AND scale
  • more complex, can fail if subject changes appearance

bounding box normalisation

  • detect subject bounding box in each frame
  • scale frames to normalise subject size
  • addresses zoom directly

first attempt - align everything to frame 0

I ran experiments on 10 sessions for which I still had the data… results were mixed.

phase correlation worked reasonably well, but feature-based and bbox normalisation were producing broken images. like, completely distorted. the homography was finding matches but applying wild transformations that made no sense.

feature normalisation is broken
feature normalisation is broken

my theory is that with actual differences between frames (and especially when comparing to frame 0), features are easily lost and the matcher gets confused

chain to previous frame

while previous results were OK, I tried comparing each frame n to frame n-1 for better jitter detection

Click to see the chained alignment implementation (python)
def align_sequence(self, frames, chain_to_previous=True):
    aligned_frames = [frames[0].copy()]  # First frame unchanged
    
    for i, current in enumerate(frames[1:], start=1):
        if chain_to_previous:
            reference = aligned_frames[i - 1]  # Previous ALIGNED frame
        else:
            reference = frames[0]
        
        aligned, metrics = self.compute_alignment(reference, current)
        aligned_frames.append(aligned)
    
    return aligned_frames, metrics_list

now phase correlation only corrects the incremental jitter between consecutive frames, not the overall position changes from animation.

confidence threshold

manually reviewing possible fixes from the phase correlation showed that the confidence score was a good indicator of, well, how confident I should be that the detected shift should be corrected indeed. applyting a somewhat arbitrary 0.1 threshold did a good job added a threshold check:

results

ran the full experiment on 146 debug sessions (893 frames total).

processing time: 28 minutes on my machine (~20 seconds per session, ~125ms per frame for phase correlation). very much not multi-threaded.

example 1: dancing white bear

session: 20250322_113430_a_cute_funny_white_bear_dancing_white_background

frame dx (px) dy (px) confidence applied?
0 - - - (reference)
1 +8.23 +10.10 0.757 ✓ yes
2 +41.25 +20.39 0.496 ✓ yes
3 -12.11 +27.95 0.768 ✓ yes
4 -11.94 +11.03 0.784 ✓ yes

the bear was drifting around quite a bit (up to 46px magnitude). phase correlation caught and fixed all of it. the bear now stays centered.

total processing time: 124ms for 5 frames.

GIF comparison (original vs fixed):

original
Original - notice the wobble
fixed
Phase correlation applied
Individual frame comparison (frame 2, the worst offender at +41px dx):

original frame 2
Original frame 2
corrected frame 2
Corrected frame 2

example 2: rsfgwrsgwrsg

session: 20250322_111643_rsfgwrsgwrsg

(yes that’s the actual prompt. can’t remember what I was doing.)

frame dx (px) dy (px) confidence applied?
0 - - - (reference)
1 -0.02 +0.03 0.403 ✓ yes
2 +0.18 -0.08 0.342 ✓ yes
3 +0.85 +39.80 0.417 ✓ yes
4 +5.21 +27.53 0.455 ✓ yes
5 +1.53 +28.45 0.780 ✓ yes

frames 3-5 had significant vertical drift (up to 40px!). all fixed.

GIF comparison (original vs fixed):

original
Original - vertical drift
fixed
Phase correlation applied
Individual frame comparison (frame 3, biggest shift at +39.8px dy):

original frame 3
Original frame 3
corrected frame 3
Corrected frame 3

example 3: meowing black cat (subtle fix)

session: 20250321_215009_cute_meowing_black_cat_white_background

frame dx (px) dy (px) confidence applied?
0 - - - (reference)
1 +0.04 +0.03 0.759 ✓ yes
2 +0.00 +0.17 0.721 ✓ yes
3 -3.53 +1.02 0.615 ✓ yes
4 -1.13 +1.43 0.366 ✓ yes
5 -6.38 -0.27 0.094 ✗ no (low confidence)

this one is more subtle - small shifts of a few pixels. but the cat’s tail was wobbling in the original. now it’s stable.

GIF comparison (original vs fixed):

original
Original - tail wobble
fixed
Phase correlation applied
Individual frame comparison (frame 3, shift of -3.5px):

original frame 3
Original frame 3
corrected frame 3
Corrected frame 3

what about feature-based and bbox?

I tested all three algorithms. here’s the verdict:

phase correlation

  • fast (~30ms per frame)
  • handles translation well
  • confidence score lets us skip uncertain fixes
  • doesn’t break images

feature-based

  • too aggressive for animation frames
  • gets confused, not worth it

bounding box normalisation

  • struggles with subject detection
  • threshold-based detection unreliable across varied content
  • when it gets the bbox wrong, results are just bad