All blog posts

We Tested Gemma 4 12B for Ad Automation - What Actually Works

June 22, 2026 · 8 min read

Soku Team

Soku Team

We Tested Gemma 4 12B for Ad Automation - What Actually Works

We tested Gemma 4 12B against the work marketers actually want to automate: reviewing creative, producing variant briefs, spotting policy risk, and deciding what to change in a campaign. The result is useful but bounded. Gemma 4 12B looks promising as a read-only creative reviewer. It is not a media buyer.

For context, read what Gemma 4 12B means for AI marketers. If you want to reproduce the workflow, use the setup guide for Meta and Google Ads teams.

Gemma 4 12B ad automation test scorecard
Gemma 4 12B ad automation test scorecard

The test setup

We used a review packet that mirrors a real ad-team handoff:

  • one campaign brief,
  • three creative variants,
  • brand and policy notes,
  • a small performance export,
  • the target platform and objective,
  • a fixed prompt that required a table output.

The model was not allowed to write to Meta, Google, or TikTok. That restriction is intentional. The goal was to answer a narrower question: can Gemma produce review output good enough to send better assets into Soku or a human approval workflow?

What worked

Creative QA

This was the strongest result. Gemma was good at checking whether the creative matched the brief, whether the hook promised something the body never delivered, and whether the visual structure matched the platform. It was especially useful for turning fuzzy review notes into a table: asset, issue, severity, fix, and reviewer required.

That is real leverage. Most creative reviews are inconsistent because each reviewer has a slightly different rubric. A fixed Gemma prompt made the review more repeatable.

Variant briefs

Gemma was also useful for producing the next batch of variants. Given a winning hook and a fatigue signal, it could propose missing angles: social proof, objection handling, competitor contrast, price framing, and product-demo variants.

The important caveat: the ideas were only good when the performance export and brief were specific. Vague input produced generic variants. Specific input produced usable briefs.

Audio and video notes

The native audio angle matters for ad teams. Voiceover pacing, claim density, and mismatch between spoken CTA and on-screen CTA are exactly the kind of issues that slow human review. Gemma was useful as a first-pass reviewer here.

What did not work

Causal diagnosis

When asked "why did performance drop?", the model wanted to answer even when the data was insufficient. That is dangerous. A small export cannot prove auction pressure, audience saturation, seasonality, or creative fatigue by itself.

The safe pattern is to force a data-grounded response:

For each hypothesis, show the field that supports it.
If the export does not contain enough evidence, say "not knowable from this packet."

Budget decisions

Gemma should not decide budgets. It can summarize evidence and produce a suggested test plan, but spend allocation needs account context, platform constraints, and approval controls. That is where an ad agent like Soku belongs.

The safe operating model

Use Gemma before the campaign write path:

  1. Soku or the marketer identifies a creative gap.
  2. The team generates variants.
  3. Gemma reviews the variants locally.
  4. A human approves the output.
  5. Soku handles the campaign workflow and measures the result.

That keeps Gemma in the part of the system where it is strongest: multimodal review and structured briefing. It keeps campaign execution inside a tool designed for approvals, integrations, and measurement.

The scorecard

TaskResultUse it?
Creative QAStrongYes
Variant briefingStrong with specific inputsYes
Policy risk triageUseful first passYes, with human review
Performance causalityWeak unless data is richCaution
Budget changesNot enough contextNo

What we would deploy first

The first production workflow should be a "creative gate" for one channel. Do not boil the ocean. Pick TikTok hooks, Meta static ads, or Google Demand Gen assets. Run 20-50 assets through the same prompt. Compare Gemma's flags with human reviewer notes. Then measure whether the approved batch performs better than the unreviewed control.

That is how a model becomes useful in marketing: not by sounding smart, but by improving the quality of the work that reaches the auction.

FAQ

Did Gemma 4 12B replace a media buyer in the test?

No. It worked as a reviewer and briefer, not as a spend decision-maker.

Where should Soku sit in this workflow?

After review and before/after launch: Soku connects the approved assets to campaign execution and performance measurement.

What is the biggest risk?

Over-trusting causal explanations from thin data. Force the model to show evidence or say the answer is not knowable.

Related Tools

Related Use Cases

Relevant Reads