Section 3: Hidden Challenges in Creative Testing


Of course, something as complex as creative testing has sticking points. Here’s a summary of some of those challenges, and how we work around them: 


  • Multiple strategies for testing ads:  It’s nice to have choices, but they can complicate things. You can test creative on Facebook with their split-test feature, or by setting up one ad per ad set, or by setting up many ads within an ad set (which is actually what Facebook recommends). The approach you pick will affect your testing results. 
  • Data integrity: The data for each of your tests won’t come in evenly. Some ads will get more impressions than others. The CPM for different ads and ad sets will vary. This makes for noise in the data, which makes it harder to determine the winning ad.
  • Cost: Testing has an extremely high ROI, but it can also have a very high investment cost. If you don’t set up your creative testing right, it can be prohibitively expensive. 
  • Control Bias: Facebook’s algorithm prefers winning ads and maintains creative history on new ads, ad sets and campaigns. 
  • Google: Running tests in Google Ads has challenges, but it got easier last year when Google Ads launched asset reporting


Creative Testing: Statistical Significance vs Cost-Effective


Let’s take a closer look at the cost aspect of creative testing. 

In classic testing, you need a 95% confidence rate to declare a winner. That’s nice to have but getting a 95% confidence rate for in-app purchases may end up costing you $20,000 per creative variation. 

Why so expensive? Because to reach a 95% confidence level, you’ll need about 100 purchases. With a 1% purchase rate (which is typical for gaming apps), and a $200 cost per purchase, you’ll end up spending $20,000 for each variation in order to accrue enough data for that 95% confidence rate. 

  • Sample math to reach 95% statistical relevance for a single variation: gaming averages 
  • 1% install to purchase rate
  • 100 purchases
  • $200 cost per purchase
  • $20,000 per variation (OUCH!) 
  • *Variations must beat control by >25%

That’s actually the best-case scenario, too. Because of the way the math works, you’d also have to find a variation that beats the control by 25% or more for it to cost “only” $20,000. A variation that beat the control by 5% or 10% would have to run even longer to achieve a 95% confidence level. 

There aren’t a lot of advertisers who can afford to spend $20,000 per variation, especially if 95% of new creative fails to beat the control. 

So, what to do? 

What we do is move the conversion event we’re targeting for up a little in the sales funnel. For mobile apps, instead of optimizing for purchases we’d optimize for impression to install rate (IPM). For websites, we’d optimize for impression to top-funnel conversion rate. To be clear, this is not a Facebook recommended best practice, this is our own voodoo magic/secret sauce that we’re brewing.


IPM Testing Is Cost-Effective


The obvious concern here is that ads with high CTRs and high conversion rates for top-funnel events may not be true winners for down-funnel conversions and ROI / ROAS. But while there is a risk of identifying false positives with this method, we’d rather take that risk than the risk, time and expense of optimizing for bottom-funnel metrics.

So optimizing for installs is more efficient than optimizing for purchases. Most importantly, it means you can run tests for less money per variation because you are optimizing towards installs vs purchases. For many advertisers, that alone can make more testing financially viable. $200 testing cost per variation versus $20,000 testing cost per variation can mean the difference between being able to do a couple of tests versus having an ongoing, robust testing program.  Note: this process may generate false negatives and false positives.

  • Sample math to reach 95% statistical relevance for a single variation: gaming averages 
  • 0.5% impression to install rate
  • 100 installs 
  • $2.00 cost per install
  • $200 per variation (HUGE SAVINGS!)
  • Variations must beat the control by >25%


How We’ve Been Testing Creative Until Now


For the past few years, to streamline our Facebook and Google creative testing and reduce non-converting spend, we’ve been testing new video concepts using IPM (Impressions Per Install) as the primary metric. For the record, using IPM is not the Facebook recommended best practice to allow ad sets to get out of the learning phase by gathering enough data to become statistically valid.

When testing creative we typically would try three to five videos along with a control video using Facebook’s split test feature. We would show these ads to broad or 5-10% LALs (Lookalike) audiences, and restrict distribution to the Facebook newsfeed only, Android only and we’d use mobile app install bidding (MAI) to get about 100-250 installs. 

If one of those new “contender” ads beat the control video’s IPM or came within 10%-15% of its performance, we would launch those potential new winning videos into the ad sets with the control video and let them fight it out to generate ROAS.

Unexpected Results


We’ve seen hints of what we’re about to describe across numerous ad accounts and have confirmed with other advertisers that they have experienced similar results. But for purposes of explanation, let’s focus on one particular client of ours and how their ads performed in recent creative tests. 

In two months, we produced +60 new video concepts for a client. All of them failed to beat the control video’s IPM. This struck us as odd, and it was statistically impossible.  We expected to generate a new winner 5% of time or 1 out of 20 videos – so 3 winners. Since we felt confident in our creative ideas, we decided to look deeper into our custom, money-saving testing method.

Traditional testing methodology includes the idea of testing a testing system or an A/A test. A/A tests are like A/B tests, but instead of testing multiple creatives, you test the same creative in each “slot” of the test.

If your testing system/platform is working as expected, all “variations”, should produce similar results assuming you get close to statistical significance. If your A/A test results are very different, and the testing platform/methodology concludes that one variation or another significantly outperforms or underperforms compared to the other variations, there could be an issue with the testing method or quantity of data gathered. 

Here’s how we set up an A/A test to validate our custom approach to Facebook testing. The purpose of this test was to understand if Facebook maintains a creative history for the control and thus gives the control a performance boost making it very difficult to beat – if you don’t allow split test ads to exit the learning phase and reach statistical relevance.

  • We copied the control video four times and added one black pixel in different locations in each of the new “variations.” This allowed us to run what would look like the same video to humans but would be different videos in the eyes of Facebook’s platform. The goal was to get Facebook to assign new hash IDs for cloned videos and then test them together for maximum IPMs. 
  • These are the ads we ran… except we didn’t run the hotdog dogs; I’ve replaced the actual ads with cute doges to avoid disclosing the advertiser’s identity.  IPMs for each ad in the far right of the image.


Things to note here: 

  • The far-right ad (in the blue square) is the control.
  • All the other ads are clones of the control with one black pixel added.  
  • The far-left ad/clone outperformed the control by 149%. As described earlier, a difference like that shouldn’t happen if the platform was truly variation agnostic, BUT – to save money, we did not follow best practices to allow the ad set(s) to exit the learning phase.  

We ran this test for only 100 installs. Which is our standard operating procedure for creative testing designed to save time and money?

Once our first test reached 100 installs, we paused the campaign to analyze the results. We turned the campaign back on to scale to 500 installs to get closer to statistical significance. We wanted to see if more data would result in IPM normalization (in other words, if the test results would settle back down to more even performance across the variations). However, the results of the second test remained similar. Note: the ad set(s) did not exit the learning phase and we did not follow Facebook’s best practice.

The results of these tests, while not statistically significant and not based on best practices, were surprisingly enough to merit additional tests. So we tested on! 


Second A/A test of video creative


For our second test we ran the six videos shown below. Four of them were controls with different headers; two of them were new concepts that were very similar to the control. Again, we didn’t run the hotdog dogs; they’ve been inserted to protect the advertiser’s identity and to offer you cuteness!

The IPMs for all ads ranged between 7-11 – even the new ads that did not share a thumbnail with the control. IPMs for each ad in the far right of the image.

Fourth A/A test of video creative


This was when we had our “ah-ha!” moment. We tested six very different video concepts: the one control and five brand new ideas, all of which were visually very different from the control and did not share the same thumbnail.

The control’s IPM was consistent in the 8-9 range, but the IPMs for the new visual concepts ranged between 0-2. IPMs for each ad in the far right of the image.

Here are our impressions from the above tests

  • Facebook’s split-tests maintains creative history for the control video. This gives the control an advantage when using our IPM testing. 
  • We remain unclear if Facebook can group variations with a similar look and feel to the control. If it can, similar-looking ads could also start with a higher IPM based on influence from the control — or perhaps similar thumbnails drive influence.
  • Creative concepts that are visually very different from the control appear to not share a creative history. IPMs for these variations are independent of the control. It appears that new, “out of the box” visual concepts vs the control may require more impressions to quantify their performance or get closer to statistical relevance.
  • Our IPM testing methodology is valid if we do NOT use a control video as the benchmark for success.


IMP Testing Summary


Here are the line graphs from the second, third, and fourth tests. 

And here’s what we think they mean:


Creative Testing 2.0 Recommendations:


Given the above results, those of us testing using IPM  have an opportunity to re-test IPM winners that exclude the control video to determine if we’ve been killing potential winners.  As such, we recommend the following three-phase testing plan.


Creative Testing Phase 1: Initial IPM Test


  • Use 3~6 creatives in one ad set with MAI bidding (NEVER include the control in the ad set)
    • Less expensive than Facebook split testing but not a best practice and will not achieve statistically relevance
      • 5% LAL in the US (for other countries, still use 5% LAL)
    • This will give you an audience reach of 10M or smaller (other geos)
  • Isolate one OS (iOS or Android)
  • Facebook Newsfeed only
  • Generate over 100 installs (50 installs are acceptable in high CPI scenarios)
    • 100 installs: 70% confidence with 5% margin of error
    • 160 installs: 80% confidence with 5% margin of error
    • 270 installs: 90% confidence with 5% margin of error
  • Lifetime budget: $500~$1,000 to drive installs that reach more than 70% confidence level
  • The goal is to kill IPM losers quickly and inexpensively and then take top 1~2 IPM winners to phase 2

Creative Testing Phase 2: Initial ROAS Test

  • Once you have high IPM winners identified, you can move into initial ROAS testing to see if high IPMs also will generate revenue
  • Create a new campaign
  • Test IPM winners from Phase 1 with AEO or VO
  • 10% LAL, auto-placement, Android or iOS but do NOT test using WW audiences, chose only one country
  • 1 ad set with IPM winners from phase 1
    • Create new campaigns for new IPM winners from next rounds – do not add winners from other tests
  • Lifetime budget: $800~$1,500


Creative Testing Phase 3: ROAS Scale Test


  • Choose winners from Phase 2 with good ROAS
  • Use CBO, create new ad set and roll them out towards the target audiences that produce good results for the control
  • New ad set for new creative winners from different testing rounds
  • Never compete against control in an ad set, instead, have them separated and compete for more budget within the same campaign


Note: We’re still testing many of our assumptions and non-standard practices.

  • Is it helpful to warm up IPM winners and establish “creative history” by purchasing installs in inexpensive countries?
  • How long should IPM winners be “isolated” from the control to allow time for learning to be built up?
  • 5-10% of LaL is contingent on the population of the country being tested?
  • Results don’t appear to change being run as 1 ad per ad set or many ads per ad set?
  • Will lifetime vs daily budgets matter?
  • Does a new campaign matter?
  • Does resetting the post ID matter?
  • Should creative testing be isolated to a test account?

We look forward to hearing how you’re testing and sharing more of what we uncover soon.


section 2 Table of Contents section 4

READ OUR OTHER whitepapers for creative best practices for Facebook & Google social advertising.

Download PDF Today!

    Please prove you are human by selecting the Tree.

    Read our Whitepapers!

    Creative Best Practices for Facebook & Google social advertising.

    You have Successfully Subscribed!