Eideticker: Limitations in cross-browser performance testing

Last summer I wrote a bit about using Eideticker to measure the relative performance of Firefox for Android versus other browsers (Chrome, stock, etc.). At the time I was pretty optimistic about Eideticker’s usefulness as a truly “objective” measure of user experience that would give us a more accurate view of how we compared against the competition than traditional benchmarking suites (which more often than not, measure things that a user will never see normally when browsing the web). Since then, there’s been some things that I’ve discovered, as well as some developments in terms of the “state of the art” in mobile browsing that have caused me to reconsider that view — while I haven’t given up entirely on this concept (and I’m still very much convinced of eideticker’s utility as an internal benchmarking tool), there’s definitely some limitations in terms of what we can do that I’m not sure how to overcome.

Essentially, there are currently three different types of Eideticker performance tests:

  • Animation tests: Measure the smoothness of an animation by comparing frames and seeing how many are different. Currently the only example of this is the canvas “clock” test, but many others are possible.
  • Startup tests: Measure the amount of time it takes from when the application is launched to when the browser is fully running/available. There are currently two variants of this test in the dashboard, both measure the amount of time taken to fully render Firefox’s home screen (the only difference between the two is whether the browser profile is fully initialized). The dirty profile benchmark probably most closely resembles what a user would usually experience.
  • Scrolling tests: Measure the amount of undrawn areas when the user is panning a website. Most of the current eideticker tests are of this kind. A good example of this is the taskjs benchmark.

In this blog post, I’m going to focus on startup and scrolling tests. Animation tests are interesting, but they are also generally the sorts of tests that are easiest to measure in synthetic ways (e.g. by putting a frame counter in your javascript code) and have thus far not been a huge focus for Eideticker development.

As it turns out, it’s unfortunately been rather difficult to create truly objective tests which measure the difference between browsers in these two categories. I’ll go over them in order.

Startup tests

There are essentially two types of startup tests: one where you measure the amount of time to get to the browser’s home screen when you explicitly launch the app (e.g. by pressing the Firefox icon in the app chooser), another where you load a web page in a browser from another app (e.g. by clicking on a link in the Twitter application).

The first is actually fairly easy to test across browsers, although we are not currently. There’s not really a good reason for that, it was just an oversight, so I filed bug 852744 to add something like this.

The second case (startup to the browser’s homescreen) is a bit more difficult. The problem here is that, in a nutshell, an apples to apples comparison is very difficult if not impossible simply because different browsers do different things when the user presses the application icon. Here’s what we see with Firefox:

And here’s what we see with Chrome:

And here’s what we see with the stock browser:

As you can see Chrome and the stock browser are totally different: they try to “restore” the browser back to its state from the last time (in Chrome’s case, I was last visiting taskjs.org, in Stock’s case, I was just on the homepage).

Personally I prefer Firefox’s behaviour (generally I want to browse somewhere new when I press the icon on my phone), but that’s really beside the point. It’s possible to hack around what chrome is doing by restoring the profile between sessions to some sort of clean “new tab” state, but at that point you’re not really reproducing a realistic user scenario. Sure, we can draw a comparison, but how valid is it really? It seems to me that the comparison is mostly only useful in a very broad “how quickly does the user see something useful” sense.

Panning tests

I had quite a bit of hope for these initially. They seemed like a place where Eideticker could do something that conventional benchmarking suites couldn’t, as things like panning a web page are not presently possible to do in JavaScript. The main measure I tried to compare against was something called “checkerboarding”, which essentially represents the amount of time that the user waits for the page to redraw when panning around.

At the time that I wrote these tests, most browsers displayed regions that were not yet drawn while panning using the page background. We figured that it would thus be possible to detect regions of the page which were not yet drawn by looking for the background color while initiating a panning action. I thus hacked up existing web pages to have a magenta background, then wrote some image analysis code to detect regions that were that color (under the assumption that magenta is only rarely seen in webpages). It worked pretty well.

The world’s moved on a bit since I wrote that: modern browsers like Chrome and Firefox use something like progressive drawing to display a lower resolution “tile” where possible in this case, so the user at least sees something resembling the actual page while panning on a slower device. To see what I mean, try visiting a slow-to-render site like taskjs.org and try panning down quickly. You should see something like this (click to expand):

Unfortunately, while this is certainly a better user experience, it is not so easy to detect and measure. :) For Firefox, we’ve disabled this behaviour so that we see the old checkerboard pattern. This is useful for our internal measurements (we can see both if our drawing code as well as our heuristics about when to draw are getting better or worse over time) but it only works for us.

If anyone has any suggestions on what to do here, let me know as I’m a bit stuck. There are other metrics we could still compare against (i.e. how smooth is the panning animation aka frames per second?) but these aren’t nearly as interesting.

5 thoughts on “Eideticker: Limitations in cross-browser performance testing”

  1. You could detect checkerboarding in other browsers by measuring when the output stops changing. In the absence of animation, that will be when the page is fully drawn. Of course that’s going to be much harder because you can’t do the magenta trick, and even more difficult to do while panning is actually taking place, but it sounds possible to me…

  2. @Jody: Yeah, the problem is that while panning we might not be able to fully draw a region until it’s completely out of view. So we have to have some kind of detection that the image of the page is not what it would be in an ideal world. Noticing that the page is out of focus is something that a human can do easily but I’m not sure how to train a computer to do the same thing. :/

    I wonder if I could convince the Chromium developers to add some hooks to change the way partially-drawn regions are painted…

  3. Would it be possible to use some image detection to detect the panning itself? i.e. compare what was on screen the previous frame to what’s on screen now to see how far the content has shifted (assuming there’s more than blank space on screen of course). That would probably need some sort of tolerance to deal with subpixel differences, not to mention when the lower resolution content changes into the high resolution content, but it might give you some information on how smoothly new content scrolls into view (and perhaps even how quickly the lower resolution preview changes into the real thing, if you could reliably detect that – perhaps by downscaling both frames then comparing them with a tolerance factor).

  4. @Emanuel: I hadn’t really thought of doing some kind of downscale to detect changes from low-resolution -> high resolution content. It’s an interesting idea, albeit probably difficult to implement in practice (or to put it another way: a fun challenge!). :)

    For smoothness tests, just comparing the difference between frames seems to be a reliable enough indicator of that (more different frames generally means smoother transitions), so I’m not too concerned there.

  5. To me, the home-screen test is supposed to measure how long it takes the browser to become responsive, to accept user input. For example, how long it takes before I can tap on the address bar, but I don’t know how to translate that to an Eideticker test. On the other hand, the page loading test is supposed to measure how long it takes for me to see the page, without interaction involved. This sounds right up Eideticker’s alley, so maybe it’s better to put focus on the page loading test instead of the home-screen test.

    For detecting low resolution tiles, I suppose you can put solid color boxes in the page, and you should be able to detect how sharp the edges of the boxes are. For example, if you put the boxes along the left side of the page, even when panning, you should be able to track these boxes, knowing their size and general location. For a more general approach, you can look into image processing techniques such as Fourier transforms. Sharpness corresponds to the frequency content of the image — blurry means less high frequency content. Fourier transform turns the image into its frequency content, so in effect, you can quantitatively examine “how blurry” the picture is.

Comments are closed.