Category Archives: Mozilla

End of Q2 Eideticker update: Flame tests, future plans

[ For more information on the Eideticker software I’m referring to, see this entry ]

Just wanted to give an update on where Eideticker is at the end of Q2 2014. The big news is that we’ve started to run startup tests against the Flame, the results of which are starting to appear on the dashboard:

eideticker-contacts-flame [link]

It is expected that these tests will provide a useful complement to the existing startup tests we’re running with b2gperf, in particular answering the “is this regression real?” question.

Pending work for Q3:

  • Enable scrolling tests on the Flame. I got these working against the Hamachi a few months ago but because of some weird input issue we’re seeing we can’t yet enable them on the Flame. This is being tracked in bug 1028824. If anyone has background on the behaviour of the touch screen driver for this device I would appreciate some help. :)
  • Enable tests for multiple branches on the Flame (currently we’re only doing master). This is pretty much ready to go (bug 1017834), just need to land it.
  • Annotate eideticker graphs with internal benchmark information. Eli Perelman of the FirefoxOS performance team has come up with a standard set of on-load events for the Gaia apps (app chrome loaded, app content loaded, …) that each app will generate, feeding into tools like b2gperf and test-perf. We want to show this information in Eideticker’s frame-by-frame analysis (example) so we can verify that the app’s behaviour is consistent with what it is claimed. This is being tracked in bug 1018334
  • Re-enable Eideticker for Android and run tests more frequently. Sadly we haven’t been consistently generating new Eideticker results for Android for the last quarter because of networking issues in the new Mountain View office, where the test rig for those live. One way or another, we want to fix this next quarter and hopefully run tests more frequently against mozilla-inbound (instead of just nightly builds)

The above isn’t an exhaustive list: there’s much more that we have in mind for the future that’s not yet scheduled or defined well (e.g. get Eideticker reporting to Treeherder’s new performance module). If you have any questions or feedback on anything outlined above I’d love to hear it!

Managing test manifests: ManifestDestiny -> manifestparser

Just wanted to make a quick announcement that ManifestDestiny, the python package we use internally here at Mozilla for declaratively managing lists of tests in Mochitest and other places, has been renamed to manifestparser. We kept the versioning the same (0.6), so the only thing you should need to change in your python package dependencies is a quick substitution of “ManifestDestiny” with “manifestparser”. We will keep ManifestDestiny around indefinitely on pypi, but only to make sure old stuff doesn’t break. New versions of the software will only be released under the name “manifestparser”.

Quick history lesson: “Manifest destiny” refers to a philosophy of exceptionalism and expansionism that was widely held by American settlers in the 19th century. The concept is considered offensive by some, as it was used to justify displacing and dispossessing Native Americans. Wikipedia’s article on the subject has a good summary if you want to learn more.

Here at Mozilla Tools & Automation, we’re most interested in creating software that everyone can feel good about depending on, so we agreed to rename it. When I raised this with my peers, there were no objections. I know these things are often the source of much drama in the free software world, but there’s really none to see here.

Happy manifest parsing!

mozregression: New maintainer, issues tracked in bugzilla

Just wanted to give some quick updates on mozregression, your favorite regression-finding tool for Firefox:

  1. I moved all issue tracking in mozregression to bugzilla from github issues. Github unfortunately doesn’t really scale to handle notifications sensibly when you’re part of a large organization like Mozilla, which meant many problems were flying past me unseen. File your new bugs in bugzilla, they’re now much more likely to be acted upon.
  2. Sam Garrett has stepped up to be co-maintainer of the project with me. He’s been doing a great job whacking out a bunch of bugs and keeping things running reliably, and it was time to give him some recognition and power to keep things moving forward. :)
  3. On that note, I just released mozregression 0.17, which now shows the revision number when running a build (a request from the graphics team, bug 1007238) and handles respins of nightly builds correctly (bug 1000422). Both of these were fixed by Sam.

If you’re interested in contributing to Mozilla and are somewhat familiar with python, mozregression is a great place to start. The codebase is quite approachable and the impact will be high — as I’ve found out over the last few months, people all over the Mozilla organization (managers, developers, QA …) use it in the course of their work and it saves tons of their time. A list of currently open bugs is here.

PyCon 2014 impressions: ipython notebook is the future & more

This year’s PyCon US (Python Conference) was in my city of residence (Montréal) so I took the opportunity to go and see what was up in the world of the language I use the most at Mozilla. It was pretty great!

ipython

The highlight for me was learning about the possibilities of ipython notebooks, an absolutely fantastic interactive tool for debugging python in a live browser-based environment. I’d heard about it before, but it wasn’t immediately apparent how it would really improve things — it seemed to be just a less convenient interface to the python console that required me to futz around with my web browser. Watching a few presentations on the topic made me realize how wrong I was. It’s already changed the way I do work with Eideticker data, for the better.

Using ipython to analyze some eideticker data
Using ipython to analyze some eideticker data

I think the basic premise is really quite simple: a better interface for typing in, experimenting with, and running python code. If you stop and think about it, the modern web interface supports a much richer vocabulary of interactive concepts that the console (or even text editors like emacs): there’s no reason we shouldn’t take advantage of it.

Here are the (IMO) killer features that make it worth using:

  • The ability to immediately re-execute a block of code after editing and seeing an error (essentially merging the immediacy of the python console with the permanency / cut & pastability of an actual script)
  • Live-printing out graphs of numerical results using matplotlib. ZOMG this is so handy. Especially in conjunction with the live-editing outlined above, there’s no better tool for fine-tuning mathematical/statistical analysis.
  • The shareability of the results. Any ipython notebook can be saved and then saved to a public website. Many presentations at PyCon 2014, in fact, were done entirely with ipython notebooks. So handy for answering questions like “how did you get that”?

To learn more about how to use ipython notebooks for data analysis, I highly recommend Julie Evan’s talk Diving into Open Data with IPython Notebook & Pandas, which you can find on pyvideo.org.

Other Good Talks

I saw some other good talks at the conference, here are some of them:

  • All Your Ducks In A Row: Data Structures in the Standard Library and Beyond – A useful talk by Brandon Rhoades on the implementation of basic data structures in Python, and how to select the ones to use for optimal performance. It turns out that lists aren’t the best thing to use for long sequences of numerical data (who knew?)
  • Fast Python, Slow Python – An interesting talk by Alex Gaynor about how to write decent performing pure-python code in a single-threaded context. Lots of intelligent stuff about producing robust code that matches your intention and data structures, and caution against doing fancy things in the name of being “pythonic” or “general”.
  • Analyzing Rap Lyrics with Python – Another data analysis talk, this one about a subject I knew almost nothing about. The best part of it (for me anyway) was learning how the speaker (Julie Lavoie) narrowed her focus in her research to the exact aspects of the problem that would let her answer the question she was interested in (“Can we automatically find out which rap lyrics are the most sexist?”) as opposed to interesting problems (“how can I design the most general scraping library possible?”) that don’t answer the question. In my opinion, this ability to focus is one of the key things that seperates successful projects from unsuccessful ones.

Upcoming travels to Japan and Taiwan

Just a quick note that I’ll shortly be travelling from the frozen land of Montreal, Canada to Japan and Taiwan over the next week, with no particular agenda other than to explore and meet people. If any Mozillians are interested in meeting up for food or drink, and discussion of FirefoxOS performance, Eideticker, entropy or anything else… feel free to contact me at wrlach@gmail.com.

Exact itinerary:

  • Thu Mar 20 – Sat Mar 22: Tokyo, Japan
  • Sat Mar 22 – Tue Mar 25: Kyoto, Japan
  • Tue Mar 25 – Thu Mar 27: Tokyo, Japan
  • Thu Mar 27 – Sun Mar 30: Taipei, Taiwan

I will also be in Taipei the week of the March 31st, though I expect most of my time to be occupied with discussions/activities inside the Taipei office about FirefoxOS performance matters (the Firefox performance team is having a work week there, and I’m tagging along to talk about / hack on Eideticker and other automation stuff).

It’s all about the entropy

[ For more information on the Eideticker software I’m referring to, see this entry ]

So recently I’ve been exploring new and different methods of measuring things that we care about on FirefoxOS — like startup time or amount of checkerboarding. With Android, where we have a mostly clean signal, these measurements were pretty straightforward. Want to measure startup times? Just capture a video of Firefox starting, then compare the frames pixel by pixel to see how much they differ. When the pixels aren’t that different anymore, we’re “done”. Likewise, to measure checkerboarding we just calculated the areas of the screen where things were not completely drawn yet, frame-by-frame.

On FirefoxOS, where we’re using a camera to measure these things, it has not been so simple. I’ve already discussed this with respect to startup time in a previous post. One of the ideas I talk about there is “entropy” (or the amount of unique information in the frame). It turns out that this is a pretty deep concept, and is useful for even more things than I thought of at the time. Since this is probably a concept that people are going to be thinking/talking about for a while, it’s worth going into a little more detail about the math behind it.

The wikipedia article on information theoretic entropy is a pretty good introduction. You should read it. It all boils down to this formula:

wikipedia-entropy-formula

You can see this section of the wikipedia article (and the various articles that it links to) if you want to break down where that comes from, but the short answer is that given a set of random samples, the more different values there are, the higher the entropy will be. Look at it from a probabilistic point of view: if you take a random set of data and want to make predictions on what future data will look like. If it is highly random, it will be harder to predict what comes next. Conversely, if it is more uniform it is easier to predict what form it will take.

Another, possibly more accessible way of thinking about the entropy of a given set of data would be “how well would it compress?”. For example, a bitmap image with nothing but black in it could compress very well as there’s essentially only 1 piece of unique information in it repeated many times — the black pixel. On the other hand, a bitmap image of completely randomly generated pixels would probably compress very badly, as almost every pixel represents several dimensions of unique information. For all the statistics terminology, etc. that’s all the above formula is trying to say.

So we have a model of entropy, now what? For Eideticker, the question is — how can we break the frame data we’re gathering down into a form that’s amenable to this kind of analysis? The approach I took (on the recommendation of this article) was to create a histogram with 256 bins (representing the number of distinct possibilities in a black & white capture) out of all the pixels in the frame, then run the formula over that. The exact function I wound up using looks like this:


def _get_frame_entropy((i, capture, sobelized)):
    frame = capture.get_frame(i, True).astype('float')
    if sobelized:
        frame = ndimage.median_filter(frame, 3)

        dx = ndimage.sobel(frame, 0)  # horizontal derivative
        dy = ndimage.sobel(frame, 1)  # vertical derivative
        frame = numpy.hypot(dx, dy)  # magnitude
        frame *= 255.0 / numpy.max(frame)  # normalize (Q&D)

    histogram = numpy.histogram(frame, bins=256)[0]
    histogram_length = sum(histogram)
    samples_probability = [float(h) / histogram_length for h in histogram]
    entropy = -sum([p * math.log(p, 2) for p in samples_probability if p != 0])

    return entropy

[Context]

The “sobelized” bit allows us to optionally convolve the frame with a sobel filter before running the entropy calculation, which removes most of the data in the capture except for the edges. This is especially useful for FirefoxOS, where the signal has quite a bit of random noise from ambient lighting that artificially inflate the entropy values even in places where there is little actual “information”.

This type of transformation often reveals very interesting information about what’s going on in an eideticker test. For example, take this video of the user panning down in the contacts app:

If you graph the entropies of the frame of the capture using the formula above you, you get a graph like this:

contacts scrolling entropy graph
[Link to original]

The Y axis represents entropy, as calculated by the code above. There is no inherently “right” value for this — it all depends on the application you’re testing and what you expect to see displayed on the screen. In general though, higher values are better as it indicates more frames of the capture are “complete”.

The region at the beginning where it is at about 5.0 represents the contacts app with a set of contacts fully displayed (at startup). The “flat” regions where the entropy is at roughly 4.25? Those are the areas where the app is “checkerboarding” (blanking out waiting for graphics or layout engine to draw contact information). Click through to the original and swipe over the graph to see what I mean.

It’s easy to see what a hypothetical ideal end state would be for this capture: a graph with a smooth entropy of about 5.0 (similar to the start state, where all contacts are fully drawn in). We can track our progress towards this goal (or our deviation from it), by watching the eideticker b2g dashboard and seeing if the summation of the entropy values for frames over the entire test increases or decreases over time. If we see it generally increase, that probably means we’re seeing less checkerboarding in the capture. If we see it decrease, that might mean we’re now seeing checkerboarding where we weren’t before.

It’s too early to say for sure, but over the past few days the trend has been positive:

entropy-levels-climbing
[Link to original]

(note that there were some problems in the way the tests were being run before, so results before the 12th should not be considered valid)

So one concept, at least two relevant metrics we can measure with it (startup time and checkerboarding). Are there any more? Almost certainly, let’s find them!

Eideticker for FirefoxOS: Becoming more useful

[ For more information on the Eideticker software I’m referring to, see this entry ]

Time for a long overdue eideticker-for-firefoxos update. Last time we were here (almost 5 months ago! man time flies), I was discussing methodologies for measuring startup performance. Since then, Dave Hunt and myself have been doing lots of work to make Eideticker more robust and useful. Notably, we now have a setup in London running a suite of Eideticker tests on the latest version of FirefoxOS on the Inari on a daily basis, reporting to http://eideticker.mozilla.org/b2g.

b2g-contacts-startup-dashboard

There were more than a few false starts with and some of the earlier data is not to be entirely trusted… but it now seems to be chugging along nicely, hopefully providing startup numbers that provide a useful counterpoint to the datazilla startup numbers we’ve already been collecting for some time. There still seem to be some minor problems, but in general I am becoming more and more confident in it as time goes on.

One feature that I am particularly proud of is the detail view, which enables you to see frame-by-frame what’s going on. Click on any datapoint on the graph, then open up the view that gives an account of what eideticker is measuring. Hover over the graph and you can see what the video looks like at any point in the capture. This not only lets you know that something regressed, but how. For example, in the messages app, you can scan through this view to see exactly when the first message shows up, and what exact state the application is in when Eideticker says it’s “done loading”.

Capture Detail View
[link to original]

(apologies for the low quality of the video — should be fixed with this bug next week)

As it turns out, this view has also proven to be particularly useful when working with the new entropy measurements in Eideticker which I’ve been using to measure checkerboarding (redraw delay) on FirefoxOS. More on that next week.

mozregression now supports inbound builds

Just wanted to send out a quick note that I recently added inbound support to mozregression for desktop builds of Firefox on Windows, Mac, and Linux.

For the uninitiated, mozregression is an automated tool that lets you bisect through builds of Firefox to find out when a problem was introduced. You give it the last known good date, the last known bad date and off it will go, automatically pulling down builds to test. After each iteration, it will ask you whether this build was good or bad, update the regression range accordingly, and then the cycle repeats until there are no more intermediate builds.

Previously, it would only use nightlies which meant a one day granularity — this meant pretty wide regression ranges, made wider in the last year by the fact that so much more is now going into the tree over the course of the day. However, with inbound support (using the new inbound archive) we now have the potential to get a much tighter range, which should be super helpful for developers. Best of all, mozregression doesn’t require any particularly advanced skills to use which means everyone in the Mozilla community can help out.

For anyone interested, there’s quite a bit of scope to improve mozregression to make it do more things (FirefoxOS support, easier installation…). Feel free to check out the repository, the issues list (I just added an easy one which would make a great first bug) and ask questions on irc.mozilla.org#ateam!

Automatically measuring startup / load time with Eideticker

So we’ve been using Eideticker to automatically measure startup/pageload times for about a year now on Android, and more recently on FirefoxOS as well (albeit not automatically). This gives us nice and pretty graphs like this:

flot-startup-times-gn

Ok, so we’re generating numbers and graphing them. That’s great. But what’s really going on behind the scenes? I’m glad you asked. The story is a bit different depending on which platform you’re talking about.

Android

On Android we connect Eideticker to the device’s HDMI out, so we count on a nearly pixel-perfect signal. In practice, it isn’t quite, but it is within a few RGB values that we can easily filter for. This lets us come up with a pretty good mechanism for determining when a page load or app startup is finished: just compare frames, and say we’ve “stopped” when the pixel differences between frames are negligible (previously defined at 2048 pixels, now 4096 — see below). Eideticker’s new frame difference view lets us see how this works. Look at this graph of application startup:

frame-difference-android-startup
[Link to original]

What’s going on here? Well, we see some huge jumps in the beginning. This represents the animated transitions that Android makes as we transition from the SUTAgent application (don’t ask) to the beginnings of the FirefoxOS browser chrome. You’ll notice though that there’s some more changes that come in around the 3 second mark. This is when the site bookmarks are fully loaded. If you load the original page (link above) and swipe your mouse over the graph, you can see what’s going on for yourself. :)

This approach is not completely without problems. It turns out that there is sometimes some minor churn in the display even when the app is for all intents and purposes started. For example, sometimes the scrollbar fading out of view can result in a significantish pixel value change, so I recently upped the threshold of pixels that are different from 2048 to 4096. We also recently encountered a silly problem with a random automation app displaying “toasts” which caused results to artificially spike. More tweaking may still be required. However, on the whole I’m pretty happy with this solution. It gives useful, undeniably objective results whose meaning is easy to understand.

FirefoxOS

So as mentioned previously, we use a camera on FirefoxOS to record output instead of HDMI output. Pretty unsurprisingly, this is much noisier. See this movie of the contacts app starting and note all the random lighting changes, for example:

My experience has been that pixel differences can be so great between visually identical frames on an eideticker capture on these devices that it’s pretty much impossible to settle on when startup is done using the frame difference method. It’s of course possible to detect very large scale changes, but the small scale ones (like the contacts actually appearing in the example above) are very hard to distinguish from random differences in the amount of light absorbed by the camera sensor. Tricks like using median filtering (a.k.a. “blurring”) help a bit, but not much. Take a look at this graph, for example:

plotly-contacts-load-pixeldiff
[Link to original]

You’ll note that the pixel differences during “static” parts of the capture are highly variable. This is because the pixel difference depends heavily on how “bright” each frame is: parts of the capture which are black (e.g. a contacts icon with a black background) have a much lower difference between them than parts that are bright (e.g. the contacts screen fully loaded).

After a day or so of experimenting and research, I settled on an approach which seems to work pretty reliably. Instead of comparing the frames directly, I measure the entropy of the histogram of colours used in each frame (essentially just an indication of brightness in this case, see this article for more on calculating it), then compare that of each frame with the average of the same measure over 5 previous frames (to account for the fact that two frames may be arbitrarily different, but that is unlikely that a sequence of frames will be). This seems to work much better than frame difference in this environment: although there are plenty of minute differences in light absorption in a capture from this camera, the overall color composition stays mostly the same. See this graph:

plotly-contacts-load-entropy
[Link to original]

If you look closely, you can see some minor variance in the entropy differences depending on the state of the screen, but it’s not nearly as pronounced as before. In practice, I’ve been able to get extremely consistent numbers with a reasonable “threshold” of “0.05”.

In Eideticker I’ve tried to steer away from using really complicated math or algorithms to measure things, unless all the alternatives fail. In that sense, I really liked the simplicity of “pixel differences” and am not thrilled about having to resort to this: hopefully the concepts in this case (histograms and entropy) are simple enough that most people will be able to understand my methodology, if they care to. Likely I will need to come up with something else for measuring responsiveness and animation smoothness (frames per second), as likely we can’t count on light composition changing the same way for those cases. My initial thought was to use edge detection (which, while somewhat complex to calculate, is at least easy to understand conceptually) but am open to other ideas.

First Eideticker Responsiveness Tests

[ For more information on the Eideticker software I’m referring to, see this entry ]

Time for another update on Eideticker. In the last quarter, I’ve been working on two main items:

  1. Responsiveness tests (Android / FirefoxOS)
  2. Eideticker for FirefoxOS

The focus of this post is the responsiveness work. I’ll talk about Eideticker for FirefoxOS soon. :)

So what do I mean by responsiveness? At a high-level, I mean how quickly one sees a response after performing an action on the device. For example, if I perform a swipe gesture to scroll the content down while browsing CNN.com, how long does it take after
I start the gesture for the content to visibly scroll down? If you break it down, there’s a multi-step process that happens behind the scenes after a user action like this:

input-events

If anywhere in the steps above, there is a significant delay, the user experience is likely to be bad. Usability research
suggests that any lag that is consistently above 100 milliseconds will lead the user to perceive things as being laggy. To keep our users happy, we need to do our bit to make sure that we respond quickly at all levels that we control (just the application layer on Android, but pretty much everything on FirefoxOS). Even if we can’t complete the work required on our end to completely respond to the user’s desire, we should at least display something to acknowledge that things have changed.

But you can’t improve what you can’t measure. Fortunately, we have the means to do calculate of the time delta between most of the steps above. I learned from Taras Glek this weekend that it should be possible to simulate the actual capacitative touch event on a modern touch screen. We can recognize when the hardware event is available to be consumed by userspace by monitoring the `/dev/input` subsystem. And once the event reaches the application (the Android or FirefoxOS application) there’s no reason we can’t add instrumentation in all sorts of places to track the processing of both the event and the rendering of the response.

My working hypothesis is that it’s application-level latency (i.e. the time between the application receiving the event and being able to act on it) that dominates, so that’s what I decided to measure. This is purely based on intuition and by no means proven, so we should test this (it would certainly be an interesting exercise!). However, even if it turns out that there are significant problems here, we still care about the other bits of the stack — there’s lots of potentially-latency-introducing churn there and the risk of regression in our own code is probably higher than it is elsewhere since it changes so much.

Last year, I wrote up a tool called Orangutan that can directly inject input events into an input device on Android or FirefoxOS. It seemed like a fairly straightforward extension of the tool to output timestamps when these events were registered. It was. Then, by synchronizing the time between the device and the machine doing the capturing, we can then correlate the input timestamps to events. To help visualize what’s going on, I generated this view:

taskjs-framediff-view

[Link to original]

The X axis in that graph represents time. The Y-axis represents the difference between the frame at that time with the previous one in number of pixels. The “red” represents periods in capture when events are ongoing (we use different colours only to
distinguish distinct events). [1]

For a first pass at measuring responsiveness, I decided to measure the time between the first event being initiated and there being a significant frame difference (i.e. an observable response to the action). You can see some preliminary results on the eideticker dashboard:

taskjs-responsiveness

[Link to original]

The results seem pretty highly variable at first because I was synchronizing time between the device and an external ntp server, rather than the host machine. I believe this is now fixed, hopefully giving us results that will indicate when regressions occur. As time goes by, we may want to craft some special eideticker tests for responsiveness specifically (e.g. a site where there is heavy javascript background processing).

[1] Incidentally, these “frame difference” graphs are also quite useful for understanding where and how application startup has regressed in Fennec — try opening these two startup views side-by-side (before/after a large regression) and spot the difference: [1] and [2])