Category Archives: Mozilla

More Perfherder updates

Since my last update, we’ve been trucking along with improvements to Perfherder, the project for making Firefox performance sheriffing and analysis easier.

Compare visualization improvements

I’ve been spending quite a bit of time trying to fix up the display of information in the compare view, to address feedback from developers and hopefully generally streamline things. Vladan (from the perf team) referred me to Blake Winton, who provided tons of awesome suggestions on how to present things more concisely.

Here’s an old versus new picture:

Screen Shot 2015-07-14 at 3.53.20 PM Screen Shot 2015-08-07 at 1.57.39 PM

Summary of significant changes in this view:

  • Removed or consolidated several types of numerical information which were overwhelming or confusing (e.g. presenting both numerical and percentage standard deviation in their own columns).
  • Added tooltips all over the place to explain what’s being displayed.
  • Highlight more strongly when it appears there aren’t enough runs to make a definitive determination on whether there was a regression or improvement.
  • Improve display of visual indicator of magnitude of regression/improvement (providing a pseudo-scale showing where the change ranges from 0% – 20%+).
  • Provide more detail on the two changesets being compared in the header and make it easier to retrigger them (thanks to Mike Ling).
  • Much better and more intuitive error handling when something goes wrong (also thanks to Mike Ling).

The point of these changes isn’t necessarily to make everything “immediately obvious” to people. We’re not building general purpose software here: Perfherder will always be a rather specialized tool which presumes significant domain knowledge on the part of the people using it. However, even for our audience, it turns out that there’s a lot of room to improve how our presentation: reducing the amount of extraneous noise helps people zero in on the things they really need to care about.

Special thanks to everyone who took time out of their schedules to provide so much good feedback, in particular Avi Halmachi, Glandium, and Joel Maher.

Of course more suggestions are always welcome. Please give it a try and file bugs against the perfherder component if you find anything you’d like to see changed or improved.

Getting the word out

Hammersmith:mozilla-central wlach$ hg push -f try
pushing to ssh://hg.mozilla.org/try
no revisions specified to push; using . to avoid pushing multiple heads
searching for changes
remote: waiting for lock on repository /repo/hg/mozilla/try held by 'hgssh1.dmz.scl3.mozilla.com:8270'
remote: got lock after 4 seconds
remote: adding changesets
remote: adding manifests
remote: adding file changes
remote: added 1 changesets with 1 changes to 1 files
remote: Trying to insert into pushlog.
remote: Inserted into the pushlog db successfully.
remote:
remote: View your change here:
remote: https://hg.mozilla.org/try/rev/e0aa56fb4ace
remote:
remote: Follow the progress of your build on Treeherder:
remote: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e0aa56fb4ace
remote:
remote: It looks like this try push has talos jobs. Compare performance against a baseline revision:
remote: https://treeherder.mozilla.org/perf.html#/comparechooser?newProject=try&newRevision=e0aa56fb4ace

Try pushes incorporating Talos jobs now automatically link to perfherder’s compare view, both in the output from mercurial and in the emails the system sends. One of the challenges we’ve been facing up to this point is just letting developers know that Perfherder exists and it can help them either avoid or resolve performance regressions. I believe this will help.

Data quality and ingestion improvements

Over the past couple weeks, we’ve been comparing our regression detection code when run against Graphserver data to Perfherder data. In doing so, we discovered that we’ve sometimes been using the wrong algorithm (geometric mean) to summarize some of our tests, leading to unexpected and less meaningful results. For example, the v8_7 benchmark uses a custom weighting algorithm for its score, to account for the fact that the things it tests have a particular range of expected values.

To hopefully prevent this from happening again in the future, we’ve decided to move the test summarization code out of Perfherder back into Talos (bug 1184966). This has the additional benefit of creating a stronger connection between the content of the Talos logs and what Perfherder displays in its comparison and graph views, which has thrown people off in the past.

Continuing data challenges

Having better tools for visualizing this stuff is great, but it also highlights some continuing problems we’ve had with data quality. It turns out that our automation setup often produces qualitatively different performance results for the exact same set of data, depending on when and how the tests are run.

A certain amount of random noise is always expected when running performance tests. As much as we might try to make them uniform, our testing machines and environments are just not 100% identical. That we expect and can deal with: our standard approach is just to retrigger runs, to make sure we get a representative sample of data from our population of machines.

The problem comes when there’s a pattern to the noise: we’ve already noticed that tests run on the weekends produce different results (see Joel’s post from a year ago, “A case of the weekends”) but it seems as if there’s other circumstances where one set of results will be different from another, depending on the time that each set of tests was run. Some tests and platforms (e.g. the a11yr suite, MacOS X 10.10) seem particularly susceptible to this issue.

We need to find better ways of dealing with this problem, as it can result in a lot of wasted time and energy, for both sheriffs and developers. See for example bug 1190877, which concerned a completely spurious regression on the tresize benchmark that was initially blamed on some changes to the media code– in this case, Joel speculates that the linux64 test machines we use might have changed from under us in some way, but we really don’t know yet.

I see two approaches possible here:

  1. Figure out what’s causing the same machines to produce qualitatively different result distributions and address that. This is of course the ideal solution, but it requires coordination with other parts of the organization who are likely quite busy and might be hard.
  2. Figure out better ways of detecting and managing these sorts of case. I have noticed that the standard deviation inside the results when we have spurious regressions/improvements tends to be higher (see for example this compare view for the aforementioned “regression”). Knowing what we do, maybe there’s some statistical methods we can use to detect bad data?

For now, I’m leaning towards (2). I don’t think we’ll ever completely solve this problem and I think coming up with better approaches to understanding and managing it will pay the largest dividends. Open to other opinions of course!

Perfherder update

Haven’t been doing enough blogging about Perfherder (our project to make Talos and other per-checkin performance data more useful) recently. Let’s fix that. We’ve been making some good progress, helped in part by a group of new contributors that joined us through an experimental “summer of contribution” program.

Comparison mode

Inspired by Compare Talos, we’ve designed something similar which hooks into the perfherder backend. This has already gotten some interest: see this post on dev.tree-management and this one on dev.platform. We’re working towards building something that will be really useful both for (1) illustrating that the performance regressions we detect are real and (2) helping developers figure out the impact of their changes before they land them.

Screen Shot 2015-07-14 at 3.54.57 PM Screen Shot 2015-07-14 at 3.53.20 PM

Most of the initial work was done by Joel Maher with lots of review for aesthetics and correctness by me. Avi Halmachi from the Performance Team also helped out with the t-test model for detecting the confidence that we have that a difference in performance was real. Lately myself and Mike Ling (one of our summer of contribution members) have been working on further improving the interface for usability — I’m hopeful that we’ll soon have something implemented that’s broadly usable and comprehensible to the Mozilla Firefox and Platform developer community.

Graphs improvements

Although it’s received slightly less attention lately than the comparison view above, we’ve been making steady progress on the graphs view of performance series. Aside from demonstrations and presentations, the primary use case for this is being able to detect visually sustained changes in the result distribution for talos tests, which is often necessary to be able to confirm regressions. Notable recent changes include a much easier way of selecting tests to add to the graph from Mike Ling and more readable/parseable urls from Akhilesh Pillai (another summer of contribution participant).

Screen Shot 2015-07-14 at 4.09.45 PM

Performance alerts

I’ve also been steadily working on making Perfherder generate alerts when there is a significant discontinuity in the performance numbers, similar to what GraphServer does now. Currently we have an option to generate a static CSV file of these alerts, but the eventual plan is to insert these things into a peristent database. After that’s done, we can actually work on creating a UI inside Perfherder to replace alertmanager (which currently uses GraphServer data) and start using this thing to sheriff performance regressions — putting the herder into perfherder.

As part of this, I’ve converted the graphserver alert generation code into a standalone python library, which has already proven useful as a component in the Raptor project for FirefoxOS. Yay modularity and reusability.

Python API

I’ve also been working on creating and improving a python API to access Treeherder data, which includes Perfherder. This lets you do interesting things, like dynamically run various types of statistical analysis on the data stored in the production instance of Perfherder (no need to ask me for a database dump or other credentials). I’ve been using this to perform validation of the data we’re storing and debug various tricky problems. For example, I found out last week that we were storing up to duplicate 200 entries in each performance series due to double data ingestion — oops.

You can also use this API to dynamically create interesting graphs and visualizations using ipython notebook, here’s a simple example of me plotting the last 7 days of youtube.com pageload data inline in a notebook:

Screen Shot 2015-07-14 at 4.43.55 PM

[original]

PyCon 2015

So I went to PyCon 2015. While I didn’t leave quite as inspired as I did in 2014 (when I discovered iPython), it was a great experience and I learned a ton. Once again, I was incredibly impressed with the organization of the conference and the diversity and quality of the speakers.

Since Mozilla was nice enough to sponsor my attendance, I figured I should do another round up of notable talks that I went to.

Technical stuff that was directly relevant to what I work on:

  • To ORM or not to ORM (Christine Spang): Useful talk on when using a database ORM (object relational manager) can be helpful and even faster than using a database directly. I feel like there’s a lot of misinformation and FUD on this topic, so this was refreshing to see. video slides
  • Debugging hard problems (Alex Gaynor): Exactly what it says — how to figure out what’s going on when things aren’t behaving as they should. Great advice and wisdom in this one (hint: take nothing for granted, dive into the source of everything you’re using!). video slides
  • Python Performance Profiling: The Guts And The Glory (Jesse Jiryu Davis): Quite an entertaining talk on how to properly profile python code. I really liked his systematic and realistic approach — which also discussed the thought process behind how to do this (hint: again it comes down to understanding what’s really going on). Unfortunately the video is truncated, but even the first few minutes are useful. video

Non-technical stuff:

  • The Ethical Consequences Of Our Collective Activities (Glyph): A talk on the ethical implications of how our software is used. I feel like this is an under-discussed topic — how can we know that the results of our activity (programming) serves others and does not harm? video
  • How our engineering environments are killing diversity (and how we can fix it) (Kate Heddleston): This was a great talk on how to make the environments in which we develop more welcoming to under-represented groups (women, minorities, etc.). This is something I’ve been thinking a bunch about lately, especially in the context of expanding the community of people working on our projects in Automation & Tools. The talk had some particularly useful advice (to me, anyway) on giving feedback. video slides

I probably missed out on a bunch of interesting things. If you also went to PyCon, please feel free to add links to your favorite talks in the comments!

Perfherder update: Summary series drilldown

Just wanted to give another quick Perfherder update. Since the last time, I’ve added summary series (which is what GraphServer shows you), so we now have (in theory) the best of both worlds when it comes to Talos data: aggregate summaries of the various suites we run (tp5, tart, etc), with the ability to dig into individual results as needed. This kind of analysis wasn’t possible with Graphserver and I’m hopeful this will be helpful in tracking down the root causes of Talos regressions more effectively.

Let’s give an example of where this might be useful by showing how it can highlight problems. Recently we tracked a regression in the Customization Animation Tests (CART) suite from the commit in bug 1128354. Using Mishra Vikas‘s new “highlight revision mode” in Perfherder (combined with the revision hash when the regression was pushed to inbound), we can quickly zero in on the location of it:

Screen Shot 2015-03-27 at 3.18.28 PM

It does indeed look like things ticked up after this commit for the CART suite, but why? By clicking on the datapoint, you can open up a subtest summary view beneath the graph:

Screen Shot 2015-03-27 at 2.35.25 PM

We see here that it looks like the 3-customize-enter-css.all.TART entry ticked up a bunch. The related test 3-customize-enter-css.half.TART ticked up a bit too. The changes elsewhere look minimal. But is that a trend that holds across the data over time? We can add some of the relevant subtests to the overall graph view to get a closer look:

Screen Shot 2015-03-27 at 2.36.49 PM

As is hopefully obvious, this confirms that the affected subtest continues to hold its higher value while another test just bounces around more or less in the range it was before.

Hope people find this useful! If you want to play with this yourself, you can access the perfherder UI at http://treeherder.mozilla.org/perf.html.

Measuring e10s vs. non-e10s performance with Perfherder

For the past few months I’ve been working on a sub-project of Treeherder called Perfherder, which aims to provide a workflow that will let us more easily detect and manage performance regressions in our products (initially just those detected in Talos, but there’s room to expand on that later). This is a long term project and we’re still sorting out the details of exactly how it will work, but I thought I’d quickly announce a milestone.

As a first step, I’ve been hacking on a graphical user interface to visualize the performance data we’re now storing inside Treeherder. It’s pretty bare bones so far, but already it has two features which graphserver doesn’t: the ability to view sub-test results (i.e. the page load time for a specific page in the tp5 suite, as opposed to the geometric mean of all of them) and the ability to see results for e10s builds.

Here’s an example, comparing the tp5o 163.com page load times on windows 7 with e10s enabled (and not):

e10s-vs-non-e10s
[link]

Green is e10s, red is non-e10s (the legend picture doesn’t reflect this because we have yet to deploy a fix to bug 1130554, but I promise I’m not lying). As you can see, the gap has been closing (in particular, something landed in mid-January that improved the e10s numbers quite a bit), but page load times are still measurably slower with this feature enabled.

mozregression updates

Lots of movement in mozregression (a tool for automatically determining when a regression was introduced in Firefox by bisecting builds on ftp.mozilla.org) in the last few months. Here’s some highlights:

  • Support for win64 nightly and inbound builds (Kapil Singh, Vaibhav Agarwal)
  • Support for using an http cache to reduce time spent downloading builds (Sam Garrett)
  • Way better logging and printing of remaining time to finish bisection (Julien Pagès)
  • Much improved performance when bisecting inbound (Julien)
  • Support for automatic determination on whether a build is good/bad via a custom script (Julien)
  • Tons of bug fixes and other robustness improvements (me, Sam, Julien, others…)

Also thanks to Julien, we have a spiffy new website which documents many of these features. If it’s been a while, be sure to update your copy of mozregression to the latest version and check out the site for documentation on how to use the new features described above!

Thanks to everyone involved (especially Julien) for all the hard work. Hopefully the payoff will be a tool that’s just that much more useful to Firefox contributors everywhere. :)

Using Flexbox in web applications

Over last few months, I discovered the joy that is CSS Flexbox, which solves the “how do I lay out this set of div’s in horizontally or vertically”. I’ve used it in three projects so far:

  • Centering the timer interface in my meditation app, so that it scales nicely from a 320×480 FirefoxOS device all the way up to a high definition monitor
  • Laying out the chart / sidebar elements in the Eideticker dashboard so that maximum horizontal space is used
  • Fixing various problems in the Treeherder UI on smaller screens (see bug 1043474 and its dependent bugs)

When I talk to people about their troubles with CSS, layout comes up really high on the list. Historically, basic layout problems like a panel of vertical buttons have been ridiculously difficult, involving hacks involving floating divs and absolute positioning or JavaScript layout libraries. This is why people write articles entitled “Give up and use tables”.

Flexbox has pretty much put an end to these problems for me. There’s no longer any need to “give up and use tables” because using flexbox is pretty much just *like* using tables for layout, just with more uniform and predictable behaviour. :) They’re so great. I think we’re pretty close to Flexbox being supported across all the major browsers, so it’s fair to start using them for custom web applications where compatibility with (e.g.) IE8 is not an issue.

To try and spread the word, I wrote up a howto article on using flexbox for web applications on MDN, covering some of the common use cases I mention above. If you’ve been curious about flexbox but unsure how to use it, please have a look.

mozregression 0.24

I just released mozregression 0.24. This would be a good time to note some of the user-visible fixes / additions that have gone in recently:

  1. Thanks to Sam Garrett, you can now specify a different branch other than inbound to get finer grained regression ranges from. E.g. if you’re pretty sure a regression occurred on fx-team, you can do something like:

    mozregression --inbound-branch fx-team -g 2014-09-13 -b 2014-09-14

  2. Fixed a bug where we could get an incorrect regression range (bug 1059856). Unfortunately the root cause of the bug is still open (it’s a bit tricky to match mozilla-central commits to that of other branches) but I think this most recent fix should make things work in 99.9% of cases. Let me know if I’m wrong.
  3. Thanks to Julien Pagès, we now download the inbound build metadata in parallel, which speeds up inbound bisection quite significantly

If you know a bit of python, contributing to mozregression is a great way to have a high impact on Mozilla. Many platform developers use this project in their day-to-day work, but there’s still lots of room for improvement.

Hacking on the Treeherder front end: refreshingly easy

Over the past two weeks, I’ve been working a bit on the Treeherder front end (our interface to managing build and test jobs from mercurial changesets), trying to help get things in shape so that the sheriffs can feel comfortable transitioning to it from tbpl by the end of the quarter.

One thing that has pleasantly surprised me is just how easy it’s been to get going and be productive. The process looks like this on Linux or Mac:


git clone https://github.com/mozilla/treeherder-ui.git
cd treeherder-ui/webapp
./scripts/web-server.js

Then just load http://localhost:8000 in your favorite web browser (Firefox) and you should be good to go (it will load data from the actually treeherder site). If you want to make modifications to the HTML, Javascript, or CSS just go ahead and do so with your favorite editor and the changes will be immediately reflected.

We have a fair backlog of issues to get through, many of them related to the front end. If you’re interested in helping out, please have a look:

https://wiki.mozilla.org/Auto-tools/Projects/Treeherder#Bugs_.26_Project_Tracking

If nothing jumps out at you, please drop by irc.mozilla.org #treeherder and we can probably find something for you to work on. We’re most active during Pacific Time working hours.

A new meditation app

I had some time on my hands two weekends ago and was feeling a bit of an itch to build something, so I decided to do a project I’ve had in the back of my head for a while: a meditation timer.

If you’ve been following this log, you’d know that meditation has been a pretty major interest of mine for the past year. The foundation of my practice is a daily round of seated meditation at home, where I have been attempting to follow the breath and generally try to connect with the world for a set period every day (usually varying between 10 and 30 minutes, depending on how much of a rush I’m in).

Clock watching is rather distracting while sitting so having a tool to notify you when a certain amount of time has elapsed is quite useful. Writing a smartphone app to do this is an obvious idea, and indeed approximately a zillion of these things have been written for Android and iOS. Unfortunately, most are not very good. Really, I just want something that does this:

  1. Select a meditation length (somewhere between 10 and 40 minutes).
  2. Sound a bell after a short preparation to demarcate the beginning of meditation.
  3. While the meditation period is ongoing, do a countdown of the time remaining (not strictly required, but useful for peace of mind in case you’re wondering whether you’ve really only sat for 25 minutes).
  4. Sound a bell when the meditation ends.

Yes, meditation can get more complex than that. In Zen practice, for example, sometimes you have several periods of varying length, broken up with kinhin (walking meditation). However, that mostly happens in the context of a formal setting (e.g. a Zendo) where you leave your smartphone at the door. Trying to shoehorn all that into an app needlessly complicates what should be simple.

Even worse are the apps which “chart” your progress or have other gimmicks to connect you to a virtual “community” of meditators. I have to say I find that kind of stuff really turns me off. Meditation should be about connecting with reality in a more fundamental way, not charting gamified statistics or interacting online. We already have way too much of that going on elsewhere in our lives without adding even more to it.

So, you might ask why the alarm feature of most clock apps isn’t sufficient? Really, it is most of the time. A specialized app can make selecting the interval slightly more convenient and we can preselect an appropriate bell sound up front. It’s also nice to hear something to demarcate the start of a meditation session. But honestly I didn’t have much of a reason to write this other than the fact than I could. Outside of work, I’ve been in a bit of a creative rut lately and felt like I needed to build something, anything and put it out into the world (even if it’s tiny and only a very incremental improvement over what’s out there already). So here it is:

meditation-timer-screen

The app was written entirely in HTML5 so it should work fine on pretty much any reasonably modern device, desktop or mobile. I tested it on my Nexus 5 (Chrome, Firefox for Android)[1], FirefoxOS Flame, and on my laptop (Chrome, Firefox, Safari). It lives on a subdomain of this site or you can grab it from the Firefox Marketplace if you’re using some variant of Firefox (OS). The source, such as it is, can be found on github.

I should acknowledge taking some design inspiration from the Mind application for iOS, which has a similarly minimalistic take on things. Check that out too if you have an iPhone or iPad!

Happy meditating!

[1] Note that there isn’t a way to inhibit the screen/device from going to sleep with these browsers, which means that you might miss the ending bell. On FirefoxOS, I used the requestWakeLock API to make sure that doesn’t happen. I filed a bug to get this implemented on Firefox for Android.