Python dependency gotchas: always go to the source

Aug 16th, 2021

Mozilla Python

Getting back into the swing of things at Mozilla after my extended break. I’m currently working on enhancing and extending Looker support for Glean-based applications, which eventually led me back to working on bigquery-etl, our framework for creating derived datasets in our data lake.

I spent some time working on improving the initial developer experience of bigquery-etl early this year, so I figured it would be no problem to get going again despite an extended hiatus from it (I think it’s probably been ~2–3 months since I last touched it). Unfortunately the first thing I got after creating a fresh virtual environment (to pick up the new dependency updates) was this exciting looking error:

wlach@antwerp bigquery-etl % ./bqetl --help
Traceback (most recent call last):
  ...
  File "/Users/wlach/src/bigquery-etl/venv/lib/python3.9/site-packages/google/cloud/bigquery_v2/types/__init__.py", line 16, in <module>
    from .encryption_config import EncryptionConfiguration
  File "/Users/wlach/src/bigquery-etl/venv/lib/python3.9/site-packages/google/cloud/bigquery_v2/types/encryption_config.py", line 26, in <module>
    class EncryptionConfiguration(proto.Message):
  File "/Users/wlach/src/bigquery-etl/venv/lib/python3.9/site-packages/proto/message.py", line 200, in __new__
    file_info = _file_info._FileInfo.maybe_add_descriptor(filename, package)
  File "/Users/wlach/src/bigquery-etl/venv/lib/python3.9/site-packages/proto/_file_info.py", line 42, in maybe_add_descriptor
    descriptor=descriptor_pb2.FileDescriptorProto(
TypeError: descriptor to field 'google.protobuf.FileDescriptorProto.name' doesn't apply to 'FileDescriptorProto' object

What I did

Since we have pretty decent continuous integration at Mozilla, when I see an error like this I am usually pretty sure it’s some kind of strange interaction between my local development environment and whatever dependencies we’ve specified for the repository in question. Usually these problems are pretty easy to solve.

First thing I tried was to type the error into Google, to see if this had come up for anyone else before. I tried several variations of TypeError: descriptor to field and FileDescriptorProto and nothing really turned up. This strategy almost always turns up something. When it doesn’t it usually indicates that something pretty strange is happening.

To see if this was a strange problem particular to us, I asked on our internal channel but no one had offhand seen or heard of this error either. One of my colleagues (who had a working setup on a Mac, the same environment I was using) suggested I set up pyenv to isolate my development environment, which was a good idea but did not seem to solve the problem: both Python 3.8 and 3.9 installed via pyenv ran into the exact same issue.

After flailing around trying a number of other failed approaches (maybe I need to upgrade the version of virtualenv that we’re using?), I broke down and looked harder at the error itself. It seemed to be some kind of typing error in Google’s protobuf library, which google-cloud-bigquery is calling. If this sort of thing was happening to everyone, we probably would have seen it happening more broadly. So my guess, again, was that it was happening due to an obscure interaction between some variable on my machine and this particular combination of dependencies.

At this point, I systematically went through our set of python dependencies to see what might be the matter. For the most part, I found nothing surprising or suspicious. google-api-core was at the latest version, as was google-cloud-bigquery. However, I did notice that the version of protobuf we were using was a little older (3.15.8 when the latest “official” version on pypi was 3.17.3).

It seemed like a longshot that the problem was there, but it seemed like upgrading the dependency was worth a try just in case. So I bumped the version of protobuf to the latest version in my local checkout (pip install protobuf==3.17.3)…

… and sure enough, after doing so, the problem was fixed and ./bqetl --help started working again:

wlach@antwerp bigquery-etl % ./bqetl --help
Usage: bqetl [OPTIONS] COMMAND [ARGS]...

  CLI tools for working with bigquery-etl.

...

After doing so, I did up a quick pull request and the problem is now fixed, at least for me.

It’s a bit unfortunate that dependabot (which we have configured for this repository) didn’t send an update for protobuf, which would have fixed this problem earlier.1 It seems like it’s not completely reliable for python packages, for whatever reason: I have also noticed this problem with mozregression.

I suspect (though can’t confirm) that the problem here is a backwards-incompatible change made to either protobuf or one of the packages that uses it. However, the nature of the incompatibility seems subtle: bigquery-etl works fine with the old set of dependencies we run in continuous integration and it appears to only come up in specific circumstances (i.e. mine). Unfortunately, I need to get back to what I was actually planning to work on and don’t have time to unwind the rather set of complex interactions going on here. Maybe later!

What I would have done differently

This kind of illustrates (again) to me that while some shortcuts and heuristics can save a bunch of time and mental effort (Googling things all the time is basically standard practice in the industry at this point), sometimes you really just need to start a little closer at the problem to find a solution. I was hesitant to do this in this case because I’m never sure where those kinds of rabbit holes are going to take me (e.g. I spent several days debugging a bad interaction between Kubernetes and our airflow cluster in late 2019 with not much to show for the effort), but often all it takes is understanding the general shape of the problem to move you to a quick solution.

Other lessons

Here’s a couple of other things this experience reinforced for me (these are more subjective, take them or leave them):

  1. As an aside, the main reason we use dependabot and aggressively update packages like google-api-core is due to a bug in pip


Lightweight dashboards and reports with Irydium and surge.sh

Aug 3rd, 2021

Irydium Recurse

One of my main goals with Irydium is to allow it to be a part of as many data science and engineering workflows as possible (including ones I haven’t thought of). Yes, like Iodide and other products, I am (slowly) building a web-based interface for building and sharing dashboards, reports, and similar things. However, I also want to fully support local and command-line based workflows. Beyond the obvious utility of being able to use your favorite text-editor to create documents, this also opens up the possibility of combining Irydium with other tools and workflows. For a slightly longer exposition on why this is desirable, I would highly recommend reading Ryan Harter’s post on the subject: Don’t make me code in your text box.

Using the irydium template

To make getting started easier, I just created an irydium-template: a simple GitHub repository which contains a minimal markdown document (a big mac index visualization) which you can use as a base, as well as a bit of npm scaffolding to get you up and running quickly. To check it out via the console, I recommend using degit (the tool of choice for such things in the Svelte community):

npx degit git@github.com:irydium/irydium-template.git my-notebook
npm install
npm run dev

This will create a webserver which renders the document (index.md) at port 3000, along with some debugging options. As you edit and save the document, the site should update automatically.

Publishing your work

When you’re happy with the results, you can create a static version of the site (an index.html file) by running npm run build. You can publish this via whatever you like: GitHub pages, Netlify / Vercel or… my new favorite service, surge.sh. Surge provides a really simple hosting service for hosting static sites and works great with Irydium. Installing and running it locally is two commands:

npm install -g surge
surge

Surge will prompt you for an email and a password, then will automatically publish your site at a unique URL. As an example, I published a site for the above template: few-blade.surge.sh

Interested in chatting more about this? Feel free to reach out on the Irydium Gitter chat.


Irydium @ Recurse Updates

Jul 28th, 2021

Irydium Recurse

Some quick updates on where Irydium is at, roughly a week-and-a-half before my mini-sabbatical at the Recurse Centre ends.

JupyterBook and MyST

I’d been admiring JupyterBook from afar for some time: their project philosophy appealed to me greatly. In particular, the MyST extensions to markdown seemed like a natural fit for this project and a natural point of collaboration and cross-pollination. A couple of weeks ago, I finally got in touch with some people working on that project, which prompted a few small efforts:

I’ve become convinced that building on top of MyST is right for both Irydium and the larger community. Increasing Irydium’s support for MyST is tracked in irydium/irydium#123.

Using Irydium to build Irydium

I’ve been spending a fair bit of time thinking of how to ma ke it easier for people to build Irydium documents through composition of existing documents. Landed the first pieces of this. The first is the ability to “import” a code chunk from another irydium document. There’s a few examples of this in the new components section of irydium.dev:

In a sense this allows you to define a reusable piece of code along with both documentation and usage examples. I think this concept will be particularly useful for supporting language plugins (which I will write about in an upcoming post).

It’s a real project now

I spent a bit of time last week doing some community gardening. I still consider Irydium an “experiment” but I’d like to at least open up the possibility of it being something larger. To help make that happen, I started working on some basic project governance pieces, namely:

Next steps

There’s not a ton of time left at RC, so some of these things may have to be done in my spare time after the batch ends. That said, here’s my near-term roadmap:


10 years at Mozilla

Jul 12th, 2021

Mozilla Recurse

Yesterday (July 11, 2021) was the 10 year anniversary of starting at the Mozilla Corporation. My life has changed a ton in those years: in that time I ended a marriage, changed the city in which I live two times, and took up religion1. Mozilla has also changed pretty drastically in my time here, especially in the last year.

Yet somehow I’m still at it, for more or less for the same reasons that led me to accept my initial offer to join the A-team.2 The Internet has the immense potential to be a force for individual empowerment and yet more than ever, we see this technology used to consolidate unchecked power, spread misinformation, and generally exploit people. Mozilla is not perfect (no organization is: 10 years anywhere will teach you that), but it’s one of the few remaining counter-forces to these accelerating trends. While I’m currently taking a bit of a break to explore some stuff on my own, I am looking forward to getting back to work on the mission when I return in mid-August.

  1. To the extent that Zen Buddhism is a religion. 

  2. I’ve since moved to Data @ Mozilla 


Adding persistence to Irydium with Supabase

Jul 5th, 2021

Recurse Irydium

Entering the second week of Recurse. Besides orientation and a few adventures in pair programming (special shout out to Jane Adams for trying out Irydium with me!), I spent most of my time attempting to get document saving & loading working with Irydium.

I learned from Iodide that not having a good document sharing story really inhibits collaboration and sharing, which is something I explicitly want to do here at the Recurse centre (and in general for this project). That said, this isn’t actually an area I want to spend a lot of time on right now: it’s the shape of problem I’ve solved many times before (and that has been solved by many others). I’d rather spend my time over the next few weeks on things I haven’t had much of a chance to look at or pursue in my day-to-day.

So, to try to keep the complexity down, I decided to take the same approach as the svelte repl, which aims only to allow the reproduction of simple examples. It allows you to save anything you type in it and also browse anything that you had previously saved. That’s not going to replace GitHub, but it’s more than enough to get started.

Supabase

So with that goal in mind, how to do go about it? If I wanted to completely fall back on my previous knowledge, I could have gone for the tried + true approach of Django / Heroku to add a persistence layer (what I did for Iodide). That would have had the benefit of being familiar but would also have increased the overall implementation complexity of Irydium considerably. In the past year, I’ve become convinced that serverless approaches to building web applications are the wave of the future, at least for applications like this one. They’re easier to set up, easier to develop, and (generally speaking) cheaper to deploy. Just before I launched, I set up irydium.dev as a static site on Netlify and it’s been a great experience: deploys are super fast and it’s easy to reason about what’s going on “under the hood” (since there’s not a much of a hood to look under).

With that in mind, I decided to take a (small) gamble and give Supabase a try for this one after determining it would be compatible with the approach I wanted to take. Supabase bills itself as a “Firebase Alternative” (Firebase is another popular solution for bootstrapping simple web applications with persistence). In contrast to Firebase, Supabase uses a standard database technologies (Postgres!) and has a nice JavaScript SDK and a bunch of well-written tutorials (including one especially for Svelte).

The naive model for integrating with Supabase is pretty simple:

I’d say it probably took me 20–30 hours to get the feature working end-to-end (including documentation), which wasn’t too bad. My impressions were pretty positive: the aforementioned tutorial is pretty decent, the supabase-js library provides a nice ORM-like abstraction over SQL and integrates nicely with Svelte. In general working with Supabase felt pretty familiar to me from previous experiences writing database-backed applications, which I take as a very good sign.

The part that felt the weirdest was writing raw SQL to set up the “documents” table that Irydium uses: SQL is something I’m fairly used to writing because of my experiences at Mozilla, but I imagine this might be off-putting to someone newer to writing these types of things. Also, I have some concerns of how maintainable a Supabase database is over the long term: while it was easy enough to document the currently-simple setup instructions in the README, I do somewhat fear the prospect of managing my database via their SQL console. Something like Django’s schema migrations and management commands would be a welcome addition to Supabase’s SDK.

Netlify functions

The above approach isn’t what most people would consider to be “best practice”1. In particular, storing credentials in localStorage is probably not the best idea for an application presenting interactive content like Irydium: it wouldn’t be particularly difficult for a malicious document to steal someone’s secret and send it somewhere it shouldn’t be.

I’m not so worried about it at this stage of the project, but one intriguing possibility here (that’s compatible with our current deploy set up) would be to write some simple Netlify Functions to do the actual interaction with Supabase, while delegating to Netlify for the authentication itself (using Netlify Identity).

I experimented writing a simple function to prove out this approach and it seems to work quite well (source, example). This particular function is making an anonymous query to the database, but I see no obstacle to handling authenticated ones as well. Having an API under a .netlify namespace seems kinda weird on first blush, but I can probably get used to it.

I want to move on to other things now (parsers! document state visualizations!) but might poke at this more later. In the mean time, if you write/build something cool at irydium.dev/repl, let me know!


Irydium: Points of departure

Jun 28th, 2021

Recurse Irydium

So it’s my first day at the Recurse centre, which I blogged briefly about last week. I thought I’d start out by going into a bit more detail about what I’m trying to do with Irydium. This post might be a bit discursive and some of my thoughts are only half-formed: my intent here is towards trying to express some of these ideas at all rather than to come up with the perfect formulation for them, which is going to take time. It is based partly on a presentation I gave at Mozilla last Friday (just before going on my 6-week leave, which starts today).

First principles

The premise of Irydium is that despite obvious advances in terms of the ability of computers to crunch numbers and analyze data, our ability to share whatever we learn from these understandings is still far too difficult, especially for people new to the field. Even for domain experts (those with the job title “Data Engineer” or “Data Scientist” or similar) this is still more difficult than one would like.

I’ve made a few observations over the past couple years of trying to explain and document Mozilla’s data platform that I think form a good starting point for trying to close the gap:

Ok, so what is Irydium?

Irydium is, at heart, a way to translate markdown documents into an interactive, compelling visual presentation.

My view is that publishing markdown text on the web is very close to a solved problem, and that we should build on that success rather than invent something new. This is not necessarily a new point of view (e.g. Rmarkdown and JupyterBook have similar premises) but I think some aspects of Irydium’s approach are mildly novel (or at least within the space of “not generally accepted ideas”).

If you want to get a bit of a flavor for how it works, visit the demonstration site (irydium.dev) and play with some of the examples.

What makes Irydium different from <X>?

While there are a bunch of related projects in this space, there’s a few design principles about Irydium that make it a little different from most of what’s already out there1:

With the above caveats, there are still a number of projects that overlap with Irydium’s ideas and/or design goals. A few that seem worth mentioning here:

Success criteria

My intent with Irydium, at this point in its development, is to prove out some concepts and see where they lead. While I’d welcome it if Irydium became a successful, widely adopted environment for building interactive data visualizations, I’d also be totally happy with other outcomes, such as:

  1. Providing a source of ideas and/or code for other people.
  2. Working on (or with) Irydium being a good learning experience both for myself and others

  1. Please don’t conflate “unique” with “superior”: I’m well aware that all designs come with trade offs. In particular, Irydium’s approach will almost certainly make it difficult / impossible to directly interact with “big data” systems in an efficient way. 

  2. There is at least one effort (Dataflow) to allow editing Observable documents without using Observable itself, which is interesting. 


Mini-sabbatical and introducing Irydium

Jun 23rd, 2021

Mozilla Recurse Irydium

Approaching my 10-year moz-iversary in July, I’ve decided it’s time to take a bit of a mini-sabbatical: I’ll be out (and trying as hard as possible not to check bugmail) from Friday, June 25th until August 9th. During this time, I’ll be doing a batch at the Recurse Centre (something like a writer’s retreat for programmers), exploring some of my interests around data visualization and analysis that don’t quite fit into my role as a Data Engineer here at Mozilla.

In particular, I’m planning to work a bunch on a project tentatively called “Irydium”, which pursues some of the ideas I sketched out last year in my Iodide retrospective and a few more besides. I’ve been steadily working on it in my off hours, but it’s become clear that some of the things I want to pursue would benefit from more dedicated attention and the broader perspective that I’m hoping the Recurse community will be able to provide.

I had meant to write up a proper blog post to announce the project before I left, but it looks like I’m pretty much out of time. Instead, I’ll just offer up the examples on the newly-minted irydium.dev and invite people to contact me if any of the ideas on the site sounds interesting. I’m hoping to blog a whole bunch while I’m there, but probably not under the Mozilla tag. Feel free to add wrla.ch to your RSS feed if you want to follow what I’m up to!


Glean Dictionary updates

Jun 2nd, 2021

Mozilla Glean

(this is a cross-post from the data blog)

Lots of progress on the Glean Dictionary since I made the initial release announcement a couple of months ago. For those coming in late, the Glean Dictionary is intended to be a data dictionary for applications built using the Glean SDK and Glean.js. This currently includes Firefox for Android and Firefox iOS, as well as newer initiatives like Rally. Desktop Firefox will use Glean in the future, see Firefox on Glean (FoG).

Production URL

We’re in production! You can now access the Glean Dictionary at dictionary.telemetry.mozilla.org. The old protosaur-based URL will redirect.

Glean Dictionary + Looker = ❤️

At the end of last year, Mozilla chose Looker as our internal business intelligence tool. Frank Bertsch, Daniel Thorn, Anthony Miyaguchi and others have been building out first class support for Glean applications inside this platform, and we’re starting to see these efforts bear fruit. Looker’s explores are far easier to use for basic data questions, opening up data based inquiry to a much larger cross section of Mozilla.

I recorded a quick example of this integration here:

Note that Looker access is restricted to Mozilla employees and NDA’d volunteers. Stay tuned for more public data to be indexed inside the Glean Dictionary in the future.

Glean annotations!

I did up the first cut of a GitHub-based system for adding annotations to metrics — acting as a knowledge base for things data scientists and others have discovered about Glean Telemetry in the field. This can be invaluable when doing new analysis. A good example of this is the annotation added for the opened as default browser metric for Firefox for iOS, which has several gotchas:

Many thanks to Krupa Raj and Leif Oines for producing the requirements which led up to this implementation, as well as their evangelism of this work more generally inside Mozilla. Last month, Leif and I did a presentation about this at Data Club, which has been syndicated onto YouTube:

Since then, we’ve had a very successful working session with some people Data Science and have started to fill out an initial set of annotations. You can see the progress in the glean-annotations repository.

Other Improvements

Lots more miscellaneous improvements and fixes have gone into the Glean Dictionary in the last several months: see our releases for a full list. One thing that irrationally pleases me are the new labels Linh Nguyen added last week: colorful and lively, they make it easy to see when a Glean Metric is coming from a library:

Future work

The Glean Dictionary is just getting started! In the next couple of weeks, we’re hoping to:

If you’re interested in getting involved, join us! The Glean Dictionary is developed in the open using cutting edge front-end technologies like Svelte. Our conviction is that being transparent about the data Mozilla collects helps us build trust with our users and the community. We’re a friendly group and hang out on the #glean-dictionary channel on Matrix.


mozregression update May 2021

May 10th, 2021

Mozilla Glean Telemetry mozregression

Just wanted to give some quick updates on the state of mozregression.

Anti-virus false positives

One of the persistent issues with mozregression is that it seems to be persistently detected as a virus by many popular anti-virus scanners. The causes for this are somewhat complex, but at root the problem is that mozregression requires fairly broad permissions to do the things it needs to do (install and run copies of Firefox) and thus its behavior is hard to distinguish from a piece of software doing something malicious.

Recently there have been a number of mitigations which seem to be improving this situation:

It’s tempting to lament the fact that this is happening, but in a way I can understand it’s hard to reliably detect what kind of software is legitimate and what isn’t. I take the responsibility for distributing this kind of software seriously, and have pretty strict limits on who has access to the mozregression GitHub repository and what pull requests I’ll merge.

CI ported to GitHub Actions

Due to changes in Travis’s policies, we needed to migrate continuous integration for mozregression to GitHub actions. You can see the gory details in bug 1686039. One possibly interesting wrinkle to others: due to Mozilla’s security policy, we can’t use (most) external actions inside our GitHub repository. I thus rewrote the logic for uploading a mozregression release to GitHub for MacOS and Linux GUI builds (Windows builds are still happening via AppVeyor for now) from scratch. Feel free to check the above out if you have a similar need.

MacOS Big Sur

As of version 4.0.17, the mozregression GUI now works on MacOS Big Sur. It is safe to ask community members to install and use it on this platform (though note the caveats due to the bundle being unsigned).

Usage Dashboard

Fulfilling a promise I implied last year, I created a public dataset for mozregression and created an dashboard tracking mozregression use using Observable. There are a few interesting insights and trends there that can be gleaned from our telemetry. I’d be curious if the community can find any more!


Blog moving back to wrla.ch

Mar 21st, 2021

Mozilla meta

House keeping news: I’m moving this blog back to the wrla.ch domain from wlach.github.io. This domain sorta kinda worked before (I set up a netlify deploy a couple years ago), but the software used to generate this blog referenced github all over the place in its output, so it didn’t really work as you’d expect. Anyway, this will be the last entry published on wlach.github.io: my plan is to turn that domain into a set of redirects in the future.

I don’t know how many of you are out there who still use RSS, but if you do, please update your feeds. I have filed a bug to update my Planet Mozilla entry, so hopefully the change there will be seamless.

Why? Recent events have made me not want to tie my public web presence to a particular company (especially a larger one, like Microsoft). I don’t have any immediate plans to move this blog off of github, but this gives me that option in the future. For those wondering, the original rationale for moving to github is in this post. Looking back, the idea of moving away from a VPS and WordPress made sense, the move away from my own domain less so. I think it may have been harder to set up static hosting (esp. with HTTPS) at that time… or I might have just been ignorant.

In related news, I decided to reactivate my twitter account: you can once again find me there as @wrlach (my old username got taken in my absence). I’m not totally thrilled about this (I basically stand by what I wrote a few years ago, except maybe the concession I made to Facebook being “ok”), but Twitter seems to be where my industry peers are. As someone who doesn’t have a large organic following, I’ve come to really value forums where I can share my work. That said, I’m going to be very selective about what I engage with on that site: I appreciate your understanding.