Cícero #68

O post Cícero #68 apareceu primeiro em Mentirinhas.

Notes on Taylor and Maclaurin series

A Maclaurin series is a power series - a polynomial with carefully selected coefficients and an infinite number of terms - used to approximate arbitrary functions with some conditions (e.g. differentiability). The Maclaurin series does this for input values close to 0, and is a special case of the Taylor series which can be used to find a polynomial approximation around any value.

Intuition

Let's say we have a function f(x) and we want to approximate it with some other - polynomial - function p(x). To make sure that p(x) is as close as possible to f(x), we'll create a function that has similar derivatives to f(x).

  • We start with a constant polynomial, such that p(0)=f(0). This approximation is perfect at 0 itself, but not as much elsewhere.
  • We want p(x) to behave similarly to f(x) around 0, so we'll set the derivative of our approximation to be the same as the derivative of f(x) at 0; in other words p'(0)=f'(0). This approximation will be decent very close to 0 (at least in the direction of the slope), but will become progressively worse as we get farther away from 0.
  • We continue this process, by setting the second derivative to be p''(0)=f''(0), the third derivative to be p'''(0)=f'''(0) and so on, for as many terms as we need to achieve a good approximation in our desired range. Intuitively, if many derivatives of p(x) are identical to the corresponding derivatives of f(x) at some point, the two functions will have very similar behaviors around that point [1].

The full Maclaurin series that accomplishes this approximation is:

\[p(x) = f(0)+\frac{f'(0)}{1!}x+\frac{f''(0)}{2!}x^2+\frac{f'''(0)}{3!}x^3+\cdots=\sum_{n=0}^{\infty} \frac{f^{(n)}(0)}{n!}x^n\]

We'll get to how this equation is found in a moment, but first an example that demonstrates its approximation capabilities. Suppose we want to find a polynomial approximation for f(x)=cos(x). Following the definition of the Maclaurin series, it's easy to calculate:

\[p_{cos}(x)=1-\frac{x^2}{2!}+\frac{x^4}{4!}-\frac{x^6}{6!}+\frac{x^8}{8!}-\cdots\]

(try it as an exercise).

Successive approximation of cos(x) with Maclaurin series

The dark blue line is the cosine function f(x)=cos(x). The light blue lines are successive approximations, with k terms of the power series p_{cos}(x) included:

  • With k=1, p_{cos}(x)=1 since that's just the value of cos(x) at 0.
  • With k=2, p_{cos}(x)=1-\frac{x^2}{2}, and indeed the line looks parabolic
  • With k=3 we get a 4th degree polynomial which tracks the function better, and so on

With more terms in the power series, the approximation resembles cos(x) more and more, at least close to 0. The farther away we get from 0, the more terms we need for a good approximation [2].

How the Maclaurin series works

This section shows how one arrives at the formula for the Maclaurin series, and connects it to the intuition of equating derivatives.

We'll start by observing that the Maclaurin series is developed around 0 for a good reason. The generalized form of a power series is:

\[p(x)=a_0+a_1 x+a_2 x^2 + a_3 x^3 + a_4 x^4 + \cdots\]

To properly approximate a function, we need this series to converge; therefore, it would be desirable for its terms to decrease. An x value close to zero guarantees that x^n becomes smaller and smaller with each successive term. There's a whole section on convergence further down with more details.

Recall from the Intuition section that we're looking for a polynomial that passes through the same point as f(x) at 0, and that has derivatives equal to those of f(x) at that point.

Let's calculate a few of the first derivatives of p(x); the function itself can be considered as the 0-th derivative:

\[\begin{align*} p(x)&=a_0+a_1 x+a_2 x^2 + a_3 x^3+ a_4 x^4+\cdots\\ p'(x)&= a_1 +2 a_2 x + 3 a_3 x^2+4 a_4 x^3+\cdots\\ p''(x)&= 2 a_2 + 3 \cdot 2 a_3 x+ 4 \cdot 3 x^2+\cdots\\ p'''(x)&= 3\cdot 2 a_3 + 4\cdot 3 \cdot 2 x+\cdots \\ \cdots \end{align*}\]

Now, equate these to corresponding derivatives of f(x) at x=0. All the non-constant terms drop out, and we're left with:

\[\begin{align*} f(0)&=p(0)=a_0\\ f'(0)&=p'(0)= a_1 \\ f''(0)&=p''(0)= 2 a_2 \\ f'''(0)&=p'''(0)= 3\cdot 2 a_3 \\ \cdots\\ f^{(n)}(0)&=p^{(0)}(0)=n!a_n\\ \cdots\\ \end{align*}\]

So we can set the coefficients of the power series, generalizing the denominators using factorials:

\[\begin{align*} a_0 &= f(0)\\ a_1 &= \frac{f'(0)}{1!}\\ a_2 &= \frac{f''(0)}{2!}\\ a_3 &= \frac{f'''(0)}{3!}\\ \cdots \\ a_n &= \frac{f^{(n)}(0)}{n!} \end{align*}\]

Which gives us the definition of the Maclaurin series:

\[p(x) = f(0)+\frac{f'(0)}{1!}x+\frac{f''(0)}{2!}x^2+\frac{f'''(0)}{3!}x^3+\cdots=\sum_{n=0}^{\infty} \frac{f^{(n)}(0)}{n!}x^n\]

Taylor series

The Maclaurin series is suitable for finding approximations for functions around 0; what if we want to approximate a function around a different value? First, let's see why we would even want that. A couple of major reasons come to mind:

  1. We have a non-cyclic function and we're really interested in approximating it around some specific value of x; if we use Maclaurin series, we get a good approximation around 0, but its quality is diminishing the farther away we get. We may be able to use much fewer terms for a good approximation if we start it around our target value.
  2. The function we're approximating is not well behaved around 0.

It's the second reason which is most common, at least in calculus. By "not well behaved" I mean a function that's not finite at 0 (or close to it), or that isn't differentiable at that point, or whose derivatives aren't finite.

There's a very simple and common example of such a function - the natural logarithm ln(x). This function is undefined at 0 (it approaches -\infty). Moreover, its derivatives are:

\[\begin{align*} ln'(x)&= \frac{1}{x}\\ ln''(x)&= -\frac{1}{x^2}\\ ln'''(x)&= \frac{2}{x^3}\\ ln^{(4)}(x)&= -\frac{6}{x^4}\\ ln^{(5)}(x)&= \frac{24}{x^5}\\ \cdots \end{align*}\]

None of these is defined at 0 either! The Maclaurin series won't work here, and we'll have to turn to its generalization - the Taylor series:

\[p(x) = f(a)+\frac{f'(a)}{1!}(x-a)+\frac{f''(a)}{2!}(x-a)^2+\frac{f'''(a)}{3!}(x-a)^3+\cdots=\sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!}(x-a)^n\]

This is a power series that provides an approximation for f(x) around any point a where f(x) is finite and differentiable. It's easy to use exactly the same technique to develop this series as we did for Maclaurin.

Let's use this to approximate ln(x) around x=1, where this function is well behaved. ln(1)=0 and substituting x=1 into its derivatives (as listed above) at this point, we get:

\[f'(1)=1\quad f''(1)=-1\quad f'''(1)=2\quad f^{(4)}(1)=-6\quad f^{(5)}(1)=24\]

There's a pattern here: generally, the n-th derivative at 1 is (n-1)! with an alternating sign. Substituting into the Taylor series equation from above we get:

\[p_{ln}(x)=(x-1)-\frac{1}{2}(x-1)^2+\frac{1}{3}(x-1)^3-\frac{1}{4}(x-1)^4+\cdots\]

Here's a plot of approximations with the first k terms (the function itself is dark blue, as before):

Successive approximation of ln(x) with Taylor series around a=1

While the approximation looks good in the vicinity of 1, it seems like all approximations diverge dramatically at some point. The next section helps understand what's going on.

Convergence of power series and the ratio test

When approximating a function with power series (e.g. with Maclaurin or Taylor series), a natural question to ask is: does the series actually converge to the function it's approximating, and what are the conditions on this convergence?

Now it's time to treat these questions a bit more rigorously. We'll be using the ratio test to check for convergence. Generally, for a series:

\[\sum_{n=1}^\infty a_n\]

We'll administer this test:

\[L = \lim_{n\to\infty}\left|\frac{a_{n+1}}{a_n}\right|\]

And check the conditions for which L < 1, meaning that our series converges absolutely.

Let's start with our Maclaurin series for cos(x):

\[p_{cos}(x)=1-\frac{x^2}{2!}+\frac{x^4}{4!}-\frac{x^6}{6!}+\frac{x^8}{8!}-\cdots=1+\sum_{n=1}^{\infty} \frac{(-1)^n x^{2n}}{(2n)!}\]

Ignoring the constant term, we'll write out the ratio limit. Note that because of the absolute value, we can ignore the power-of-minus-one term too:

\[\begin{align*} L &= \lim_{n\to\infty}\left|\frac{a_{n+1}}{a_n}\right|\\ &= \lim_{n\to\infty}\left| \frac{x^{2n+2} (2n)!}{(2n+2)! x^{2n}}\right|\\ &= \lim_{n\to\infty}\left| \frac{x^2}{(2n+1)(2n+2)}\right| \end{align*}\]

Since the limit contents are independent of x, it's obvious that that L=0 for any x. This means that the series converges to cos(x) at any x, given an infinite number of terms. This matches our intuition for this function, which is well-behaved (smooth everywhere).

Now on to ln(x) with its Taylor series around x=1. The series is:

\[p_{ln}(x)=(x-1)-\frac{1}{2}(x-1)^2+\frac{1}{3}(x-1)^3-\frac{1}{4}(x-1)^4+\cdots=\sum_{n=1}^{\infty} \frac{(-1)^{n+1} (x-1)^n}{n}\]

Once again, writing out the ratio limit:

\[\begin{align*} L &= \lim_{n\to\infty}\left|\frac{a_{n+1}}{a_n}\right|\\ &= \lim_{n\to\infty}\left| \frac{(x-1)^{n+1} n}{(n+1) (x-1)^n}\right|\\ &= \lim_{n\to\infty}\left| \frac{n(x-1)}{(n+1)}\right|\\ &= \left|x-1\right| \lim_{n\to\infty}\left| \frac{n}{(n+1)}\right|=\left| x-1\right| \end{align*}\]

To converge, we require:

\[L=\left| x-1\right|<1\]

The solution of this inequality is 0 < x < 2. Therefore, the series converges to ln(x) only in this range of x. This is also what we observe in the latest plot. Another way to say it: the radius of convergence of the series around x=1 is 1.


[1]If this explanation and the plot of cos(x) following it don't convince you, consider watching this video by 3Blue1Brown - it includes more visualizations as well as a compelling alternative intuition using integrals and area.
[2]

Note that since cos(x) is cyclic, all we really need is good approximations in the range [-\pi, \pi). Our plot only shows the positive x axis; it looks like a mirror image on the negative side, so we see that a pretty good approximation is achieved by the time we reach k=5.

This is also a good place to note that while Maclaurin series are important in Calculus, it's not the best approximation for numerical analysis purposes; there are better approximations that converge faster.

Como eu vejo esta foto #195

O nyan cat não exist… O.O

O post Como eu vejo esta foto #195 apareceu primeiro em Mentirinhas.

TIL: testing in the future using the faketime command

Last week's blog post accidentally got published a few hours early1. One of the keen-eyed among you even submitted it to the orange site before it was officially up, since it was in my RSS feed briefly and was picked up by various RSS readers. Resolving that issue led me to discover the command faketime and a wonderful way of validating processes that are time and timezone dependent.

I'm going to first talk about the bug, then separately about how I tested a fix. Feel free to skip ahead to the testing part if you want to skip the story.

The bug that published my post early

Last week's post went up early because I was testing out a new way of publishing previews of posts, and that process had a bug. Previously, I would publish my entire site with drafts to a separate hosting stack just for blog previews. I didn't love that this required two separate deployment processes, though, and I kept admiring how a friend has unlisted posts on her regular site for previews. So I wanted to do that!

Since my static site generator doesn't have hidden pages, I accomplished it by customizing my site templates. One of the comments in that issue thread yields a way to achieve this. Adapting it for all the files I needed, I put something like this inside my templates for atom.xml and my blog/tag pages2:

<!-- snippet from templates/blog.html -->
{% set ts_now = now() | date(format="%Y-%m-%d") | date(format="%s") | int %}

{% for page in section.pages %}
  {% set ts_page = page.date|default(value=0)|date(format="%s")|int %}
  {% if ts_page <= ts_now %}
    <li>{{page.date}}: <a href="{{ page.permalink | safe }}">{{ page.title }}</a></li>
  {%- endif -%}
{% endfor %}

This worked great, and I was able to get feedback on last week's post by sending a link to the hidden page! Neat!

The problem came when I published a typo fix before going to bed on Sunday. The post was scheduled for Monday, and when I published a typo fix, it had already become Monday in UTC. I am on the US east coast, and my computer is set to use Eastern Time. So imagine my surprise when, upon publishing a typo fix, this post also became public and appeared in the feeds! My static site generator was using UTC for the post dates.

I quickly made a small change to remove it from the feeds (I set the post a year in the future). But I couldn't let go of the bug, and I came back to it this week.

I eventually made another tweak to my templates to, effectively, strip out timezone information. It's hacky, but it works. But how do I make sure of that?

Testing in the future

To verify my change, I had to figure out how to check it at a few critical times. Since I want posts to publish on a particular calendar day in my local timezone, I wanted to check if the day before at 11pm filters it out and the next day at 1am includes it.

I stumbled across faketime, which I installed from a system package. It's available on Fedora via sudo dnf install libfaketime, and similar packages exist on other distributions.

Using it is straightforward. You give it a timestamp and a program to run. Then it runs the program while intercepting system calls to time functions. This lets you very easily test something out at a few different times without any modifications to your program or system clock.

Here's how I used it for testing this issue:

# verify that the post disappears before publication time
faketime "sunday 11pm" zola serve

# verify that the post appears after publication time
faketime "monday 1am" zola serve

It can do a few other really useful things, too.

  • Set a specific time: faketime "2024-01-01 12:00:00" zola serve
  • Start at a time, and go 10x faster: faketime -f "@2024-07-21 23:59:00 x10" zola serve
  • Advance by an interval (here, 10 seconds) on each call to get the system time: faketime -f "@2024-07-21 23:59:00 i10.0" zola serve

I just use the simple ones with relative times usually, but it's very nice being able to speed up time! faketime is available as a program or a C library, so it can also be integrated into other programs for testing.


1

With a deep dose of irony, this post also published early. I spent a couple of hours digging in last night and fixing something, because originally I'd not really fixed the bug! Now it is truly fixed, but wow was that a funny twist.

2

This approach does yield a minor bug: hidden posts are included in the count for tags, but are not displayed in the list. In an ideal world, I'd not include them in the count. But this is not an ideal world.

President Venn Diagram

Hard to imagine political rhetoric more microtargeted at me than 'I love Venn diagrams. I really do, I love Venn diagrams. It's just something about those three circles.'

Master Boot Record Poster Tour

Master Boot Record Poster Tour

Mentirinhas #2172

Cor de arara quando foge.

O post Mentirinhas #2172 apareceu primeiro em Mentirinhas.

CrowdStrike

We were going to try swordfighting, but all my compiling is on hold.

Developing domain expertise: get your hands dirty.

Recently, I’ve been thinking about developing domain expertise, and wanted to collect my thoughts here. Although I covered some parts of this in Your first 90 days as CTO (understanding product analytics, shadowing customer support, talking to customers, and talking with your internal experts), I missed the most important dimension of effective learning: getting your hands dirty.

At Carta, I’m increasingly spending time focused on our fund financials business, which requires a deep understanding of accounting. I did not join Carta with a deep understanding of accounting.Initially, I hoped that I would learn enough accounting through customer escalations, project review, and so on, but recently decided I also needed to work through Financial Accounting, 11th Edition by Weygandt, Kimmel, and Kieso.

The tools for building domain expertise vary quite a bit across companies, and I found the same tools ranged from excellent to effectively useless when applied across Stripe (an increasingly expansive platform for manipulating money online), SocialCode (a Facebook advertising optimization company), and Carta (a platform for fund administration and a platform for cap table management). Here are some notes about approaches taken at specific companies, followed by some generalized recommendations.

Uber

Uber likely had the simplest and most effective strategy of any product I’ve worked on: each employee got several hundred dollars of Uber credits every month to use the product. This, combined with the fact that almost all early hires lived in markets that had an active Uber marketplace going, meant that our employees intimately experienced the challenges.

This was particularly impactful for folks who traveled to other cities or countries and experienced using Uber there. Often the experience was pretty inconsistent across cities, and experiencing that inconsistency directly was extremely valuable.

Carta

Returning to my starting paragraph on Carta, Carta operates in a series of deep, complex domains: equity management is a complex legal domain, and fund administration is a complex accounting domain. Ramping in either, let alone both, is difficult.

Carta has an unlimited individual book budget, and they pay for the Certified Equity Professional (CEP) test. These are good educational benefits, but are more a platform that you can build on than the actual learning itself. Teams working on products tend to develop deep domain expertise by building in that domain, but that approach is difficult to apply as an executive as I’m typically simultaneously engaging with so many different products and problems.

In addition to the standard foundation of domain learning (talking to customers, digging into product and business analytics, etc), I’ve found three mechanisms particularly helpful: our executive sponsor program, reading textbooks, and initiative-specific deep dives.

For our executive sponsor program, we have a C-level executive assigned to key customers, who are involved in every escalation, periodic check-ins and advocating for those customers in our roadmap planning. By design, being a sponsor is painful when things don’t go well, and that is a very pure, effective learning mechanism: figure out the customer’s problem, and then track resolving it through the company. Some days I don’t enjoy being a sponsor, but it’s the most effective learning mechanism I’ve found for our exceptionally deep domains, and I’m grateful we rolled the program out.

Second, I’ve found book learning very effective at creating a foundation to dig into product and technical considerations in the accounting domain. For example, soon after joining I found a short refresher on accounting, reading Accounting Made Simple by Mike Piper in a couple of hours. Later, I worked through the Partnership Accounting course on Udemy, and now I’m working through two textbooks, Financial Accounting, 11th Edition and Understanding Partnership Accounting.

Finally, initiative-specific deep dives have been a good opportunity to work directly with a team on a narrow problem until we solved a complex problem together. This taught me a lot about the domain, the individuals, and hopefully provided them with a better sense and relationship with me as an executive sponsoring a project they also cared about. My first big project was working with our payments infrastructure team to support automated money movement in our fund administration product, and I learned so much from the team on that project. I also know there’s no chance I’d understand the complexities at the intersection of money movement and fund administration so well if I hadn’t gotten to work with them on that project.

Stripe

At the time I joined Stripe, all new employees were encouraged to read Payment Systems in the US. More ambitious folks usually built a straightforward Stripe store of some sort: David Singleton created a site to sell journals, and Michelle Bu maintained a store that sold t-shirts with the seconds since epoch printed on them. Building a store was a great educational experience, but maintaining the store live was significantly more valuable in understanding the friction that bothered our users. Things like forced upgrades or late tax forms are abstract when imposed on others, and illuminating when you experience them directly.

As Stripe got increasingly broad and complex, it became increasingly difficult for anyone to maintain a deep understanding of the entire stack. To combat that trend, executives relied more on mechanisms like project-driven learning on high-priority projects, and executive sponsors for key customers. They certainly also relied on standard mechanisms like talking to customers frequently, frequently reviewing product data, and so on.

Intercom

I met Brian Scanlan some years back, who told me that executives at Intercom would start each offsite by doing a quick integration of their product into a new website. The goal wasn’t to do a novel integration, but to stay close to the integration experience of their typical user. I still find this a fairly profound idea, and I tried a variation of this idea at Carta’s most recent executive offsite, making every executive start the offsite by performing a handful of fund administration tasks on a demo fund’s data.

Felt

Chatting with one of the founders at Felt, Can Duruk, about this topic, he mentioned that they maintain an introduction to Geographic Information Systems for both employees and users to understand the domain. They also hired an in-house cartographer who helps educate the team on the details of map making.

Recommendations

The recommendations I would make are embedded in the specific stories above, but I’ll compact them into a list as well for easier reference. Some particularly useful mechanism for senior leaders to develop domain expertise are:

  • Reviewing product analytics on a recurring basis. Your goal is to build an intuition around how the data should move, and then refine that intuition against how the data moves in reality.
  • Shadow customer support to see customer issues and how those issues are resolved.
  • Assign named executive sponsors for key customers. Those sponsors should meet with those customers periodically, be a direct escalation point for those customers, be aware of issues impacting those customers, and be an advocate for those customers’ success.
  • Directly use or integrate with the product. Try to find ways that more closely different customer cohorts rather than just what you find most common. For example, if you only used Uber in San Francisco in 2014, you had a radically misguided intuition about how well Uber worked.
  • Make an executive offsite ritual around using the product. Follow Intercom’s approach to routinely integrate the core parts of your product from scratch, experiencing the challenges of your new users over and over to ensure they don’t degrade.
  • Use executive initiatives as an opportunity to dig deep into particular areas of the business. Over the past year, the areas at Carta that I’ve learned the best are the ones where I embedded myself temporarily into a team dealing with a critical problem and kept with them until the problem was resolved.
  • Use a textbook or course driven approach to understand the underlying domain that you’re working in. This applies from Uber’s marketplace management to Carta’s accounting.

The details of ramping up on a specific domain will always vary a bit, but hopefully something in there gives you a useful starting point for digging into yours. So often executives take a view that the constraints are a problem for their teams, but I think great executive leadership only exists when individuals can combine the abstract mist of grand strategy with the refined nuance of how things truly work. If this stuff seems like the wrong use of your time, that’s something interesting to reflect on.

Organ Meanings

IMO the thymus is one of the coolest organs and we should really use it in metaphors more.

Inside an IBM/Motorola mainframe controller chip from 1981

In this article, I look inside a chip in the IBM 3274 Control Unit.1 But before I discuss the chip, I need to give some background on mainframes. (I didn't completely analyze the chip, so don't expect a nice narrative or solid conclusions.)

Die photo of the Motorola/IBM SC81150 chip. Click this image (or any other) for a larger version.

Die photo of the Motorola/IBM SC81150 chip. Click this image (or any other) for a larger version.

IBM's vintage mainframes were extremely underpowered compared to modern computers; a System/370 mainframe ran well under 1 million instructions per second, while a modern laptop executes billions of instructions per second. But these mainframes could support rooms full of users, while my 2017 laptop can barely handle one person.2 Mainframes achieved their high capacity by offloading much of the data entry overhead so the mainframe could focus on the "important" work. The mainframe received data directly into memory in bulk over high-speed I/O channels, without needing to handle character-by-character editing. For instance, a typical data entry terminal (a "3270") let the user update fields on the screen without involving the computer. When the user had filled out the screen, pressing the "Enter" key sent the entire data record to the mainframe at once. Thus, the mainframe didn't need to process every keystroke; it only dealt with complete records. (This is also why many modern keyboards have an "Enter" key.)

A room with IBM 3179 Color Display Stations, 1984. Note that these are terminals, not PCs. From 3270 Information Display System Introduction.

A room with IBM 3179 Color Display Stations, 1984. Note that these are terminals, not PCs. From 3270 Information Display System Introduction.

But that was just the beginning of the hierarchy of offloaded processing in a mainframe system. Terminals weren't attached directly to the mainframe. You could wire 16 terminals to a terminal multiplexer (such as the 3299). This would in turn be connected to a 3274 Control Unit that merged the terminal data and handled the network protocols. The Control Unit was connected to the mainframe's channel processor which handled I/O by moving data between memory and peripherals without slowing down the CPU. All these layers allowed the mainframe to focus on the important data processing while the layers underneath dealt with the details.3

An overview of the IBM 3270 Information Display System attachment. The yellow highlights indicate the 3274 Control Unit. From 3270 Information Display System: Introduction.

An overview of the IBM 3270 Information Display System attachment. The yellow highlights indicate the 3274 Control Unit. From 3270 Information Display System: Introduction.

The 3274 Control Unit (highlighted above) is the source of the chip I examined. The purpose of the Control Unit "is to take care of all communication between the host system and your organization's display stations and printers". The diagram above shows how terminals were connected to a mainframe, with the 3274 Control Unit (indicated by arrows) in the middle. The 3274 was an all-purpose box, handling terminals, printers, modems, and encryption (if needed). It could communicate with the mainframe at up to 650,000 characters per second. The control unit below (above) is a boring beige box. The control panel is minimal since people normally didn't interact with the unit. On the back are coaxial connectors for the lines to the terminals, as well as connectors to interface with the computer and other peripherals.

An IBM 3274-41D Control Unit. From bitsavers.

An IBM 3274-41D Control Unit. From bitsavers.

The Keystone II board

In 1983, IBM announced new Control Unit models with twice the speed: these were the Model 41 and Model 61. These units were built around a board called Keystone II, shown below. The board is constructed with IBM's peculiar PCB style. The board is arranged as a grid of squares with the PCB traces too small to see unless you zoom in. Most of the decoupling capacitors are in IBM's thin, rectangular packages, although I see a few capacitors in more standard blue packages. IBM is almost a parallel universe with its unusual packaging for ICs and capacitors as well as the strange circuit board appearance.

The Keystone II board. The box is labeled Keystone II FCS [i.e. First Customer Shipment] July 23, 1982. Photo from bitsavers, originally from Bob Roberts.

The Keystone II board. The box is labeled Keystone II FCS [i.e. First Customer Shipment] July 23, 1982. Photo from bitsavers, originally from Bob Roberts.

Most of the chips on the board are IBM chips packaged in square aluminum cans, known as MST (Monolithic System Technology). The first line on each package is the IBM part number, which is usually undocumented. The empty socket can hold a ROS chip; ROS is Read-Only Store, known as ROM to people outside IBM. The Texas Instruments ICs in the upper right are easier to identify; the 74LS641 chips are octal bus transceivers, presumably connecting this board to the rest of the system. Similarly, the 561 5843 is a 74S240 octal bus driver while the 561 6647 chips are 74LS245 octal bus transceivers.

The memory chips on the left side of this board are interesting: each one consists of two "piggybacked" 16-kilobit DRAM chips. IBM's part number 8279251 corresponds to the Intel 4116 chip, originally made by Mostek. With 18 piggybacked chips, the board holds 64 kilobytes of parity-protected memory.

The photo below shows the Keystone II board mounted in the 3274 Control Unit. The board is in slot E towards the left and the purple Motorola IC is visible.

The Keystone II card in slot E of a 3274-41D Control Unit. Photo from bitsavers.

The Keystone II card in slot E of a 3274-41D Control Unit. Photo from bitsavers.

The Motorola/IBM chip

The board has a Motorola chip in a purple ceramic package; this is the chip that I examined. Popping off the golden lid reveals the silicon die underneath. The package has the part number "SC81150R", indicating a Motorola Special/Custom chip. This part number is also visible on the die, as shown below.

The corner of the die is marked with the SC81150 part number. Bond pads and bond wires are also visible.

The corner of the die is marked with the SC81150 part number. Bond pads and bond wires are also visible.

While the outside of the IC is labeled "Motorola", there are no signs of Motorola internally. Instead, the die is marked "IBM" with the eight-striped logo. My guess is that IBM designed the chip and Motorola manufactured it.

The IBM logo on the die.

The IBM logo on the die.

The diagram below shows the chip with some of the functional blocks identified. Around the outside are the bond pads and the bond wires that are connected to the chip's grid of pins. At the right is the 16×16 block of memory, along with its associated control, byte swap, and output circuitry. The yellowish-white lines are the metal layer on top of the chip that provides the chip's wiring. The thick metal lines distribute power and ground throughout the chip. Unlike modern chips, this chip only has a single metal layer, so power and ground distribution tends to get in the way of useful circuitry.

The die with some functional blocks identified.

The die with some functional blocks identified.

The chip is centered around a 16-bit bus (yellow line) that connects many part of the chip. To write to the bus, a circuit pulls bus lines low. The bus lines are kept high by default by 16 pull-up transistors. This approach was fairly common in the NMOS era. However, performance is limited by the relatively weak pull-up current, making bus lines slow to go high due to R-C delays. For higher performance, some chips would precharge the bus high during one clock cycle and then pull lines low during the next cycle.

The two groups of I/O pins at the bottom are connected to the input buffer on the left and the output buffer on the right. The input buffer includes XOR circuits to compute the parity of each byte. Curiously, only 6 bits of the inputs are connected to the main bus, although other circuits use all 8 bits. The buffer also has a circuit to test for a zero value, but only using 5 of the bits.

I've put red boxes around the numerous PLAs, which can be identified by their grids of transistors. This chip has an unusually large number of PLAs. Eric Schlaepfer hypothesizes that the chip was designed on a prototype circuit board using commercial PAL chips for flexibility, and then they transferred the prototype to silicon, preserving the PLA structure. I didn't see any obvious structure to the PLAs; they all seemed to have wires going all over.

The miscellaneous logic scattered around the chip includes many latches and bus drivers; the latch circuit is similar to the memory cells. I didn't fully reverse-engineer this circuitry but I didn't see anything that looked particularly interesting, such as an ALU or counter. The circuitry near the PLAs could be latches as part of state machines, but I didn't investigate further.

I was hoping to find a recognizable processor inside the package, maybe a Motorola 6809 or 68000 processor. Instead, I found a complicated chip that doesn't appear to be a processor. It has a 16×16 memory block along with about 20 PLAs (Programmable Logic Arrays), a curiously large number. PLAs are commonly used in processors for decoding instructions, since they can match bit patterns. I couldn't find a datapatch in the chip; I expected to see the ALU and registers organized in a large but regular 8-bit or 16-bit block of circuitry. The chip doesn't have any ROM4 so there's no microcode on the chip. For these reasons, I think the chip is not a processor or microcontroller, but a specialized data-handling chip, maybe using the PLAs to interpret bits of a protocol.

The chip is built with NMOS technology, the same as the 6502 and 8086 for instance, rather than CMOS technology that is used in modern chips. I measured the transistor features and the chip appears to be built with a 3.5 µm process (not nm!), which Motorola also used for the 68000 processor (1979).

The memory buffer

The chip has a 16×16 memory buffer, which could be a register file or a FIFO buffer. One interesting feature is that the buffer is triple-ported, so it can handle two reads and one write at the same time. The buffer is implemented as a grid of cells, each storing one bit. Each row corresponds to a 16-bit word, while each column corresponds to one bit in a word. Horizontal control lines (made of polysilicon) select which word gets written or read, while vertical bit lines of metal transmit each bit of the word as it is written or read.

The microscope photo below shows two memory cells. These cells are repeated to create the entire memory buffer. The white vertical lines are metal wiring. The short segments are connections within a cell. The thicker vertical lines are power and ground. The thinner lines are the read and write bit lines. The silicon die itself is underneath the metal. The pinkish regions are active silicon, doped to make it conductive. The speckled golden lines are regions are polysilicon wires between the silicon and the metal. It has two roles: most importantly, when polysilicon crosses active silicon, it forms the gate of a transistor. But polysilicon is also used as wiring, important since this chip only has one layer of metal. The large, dark circles are contacts, connections between the metal layer and the silicon. Smaller square regions are contacts between silicon and polysilicon.

Two memory cells, side by side, as they appear under the microscope.

Two memory cells, side by side, as they appear under the microscope.

It was too difficult to interpret the circuits when they were obscured by the metal layer so I dissolved the metal layer and oxide with hydrochloric acid and Armour Etch respectively. The photo below shows the die with the metal removed; the greenish areas are remnants in areas where the metal was thick, mostly power and ground supplies. The dark regions in this image are regions of doped silicon. These are the active areas of the chip, showing the blocks of circuitry. There are also some thin lines of polysilicon wiring. The memory buffer is the large block on the right, just below the center.

The chip with the metal layer removed. Click to zoom in on the image.

The chip with the metal layer removed. Click to zoom in on the image.

Like most implementations of static RAM, each storage cell of the buffer is implemented with cross-coupled inverters, with the output of one inverter feeding into the input of the other. To write a new value to the cell, the new value simply overpowers the inverter output, forcing the cell to the new state. To support this, one of the inverters is designed to be weak, generating a smaller signal than a regular inverter. Most circuits that I've examined create the inverter by using a weak transistor, one with a longer gate. This chip, however, uses a circuit that I haven't seen before: an additional transistor, configured to limit the current from the inverter.

The schematic below shows one cell. Each cell uses ten transistors, so it is a "10T" cell. To support multiple reads and writes, each row of cells has three horizontal control signals: one to write to the word, and two to read. Each bit position has one vertical bit line to provide the write data and two vertical bit lines for the data that is read. Pass transistors connect the bit lines to the selected cells to perform a read or a write, allowing the data to flow in or out of the cell. The symbol that looks like an op-amp is a two-transistor NMOS buffer to amplify the signal when reading the cell.

Schematic of one memory cell.

Schematic of one memory cell.

With the metal layer removed, it is easier to see the underlying silicon circuitry and reverse-engineer it. The diagram below shows the silicon and polysilicon for one storage cell, corresponding to the schematic above. (Imagine vertical metal lines for power, ground, and the three bitlines.)

One memory cell with the metal layer removed. I etched the die a few seconds too long so some of the polysilicon is very thin or missing.

One memory cell with the metal layer removed. I etched the die a few seconds too long so some of the polysilicon is very thin or missing.

The output from the memory unit contains a byte swapper. A 16-bit word is generated with the left half from the read 1 output and the second half from the read 2 output, but the bytes can be swapped. This was probably used to read an aligned 16-bit word if it was unaligned in memory.

Parity circuits

In the lower right part of the chip are two parity circuits, each computing the parity of an 8-bit input. The parity of an input is computed by XORing the bits together through a tree of 2-input XOR gates. First, four gates process pairs of input bits. Next, two XOR gates combine the outputs of the first gates. Finally, an XOR gate combines the two previous outputs to generate the final parity.

The arrangement of the 14 XOR gates to compute parity of the two 8-bit values A and B.

The arrangement of the 14 XOR gates to compute parity of the two 8-bit values A and B.

The schematic below shows how an XOR gate is built from a NOR gate and an AND-NOR gate. If both inputs are 0, the first NOR gate forces the output to 0. If both inputs are 1, the AND gate forces the output to 0. Thus, the circuit computes XOR. Each labeled block above implements the XOR circuit below.

Schematic of an XOR gate.

Schematic of an XOR gate.

Conclusion

My conclusion is that the processor for the Keystone II board is probably one of the other chips, one of the IBM metal-can MST packages, and this chip helps with data movement in some way. It would be possible to trace out the complete circuitry of the chip and determine exactly how it functions, but that is too time-consuming a project for this relatively obscure chip.

Follow me on Twitter @kenshirriff or RSS for more chip posts. I'm also on Mastodon occasionally as @[email protected]. Thanks to Al Kossow for providing the chip and Dag Spicer for providing photos. Thanks to Eric Schlaepfer for discussion.

Notes and references

  1. The 3274 Control Unit was replaced by the 3174 Establishment Controller, introduced in 1986. An "Establishment Controller" managed a cluster of peripherals or PCs connected to a host mainframe, essentially a box that provided a "kitchen-sink" of functionality including terminal support, local disk storage, Ethernet or token-ring networking, ASCII terminal support, encryption/decryption, and modem support. These units ranged from PC-sized boxes to mini-fridge-sized boxes, depending on how much functionality was required. 

  2. I'm serious that my laptop can barely handle one person; my 2017 MacBook Air starts dropping characters if it has even a moderate load, and I have to start one-finger typing. You would think that a 1.8 GHz dual-core i5 processor could handle more than 2 characters per second. I don't know if there's something wrong with it, or if modern software just has too much overhead. Don't worry, I upgraded and do most of my work on a faster, more recent laptop. 

  3. The IBM hardware model had the CPU focusing on the big picture, while the hierarchy of boxes underneath processed data, performed storage, handled printing, and so forth. In a sense, this paralleled the structure of offices in that era, where executives had assistants and secretaries to do the tedious work for them: typing, filing, and so forth. Nowadays, the computer hierarchy and the office hierarchy are both considerably flatter. Maybe there's a connection? 

  4. A ROM and a PLA are similar in many ways. The general distinction is that a ROM activates one word (row) at a time, while a PLA can activate multiple rows at a time and combine the values, giving more flexibility. A ROM generally has a binary decoder to select the row. This decoder can be recognized by its binary structure: transistors alternating by 1's, by 2's, by 4's, and so forth. 

Programando bots para o Mastodon: obtendo dados via API

O Mastodon tem uma API rica e fácil de operar, e desenvolver clientes ou robôs para interagir com ela está ao seu alcance em várias linguagens.

Desenvolver interfaces ou clientes para o Mastodon tem uma escala de complexidade a ser vencida, como em tudo na vida, mas não há razão para começar já tentando operações difíceis, autenticadas e envolvendo múltiplas instâncias: o ideal é começar com operações simples, envolvendo apenas um endpoint de API por vez, e com acesso a dados públicos, até pegar o jeito e alçar voos mais altos.

O artigo de hoje é introdutório a esse tipo de aplicação para quem já programa e está familiarizado com APIs web. Usaremos um script shell para listar as hashtags em destaque na lista de Trending de uma instância, como as deste exemplo - mas obtendo-as via endpoint da API, e não via scraping da interface web do Mastodon.

A API do Mastodon é bem documentada, e desde já recomendo os capítulos "Getting Started with the API", "Playing with public data" e "trends API methods" para entendimento aprofundado do que apresentarei com bem menos detalhes a seguir.

Para rodar os exemplos abaixo, você precisa ter uma shell Bash ou compatível, com o curl e o gron (para facilitar o consumo das respostas em formato JSON) instalados em um sistema compatível com o Posix (como a maior parte das distribuições Linux, o Mac e Unix em geral).

API do Mastodon: consultando e recebendo a resposta

Em primeiro lugar, perceba que a API do Mastodon é acessada pela web, enviando consultas ou comandos (eventualmente com detalhamento – por exemplo, via campos de formulário ou parâmetros da URL de acesso), e recebendo respostas em formato JSON.

Para o nosso exemplo de hoje, usaremos o endpoint "View trending tags" da API do Mastodon, que pode ser acessado – em instâncias abertas – por meio de uma URL como esta: https://social.br-linux.org/api/v1/trends/tags

Acesse a URL acima em um navegador, e você terá como resposta uma longa linha de texto, em formato JSON, listando as hashtags que aquela instância visualiza como estando em destaque naquele momento, e para cada uma delas expondo alguns atributos, como o número de contas que a mencionou, e quantas vezes ela foi mencionada, entre outros.

Trabalhando com a resposta da API

Ao fazer o acesso acima no navegador, você já usou a API da mesma forma que outros recursos mais avançados fariam, porém recebeu a resposta em um formato inconveniente para operar - ao contrário do que aconteceria quando acessamos por meio de bibliotecas mais avançadas, como a Mastodon.py e várias outras bibliotecas Mastodon para várias linguagens.

Vamos conhecer e destrinchar a resposta da API de trending tags, item por item.

Hoje não veremos o uso dessas bibliotecas, mas sim destrincharemos a resposta da API nesse endpoint - escolhido como exemplo por ser simples – para fixar melhor o funcionamento.

Em primeiro lugar, acesse o endpoint a partir da shell, usando os já mencionados curl e gron, para ver uma versão formatada da resposta da API. Para isso, digite:

curl -s "https://social.br-linux.org/api/v1/trends/tags" | gron

Observe que a resposta traz uma série de tags, e para cada uma delas são apresentadas as estatísticas diárias, como no trecho do exemplo:

(...)
json[7].history[0] = {};
json[7].history[0].accounts = "22";
json[7].history[0].day = "1720483200";
json[7].history[0].uses = "29";
(...)
json[7].name = "introductions";
json[7].url = "https://social.br-linux.org/tags/introductions";
(...)

Veja que destaquei duas linhas, por serem justamente as que nos interessam para esse exemplo: a visão da instância sobre o número de contas diferentes que usaram a tag hoje (history[0].accounts) e o nome da tag (name).

A partir dessa percepção, e com uso de expressões regulares para selecionar linhas como as duas destacadas acima, chegamos ao script a seguir:

#!/usr/bin/env bash
#
# bot_exemplo.sh - exibe as hashtags em destaque em uma instância do Mastodon
#
# Copyright (c) 2024,  Augusto Campos (http://augustocampos.net/).
# Licensed under the Apache License, Version 2.0.
#

instancia="https://social.br-linux.org"

curl -s "$instancia/api/v1/trends/tags" | gron | awk '

/0].accounts/ { gsub(/(.* = |[";])/,""); contas=$0 }
/\.name/      { gsub(/.* = |[";]/,""); print contas, $0 }

' | sort -nr 

Ele faz a consulta à API (curl e gron), filtra com o awk as linhas como as destacadas acima, extraindo delas só os valores (removendo as aspas, os ponto-e-vírgulas e tudo o que vem antes do sinal de igual), e ao final classifica (sort) as tags em ordem das que foram mencionadas por mais contas.

Executá-lo hoje gerou a saída a seguir:

Note que no formato atual, esse script seria mais adequadamente descrito como um cliente do que como um bot. Mesmo assim, a funcionalidade dele é parte integrante do bot que gera os dados para a curadoria do perfil @TrendsBR, e possivelmente outras peças desse mesmo bot (ou de outros que mantenho) voltarão a ilustrar artigos futuros aprofundando os conceitos que hoje começamos a ver.

O artigo "Programando bots para o Mastodon: obtendo dados via API" foi originalmente publicado no site TRILUX, de Augusto Campos.

Beam of Light

Einstein's theories solved a longstanding mystery about Mercury: Why it gets so hot. "It's because," he pointed out, "the sun is right there."

Approximating the Sierpinski Triangle on my CNC

One of my big hobbies outside of tech is chess. I like to play it, and I also help run our town's chess club. As part of that, we like to run rated tournaments to get our members some experience in a low-pressure tournament environment. These are my responsibility to organize and run, and our club's only certified tournament director.

We hosted our first tournament last winter. The whole thing went off well, except that I was supposed to have a prize medal for the winner. Our vendor fell through, and I had nothing except congratulations for the winner1.

With our second tournament coming in May, I needed a solution. Or rather, I used the "solution" as an excuse to get a new toy. Instead of buying a medal, I'd buy a tool that I could use to make medals for the winners. And then they're also one-of-a-kind.

I made a simple model of a medal and convinced myself that this would work, I've somehow done it and figured it out. I could either make the medal on a 3D printer or I could cut it out of wood with an automated woodworking tool (called a CNC), and since I already have a well-equipped woodshop, I opted for the latter, to complement what I already have.

Before long, my very cheap CNC2 arrived in the mail, and I had to get to work. But I didn't know what I was doing, so first I played to learn how to use it. And my first project was very fun, and accelerated my learning by containing many of the difficult cases I didn't run into for my real use case.

What's a CNC?

"CNC" means "computer numerical control", and it refers to tools which are automated and controlled by computers. This is in contrast to most tools, which are controlled manually with at best precise digital measurements.

Within the category of CNC, you have a lot of different tools. The most common are routers and mills, but the rest are interesting as well.

  • CNC routers have spinny sharp bits on them that can cut slots, holes, fancy shapes, whatever! They move around on an x-y plane, but also have some z-axis depth control, so they work in 3 dimensions. Routers usually cut wood and soft materials.
  • CNC mills also use spinny sharp bits, and are similar to routers. The distinction comes in their axes (mills tend to have smaller working areas, but much more depth capacity) and rigidity and torque (mills are much better at cutting harder things). Mills also tend to have more axes, such as 5, by adding rotation to be able to produce more complicated parts. You usually cut metal and hard materials on a mill.
  • 3D printers are CNC!
  • Laser cutters are CNC!
  • You can have a CNC lathe!

The category is gigantic. In my case, I got a CNC router, but I often say I'm milling something on it. Now let's see how we do that.

How do we CNC something?

See, while I had made a model and bought the CNC, I hadn't accounted for any of the rest of the process of actually milling something. In my head, the process for making something on the CNC was roughly:

  1. Design the part in CAD.
  2. Run it on the CNC and get finished part!

This illusion was shattered when one of my friends who does a lot of 3D printing asked me what I was using for my CAM software. "CAM software? Uhhh..."

It turns out, after you make a part in CAD, you then have to convert that into tool paths for the machine. These are the instructions for how it moves around, how fast it spins the motor, and, well, everything it does. The software that does that is called CAM (computer-aided machining), and it turns out it's not trivial. And if you design your part without toolpaths in mind, you'll likely make a part that you can't actually mill!

So the real process for making something on my CNC is more like:

  1. Design the part in CAD.
  2. Put it into CAM and make toolpaths, then go to 1 to fix design mistakes. Repeat a lot.
  3. Upload it to my CNC, run it, break a bit, and go to 2 to fix toolpath mistakes.

Fixing bugs in a physical manufacturing process can be a lot slower and more expensive than in software.

Let's look at what I wanted to make, then see how to do it. Since I'm a Linux user, this process required using some less common tools; the usual ones are Windows/Mac only.

What's the Sierpinski triangle?

A fractal is basically a shape that's self-repeating. The most famous fractal is probably the famed Mandelbrot set. This one would be fun to make, but the rapid approach toward infinitely small curves makes it hard to mill.

Instead, I picked the Sierpinski triangle, which is another fractal. It starts with an equilateral triangle. Then inside of it, you draw another equilateral triangle, upside down. This partitions it into 3 "up-facing" triangles and 1 "down-facing" triangle. Then you just repeat this process (smaller!) inside each of the up-facing triangles.

Illustration of Sierpinski triangle, four iterations

Here we can see four iterations of it. The real fractal goes on infinitely, getting infinitesimally small. This forms a fascinating image. And more important for my purposes, it's something you can sort of mill! Obviously you can't go infinitely small in a physical process, but it's a lot easier to approximate this than it is to approximate a Mandelbrot set on my CNC.

So now we have to turn it into something on the computer, working our way closer to instructions the CNC uses.

Modeling it with OpenSCAD

The first step for me was modeling it with a CAD program. This is a natural fit for OpenSCAD, which lets you generate models through code3. I'd dabbled before to make a proof-of-concept model of a prize medal, but doing this required me to go deeper into OpenSCAD and start using functions and modules.

The strategy I took for modeling this was to first focus on generating a model of the triangle, with each iteration stacked atop the previous one, and then separately figure out how to remove that from the wood block we're going to be working with. This turned out to be very helpful for debugging, since I could separate out the layers—if this were subtracted out of our stock, those would be stuck invisibly inside a block! After we have the triangle model, we'll make a rectangular prism (our block of wood) and subtract our triangle out of it.

My first step was laying out some parameters for the model as constants. This way, if we need to change anything, we can update these. INCH is included as a constant, since OpenSCAD is unitless, but I'm working with it assuming it is millimeters (which my CNC expects) while my woodworking equipment is in Imperial units (tablesaw and thickness planer in particular). You'll notice that the layers are very thin, less than 1mm! That's because, again, my CNC is cheap and slow, and any more than that was going to take far too long to produce. But let's call it an ✨ aesthetic choice ✨.

INCH=25.4;
buffer=0.5*INCH;
width=4*INCH + buffer;
thickness=0.25*INCH;
layer_height=0.75;

For this model, I took a very iterative approach, drawing from all my software engineering experience. (I don't know what the equivalent of a unit test would be in this world, though. If you do, please let me know!) To start out, I made a model of the Sierpinski triangle in OpenSCAD. I did one layer first, to get an equilateral triangle rendering. Here are some of the functions I ended up with4.

function sq(a) = a*a;

function midpoint(a, b) = [(b[0]+a[0])/2, (b[1]+a[1])/2];

function triangle_top(a, b) =
    let (length = sqrt(sq(a[0]-b[0]) + sq(a[1]-b[1])),
        height = length * sqrt(3) / 2,
        mp = midpoint(a,b),
        xd = (b[1]-a[1]) / length * height,
        yd = (b[0]-a[0]) / length * height)
     [mp[0] - xd, mp[1] + yd];

module eq_triangle(a, b) {
    c = triangle_top(a, b);
    points = [a, b, c];
    offset(1.5, $fn=20)offset(delta=-3)offset(1.5)polygon(points);
}

Then I did the iterative step, to work out the math of it. One of my early attempts wound up with this beauty:

Screenshot of an attempted Sierpinski triangle, with the layers spreading out in the x-y axes instead of stacking atop each other.

I do think that I made art here, but it's really not what I was going for—and it's not going to be something I can mill! So I fixed my math, and with some struggles I got a working model. Here's the module for that, along with the render.

module sierpinski(layers, width) {
    origin = [0,0];
    a = origin;
    b = [origin[0]+width, origin[1]];
    mp = midpoint(a, b);

    // to subtract it out of the block instead, use 1*layer_height
    translate([0,0,1*layer_height]) {
        linear_extrude(layer_height+0.1) {
            eq_triangle(a, b);
        }

    if (layers > 0) {
            translate([0,0,0])
            sierpinski(layers-1, width/2);

            translate([mp[0],mp[1],0])
            sierpinski(layers-1, width/2);

            tt = triangle_top(a, mp);
            translate([tt[0],tt[1],0])
            sierpinski(layers-1, width/2);
        }
    }
}

Screenshot of a Sierpinski triangle going upways.

You'll notice that this looks sorta like some of those kids' building blocks from a notoriously litigious toy company. Why's that? Because if you have two straight edges contacting each other, OpenSCAD will happily display it but will then complain about a 2-manifold something-or-another when you try to render it for real5. One of my friends explained that this is because the model is assumed to have those properties for optimization purposes (renders can already be slow) so if they're violated, such as two straight edges contacting each other that have nothing else attached to them, it can't compute the model! We resolve this by adding a fillet on the inner corners so they're rounded, and we get this look!

This is ultimately to our benefit, though, because we can't produce sharp inner corners on the CNC. We're using a round bit, spinning in circles. So this better models what will actually happen on the CNC, and we'll get fewer surprises in the later steps.

Now we have a Sierpinski triangle going upways, but we ultimately want to cut it out of our stock. To do that I adjusted a constant. I could probably actually flip it in the model, but I was tired and picked the first thing that worked. And then we subtract the flipped model out of our stock!

difference() {
    linear_extrude(thickness) {
        square([width,width*sqrt(3)/2]);
    };
    translate([buffer/2,buffer/2,thickness]) {
        sierpinski(5, width - buffer);
    };
}

Screenshot of a render of Sierpinski triangle carved out of a block of wood.

Whew, now we have the model! That was the hard part, right? ...Right? Ahhh hahaha, sweet summer child that I was.

Turning it into commands for the CNC

This is the part where we break out the CAM software. CAM (computer-aided machining) software turns your model into commands that your machine can run. This is often G-code. You can think of G-code as sort of like assembly code that a CNC runs.

Here's a snippet from one of my models:

G21
G90
M3 S1000
G0 X1.4560 Y0.6702 F6000
G0 Z1.0 F300
G1 Z-0.6250 F250
G1 X2.3053 Y0.2921 F500

Each of these commands either sets a mode on the machine (G21 sets the unit to be millimeters) or performs a command (M3 starts the spindle, G0 and G1 are forms of movement). This would be incredibly tedious to write out by hand, but it's theoretically doable6.

To get the model into this form, we pop it into our CAM software and do some work. The CAM software I use is Kiri:moto. This software is a whole other thing you have to learn.

The crux of it is this: You tell it which operations you want the machine to do, and then it tries to figure out how to do it. Along the way, there are a ton of parameters to tune. Of course you have to tell it the tools you have (in my case, a 1mm endmill bit) and it has to know some information about your CNC. And then for each operation, you need to tell it things like how fast to turn the spindle, how much to move over or down on each pass, if you want to leave excess material (very handy to do rough passes first, then come back and clean it up).

Here's what it looks like from my most recent run of this model.

Screenshot of Kiri:moto running in Chromium.

When I first opened this software, I was overwhelmed. What are all these boxes? You don't have to understand each of them, but understanding them will help you avoid broken bits and repeated trial runs on your CNC.

What's really handy are the preview and animation tabs, which let you see the paths it's going to generate and watch it pretend to mill out your part. Really neat, and a good way to validate a design!

After something looks good in your CAM software (which took me as long as modeling the part the first time, but is a lot faster now), then you download the G-code and go to the workshop to run it.

Making it real

With the G-code in hand, I ran to the workshop and made the part. And it worked! I was happy with it, but also... It had blemishes and it had artifacts from the machining, where my toolpaths were clearly bad. It was rough, and it showed my inexperience. So I did it again, and the second one I made is where I learned a lot of ways to improve (and some more silly mistakes to make).

Here's the first one, fresh off the machine.

Then the second one in progress.

And finally, the second one side-by-side with the first one. The second is on the left (I'm a monster, sorry), and if you zoom in on the vertices of the triangles of each, you can really see the artifacts on the first one. It's so sloppy! The second one is so clean!

As a bonus, here's the oak medal I made for a chess tournament, also fresh off the CNC. I finished in time, with a day or two to spare, and the tournament went off without a hitch!

Broken bits and deep soulful joy

This project taught me a lot of lessons very quickly. I broke a few bits making part and left scars on my machine. Each time it was for something silly, and each one was a lesson. A lesson in setting up parts on the machine. In designing good toolpaths to improve schedules and end results. In how to remove parts from the machine without breaking them or your bits. And in how to design things that can be physically produced.

The lessons are hard-won and each time it usually comes with some physical marker of your failure. Maybe it's a broken bit that you needed to produce your part, so you're blocked until new ones arrive. Or maybe it's a ruined part and a lost day's work. Or maybe it's physical scars on your machine, forever commemorating that silly mistake.

These hard-won lessons can wear you down.

The iterations were long, and each time I was sort of wondering, why is it that I'm doing this? Fixing bugs in software is usually a lot faster and doesn't result in wasted material. But when it worked? Then I remembered exactly why I'm doing this.

Because making physical things is joyous and makes my soul sing. There is a joy that I get from holding a small little piece that I made that is so often missing in my work as a software engineer.

It doesn't matter what it is, making physical things is a joyous and vexing process. Baking a cake, making a fractal, framing a photo. Each of these connects me to reality and grounds me in our physical world in a way that's often missing from software alone.

Getting to hold a thing you made, and show it to a friend? It makes all the broken bits worth it.


Thank you to Dan Reich for the helpful feedback on a draft of this post!


1

If for some reason he's reading this (or you know him; he's not from our club), email me! I'd love to hook you up with a retroactive medal.

2

I'm talking $250 cheap. This thing isn't going to do well on metal, and it won't win any speed awards, but it can do small jobs in wood.

3

I did also try modeling some other things with FreeCAD, not least because you can do CAM inside it as well. I had it crash on me repeatedly, and it doesn't fit my brain as well as OpenSCAD (since I'm first and foremost a programmer). Maybe I'll try out another one someday, but so far OpenSCAD is treating me well!

4

The offset bits are to fillet the corners, which comes back around later. This code is presented in the logical order, but not chronological order that I developed it in. I don't think anyone needs to see the chaos that is my development process.

5

Now I can get the model to render without this issue and without fillets, I think because all the layers touch and it's in the stock. But at any rate, this better shows what will actually happen on the CNC.

6

For another project that's going on in the background, but delayed due to health issues, I'm planning to generate G-code directly from another program. Still not doing it by hand, but I'll have to do a lot of inspection and reading of the G-code.

Melbourne Photography X - Night

South Wharf

South Wharf

A collection of nightscapes shot on the Fuji GW690iii medium format camera with Kodak Vision 3 500T ECN-2 film.

Palais

Palais

Yarra

Yarra

South Wharf ii

South Wharf ii

Luna Park

Luna Park

South Wharf iii

South Wharf iii

Dinner

Dinner

Yarra ii

Yarra ii

Under construction

Under construction

Yarra iii

Yarra iii

The Corner

The Corner

System Design - CQRS (Command Query Responsability Segregation)

Esse capítulo tem o objetivo de adicionar mais algumas estratégias para lidar com dados em sistemas modernos à sua caixa de ferramentas, sendo esses sistemas distribuídos ou não. A necessidade de aumentar o repertório de padrões de projeto para lidar com dados de domínios em larga escala tem se tornado cada vez mais presente no dia a dia de engenheiros e arquitetos de software, e representa um importante recorte de senioridade. A longo prazo e em escala, considero os dados como a parte mais crítica e difícil de se lidar dentre todas as disciplinas de Engenharia de Software, e a seguir vamos abordar o padrão CQRS e algumas possibilidades de implementação que podem ser adaptadas, combinadas e estendidas conforme a experiência dos times e conhecimento dos domínios de negócio ganham maturidade.

Definindo CQRS

O CQRS, ou Command Query Responsibility Segregation, é um padrão arquitetural cujo objetivo é separar as responsabilidades de escrita e leitura de um sistema. As operações de escrita no padrão CQRS são denominadas “comandos”, pois entende-se que a implementação de escrita do CQRS seja voltada para efetuar operações imperativas que mudam o estado de uma ou mais entidades do sistema. As operações de leitura são denominadas “queries”, cujo objetivo é apenas fornecer uma capacidade de leitura dos dados desse domínio de forma otimizada.

O objetivo central do CQRS é aumentar a performance e a escalabilidade de um serviço através de modelos de dados que sejam especificamente otimizados para suas respectivas tarefas, apostando na teoria de que, ao separar as operações de comandos e consultas, cada parte do sistema pode ser escalada independentemente, permitindo uma utilização mais eficiente dos recursos computacionais alocados para cada uma dessas tarefas.

Em resumo, o padrão CQRS envolve usar dois ou mais bancos de dados que têm seus dados replicados, mas cada um com uma estrutura específica para diferentes necessidades. Vamos explorar essas ideias e outras abordagens mais complexas e poderosas ao longo do capítulo.


Separação de Responsabilidades

O princípio central do CQRS é a separação de responsabilidades entre operações de leitura e operações de escrita, utilizando infraestruturas e modelos de dados diferentes.

Diagrama Responsabilidade

Diagrama conceitual de segregação de responsabilidades do CQRS

Os commands, ou comandos, encapsulam todas as informações necessárias para realizar operações de escrita, como criar, atualizar ou deletar um registro, além de aplicar todas as regras de validação necessárias para garantir a integridade dos dados. Conceitualmente, o comando tende a se referir ao ato de “processar algo”, alterando um estado mediante o estímulo de um comportamento, mas também pode ser aplicado para manipular entidades anêmicas, se necessário. O modelo de escrita deve se focar em garantir a consistência e a integridade dos dados. É comum usar bancos de dados relacionais que suportem transações e garantam ACID (Atomicidade, Consistência, Isolamento, Durabilidade) para assegurar a consistência e executar as transações de forma atômica. Os bancos de dados de escrita que precisam garantir forte consistência contam com processos de normalização para otimizar a performance e a integridade.

As queries são responsáveis por retornar dados sem alterar o estado do sistema. Os bancos de dados são otimizados para recuperação rápida e eficiente de informações, muitas vezes utilizando técnicas como caching, réplicas de leitura ou desnormalização de dados para melhorar o desempenho nesse tipo de cenário. Bancos de dados NoSQL são frequentemente usados nesse contexto, pois oferecem alta performance em consultas e podem escalar horizontalmente de forma eficaz, embora bancos SQL também possam ser usados de forma desnormalizada sem nenhum tipo de problema.

Em resumo, um exemplo mais simples de aplicação do CQRS seria fazer uso de um modelo normalizado dentro de um banco SQL de escrita para garantir toda a consistência e integridade e, a partir dos eventos de comando, uma segunda escrita seria realizada em outra base de dados com uma view materializada e desnormalizada, otimizada para ser recuperada, ou em um banco NoSQL com a estrutura do documento muito próxima do payload de response.

Perspectiva sobre Modelos de Domínio

O modelo de comando é responsável por manipular os dados do sistema e garantir a consistência e a integridade das operações. Este modelo é geralmente mais complexo, pois incorpora todas as regras de negócio, validações e lógicas que precisam ser aplicadas quando o estado do sistema é alterado. O modelo de comando frequentemente segue o padrão Rich Domain Model, onde a lógica de negócio está embutida nas entidades do domínio e faz uso de transações ACID para garantir mudanças de estado consistentes durante o ciclo de vida dos dados de domínio. Vamos desenhar um cenário onde, em um sistema hospitalar de prontuários médicos hipotéticos, um o médico precisar criar uma nova prescrição para um paciente. A ação de comando deverá verificar se o médico é válido, se o paciente é válido, se o medicamento existe, se o médico está autorizado a prescrever o medicamento de acordo com sua especialidade e, por fim, realizar a persistência no banco de dados. Toda essa lógica será encapsulada dentro do comando.

O modelo de consulta é otimizado para leitura e recuperação rápida de dados. Diferentemente do modelo de comando, ele não precisa incorporar lógica de negócio complexa ou validações, pois sua responsabilidade é exclusivamente fornecer dados para serem exibidos ou utilizados em outras partes do sistema depois que um comando já foi executado. Por exemplo, um modelo desnormalizado das prescrições pode ser criado para agrupar de forma legível e rápida as informações do médico, do paciente e dos medicamentos prescritos.


Modelos de Implementação

A aplicação do CQRS pode variar desde as implementações mais simplistas, que respeitam contextos limitados de um domínio ou funcionalidade, até as mais complexas, que agrupam informações de forma incremental de várias fontes e etapas de um processo maior. Aqui veremos algumas alternativas e modelos de implementação que podem ser úteis para a compreensão da extensão das capacidades desse tipo de arquitetura na resolução de problemas de escala e resiliência.


CQRS em bancos SQL e Views Materializadas

Um dos exemplos mais simples de uma implementação CQRS é transpor um modelo SQL normalizado para outro modelo SQL desnormalizado. A simplicidade dessa abordagem permite que essa nova tabela desnormalizada esteja presente ou não na mesma instância ou schema que o restante das tabelas normalizadas dos domínios. A evolução para um banco de dados separado é um passo que pode ocorrer com facilidade, porém necessitaria de processos e infraestruturas adicionais, se necessário.

Vamos supor um modelo de uma funcionalidade de prescrição de medicamentos de um sistema hospitalar fictício, onde teremos as tabelas Médicos, Pacientes, Medicamentos, Prescrições e Prescrição_Medicamentos, que fará o vínculo de 1:N entre os medicamentos prescritos. Esse modelo fornece uma consistência forte de relacionamentos, não permitindo que medicamentos não cadastrados sejam prescritos, que pacientes não cadastrados sejam tratados e que médicos não cadastrados operem e prescrevam medicamentos.

CQRS

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
CREATE TABLE IF NOT EXISTS Medicos (
    id SERIAL PRIMARY KEY,
    nome VARCHAR(255) NOT NULL,
    especialidade VARCHAR(255) NOT NULL,
    crm VARCHAR(255) NOT NULL UNIQUE
);

CREATE TABLE IF NOT EXISTS Pacientes (
    id SERIAL PRIMARY KEY,
    nome VARCHAR(255) NOT NULL,
    data_nascimento DATE NOT NULL,
    endereco VARCHAR(255)
);

CREATE TABLE IF NOT EXISTS Medicamentos (
    id SERIAL PRIMARY KEY,
    nome VARCHAR(255) NOT NULL,
    descricao TEXT
);

CREATE TABLE IF NOT EXISTS Prescricoes (
    id SERIAL PRIMARY KEY,
    id_medico INT NOT NULL,
    id_paciente INT NOT NULL,
    data_prescricao TIMESTAMP NOT NULL,
    FOREIGN KEY (id_medico) REFERENCES Medicos(id),
    FOREIGN KEY (id_paciente) REFERENCES Pacientes(id)
);

CREATE TABLE IF NOT EXISTS Prescricao_Medicamentos (
    id SERIAL PRIMARY KEY,
    id_prescricao INT NOT NULL,
    id_medicamento INT NOT NULL,
    horario VARCHAR(50) NOT NULL,
    dosagem VARCHAR(50) NOT NULL,
    FOREIGN KEY (id_prescricao) REFERENCES Prescricoes(id),
    FOREIGN KEY (id_medicamento) REFERENCES Medicamentos(id)
);

Exemplo da modelagem de escrita normalizada para integridade dos relacionamentos

Esse modelo, por mais que seja superficial, garante a integridade dos dados durante a manipulação dos dados. Porém, uma outra funcionalidade do sistema de prescrição médica é gerar relatórios e ordens de serviço para a farmácia hospitalar para preparar e controlar a saída do estoque dos medicamentos. Essa funcionalidade é crítica, pois os medicamentos precisam de triagem, rastreio, contabilidade e facilidade visual para separação e destinação ao quarto/enfermaria onde o paciente está. Para montar uma visão como essa em sistemas altamente normalizados, é necessária uma grande quantidade de joins entre as tabelas.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
SELECT
    p.id AS id_prescricao,
    p.data_prescricao,
    m.nome AS nome_medico,
    m.especialidade,
    pac.nome AS nome_paciente,
    pac.data_nascimento,
    pac.endereco,
    med.nome AS nome_medicamento,
    pm.horario,
    pm.dosagem
FROM
    Prescricoes p
    LEFT JOIN Medicos m ON p.id_medico = m.id
    LEFT JOIN Pacientes pac ON p.id_paciente = pac.id
    LEFT JOIN Prescricao_Medicamentos pm ON p.id = pm.id_prescricao
    LEFT JOIN Medicamentos med ON pm.id_medicamento = med.id
WHERE
    p.id = 1; -- ID da prescrição específica

Exemplo de Query para recuperar os dados dos medicamentos solicitados por prescrições médicas

Output

1
2
3
1	2023-05-20 14:30:00.000	Dr. João Silva	Cardiologia	Maria Oliveira	1985-07-10	Rua das Flores, 123	Aspirina	08:00	100mg
1	2023-05-20 14:30:00.000	Dr. João Silva	Cardiologia	Maria Oliveira	1985-07-10	Rua das Flores, 123	Paracetamol	20:00	500mg
1	2023-05-20 14:30:00.000	Dr. João Silva	Cardiologia	Maria Oliveira	1985-07-10	Rua das Flores, 123	Aspirina	08:00	100mg

Para externalizar essa consulta para um modelo especializado, a primeira possibilidade seria criar uma segunda tabela semi-desnormalizada, mantendo apenas a consistência entre IDs e relacionamentos para evitar corrupção a um nível básico e colocando em linha a prescrição dos medicamentos de forma descritiva. Isso elimina a necessidade de joins entre tabelas constantemente, entregando a view específica para o subsistema de farmácia.

CQRS

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
CREATE TABLE IF NOT EXISTS vw_prescricoes_medicamentos_detalhadas (
    id SERIAL PRIMARY KEY,
    id_prescricao INT,
    data_prescricao TIMESTAMP NOT NULL,
    id_medico INT NOT NULL,
    nome_medico VARCHAR(255) NOT NULL,
    especialidade_medico VARCHAR(255) NOT NULL,
    crm_medico VARCHAR(8) NOT NULL,
    id_paciente INT NOT NULL,
    nome_paciente VARCHAR(255) NOT NULL,
    data_nascimento_paciente DATE NOT NULL,
    endereco_paciente VARCHAR(255),
    id_medicamento INT NOT NULL,
    nome_medicamento VARCHAR(255) NOT NULL,
    descricao_medicamento TEXT,
    horario VARCHAR(50) NOT NULL,
    dosagem VARCHAR(50) NOT null,
    FOREIGN KEY (id_medico) REFERENCES Medicos(id),
    FOREIGN KEY (id_paciente) REFERENCES Pacientes(id),
    FOREIGN KEY (id_medicamento) REFERENCES Medicamentos(id),
    FOREIGN KEY (id_prescricao) REFERENCES Prescricoes(id)
);

Exemplo de modelagem para um padrão de leitura otimizado para a triagem de farmácia

Para ilustrar inicialmente, supondo que a tabela otimizada para consulta esteja presente no mesmo banco de dados, podemos iniciar uma carga inicial com os dados presentes localmente, utilizando como base a query anterior com todos os joins necessários. Em seguida, conseguiremos simplificar a busca dos dados da prescrição de forma detalhada apenas com um select simples em uma única tabela analítica onde os dados estão compilados.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
INSERT INTO vw_prescricoes_medicamentos_detalhadas (
    id_prescricao,
    data_prescricao,
    id_medico,
    nome_medico,
    especialidade_medico,
    crm_medico,
    id_paciente,
    nome_paciente,
    data_nascimento_paciente,
    endereco_paciente,
    id_medicamento,
    nome_medicamento,
    descricao_medicamento,
    horario,
    dosagem
)
SELECT
    p.id AS id_prescricao,
    p.data_prescricao,
    m.id AS id_medico,
    m.nome AS nome_medico,
    m.especialidade AS especialidade_medico,
    m.crm as crm_medico,
    pac.id AS id_paciente,
    pac.nome AS nome_paciente,
    pac.data_nascimento AS data_nascimento_paciente,
    pac.endereco AS endereco_paciente,
    med.id AS id_medicamento,
    med.nome AS nome_medicamento,
    med.descricao AS descricao_medicamento,
    pm.horario,
    pm.dosagem
FROM
    Prescricoes p
    JOIN Medicos m ON p.id_medico = m.id
    JOIN Pacientes pac ON p.id_paciente = pac.id
    JOIN Prescricao_Medicamentos pm ON p.id = pm.id_prescricao
    JOIN Medicamentos med ON pm.id_medicamento = med.id;

Exemplo de carregamento inicial da tabela de view com os dados presentes no modelo normalizado

1
SELECT * FROM vw_prescricoes_medicamentos_detalhadas WHERE id_prescricao = 1;
1
2
3
1	1	2023-05-20 14:30:00.000	1	Dr. João Silva	Cardiologia	CRM12345	1	Maria Oliveira	1985-07-10	Rua das Flores, 123	1	Aspirina	Analgésico e anti-inflamatório	08:00	100mg
2	1	2023-05-20 14:30:00.000	1	Dr. João Silva	Cardiologia	CRM12345	1	Maria Oliveira	1985-07-10	Rua das Flores, 123	2	Paracetamol	Analgésico	20:00	500mg
20	1	2023-05-20 14:30:00.000	1	Dr. João Silva	Cardiologia	CRM12345	1	Maria Oliveira	1985-07-10	Rua das Flores, 123	1	Aspirina	Analgésico e anti-inflamatório	08:00	100mg

Com o exemplo acima, conseguimos otimizar inicialmente uma visualização dentro de um modelo de leitura para a farmácia do hospital, nos quais os sistemas conseguem recuperar os dados de forma simplificada. Esse tipo de estratégia é muito comum para criar visualizações especializadas em diversos tipos de sistemas e viabiliza algumas abordagens interessantes de segregação de responsabilidade de escrita e leitura de forma simplificada. No entanto, executar o carregamento de dados como o exemplo ilustrado é inviável em sistemas transacionais com um volume considerável, uma vez que executar o select da base de dados inteira para carregar em uma tabela especializada não resolveria e talvez agravasse problemas de escala de uso desses dados. Para isso, precisamos adicionar responsabilidades adicionais ao modelo de comando e consulta, muitas vezes utilizando a consistência eventual nos modelos de leitura.

CQRS Exemplo

Para realizar a sincronização entre os modelos de escrita e leitura de forma saudável, o uso de mensageria e eventos como intermediários entre ambos pode ajudar a desacoplar as responsabilidades e fazer com que ambos escalem independentemente um do outro. No entanto, a consistência eventual é um efeito colateral que precisa ser considerado no design da arquitetura para viabilizar esse comportamento.


Consistência Eventual no CQRS

No contexto de CQRS, a consistência eventual pode ser de grande valor quando prevista e aceita no design da solução. Diferente de sistemas tradicionais que podem garantir uma consistência imediata entre os modelos de dados, aceitar o comportamento de um sistema eventualmente consistente pressupõe que o sistema pode operar de forma inconsistente por um período de tempo sem grandes problemas e também pressume que, com o tempo, o sistema ou entidade se tornará consistente.

CQRS Farmacia

Na prática, olhando para uma implementação de CQRS que suporte esse tipo de cenário, onde os modelos de comando e consulta são separados e controlados por funcionalidades e implementações distintas, as operações de escrita são processadas no modelo de comando e, em seguida, eventos ou mensagens são gerados para atualizar o modelo de consulta de forma assíncrona, o que implica que pode haver um atraso antes que o modelo de consulta reflita as últimas mudanças realizadas no modelo de comando. Durante esse intervalo, o sistema está em um estado de “consistência eventual”.

CQRS Evento

Para realizar a sincronização entre modelos, são necessários esforços computacionais adicionais, sendo eles processos assíncronos de mensageria que trocam dados através de tópicos ou filas e realizam a escrita no modelo de consulta, criando views otimizadas para a recuperação. Esse processo pressupõe a existência de um comportamento adicional independente e que não deve impactar agressivamente a performance.

Para ilustrar, podemos entender que após o processamento e persistência dos dados no modelo de escrita, o processo encapsulado no comando envia alguma mensagem ou evento contendo todos os dados necessários para que uma aplicação ou processo de sincronização consiga construir a representação do registro no modelo de consulta.


CQRS e Réplicas de Leitura

À medida que a intensidade de escrita aumenta devido à sincronização dos modelos, o próprio modelo de leitura tende a acabar sendo saturado pela carga de trabalho, pois ainda centraliza uma grande concorrência de escrita e leitura do sistema e dos clientes, por mais que sejam otimizadas. Olhando para a solução que o CQRS visa implementar, podemos perceber que, com o tempo, apenas trocamos o problema de lugar. No entanto, existem algumas outras abordagens de otimização do modelo de leitura em uma abordagem SQL.

CQRS Read Replica

Se aproveitarmos a já aceita consistência eventual entre os modelos, podemos utilizar réplicas de leitura adicionais como banco principal para o modelo de consulta, deixando a instância primária somente para fazer offload da escrita e evitar concorrência com o uso da API. Esse tipo de abordagem aumenta consideravelmente os custos operacionais, mas adiciona uma camada adicional de resiliência de dados. Resumindo de forma prática, se presumirmos que a sincronização entre os modelos ocorre mediante a escrita nas duas bases, e que as queries não podem efetuar mudanças no estado das entidades, podemos adicionar instâncias read-only no processo para ganhar níveis de performance.


CQRS e Bancos de Dados NoSQL

A implementação de modelos NoSQL para suprir a responsabilidade de leitura pode ser uma alternativa interessante devido à troca de isolamento, relacionamento e atomicidade por performance otimizada para escrita e leitura. Uma vez que não precisamos das features ACID nos modelos de leitura, podemos aceitar o tradeoff para otimização de consultas com maior segurança.

A implementação desse modelo é exatamente igual, topologicamente falando, a de utilizar ambas as bases no padrão SQL, com exceção de que suas aplicações ou processos do contexto do domínio precisam conhecer os dois dialetos e saber trabalhar a tradução entre eles por meio de processos intermediários.

CQRS NoSQL

Imagine que precisamos converter os dados de todas as prescrições do cliente para construir prontuários médicos eletrônicos para acompanhamento e gestão interna do hospital e também servir para gerar receitas médicas das consultas e entregar diretamente para o paciente. Os dois casos são muito parecidos e podemos criar um query model NoSQL muito próximo do response de um prontuário ou receita.

CQRS NoSQL Prontuario

Podemos usar o padrão CQRS para transformar esse modelo de consulta em um padrão NoSQL orientado a documentos. Nesse caso, todas as informações da prescrição são agrupadas em um modelo de prontuário ou receita médica, similar a um response HTTP que seria criado manualmente, juntando as linhas retornadas. Quando comandos de escrita são executados no sistema, eventos ou mensagens com todas as informações do prontuário precisam ser gerados nos brokers para criar uma visualização otimizada.

Para este exemplo, vamos usar o Elasticsearch como base para converter o modelo para NoSQL. Nele, podemos criar um mapeamento para garantir a estrutura mínima dos campos e tipos necessários para construir uma visualização segura. Esse mapeamento pode ser criado de forma semelhante ao evento de entrada e ao payload de resposta, garantindo um formato base otimizado para a resposta esperada da API de consulta do domínio.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// PUT /prontuarios
{
  "mappings": {
    "properties": {
      "id_prescricao": { "type": "integer" },
      "data_prescricao": { "type": "date" },
      "medico": {
        "properties": {
          "id_medico": { "type": "integer" },
          "nome": { "type": "text" },
          "crm": { "type": "text" },
          "especialidade": { "type": "text" }
        }
      },
      "paciente": {
        "properties": {
          "id_paciente": { "type": "integer" },
          "nome": { "type": "text" },
          "data_nascimento": { "type": "date" },
          "endereco": { "type": "text" }
        }
      },
      "medicamentos": {
        "type": "nested",
        "properties": {
          "id_medicamento": { "type": "integer" },
          "nome": { "type": "text" },
          "horario": { "type": "text" },
          "dosagem": { "type": "text" }
        }
      }
    }
  }
}
1
2
3
4
5
{
	"acknowledged": true,
	"shards_acknowledged": true,
	"index": "prontuarios"
}

Após a criação de um mapping, precisamos criar um processo que recebe o evento de domínio decorrente de um comando de escrita e o transforma para o padrão de documento estabelecido. A escolha da tecnologia do banco de dados deve ser levada em conta para esse processo, uma vez que pode ou não ter todos os dados em um único evento para construir a view de leitura de forma íntegra. Caso esse processo seja feito com dados distribuídos que são recebidos, consolidados e disponibilizados de forma assíncrona e incremental, o modelo NoSQL deverá ser capaz de receber incrementos parciais dos registros.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// POST /prescricoes/_doc/1
{
    "id_prescricao": 1,
    "data_prescricao": "2023-05-20T14:30:00.000Z",
    "medico": {
        "id_medico": 1,
        "nome": "Dr. João Silva",
        "especialidade": "Cardiologia",
        "crm": "CRM123123"
    },
    "paciente": {
        "id_paciente": 1,
        "nome": "Maria Oliveira",
        "data_nascimento": "1985-07-10",
        "endereco": "Rua das Flores, 123"
    },
    "medicamentos": [
        {
            "id_medicamento": 1,
            "nome": "Aspirina",
            "horario": "08:00",
            "dosagem": "100mg"
        },
        {
            "id_medicamento": 2,
            "nome": "Paracetamol",
            "horario": "20:00",
            "dosagem": "500mg"
        }
    ]
}

Esse modelo de database nos permite uma variedade muito grande de possibilidades de consulta. Caso a chave do índice ou coleção do seu modelo seja conhecida e mantida pelo modelo original de escrita, a busca pode ser realizada diretamente por ela, o que invariavelmente garante uma performance otimizada para a recuperação desses dados.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// GET /prescricoes/_doc/1
{
	"_index": "prescricoes",
	"_type": "_doc",
	"_id": "1",
	"_version": 1,
	"_seq_no": 0,
	"_primary_term": 1,
	"found": true,
	"_source": {
		"id_prescricao": 1,
		"data_prescricao": "2023-05-20T14:30:00.000Z",
		"medico": {
			"id_medico": 1,
			"nome": "Dr. João Silva",
			"especialidade": "Cardiologia",
			"crm": "CRM123123"
		},
		"paciente": {
			"id_paciente": 1,
			"nome": "Maria Oliveira",
			"data_nascimento": "1985-07-10",
			"endereco": "Rua das Flores, 123"
		},
		"medicamentos": [
			{
				"id_medicamento": 1,
				"nome": "Aspirina",
				"horario": "08:00",
				"dosagem": "100mg"
			},
			{
				"id_medicamento": 2,
				"nome": "Paracetamol",
				"horario": "20:00",
				"dosagem": "500mg"
			}
		]
	}
}


CQRS em Sistemas Distribuídos

A arquitetura CQRS quando aplicada a sistemas distribuídos e granulares pode ofertar aumentos significativos de resiliência, performance e facilidade para sumarizar dados de domínio distribuídos entre contextos de multiplos microserviços. Quando adotamos um modelo de microserviços no qual segregamos databases especialistas para cada tipo de serviço torna mais difícil criar consultas que unam e retornem dados de diferentes serviços. Esse tipo de implementação pode oferecer abordagens de consolidação para otimizar as operações de query e replicação de dados.

A construção de views otimizadas utilizando dados de vários serviços por meio de eventos e mensagens pode facilitar alguns cenários, porém igualmente acarreta um aumento de complexidade e granularidade no ambiente, que pode se tornar um tópico complexo na arquitetura de solução. Esse tipo de abordagem pode ser um pouco controverso em termos mais puristas de domínio, nos quais limitam a separação de comando e query somente dentro da responsabilidade de um domínio específico, mas a capacidade de estender os conceitos desse tipo de abordagem para entregar modelos consolidados com informações de diferentes domínios pode ser uma grande adição à sua caixa de ferramentas de arquitetura de solução.

CQRS - Distribuido

Consolidação de eventos de diversos event-stores de comandos para compor modelos de dados com dados distribuídos.

O preço da consistência eventual nesse tipo de cenário tende a se tornar cada vez maior dependendo da quantidade de fontes de eventos que vão ser tratadas e sumarizadas. Vamos extender o exemplo do sistema hospitalar mais uma vez, onde agora precisamos criar uma forma de recuperar todo o histórico do paciente para fins de auditoria, faturamento e treinamento de modelos. Temos serviços espalhados na arquitetura que são responsáveis por tratar da triagem inicial, prescrições médicas, exames laboratoriais e exames de imagem que foram realizados por determinado paciente. Essas informações precisam ser recuperadas de forma consolidada por atendimento médico individual, mas também precisam retornar todo o histórico do paciente durante seus anos de relacionamento como cliente do hospital.

Nesse sistema, temos serviços espalhados pela arquitetura responsáveis pela triagem inicial, prescrições médicas, exames laboratoriais e exames de imagem. Precisamos consolidar essas informações por atendimento médico individual e também recuperar todo o histórico do paciente ao longo dos anos. Esse é um caso interessante para usar uma visualização consolidada entre vários domínios que expõem seus dados por meio de tópicos de consolidação. Podemos criar listeners tanto para eventos de comando quanto para tópicos de resposta que confirmem o sucesso da execução, construindo um modelo de consulta que agrupa os dados.

Esse modelo de leitura deve permitir atualizações incrementais e aceitar uma consistência eventual contínua, já que o tempo de construção do registro pode variar conforme a demanda e o número de fontes de dados. Essa abordagem é útil para recuperar dados em tempo real, mesmo com consistência eventual. É uma alternativa a jobs de ETL que fazem essa agregação por meio de batches programados, ou a padrões como API-Composition, que podem comprometer a disponibilidade do serviço devido ao alto acoplamento entre diferentes serviços para construir a resposta.


Pattern de Dual-Write no Contexto de CQRS

Quando olhamos friamente com a ótica da resiliência de sistemas distribuídos para o padrão de comando, pode surgir um alerta de consistência importante onde em dois passos (persistir no banco e publicar o evento) o que ocorreria caso um deles falhasse e outro fosse executado? Imagine que, por indisponibilidade temporária do broker de mensagens, o dado fosse persistido na base de dados durante a execução do comando, porém ocorresse uma falha na publicação da mensagem. Esse cenário levaria a uma inconsistência sistêmica em que o estado alterado pelo comando não refletiria nas APIs de consulta.

Agora vamos olhar para o cenário inverso, onde, por acaso, a publicação da mensagem ocorresse como previsto, mas uma falha inesperada ocorresse no banco de dados. Nesse caso, teríamos um nível similar de inconsistência, em que dados que não existem na base de escrita transacional estariam disponibilizados nas APIs de consulta como se o comando executado tivesse sido efetivado.

Ambos os cenários são problemáticos para sistemas que precisam de integridade forte, e para isso existem alguns padrões de design que podem nos ajudar a garantir mais níveis de segurança aos processos de comando e leitura. Um deles é o Dual-Write.

O Padrão Dual-Write se aplica quando precisamos confirmar a consistência do dado em duas fontes distintas e dependentes, mesmo que de maneira assíncrona. No caso do CQRS, ele é aplicado para manter os modelos de comando e consulta sincronizados. Quando um comando é emitido para alterar o estado do sistema, ele é processado pelo modelo de comando. Isso inclui validações, lógica de negócios e atualização do banco de dados de escrita. Após a operação de escrita ser concluída com sucesso, um evento correspondente é gerado. Esse evento descreve a mudança que ocorreu e deve ser propagado para o modelo de consulta. Para garantir que um ou outro não ocorra isoladamente, o padrão busca assegurar que os dados não sejam alterados em caso de falha de publicação do evento e que o evento não seja publicado em caso de falha na escrita do banco, um garantindo o outro.

CQRS Dual Write

Exemplo de Dual Write implementado para garantir a escrita em banco e a publicação do evento

Para tornar esse nível de confiabilidade possível, é necessário que todas as transações do banco de dados de escrita ocorram dentro de transações atômicas, onde todas as modificações de estado estejam dentro de uma atividade única e indivisível. Esse tipo de abordagem só acontece de forma efetiva em bancos de dados transacionais ACID, que dão suporte para transações com commit e rollback. Nesse caso, todas as transações obrigatoriamente precisam iniciar uma transaction antes de efetuar todas as modificações necessárias. Caso tudo ocorra como o esperado, incluindo a publicação do evento, o commit é realizado, efetuando todas as operações de uma única vez.

CQRS Dual Write Rollback

CQRS Dual Write Rollback

Exemplo de Falhas que podem ocorrer em processos e integrações sendo respondidas com um rollback

Em caso de falhas em alguma etapa do processo, o processo de rollback deverá ser iniciado, não efetivando as operações de escrita que foram realizadas dentro da transaction.


Outbox Pattern no Contexto de CQRS

O Transactional Outbox Pattern é uma alternativa ao Dual-Write Pattern, projetado para garantir a consistência entre a escrita no banco de dados e a publicação de eventos em sistemas distribuídos que utilizam bancos de dados SQL. No contexto do CQRS, é de extrema importância que os eventos de mudança de estado sejam corretamente propagados para os modelos de consulta. Garantir que uma alteração de estado no banco de dados e a publicação de um evento correspondente ocorram de maneira atômica significa que ambos devem ser concluídos com sucesso ou nenhum deles deve ser executado.

O Outbox Pattern resolve esses problemas armazenando eventos em uma tabela de outbox dentro do mesmo banco de dados transacional utilizado para persistir o estado. Quando um comando é processado e uma alteração de estado é feita, um evento correspondente é criado em uma tabela de outbox na mesma transação do banco de dados. Um serviço ou processo intermediário assíncrono lê periodicamente os eventos da tabela de outbox de forma sequencial, publica os mesmos no sistema de mensageria e, em seguida, marca os eventos como publicados ou os remove da tabela.

Outbox Relay

Esse padrão envolve três componentes principais: a tabela outbox, o processo de publicação, também conhecido como relay de mensagem, e a gestão de erros obrigatória para tornar o padrão de fato resiliente para o propósito pelo qual foi concebido. A tabela outbox é uma tabela no banco de dados transacional onde os eventos são armazenados temporariamente. Esta tabela deve estar presente na mesma transação que a escrita de dados para garantir a atomicidade. O processo de publicação é um serviço ou processo assíncrono que periodicamente lê eventos da tabela outbox, publica os mesmos no sistema de mensageria do sistema e, em seguida, marca os eventos como publicados ou os remove da tabela (preferencialmente). A gestão de erros deve incluir mecanismos para lidar com falhas na publicação, como retentativas e monitoramento, para garantir que todos os eventos sejam eventualmente publicados e somente removidos mediante essa confirmação.

Imagine que precisamos criar uma view das prescrições do prontuário médico. Para isso, vamos criar uma tabela Outbox para armazenar os eventos de prescrição dentro da transação que salva as prescrições na tabela principal de escrita.

Outbox Modelagem

1
2
3
4
5
6
7
8
9
CREATE TABLE IF NOT EXISTS OutboxPrescricaoMedica (
    id SERIAL PRIMARY KEY,
    aggregate_id INT NOT NULL,
    aggregate_type VARCHAR(255) NOT NULL,
    event_type VARCHAR(255) NOT NULL,
    payload JSONB NOT NULL,
    timestamp TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    processed BOOLEAN NOT NULL DEFAULT FALSE
);

Agora vamos olhar abaixo para uma transação SQL que simula um fluxo transacional que agrupa a escrita da prescrição de medicamentos dentro de uma prescrição médica e o evento na tabela de Outbox dentro da mesma transação.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
-- Inicia uma transação SQL
BEGIN

-- Insere as Prescrições de Medicamentos da Prescrição de id 1
INSERT INTO Prescricao_Medicamentos (id_prescricao, id_medicamento, horario, dosagem)
VALUES  (1, 1, '08:00', '100mg'),
        (1, 2, '20:00', '500mg'),
        (1, 1, '20:00', '500mg');

--- Insere os dados na tabela de outbox
INSERT INTO OutboxPrescricaoMedica (aggregate_id, aggregate_type, event_type, payload)
    VALUES (
        1,
        'Prescricao',
        'PrescricaoCriada',
        jsonb_build_object(
            'id_prescricao', 1,
            'data_prescricao', NOW(),
            'medico', jsonb_build_object(
                'id_medico', 1,
                'nome', 'Dr. João Silva',
                'especialidade', 'Cardiologia',
                'crm', 'CRM123123'
            ),
            'paciente', jsonb_build_object(
                'id_paciente', 1,
                'nome', 'Maria Oliveira',
                'data_nascimento', '1985-07-10',
                'endereco', 'Rua das Flores, 123'
            ),
            'medicamentos', (
                SELECT jsonb_agg(
                    jsonb_build_object(
                        'id_medicamento', Medicamentos.id,
                        'nome', Medicamentos.nome,
                        'horario', Prescricao_Medicamentos.horario,
                        'dosagem', Prescricao_Medicamentos.dosagem
                    )
                )
                FROM Prescricao_Medicamentos
                JOIN Medicamentos ON Medicamentos.id = Prescricao_Medicamentos.id_medicamento
                WHERE Prescricao_Medicamentos.id_prescricao = 1
            )
        )
    );

-- Caso tudo dê certo, realiza o commit da transacao
COMMIT;

O processo que ocorre durante a implementação do Outbox leva mais a sério a mediação da publicação do evento dentro da abordagem transacional do que o proposto no Dual-Write, apostando no sucesso da propagação do evento pela simplicidade, mesmo que dependa de um processo adicional para ler e publicar.

Essa abordagem, apesar de oferecer alguns graus de segurança adicionais na garantia de publicação, requer um custo computacional adicional na base de dados devido à leitura e escrita constante na tabela de outbox. Esse cenário pode se tornar um gargalo em caso de aumento de escala. Entendemos que as capacidades de leitura e escrita em concorrência podem ser comprometidas e afetar mais facilmente a performance do sistema como um todo nesse tipo de cenário transacional, que tende a ser a parte mais difícil de se escalar em sistemas de grande escala.


Imagens geradas pelo Dall-E

Revisores


Referencias

Centro de Arquitetura Microsoft - Padrão CQRS

CQRS – O que é? Onde aplicar?

CQRS (Command Query Responsibility Segregation) em uma Arquitetura de Microsserviços

Amazon AWS - Padrão CQRS

Martin Fowler - CQRS

Gitlab - CQRS

Command Query Responsibility Segregation (CQRS)

Microservices Patterns: API Composition and CQRS Patterns

Pattern: Command Query Responsibility Segregation (CQRS)

Pattern: Database per Service

CQRS na AWS: Sincronizando os Serviços de Command e Query com o Padrão Transactional Outbox

CQRS na AWS: Sincronizando os Serviços de Command e Query com o Amazon SQS

Qual a diferênça entre View e Materialized View?

Guia para modelagem de domínios ricos

Asking an LLM to build a simple web tool

I've been really enjoying following Simon Willison's blog posts recently. Simon shows other programmers the way LLMs will be used for code assistance in the future, and posts full interactions with LLMs to build small tools or parts of larger applications.

A recent post caught my attention; here Simon got an LLM (Claude 3.5 Sonnet in this case) to build a complete tool that lets one configure/tweak box shadow settings and copy the resulting CSS code for use in a real application. One thing that seemed interesting is that the LLM in this case used some heavyweight dependencies (React + JSX) to implement this; Almost 3 MiB of dependency for something that clearly needs only a few dozen lines of HTML + JS to implement; yikes.

So I've decided to try my own experiment and get an LLM to do this without any dependencies. It turned out to be very easy, because the LLM I used (in this case ChatGPT 4o, but it could really have been any of the top-tier LLMs, I think) opted for the no-dependency approach from the start. I was preparing to ask it to adjust the code to remove dependencies, but this turned out to be unnecessary.

The resulting tool is very similar to Simon's in functionality; it's deployed at https://eliben.org/box-shadow-tool/; here's a screenshot:

Screenshot of box shadow tool

Here are my prompts:

CSS for a slight box shadow, build me a tool that helps me twiddle settings and preview them and copy and paste out the CSS

ChatGPT produced a working tool but it didn't really look good on the page.

Yes, make the tool itself look a bit better with some CSS so it's all centered on the screen and there's enough space for the preview box

It still wasn't quite what I wanted.

the container has to be wider so all the text and sliders fix nicely, and there's still not enough space for the shadows of the preview box to show without overlapping with other elements

Now it was looking better; I wanted a button to copy-paste, like in Simon's demo:

this looks better; now add a nice-looking button at the bottom that copies the resulting css code to the clipboard

The code ChatGPT produced for the clipboard copy operation was flagged by vscode as deprecated, so I asked:

it seems like "document.execCommand('copy')" is deprecated; is there a more accepted way to do this?

The final version can be seen in the online demo (view-source). The complete ChatGPT transcript is available here.

Insights

Overall, this was a positive experience. While a tool like this is very simple to implement manually, doing it with an LLM was even quicker. The results are still not perfect in terms of alignment and space, but they're good enough. At this point one would probably just take over and do the final tweaks manually.

I was pleasantly surprised by how stable the LLM managed to keep its output throughout the interaction; it only modified the parts I asked it to, and the rest of the code remained identical. Stability has been an issue with LLMs (particularly for images), and I'm happy to see it holds well for code (there could be some special tuning or prompt engineering for ChatGPT to make this work well).

What I learned building a $1K MRR SaaS in 6 weeks

Question:


Answer:

The post What I learned building a $1K MRR SaaS in 6 weeks appeared first on Vadim Kravcenko.

Entering text in the terminal is complicated

The other day I asked what folks on Mastodon find confusing about working in the terminal, and one thing that stood out to me was “editing a command you already typed in”.

This really resonated with me: even though entering some text and editing it is a very “basic” task, it took me maybe 15 years of using the terminal every single day to get used to using Ctrl+A to go to the beginning of the line (or Ctrl+E for the end – I think I used Home/End instead).

So let’s talk about why entering text might be hard! I’ll also share a few tips that I wish I’d learned earlier.

it’s very inconsistent between programs

A big part of what makes entering text in the terminal hard is the inconsistency between how different programs handle entering text. For example:

  1. some programs (cat, nc, git commit --interactive, etc) don’t support using arrow keys at all: if you press arrow keys, you’ll just see ^[[D^[[D^[[C^[[C^
  2. many programs (like irb, python3 on a Linux machine and many many more) use the readline library, which gives you a lot of basic functionality (history, arrow keys, etc)
  3. some programs (like /usr/bin/python3 on my Mac) do support very basic features like arrow keys, but not other features like Ctrl+left or reverse searching with Ctrl+R
  4. some programs (like the fish shell or ipython3 or micro or vim) have their own fancy system for accepting input which is totally custom

So there’s a lot of variation! Let’s talk about each of those a little more.

mode 1: the baseline

First, there’s “the baseline” – what happens if a program just accepts text by calling fgets() or whatever and doing absolutely nothing else to provide a nicer experience. Here’s what using these tools typically looks for me – If I start the version of dash installed on my machine (a pretty minimal shell) press the left arrow keys, it just prints ^[[D to the terminal.

$ ls l-^[[D^[[D^[[D

At first it doesn’t seem like all of these “baseline” tools have much in common, but there are actually a few features that you get for free just from your terminal, without the program needing to do anything special at all.

The things you get for free are:

  1. typing in text, obviously
  2. backspace
  3. Ctrl+W, to delete the previous word
  4. Ctrl+U, to delete the whole line
  5. a few other things unrelated to text editing (like Ctrl+C to interrupt the process, Ctrl+Z to suspend, etc)

This is not great, but it means that if you want to delete a word you generally can do it with Ctrl+W instead of pressing backspace 15 times, even if you’re in an environment which is offering you absolutely zero features.

You can get a list of all the ctrl codes that your terminal supports with stty -a.

mode 2: tools that use readline

The next group is tools that use readline! Readline is a GNU library to make entering text more pleasant, and it’s very widely used.

My favourite readline keyboard shortcuts are:

  1. Ctrl+E (or End) to go to the end of the line
  2. Ctrl+A (or Home) to go to the beginning of the line
  3. Ctrl+left/right arrow to go back/forward 1 word
  4. up arrow to go back to the previous command
  5. Ctrl+R to search your history

And you can use Ctrl+W / Ctrl+U from the “baseline” list, though Ctrl+U deletes from the cursor to the beginning of the line instead of deleting the whole line. I think Ctrl+W might also have a slightly different definition of what a “word” is.

There are a lot more (here’s a full list), but those are the only ones that I personally use.

The bash shell is probably the most famous readline user (when you use Ctrl+R to search your history in bash, that feature actually comes from readline), but there are TONS of programs that use it – for example psql, irb, python3, etc.

tip: you can make ANYTHING use readline with rlwrap

One of my absolute favourite things is that if you have a program like nc without readline support, you can just run rlwrap nc to turn it into a program with readline support!

This is incredible and makes a lot of tools that are borderline unusable MUCH more pleasant to use. You can even apparently set up rlwrap to include your own custom autocompletions, though I’ve never tried that.

some reasons tools might not use readline

I think reasons tools might not use readline might include:

  • the program is very simple (like cat or nc) and maybe the maintainers don’t want to bring in a relatively large dependency
  • license reasons, if the program’s license is not GPL-compatible – readline is GPL-licensed, not LGPL
  • only a very small part of the program is interactive, and maybe readline support isn’t seen as important. For example git has a few interactive features (like git add -p), but not very many, and usually you’re just typing a single character like y or n – most of the time you need to really type something significant in git, it’ll drop you into a text editor instead.

For example idris2 says they don’t use readline to keep dependencies minimal and suggest using rlwrap to get better interactive features.

how to know if you’re using readline

The simplest test I can think of is to press Ctrl+R, and if you see:

(reverse-i-search)`':

then you’re probably using readline. This obviously isn’t a guarantee (some other library could use the term reverse-i-search too!), but I don’t know of another system that uses that specific term to refer to searching history.

the readline keybindings come from Emacs

Because I’m a vim user, It took me a very long time to understand where these keybindings come from (why Ctrl+A to go to the beginning of a line??? so weird!)

My understanding is these keybindings actually come from Emacs – Ctrl+A and Ctrl+E do the same thing in Emacs as they do in Readline and I assume the other keyboard shortcuts mostly do as well, though I tried out Ctrl+W and Ctrl+U in Emacs and they don’t do the same thing as they do in the terminal so I guess there are some differences.

There’s some more history of the Readline project here.

mode 3: another input library (like libedit)

On my Mac laptop, /usr/bin/python3 is in a weird middle ground where it supports some readline features (for example the arrow keys), but not the other ones. For example when I press Ctrl+left arrow, it prints out ;5D, like this:

$ python3
>>> importt subprocess;5D

Folks on Mastodon helped me figure out that this is because in the default Python install on Mac OS, the Python readline module is actually backed by libedit, which is a similar library which has fewer features, presumably because Readline is GPL licensed.

Here’s how I was eventually able to figure out that Python was using libedit on my system:

$ python3 -c "import readline; print(readline.__doc__)"
Importing this module enables command line editing using libedit readline.

Generally Python uses readline though if you install it on Linux or through Homebrew. It’s just that the specific version that Apple includes on their systems doesn’t have readline. Also Python 3.13 is going to remove the readline dependency in favour of a custom library, so “Python uses readline” won’t be true in the future.

I assume that there are more programs on my Mac that use libedit but I haven’t looked into it.

mode 4: something custom

The last group of programs is programs that have their own custom (and sometimes much fancier!) system for editing text. This includes:

  • most terminal text editors (nano, micro, vim, emacs, etc)
  • some shells (like fish), for example it seems like fish supports Ctrl+Z for undo when typing in a command. Zsh’s line editor is called zle.
  • some REPLs (like ipython), for example IPython uses the prompt_toolkit library instead of readline
  • lots of other programs (like atuin)

Some features you might see are:

  • better autocomplete which is more customized to the tool
  • nicer history management (for example with syntax highlighting) than the default you get from readline
  • more keyboard shortcuts

custom input systems are often readline-inspired

I went looking at how Atuin (a wonderful tool for searching your shell history that I started using recently) handles text input. Looking at the code and some of the discussion around it, their implementation is custom but it’s inspired by readline, which makes sense to me – a lot of users are used to those keybindings, and it’s convenient for them to work even though atuin doesn’t use readline.

prompt_toolkit (the library IPython uses) is similar – it actually supports a lot of options (including vi-like keybindings), but the default is to support the readline-style keybindings.

This is like how you see a lot of programs which support very basic vim keybindings (like j for down and k for up). For example Fastmail supports j and k even though most of its other keybindings don’t have much relationship to vim.

I assume that most “readline-inspired” custom input systems have various subtle incompatibilities with readline, but this doesn’t really bother me at all personally because I’m extremely ignorant of most of readline’s features. I only use maybe 5 keyboard shortcuts, so as long as they support the 5 basic commands I know (which they always do!) I feel pretty comfortable. And usually these custom systems have much better autocomplete than you’d get from just using readline, so generally I prefer them over readline.

lots of shells support vi keybindings

Bash, zsh, and fish all have a “vi mode” for entering text. In a very unscientific poll I ran on Mastodon, 12% of people said they use it, so it seems pretty popular.

Readline also has a “vi mode” (which is how Bash’s support for it works), so by extension lots of other programs have it too.

I’ve always thought that vi mode seems really cool, but for some reason even though I’m a vim user it’s never stuck for me.

understanding what situation you’re in really helps

I’ve spent a lot of my life being confused about why a command line application I was using wasn’t behaving the way I wanted, and it feels good to be able to more or less understand what’s going on.

I think this is roughly my mental flowchart when I’m entering text at a command line prompt:

  1. Do the arrow keys not work? Probably there’s no input system at all, but at least I can use Ctrl+W and Ctrl+U, and I can rlwrap the tool if I want more features.
  2. Does Ctrl+R print reverse-i-search? Probably it’s readline, so I can use all of the readline shortcuts I’m used to, and I know I can get some basic history and press up arrow to get the previous command.
  3. Does Ctrl+R do something else? This is probably some custom input library: it’ll probably act more or less like readline, and I can check the documentation if I really want to know how it works.

Being able to diagnose what’s going on like this makes the command line feel a more predictable and less chaotic.

some things this post left out

There are lots more complications related to entering text that we didn’t talk about at all here, like:

  • issues related to ssh / tmux / etc
  • the TERM environment variable
  • how different terminals (gnome terminal, iTerm, xterm, etc) have different kinds of support for copying/pasting text
  • unicode
  • probably a lot more

Further simplifying self-referential types for Rust

In my last post I discussed how we might be able to introduce ergonomic self-referential types (SRTs) to Rust, mostly by introducing features we know we want in some form anyway. The features listed were:

  1. Some form of 'unsafe and 'self lifetimes.
  2. A safe out-pointer notation for Rust (super let / -> super Type).
  3. A way to introduce out-pointers without breaking backwards-compat.
  4. A new Move auto-trait that can be used to mark types as immovable (!Move).
  5. View types which make it possible to safely initialize self-referential types.

This post was received quite well, and I thought the discussion which followed was quite interesting. I learned about a number of things that I think would help refine the design further, and I thought it would be good to write it up.

Not all self-referential types are !Move

Niko Matsakis pointed out that not all self-referential types are necessarily !Move. For example: if the data being referenced is heap-allocated, then the type doesn't actually have to be !Move. When writing protocol parsers, it's actually fairly common to read data into a heap-allocated type first. It seems likely that a fair number of self-referential types don't actually also need to be !Move or any concept of Move at all to function. Which also means we don't need some form of super let / -> super Type to construct types in-place.

If we just want to enable self-references for heap-allocated types, then all we need for that is a way to initialize them (view types) and an ability to describe the self-lifetimes ('unsafe as a minimum). That should give us a good idea of what we can prioritize to begin enabling a limited form of self-references.

The 'self lifetime is insufficient

Speaking of lifetimes, Mattieum pointed out that 'self likely wasn't going to be enough. 'self points to an entire struct, which ends up being too coarse to be practical. Instead we need to be able to point to individual fields to describe lifetimes.

Apparently Niko has also come up with a feature for this, in the form of lifetimes based on places. Rather than having abstract lifetimes like the 'a that we use to link to values with, it would be nicer if references just always had implicit, unique lifetime names. With access to that, we should rewrite the motivating example from our last post from being based on 'self:

struct GivePatsFuture {
    resume_from: GivePatsState,
    data: Option<String>,
    name: Option<&'self str>,        // ← Note the `'self` lifetime
}

To instead be based on paths:

struct GiveManyPatsFuture {
    resume_from: GivePatsState,
    first: Option<String>,
    data: Option<&'self.first str>, // ← Note the `'self.first` lifetime
}

This might not seem like it's that important in this example; but once we introduce mutability things quickly spiral. And not introducing a magic 'self in favor of always requiring 'self.field seems like it would generally be better. And that requires having lifetimes which can be based on places, which seems like a great idea regardless.

Automatic referential stability

Earlier in this post we established that we don't actually need to encode !Move for self-referential types which store their values on the heap. That's not all self-referential types - but does describe a fair number of them. Now what if we didn't need to encode !Move for almost the entire remaining set of self-referential types?

If that sounds like move constructors, you'd be right - but with a catch! Unlike the Relocate trait I described in my last post, DoveOfHope noted that we might not even need that to work. After all: if the compiler already knows that we're pointing to a field contained within the struct - can't the compiler make sure to update the pointers when we try and move the structure?

I was skeptical about the possibility of this until I read about place-based lifetimes. With that it seems like we would actually have enough granularity to know how to update which fields when they are moved. In terms of cost: that's just updating the value of a pointer on move - which is effectively free. And it would rid us almost entirely of needing to encode !Move.

The only cases not covered by this would be 'unsafe references or actual *const T / *mut T pointers to stack data. The compiler doesn't actually know what those point to, and so cannot update them on move. For that some form of Relocate trait actually seems like it would be useful to have. But that's something that wouldn't need to be added straight away either.

Raw pointer operations and automatic referential stability

This section was added after publication, on 2024-07-08.

While it should be possible for the compiler to guarantee that the codegen is correct for e.g. mem::swap, we can't make those same guarantees for raw pointer operations such as ptr::swap. And because existing structures may be freely using those operations internally, that means on-stack self-referential types can't just be made to work without any caveats the way we can for on-heap SRTs. That's indeed a problem, and I want to thank The_8472 for pointing this out.

I was really hoping to be able to avoid extra bounds, so that on-stack SRTs could match the experience of on-heap SRTs. But that doesn't seem possible, so perhaps some minimum amount of bounds which we can flip to being set by default (like Sized) over an edition might be enough to do the trick here. I'm currently thinking of something like:

  1. Introduce a new auto marker trait Transfer to complement Relocate, as a dual to Rust's Destruct / Drop system. Transfer is the name of the bound people would use, Relocate provides the hooks to extend the Transfer system.
  2. All types with 'self lifetimes automatically implement Transfer.
  3. Only bounds which include + Transfer can take impl Transfer types.
  4. All relevant raw pointer move operations must uphold additional safety invariants of what to do with impl Transfer types.
  5. We gradually update the stdlib to support + Transfer in all bounds.
  6. Over some edition we make opt-out rather than opt-in (T: TransferT: ?Transfer).
auto trait Transfer {}

trait Relocate { ... }

I was really hoping we could avoid something like this. And it does put into question whether this is actually simpler than immovable types. But the_8472 is quite right that this is an issue, and so we need to address it. Luckily we've already done something like this before with const. And I don't think that this is something we can generalize. I'll write more about this at some later date.

Relocate should probably take &own self

Now, even if we don't expect people to need to write their own pointer update logic, basically like ever, it's still something that should be provided. And when we do, we should encode it correctly. Nadrieril very helpfully pointed out that the &mut self bound on the Relocate trait might not actually be what we want - because we're not just borrowing a value - we actually want to destruct it. Instead they informed me about the work done towards &own, which would give access to something called: "owned references".

Daniel Henry-Mantilla is the author of the stackbox crate as well as the main person responsible for the lifetime extension system behind the pin! macro in the stdlib. A while back he shared a very helpful writeup about &own. The core of the idea is that we should decouple the concepts of: "Where is the data concretely stored?" from: "Who logically owns the data?" Resulting in the idea of having a reference which doesn't just provide temporary unique access - but can take permanent unique access. In his post, Daniel helpfully provides the following table:

Semantics for TFor the backing allocation
&TShared accessBorrowed
&mut TExclusive accessBorrowed
&own TOwned access
(drop responsibility)
Borrowed

Applying this to our post, we would use this to change the trait Relocate from taking &mut self, which temporarily takes exclusive access of a type - but can't actually drop the type:

trait Relocate {
    fn relocate(&mut self) -> super Self;
}

To instead take &own, which takes permanent exclusive access of a type, and can actually drop the type:

trait Relocate {
    fn relocate(&own self) -> super Self;
}

edit 2024-07-08: This example was added later. To explain what &own solves, let's take a look at the Relocate example impl from our last post. In it we say the following:

We're making one sketchy assumption here: we need to be able to take the owned data from self, without running into the issue where the data can't be moved because it is already borrowed from self.

struct Cat {
    data: String,
    name: &'self str,
}
impl Cat {
    fn new(data: String) -> super Self { ... }
}
impl Relocate for Cat {
    fn relocate(&mut self) -> super Self {
        let mut data = String::new();                   // ← dummy type, does not allocate
        mem::swap(&mut self.data, &mut data);           // ← take owned data
        super let cat = Cat { data };                   // ← construct new instance
        cat.name = cat.data.split(' ').next().unwrap(); // ← create self-ref
        cat                                             // ← return new instance
    }
}

What &own gives us is a way to correctly encode the semantics here. Because the type is not moved we can't actually move by-value. But logically we do still want to claim unique ownership of the value so we can destruct the type and move individual fields. That's kind of the way moving Box by-value works too, but rather than having the allocation on the heap, the allocation can be anywhere. With that we could rewrite the rather sketchy mem::swap code above into more normal-looking destructuring + initialization instead:

struct Cat {
    data: String,
    name: &'self str,
}
impl Cat {
    fn new(data: String) -> super Self { ... }
}
impl Relocate for Cat {
    fn relocate(&mut self) -> super Self {
        let Self { data, .. } = self;                   // ← destruct `self`
        super let cat = Cat { data };                   // ← construct new instance
        cat.name = cat.data.split(' ').next().unwrap(); // ← create self-ref
        cat                                             // ← return new instance
    }
}

Now, because this actually does need to construct types in fixed memory locations this trait would need to have some form of -> super Self syntax. After all: this would be the one place where it would still be needed. For anyone interested in keeping up to date with &own, here is the Rust issue for it (which also happens to have been filed by Niko).

Motivating example, reworked again

With this in mind, we can rework the motivating example from the last post once again. To refresh everyone's memory, this is the high-level async/.await-based Rust code we would:

async fn give_pats() {
    let data = "chashu tuna".to_string();
    let name = data.split(' ').take().unwrap();
    pat_cat(&name).await;
    println!("patted {name}");
} 

async fn main() {
    give_pats().await;
}

And using the updates in this post we can proceed to desugar it. This time around without the need for any reference to Move or in-place construction, thanks to path-based lifetimes and the compiler automatically preserving referential stability:

enum GivePatsState {
    Created,
    Suspend1,
    Complete,
}

struct GivePatsFuture {
    resume_from: GivePatsState,
    data: Option<String>,
    name: Option<&'self.data str>, // ← Note the `'self.data` lifetime
}
impl GivePatsFuture {   // ← No in-place construction needed
    fn new() -> Self {
        Self {
            resume_from: GivePatsState::Created,
            data: None,
            name: None
        }
    }
}
impl Future for GivePatsFuture {
    type Output = ();
    fn poll(&mut self, cx: &mut Context<'_>)  // ← No `Pin` needed
        -> Poll<Self::Output> { ... }
}

This is significantly simpler than what we had before in its definition. And even the actual desugaring of the invocation ends up being simpler: no longer needing the intermediate IntoFuture constructor to guarantee in-place construction.

let into_future = GivePatsFuture::new();
let mut future = into_future.into_future(); // ← No `pin!` needed
loop {
    match future.poll(&mut current_context) {
        Poll::Ready(ready) => break ready,
        Poll::Pending => yield Poll::Pending,
    }
}

All that's needed for this is for the compiler to update the addresses of self-pointers on move. That's a little bit of extra codegen whenever a value is moved - rather than just a bitwise copy, it also needs to update pointer values. But it seems quite doable to implement, should actually perform really well, and most importantly: users would rarely if ever need to think about it. Writing &'self.field would always just work.

When immovable types are still needed

I don't want to discount the idea of immovable types entirely though. There are definitely benefits to having types which cannot be moved. Especially when working with FFI structures that require immovability. Or some high-performance data structures which use lots of self-references to stack-based data which would be too expensive to update. Use cases certainly exist, but they're going to be fairly niche. For example: Rust for Linux uses immovable types for their intrusive linked lists - and I think those probably need some form of immovability to actually work.

However, if the compiler doesn't require immovable types to provide self-references, then immovable types suddenly go from being load-bearing to instead becoming something closer to an optimization. It's likely still worth adding them since they certainly are more efficient. But if we do it right, adding immovable types will be backwards-compatible, and would be something we can introduce later as an optimization.

When it comes to whether async {} should return impl Future or impl IntoFuture: I think the answer really should be impl IntoFuture. In the 2024 edition we're changing range syntax (0..12) from returning Iterator to returning IntoIterator. This matches Swift's behavior, where 0..12 returns Sequence rather than IteratorProtocol. I think this is a good indication that async {} and gen {} probably also should return impl Into* traits rather than their respective traits.

Conclusion

I like it when something I write is discussed and I end up picking up on other relevant work. I think to enable self-referential types I'm now definitely favoring a form of built-in pointer updates as part of the language over immovable types (edit: maybe I spoke too soon). However, if we do want immovable types - I think my last post provides a coherent and user-friendly design to get us there.

There are a fairly large number of dependencies if we want to implement a complete story for self-referential types. Luckily we can implement features one at a time, enabling increasingly more expressive forms of self-referential types. In terms of importance: some form of 'unsafe seems like a good starting point. Followed by place-based lifetimes. View types seem useful, but aren't in the critical path since we can work around phased initialization using an option dance. Here's a graph of all features and their dependencies.

A graph showing the various dependencies between language items

Breaking down the features like has actually further reinforced my perception that this all seems quite doable. 'unsafe doesn't seem like it's that far out. And Niko has sounded somewhat serious about path-based lifetimes and view types. We'll have to see how quickly those will actually end up being developed in practice - but having laid this all out like this, I'm feeling somewhat optimistic!

Why I kept my startup job for seven years (and counting)

Software engineers typically don't stay anywhere for very long. If you're not moving, you're losing out on opportunities1. And yet, I've made the choice to join and stay at one company for seven years. That's more than half my career to date. Why did I do that? And would I do it again?

Why have I stayed so long?

People change companies for a lot of different reasons. The factors I see most often are:

  • To get more money or a promotion
  • For better work conditions
  • Because they're bored
  • To change roles (into or out of management, into product, etc.)
  • To get a better culture or different team.

When I look at why people typically change jobs, it's very clear why I've stayed at this particular company. I don't think I'd have a better job anywhere else.

Here's what we've done to build that department such that I didn't want to leave, and so that few people do leave the company: our turnover has been remarkably low for a software company, especially one hiring very good engineers2.

We paid enough and promoted people actively

I've not set salaries, since my one stint of management still had my direct reports formally reporting to our VP for compensation purposes. But I've played the Salary Game3 with coworkers past and present, and I've had deep discussions with my various bosses about compensation strategy. Unlike many companies, we understood from the early days that if you don't raise salaries as market conditions change, many of them will leave for those other roles4. For me, while the pay isn't what I'd get at a big tech company, it's always been enough that pay wouldn't be a driving factor for me to leave.

The same is true with promotions. It frustrates me to no end that at many companies, the best way to get promoted from mid to senior, from senior to staff, etc. is to change companies. You've build all the knowledge already at your current role, and that knowledge walks out the metaphorical remote-work door when your employee shuts her laptop for last time. All because another company recognizes that yes, she is a senior software engineer now, and yours didn't.

Obviously this isn't always possible. Sometimes there isn't budget for raises. And sometimes people want a role that we simply don't have available. For example, one of our long-time engineers left when we had no management openings available. We had a going away party, because he was a cherished member of the team, and we genuinely wished him the best. He needed a change, and he got it. He also learned what he was leaving behind (harder to appreciate it without a change), and he's become a stronger engineer for seeing multiple companies.

You can't always satisfy what people need or want, but it's foolish not to try. If you pay people more when market rates raise and promote people when they're on the verge of a new role, they'll stay. If you don't, you'll leak out your best employees.

We have great working conditions

I've worked for companies where the expectation was that you get in before 9 and leave well after 6, staying late regularly to ship things on time. At those companies, I did my own thing, riding on my talent to just walk out the door when 5:30 pm. No one questioned it, because I was that productive and that skilled (in a niche role), and I could set my working conditions without repercussions.

But it really sucks to work on a team where everyone is expected to work late except one person. It's bad for that one person, it's bad for the whole team. The entire team deserves to work reasonable hours with a workspace that promotes their productivity instead of destroying it.

Our work environment has changed over the years, but we've always shaped it with a mind toward what our engineers need to be productive and content. When we had a physical office space in Manhattan, we had a partition put up to separate the engineering desks from the louder departments, so we could focus. We had focus time rules to build space and time for deep work. And now with remote work (and a team that was mostly hired to work remote from the outset) we've shifted our patterns but still worked deliberately to be mindful of what folks need.

And in 2022, we started doing four day workweeks. We don't do four 10-hour days, we do four 8-hour days. Just a shorter week! We still get at least as much impact done with fewer hours, and we're all happier and less drained as a result.

I haven't gotten bored yet (mostly)

At some previous jobs, things became routine after a while. You end up specializing in one area. For me, that doesn't really work out: I need a constant drip of dopamine from learning new things or I'm unable to make myself work on tasks5.

Not everyone can bounce between different areas, but I've had the luxury of having exposure to a lot of different things here. As a Principal Software Engineer, I oversee technical direction across our company. This means I see aspects of almost everything we do with computers. I've done lots of backend work. I've done some ML work. I've done frontend bug fixes6. I've worked with Salesforce (once, never again) and helped fix people's laptops. I've helped us step up our application security game, and helped our platform scale by 50x capacity.

I'm not lacking any dopamine at work. One of my favorite things is that I've got a reputation as great debugger so I get pulled into the trickiest bugs and the trickiest incidents. It's a lot of fun!

I've learned so much and grown a lot

When I started this job, I knew much less than I do now, despite being a pretty good engineer then. The nature of my roles has led to me being able to have continuous growth during my tenure, and I'm deeply proud of my growth and learning.

I've gotten a lot better at working with people by getting some management experience, and a lot of leadership experience, and growing to understand the difference between the two. Kind coworkers who explain how others think has been tremendously helpful here.

I've learned a lot about web development. My frontend development skills were much weaker before. I've dramatically improved at backend web development (focusing previously mostly on data engineering). I've learned so much about investigating and improving application performance. And my understanding of application security has deepened.

This might have given me unreasonable expectations of what I'll be able to learn in future roles, but I guess that means I'll have to craft those roles myself to foster continuous learning!

My role has changed

I was hired initially as an individual contributor on a team of three engineers. My manager intentionally asked what my goals were (management vs. individual contributor track) and ensured that we found opportunities for me to try things out. I've had the opportunity to change my role a few times.

The major shifts were from senior software engineer to tech lead manager, then tech lead manager to staff engineer (our individual contributor track was established for me), and then eventually from staff engineer to principal engineer.

Those shifts led to doing my first ever management, then learning about leading without authority, and eventually into being a company leader rather than just one for our department. We've fostered role changes in our other engineers, as well. One of our long-time engineers started as a marketing intern, and people have moved into management or into the technical track. And one of product designers started out in a different department, too! This has been an intentional approach at the company.

There aren't many better cultures (but I'm biased)

Culture is relative, and it's hard to pin down. We've built the culture we have as deliberately as possible. It's characterized by kindness and genuine feedback, by compassion and helping people grow. And it's characterized by a focus on excellence paired with a recognition that we're fallible humans, and that when we make mistakes it's usually not our fault.

As part of team, department, and then company leadership, I've played a strong role in shaping what our culture is. Rather than patting myself on the back, I'd like to demonstrate a few of the ways I've made mistakes and how other leaders at the company used those as teaching moments. These made me a better engineer, employee, and person, ultimately improving the company.

  • I made an awful negotiation mistake, and our CEO taught me how to negotiate better. When we were still under 20 employees, I was negotiating for a raise. I'd asked for one number, then later asked for a different one. I'd done more research but I didn't present it as a change, or have any explanation. He taught me how to handle that situation better, recommended a book, and gave me the higher number I asked for alongside the lesson. Many managers would've stuck to the lower number, and few would have given the lesson. This helped me not have to go look at other companies to get money, which means they kept a great employee longer.
  • I approached collective action poorly. Another time, I helped organize collective action when we were upset about the potential approach to a benefits change. I made a horrible error, though: I was in the room where that approach was previously discussed, then sprung collective action on my leadership team colleagues instead of talking directly to them about my concerns first. The biggest reasons were that I didn't feel welcome to speak up, and also that I truly didn't understand how they would feel about it. Instead of in any way penalizing me, our CTO worked to understand why I approached it that way, then he (and my therapist) helped me understand how to approach it differently next time7. He also made sure that there was space for me in the leadership meeting, allowing me to bloom more as a leader. And the benefits change? We got much of what we asked for.

I haven't so far deleted any production databases, but we've had some "whoopsies" in production and we've handled those by looking at where the system went wrong to let it happen. We do blameless post mortems to understand what happened and where things were able to go off the rails. Then we fix that, rather than blaming individuals. As a neurodivergent person, I'm very glad this approach has extended into mistakes with human interactions, too. We've worked to fix the system instead of penalizing people for not understanding the nuances of how people interact.

As our principal engineer, I've been at the company since we were 11 people and 3 engineers, and I've seen our practices evolve and grow as the company expanded and severely contracted. Our culture has changed, but it has kept this core brightness that is special. Not many places have that spark.

I'd do it again

When I joined this company, I thought it was a short-term thing to get us a mortgage, settle into a house, and then go back to my consulting work. But now? I'm in no hurry to leave. I want to see where we go and help us get there, but most of all, I just love this environment where I've grown and thrived.

If I knew what I know now, I'd do things differently and avoid some mistakes, but I'd join this company again and enjoy it just as much. I've made some friends for life, and I've learned more than I dreamed I could.

My hope for each of you reading this is that you'll find your own company like this. You deserve a team where you have a home base you never want to leave, where you have great working conditions and fair pay and as much (or as little) growth as you want. And if you are a leader? Please make this sort of culture happen.


Thank you to Dan Reich for the helpful feedback on a draft of this post!


1

Well, in the previous economy, anyway. This new one isn't as freely giving.

2

Doesn't everyone say they hire great engineers? It's a cliche. But this is also the best engineering team I've worked on in a few axes. Who we hire is one aspect, and the environment is another, which allows people to do some of their best work.

3

The rules of the salary game are:

  • Both people have to share their salary with the other.
  • Neither of you are allowed to get mad at the other about it.
  • You can use the information but not attribute it in negotiations.

Note that you are allowed to get mad at your employer, just not the other person.

4

It's also possible that other companies do get it, but want to encourage turnover to do things like claw back issued stock options.

5

I often wonder how, in retrospect, it took me until my 30s to be diagnosed with ADHD.

6

One of these was out of pure spite when someone said he didn't think it was really a bug, so I spite-reproduced it then spite-mostly-fixed it.

7

Collective action is a wonderful thing. The error here was more that I was in the room and had the influence to change things directly but didn't use it and didn't talk to my direct peers first.

Standard cells: Looking at individual gates in the Pentium processor

Intel released the powerful Pentium processor in 1993, a chip to "separate the really power-hungry folks from ordinary mortals." The original Pentium was followed by the Pentium Pro, the Pentium II, and others, spawning a long-running brand of high-performance processors, Intel's flagship line until the Core processors took over in 2006. The Pentium eventually became virtually synonymous with "PC" and even made it into pop culture.

Even though the Pentium is a complex chip with 3.3 million transistors, its transistors are visible under a microscope, unlike modern chips. By examining the chip, we can see the interesting circuits used for gates, flip-flops, and other circuits, including the use of an unusual technology called BiCMOS. In this article, I take a close look at the original Pentium chip1, showing how much of its circuitry was built out of structured rows of tiny transistors, a technique known as standard-cell design.

The die photo below shows the Pentium's fingernail-sized silicon die under a microscope. I removed the chip's four metal layers to show the underlying silicon, revealing the individual transistors, which are obscured in most die photos by the layers of metal. Standard-cell circuitry, indicated by red boxes, is recognizable because the circuitry is arranged in uniform columns of cells, giving it a characteristic striped appearance. In contrast, the chip's manually-optimized functional blocks are denser and more structured, giving them a darker appearance. Examples are the caches on the left, the datapaths in the middle, and the microcode ROMs on the right.

Die photo of the Intel Pentium processor with standard cells highlighted in red. The edges of the chip suffered some damage when I removed the metal layers. Click this image (or any other) for a larger version.

Die photo of the Intel Pentium processor with standard cells highlighted in red. The edges of the chip suffered some damage when I removed the metal layers. Click this image (or any other) for a larger version.

Standard-cell design

Early processors in the 1970s were usually designed by manually laying out every transistor individually, fitting transistors together like puzzle pieces to optimize their layout. While this was tedious, it resulted in a highly dense layout. Federico Faggin, designer of the popular Z80 processor, was almost done when he ran into a problem. The last few transistors wouldn't fit, so he had to erase three weeks of work and start over. The closeup of the resulting Z80 layout below shows that each transistor has a different, complex shape, optimized to pack the transistors as tightly as possible.2

A closeup of transistors in the Zilog Z80 processor (1976). This chip is NMOS, not CMOS, which provides more layout flexibility. The metal and polysilicon layers have been removed to expose the underlying silicon. The lighter stripes over active silicon indicate where the polysilicon gates were. I think this photo is from the Visual 6502 project but I'm not sure.

A closeup of transistors in the Zilog Z80 processor (1976). This chip is NMOS, not CMOS, which provides more layout flexibility. The metal and polysilicon layers have been removed to expose the underlying silicon. The lighter stripes over active silicon indicate where the polysilicon gates were. I think this photo is from the Visual 6502 project but I'm not sure.

Because manual layout is slow, difficult, and error-prone, people developed automated approaches such as standard-cell.3 The idea behind standard-cell is to create a standard library of blocks (cells) to implement each type of gate, flip-flop, and other low-level component. To use a particular circuit, instead of arranging each transistor, you use the standard design from the library. Each cell has a fixed height but the width varies as needed, so the standard cells can be arranged in rows. The Pentium die photo below shows seven cells in a row. (The rectangular blobs are doped silicon while the long, thin vertical lines are polysilicon.) Compare the orderly arrangement of these transistors with the Z80 transistors above.

Some standard cell circuitry in the Pentium.
I removed the metal to show the underlying silicon and polysilicon.

Some standard cell circuitry in the Pentium. I removed the metal to show the underlying silicon and polysilicon.

The photo below zooms out to show five rows of standard cells (the dark bands) and the wiring in between. Because CMOS circuitry uses two types of transistors (NMOS and PMOS), each standard-cell row appears as two closely-spaced bands: one of NMOS transistors and one of PMOS transistors. The space between rows is used as a "wiring channel" that holds the wiring between the cells. Power and ground for the circuitry run along the top and bottom of each row.

Some standard cells in the Pentium processor.

Some standard cells in the Pentium processor.

The fixed structure of standard cell design makes it suitable for automation, with the layout generated by "automatic place and route" software. The first step, placement, consists of determining an arrangement of cells that minimizes the distance between connected cells. Running long wires between cells wastes space on the die, since you end up with a lot of unnecessary metal wiring. But more importantly, long paths have higher capacitance, slowing down the signals. Once the cells are placed in their positions, the "routing" step generates the wiring to connect the calls. Placement and routing are both difficult optimization problems that are NP-complete.

Intel started using automated place and route techniques for the 386 processor, since it was much faster than manual layout and dramatically reduced the number of errors. Placement was done with a program called Timberwolf, developed by a Berkeley grad student. As one member of the 386 team said, "If management had known that we were using a tool by some grad student as a key part of the methodology, they would never have let us use it." Intel developed custom software for routing, using an iterative heuristic approach. Standard-cell design is still used in current processors, but the software is much more advanced.

A brief overview of CMOS

Before looking at the standard cell circuits in detail, I'll give a quick overview of how CMOS circuits are implemented. Modern processors are built from CMOS circuitry, which uses two types of transistors: NMOS and PMOS. The diagram below shows how an NMOS transistor is constructed. The transistor can be considered a switch between the source and drain, controlled by the gate. The source and drain regions (green) consist of silicon doped with impurities to change its semiconductor properties, forming N+ silicon. The gate consists of a layer of polysilicon (red), separated from the silicon by a very thin insulating oxide layer. Whenever polysilicon crosses active silicon, a transistor is formed. Diagram showing the structure of an NMOS transistor.

Diagram showing the structure of an NMOS transistor.

The NMOS and PMOS transistors are opposite in their construction and operation. A PMOS transistor swaps the N-type and P-type silicon, so it consists of P+ regions in a substrate of N silicon. In operation, an NMOS transistor turns on when the gate is high, while a PMOS transistor turns on when the gate is low.4 An NMOS transistor is best at pulling its output low, while a PMOS transistor is best at pulling its output high. In a CMOS circuit, the transistors work as a team, pulling the output high or low as needed; the "C" in CMOS indicates this "Complementary" approach. NMOS and PMOS transistors are not entirely symmetrical, however, due to the underlying semiconductor physics. Instead, PMOS transistors need to be larger than NMOS transistors, which helps to distinguish PMOS transistors from NMOS transistors on the die.

The layers of circuitry in the Pentium

The construction of the Pentium is more complicated than the diagram above, with four layers of metal wiring that connect the transistors.5 Starting at the surface of the silicon die, the Pentium's transistors are similar to the diagram, with regions of silicon doped to change their semiconductor properties. Polysilicon wiring is created on top of the silicon. The most important role of the polysilicon is that when it crosses doped silicon, a transistor is formed, with the polysilicon as the gate. However, polysilicon is also used as wiring over short distances.

Above the silicon, four layers of metal connect the components: multiple metal layers allow signals to crisscross the chip without running into each other. The metal layers are numbered M1 through M4, with M1 on the bottom. A few rules control the wiring: a metal layer can connect with the layer above or below through a tungsten plug called a "via". Only the bottom metal, M1, can connect to the silicon or polysilicon, through a "contact". The layers usually alternate between horizontal wiring and vertical wiring (at least locally). Thus, a signal from a transistor may travel through M1, bounce up to M2 and M3 to cross other signals, and then go back down to M1 to connect to another transistor. As you can see, automated place and route software has a complicated task, producing millions of complicated wiring paths as densely as possible.

The diagram below shows how the layers appear on the chip. (This photo shows one of the rare spots on the chip where all the layers are visible.) The M4 metal layer on top of the chip is the thickest, so it is mostly used for power, ground, and clock signals rather than data. An M4 ground wire covers the top of this photo. The next layer down is M3. In this part of the chip, M3 lines run vertically. (Due to optical effects, the vertical M3 lines may look like they are on top of M4, but they are below.) The horizontal M2 metal lines are lower and appear brown rather than golden, due to the oxide layers that cover them. The bottom metal layer is M1. The vertical M1 lines are thick in this part of the chip because they provide power to the circuitry.

The Pentium is constructed with four layers of metal. Because the chip has a three-dimensional structure, I used focus stacking to get a clearer image.

The Pentium is constructed with four layers of metal. Because the chip has a three-dimensional structure, I used focus stacking to get a clearer image.

The silicon and polysilicon are mostly obscured in the above photo. By removing all the metal layers, I obtained the image below. This image shows the same region as the image above, but it is hard to see the correlation because the metal layers almost completely obscure the silicon. The orderly columns of transistors reveal the standard-cell design. The irregular dark regions are doped silicon, which forms the chip's transistors. The dark or shiny horizontal bands are polysilicon. I will explain below how these regions form gates and other circuits.

A closeup of the silicon and polysilicon.

A closeup of the silicon and polysilicon.

Inverter

The fundamental CMOS gate is an inverter, shown in the schematic below. The inverter is built from one PMOS transistor (top) and one NMOS transistor (bottom). If the gate input is a "1", the bottom transistor turns on, pulling the output to ground (0). A "0" input turns on the top transistor, pulling the output high (1). Thus, this two-transistor circuit implements an inverter.10

Schematic diagram of a CMOS inverter.

Schematic diagram of a CMOS inverter.

The diagram below shows two views of how a standard-cell inverter appears on the Pentium die, with and without metal. The inverter consists of two transistors, just like the schematic above. The input is connected to the two polysilicon gates of the transistors. The metal output wire is connected to the two transistors (the left sides, specifically).

A standard-cell CMOS inverter in the Pentium.

A standard-cell CMOS inverter in the Pentium.

In more detail, the image on the left includes the bottom (M1) metal layer, but I removed the other metal layers. Two thick metal lines at the top and bottom provide power and ground to the standard cells. The multiple dark circles are contacts between the M1 metal layer and the metal layer on top (M2), providing a path for power and ground that eventually reaches the top (M4) metal layer and then the chip's pins. (The power and ground wires are thick to provide sufficient current to the circuitry while minimizing voltage drops and noise.) The small, lighter circles are vias that connect the M1 metal layer to the underlying silicon or polysilicon. The input to the gate is provided from the M2 metal, which connects to the M1 layer at the indicated contact. The smaller black dots at the top and bottom of this metal strip are vias, connections to the underlying silicon.

For the image on the right, I removed all four metal layers, revealing the polysilicon and doped silicon. Recall that a transistor is constructed from regions of doped silicon with a stripe of polysilicon between the regions, forming the transistor's gate. The diagram shows the two transistors that form the inverter. When combined with the metal wiring, they form the inverter schematic shown earlier. The final feature is the "well tap". The PMOS transistors are constructed in a "well" of N-doped silicon. The well must be kept at a positive voltage, so periodic "taps" connect the well to the +3.3V supply. As mentioned earlier, the PMOS transistor is larger than the NMOS transistor, which allowed me to figure out the transistor types in the photo.

By the way, the chip is built with a 600 nm process, so the width of the polysilicon lines is approximately 600 nm. For comparison, the wavelength of visible light is 400 to 700 nm, with 600 nm corresponding to orange light. This explains why the microscope photos are somewhat fuzzy; the features are the size of the wavelength of light.6

NAND gate

Another common gate in the Pentium is the NAND gate. The schematic below shows a NAND gate with two PMOS transistors above and two NMOS transistors below. If both inputs are high, the two NMOS transistors turn on, pulling the output low. If either input is low, a PMOS transistor turns on, pulling the output high. (Recall that NMOS and PMOS are opposites: a high voltage turns an NMOS transistor on while a low voltage turns a PMOS transistor on.) Thus, the CMOS circuit below produces the desired output for the NAND function.

Schematic of a CMOS NAND gate.

Schematic of a CMOS NAND gate.

The implementation of the gate as a standard cell, below, follows the schematic. The left photo shows the circuit with one layer of metal (M1). A thick metal line provides 3.3 volts to the gate; it has two contacts that provide power to the two PMOS transistors. The metal line for ground is similar, except only one NMOS transistor is grounded. The thinner metal in the middle has two contacts to get the transistor outputs and a via to connect the output to the M2 metal layer on top. Finally, two tiny bits of M1 metal connect the inputs from the M2 layer to the underlying polysilicon.

Implementation of a CMOS NAND gate as a standard cell.

Implementation of a CMOS NAND gate as a standard cell.

The right photo shows the circuit with all metal removed, showing the polysilicon and silicon. Since a transistor is formed where a polysilicon line crosses doped silicon, the two polysilicon lines create four transistors. Polysilicon functions both as local wiring and as the transistor gates. In particular, the inputs can be connected at the top or bottom of the circuit (or both), depending on what works best for wiring the circuitry. Note that the transistors are squashed together so the silicon in the middle is part of two transistors. An important asymmetry is that the output is taken from the middle of the PMOS transistors, wiring them in parallel, while the output is taken from the right side of the NMOS transistors, wiring them in series.

Zooming out a bit, the photo below shows three NAND gates. Although the underlying standard cell is the same for each one, there are differences between the gates. At the top, horizontal wiring links the inputs to M2 through vias. The length of each polysilicon line depends on the position of the metal. Moreover, in the middle of each gate, the metal connection to the output is positioned differently. Finally, note that the power wiring shifts upward in the upper right corner; this is to make room for a larger cell to the right. The point is that the standard cells aren't simply copies of each other, but are adjusted in each case to put the inputs, outputs, and power in the right location. Also note that these standard cells are not isolated, but are squeezed together so the PMOS transistors are touching. This optimization slightly increases the density.

Three NAND gates in the Pentium.

Three NAND gates in the Pentium.

OR-NAND gate

The standard cell library includes some complex gates. For instance, the gate below is a 5-input OR-NAND gate, computing ~((A+B+C+D)⋅E). In the NMOS circuit, transistors A through D are paralleled while E is in series. The PMOS circuit is the opposite, with A through D in series and E in parallel. To provide sufficient current, the PMOS circuit has two sets of transistors for A through D, so the PMOS block is much larger than the NMOS block.

The OR-NAND gate as it appears on the die. The left image shows the M1 metal layer while the right image shows the silicon
and polysilicon.

The OR-NAND gate as it appears on the die. The left image shows the M1 metal layer while the right image shows the silicon and polysilicon.

Latch

One of the key building blocks of the Pentium's circuitry is the latch. The idea of the latch is to hold one bit, controlled by the clock signal. A latch is "transparent": the latch's input immediately appears on the output while the clock is high. But when the clock is low, the latch holds its previous value. The latch is implemented with a feedback loop that passes the latch's output back into the latch. The heart of this latch circuit is the multiplexer (mux), which selects either the previous output (when the clock is low) or the new input (when the clock is high). The inverters amplify the feedback signal so it doesn't decay in the loop. An inverter also amplifies the output so it can drive other circuitry.

The circuit for a latch.

The circuit for a latch.

The circuit for a multiplexer is interesting since it uses "pass transistors". That is, the transistors simply pass their input through to the output, rather than pulling a signal to power or ground as in a typical logic gate. The schematic shows how this works. First, suppose that the select line is low. This will turn on the two transistors connected to the first input, allowing its level to flow to the output. Meanwhile, both transistors connected to the second input will be turned off, blocking that signal. But if the select line is high, everything switches. Now, the two transistors connected to the second input turn on, passing its level to the output. Thus, the multiplexer selects the first input if the control signal is low, and the second input if the control signal is high.

A multiplexer and its implementation in CMOS.

A multiplexer and its implementation in CMOS.

The diagram below shows a multiplexer, part of a latch. On the left, an inverter feeds into one input of the multiplexer.7 On the right is the other input to the multiplexer. The output is taken from the middle, between the pairs of the transistors.

A multiplexer as it appears on the Pentium die.

A multiplexer as it appears on the Pentium die.

Note that the multiplexer's circuit is opposite, in a way, to a logic gate. In a logic gate, you want either the NMOS transistor on or the PMOS transistor on, so the output is pulled low or high respectively. This is accomplished by giving the signals on the transistor gates the same polarity, so the same polysilicon line runs through both transistors. In a multiplexer, however, you want the corresponding PMOS and NMOS transistors to turn on at the same time, so they can pass the signal. This requires the signals on the transistor gates to have opposite polarity. One polysilicon line runs through the right PMOS transistor and the left NMOS transistor. The other polysilicon line runs through the left PMOS transistor and the right NMOS transistor, connected by metal wiring (not shown). The multiplexer includes an inverter to provide the necessary signal, but I cropped it out of the diagram below.

The flip-flop

The Pentium makes extensive use of flip-flops. A flip-flop is similar to a latch, except its clock input is edge-sensitive instead of level-sensitive. That is, the flip-flop "remembers" its input at the moment the clock goes from low to high, and provides that value as its output. This difference may seem unimportant, but it turns out to make the flip-flop more useful in counters, state machines, and other clocked circuits.

In the Pentium, a flip-flop is constructed from two latches: a primary latch and a secondary latch. The primary latch passes its value through while the clock is low and holds its value when the clock is high. The output of the primary latch is fed into the secondary latch, which has the opposite clock behavior. The result is that when the clock switches from low to high, the primary latch stops updating its output at the same time that the secondary starts passing this value through, providing the desired flip-flop behavior.

A standard-cell flip-flop.

A standard-cell flip-flop.

The photo above shows a standard-cell flop-flop, with an intricate pattern of metal wiring connecting the various sub-components. There are a few variants; with minor logic changes, the flip-flop can have "set" or "reset" inputs, bypassing the clock to force the output to the desired state. (Set and reset functions are useful for initializing flip-flops to a desired value, for example when the processor starts up.)

The BiCMOS buffer

Although I've been discussing CMOS circuits so far, the Pentium was built with BiCMOS, a process that allows circuits to use bipolar transistors in addition to CMOS. By adding a few extra processing steps to the regular CMOS manufacturing process, bipolar (NPN and PNP) transistors can be created. The Pentium made extensive use of BiCMOS circuits since they reduced signal delays by up to 35%. Intel also used BiCMOS for the Pentium Pro, Pentium II, Pentium III, and Xeon processors (but not the Pentium MMX). However, as chip voltages dropped, the benefit from bipolar transistors dropped too and BiCMOS was eventually abandoned.

The schematic below shows a standard-cell BiCMOS buffer in the Pentium chip.8 This circuit is more complex than a CMOS buffer: it uses two inverters, an NPN pull-up transistor, an NMOS pull-down transistor, and a PMOS pull-up transistor.9

Reverse-engineered schematic of the BiCMOS buffer.

Reverse-engineered schematic of the BiCMOS buffer.

In the die images below, note the circular structure of the NPN transistor, very different from the linear structure of the NMOS and PMOS transistors and considerably larger. A sign of the buffer's high-current drive capacity is the output's thick metal wiring, much thicker than the typical signal wiring.

A BiCMOS buffer in the Pentium.

A BiCMOS buffer in the Pentium.

Conclusions

Standard-cell layout is extensively used in modern chips. Modern processors, with their nanometer-scale transistors, are much too small to study under a microscope. The Pentium, on the other hand, has features large enough that its circuits can be observed and reverse engineered. Of course, with 3.3 million transistors, the Pentium is too much for me to reverse engineer in depth, but I still find it interesting to study small-scale circuits and see how they were implemented. This post presented a small sample of the standard cells in the Pentium. The full standard-cell library is much larger, with dozens, if not hundreds, of different cells: many types of logic gates in a variety of sizes and drive strengths. But the fundamental design and layout principles are the same as the cells described here.

One unusual feature of the Pentium is its use of BiCMOS circuitry, which had a peak of popularity in the 1990s, right around the era of the Pentium. Although changing tradeoffs made BiCMOS impractical for digital circuitry, BiCMOS still has an important role in analog ICs, especially high-frequency applications. The Pentium in a sense is a time capsule with its use of BiCMOS.

I hope that you have enjoyed this look at some of the Pentium's circuits. I find it reassuring to see that even complex processors are made up of simple transistor circuits and you can observe and understand these circuits if you look closely.

For more on standard-cell circuits, I wrote about standard cells in an IBM chip and standard cells in the 386 (the 386 article has a lot of overlap with this one). Follow me on Twitter @kenshirriff or RSS for updates. I'm also on Mastodon occasionally as @[email protected].

Notes and references

  1. In this blog post, I'm focusing on the "P54C" version of the original Pentium processor. Intel produced many different versions of the Pentium, and it can be hard to keep them straight. Part of the problem is that "Pentium" is a brand name, with multiple microarchitectures, lines, and products. At the high level, the Pentium (1993) was followed by the Pentium Pro (1995) Pentium II (1997), Pentium III (1999), Pentium 4 (2000), and so on. The original Pentium used the P5 microarchitecture, a superscalar microarchitecture that was advanced but still executed instruction in order like traditional microprocessors. The Pentium Pro was a major jump, implementing a microarchitecture called P6 that broke instructions into micro-operations and executed them out of order using dataflow techniques. The next microarchitecture version was NetBurst, first used with the Pentium 4. NetBurst provided a deep pipeline and introduced hyper-threading, but it was disappointingly slow and was replaced by the Core microarchitecture. The Core microarchitecture is based on the P6 and is Intel's current microarchitecture.

    I'll focus now on the original Pentium, which went through several substantial revisions. The first Pentium product was the 80501 (codenamed P5), running at 60 or 66 MHz and using 5 volts. These chips were built with an 800 nm process and contained 3.1 million transistors.

    The power consumption of these chips was disappointing, so Intel improved the chip, producing the 80502. These chips, codenamed P54C, used 3.3 volts and ran at 75-120 MHz. The chip's architecture remained essentially the same but support was added for multiprocessing, boosting the transistor count to 3.3 million. The P54C had a much more advanced clock circuit, allowing the external bus speed to stay low (50-66 MHz) while the internal clock speed—and thus performance—climbed to 100 MHz. The chips were built with a smaller 600 nm process with four layers of metal, compared to the previous three. Visually, the die of the P54C is almost the same as the P5, with the additional multiprocessing logic at the bottom and the clock circuitry at the top. For this article, I examined the P54C, but the standard cells should be similar in other versions.

    Next, Intel moved to the 350 nm process, producing a smaller, faster Pentium chip, codenamed the P54CS; the die looks almost identical to the P54C (but smaller), with subtle changes to the bond pads. Another variant was designed for mobile use: the Pentium processor with "Voltage Reduction Technology" reduced power consumption by using a 2.9- or 3.1-volt supply for the core and a 3.3-volt supply to drive the I/O pins. These were built first with the 600 nm process (75-100 MHz) and then the 350 nm process (100-150 MHz).

    The biggest change to the original Pentium was the Pentium MMX, with part number 80503 and codename P55C. This chip extended the x86 instruction set with 57 new instructions for vector processing. It was built on a 350 nm process before moving to 280 nm, and had 4.5 million transistors. More obscure variants of the original Pentium include the P54CQS, P54CS, P54LM, P24T, and Tillamook, but I won't get into them. 

  2. Circuits that had a high degree of regularity, such as the arithmetic/logic unit (ALU) or register storage were typically constructed by manually laying out a block to implement the circuitry for one bit and then repeating the block as needed. Because a circuit was repeated 32 times for the 32-bit processor, the additional effort was worthwhile. 

  3. An alternative layout technique is the gate array, which doesn't provide as much flexibility as a standard cell approach. In a gate array (sometimes called a master slice), the chip had a fixed array of transistors (and often resistors). The chip could be customized for a particular application by designing the metal layer to connect the transistors as needed. The density of the chip was usually poor, but gate arrays were much faster to design, so they were advantageous for applications that didn't need high density or produced a relatively small volume of chips. Moreover, manufacturing was much faster because the silicon wafers could be constructed in advance with the transistor array and warehoused. Putting the metal layer on top for a particular application could then be quick. Similar gate arrays used a fixed arrangement of logic gates or flip-flops, rather than transistors. Gate arrays date back to 1967

  4. The behavior of MOS transistors is complicated, so the description above is simplified, just enough to understand digital circuits. In particular, MOS transistors don't simply switch between "on" and "off" but have states in between. This allows MOS transistors to be used in a wide variety of analog circuits. 

  5. The earliest Pentiums had three layers of metal wiring, but Intel moved to a four-layer process with the P54C die, the version that I'm examining. 

  6. To get this level of magnification with my microscope, I had to use an oil immersion lens. Instead of looking at the chip in air, as with a normal lens, I had to put a drop of special microscope oil on the chip. I carefully lower the lens until it dips into the oil (making sure I don't crash the lens into the chip). The purpose of the oil is that its index of refraction is almost the same as glass, much higher than air. This gives the lens a higher "numerical aperture", allowing the lens to resolve smaller details. 

  7. For completeness, I'll mention that the inverter feeding the multiplexer inverter isn't exactly an inverter. Specifically, the inverter's two transistors are not tied together to produce an output. Instead, the inverter's NMOS transistor provides an input to the multiplexer's NMOS transistor and likewise, the PMOS transistor provides an input to the PMOS transistor. The omission of this connection does not affect the circuit's behavior, but it makes calling the circuit an inverter and a multiplexer a bit of an abstraction. 

  8. Intel called this gate "BiNMOS" rather than "BiCMOS" because it uses a bipolar transistor and an NMOS transistor to drive the output, rather than two bipolar transistors. The Pentium's BiCMOS circuitry is described in a conference paper, showing a second NPN transistor to protect the first one. I don't see the second transistor on the die so the two transistors may be implemented in one silicon structure. Reference: R. F. Krick et al., “A 150 MHz 0.6 µm BiCMOS superscalar microprocessor,” IEEE Journal of Solid-State Circuits, vol. 29, no. 12, Dec. 1994, doi:10.1109/4.340418

  9. The Pentium contains multiple types of BiCMOS standard cells, which I'll show in this footnote. The cell below is an inverter. It is similar to the BiCMOS buffer described earlier, except it lacks the first inverter in the circuit. To make room for the NPN transistor on the left, the PMOS transistors are shifted to the right. As a result, they don't line up with the PMOS transistors in other cells. This is a break from the traditional orderliness of standard cells.

    A BiCMOS inverter with PMOS on the left and NMOS on the right. The input is at the bottom and the output is in the middle.

    A BiCMOS inverter with PMOS on the left and NMOS on the right. The input is at the bottom and the output is in the middle.

    The BiCMOS inverter below is similar, except it uses two NPN transistors, providing more output drive. I removed the M1 metal layer to provide a better view of the transistors.

    A BiCMOS inverter with two NPN transistors. The PMOS transistors are in the lower left and the NMOS transistors are in the lower right.

    A BiCMOS inverter with two NPN transistors. The PMOS transistors are in the lower left and the NMOS transistors are in the lower right.

    Another interesting BiCMOS circuit is the D flip-flop with enable and BiCMOS output, shown below. This is similar to the earlier flip-flop except it has an enable input, allowing it to either load a new value triggered by the clock, or to hold its earlier value. This allows the flip-flop to remember a value for more than one clock cycle. The additional functionality is implemented by another multiplexer, selecting either the old value or the new value. (This multiplexer is, in a way, one level higher than the multiplexer in each latch.) The transistor for the BiCMOS output is in the upper right, poking out from under the metal. (This circuit might be implemented as two independent cells, one for the flip-flop and one for the driver; I'm not sure.)

    A D flip-flop in the Pentium.

    A D flip-flop in the Pentium.

     

  10. One puzzling inverter variant is used in a gate I'll call the "slow buffer". This buffer consists of two inverters, so it passes its input through to the output, buffered. The strange part is that the first inverter uses transistors with wide gates, which makes these transistors much weaker than regular transistors. As a result, the first inverter will be slow to switch states. My guess is that this circuit is used to delay signals, for example, to keep a signal aligned with another signal that is delayed by multiple logic gates.

    The buffer consists of two inverters. The first inverter uses wide, weak transistors.

    The buffer consists of two inverters. The first inverter uses wide, weak transistors.

    You might expect that larger transistors would be stronger, not weaker. The problem is that these transistors are larger in the wrong dimension. If you make the gate wider, the effect is similar to multiple transistors in parallel, providing more current. But if you make the gate longer (as in this case), the effect is similar to multiple transistors in series, so the resistances add and the total current is reduced. In most cases, transistors are constructed with the smallest gate length possible, which is determined by the manufacturing process, so the transistors here are unusual. This chip was manufactured with an 800 nm process, so the smallest gate length is approximately 800 nm. The gate width (the normal direction for variation) varies dramatically depending on the circuit, optimized to provide maximum performance. 

Files.fileExists or file.exists?

How would you design a class that abstracts, say, a file on a disk with certain properties? Let’s say you need to be able to check whether the file exists on the disk or has already been deleted. Would you create an object first and then call the exists() method on it, or would you call Disk.fileExists() first and only then, if TRUE is returned, make an instance of the File class and continue working with it? This may sound like a matter of taste, but it’s not that simple.

Capote (2005) by Bennett Miller
Capote (2005) by Bennett Miller

Let’s see how we can check whether a file exists on the disk or not in different programming languages and their SDKs:

Language How to check if file exists?
JDK 7 Files.get("a.txt").exists()
JDK 8 Files.exists(Path.get("a.txt"))
.Net File.Exists("a.txt")
Node fs.existsSync('a.txt')
Python os.path.exists("a.txt")
Python (3.4+) pathlib.Path("a.txt").exists()
Ruby File.exist?("a.txt")
Perl if -e "a.txt"
PHP file_exists('a.txt')
Smalltalk (File name: 'a.txt') exists ifTrue: ...

There are basically two different design decisions: either you make a File object first, then ask it for its existence on the disk, or you ask the disk whether the file is there and only after that you make an instance of the File class. Which design is better? Let’s forget for a moment that static methods are evil and imagine that Files is not a utility class, but an abstraction of a disk. How would you design the exists() method if you were the designer of a new SDK for a new programming language?

To answer this question, we must answer a more fundamental one: what is the message an SDK would be sending to a programmer by placing the exists () method either on the File or on the Disk?

This may sound like a trivial and cosmetic issue to an experienced programmer, but let me convince you that it’s not. Consider the design of a list of payment bills in a database. A bill may either be “paid” or “not yet paid,” which a programmer may check through the paid() method. The first design choice is this (it’s Java):

Bill b = bills.get(42)
if (b.paid()) {
  // do something
}

The second choice would be the following:

if (bills.paid(42)) {
  // do something
}

What is the message in the first snippet? I believe it’s the following: “A bill may either be paid or not.” What is the message in the second design option? It’s this: “If a bill exists, it is paid.” In other words, in the first snippet, two qualities of a bill (“I exist” and “I’m paid”) co-exist, while in the second snippet they are merged into one (“I’m paid”).

At the persistence layer, this dichotomy of qualities may mean either a nullable column paid in an SQL-database table or one with the NOT NULL constraint. The first snippet may return a bill object that exists in the database as a row, but the paid column is set to NULL. A programmer who uses your design can easily grasp the idea of the “being paid” status of a bill: it’s not the same as the status of its existence. A programmer must first get the bill and only then check its payment status. A programmer would also expect two points of possible failure—a bill may be absent, or a bill may not be paid—throwing different exceptions or returning different types of results.

As you see, this issue is not cosmetic but very much existential: the design of the methods of a Bill or Bills helps programmers understand on what terms the bills exist.

Now, the answer to the original question about the exists() method of a file is easy to find. Locating a file on a disk is the first task, which checks whether the name of the file is correct and the file may potentially exist on the disk:

// Here, an exception may be raised if,
// for example, the name of the file is
// wrong or simply a NULL.
File f = new File("a.txt");

Then, the existence of the file, at this particular moment, on the disk, is checked:

// Here, an exception may be raised if,
// for example, the disk is not mount or
// the permissions are not sufficient for
// checking the existence of the file.
boolean e = f.exists();

We may now conclude that how Python, JS, Ruby, and many others let us check the existence of a file on the disk is wrong. JDK 7 was right, but the inventors of JDK 8 ruined it (most probably for the sake of performance).

By the way, there are many more examples of different “file checking” design decisions in many other programming languages.

Locally patching dependencies in Go

In a previous post I talked about how each Go module is its own self-contained "virtual environment" during development. Among other benefits, this makes the dependencies of a module explicit and simple to tweak.

Locally patching a dependency

To use a concrete example, suppose our module depends on the popular package go-cmp, that lets us deep-compare arbitrary Go values. Say we're debugging an intricate scenario and want to either:

  • Add a log statement inside the dependency to see what our code is passing to it (e.g. "do I ever invoke cmp.Equal with these specific options?")
  • Test a suspicion of a bug in the dependency by temporarily modifying its code and seeing if this has an effect on our module.

The Go module system makes this easy to accomplish; this post will demonstrate several way of doing this.

Setting up

Let's set up a test module to demonstrate this. The full code can be found on GitHub, or just follow along:

In a directory, run go mod init example.com (the module name is just a placeholder - it's a local experiment, we don't intend it to be imported or even published online). This creates a go.mod file; now, let's write this code:

package main

import (
  "fmt"

  "github.com/google/go-cmp/cmp"
  "github.com/google/go-cmp/cmp/cmpopts"
)

func main() {
  s1 := []int{42, 12, 23, 2}
  s2 := []int{12, 2, 23, 42}

  if cmp.Equal(s1, s2, cmpopts.SortSlices(intLess)) {
    fmt.Println("slices are equal")
  }
}

func intLess(x, y int) bool {
  return x < y
}

And then run go mod tidy; this should get the github.com/google/go-cmp dependency, and the go.mod file will look something like:

module example.com

go 1.22.2

require github.com/google/go-cmp v0.6.0

(your Go version and the dependency version will likely be different, of course)

Now, we'll download the dependency locally and patch it. Clone the https://github.com/google/go-cmp/ repository into a local directory; we'll call it $DEP (on my machine DEP=/home/eliben/test/go-cmp). Next, edit $DEP/cmp/compare.go to add a log statement:

func Equal(x, y interface{}, opts ...Option) bool {
  log.Println("options:", opts)
  s := newState(opts)
  s.compareAny(rootStep(x, y))
  return s.result.Equal()
}

If we run our test module now we don't see any effect yet:

$ go run .
slices are equal

This is to be expected! Go has no idea we've cloned the dependency locally and want it to be used in the build process of our test module. This is the next step.

Using a module replace directive

The most basic way to accomplish what we need is using a replace directive in the go.mod file of our test module.

In our module directory, run:

$ go mod edit -replace github.com/google/go-cmp=$DEP

If you look in your go.mod file, you'll see a new replace directive added there, redirecting uses of github.com/google/go-cmp to whatever directory DEP stands for on your machine.

If we now run the test module, it will pick up the patched dependency:

$ go run .
2024/06/29 06:57:17 options: [FilterValues(cmpopts.sliceSorter.filter, Transformer(cmpopts.SortSlices, cmpopts.sliceSorter.sort))]
slices are equal

Using Go workspaces

Go workspaces (go.work files) have been with us since version 1.18; a workspace makes it easier to work with multi-module repositories and large monorepos. It can also be leveraged to implement our use case very easily.

Get back to a clean go.mod file without a replace directive (you can either undo the change using source control, run go mod edit -dropreplace ... or just remove the replace directive from the go.mod file).

Now, run these commands in the test module's directory:

$ go work init
$ go work use . $DEP

This asks the Go tool to:

  1. Initialize an empty workspace in the current directory; a go.work file will be created.
  2. Add use directives to go.work for including the current directory . and the place where we checked out a local version of the dependency ($DEP).

If you look around, a new file was created - go.work; go.mod itself was not modified. If we run the module with go run ., we'll see that the local patch was picked up!

I like this approach a bit more than planting replace directives in the go.mod file, since it provides a cleaner separation between temporary patching and the module's actual source code. While go.mod files are checked into source control and provide a critical source of truth for building the module, go.work files aren't typically checked in and are used to set up a convenient local development environment. Using go.work for temporary patching is thus safer - it's more difficult to leave behind a replace directive in the go.mod file and commit it (this can cause all kinds of inconveniences when testing, for example).

Using gohack

gohack is a tool designed especially to address our use case; it predates Go workspaces. Start by installing it:

$ go install github.com/rogpeppe/gohack@latest

Now run:

$ gohack get github.com/google/go-cmp
github.com/google/go-cmp => $HOME/gohack/github.com/google/go-cmp

This invocation does two things:

  1. Fetch the dependency's code and store it somewhere locally. You can control where these are stored by setting the $GOHACK env var; the default is $HOME/gohack.
  2. Add a replace line to our go.mod file to point there.

Since gohack placed the dependency in a new location, we'll have to edit its cmp/compare.go file again to add the log statement. If we go run . in our test module, we'll see the change picked up.

It's also fairly easy to undo changes with the gohack undo command.

Which approach to use?

gohack can be useful in some cases where a quick check is all you need. Since gohack obtains the dependency on its own, it makes it a bit faster to use than cloning manually. That said, I'd be concerned about committing the replace line accidentally, which is why I think the workspace approach is safer (and also more explicit).

Update 2024-07-05: Sean Liao reminded me that go mod vendor is yet another way to accomplish this. This approach comes with its own tradeoffs; read the documentation to learn more.

Reasons to use your shell's job control

Hello! Today someone on Mastodon asked about job control (fg, bg, Ctrl+z, wait, etc). It made me think about how I don’t use my shell’s job control interactively very often: usually I prefer to just open a new terminal tab if I want to run multiple terminal programs, or use tmux if it’s over ssh. But I was curious about whether other people used job control more often than me.

So I asked on Mastodon for reasons people use job control. There were a lot of great responses, and it even made me want to consider using job control a little more!

In this post I’m only going to talk about using job control interactively (not in scripts) – the post is already long enough just talking about interactive use.

what’s job control?

First: what’s job control? Well – in a terminal, your processes can be in one of 3 states:

  1. in the foreground. This is the normal state when you start a process.
  2. in the background. This is what happens when you run some_process &: the process is still running, but you can’t interact with it anymore unless you bring it back to the foreground.
  3. stopped. This is what happens when you start a process and then press Ctrl+Z. This pauses the process: it won’t keep using the CPU, but you can restart it if you want.

“Job control” is a set of commands for seeing which processes are running in a terminal and moving processes between these 3 states

how to use job control

  • fg brings a process to the foreground. It works on both stopped processes and background processes. For example, if you start a background process with cat < /dev/zero &, you can bring it back to the foreground by running fg
  • bg restarts a stopped process and puts it in the background.
  • Pressing Ctrl+z stops the current foreground process.
  • jobs lists all processes that are active in your terminal
  • kill sends a signal (like SIGKILL) to a job (this is the shell builtin kill, not /bin/kill)
  • disown removes the job from the list of running jobs, so that it doesn’t get killed when you close the terminal
  • wait waits for all background processes to complete. I only use this in scripts though.
  • apparently in bash/zsh you can also just type %2 instead of fg %2

I might have forgotten some other job control commands but I think those are all the ones I’ve ever used.

You can also give fg or bg a specific job to foreground/background. For example if I see this in the output of jobs:

$ jobs
Job Group State   Command
1   3161  running cat < /dev/zero &
2   3264  stopped nvim -w ~/.vimkeys $argv

then I can foreground nvim with fg %2. You can also kill it with kill -9 %2, or just kill %2 if you want to be more gentle.

how is kill %2 implemented?

I was curious about how kill %2 works – does %2 just get replaced with the PID of the relevant process when you run the command, the way environment variables are? Some quick experimentation shows that it isn’t:

$ echo kill %2
kill %2
$ type kill
kill is a function with definition
# Defined in /nix/store/vicfrai6lhnl8xw6azq5dzaizx56gw4m-fish-3.7.0/share/fish/config.fish

So kill is a fish builtin that knows how to interpret %2. Looking at the source code (which is very easy in fish!), it uses jobs -p %2 to expand %2 into a PID, and then runs the regular kill command.

on differences between shells

Job control is implemented by your shell. I use fish, but my sense is that the basics of job control work pretty similarly in bash, fish, and zsh.

There are definitely some shells which don’t have job control at all, but I’ve only used bash/fish/zsh so I don’t know much about that.

Now let’s get into a few reasons people use job control!

reason 1: kill a command that’s not responding to Ctrl+C

I run into processes that don’t respond to Ctrl+C pretty regularly, and it’s always a little annoying – I usually switch terminal tabs to find and kill and the process. A bunch of people pointed out that you can do this in a faster way using job control!

How to do this: Press Ctrl+Z, then kill %1 (or the appropriate job number if there’s more than one stopped/background job, which you can get from jobs). You can also kill -9 if it’s really not responding.

reason 2: background a GUI app so it’s not using up a terminal tab

Sometimes I start a GUI program from the command line (for example with wireshark some_file.pcap), forget to start it in the background, and don’t want it eating up my terminal tab.

How to do this:

  • move the GUI program to the background by pressing Ctrl+Z and then running bg.
  • you can also run disown to remove it from the list of jobs, to make sure that the GUI program won’t get closed when you close your terminal tab.

Personally I try to avoid starting GUI programs from the terminal if possible because I don’t like how their stdout pollutes my terminal (on a Mac I use open -a Wireshark instead because I find it works better but sometimes you don’t have another choice.

reason 2.5: accidentally started a long-running job without tmux

This is basically the same as the GUI app thing – you can move the job to the background and disown it.

I was also curious about if there are ways to redirect a process’s output to a file after it’s already started. A quick search turned up this Linux-only tool which is based on nelhage’s reptyr (which lets you for example move a process that you started outside of tmux to tmux) but I haven’t tried either of those.

reason 3: running a command while using vim

A lot of people mentioned that if they want to quickly test something while editing code in vim or another terminal editor, they like to use Ctrl+Z to stop vim, run the command, and then run fg to go back to their editor.

You can also use this to check the output of a command that you ran before starting vim.

I’ve never gotten in the habit of this, probably because I mostly use a GUI version of vim. I feel like I’d also be likely to switch terminal tabs and end up wondering “wait… where did I put my editor???” and have to go searching for it.

reason 4: preferring interleaved output

A few people said that they prefer to the output of all of their commands being interleaved in the terminal. This really surprised me because I usually think of having the output of lots of different commands interleaved as being a bad thing, but one person said that they like to do this with tcpdump specifically and I think that actually sounds extremely useful. Here’s what it looks like:

# start tcpdump
$ sudo tcpdump -ni any port 1234 &
tcpdump: data link type PKTAP
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type PKTAP (Apple DLT_PKTAP), snapshot length 524288 bytes

# run curl
$ curl google.com:1234
13:13:29.881018 IP 192.168.1.173.49626 > 142.251.41.78.1234: Flags [S], seq 613574185, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 2730440518 ecr 0,sackOK,eol], length 0
13:13:30.881963 IP 192.168.1.173.49626 > 142.251.41.78.1234: Flags [S], seq 613574185, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 2730441519 ecr 0,sackOK,eol], length 0
13:13:31.882587 IP 192.168.1.173.49626 > 142.251.41.78.1234: Flags [S], seq 613574185, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 2730442520 ecr 0,sackOK,eol], length 0
 
# when you're done, kill the tcpdump in the background
$ kill %1 

I think it’s really nice here that you can see the output of tcpdump inline in your terminal – when I’m using tcpdump I’m always switching back and forth and I always get confused trying to match up the timestamps, so keeping everything in one terminal seems like it might be a lot clearer. I’m going to try it.

reason 5: suspend a CPU-hungry program

One person said that sometimes they’re running a very CPU-intensive program, for example converting a video with ffmpeg, and they need to use the CPU for something else, but don’t want to lose the work that ffmpeg already did.

You can do this by pressing Ctrl+Z to pause the process, and then run fg when you want to start it again.

reason 6: you accidentally ran Ctrl+Z

Many people replied that they didn’t use job control intentionally, but that they sometimes accidentally ran Ctrl+Z, which stopped whatever program was running, so they needed to learn how to use fg to bring it back to the foreground.

The were also some mentions of accidentally running Ctrl+S too (which stops your terminal and I think can be undone with Ctrl+Q). My terminal totally ignores Ctrl+S so I guess I’m safe from that one though.

reason 7: already set up a bunch of environment variables

Some folks mentioned that they already set up a bunch of environment variables that they need to run various commands, so it’s easier to use job control to run multiple commands in the same terminal than to redo that work in another tab.

reason 8: it’s your only option

Probably the most obvious reason to use job control to manage multiple processes is “because you have to” – maybe you’re in single-user mode, or on a very restricted computer, or SSH’d into a machine that doesn’t have tmux or screen and you don’t want to create multiple SSH sessions.

reason 9: some people just like it better

Some people also said that they just don’t like using terminal tabs: for instance a few folks mentioned that they prefer to be able to see all of their terminals on the screen at the same time, so they’d rather have 4 terminals on the screen and then use job control if they need to run more than 4 programs.

I learned a few new tricks!

I think my two main takeaways from thos post is I’ll probably try out job control a little more for:

  1. killing processes that don’t respond to Ctrl+C
  2. running tcpdump in the background with whatever network command I’m running, so I can see both of their output in the same place

Testing a WebSocket that could hang open for hours

I recently ran into a bug in some Go code that no one had touched in a few years. The code in question was not particularly complicated, and had been reviewed by multiple people. It included a timeout, and is straightforward: allow a Websocket connection to test that the client can open those successfully, and then close it.

The weird thing is that some of these connections were being held open for a long time. There was a timeout of one second, but sometimes these were still open after twelve hours. That's not good!

This bug ended up being instructive in both Go and in how WebSockets work. Let's dive in and see what was going on, then what it tells us!

Comic showing a caterpillar and a butterfly, representing the transformation of HTTP requests into WebSockets.

Identifying the bug

The preliminary investigation found that this was happening for users with a particular VPN. Weird, but not particularly helpful.

After the logs turned up little useful info, I turned to inspecting the code. It was pretty easy to see that the code itself had a bug, in a classic new-to-Go fashion. The trickier thing (for later) was how reproduce the bug and verify it in a test.

The bug was something like this:

for {
    select {
    case <-ctx.Done():
        // we timed out, so probably log it and quit!
        return

    default:
        _, _, err := conn.ReadMessage()

        if err != nil {
            // ...
        }
    }
}

There are two conspiring factors here: first, we're using a default case in the select, and second, that default case has no read deadline. The default case is run when no other case is ready, which is the case until we time out. The issue is that we won't interrupt this case when the other one becomes ready. And in that case, conn.ReadMessage() will wait until it receives something if no read deadline has been set.

The question then becomes, how do we actually run into this case?

How does this happen?

This is a weird case, because it requires the end client to misbehave. Right before the bugged for loop, the server sent a WebSocket close frame to the client. If you have such a connection open in your browser, then when it receives the close frame it will send one back. This is part of the closing handshake for WebSockets. So if we get nothing back, that means that something went wrong.

Taking a step back, let's refresh some details about WebSockets. WebSocket connections are bidirectional, much like TCP connections: the client and the server can send messages and these messages can interleave with each other. In contrast, a regular HTTP connection follows a request-response pattern where the client sends a request and then the server sends a single response1.

But the cool thing is that WebSockets start out life as a regular HTTP request. When you send a WebSocket request, the body starts as something like this2:

GET /websocket/ HTTP/1.1
Host: server.example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: x3JJHMbDL1EzLkh9GBhXDw==
Sec-WebSocket-Version: 13

After this request, the server ideally responds saying it'll switch protocols with something like this response:

HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: HSmrc0sMlYUkAGmm5OPpG2HaGWk=

After that's done, then both ends switch to a different binary protocol that's not related to HTTP. Pretty neat that it starts life as a regular HTTP request!

Now that we have a WebSocket open, the server and client can each send messages. These are either data messages or control messages. Data messages are what we send and receive in our applications and are what you usually see and handle. Control messages are used to terminate the connection or do other operational things, and are usually hidden from the application.

When the connection ends, you're supposed to send a particular control message: a close frame. After receiving it, the other side is supposed to respond with a close frame. And then you can both close the underlying network connection and move on with your lives.

But it turns out that sometimes that doesn't happen! This could be that the client connecting to your server is doing something naughty and didn't send it to leave you hanging. Or maybe the network was cut and the message didn't get back to you, or maybe the other end of the connection vanished in a blaze of thermite.

Whatever the cause, when this happens, if you're waiting for that close frame you'll be waiting a long time. So now we have to reproduce it in a test.

Leaving the server hanging in a test

Reproducing the bug was a bit tricky since I couldn't use any normal ways of opening a WebSocket. Those implementations all assume you want a correct implementation but oh, no, I want a bad implementation. To do that, you have to roll up your sleeves and do the request by hand on top of TCP.

The test relies on opening a TCP connection, sending the upgrade request, and then just... not responding or sending anything. Then you periodically try to read from the connection. If you get back a particular error code on the read, you know the server has closed the TCP connection. If you don't, then it's still open!

This is what it looks like, roughly. Here I've omitted error checks and closing connections for brevity; this isn't production code, just an example. First, we open our raw TCP connection.

addr := server.Addr().String()
conn, err := net.Dial("tcp", addr)

Then we send our HTTP upgrade request. Go has a nice facility for doing this: we can form an HTTP request and put it onto our TCP connection3.

req, err := http.NewRequest("GET", url, nil)
req.Header.Add("Upgrade", "websocket")
req.Header.Add("Connection", "Upgrade")
req.Header.Add("Sec-WebSocket-Key", "9x3JJHMbDL1EzLkh9GBhXDw==")
req.Header.Add("Sec-WebSocket-Version", "13")

err = req.Write(conn)

We know the server is going to send us back an upgrade response, so let's snag that from the connection. Ideally we'd check that it is an upgrade response but you know, cutting corners for this.

buf := make([]byte, 1024)
_, err = conn.Read(buf)

And then we get to the good part. Here, what we have to do is we just wait and keep checking if the connection is open! The way we do that is we try to read from the connection with a read deadline. If we get io.EOF, then we know that the connection closed. But if we get nothing (or we read data) then we know it's still open.

You don't want your test to run forever, so we set a timeout4 and if we reach that, we say that the test failed: it was held open longer than we expected! But if we get io.EOF before then, then we know it was closed as we hoped. So we'll loop and select from two channels, one which ticks every 250 ms, and the other which finishes after 3 seconds.

ticker := time.NewTicker(250 * time.Millisecond)
timeout := time.After(3 * time.Second)

for {
    select {
        case <-ticker.C:
            conn.SetReadDeadline(time.Now().Add(10 * time.Millisecond))
            buf := make([]byte, 1)
            _, err = conn.Read(buf)

            if err == io.EOF {
                // connection is closed, huzzah! we can return, success
                return
            }

        case <-timeout:
            // if we get here, we know that the connection didn't close.
            // we have a bug, how sad!
            assert.Fail(t, "whoops, we timed out!")
            return
    }
}

Resolving the bug

To resolve the bug, you have two options: you can set a read deadline, or you can run the reads in a goroutine which sends a result back when you're done.

Setting a read deadline is straightforward, as seen above. You can use it and then you'll be happy, because the connection can't hang forever on a read! The problem is, in the library we were using, conn.SetReadDeadline sets it for the underlying network connection and if it fails, the whole WebSocket is corrupt and future reads will fail.

So instead, we do it as a concurrent task. This would look something like this:

waitClosed := make(chan error)
go func() {
    _, _, err := conn.ReadMessage()
    if err != nil {
        // ...
    }

    waitClosed <- err
}()

timeout := time.After(3 * time.Second)

for {
    select {
    case <-timeout:
        // we timed out, so close the conection and quit!
        conn.Close()
        return

    case <-waitClosed:
        // success! nothing needed here
        return
    }
}

It looks like it will leak resources, because won't that goroutine stay open even if the we hit the timeout? The key is that when we hit the timeout we close the underlying network connection. This will cause the read to finish (with an error) and then that goroutine will also terminate.


It turns out, there are a lot of places for bugs to hide in WebSockets code and other network code. And with existing code, a bug like this which isn't causing any obvious problems can lurk for years before someone stumbles across it. That's doubly true if the code was trying to do the right thing but had a bug that's easy to miss if you're not very familiar with Go.

Debugging things like this is a joy, and always leads to learning more about what's going on. Every bug is an opportunity to learn more.


Thanks to Erika Rowland and Dan Reich for providing feedback on a draft of this post.


1

There are other ways that HTTP requests can work, such as with server-sent events. And a single connection can send multiple resources. But the classic single-request single-response is a good mental model for HTTP most of the time.

2

This example is from the WebSockets article on Wikipedia.

3

I wanted to do this in Rust (my default choice) but found this part of it much easier in Go. I'd still like to write a tool that checks WebSockets for this behavior (and other naughty things), so I might dig in some more with Rust later.

4

The first time I wrote this test, I had the timeout inline in the case, which resulted in never timing out, because it was created fresh every loop.

Ergonomic Self-Referential Types for Rust

I've been thinking a little about self-referential types recently, and while it's technically possible to write them today using Pin (limits apply), they're not at all convenient. So what would it take to make it convenient? Well, as far as I can tell there are four components involved in making it work:

  1. The ability to write 'self lifetimes.
  2. The ability to construct types from functions in fixed memory locations.
  3. A way to mark types as "immovable" in the type system.
  4. The ability to safely initialize self-references in structs without going through an option-dance.

It's only once we have all four of these components that writing self-referential types can become accessible to most regular Rust programmers. And that seems important, because as we've seen with async {} and Future: once you start writing sufficiently complex state machines, being able to track references into data becomes incredibly useful.

Speaking of async and Future: in this post we'll be using that as a motivating example for how these features can work together. Because if it seems realistic that we can make that a case as complex as that work, other simpler cases should probably work too.

Oh and before we dive in, I want to give a massive shout-out to Eric Holk. We've spent several hours working through the type-system implications of !Move together, and worked through a number of edge cases and issues. I can't be solely credited for the ideas in this post. However any mistakes in this post are mine, and I’m not claiming to speak for the both of us.

Disclaimer: This post is not a fully-formed design. It is an early exploration of how several features could work together to solve a broader problem. My goal is primarily to narrow down the design space to a tangible list of features which can be progressively implemented, and share it with the broader Rust community for feedback. I'm not on the lang team, nor do I speak for the lang team.

Motivating example

Let's take async {} and Future as our examples here. When we borrow local variables in an async {} block across .await points, the resulting state machine will store both the concrete value and a reference to that value in the same state machine struct. That state machine is what we call self-referential, because it has a reference which points to something in self. And because references are pointers to concrete memory addresses, there are challenges around ensuring they are never invalidated as that would result in undefined behavior. Let's look at an example async function:

async fn give_pats() {
    let data = "chashu tuna".to_string();       // ← Owned value declared
    let name = data.split(' ').next().unwrap(); // ← Obtain a reference
    pat_cat(&name).await;                       // ← `.await` point here
    println!("patted {name}");                  // ← Reference used here
} 

async fn main() {
    give_pats().await; // Calls the `give_pats` function.
}

This is a pretty simple program, but the idea should come across well enough: we declare an owned value in-line, we call an .await function, and later on we reference the owned value again. This keeps a reference live across an .await point, and that requires self-referential types. We can desugar this to a future state machine like so:

enum GivePatsState {
    Created,       // ← Marks our future has been created
    Suspend1,      // ← Marks the first `.await` point
    Complete,      // ← Marks the future is now done
}

struct GivePatsFuture {
    resume_from: GivePatsState,
    data: Option<String>,
    name: Option<&str>,  // ← Note the lack of a lifetime here
}

impl GivePatsFuture {
    fn new() -> Self {
        Self {
            resume_from: GivePatsState::Created,
            data: None,
            name: None,
        }
    }
}

impl Future for GivePatsFuture {
    type Output = ();
    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> { ... }
}

The lifetime of GivePatsFuture::name is unknown, mainly because we can't name it. And because the desugaring happens in the compiler, it doesn't have to name the lifetime either. We'll talk more about that later in this post. Because this generates a self-referential state machine, this future will need to be fixed in-place using Pin first. Once pinned, the Future::poll method can be called in a loop until the future yields Ready. The desugaring for that will look something like this:

let mut future = IntoFuture::into_future(GivePatsFuture::new());
let mut pinned = unsafe { Pin::new_unchecked(&mut future) };
loop {
    match pinned.poll(&mut current_context) {
        Poll::Ready(ready) => break ready,
        Poll::Pending => yield Poll::Pending,
    }
}

And finally, just for reference, here is what the traits we're using look like today. The main bit that's interesting here for the purpose of this post is that Future takes a Pin<&mut Self>, which we'll be explaining how it can be replaced with a simpler system throughout the remainder of this post.

pub trait IntoFuture {
    type Output;
    type IntoFuture: Future<Output = Self::Output>;
    fn into_future(self) -> Self::IntoFuture;
}

pub trait Future {
    type Output;
    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output>;
}

Now that we've taken a look at how self-referential futures are desugared by the compiler today, let's take a look at how we can incrementally replace it with a safe, user-constructible system.

Self-referential lifetimes

In our motivating example we showed the GivePatsFuture which has the name field that points at the data field. It's a clearly a reference, but it does not carry any lifetime:

struct GivePatsFuture {
    resume_from: GivePatsState,
    data: Option<String>,
    name: Option<&str>, // ← Note the lack of a lifetime
}

The reason this doesn't have a lifetime is not inherent, it's because we can't actually name the lifetime here. It's not 'static because it isn't valid for the remainder of the program. In the compiler today I believe we can just omit the lifetime because the codegen happens after the lifetimes have already been checked. But say we wanted to write this by hand today as-is; we would need a concept for an "unchecked lifetime", something like this:

struct GivePatsFuture {
    resume_from: GivePatsState,
    data: Option<String>,
    name: Option<&'unsafe str>, // ← Unchecked lifetime
}

Being able to write a lifetime that isn't checked by the compiler would be the first-click stop to enable writing self-referential structs by hand. It would just require a lot of unsafe and anxiety to get right. But at least it would be possible. I believe folks on T-compiler are already working on adding this, which seems like a great idea.

But even better would be if we could describe checked lifetimes here. What we actually want to write here is a lifetime which is valid for the duration of the value - and it would always be guaranteed to be valid. Adding this lifetime would come with additional constraints we'll get into later in this post (e.g. the type wouldn't be able to move), but what we really want is to be able to write something like this:

struct GivePatsFuture {
    resume_from: GivePatsState,
    data: Option<String>,
    name: Option<&'self str>, // ← Valid for the duration of `Self`
}

To sidetrack slightly; adding a named 'self lifetime could also allow us to remove the where Self: 'a boilerplate when using lifetimes in generic associated types. If you've ever worked with lifetimes in associated types, then you'll likely have run into the "missing required bounds" error. Niko suggested in this issue using 'self as one possible solution for the bounds boilerplate. I think its meaning would be slightly different than when used with self-references? But I don't believe using this would be ambiguous either. And all in all I think it looks rather neat:

// A lending iterator trait as we have to write it today:
trait LendingIterator {
    type Item<'a>
    where
        Self: 'a;
    fn next(&mut self) -> Self::Item<'_>;
}

// A lending iterator trait if we had `'self`:
trait LendingIterator {
    type Item<'self>;
    fn next(&mut self) -> Self::Item<'_>;
}

Constructing types in-place

In order for 'self to be valid, we have to promise our value won't move in memory. And before we can promise a value won't move, we have to first construct it somewhere we can be sure it can stay and not be moved further. Looking at our motivating example, the way we achieve this using Pin is a little goofy. Here's that same example again:

impl GivePatsFuture {
    fn new() -> Self {
        Self {
            resume_from: GivePatsState::Created,
            data: None,
            name: None,
        }
    }
}

async fn main() {
    let mut future = IntoFuture::into_future(GivePatsFuture::new());
    let mut pinned = unsafe { Pin::new_unchecked(&mut future) };
    loop {
        match pinned.poll(&mut current_context) {
            Poll::Ready(ready) => break ready,
            Poll::Pending => yield Poll::Pending,
        }
    }
}

In this example we're seeing GivePatsFuture being constructed inside of the new function, be moved out of that, and only then pinned in-place using Pin::new_unchecked. Even if GivePatsFuture: !Unpin, the Unpin trait only affects types once they are held inside of a Pin structure. And we can't just return Pin from new, because the function's stack frames are discarded the moment the function returns.

It would be better if we enabled types to describe how they can construct themselves in-place. That means no more external Pin::new_unchecked calls; but exclusively internally provided constructors. This enables us to make self-referential types entirely self-contained, with internally provided constructors replacing the external pin dance. Here's how we could rewrite GivePatsFuture::new to use an internal constructor instead:

impl GivePatsFuture {
    fn new(slot: Pin<&mut MaybeUninit<Self>>) {
        let slot = unsafe { slot.get_unchecked_mut() };
        let this: *mut Self = slot.as_mut_ptr();
        unsafe { 
           addr_of_mut!((*this).resume_from).write(GivePatsState::Created);
           addr_of_mut!((*this).data).write(None);
           addr_of_mut!((*this).name).write(None);
        };
    }
}

If you don't like this: that's understandable. I don't think anyone does. But bear with me; we're on a little didactic sausage-making journey here together. I'm sorry about the sights; let's quickly move on.

In a recent blog post I posited we might be able to think of parameters like these as "spicy return". James Munns pointed out that in C++ this feature has a name: out-pointers. And Jack Huey made an interesting connection between this and the super let design. Just so we don't have to look at a pile of unsafe code again, let's pretend we can combine these into something coherent:

impl GivePatsFuture {
    fn new() -> super Pin<&'super mut Self> {
        pin!(Self { // just pretend this works
            resume_from: GivePatsState::Created,
            data: None,
            name: None,
        })
    }
}

I know, I know - I'm showing syntax here. I'm personally meh about the way it looks and we'll talk more later about how we can improve this, but I hope we can all agree that the function body itself is approximately 400% more legible than the pile of unsafe we were working with earlier. We'll also get to how we can entirely remove Pin from the signature, so please don't get too hung up on that either.

I'm asking you to play along here for a second, and pretend we might be able to do something like this for now, so we can get to the part later on this post where we can actually fix it. In maybe state this more clearly: this post is less about proposing concrete designs for a problem, but more about how we can tease apart the problem of "immovable types" into separate features we can tackle independently from one another.

Converting into immovable types

Alright, so we have an idea of how we might be able to construct immovable types in-place, provided as a constructor defined on a type. Now while that's nice, we've also lost an important property with that: whenever GivePatsFuture is constructed, it needs to have a fixed place in memory. Where before we could freely move it around until we started .awaiting it.

One of the main reasons why async is useful is because it enables ad-hoc concurrent execution. That means we want to be able to take futures and pass them to concurrency operations to enable concurrency through composition. We can't move futures which have a fixed location in memory, so we need a brief moment where futures can be moved before they're ready to be kept in place and polled to completion.

The way Pin works with this today is that a type can be !Unpin - but that only becomes relevant once it's placed inside of a Pin structure. With futures that typically doesn't happen until it begins being polled, usually via .await, and so we get the liberty of moving !Unpin futures around until we start .awaiting them. That's why !Unpin doesn't mark: "A type which cannot be moved", it marks: "A type which cannot be moved once it has been pinned". This is definitely confusing, so don't worry if it's hard to follow.

fn foo<T: Unpin>(t: &mut T);   // Type is not pinned, type can be moved.
fn foo<T: !Unpin>(t: &mut T);  // Type is not pinned, type can be moved.
fn foo<T: Unpin>(t: Pin<&mut T>);  // Type is pinned, type can be moved.
fn foo<T: !Unpin>(t: Pin<&mut T>); // Type is pinned, type can't be moved.

If we want "immovability" to be unconditionally part of a type, we can't make it behave the same way Unpin does. Instead it seems better to separate the movable / immovable requirements into two separate types. We first construct a type which can be freely moved around - and once we're ready to drive it to completion, we convert it to a type which is immovable and we begin calling that. This maps perfectly to the separation between IntoFuture and Future we already use.

Let's take a look at our first example again, but modify it slightly. What I'm proposing here is that rather than give_pats returning an impl Future, it should instead return an impl IntoFuture. This type is not pinned, and can be freely moved around. It's only once we're ready to .await it that we call .into_future to obtain the immovable future - and then we call that.

struct GivePatsFuture { ... } 
impl GivePatsFuture {
    fn new() -> super Pin<&'super mut Self> { ... } // suspend belief pls
}

struct GivePatsIntoFuture;
impl IntoFuture for GivePatsIntoFuture {
    type Output = ();
    type IntoFuture = GivePatsFuture;

    // We call the `Future::new` constructor which gives us a
    // `Pin<&'super GivePatsFuture>`, and then rather than writing
    // it into the current function's stack frame we write it in the
    // caller's stack frame.
    //
    // (keep belief suspended a little longer)
    fn into_future(self) -> super Pin<&'super mut GivePatsFuture> {
        GivePatsFuture::new() // create in caller's scope
    }
}

Just like we can keep returning values from functions to pass them further up the call stack, so should we be able to use out-pointers / emplacement / spicy allocate in a stack frame further up the call stack. Though even if we didn't support that out of the gate, we could probably in-line the GivePatsFuture::new into GivePatsIntoFuture::into_future and things would still work. And with that, our .await desugaring could then look something like this:

async fn main() {
    let into_future: GivePatsIntoFuture = give_pats();
    let mut future: Pin<&mut GivePats> = GivePatsIntoFuture.into_future();
    loop {
        match future.poll(&mut current_context) {
            Poll::Ready(ready) => break ready,
            Poll::Pending => yield Poll::Pending,
        }
    }
}

To re-iterate why this section exists: we can get the same functionality Pin + Unpin provide today by creating two separate types. One type which can be freely moved around. And another type which once constructed will not move locations in memory.

So far the only framing of "immovable types" I've seen so far is a single types which have both these properties - just like Unpin does today. What I'm trying to articulate here is that we can avoid that issue if we choose to create two types instead, enabling one to construct the other, and make them provide separate guarantees. I think that's a novel insight, and one I thought was important to spend some time on.

Immovable types

Alright, I've been asking folks to suspend belief that we can in fact perform in-place construction of a Pin<&mut Self> type somehow and that would all work out the way we want it to. I'm not sure myself, but for the sake of the narrative of this post it was easier if we just pretended we could for a second.

The real solution here, of course, to get rid of Pin entirely. Instead types themselves should be able to communicate whether they have a stable memory location or not. The simplest formulation for this would be to add a new built-in auto-trait, Move, which tells the compiler whether a type can be moved or not.

auto trait Move {}

This is of course not a new idea: we've known about the possibility for Move since at least 2017. That's before I started working on Rust. There were some staunch advocates for that in the Rust community in favor of Move, but ultimately that wasn't the design we ended up going with. I think in hindsight most of will acknowledge that the downsides of Pin are real enough that revisiting Move and working through its limitations seems like a good idea 1. To explain what the Move trait is: it would be a language-level trait which governs access to the following capabilities:

1

For anyone looking to assign blame here or pull skeletons out of closets: please don't. The higher-order bit here is that we have Pin today, it clearly doesn't work as well as was hoped at the time, and we'd like to replace it with something better. I think the most interesting thing to explore here is how we can move forward and do better.

  • The ability to be passed by-value into functions and types.
  • The ability to be passed by mutable reference to mem::swap, mem::take, and mem::replace.
  • The ability to be used with any syntactic equivalents to the earlier points, such as assigning to mutable references, closure captures, and so on.

Conversely, when a type implements !Move they would not have access to any of these capabilities - making it so they cannot be moved once they have a fixed memory location. And by default we would assume in all bounds that types are Move, except for places that explicitly opt-out by using + ?Move. Here are examples of things that constitute moves:

// # examples of moving

// ## swapping two values
let mut x = new_thing();
let mut y = new_thing();
swap(&mut x, &mut y);

// ## passing by value
fn foo<T>(x: T) {}
let x = new_thing();
foo(x);

// ## returning a value
fn make_value() -> Foo {
    Foo {
        x: 42
    }
}

// ## `move` closure captures
let x = new_thing();
thread::spawn(move || {
    let x = x;
})

And here are some things that do not constitute moves:

// # things that are not moves

// ## passing a reference
fn take_ref<T>(x: &T) {}
let x = new_thing();
take_ref(&x);

// ## passing mutable references is also okay,
//    but you have to be careful how you use it
fn take_mut_ref<T>(x: &mut T) {}
let mut x = new_thing();
take_mut_ref(&mut x);

Passing types by-value will never be compatible with !Move types because that’s what a move is. Passing types by-reference will always be compatible with !Move types because they are immutable 2. The only place with some ambiguity is when we work with mutable references, as things like mem::swap allow us to violate the immovability guarantees.

2

Yes yes, we'll get to internal mutability in a second here.

If a function wants to take a mutable reference which may be immovable, they will have to add + ?Move to it. If a function does not use + ?Move on their mutable reference, then a !Move type cannot be passed to it. In practice this will work as follows:

fn meow<T>(cat: T);            // by-value,   can't pass `!Move` values
fn meow<T>(cat: &T);           // by-ref,     can pass `!Move` values
fn meow<T>(cat: &mut T);       // by-mut-ref, can't pass `!Move` values
fn meow<T: ?Move>(cat: &mut T) // by-mut-ref, can pass `!Move` values

By default all cat: &mut T bounds would imply + Move. And only where we opt-in to + ?Move could !Move types be passed. In practice it seems likely most places will probably be fine adding + ?Move, since it's far more common to write to a field of a mutable reference than it is to replace it whole-sale using mem::swap. Things like interior mutability are probably also largely fine under these rules, since even if accesses go through shared references, updating the values in the pointers will have to interact with the earlier rules we've set out - and those are safe by default.

To be entirely accurate we also have to consider internal mutability. That allows us to mutate values through shared references - but only by being able to conditionally convert it to &mut references at runtime. Just because we allow casting &T to &mut T at runtime, doesn't mean that the rules we've applied to the system don't still work. Say we held an &mut T: !Move inside of a Mutex. If we tried to call the deref_mut method, we'd get a compile-error because that bound hasn't yet declared that T: ?Move. We could probably add that, but because it doesn't work by default we'd have an opportunity to validate its soundness before adding it.

Anyway, that's enough theory about how this should probably work for now. Let's try and update our earlier example, replacing Pin with !Move. That should be as simple as adding a !Move impl on GivePatsFuture.

struct GivePatsFuture {
    resume_from: GivePatsState,
    data: Option<String>,
    name: Option<&'self str>,
}
impl !Move for GivePatsFuture {}

And once we have that, we can change our constructors to return super Self instead of super Pin<&'super mut Self>. We already know that emplacement using something like super Self (not actual notation) to write to fixed memory locations seems plausible. All we then need to do is add an auto-trait which tells the type-system that further move operations aren't allowed.

struct GivePatsFuture { ... } 
impl !Move for GivePatsFuture {}
impl GivePatsFuture {
    fn new() -> super Self { ... } // create in caller's scope
}

struct GivePatsIntoFuture;
impl IntoFuture for GivePatsIntoFuture {
    type Output = ();
    type IntoFuture = GivePatsFuture;
    fn into_future(self) -> super GivePatsFuture {
        GivePatsFuture::new() // create in caller's scope
    }
}

I probably should have said this sooner, but I'll say it now: in this post I'm intentionally not bothering with backwards-compat. The point, again, is to break the complicated design space of "immovable types" into smaller problems we can tackle one-by-one. Figuring out how to bridge Pin and !Move is something we will want to figure out at some point - but not now.

As far as async {} and Future are concerned: this should work! This allows us to freely move around async blocks which desugar into IntoFuture. And only once we're ready to start polling them do we call into_future to obtain an impl Future + !Move. A system like that is equivalent to the existing Pin system, but does not need Pin in its signature. For good measure, here's how we would be able to rewrite the signature of Future with this change:

// The current `Future` trait
// using `Pin<&mut Self>`
pub trait Future {
    type Output;
    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output>;
}

// The `Future` trait leveraging `Move`
// using `&mut self`
pub trait Future {
    type Output;
    fn poll(&mut self, cx: &mut Context<'_>) -> Poll<Self::Output>;
}

That would also mean: no more Pin-projections. No more incompatibilities with Drop. Because it's an auto-trait that governs language behavior, as long as the base rules are sound, the interaction with all other parts of Rust would be sound.

Also, most relevant for me probably, this would make it possible to write future state machines using methods and functions, rather than the current status quo where we just lump everything into the poll function body. After having written a ridiculous amount of futures by hand over the past six years, I can't tell you how much I'd love to be able to do that.

Motivating example, reworked

Now that we've covered self-referential lifetimes, in-place construction, understand async {} should return IntoFuture, and have seen !Move, we're ready to bring these features together to rework our motivating example. This is what we started with using regular async/.await code:

async fn give_pats() {
    let data = "chashu tuna".to_string();
    let name = data.split(' ').next().unwrap();
    pat_cat(&name).await;
    println!("patted {name}");
} 

async fn main() {
    give_pats().await;
}

And here is what that, with these new capabilities, here is what async fn give_pats would be able to desugar to. Note the 'self lifetime, the !Move impl for the future, the omission of Pin everywhere, and the in-place construction of the type.

enum GivePatsState {
    Created,
    Suspend1,
    Complete,
}

struct GivePatsFuture {
    resume_from: GivePatsState,
    data: Option<String>,
    name: Option<&'self str>,        // ← Note the `'self` lifetime
}
impl !Move for GivePatsFuture {}     // ← This type is immovable
impl Future for GivePatsFuture {
    type Output = ();
    fn poll(&mut self, cx: &mut Context<'_>)  // ← No `Pin` needed
        -> Poll<Self::Output> { ... }
}

struct IntoGivePatsFuture {}
impl IntoFuture for IntoGivePatsFuture {
    type Output = ();
    type IntoFuture: GivePatsFuture
    fn into_future(self)
        -> super GivePatsFuture {  // ← Writes to a stable addr
        Self {
            resume_from: GivePatsState::Created,
            data: None,
            name: None,
        }
    }
}

And finally, we can then desugar the give_pats().await call to concrete types we construct and call to completion:

let into_future = IntoGivePatsFuture {};
let mut future = into_future.into_future(); // ← Immovable without `Pin`
loop {
    match future.poll(&mut current_context) {
        Poll::Ready(ready) => break ready,
        Poll::Pending => yield Poll::Pending,
    }
}

And with that, we should have a working example of async {} blocks, desugared to concrete types and traits that don't use Pin anywhere at all. Accessing fields within it wouldn't go through any kind of pin projection, and there would no longer be any need for things like stack-pinning. Immovability would just be a property of the types themselves, constructed when we need them in the place where we want to use them.

Oh and I guess just to mention it: functions working with these traits would always want to use T: IntoFuture rather than T: Future. That's not a big change, and actually something people should already be doing today. But I figured I'd mention it in case people are confused about what the bounds should be for concurrency operations.

Phased initialization

We didn't show this in our example, but there is one more aspect to self-referential types worth covering: phased initialization. This is when you initialize parts of a type at separate points in time. In our motivating example we didn't have to use that, because the self-references lived inside of an Option. That means that when we initialized the type we could just pass None, and things were fine. However, say we did want to initialize a self-reference, how would we go about that?

struct Cat {
    data: String,
    name: &'self str,
}
impl !Move for Cat {}

impl Cat {
    fn new(data: String) -> super Self {
        Cat {
            data: "chashu tuna".to_string(),
            name: /* How do we reference `self.data` here? */
        }
    }
}

Now of course because String is heap-allocated, its address is actually stable and so we could write something like this:

struct Cat {
    data: String,
    name: &'self str,
}
impl !Move for Cat {}

impl Cat {
    fn new(data: String) -> super Self {
        let data = "chashu tuna".to_string();
        Cat {
            name: data.split(' ').next().unwrap(),
            data, 
        }
    }
}

That's clearly cheating, and not what we want people to have to do. But it does point us at how the solution here should probably work: we first need a stable address to point to. And once we have that address, we can refer to it. We can't do that if we have to build the entire thing in a single go. But what if we could do it in multiple phases? That's what Niko's recent post on borrow checking and view types went into. That would allow us to change our example to instead be written like this:

struct Cat {
    data: String,
    name: &'self str,
}
impl !Move for Cat {}

impl Cat {
    fn new(data: String) -> super Self {
        super let this = Cat { data: "chashu tuna".to_string() }; // ← partial init
        this.name = this.data.split(' ').next().unwrap();         // ← finish init
        this
    }
}

We initialize the owned data in Cat first. And once we have that, we can then initialize the references to it. These references would be 'self, we sprinkle in a super let annotation to indicate we're placing this in the caller's scope, and everything should subsequently check out.

Migrating from Pin to Move

What we didn't cover in this post is any migration story from the existing Pin-based APIs to the new Move-based system. If we want to move off of Pin in favor of Move, the only path plausible way I see is by minting new traits that don't carry Pin in its signature, and providing bridging impls from the old traits to the new traits. A basic conversion with an explicit method could look like this, though blanket impls could also be a possibility:

pub trait NewFuture {
    type Output;
    fn poll(&mut self, ...) { ... }
}

pub trait Future {
    type Output;
    fn poll(self: Pin<&mut Self>, ...) { ... }

    /// Convert this future into a `NewFuture`.
    fn into_new_future(self: Pin<&mut Self>) -> NewFutureWrapper<&mut Self> { ... }
}

/// A wrapper bridging the old future trait to the new future trait.
struct NewFutureWrapper<'a, F: Future>(Pin<&'a mut F>);
impl !Move for NewFutureWrapper {}
impl<'a, F> NewFuture for NewFutureWrapper<'a, F> { ... }

I've been repeating this line for at least three years now: but if we want to fix the problems with Pin, the first step we need to take is to not make the problem worse. If the stdlib needs to fix the Future trait once, that sucks but it's fine and we'll find a way to do it. But if we tie Pin up into a number of other traits, the problems will compound and I'm no longer sure whether we can rid ourselves of Pin. And that's a problem, because Pin is broadly disliked and we actively want to get rid of it.

Compatibility with self-referential types is not only relevant for iteration; it's a generalized property which ends up interacting with nearly every trait, function, and language feature. Move just composes with any other trait, and so there's no need for a special PinnedRead or anything. A type would instead just implement Read + Move, and that would be enough for a self-referential reader to function. And we can repeat that for any other combination of traits.

In-place construction of course does change the signature of traits. But in order to support that in a backwards-compatible way, all we'd need to do is enable traits to opt-in to "can perform in-place construction". And being able to gradually roll-out capabilities like that is exactly why we're working on effect generics.

pub trait IntoFuture {
    type Output;
    type IntoFuture: Future<Output = Self::Output>;

    // Marked as compatible with in-place construction, with
    // implementations being able to decide whether they want
    // to use it or not.
    fn into_future(self) -> #[maybe(super)] Self::IntoFuture;
}

If we want self-referential types to be generally useful, they need to practically compose with most other features we have. And so really, the first step to getting there is stop stabilizing any new traits in the stdlib which use Pin in its signature.

Making immovable types movable

So far we’ve been talking a lot about self-referential types and how we need to make sure they cannot be moved, because moving them would be bad. But what if we did allow them to be moved? In C++ this is possible using a feature called "move constructors", and if we supported self-referential types in Rust, it doesn't seem like a big leap to support that too.

Before we go any further I want to preface this: I've heard from people who have worked with move constructors in C++ that they can be rather tricky to work with. I haven't worked with them, so I can't speak from experience. Personally I don't really have any uses where I feel like I would have wanted move constructors, so I'm not particularly in favor or against supporting them. I'm writing this section mostly out of academic interest, because I know there will be people wondering about this. And the rules for how this should work seem fairly straightforward.

Niko Matsakis recently wrote a two-parter on the Claim trait (first, second), proposing a new Claim trait to fill the gap between Clone and Copy. This trait would be for types which are "cheap" to clone, such as the Arc and Rc types. And using autoclaim, the compiler would automatically insert calls to .claim as needed. For example when a move || closure captures a type implementing Claim but it is already in use somewhere else - it would automatically call .claim so it would compile.

Enabling immovable types to relocate would work much the same as auto-claiming would. We would need to introduce a new trait, which we'll call Relocate here, with a method relocate. Whenever we tried to move an otherwise immovable value, we would automatically call .relocate instead. The signature of the Relocate trait would take self as a mutable reference. And return an instance of Self, constructed in-place:

trait Relocate {
    fn relocate(&mut self) -> super Self;
}

Note the signature of self here: we take it by mutable reference - not owned nor shared. That is because what we're writing is effectively the immovable equivalent to Into, but we can't take self by-value - so we have to take it by-reference instead and tell people to just mem::swap away. Applying this our earlier Cat example, we would be able to implement it as follows:

struct Cat {
    data: String,
    name: &'self str,
}
impl Cat {
    fn new(data: String) -> super Self { ... }
}
impl Relocate for Cat {
    fn relocate(&mut self) -> super Self {
        let mut data = String::new(); // dummy type, does not allocate
        mem::swap(&mut self.data, &mut data); // take owned data
        super let cat = Cat { data }; // construct new instance
        cat.name = cat.data.split(' ').next().unwrap(); // create self-ref
        cat
    }
}

We're making one sketchy assumption here: we need to be able to take the owned data from self, without running into the issue where the data can't be moved because it is already borrowed from self. This is a general problem we need to solve, and one way we could for example work around this is by creating dummy pointers in the main struct to ensure the types are always valid - but we invalidate the types:

struct Cat {
    data: String,
    dummy_data: String, // never initialized with a value
    name: &'self str,
}

impl Relocate for Cat {
    fn relocate(&mut self) -> super Self {
        self.name = &self.dummy_data; // no more references to `self.data`
        let data = mem::take(&mut self.data); // shorter than `mem::swap`
        super let cat = Cat { data };
        cat.name = cat.data.split(' ').next().unwrap();
        cat
    }
}

In this example the Cat would implement Move even if it has a 'self lifetime, because we can freely move around. When a type is dropped after having been passed to Relocate, it should not call its Drop impl. Because semantically we're not trying to drop the type - all we're doing is updating its location in memory. Under these rules access to 'self in structs would be available both if Self: !Move OR Self: Relocate.

I want to again emphasize that I'm not directly advocating here for the introduction of move constructors to Rust. Personally I'm pretty neutral about them, and I can be convinced either way. I mainly wanted to have at least once walked through the out the way move constructors could work, because it seems like a good idea to know a gradual path here should be possible. Hopefully that point is coming across okay here.

Further reading

The Pin RFC is an interesting read as it describes the system for immovable types we ended up going with today. Specifically the comparison between Pin and Move, and section on drawbacks are interesting to read back up on. Especially when we compare it with the pin docs, and see what was not present in the RFC - but later turned out to be major practical issues (e.g. pin projections, interactions with the rest of the language).

Tmandry presented an interesting series (blog 1, blog 2, talk) on async internals. Specifically he covers how async {} blocks desugar into Future-based state machines. This post uses that desugaring as its motivating example, so for those keen to learn more about what happens behind the scenes, this is an excellent resource. The section on .await desugaring in the Rust reference is also a good read, as it captures the status quo in the compiler.

More recently Miguel Young de la Sota's (mcyoung) talk and crate to support C++ move constructors is interesting to read up on. Something I haven't fully processed yet, but is interesting, is the New trait it exposes. This can be used to construct types in-place on both the stack and the heap, which ideally something like the super let/super Type notation could support too. You can think of C++'s move constructors as further evolution of immovable types, so it's no surprise that there are lots of shared concepts.

Two years ago I tried formulating a way we could leverage view types for safe pin projection (post), and I came up short in a few regards. In particular I wasn't sure how to deal with the interactions with #[repr(packed)], how to make Drop compatible, and how to mark Unpin as unsafe. There might be a path for the latter, but I'm not aware of any practical solutions for the first two issues. This post is basically a sequel to that post, but changing the premise from: "How can we fix Pin?" to "How can we replace Pin?".

Niko's series on view types is also well worth reading. His first post discusses what view types are, how they'd work, and why they're useful. And in one of his most recent posts he discusses how view types fall into a broader "4-part roadmap for the borrow checker" (aka: "the borrow checker within"). In his last post he directly covers phased initialization using view types as well, which is one of the features we discuss in this post in relation to self-referential types.

Finally I'd suggest looking at the ouroboros crate. It enables safe, phased initialization for self-referential types on stable Rust by leveraging macros and closures. The way it works is that fields using owned data are initialized first. And then the closures are executed to initialize fields referencing the data. Phased initialization using view types as described in this post emulates that approach, but enables it directly from the language through a generally useful feature.

Conclusion

In this post we've deconstructed "self-referential types" into four constituent parts:

  1. 'unsafe (unchecked) and 'self (checked) lifetimes which will make it possible to express self-referential lifetimes.
  2. A moral equivalent to super let / -> super Type to safely support out-pointers.
  3. A way to backwards-compatibly add optional -> super Type notations.
  4. A new Move auto-trait which governs access to move operations.
  5. A view types feature which will make it possible to construct self-referential types without going through an Option dance.

The final insight this post provides is that today's Pin + Unpin system can be emulated with Move by creating Move wrappers which can return !Move types. In the context of async, the pattern would be to construct an impl IntoFuture + Move wrapper, which constructs an impl Future + !Move future in-place via an out-pointer.

People generally dislike Pin, and as far as I can tell there is broad support for exploring alternative solutions such as Move. Right now the only trait that uses Pin in the stdlib is Future. In order to facilitate a migration off of Pin to something like Move, we would do well not to further introduce any Pin-based APIs to the stdlib. Migrating off of a single API will take effort, but seems ultimately doable. Migrating off of a host of APIs will take more effort, and makes it more likely we'll forever be plagued with the difficulties of Pin.

The purpose of this post has been to untangle the big scary problem of "immovable types" into its constituents parts so we can begin tackling them one-by-one. None of the syntax or semantics in this post are meant to be concrete or final. I mainly wanted to have at least once walked through everything required to make immovable types work - so that others can dig in, think along, and we can begin refining concrete details.

Summary of reading: April - June 2024

  • "River of the Gods: Genius, Courage, and Betrayal in the Search for the Source of the Nile" by Candice Millard - while the book is readable, I found it disappointing because it focuses much more on the personalities involved and their numerous feuds than on the actual exploration, geography and history of the region. The scientific parts of the book would comfortably fit into a couple of pages - the rest is fluff.
  • "We have no idea" by J. Cham and D. Whiteson - a whimsical and introductory look at the state of modern physics - what we know, and more importantly - what we don't know. The book is pretty good, but is full of potty humor and endless childish puns which I found distracting. It also adds comics drawings on almost every page for the entertainment factor, without adding information. I suppose these are good to try to lure kids to read it.
  • "Quantum computing explained - for beginners" by Pantheon Space Academy - this is certainly the worst book I've read in the last year or more. I honestly don't undestand how it managed to get this many positive reviews on Amazon, unless it's some sort of mistake - or scam. The book is a tedious, repetitive jumble of ever-changing, shallow analogies trying to explain concepts from quantum computing. Except that it doesn't really explain anything; it mostly tries to teach the jargon - the kind of understanding that makes you seem knowledgeable at cocktail parties. There's not a single equation in this book, not a single algorithm or circuit, not a single snippet of code. I still think it's some sort of scam. It can't even be AI generated, because any modern LLM will give you much more useful information in your first 10 minutes of interaction than this book contains in its entirety.
  • "The Lions of Al-Rassan" by Guy Gavriel Kay - not my usual fare! A dip into the world of historic fantasy. The plot is loosely based on the period of the Spanish reconquista, but everything is changed - the names, the religions, even the celestial bodies (two moons); some magic is also mixed in, though in very mild amounts. Good, high-paced story in most places, though I found the protagonists to be too idealized and some of the dialogues long and tedious.
  • "Boyd: The Fighter Pilot Who Changed the Art of War" by Robert Coram - the biography of John Boyd, an air-force fighter pilot (roughly between the Korean war and Vietnam war) that later made significant contribution to aircraft design (particularly F-15 and F-16) and the theory of conducting military operations. Very interesting book overall, with insights about how large organizations work.
  • "Slow Productivity" By Cal Newport - a reshuffle of the author's ideas from "Deep work", served as "how to work less and achieve more" advice. Way too general, IMHO, and thus of limited usefulness. The book is short, at least, but the author apparently struggled to fill it up because it contains so many barely-relevant detours. I really find cherry-picked stories about notable people unconvincing because of the huge selection bias inherent in them. The author even admits this in one scenario towards the end of the book ("I knew so many professors who took a sabbatical to focus on something new but never accomplished anything"). Unfortunately, among Newport's books I've read so far, this is certainly my least favorite.
  • "Interplanetary Robots" by Rod Pyle - a good overview of the robotic space missions sent by humanity up until the Perseverance rover (the book was written before it launched). Good writing and a decent amount of technical detail.
  • "Quantum: Einstein, Bohr and the Great Debate about the nature of reality" by Manjit Kumar - a historical account of the development of quantum mechanics, focusing on its philosiphical meaning and the opposing views held by Einstein and Bohr. I liked this book - it describes the history and scientists involved in it well and doesn't shy away from some actual physics.
  • "Quantum Computing for Everyone" by Chris Bernhardt - a short and effective introduction to QC. This is a very good book! Starting with only some basic familiarity with linear algebra concepts, the book develops the fundamentals of QC methodically and at just the right pace. I wish it was longer, and also wish the author did fewer simplifications - for example, hadn't omitted the use of complex numbers. The author clearly has a talent for explaining technical, math-heavy material; in fact, this is one of the things that impressed me most about this book and is something I intend to learn from. The book's language is very terse, logically organized and yet simple! It's like reading the Simple English version of Wikipedia; no nonsense or useless detours, no embellishments; few and well-placed analogies. A fantastic example for anyone writing such material.
  • "Mr. Popper's Penguins" by Richard and Florence Atwater - a somewhat silly but good-natured children's book about a small-town house painter raising a family of penguins. This is a rare case where I think I liked the movie version more, even though the plot is very different from the book.
  • "The Anxious Generation: How the Great Rewiring of Childhood Is Causing an Epidemic of Mental Illness" by Jonathan Haidt - the author reports on the sharp decline in mental health of teens and pre-teens (especially girls) since 2010, and provides compelling evidence that a combination of smartphones and social media is the most likely cause. The book is fairly troubling, and certainly thought-provoking. The tangential observations on the challenges faced by Gen Z folks are also insightful.
  • "Something Deeply Hidden: Quantum Worlds and the Emergence of Spacetime" by Sean Carroll - the book's main goal is to explain the many-worlds theory of quantum mechanics. It does so reasonably well, but otherwise contains a lot of other information that seems to be poorly organized and only loosely related to this main goal. I'd say that the first third of the book was insightful, and the rest so so.
  • "Die With Zero" by Bill Perkins - it's hard to read FI/RE-related discussions these days without running into mentions of "Die With Zero", so I decided to give this book a try. In a sign of just how affluent modern society has become, the author preaches not over-saving, and instead using one's savings earlier in life when one still has the physical and mental capacity to enjoy the experiences money can buy. It's an interesting premise, but IMHO the book is mainly of inspirational value - as opposed to practical value. The author tries to develop the idea of "peak savings" but it could certainly use more work, with more details.

Re-reads:

  • "The Working Poor: Invisible in America" by David K. Shipler
  • "The Magic of Reality" by Richard Dawkins
  • "A Philosophy of Software Design" by John Ousterhout
  • "All Quiet on the Western Front" by Erich Maria Remarque

Minha receita de almoço gostoso e prático

🍳 Minha receita infalível de almoço gostoso e prático, pro #almocodedomingo de hoje, é na airfryer.

Deixa pré-aquecendo enquanto pega e tempera os ingredientes, aí reúne:
- legumes picados (temperados com shoyo, manteiga ou azeite)
- um bife (com sal e pimenta do reino)
- um punhado de batatas fritas congeladas.

Coloca tudo junto, conta 7 minutos, aí desliga a airfryer sem abrir e aguarda mais 2 minutos.

🍽️ Só servir!

Se usar uma marmitinha de papel alumínio fica ainda mais fácil limpar! (mas menos sustentável)

O artigo "Minha receita de almoço gostoso e prático" foi originalmente publicado no site TRILUX, de Augusto Campos.

Trocando o Spotify por um pen drive

Uma coisa legal que eu fiz nesta semana foi pegar um pen drive antigo (4GB) que estava largado pela casa, formatar, colocar nele músicas que eu já tinha, e levar pro carro – pra voltar a ouvir música nos deslocamentos, mas agora sem depender de conexão e streaming.

O pen drive é igual ao da foto, tem vários anos de estrada, e minha ideia é usá-lo pra voltar a ouvir os álbuns completos e na ordem.

Comecei com 2 coletâneas (Madonna e The Clash), e agora para a segunda semana abri uma enquete com 4 candidatos no Mastodon para ver quais seriam os próximos 2 álbuns a serem ouvidos, e o vencedor (por larga margem) foi “The Offspring - Greatest Hits” – mas também já vou carregar o segundo colocado, que foi “Paramore - Riot!”.

Aliás, você sabia da pesquisa por "vtwin88cube" em sites de torrents?

Ela pode ser mais rápida do que ripar os CDs físicos originais que você já tem na sua coleção, porque retorna literalmente centenas de torrents repletos de seeds, com coletâneas e discografias, principalmente de artistas que já estiveram na parada de rock dos EUA.

Leia também o post anterior: Reduzindo em 78% o gasto mensal com assinaturas de serviços on-line, edição 2024.

O artigo "Trocando o Spotify por um pen drive" foi originalmente publicado no site TRILUX, de Augusto Campos.

Physics and perception.

At one point in 2019, several parts of Stripe’s engineering organization were going through a polite civil war. The conflict was driven by one group’s belief that Java should replace Ruby. Java would, they posited, address the ongoing challenge of delivering a quality platform in the face of both a rapidly growing business and a rapidly growing engineering organization. The other group believed Stripe’s problems were driven by a product domain with high essential complexity and numerous, demanding external partners ranging from users to financial institutions to governments; switching programming languages wouldn’t address any of those issues. I co-wrote the internal version of Magnitudes of exploration in an attempt to find a useful framework for navigating that debate, but nonetheless the two groups struggled to make much progress in understanding one another.

I was reminded of those discussions while reading Steven Sinofsky’s Hardcore Software’s chapter on Innovation versus Shipping: The Cairo Project:

Landing on my desk early in 1993 was the first of many drafts of Cairo plans and documents. Cairo took the maturity of the NT product process—heavy on documentation and architectural planning—and amped it up. Like a well-oiled machine, the Cairo team was in short order producing reams of documents assembled into three-inch binders detailing all the initiatives of the product. Whenever I would meet with people from Cairo, they would exude confidence in planning and their processes. … While any observer should have rightfully taken the abundance of documentation and confidence of the team as a positive sign, the lack of working code and ever-expanding product definition seemed to set off some minor alarms, especially with the Apps person in me. While the Cairo product had the appearance of the NT project in documentation, it seemed to lack the daily rigorous builds, ongoing performance and benchmarking, and quality and compatibility testing. There was a more insidious dynamic, and one that would prove a caution to many future products across the company but operating systems in particular.

The simple narrative regarding both the Cairo development and Java migration is that there’s a group doing the “right” thing, and another group doing the “wrong” thing. The Cairo team was shipping vaporware. The Java team was incorrectly diagnosing the underlying problems. These sorts of descriptions are comforting because they create the familiar narrative structure of “good” in conflict with “evil.” Unfortunately, I’ve never found these sorts of narratives very useful for understanding what causes a conflict, and they’re worse than useless at actually resolving conflicts.

What I have found useful is studying what each faction knows that the other doesn’t, and trying to understand those gaps deeply enough to find a solution. Sometimes I summarize this as " solving for both physics and perception."

Solving for perception

Sinofsky’s represents Cairo as an impossibly broad project that didn’t ship, but he also explains why it picked up so many features:

Cairo tended to take this as a challenge to incorporate more and more capabilities. New things that would come along would be quickly added to the list of potential features in the product. Worse, something that BillG might see conceptually related, like an application from a third party for searching across all the files on your hard disk, might become a competitive feature to Cairo. Or more commonly “Can’t Cairo just do this with a little extra work?” and then that little extra work was part of the revised product plans.

It wasn’t ill-intentioned, rather they simply wanted to live up to their CEO’s expectations. They wanted to be perceived as succeeding within their company’s value system, because they correctly understood that their project would be canceled otherwise.

Many incoming leaders find themselves immediately stuck in similar circumstances. They’ve just joined and don’t understand the domain or team very well, but are being told they need to immediately make progress on a series of problems that have foiled the company’s efforts thus far. They know they need to appear to be doing something valuable, so they do anything that might look like progress. It’s particularly common for leaders to begin a Grand Migration at that moment, which they hope will solve the problems at hand, but no matter what will be perceived as a brave, audacious initiative.

Image of stacked layers, with each layer belonging to a different team. Some of these layers are grouped into perception, and some are grouped into physics. Neither perception nor physics represents the entire set.

This isn’t a problem unique to executives or product engineers, I frequently see platform teams make the same mistake when they undertake large-scale migrations. Many platform migrations are structured as an organizational program where a platform team tells product teams they need to complete a certain task (e.g. “move to our monorepo”) by a certain date, along with tracking dashboards that inform executives which teams have or haven’t completed their tasks. This does a great job of applying pressure to the underlying teams, and a good job of managing perceptions by appearing to push hard, but these migrations often fail because there’s little emphasis on the underlying ergonomics of the migration itself. If you tell teams they are failing if they miss a date, they will try to hit the date; if it’s hard, they’ll still fail. Platform teams in that case often blame the product teams for not prioritizing their initiative, when instead the platform teams should have the self-awareness to recognize that they made things difficult by not simplifying the underlying physics for the product teams they asked to migrate.

There’s nothing wrong about solving for perception, and indeed it’s a necessary skill to be an effective leader. Rather the lesson here is that most meaningful projects require solving for both perception and physics.

Solving for physics

When I joined Stripe, one of the first projects I wanted to take on was migrating to Kubernetes and away from hand-rolled tooling for managing VMs directly. This was heavily influenced by what I had learned migrating Uber from a monolithic Python application to polygot applications in a polyrepo. After a few months of trying to build alignment within engineering, I postponed the Kubernetes migration for a few years because I couldn’t convince them it solved a pressing problem. (I did come back to it, and it was a success when I did.) I could have forced the team to work on that project, but it goes against my instincts: generally when engineers push back on leadership ideas, there’s a good reason for doing so.

Similarly, my initial push at Stripe was not toward the Ruby typing work that became Sorbet, but rather to design an incremental migration towards an existing statically-typed language such as Java or Go. The argument I got back was that this was impractical because it required too large a migration effort, and that Facebook’s Hack had already proven out the viability of moving from PHP to a PHP-like typed language. I took my time to understand the pushback, and over time shifted my thinking to focus instead on sequencing these efforts: even if we wanted to move to a different language, first we needed to improve the architecture to support migrating modules, and that effort would benefit from typing Ruby.

I was fortunate in these cases, because there were few perceptions that I needed to solve for, and I was able to mostly focus on the physics. Indeed, the opportunity to focus on physics is one of the undervalued advantages of working within infrastructure engineering. You’ll rarely be lucky enough in senior leadership roles to focus on the physics.

For example, when I joined Carta, there was pressure across the industry and internally to increase our investment into using LLMs. Most engineers were quite skeptical of the opportunity to use LLMs, so if I’d listened exclusively to the physics, I would have probably ignored the pressure to adopt. However, that would have led me astray in two ways. First, I would have seriously damaged the wider executive team’s belief in my ability to incorporate new ideas. Second, physics are anchored on how we understand the world today, and LLMs are a place where things are evolving quickly. Our approach to using LLMs in our product is better than anything we would have gotten to by only solving for physics. (And vastly better than we’d have come up with if we’d only solved for perception.)

I think the LLM example is instructive because it violates the expectation that “physics” are real and “perceptions” are false. It can go both ways, depending on the circumstances. As soon as you get complacent about your perspective representing reality, you’ll quickly be disabused of that notion.

Balancing physics and perception

Effective leaders meld perception and physics into approaches that solve both. This is hard to do, takes a lot of energy, and when done well often doesn’t even look like you’re doing that much. Many leaders try to solve both, but eventually give in to the siren’s song of applying perception pressure without a point of view on how that pressure should be channeled into a physical plan. Applying pressure without a plan is the same issue as the infrastructure migration, where you can certainly create accountability, but it’s pretty likely to fail.

Pressure without a plan is appropriate at some level of seniority, and it’s important to understand within a given organization where responsibility lies for appending a plan to the pressure. In a small startup (10s of people), that’s probably the founders. In a medium-sized company (100s of people), that’s likely the executive team. As the company grows, more and more of the plan will be devised further from the physics, but you always have to decide where planning should start.

There is always a point where an organization will simply give up on planning and allow the pressure to cascade undeterred. In a high-functioning organization, that pressure point is quite high. In lower-functioning organizations, it will occur frequently even if there’s little pressure.

If you can reduce pressure too little, you can also reduce pressure too much. One of my biggest regrets from my time at Stripe is that I allowed too little pressure to hit my organization, which over time created a values oasis that operated with a clear plan but also limited pressure. When I left, the pressure regulator came off, and my organization had a rough patch learning to operate in the new circumstances.

Altogether, this balance is difficult to maintain. I’m still getting better at it slowly over time, learning mostly from mistakes. As a final thought here, respecting physics doesn’t necessarily mean doing what engineers want you to do: those who speak for physics aren’t necessarily right. Instead, it’s making a deliberate, calculated tradeoff between the two that’s appropriate to the circumstances. Sometime’s that courageously pushing back on an impossible timeline, sometimes it’s firing a leader who insists change is impossible.

TIL: 8 versions of UUID and when to use them

About a month ago1, I was onboarding a friend into one of my side project codebases and she asked me why I was using a particular type of UUID. I'd heard about this type while working on that project, and it's really neat. So instead of hogging that knowledge for just us, here it is: some good uses for different versions of UUID.

What are the different versions?

Usually when we have multiple numbered versions, the higher numbers are newer and presumed to be better. In contrast, there are 8 UUID versions (v1 through v8) which are different and all defined in the standard.

Here, I'll provide some explanation of what they are at a high level, linking to the specific section of the RFC in case you want more details.

  • UUID Version 1 (v1) is generated from timestamp, monotonic counter, and a MAC address.
  • UUID Version 2 (v2) is reserved for security IDs with no known details2.
  • UUID Version 3 (v3) is generated from MD5 hashes of some data you provide. The RFC suggests DNS and URLs among the candidates for data.
  • UUID Version 4 (v4) is generated from entirely random data. This is probably what most people think of and run into with UUIDs.
  • UUID Version 5 (v5) is generated from SHA1 hahes of some data you proivde. As with v3, the RFC suggests DNS or URLs as candidates.
  • UUID Version 6 (v6) is generated from timestamp, monotonic counter, and a MAC address. These are the same data as Version 1, but they change the order so that sorting them will sort by creation time.
  • UUID Version 7 (v7) is generated from a timestamp and random data.
  • UUID Version 8 (v8) is entirely custom (besides the required version/variant fields that all versions contain).

When should you use them?

With eight different versions, which should you use? There are a few common use cases that dictate which you should use, and some have been replaced by others.

You'll usually be picking between two of them: v4 or v7. There are also some occasions to pick v5 or v8.

  • Use v4 when you just want a random ID. This is a good default choice.
  • Use v7 if you're using the ID in a context where you want to be able to sort. For example, consider using v7 if you are using UUIDs as database keys.
  • v5 or v8 are used if you have your own data you want in the UUID, but generally, you will know if you need it.

What about the other ones?

  • Per the RFC, v7 improves on v1 and v6 and should be used over those if possible. So you usually won't want v1 or v6. If you do want one of those, you can use v6.
  • v2 is reserved for unspecified security things. If you are using these, you probably can't tell me or anyone else about it, and you're probably not reading this post to figure out more about them.
  • v3 is superceded by v5, which uses a stronger hash. This one is one where you probably know if you need it.

1

Despite the title of "today I learned," I did learn this over a month ago. In between, that month contained a lot of sickness and low energy, and I'm finally getting back into a cadence of having energy for some extra writing or extra coding.

2

These were used in a project that either failed or is extremely secretive. I can't find much information about it and the official page's copyright notice was last updated in 2020.

Keeping things in sync: derive vs test

An extremely common problem in programming is that multiple parts of a program need to be kept in sync – they need to do exactly the same thing or behave in a consistent way. It is in response to this problem that we have mantras like “DRY” (Don’t Repeat Yourself), or, as I prefer it, OAOO, “Each and every declaration of behaviour should appear Once And Only Once”.

For both of these mantras, if you are faced with possible duplication of any kind, the answer is simply “just say no”. However, since programming mantras are to be understood as proverbs, not absolute laws, there are times that obeying this mantra can hurt more than it helps, so in this post I’m going to discuss other approaches.

Most of what I say is fairly language agnostic I think, but I’ve got specific tips for Python and web development.

The essential problem

To step back for a second, the essential problem that we are addressing here is that if making a change to a certain behaviour requires changing more than one place in the code, we have the risk that one will be forgotten. This results in bugs, which can be of various degrees of seriousness depending on the code in question.

To pick a concrete example, suppose we have a rule that says that items in a deleted folder get stored for 30 days, then expunged. We’re going to need some code that does the actual expunging after 30 days, but we’re also going to need to tell the user about the limit somewhere in the user interface. “Once And Only Once” says that the 30 days limit needs to be defined in a single place somewhere, and then reused.

There is a second kind of motivating example, which I think often crops up when people quote “Don’t Repeat Yourself”, and it’s really about avoiding tedious things from a developer perspective. Suppose you need to add an item to a menu, and you find out that first you’ve got to edit the MENU_ITEMS file to add an entry, then you’ve got to edit the MAIN_MENU constant to refer to the new entry, then you’ve got to define a keyboard shortcut in the MENU_SHORTCUTS file, then a menu icon somewhere else etc. All of these different places are in some way repeating things about how menus work. I think this is less important in general, but it is certainly life-draining as a developer if code is structured in this way, especially if it is difficult to discover or remember all the things that have to be done.

The ideal solution: derive

OAOO and DRY say that we aim to have a single place that defines the rule or logic, and any other place should be derived from this.

Regarding the simple example of a time limit displayed in the UI and used in the backend, this might be as simple as defining a constant e.g. in Python:

from datetime import timedelta

EXPUNGE_TIME_LIMIT = timedelta(days=30)

We then import and use this constant in both our UI and backend.

An important part of this approach is that the “deriving” process should be entirely automatic, not something that you can forget to do. In the case of a Python import statement, that is very easy to achieve, and relatively hard to get wrong – if you change the constant where it is defined in one module, any other code that uses it will pick up the change the next time the Python process is restarted.

Alternative solution: test

By “test”, I mean ideally an automated test, but manual tests may also work if they are properly scripted. The idea is that you write a test that checks the behaviour of code is synced. Often, it may be that for one (or more) instances that need the behaviour will define it using some constant as above, let’s say the “backend” code. Then, for one instance, e.g. the UI, you would hard code “30 days” without using the constant, but have a test that uses the backend constant to build a string, and checks the UI for that string.

Examples

In the example above, it might be hard to see why you want to use the fundamentally less reliable, less automatic method I’m suggesting. So I now have to show some motivating examples where the “derive” method ends up losing to the cruder, simpler alternative of “test”.

Example 1 - external data sources

My first example comes from the project I’m currently working on, which involves creating CAM files from input data. Most of the logic for that is driven using code, but there are some dimensions that are specified as data tables by the engineers of the physical product.

These data tables look something like below. The details here aren’t important, and I’ve changed them – it’s enough to know that we’ve are creating some physical “widgets” which need to have specific dimensions specified:

Widgets have length 150mm unless specified below

Widget id

Location

Length (mm)

A

start

100

A

end

120

F

start

105

F

end

110

These tables are supplied at design-time rather than run-time i.e. they are bundled with the software and can’t be changed after the code is shipped. But it is still convenient to read them in automatically rather than simply duplicate the tables in my code by some process. So, for the body of the table, that’s exactly what my code does on startup – it reads the bundled XLSX/CSV files.

So we are obeying “derive” here — there is a single, canonical source of data, and anywhere that needs it derives it by an entirely automatic process.

But what about that “150mm” default value specified in the header of that table?

It would be possible to “derive” it by having a parser. Writing such a parser is not hard to do – for this kind of thing in Python I like parsy, and it is as simple as:

import parsy as P

default_length_parser = (
  P.string("Widgets have length ") >>
  P.regex(r"\d+").map(int)
  << P.string("mm unless specified below")
)

In fact I do something similar in some cases. But in reality, the “parser” here is pretty simplistic – it can’t deal with the real variety of English text that might be put into the sentence, and to claim I’m “deriving” it from the table is a bit of a stretch – I’m just matching a specific, known pattern. In addition, it’s probably not the case that any value for the default length would work – most likely if it was 10 times larger, there would be some other problem, and I’d want to do some manual checking.

So, let’s admit that we are really just checking for something expected, using the “test” approach. You can still define a constant that you use in most of the code:

DEFAULT_LENGTH_MM = 150

And then you test it is what you expect when you load the data file:

assert worksheets[0].cell(1, 1).value == f"Widgets have length {DEFAULT_LENGTH_MM}mm unless specified below"

So, I’ve achieved my aim: a guard against the original problem of having multiple sources of information that could potentially be out of sync. But I’ve done it using a simple test, rather than a more complex and fragile “derive” that wouldn’t have worked well anyway.

By the way, for this specific project – we’re looking for another contract developer! It’s a very worthwhile project, and one I’m really enjoying – a small flexible team, with plenty of problem solving and fun challenges, so if you’re a talented developer and interested give me a shout.

Example 2 - defining UI behaviour for domain objects

Suppose you have a database that stores information about some kind of entity, like customers say, and you have different types of customer, represented using an enum of some kind, perhaps a string enum like this in Python:

from enum import StrEnum


class CustomerType(StrEnum):
    ENTERPRISE = "Enterprise"
    SMALL_FRY = "Small fry"  # Let’s be honest! Try not to let the name leak…
    LEGACY = "Legacy"

We need to a way edit the different customer types, and they are sufficiently different that we want quite different interfaces. So, we might have a dictionary mapping the customer type to a function or class that defines the UI. If this were a Django project, it might be a different Form class for each type:

CUSTOMER_EDIT_FORMS = {
    CustomerType.ENTERPRISE: EnterpriseCustomerForm,
    CustomerType.SMALL_FRY: SmallFryCustomerForm,
    CustomerType.LEGACY: LegacyCustomerForm,
}

Now, the DRY instinct kicks in and we notice that we now have two things we have to remember to keep in sync — any addition to the customer enum requires a corresponding addition to the UI definition dictionary. Maybe there are multiple dictionaries like this.

We could attempt to solve this by “deriving”, or some “correct by construction” mechanism that puts the creation of a new customer type all in one place.

For example, maybe we’ll have a base Customer class with get_edit_form_class() as an abstractmethod, which means it is required to be implemented. If I fail to implement it in a subclass, I can’t even construct an instance of the new customer subclass – it will throw an error.

from abc import abstractmethod

class Customer:
    @abstractmethod
    def get_edit_form_class(self):
        pass


class EnterpriseCustomer(Customer):
    def get_edit_form_class(self):
        return EnterpriseCustomerForm

class LegacyCustomer(Customer):
    ...  # etc.

I still need my enum value, or at least a list of valid values that I can use for my database field. Maybe I could derive that automatically by looking at all the sublclasses?

CUSTOMER_TYPES = [
    cls.__name__.upper().replace("CUSTOMER", "")
    for cls in Customer.__subclasses__()
]

Or maybe an __init_subclass__ trick, and I can perhaps also set up the various mappings I’ll need that way?

It’s at this point you should stop and think. In addition to requiring you to mix UI concerns into the Customer class definitions, it’s getting complex and magical.

The alternative I’m suggesting is this: require manual syncing of the two parts of the code base, but add a test to ensure that you did it. All you need is a few lines after your CUSTOMER_EDIT_FORMS definition:

CUSTOMER_EDIT_FORMS = {
    # etc as before
}

for c_type in CustomerType:
    assert (
        c_type in CUSTOMER_EDIT_FORMS
    ), f"You've defined a new customer type {c_type}, you need to add an entry in CUSTOMER_EDIT_FORMS"

You could do this as a more traditional unit test in a separate file, but for simple things like this, I think an assertion right next to the code works much better. It really helps local reasoning to be able to look and immediately conclude “yes, I can see that this dictionary must be exhaustive because the assertion tells me so.” Plus you get really early failure – as soon as you import the code.

This kind of thing crops up a lot – if you create a class here, you’ve got to create another one over there, or add a dictionary entry etc. In these cases, I’m finding simple tests and assertions have a ton of advantages when compared to clever architectural contortions (or other things like advanced static typing gymnastics):

  • they are massively simpler to create and understand.

  • you can write your own error message in the assertion. If you make a habit of using really clear error messages, like the one above, your code base will literally tell you how to maintain it.

  • you can easily add things like exceptions. “Every Customer type needs an edit UI defined, except Legacy because they are read only” is an easy, small change to the above.

    • This contrasts with cleverer mechanisms, which might require relaxing other constraints to the point where you defeat the whole point of the mechanism, or create more difficulties for yourself.

  • the rule about how the code works is very explicit, rather than implicit in some complicated code structure, and typically needs no comment other than what you write in the assertion message.

  • you express and enforce the rule, with any complexities it gains, in just one place. Ironically, if you try to enforce this kind of constraint using type systems or hierarchies to eliminate repetition or the need for any kind of code syncing, you may find that when you come to change the constraint it actually requires touching far more places.

  • temporarily silencing the assertion while developing is easy and doesn’t have far reaching consequences.

Of course, there are many times when being able to automatically derive things at the code level, including some complex relationships between parts of the code, can be a win, and it’s the kind of thing you can do in Python with its many powerful techniques.

But my point is that you should remember the alternative: “synchronise manually, and have a test to check you did it.” Being able to add any kind of executable code at module level – the same level as class/function/constant definitions – is a Python super-power that you should use.

Example 3 - external polymorphism and static typing

A variant of the above problem is when, instead of an enum defining different types, I’ve got a set of classes that all need some behaviour defined.

Often we just use polymorphism where a base class defines the methods or interfaces needed and sub-classes provide the implementation. However, as in the previous case, this can involve mixing concerns e.g. user interface code, possibly of several types, is mixed up with the base domain objects. It also imposes constraints on class hierarchies.

Recently for these kind of cases, I’m more likely to prefer external polymorphism to avoid these problems. To give an example, in my current project I’m using the Command pattern or plan-execute pattern extensively, and it involves manipulating CAM objects using a series of command objects that look something like this:

@dataclass
class DeleteFeature:
    feature_name: str


@dataclass
class SetParameter:
    param_name: str
    value: float


@dataclass
class SetTextSegment:
    text_name: str
    segment: int
    value: str


Command: TypeAlias = DeleteFeature | SetParameter | SetTextSegment

Note that none of them share a base class, but I do have a union type that gives me the complete set.

It’s much more convenient to define the behaviour associated with these separately from these definitions, and so I have multiple other places that deal with Command, such as the place that executes these commands and several others. One example that requires very little code to show is where I’m generating user-presentable tables that show groups of commands. I convert each of these Command objects into key-value pairs that are used for column headings and values:

def get_command_display(command: Command) -> tuple[str, str | float | bool]:
    match command:
        case DeleteFeature(feature_name=feature_name):
            return (f"Delete {feature_name}", True)
        case SetParameter(param_name=param_name, value=value):
            return (param_name, value)
        case SetTextSegment(text_name=text_name, segment=segment, value=value):
            return (f"{text_name}[{segment}]", value)

This is giving me a similar problem to the one I had before I had before: if I add a new Command, I have to remember to add the new branch to get_command_display.

I could split out get_command_display into a dictionary of functions, and apply the same technique as in the previous example, but it’s more work, a less natural fit for the problem and potentially less flexible.

Instead, all I need to do is add exhaustiveness checking with one more branch:

match command:
    ...  # etc
    case _:
        assert_never(command)

Now, pyright will check that I didn’t forget to add branches here for any new Command. The error message is not controllable, in contrast to hand-written asserts, but it is clear enough.

The theme here is that additions in one part of the code require synchronised additions in other parts of the code, rather than being automatically correct “by construction”, but you have something that tests you didn’t forget.

Example 4 - generated code

In web development, ensuring consistent design and keeping different things in sync is a significant problem. There are many approaches, but let’s start with the simple case of using a single CSS stylesheet to define all the styles.

We may want a bunch of components to have a consistent border colour, and a first attempt might look like this (ignoring the many issues of naming conventions here):

.card-component, .bordered-heading {
   border-color: #800;
}

This often becomes impractical when we want to organise by component, rather than by property, which introduces duplication:

.card-component {
   border-color: #800;
}

/* somewhere far away … */

.bordered-heading {
   border-color: #800;
}

Thankfully, CSS has variables, so the first application of “derive” is straightforward – we define a variable which we can use in multiple places:

:root {
    --primary-border-color: #800;
}

/* elsewhere */

.bordered-heading {
    border-bottom: 1px solid var(--primary-border-color);
}

However, as the project grows, we may find that we want to use the same variables in different contexts where CSS isn’t applicable. So the next step at this point is typically to move to Design Tokens.

Practically speaking, this might mean that we now have our variables defined in a separate JSON file. Maybe something like this (using a W3C draft spec):

{
  "primary-border-color": {
    "$value": "#800000",
    "$type": "color"
  }
  "primary-hightlight-color": {
    "$value": "#FBC100",
    "$type": "color"
  }
}

From this, we can automatically generate CSS fragments that contain the same variables quite easily – for simple cases, this isn’t more than a 50 line Python script.

However, we’ve got some choices when it comes to how we put everything together. I think the general assumption in web development world is that a fully automatic “derive” is the only acceptable answer. This typically means you have to put your own CSS in a separate file, and then you have a build tool that watches for changes, and compiles your CSS plus the generated CSS into the final output that gets sent to the browser.

In addition, once you’ve bought into these kind of tools you’ll find they want to do extensive changes to the output, and define more and more extensions to the underlying languages. For example, postcss-design-tokens wants you to write things like:

.foo {
     color: design-token('color.background.primary');
 }

And instead of using CSS variables in the output, it puts the value of the token right in to every place in your code that uses it.

This approach has various problems, in particular that you become more and more dependent on the build process, and the output gets further from your input. You can no longer use the Dev Tools built in to your browser to do editing – the flow of using Dev Tools to experiment with changing a single spacing or colour CSS variable for global changes is broken, you need your build tool. And you can’t easily copy changes from Dev Tools back into the source, because of the transformation step, and debugging can be similarly difficult. And then, you’ll probably want special IDE support for the special CSS extensions, rather than being able to lean on your editor simply understanding CSS, and any other tools that want to look at your CSS now need support etc.

It’s also a lot of extra infrastructure and complexity to solve this one problem, especially when our design tokens JSON file is probably not going to change that often, or is going to have long periods of high stability. There are good reasons to want to be essentially build free. The current state of the art in this space is that to get your build tool to compile your CSS you add import './styles.css' in your entry point Javascript file! What if I don’t even have a Javascript file? I think I understand how this sort of thing came about, but don’t try to tell me that it’s anything less than completely bonkers.

Do we have an alternative to the fully automatic derive?

Using the “test” approach, we do. We can even stick with our single CSS file – we just write it like this:

/* DESIGN TOKENS START */
/* auto-created block - do not edit */
:root {
    --primary-border-color: #800000;
    --primary-highlight-color: #FBC100;
}
/* DESIGN TOKENS END */

/* the rest of our CSS here */

The contents of this block will be almost certainly auto-generated. We won’t have a process that fully automatically updates it, however, because this is the same file where we are putting our custom CSS, and we don’t want any possibility of lost work due to the file being overwritten as we are editing it.

On the other hand we don’t want things to get out of sync, so we’ll add a test that checks whether the current styles.css contains the block of design tokens that we expect to be there, based on the JSON. For actually updating the block, we’ll need some kind of manual step – maybe a script that can find and update the DESIGN TOKEN START block, maybe cog – which is a perfect little tool for this use case — or we could just copy-paste.

There are also slightly simpler solutions in this case, like using a CSS import if you don’t mind having multiple CSS files.

Conclusion

For all the examples above, the solutions I’ve presented might not work perfectly for your context. You might also want to draw the line at different place to me. But my main point is that we don’t have to go all the way with a fully automatic derive solution to eliminate any manual syncing. Having some manual work plus a mechanism to test that two things are in sync is a perfectly legitimate solution, and it can avoid some of the large costs that come with structuring everything around “derive”.

Os punks de boutique, cantados pelo The Clash

Uma coisa que eu acho super simbólica é que já no comecinho da carreira (e antes da explosão do seu sucesso nos EUA e internacional) o The Clash já tinha perdido completamente a ilusão de efetividade da imagem de rebelião trazida pela crescente cena punk:

Punk rockers in the UK
They won't notice anyway
They're all too busy fighting
For a good place under the lighting

The new groups are not concerned
With what there is to be learned
They got Burton suits, ha you think it's funny
Turning rebellion into money

O explícito verso que diz que as novas bandas não estão nem aí e só querem transformar rebeldia em dinheiro veio já em 1978, na letra do single “White man in Hammersmith Palais”.

O artigo "Os punks de boutique, cantados pelo The Clash" foi originalmente publicado no site TRILUX, de Augusto Campos.

Projections and Projection Matrices

We'll start with a visual and intuitive representation of what a projection is. In the following diagram, we have vector b in the usual 3-dimensional space and two possible projections - one onto the z axis, and another onto the x,y plane.

Projection of a 3d vector onto axis and plane

If we think of 3D space as spanned by the usual basis vectors, a projection onto the z axis is simply:

\[b_z=\begin{bmatrix} 0 \\ 0 \\ z \end{bmatrix}\]

A couple of intuitive ways to think about what a projection means:

  • The projection of b on the z axis is a vector in the direction of the z axis that's closest to b.
  • The projection of b on the z axis is the shadow cast by b when a flashlight is pointed at it in the direction of the z axis.

We'll see a more formal definition soon. A projection onto the x,y plane is similarly easy to express.

Projection onto a line

Projecting onto an axis is easy - as the diagram shows, it's simply taking the vector component in the direction of the axis. But how about projections onto arbitrary lines?

Projection of a 3d vector onto another 3D vector

In vector space, a line is just all possible scalings of some vector [1].

Speaking more formally now, we're interested in the projection of \vec{b} onto \vec{a}, where the arrow over a letter means it's a vector. The projection (which we call \vec{b_a}) is the closest vector to \vec{b} in the direction of \vec{a}. In other words, the dotted line in the diagram is at a right angle to the line a; therefore, the error vector \vec{e} is orthogonal to \vec{a}.

This orthogonality gives us the tools we need to find the projection. We'll want to find a constant c such that:

\[\vec{b_a}=c\vec{a}\]

\vec{e} is orthogonal to \vec{a}, meaning that their dot product is zero: \vec{e}\cdot\vec{a}=0. We'll use the distributive property of the dot product in what follows:

\[\begin{align*} \vec{a}\cdot\vec{e}&=0 \\ \vec{a}\cdot(\vec{b}-c\vec{a})&=0\\ \vec{a}\cdot\vec{b}-c\vec{a}\cdot\vec{a}&=0\\ c&=\frac{\vec{a}\cdot\vec{b}}{\vec{a}\cdot\vec{a}} \end{align*}\]

Note that \vec{a}\cdot\vec{a} is the squared magnitude of \vec{a}; for a unit vector this would be 1. This is why it doesn't matter if \vec{a} is a unit vector or not - we normalize it anyway.

We have a formula for c now - we can find it given \vec{a} and \vec{b}. To prepare for what comes next, however, we'll switch notations. We'll use matrix notation, in which vectors are - by convention - column vectors, and a dot product can be expressed by a matrix multiplication between a row and a column vector. Therefore:

\[\begin{align*} c&=\frac{a^T b}{a^T a} \Rightarrow \\ b_a&=\frac{a^T b}{a^T a}a \end{align*}\]

Projection matrix

Since the fraction representing c is a constant, we can switch the order of the multiplication by a, and then use the fact that matrix multiplication is associative to write:

\[\begin{align*} b_a&=a\frac{a^T b}{a^T a}\\ b_a&=\frac{a a^T}{a^T a}b \end{align*}\]

In our case, since a is a 3D vector, a a^T is a 3x3 matrix [2], while a^Ta is a scalar. Thus we get our projection matrix - call it P:

\[\begin{align*} P&=\frac{a a^T}{a^T a}\\ b_a&=Pb \end{align*}\]

A recap: given some vector \vec{a}, we can construct a projection matrix P. This projection matrix can take any vector \vec{b} and help us calculate its projection onto \vec{a} by means of a simple matrix multiplication!

Example of line projection

Consider our original example - projection on the z axis. First, we'll find a vector that spans the subspace represented by the z axis: a trivial vector is the unit vector:

\[a_z=\begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix}\]

What's the projection matrix corresponding to this vector?

\[P = \frac{a_z a_{z}^{T}}{1} = \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix}\begin{bmatrix}0&0&1\end{bmatrix}=\begin{bmatrix} 0&0&0\\ 0&0&0\\ 0&0&1 \end{bmatrix}\]

Now, given any arbitrary vector \vec{b} we can find its projection onto the z axis by multiplying with P. For example:

\[b_a=Pb=\begin{bmatrix} 0&0&0\\ 0&0&0\\ 0&0&1 \end{bmatrix}\begin{bmatrix} x\\ y\\ z \end{bmatrix}=\begin{bmatrix} 0\\ 0\\ z \end{bmatrix}\]

Another example - less trivial this time. Say we want to project vectors onto the line spanned by the vector:

\[a=\begin{bmatrix} 1 \\ 3 \\ 7 \end{bmatrix}\]

Let's compute the projection matrix:

\[P = \frac{a a^{T}}{a^T a} = \frac{1}{59}\begin{bmatrix} 1 \\ 3 \\ 7 \end{bmatrix}\begin{bmatrix}1&3&7\end{bmatrix}=\frac{1}{59}\begin{bmatrix} 1&3&7\\ 3&9&21\\ 7&21&49 \end{bmatrix}\]

Now we'll use it to calculate the projection of b=\begin{bmatrix}2 & 8 & -4\end{bmatrix}^T onto this line:

\[b_a=Pb=\frac{1}{59}\begin{bmatrix} 1&3&7\\ 3&9&21\\ 7&21&49 \end{bmatrix}\begin{bmatrix} 2\\ 8\\ -4 \end{bmatrix}=\frac{1}{59}\begin{bmatrix} -2\\ -6\\ -14 \end{bmatrix}\]

To verify this makes sense, we can calculate the error vector \vec{e}:

\[\begin{align*} e&=b-b_a=\begin{bmatrix} 2\\ 8\\ -4 \end{bmatrix}-\frac{1}{59}\begin{bmatrix} -2\\ -6\\ -14 \end{bmatrix}=\frac{1}{59}\begin{bmatrix} 120\\ 478\\ -222 \end{bmatrix} \end{align*}\]

And check that it's indeed orthogonal to \vec{a}:

\[a\cdot e = \frac{1}{59}(1\cdot 120 + 3\cdot 478 + 7 \cdot -222)=0\]

Projection onto a vector subspace

A subspace of a vector space is a subset of vectors from the vector space that's closed under vector addition and scalar multiplication. For \mathbb{R}^3, some common subspaces include lines that go through the origin and planes that go through the origin.

Therefore, the projection onto a line scenario we've discussed so far is just a special case of a projection onto a subspace. We'll look at the general case now.

Suppose we have an m-dimensional vector space \mathbb{R}^m, and a set of n linearly independent vectors \vec{a_1},\dots,\vec{a_n} \in \mathbb{R}^m. We want to find a combination of these vectors that's closest to some target vector \vec{b} - in other words, to find the projection of \vec{b} onto the subspace spanned by \vec{a_1},\dots,\vec{a_n}.

Arbitrary m-dimensional vectors are difficult to visualize, but the derivation here follows exactly the path we've taken for projections onto lines in 3D. There, we were looking for a constant c such that c\vec{a} was the closest vector to \vec{b}. Now, we're looking for a vector \vec{c} which represents a linear combination of \vec{a_1},\dots,\vec{a_n} that is closest to a target \vec{b}.

If we organize \vec{a_1},\dots,\vec{a_n} as columns into a matrix called A, we can express this as:

\[\vec{b_a}=A\vec{c}\]

This is a matrix multiplication: \vec{c} is a list of coefficients that describes some linear combination of the columns of A. As before, we want the error vector \vec{e}=\vec{b}-\vec{b_a} to be orthogonal to the subspace onto which we're projecting: this means it's orthogonal to every one of \vec{a_1},\dots,\vec{a_n}. The fact that vectors \vec{a_n} are orthogonal to \vec{e} can be expressed as [3]:

\[\begin{align*} a_{1}^{T}e&=0\\ \vdots\\ a_{n}^{T}e&=0 \end{align*}\]

This is a system of linear equations, and thus it can be represented as a matrix multiplication by a matrix with vectors a_{k}^T in its rows; this matrix is just A^T:

\[A^T e=0\]

But e=b-Ac, so:

\[\begin{align*} A^T (b-Ac)&=0 \Rightarrow \\ A^Tb&=A^TAc \end{align*}\]

Since the columns of A are linearly independent, A^T A is an invertible matrix [4], so we can isolate c:

\[c=(A^T A)^{-1}A^T b\]

Then the projection \vec_{b_a} is:

\[b_a=Ac=A(A^T A)^{-1}A^T b\]

Similarly to the line example, we can also define a projection matrix as:

\[P=A(A^T A)^{-1}A^T\]

Given a vector \vec{b}, P projects it onto the subspace spanned by the vectors \vec{a_1},\dots,\vec{a_n}:

\[b_a=Pb\]

Let's make sure the dimensions work out. Recall that A consists of n columns, each with m rows. So we have:

\[\begin{matrix} A & (m\times n) \\ A^T & (n\times m)\\ A^T A & (n\times n) \\ (A^T A)^{-1} & (n\times n) \\ A(A^T A)^{-1} & (m\times n) \\ A(A^T A)^{-1}A^T & (m\times m) \\ \end{matrix}\]

Since the vector \vec{b} is m-dimensional, Pb is valid and the result is another m-dimensional vector - the projection \vec{b}_a.

Example of subspace projection

At the beginning of this post there's a diagram showing the projection of an arbitrary vector \vec{b} onto a line and onto a plane. We'll find the projection matrix for the plane case now. The projection is onto the xy plane, which is spanned by these vectors:

\[a_x=\begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} a_y=\begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}\]

Collecting them into a single matrix A, we get:

\[A=\begin{bmatrix} 1 & 0\\ 0 & 1\\ 0 & 0 \end{bmatrix}\]

To find P, let's first calculate A^T A:

\[A^T A= \begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} 1 & 0\\ 0 & 1\\ 0 & 0 \end{bmatrix}= \begin{bmatrix} 1 & 0\\ 0 & 1 \end{bmatrix}\]

This happens to be the identity matrix, so its inverse is itself. Thus, we get:

\[P=A(A^T A)^{-1}A^T=AIA^T=AA^T= \begin{bmatrix} 1 & 0\\ 0 & 1\\ 0 & 0 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0 \end{bmatrix}= \begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 0 \end{bmatrix}\]

We can now project an arbitrary vector \vec{b} onto this plane by multiplying it with this P:

\[b_a=Pb= \begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 0 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix}= \begin{bmatrix} x \\ y \\ 0 \end{bmatrix}\]

Granted, this is a fairly trivial example - but it works in the general case. As an exercise, pick a different pair of independent vectors and find the projection matrix onto the plane spanned by them; then, verify that the resulting error is orthogonal to the plane.

Properties of projection matrices

Projection matrices have some interesting properties that are educational to review.

First, projection matrices are symmetric. To understand why, first recall how a transpose of a matrix product is done:

\[(AB)^T=B^T A^T\]

As a warm-up, we can show that A^T A is symmetric:

\[(A^T A)^T=A^T (A^T)^T=A^T A\]

Now, let's transpose P:

\[\begin{align*} P&=A(A^T A)^{-1}A^T \\ P^T&=(A(A^T A)^{-1}A^T)^T\\ &=((A^T A)^{-1}A^T)^T A^T\\ &=A(A^T A)^{-1}A^T=P \end{align*}\]

Here we've used the fact that the inverse of a symmetric matrix is also symmetric, and we see that indeed P^T=P.

Second, projection matrices are idempotent: P^2=P; this isn't hard to prove either:

\[\begin{align*} P^2&=A(A^T A)^{-1}A^T A(A^T A)^{-1}A^T\\ &=A(A^T A)^{-1}(A^T A)(A^T A)^{-1}A^T\\ &=A(A^T A)^{-1}[(A^T A)(A^T A)^{-1}]A^T\\ &=A(A^T A)^{-1}IA^T\\ &=A(A^T A)^{-1}A^T=P \end{align*}\]

Intuitive explanation: think about what a projection does - given some \vec{b}, it calculates the closest vector to it in the desired subspace. If we try to project this projection again - what will we get? Well, still the closest vector in that subspace - itself! In other words:

\[b_a=Pb=P(Pb)\]

Projections onto orthogonal subspaces

There's another special case of projections that is interesting to discuss: projecting a vector onto orthogonal subspaces. We'll work through this using an example.

Consider the vector:

\[a_1=\begin{bmatrix} 1 \\ -2 \\ 3 \end{bmatrix}\]

We'll find the projection matrix for this vector:

\[P_1=\frac{a_1 a_{1}^T}{a_{1}^T a_1}= \frac{1}{14} \begin{bmatrix} 1 & -2 & 3\\ -2 & 4 & -6\\ 3 & -6 & 9 \end{bmatrix}\]

Now, consider the following vector, which is orthogonal to \vec{a_1}:

\[a_2=\begin{bmatrix} -3 \\ 0 \\ 1 \end{bmatrix}\]

Its projection matrix is:

\[P_2=\frac{a_2 a_{2}^T}{a_{2}^T a_2}= \frac{1}{10} \begin{bmatrix} 9 & 0 & -3\\ 0 & 0 & 0\\ -3 & 0 & 1 \end{bmatrix}\]

It's trivial to check that both P_1 and P_2 satisfy the properties of projective matrices; what's more interesting is that P_1 + P_2 does as well - so it's also a proper projection matrix!

To take it a step further, consider yet another vector:

\[a_3=\begin{bmatrix} -1 \\ -5 \\ -3 \end{bmatrix}\]

The vectors (\vec{a_1},\vec{a_2},\vec{a_3}) are all mutually orthogonal, and thus form an orthogonal basis for \mathbb{R}^3. We can calculate P_3 in the usual way, and get:

\[P_3=\frac{a_3 a_{3}^T}{a_{3}^T a_3}= \frac{1}{35} \begin{bmatrix} 1 & 5 & 3\\ 5 & 25 & 15\\ 3 & 15 & 9 \end{bmatrix}\]

Not only is P_1+P_2+P_3 is a projection matrix, it's a very familiar matrix in general:

\[P_1+P_2+P_3=I\]

This is equivalent to saying that for any vector \vec{b}:

\[(P_1+P_2+P_3)b=b\]

Hopefully this makes intuitive sense because it's just expressing \vec{b} in an alternative basis for \mathbb{R}^3 [5].


[1]We're dealing with vector spaces, where we don't really have lines - only vectors. A line is just a visual way to think about certain subspaces of the vector space \mathbb{R}^3. Specifically, a line through the origin (lines that don't go through the origin belong in affine spaces) is a way to represent \forall c, c\vec{a} where \vec{a} is a vector in the same direction as this line and c is a constant; in other words it's the subspace of \mathbb{R}^3 spanned by \vec{a}.
[2]By the rules of matrix multiplication: we're multiplying a column vector (a 3x1 matrix) by a row vector (a 1x3 matrix). The multiplication is allowed because the inner dimensions match, and the result is a 3x3 matrix.
[3]Recall from the earlier example: we're dropping the explicit vector markings to be able to write matrix arithmetic more naturally. By default vectors are column vectors, so v^T w expresses the dot product between vectors \vec{v} and \vec{w}.
[4]It's possible to prove this statement, but this post is already long enough.
[5]This is a special case of a change of basis, in which the basis is orthogonal.

What is Self Hosted? What is a Stack?

My colleague Ben Vingar wrote a tool called Counterscale which I would describe as “deploy your own analytics”. Except there is a catch: it needs Cloudflare to run. Is it really self hosted if your only way to deploy it is some proprietary cloud vendor?

What's a Stack?

Many years ago we talked about software stacks. A common one happened to be “LAMP”. Short for: Linux, Apache, MySQL and typically PHP, though Python and Perl were choices for the P just as well. LAMP lends itself very well for self hosting because all of it is Open Source software you can run and operate yourself free of charge. There was however also a second stack which was not entirely unpopular: “WAMP“ (The W meaning Microsoft Windows). You would not necessarily run it yourself if you had a choice, but I deployed more than one of these. Why? Because some SMEs were already running Windows. If you wrote some software in PHP, having people run the software on their already existing Windows servers was preferable to also running some Linux thing they did not know how to operate.

What makes LAMP, WAMP and whatever work are a few basic technological choices. Originally one of those abstractions was a protocol called CGI which allowed you to marry a programming language to the web server. Later also things like FastCGI appeared to deal with some of the performance challenges that CGI brought and there were also attempts to move PHP right into the web server as embedded language with mod_php. For the database the abstraction in many cases was a dialect of SQL. I built a tool a long time ago that a company ended up running on Microsoft's SQL server with rather minimal changes. So in some sense what made this work was that one was targeting some form of abstraction.

What's Self Hosted?

Counterscale targets something that the open source ecosystem does not really have abstracted today: an analytics engine and some serverless runtime. What was CGI and SQL in case of Counterscale is some serverless runtime environment and a column store. All these things do exist in the Open Source ecosystem. All the pieces are there to build your own serverless runtime and all the things are there to build an analytics store on top of ClickHouse, DuckDB or similar databases and Kafka. But we did not agree on protocols and we definitely did not really have that stuff today in a neatly packaged and reusable thing.

Now of course you can build software that runs entirely on Open Source software. In case of Counterscale you don't even have to look very far: Plausible exists. It's also Open Source, it's also an analytics tool, but rather than being like a “CGI script” in spirit, it's a pretty heavy thing. You gotta run docker containers, run a rather beefy ClickHouse installation, I believe it needs Kafka etc. Running Plausible yourself is definitely not neatly as easy as setting up Counterscale. You do however, have the benefit of not relying on Cloudflare.

Level up the Protocols

So what does that leave us with? I'm not sure but I'm starting to think that the web needs new primitives. We now run some things commonly but the abstractions over them are not ideal. So people target (proprietary) systems directly. The modern web needs the CGI type protocols for queues, for authentication, for columns stores, for caches etc. Why does it need that? I think it needs it to lower the cost of building small scale open source software.

The reason it's so easy and appealing to build something like Counterscale directly against Cloudflare or similar services is that they give you higher level abstractions than you would find otherwise. You don't have to think about scaling workers, you don't have to think about scaling databases. The downside of course is that it locks you into that platform.

But what would be necessary to have your “own Cloudflare” thing you can run once and then run all your cool mini CGI like scripts above? We miss some necessary protocols. Yet building these protocols is tricky because you target often the least common denominator. Technology also here is hardly the problem. Don't need any new innovative technology here, but you need the social contract and the mindset. Those are hard things, they require dedication and marketing. I have not yet seen that, but I'm somewhat confident that we might see it.

We probably want these protocols and systems built on top of it because it makes a lot of things easier. Sometimes of the cost of doing something drops low enough, it enables a whole new range of things to exist.

Many times when you start building abstractions over these things, you simplify. Even CGI was an incredibly high level abstraction over HTTP if you think about it. CGI in many ways is the original serverless. It abstracts over both HTTP and how a process spawns and its lifecycle. Serverless is bringing back a bit of that, but so far not in a way where this is actually portable between different clouds.

Abstract over Great Ideas

If you have ever chucked up an OG CGI app you might remember the magic. You write a small script, throw it into a specific folder and you are off to the races. No libraries, no complex stuff. CGI at its core was a great idea: make a web server dynamic via a super trivial protocol anyone can implement. There are more ideas like that. Submitting tasks to a worker queue is a great idea, batch writing a lot of data into a system is a great idea, kafka like topics are a great idea, caches are a great idea, so are SQL databases, column stores and much more.

Laravel Forge does a tiny bit of that I feel. Forge goes a bit in to that direction in the sense that it says quite clearly that some components are useful: databases, caches, SSL, crons etc. However it's ambition stops at the boundary of the Laravel ecosystem which is understandable.

Yet maybe over time we can see more of a “SaaS in a box” kind of experience. A thing you run, that you can plug your newfangled, serverless mini tools in, that can leverage auth and all the needs of a modern web application like queues, column stores, caches etc.

⚙️ Tech Breakdown: Fixing Unreal's Horizontal Field of View

Unreal Engine hard-codes appling the aspect ratio of the viewport to a fixed Horizontal Field of View (“FOV”), rather than the Vertical FOV. This means, e.g., if you switch from standard 16:9 to ultrawide it will crop the top and bottom of the view and zoom, rather than revealing more to the left and right as you’d expect, and if you shrink it to 4:3 you end up with an extreme fisheye effect.

FOVX

Unreal Default: content on the sides of the viewport is fixed, causing weird cropping and zooming.

FOVY

Our Camera: fixing content on the top and bottom of the viewport, like we expect.

The default is basically never what I want, but there’s no way to override it with just configuration, so we turn to code. 😖

For the player camera I can just inspect the aspect and compute an adjust FOV on the fly, but I also want “default” Unreal Cameras to behave correctly, too, so I can use Sequencer out-of-the-box for cutscenes without having to check Constrain Aspect Ratio on the camera and force-lock the viewport size with letterbox bars.

Following the implementation of CineCamera, I opted to create a transparently-replacable subtype which does the adjustment just-in-time when the desired view is computed.

To start, we define two new subtypes: one for the actual camera component, and another for the wrapper camera actor.

/* A new camera component which overrides the default desired view */
UCLASS( meta=(BlueprintSpawnableComponent) )
class UMyCameraComponent : public UCameraComponent
{
	GENERATED_BODY()

public:

	virtual void GetCameraView(float DeltaTime, FMinimalViewInfo& DesiredView) override;
};

/* a wrapper for the camera actor which overrides the camera component subtype */
UCLASS()
class AMyCamera : public ACameraActor
{
	GENERATED_BODY()

public:
	AMyCamera( const FObjectInitializer& Init );

	UMyCameraComponent* GetMyCamera() const { return static_cast<UMyCameraComponent*>(GetCameraComponent()); }
};

Aside: The component/actor distinction is kind of annoying, because PlayerController and CameraManager handle camera components wrapped in camera actors slightly differently than camera components embedded in other kinds of actors. For CineCamera, it’s doubly-annoying because functionality is split between the two classes, so you lose features like Look at Tracking when embedding the component without the wrapper.

But I digress, for the implementation we do a little math in our override of GetCameraView and sprinkle a magic dependency-injection incantation in the wrapper’s constructor to instantiate the correct subcomponent (incidentally, this is also how you can change the default CharacterMovementComponent in subtypes of Character).

namespace CameraUtil
{

static const FName MEMBER_CameraComponent( "CameraComponent" );

static float CalcAspectRatio( UWorld* W )
{
	if( ULocalPlayer* Player = W->GetFirstLocalPlayerFromController() )
	{
		FVector2D Size( 0.0f, 0.0f );
		Player->ViewportClient->GetViewportSize( Size );
		if( Size.Y > KINDA_SMALL_NUMBER )
			return Size.X / Size.Y;
	}

	/* fallback to default when there's no viewport */
	return 16.0f / 9.0f;
}


static float GetFOVX( float Aspect, float FOVY ) const
{
	return 2.0f * R2D( FMath::Atan( FMath::Tan( 0.5f * D2R(FOVY) ) * Aspect ) );
}

static float GetFOVY( float Aspect, float FOVX ) const
{
	return 2.0f * R2D( FMath::Atan( FMath::Tan( 0.5f * D2R(FOVX) ) / Aspect ) );
}

}

/*virtual*/ void UMyCameraComponent::GetCameraView( float DeltaTime, FMinimalViewInfo& DesiredView ) /*override*/
{
	Super::GetCameraView( DeltaTime, DesiredView );

	/* early-out when we're letterboxed */
	if( bConstrainAspectRatio )
		return;

	/* for some reason, DesiredView.AspectRatio is not always right, so actually query the viewport */
	const float ActualAspect = CameraUtil::CalcAspectRatio( GetWorld() );
	const float DesiredFOVY = CameraUtil::GetFOVY( 16.0f / 9.0f, DesiredView.FOV );
	const float AdjustedFOVX = CameraUtil::GetFOVX( ActualAspect, DesiredFOVY );
	DesiredView.FOV = AdjustedFOVX;
}

AMyCamera::AMyCamera( const FObjectInitializer& Init )
	: Super( Init.SetDefaultSubobjectClass<UMyCameraComponent>( CameraUtil::MEMBER_CameraComponent ) )
{
}

Nothing to write home about in this implementation but I’ll call out two details:

FOV Tangent Diagram

  • The Aspect Ratio applies to the Tangent of the Half-Angle of the FOV, not the angle itself, so we have to do coordinate-space sandwich when we multiply or divide it.
  • I’ve wrapped FMath::RadiansToDegrees and FMath::DegreesToRadians in R2D and D2R macros because they’re annoyingly verbose for such common subroutines.
  • I retrieve the actual-aspect-ratio from the local player, rather than the screen, to account for Split-Screen Multiplayer (I’m assuming all split-screens have the same aspect, so we only need to look up Player 1).

If you’re really in a pinch and can’t control the subtype of CameraActor, then you can use a custom CameraModifier to post-hook the change to DesiredView with the same code. Workflow-wise this is just more annoying in the common-case, so I avoided it as the by-default solution.

Chilling at Valley Peaks you can wishlist the game on steam

Chilling at Valley Peaks
you can wishlist the game on steam

13ft, o sucessor do 12ft para pular paywalls de sites

Se você usava o 12ft pra inspecionar o conteúdo de sites com paywall antes de decidir se valia a pena pagar, e sofreu quando ele parou de funcionar direito, o 13ft é uma alternativa que tem funcionado bem nos meus testes.

Ele se apresenta como o robô de indexação do Google, e aí os sites exibem o conteúdo completo, que ele redireciona para o seu navegador transparentemente.

É open source e fácil de instalar no Mac, Windows e Linux, mas se você preferir (ou não dominar o empacotamento de blambers em flearows), pode acessar em algum site público, como o Killwall.

O artigo "13ft, o sucessor do 12ft para pular paywalls de sites" foi originalmente publicado no site TRILUX, de Augusto Campos.

As pessoas desapareciam - pra sempre!

Uma coisa que as pessoas da minha geração esquecem, e que as gerações atuais não conseguem visualizar, é que até a metade dos anos 90 os seus amigões eram praticamente um subconjunto dos colegas de aula, vizinhos, familiares, e dos colegas e vizinhos deles.

E quando algum amigão (colega de escola ou vizinho) mudava de cidade, ele sumia da sua vida pra sempre, de uma vez só, e imediatamente. Era raro se mudar sabendo qual seria seu novo telefone, que demorava semanas ou meses (com sorte!) pra ser instalado, e escrever cartas pessoais já tinha saído de moda lá pelos anos 50.

Isso está entre as coisas que a internet revolucionou: agora é possível ter amigos e contatos com continuidade geográfica.

E sim, estou falando sobre a experiência de amizades de crianças e adolescentes, mas a de adultos não era muito diferente.

O artigo "As pessoas desapareciam - pra sempre!" foi originalmente publicado no site TRILUX, de Augusto Campos.

If it never breaks, you're doing it wrong

When the power goes out, most people are understanding. Yet the most livid I've seen people is when web apps or computers they use have a bug or go down. But most of the time, it's a really bad sign if this never happens1.

I was talking to my dad about this recently. For most of his career, he was a corporate accountant for a public utility company. Our professional interests overlap in risk, systems, internal controls, and business processes. These all play into software engineering, but risk in particular is why we should expect our computer systems to fail us.

The power goes out sometimes

As a motivating example, let's talk about the power company. When's the last time you had a power outage? If you're in the US, it's probably not that long ago. My family had our last outage for about an hour last year, and my parents had their power go out for half a day a few weeks ago.

Both of these outages were from things that were preventable.

My family's power outage was because a tree came down on an above ground power line. This could have been prevented by burying the cables. This would take quite a bit of digging, and it's common in a lot of new developments, but where we are everything is above ground for legacy reasons. Or maybe we could have removed more of the trees around the power lines! But that's probably not a great idea, because trees are important for a lot of reasons, including preventing erosion and mitigating floods.

My parents' power outage was from an animal climbing into some equipment (this makes me very sad, poor thing). This could have been prevented by protecting and sealing the equipment. Perhaps there was protection and it was broken, and an inspection could have found it. Or perhaps the equipment needed other forms of protection and sealing.

There are also power failures for reasons that are a failure to recognize and acknowledge risk, or a change to the risk levels. In particular, I think about the failures of Texas's power grid recently. These failures involved an overloading of the grid in a way that was predicted, and resulted in catastrophic failures. The risk that this would happen changed as our climate has changed, and utilities infrastructure is difficult to quickly update to reflect this change in reality2.

The thing is, all of these interventions are known. We can do all of these things, and they're discussed. Each of them comes with a cost. There are two aspects of this cost: there are the literal dollars we pay to make these interventions, and there is the opportunity cost of what we don't do instead. In a world of limited resources, we must consider both.

When you're deciding which changes to make, you have to weigh the cost of interventions against the cost of doing nothing. Your cost of not doing anything is roughly the probability of an event happening times the expected cost of such an event. You can calculate that, and you should! Whereas your cost of doing an intervention is the cost of the intervention plus any lost gains from the things you opt not to do instead (this can be lost revenue or it can be from other failures you get from doing this intervention over other ones).

What does your downtime cost you?

This all comes back to software. Let's look at an example, using fake numbers for ease of calculation.

Let's say you have a web app that powers an online store. People spend $1 in your shop each minute, and you know you have a bug that gives you a 10% chance of going down for an hour once a month. Should you fix it?

We want to say yes by default, because geez, one hour of downtime a month is a lot! But this is a decision we can put numbers behind. Off the bat, we want to say that the cost of an outage would be 0.1 * 60 * 1, or $6 a month. If your software developers cost you $3/hour, and can fix this in 10 hours, then you'd expect to make a profit on fixing this in five months.

But this also ignores some real-world aspects of the issue: How will downtime or uptime affect your reputation, and will people still be willing to buy from you? If you're down, do you lose the money or do people return later and spend it (are you an essential purchase)? Are purchases uniformly distributed across time as we used here for simplicity, or are there peak times when you lose more from being down? Is your probability of going down uniform or is it correlated to traffic levels (and thus probably to revenue lost)?

Quantifying the loss from going down is hard, but it's doable. You have to make your assumptions clear and well known.

What do you give up instead?

The other lens to look at this through is what you give up to ensure no downtime. Downtime is expensive, and so is increasing amounts of uptime.

Going from 9% to 99% uptime is pretty cheap. Going from 99% to 99.9% uptime gets a little trickier. And going from 99.9% uptime to 99.99% uptime is very expensive. Pushing further than that gets prohibitively expensive, not least because you will be seeking to be more reliable than the very components you depend on3! That shift to be more reliable than the components you use means a significant shift in thinking and how you design things, and it comes with a cost.

When you work to increase uptime, it's at the expense of something else. Maybe you have to cut a hot new feature out of the roadmap in order to get a little more stability. There goes a big contract from a customer that wanted that feature. Or maybe you have to reduce your time spent on resolving tech debt. There goes your dev velocity, right out the window.

This can even be a perverse loop. Pushing toward more stability can increase complexity in your system while robbing you of the time to resolve tech debt, and both complexity and tech debt increase the rate of bugs in your system. And this leads to more instability and more downtime!

There are some team configurations and companies who can setup engineering systems in a way where they're able to really push uptime to incredible levels. What the major cloud providers and CDNs do is incredible. On the other hand, small teams have some inherent limits to what they're able to achieve here. With a handful of engineers you're not going to be able to setup the in-house data centers and power supplies that are necessary to even have a possibility of pushing past a certain point of uptime. Each team has a limit to what they can do, and it gets exceedingly expensive the closer you push to that limit.

Why do people get upset?

An interesting question is why people get upset when software fails, especially when we're not similarly upset by other failures. I'm not entirely sure, since I'm generally understanding when systems fail (this has always been my nature, but it's been refined through my job and experience). But I have a few hypotheses.

  • It's hard to be patient when you have money on the line. If you have money on the line from a failure (commission for people selling the software, revenue for people using it in their business, etc.) then this is going to viscerally hurt, and it takes deliberate effort to see past that pain.
  • We don't see the fallible parts of software. We see power lines every day, and we can directly understand the failures: a tree fell on a line, it's out, makes sense. But with software, we mostly see a thin veneer over the top of the system, and none of its inner workings. This makes it a lot harder to understand why it might fail without being a trained professional.
  • Each failure seems unique. When the power goes out, we experience it the same way each time, so we get used to it. But when a piece of software fails, it may fail in different ways each time, and we don't have a general "all software fails at once" moment but rather many individual softwares failing independently. This makes us never really get used to running into these issues, and they're a surprise each time.
  • We know who to be mad at. When the power goes out, we don't really know who we can be upset at. We shouldn't be upset at the line workers, because they're not deciding what to maintain; who, then? Whereas with software, we know who to be mad at: the software engineers of course! (Let's just ignore the fact that software engineers are not often making the business decision of what to focus development efforts on.)
  • We don't actually get more mad, I just see it more because I'm in software. This one is interesting: we might not actually be more mad when power goes out, I might just be more aware of it. I'm not sure how to check this, but I'd be curious to hear from people in other fields about when things fail and how understanding folks are.

I'm sure there are more reasons! At any rate, it's a tricky problem. We can start to shift it by talking openly about the risk we take and the costs involved. Trade-offs are so fundamental to the engineering process.


Thank you to Erika Rowland for reviewing a draft of this post and providing very helpful feedback!


1

Exceptions apply in areas that are safety critical, where a failure can result in very real loss of life. Even in these situations, though, it's not crystal clear: Would you rather a hospital invest in shifting from 99.99% power uptime to 99.999%, or spend that same budget on interventions that apply more often? The former saves many lives in the case of an unlikely disaster, while the latter saves fewer lives but does so more certainly in more common situations. We always have limited resources available, and how we spend them reflects trade-offs.

2

This is not an excuse, though. We saw this coming. Our climate has been changing for quite a while, and people have been predicting changes in load on the grid. But plenty of people want to deny this reality, shift the blame onto other people, or hope for a miraculous solution. Or they simply like to watch the world burn, literally. Either way, now that we're where we are, it's going to be a slow process to fix it.

3

My friend Erika pointed me to this great short, approachable resource on how complex systems fail. She also has a great note going through four different ways that people use the word "resilience", which is very helpful.

“Self hosted em casa” é bom pra quem já tem e pra quem curte, mas não é um caminho suave

Mais uma vez lembrando que se você ainda não tem um servidor rodando estavelmente na sua casa, a afirmação "vou instalar um servidor na minha casa pra resolver minha demanda X" raramente é uma boa solução para essa demanda X.

Além disso:

🫰🏻 Tende a não ser uma solução econômica

📆 Tende a não ser uma solução duradoura, e

⏱️ Tende a não ser uma solução rápida.

As afirmações acima nem sempre se aplicam a quem já tem um servidor estável rodando em casa, entretanto. Nem a quem tem dois projetos separados e independentes: “implantar um servidor em casa” e “resolver minha demanda X”.

O artigo "“Self hosted em casa” é bom pra quem já tem e pra quem curte, mas não é um caminho suave" foi originalmente publicado no site TRILUX, de Augusto Campos.

Inside the tiny chip that powers Montreal subway tickets

To use the Montreal subway (the Métro), you tap a paper ticket against the turnstile and it opens. The ticket works through a system called NFC, but what's happening internally? How does the ticket work without a battery? How does it communicate with the turnstile? And how can it be so cheap that you can throw the ticket away after one use? To answer these questions, I opened up a ticket and examined the tiny chip inside.

The image below shows the chip inside the ticket, highly magnified. The four golden squares in the corner are the connections to the antenna. The tan-colored lines are the metal wiring layer on top of the chip; the thickest lines wire the antenna to other parts of the chip. The darker region that takes up the majority of the chip is the chip's digital logic. To the left is the analog circuitry that handles the signal from the antenna.

The MIFARE Ultralight die under the microscope. (Click this image (or any other) for a larger view.

The MIFARE Ultralight die under the microscope. (Click this image (or any other) for a larger view.

The chip uses NFC (Near-Field Communication). The idea behind NFC is that a reader (i.e. the turnstile) and an NFC tag (i.e. the ticket) communicate over a short distance through magnetic fields, allowing them to exchange data. The reader generates a magnetic field that both powers the tag and sends data to the tag. Both the reader and the tag have coil-like antennas so the reader's magnetic field can be picked up by the tag.1 When you tap your ticket on the turnstile, the NFC communication happens in 35 milliseconds, faster than an eyeblink. The data provided by the NFC tag shows that you have a valid ticket and then you can enter the subway.

The photo below shows the subway ticket, made of printed paper.2 At the right, the ticket appears to have golden smart-card contacts, like a credit card with an EMV chip. However, those contacts are completely fake, just printed onto the card with ink, and there is no chip there. Presumably, the makers thought that making the card look like a smart card would help people understand it. The card actually uses an entirely different technology.

A Montreal subway card. This card is for occasional use and is disposable. Regular travel uses a rigid plastic card containing a different chip.

A Montreal subway card. This card is for occasional use and is disposable. Regular travel uses a rigid plastic card containing a different chip.

Although the subway card is paper on the outside, its core is a thin plastic sheet, shown below. The sheet has a coiled antenna made from a layer of metal foil. If you look closely, you can see the tiny NFC chip in the lower right, a black speck connected to two sides of the antenna wire.3 The diagonal metal stripe in the upper left makes the antenna into a loop; topologically, a spiral antenna won't work on a 2-D sheet, so the diagonal bridge completes the circuit.

The antenna and chip inside the subway card.

The antenna and chip inside the subway card.

I want to emphasize the absurdly small size of the chip: 570 µm × 485 µm. The photo below shows that it is about the size of a grain of salt. The chip is also extremely thin—75 µm or 120 µm—so you can't even feel the chip inside the ticket.

The chip next to grains of salt. I composited two images, one illuminated from above to show the die and one illuminated from below to show the salt.

The chip next to grains of salt. I composited two images, one illuminated from above to show the die and one illuminated from below to show the salt.

Functions of the chip

There are many different types of NFC chips with varying levels of functionality. 4 This one is called the MIFARE Ultralight EV1,5 a low-cost chip designed for one-time ticketing applications. The basic function of the Ultralight chip is simple: providing a block of data to the reader. The chip holds its data in a small EEPROM; this chip has 48 bytes of user memory, while another variant has 108 bytes of user memory.

The Ultralight chip lacks the cryptography support found in more advanced chips. The Ultralight isn't much more secure than a printed ticket with a QR code or barcode, like you'd download for a show. It's up to the reader to validate the data and make sure the same ticket isn't being used multiple times.6

The Ultralight chip has a few features beyond a printed ticket, though. The chips are manufactured with a unique 7-byte identification code (UID). Moreover, the UID is signed, ensuring that fake UIDs cannot be generated.7 The chip also supports password-protected memory access and locking of memory pages to prevent modification. Since the password is transmitted without encryption, the security is weak, but better than nothing.8

Another interesting feature of the chip is the one-way counter. The chip has three 24-bit counters that can be incremented but not decremented. The counters can be used to allow the ticket to be used a particular number of times, for instance.9

Photographing the chip

To photograph the chip, I went through several steps to remove the chip from the ticket and then strip the chip down to the bare silicon. First, to extract the plastic sheet with the chip and the antenna from the paper ticket, I simply soaked the ticket in water. This turned the paper into mush, which could be scraped off to reveal the plastic core. Next, I cut out a small square of plastic that included the chip and put it in boiling sulfuric acid for about 30 seconds. This removed the plastic and adhesive, leaving the silicon die. (I try to avoid boiling acids, but processing a tiny chip like this only required a few drops of sulfuric acid, minimizing the risk.)

The die was covered with a passivation layer to protect its surface, a sandwich of silicon nitride and PSG (phosphosilicate glass) 1.1 µm thick according to the datasheet. The chip's underlying circuitry was visible, but slightly hazy due to this layer. I removed the passivation layer by boiling the chip in phosphoric acid for a few minutes. The image below shows the chip after this step. The top metal layer is much more visible, although some of the metal was dissolved by the acid. The thick metal lines connect the four bond pads to various parts of the analog circuitry, while many thin vertical metal lines provide interconnections of the logic circuitry.

The die after treatment with phosphoric acid to remove the passivation layer. Click for a much larger version.

The die after treatment with phosphoric acid to remove the passivation layer. Click for a much larger version.

Next, I treated the die with several cycles of treatment with Armour Etch to dissolve the oxide layer and hydrochloric acid to dissolve the metal. I think the chip had three layers of metal wiring on top of the silicon. Unfortunately, my process doesn't remove the metal layers cleanly, but causes them to come off in chaotic tangles. Since I wasn't interested in tracing the circuitry layer-by-layer, this wasn't a significant problem.

With the metal layers and polysilicon removed, I was left with the bare silicon. At this point, the underlying structure of the chip is visible. The doped silicon regions show the transistors, although they are extremely small at this scale. The white rectangles are capacitors. The chip has capacitors for many reasons: producing the right resonant frequency with the antenna, filtering the power, and boosting the voltage with charge pumps.

The die after stripping it down to the silicon.

The die after stripping it down to the silicon.

My biggest concern while processing this chip was to avoid losing it. With a chip this small, bumping the chip or even breathing on it can send the chip flying perhaps never to be seen again. Even trying to pick up the chip with tweezers is risky, since it can easily pop out and disappear. It's no fun examining the floor, inch by inch, trying to figure out if a speck is the lost chip or a bit of dirt. I found that the best way to move the chip between processing and a microscope slide was to put the chip in a few drops of water and move it with a pipette. Even so, there were a couple of times that I lost track of the chip and had to check some specks under the microscope to determine which was the chip and which were dirt.

Overview of the chip

The block diagram below shows the high-level structure of the chip. At the left, the antenna is connected to the RF interface, the analog circuitry that converts the high-frequency signals into digital data. This circuitry also extracts power from the antenna's signal to power the chip.

Block diagram of the MIFARE Ultralight chip, from the datasheet.

Block diagram of the MIFARE Ultralight chip, from the datasheet.

The majority of the chip contains digital logic to process the 18 different commands that it can receive from the reader. Some commands, such as Wake-up or Halt control the chip's state. Other commands, such as Read or Write provide access to the EEPROM storage. The specialized Read_Cnt and Incr_Cnt commands access the chip's counters.

The chip has an "intelligent anticollision function" that allows multiple cards to be read without conflict if they are presented to the reader simultaneously. If a conflict is detected, the reader uses a standard NFC algorithm to select the cards one at a time, based on their identification numbers. The anticollision algorithm uses four of the chip's commands.

Finally, the chip has an EEPROM to store its data. Unlike RAM, the EEPROM holds data even when unpowered; it is designed to hold data for 10 years. To store data in the EEPROM, it must be written with a higher voltage than the rest of the chip uses. The EEPROM interface circuit produces the necessary signals.

The diagram shows the chip with its functional blocks labeled. The majority of the die is occupied with digital logic; I'll explain below how it is implemented with standard-cell logic. At the top is the EEPROM, a square of storage cells. To the right of the EEPROM is a charge pump, a circuit to boost the voltage through switched capacitors. The EEPROM interface circuitry is between the EEPROM and the digital logic.

The die, stripped down to the silicon, with presumed functional blocks labeled.

The die, stripped down to the silicon, with presumed functional blocks labeled.

The remainder of the chip contains analog circuitry that is harder to interpret, so my labels are somewhat speculative. The four bond pads are where the antenna is connected to the chip. There are four pads to support two parallel antennas if desired. The first die photo shows the metal wiring between the bond pads and the structures that I've labeled as RF transistors and RF diodes. The "RF transistors" in the upper left are large, oval-shaped structures. These may be the transistors that send data back to the reader by modifying the load. Alternatively, they could be Zener diodes to regulate the voltage powering the chip, since Zener diodes often have an oval shape. The "RF diodes" at the bottom may rectify the signal from the antenna, producing the power for the chip. The rectified signal is also demodulated and processed by the analog logic to extract the digital data sent from the reader.

Sending data from the tag to the reader: load modulation

You might expect the tag to send data back to the receiver by transmitting a signal through the antenna. However, transmitting a signal takes power and the tag doesn't have much power available, just the power that it extracts from the reader's signal. Instead, the tag uses a clever technique called load modulation to send data to the reader. The idea is that if the tag changes the load across the antenna, it will absorb more or less energy from the reader. The reader can detect this change as a small variation in voltage across its transmitting antenna. Thus, the tag can dynamically change its load to send data back to the reader. Even though the signal produced by load modulation is extremely weak (80 dB less than the transmitted signal), the reader can detect it and extract the data.

In more detail, the reader transmits at a carrier frequency of 13.56 MHz.10 To send data back, the tag switches its load on and off at 848 kHz (1/16 of the carrier frequency), producing a subcarrier on top of the reader's signal. To transmit bits, this load modulation is switched on or off to transmit 106 kilobits per second (1/8 of the modulation frequency). The reader, in turn, extracts the subcarrier with a filter to receive the data bits from the tag.

An NFC tag can apply a load that is either a resistor or a capacitor; a resistor absorbs the signal directly, while a capacitor changes the antenna's resonant frequency and thus the amount of signal transferred to the tag. The die contains many capacitors, but I didn't see any significant resistors, so I suspect that this chip uses a capacitor for the load.

The chip's manufacturing process

The image below shows an extreme closeup of the die. The red box surrounds a region of doped silicon, forming five MOS transistors in series. Each dark vertical line corresponds to the gate of one transistor so the width of this line corresponds to the feature size. I estimate that the chip's feature size is 180 nm. In comparison, the wavelength of visible light is 400-700 nm. Since the features are smaller than the wavelength of light, it's not surprising that image appears blurry.

A closeup of the die, pushing the limits of my microscope.

A closeup of the die, pushing the limits of my microscope.

The 180 nm process was popular in the late 1990s. These features are very large, however, compared to recent chips with features that are a few nanometers across. At the time the MIFARE Ultralight EV1 chip was released (October 2012), the newest semiconductor manufacturing process was 22 nm, so the 180 nm process they used was old even then.

However, it makes sense that the chip would be manufactured with an older process for several reasons. First, much of the chip's area is occupied by analog circuitry and the four bond pads, so shrinking the digital logic won't reduce the overall size much. Moreover, a significantly smaller chip would be impractical to attach to the antenna; I expect even the current chip is a pain to mount. Finally, this chip is designed for the extremely low-cost (i.e. disposable) market, so the chip is manufactured as inexpensively as possible. With a more modern process, more chips would fit on a wafer, dropping the price, but manufacturing each wafer would be more expensive, so there is a tradeoff.

Standard-cell logic

The chip's digital circuitry is implemented with standard-cell logic, a common way of implementing digital logic. The idea behind standard-cell logic is to use automated tools to create the chip layout from a description of the desired logic. The process starts with a library of standard cells. Each cell is a standardized implementation of a simple circuit such as a NAND gate or a flip-flop. The cells are designed so they have a fixed height and can be arranged in rows. The cells are then connected by metal wiring on top of the cells to produce the desired circuitry. Although the resulting circuitry isn't as dense and efficient as a fully customized and optimized layout, standard cell logic is much faster (and thus cheaper) to design than a hand-tuned layout. Thus, standard-cell logic has been heavily used for integrated circuit design since the 1980s.

The photo below shows four rows of gates implemented with standard cell logic, The chip (like most modern chips) uses CMOS logic, with each logic gate built from two types of transistors: NMOS and PMOS. To simplify manufacturing, the NMOS and PMOS transistors are arranged in separate rows. Thus, each row of logic consists of a row of PMOS transistors on top and a row of NMOS transistors below, or vice versa. Due to the physics of semiconductors, the PMOS transistors are larger, which allows the transistor types to be distinguished in the image.

A closeup of the standard cell logic.

A closeup of the standard cell logic.

Looking at some of the cells and extrapolating, I estimate about 8000 gates in the logic section with about 45,000 transistors. One question is if the chip is implemented as a hardcoded state machine, or if it contains a processor (microcontroller). The transistor count is barely large enough to implement a simple microcontroller such as an 8051, but that wouldn't leave many transistors left over for other necessary circuitry. If a microcontroller were present, it would need software stored somewhere. Given the simplicity of the protocol and the relatively small number of transistors, my guess is that the chip is implemented in hardware (state machines and counters) rather than through a microcontroller.

The diagram below shows how a standard cell implements a 2-input NAND. (This cell is from the Intel 386, not the NFC chip, but the structures are similar.) The cell contains four transistors. The yellow region is the P-type silicon that forms two PMOS transistors; the transistor gates are where the polysilicon (red) crosses the yellow region. (The middle yellow region is the drain for both transistors; there is no discrete boundary between the transistors.) Likewise, the two NMOS transistors are at the bottom, where the polysilicon (red) crosses the active silicon (green). The blue lines indicate the metal wiring for the cell. The black circles are contacts, connections between the metal and the silicon or polysilicon. Finally, the well taps are the opposite type of silicon, connected to the underlying silicon well or substrate to keep it at the proper voltage.

A standard cell for NAND in the Intel 386.

A standard cell for NAND in the Intel 386.

EEPROM

The chip stores its data in an EEPROM, similar to flash memory. The chip provides 640 or 1312 bits of EEPROM, based on the part number; I believe both versions use the same EEPROM implementation, but the cheaper version limits the amount that can be used. I think the EEPROM is the matrix shown below, with row and column drive circuitry to the right and below. (The diagonal lines are accidental scratches while I was processing the chip.)

A closeup of the presumed EEPROM circuitry on the die.

A closeup of the presumed EEPROM circuitry on the die.

In the photo, the EEPROM appears to be a 64×64 grid, 4K bits of storage rather than the advertised 1312 bits. There are several possible explanations. First, I could be miscounting the capacity (it is easy to be off by a factor of 2, depending on the cell structure). Second, the chip stores data that isn't reflected in the EEPROM memory map; for instance, the one-way counters and the UID signature are not included in the EEPROM storage count. Another possibility is that the extra EEPROM space holds code for a microcontroller (if the chip has one).

An EEPROM requires a relatively high voltage (10-20V) to force electrons into the storage cell for a bit. This voltage is generated by a charge pump circuit that switches capacitors at high frequency to boost the voltage. To the right of the EEPROM is a circuit with several large capacitors, presumably the charge pump.

Conclusions

It's remarkable that these NFC chips can be manufactured so cheaply that they are disposable. To keep the price down, the chips are sold by the wafer and then mounted in the tickets.11 You can buy an eight-inch silicon wafer with the chips for $9000 from Digikey. This may seem expensive until you realize that a single wafer provides an astonishing 100,587 chips, yielding a per-chip price of nine cents. According to the datasheet, a wafer has 103,682 potential good dies per wafer (PGDW). Some dies will be faulty, of course, so the wafer comes with a file telling you which dies are the good ones, 97% of them. (During the manufacturing of a typical chip, the faulty ones are marked with a spot of ink. But that won't work in this case since each die is much smaller than an ink spot.) If you need more chips, you can buy a 12" wafer for $19,000, providing 215,712 chips. A ticket manufacturer mounts each chip on an antenna sheet and then prints the ticket, adding a few cents to the cost of the ticket. The result is an inexpensive ticket that can be used once and discarded.

I'll leave you with one last die photo. In my first attempt at processing the chip, I treated it with Armour Etch. Although this failed to remove the passivation layer, it thinned it slightly, enough to generate some wild colors due to thin-film interference. I call this the "tie die".

The die after treatment with Armour Etch.

The die after treatment with Armour Etch.

Follow me on Twitter @kenshirriff or RSS for more. I'm also on Mastodon as oldbytes.space@kenshirriff. If you're interested in this type of chip, a few years ago, I looked at two RFID race timing chips, the Monza R4 and Monza R6.

Notes and references

  1. Because the card and the reader are positioned close together, the two antennas use "inductive coupling", coupled by magnetic fields rather than radio waves. That is, the two antennas act like transformer windings, transmitting the signal from the reader to the card. 

  2. The Montreal subway uses multiple types of cards. In this blog post, I examine the Occasional card (L'Occasionnelle). This is a non-rechargeable card that works for a single trip or up to three days, and then is discarded. For long-term usage, Montreal uses the Opus card, which provides more security and implements the Calypso standard. An Opus card is plastic rather than paper, giving it a longer life. The Calypso standard is much more secure, using cryptography such as AES, DES, and ECC (spec) and provides much larger EEPROM storage. Thus, the transit system uses the Occasional card for cheap, disposable tickets and the Opus card for a long-term ticket, where spending a dollar or two on the physical card isn't an issue.

    I haven't examined an Opus card, so I don't know what type of chip it uses or even who manufactures the chip. Many companies produce Calypso cards, for instance, the STMicroelectronics CD21 Calypso chip is based on an Arm core. 

  3. If you look closely at the lower right corner of the NFC card, it has three positions that can hold a chip, with the chip in position #3. Presumably, this allows three different NFC chips to be mounted in one card, so one card could have three functions. The NFC protocol is designed to avoid collisions if multiple chips respond, so the three chips won't interfere with each other. 

  4. You can easily examine NFC cards like this using your phone, with an app such as NFC Tools or NXP's Taginfo. Tapping a card will display the type of the card and allow the memory to be read (subject to security restrictions). It's entertaining to tap various NFC cards and see what type of chip they use; I found that hotels typically use the MIFARE Classic chip, more advanced than the MIFARE Ultralight chip in the subway ticket.

    The NFC Tools app shows that this card is a MIFARE Ultralight EV1.

    The NFC Tools app shows that this card is a MIFARE Ultralight EV1.

     

  5. The part number, as provided by the chip, is MF0UL1101DUx. "MF0UL" indicates the MIFARE Ultralight EV1, a chip in the Ultralight family manufactured by NXP. An "H" if present indicates 50 pF input capacitance, rather than 17 pF in the chip I examined, allowing a different antenna. Next, "1" indicates a chip with 384 bits of user memory, while "2" would indicate 1024 bits. This is followed by "101D", and then a code indicating the specific package: "U" indicates a wafer, while "A" indicates a plastic leadless module carrier (LCC). Other characters specify the wafer diameter and thickness. 

  6. It is instructive to think about the security of a printed ticket for a concert with a barcode. You could print out a hundred copies of the ticket, but it will only get you into the concert once. (This assumes that the venue has a centralized database so they can keep track of which tickets have been scanned.) Most of the security is implemented in the backend system, not the ticket itself. The ticket numbers need to be unforgeable, either by generating random numbers or using cryptography. (If the tickets just have QR codes with the numbers 1 to 100, for instance, it would be trivial to make fake tickets.) Moreover, there is nothing to ensure that the person scanning the ticket is legitimate; someone malicious could scan your ticket in line, print out a copy, and get into the concert instead of you. The MIFARE Ultralight chip is similar to a paper ticket in many ways with only slightly more security. 

  7. The UID signing is done with an ECC (elliptic-curve cryptography) algorithm. Note that the chip doesn't need any cryptographic support for this; the chip just holds the signature that was programmed during manufacturing. As far as the chip is concerned, it is just providing some stored bytes. 

  8. The MIFARE Ultralight has enough security to work as a limited-use ticket, but more advanced applications such as reloadable stored-value cards require a chip that supports encryption such as the DESFire. This allows the market to be partitioned, with the inexpensive Ultralight supporting the low-end market, while the more costly DESFire is required for more advanced applications.

    There are many types of MIFARE cards and it's hard to keep them straight, but the diagram below from NXP may help. The different families are arranged left to right: Ultralight, Classic, Plus, DESFire, and SmartMX. The Y dimension indicates the official security certification level. The Z dimension (front to back) shows the evolution within a family over time. I've added a red arrow to indicate the "Ultralight EV1" chip, the focus of this blog post. (Personally, if you need a three-dimensional diagram to explain your product line, the product line may be excessively complicated.)

    The various MIFARE NFC types. Diagram from aMIFARE Plus Product Family.

    The various MIFARE NFC types. Diagram from aMIFARE Plus Product Family.

     

  9. In more detail, a 3-byte counter can be incremented by a specified value until it reaches the all-1's state (0xFFFFFF), at which point it stops. If you wanted to allow, say, 5 uses of a ticket, you could initialize the counter to all-1's minus 5. Then the counter could be incremented 5 times before reaching the limit.

    One complication is that the counters have an "anti-tearing" feature for additional security. The problem is that if you tear the card away from the reader in the middle of an update, there is a possibility for counters to be partially updated, yielding a bad result. The anti-tearing feature ensures that a counter will be atomically updated, avoiding a partial update. 

  10. There are multiple NFC standards with differences in speed, protocol, and range, including NFC-A, NFC-B, NFC-C, NFC-F, and NFC-V. The MIFARE Ultralight cards use NFC-A, which is defined by the standard "ISO/IEC 14443 Type A". Annoyingly, each part of the standard costs $70. The NFC Forum Analog Technical Specification provides a lot of detail, though. 

  11. Instead of a wafer, you can buy the chips on tape but it costs more than twice as much. 

Win the Medals While Young

One of the most frequently asked questions I hear from junior programmers (no matter their age) is: What should I focus on now to build the best career I can? There are multiple options, including creating a startup, getting a PhD, contributing to open source, working for Google, and many others. In my opinion, the most common mistake is trying to get rich fast. Obviously, money matters and is the ultimate metric of career success, but trying to get it too early is nothing more than gambling with your life at stake. Instead, I suggest focusing on winning some “medals,” which can later be converted to cash, not the other way around.

Неуловимые мстители (1966) by Эдмонд Кеосаян
Неуловимые мстители (1966) by Эдмонд Кеосаян

When your career is young (no matter your age), people with money, whether employers or investors, are very hesitant to trust you with it. Even if your skills are strong or your pitch looks promising, the list of achievements on your CV is still pretty short or simply empty. In their eyes, you are a junior and therefore, very unreliable.

You may get rich from this position of zero reliability, but it will mostly be a matter of luck. Having no leverage, you will lose a very valuable resource—your time. Initially, jumping from company to company, you might get a 25% raise every year, but in a few years, the growth will slow down, and eventually, you will become a middle-level programmer with almost no chances of getting truly rich. You will be an old sergeant under the command of a much younger colonel. You don’t want this to happen.

A much better alternative for a junior is being a hero while young. Earn some medals: prove your exceptional value and become a member of the elite. Here is a non-complete list of medals you can put on your CV as a software engineer (the most respected at the top):

  • ACM or IEEE award winner
  • ICPC finalist or winner
  • SPLASH best paper award winner
  • Creator of a 20K+ stars GitHub project (not “awesome-“)
  • Author of a book published by O’Reilly
  • PhD (preferably from MIT or Stanford)
  • Java Champion
  • Author of 25+ merged pull requests into the Linux kernel
  • 3000+ rating on Codeforces
  • Winner of $100K at Kaggle
  • Oracle/IBM/Microsoft certificate holder
  • 1K+ stars GitHub project
  • 50K+ StackOverflow reputation
  • A-class conference org-team member
  • Industry conference speaker
  • InfoQ, DZone, or Habr author
  • Local workshop organizer

Some of these medals may take more than five years to earn. Of course, you must make some money while working on them. The money may be, and will be, smaller than what your friends are getting. Don’t pay attention to this. Eventually, you will get even. Big time.

BTW, I borrowed the idea of “medals” from Alexander Panov, the founder of Neiry, who I had a chance to video-interview recently (watch the video, he says what I’m saying in this blog post, but without as many details).

My impressions of ReScript

I maintain a GitHub Action called check-for-changed-files. For the purpose of this blog post what the action does isn't important, but the fact that I authored it originally in TypeScript is. See, one day I tried to update the NPM dependencies. Unfortunately, that update broke everything in a really bad way due to how the libraries I used to access PR details changed and howthe TypeScript types changed. I had also gotten tired of updating the NPM dependencies for security concerns I didn't have since this code was only run in CI by others for their own use (i.e. regex denial-of-service isn't a big concern). As such I was getting close to burning out on the project as it was a nothing but a chore to keep it up-to-date and I wasn't motivated to keep the code up-to-date since TypeScript felt more like a cost than a benefit for such a small code base where I'm the sole maintainer (there's only been one other contributor to the project since the initial commit 4.5 years ago). I converted the code base to JavaScript in hopes of simplifying my life and it went better than I expected, but it still wasn't enough to keep me interested in the project.

And so I did what I needed to in order to be engaged with the project again: I rewrote it in another programming language that could run easily under Node. 😁 I decided I wanted to do the rewrite piecemeal to make sure I could tell if I was going to like the eventual outcome quickly rather than a complete rewrite from scratch and being unhappy with where I ended up (doing this while on parental leave made me prioritize my spare team immensely, so failing fast was tantamount). During my parental leave I learned Gleam because I loved their statement on expectations for community conduct on their homepage, but while it does compile to JavaScript I realized it works better when JavaScript is used as an escape hatch instead using Gleam to port an existing code base and so it wasn't a good fit for this use case.

My next language to attempt the rewrite with was ReScript thanks to my friend Dusty liking it. One of the first things I liked about the language was it had a clear migration path from JavaScript to ReScript in 5 easy steps. And since step 1 was "wrap your JavaScript code in %%raw blocks and change nothing" and step 5 was the optional "clean up" step, there was really only 3 main steps (I did have a hiccup with step 1, though, due to a bug not escaping backticks for template literals appropriately, but it was a mostly mechanical change to undo the template literals and switch to string concatenation).

A key thing that drew me to the language is its OCaml history. ReScript can have very strict typing, but ReScript's OCaml background also means there's type inference, so the typing doesn't feel that heavy. ReScript also has a functional programming leaning which I appreciate.

💡
When people say "ML" for "machine learning" it still throws me as I instinctively think they are actually referring to "Standard ML".

But having said all of that, ReScript does realize folks will be migrating or working with a preexisting JavaScript code base or libraries, and so it tries to be pragmatic for that situation. For instance, while the language has roots in OCaml, the syntax would feel comfortable to JavaScript developers. While supporting a functional style of programming, the language still has things like if/else and for loops. While the language is strongly typed, ReScript as things like its object type where the types of the fields can be inferred based on usage to make it easier to bring over JavaScript objects.

As part of the rewrite I decided to lean in on testing to help make sure things worked as I expected them to. But I ran into an issue where the first 3 testing frameworks I looked into didn't work with ReScript 11 (which came out in January 2024 and is the latest major version as I write this). Luckily the 4th one, rescript-zora, worked without issue (it also happens to be by my friend, Dusty, so I was able to ask questions of the author directly 😁; I initially avoided it so I wouldn't pester him about stuff, but I made up for it by contributing back). Since ReScript's community isn't massive it isn't unexpected to have some delays in projects keeping up with stuff. Luckily the ReScript forum is active so you can get your questions answered quickly if you get stuck. But this hiccup and the one involving %%raw and template literals, the process was overall rather smooth.

In the end I would say the experience was a good one. I liked the language and transitioning from JavaScript to ReScript went relatively smoothly. As such, I have ported check-for-changed-files over to ReScript permanently in the 1.2.1 release, and hopefully no one noticed the switch. 🤞

€3.50 Red Wine by Glass

See this image again, but printed on paper. Get the book.

See this image again, but printed on paper. Get the book.

I published a photo book that presents the landscape and environment of Hydra, Greece. In October 2023, I spent a week on the island. The photography gods were with me, and the photographs I took formed a cohesive narrative. I wrote about this earlier in the Hydra blog – check it out for some commentary and photos of my time there.

My time on Hydra led me to develop a personal take on the spirit of the place. After developing and scanning the film negatives, I was encouraged to see that my photographs reflected this interpretation of the island. A day walk on the island to the summit of Mt Eros led to the eventual narrative of the book.

€3.50 Red Wine by Glass

€3.50 Red Wine by Glass

Through photography, the book tells a story starting at Hydra township, walking up to the Prophet Elias Monastery, to the peak of Mt Eros, along the remote spine of the island, descending to the coast, and ending back at the township for sunset. The fifty two photographs are a mixture of landscapes and documentary-style shots.

€3.50 Red Wine by Glass

€3.50 Red Wine by Glass

The book is A4 landscape pages enclosed in a hard cover. Photographs were taken on the Olympus OM-1 35mm film camera with Kodak Portra 160 film.

There are still copies available. Reach out to me at [email protected] and I’ll send you the book.

Of Hydra

Of Hydra

in-place construction seems surprisingly simple?

introduction

I've been thinking a little bit about self-referential types recently, and one of its requirements is that a type, when constructed, has a fixed location in memory 1. That's needed, because a reference is pointer to a memory address - and if the memory address it points to changes, that would invalidate the pointer, which can lead to undefined behavior.

1

Eric Holk and I toyed around for a little while with the idea of offset-schemes, where we only track offsets from the start of structs. But that gets weird once you involve non-linear memory such as when using references in the heap. Using real, actual addresses is probably the better solution since that should work in all cases - even if it brings its own sets of challenges.

Where this becomes tricky is that returning data from a function constitutes a move. A constructor which creates a type within itself and returns it, will change the address of the type it constructs. Take this program which constructs a type Cat and in the function Cat::new (playground):

use std::ptr::addr_of;

struct Cat { age: u8 }
impl Cat {
    fn new(age: u8) -> Self {
        let this = Self { age };
        dbg!(addr_of!(this)); // ← first call to `addr_of!`
        this
    }
}

fn main() {
    let cat = Cat::new(4);
    dbg!(addr_of!(cat));      // ← second call to `addr_of!`
}

If we run this program we get the following output:

[src/main.rs:7:9] addr_of!(this) = 0x00007ffe0b3575d7 # first call
[src/main.rs:14:5] addr_of!(cat) = 0x00007ffe0b357747 # second call

This shows that the address of Cat within Cat::new is different from the address once it has been returned from the function. Returning types from functions means changing addresses. Languages like C++ have ways to work around this by providing something called move constructors, enable types to update their own internal addresses when they are moved in memory. But instead, what if we could just construct types in-place so they didn't have to move in the first place?

in-place construction

We already perform in-place construction when desugaring async {} blocks, so this is something we know how to do. And in the ecosystem there is also the moveit and ouroboros crates. These are all great, but they all do additional things like "self-references" or "move constructors". Constructing in-place rather than returning from a function can be useful just for the sake of reducing copies - so let's roll our own version of just that.

The way we can do this is by creating a stable location in memory we can store our value in. Rather than returning a value, a constructor should take a mutable reference to this memory location and write directly into it instead. And because we're starting with a location and write into it later, this location needs to be MaybeUninit. If we put those pieces together, we can adapt our earlier example to the following (playground):

use std::ptr::{addr_of, addr_of_mut};
use std::mem::MaybeUninit;

struct Cat { age: u8 }
impl Cat {
    fn new(age: u8, slot: &mut MaybeUninit<Self>) {
        let this: *mut Self = slot.as_mut_ptr();
        unsafe { 
           addr_of_mut!((*this).age).write(age);
           dbg!(addr_of!(*this));   // ← second call to `addr_of!`
        };
    }
}

fn main() {
    let mut slot = MaybeUninit::uninit();
    dbg!(addr_of!(slot));      // ← first call to `addr_of!`
    Cat::new(4, &mut slot);
    let cat: &mut Cat = unsafe { (slot).assume_init_mut() };
    dbg!(addr_of!(*cat));      // ← third call to `addr_of!`
}

If we run the program it will print the following:

[src/main.rs:15:5] addr_of!(slot) = 0x00007ffc9daa590f  # first call
[src/main.rs:9:9] addr_of!(*this) = 0x00007ffc9daa590f  # second call
[src/main.rs:18:5] addr_of!(*cat) = 0x00007ffc9daa590f  # third call

To folks who aren't used to writing unsafe Rust this might look a little overwhelming. But what we've done is a fairly mechanical translation. Rather than returning the type Self, we're created a MaybeUninit<Self> and passed it by-reference. The constructor then writes into it, initializing the memory. From that point onward, all references to Cat are valid and can be assumed to be initialized.

Unfortunately calling assume_init on the actual value of Cat is not possible because the compiler treats that as a move - which makes sense since it takes a type and returns another. But that's mostly a limitation of how we're doing things - not what we're doing.

indirect in-place construction

Now what happens if there is a degree of indirection? What if rather than construct just a Cat, we want to construct a Cat inside of a Bed. We would have to take the memory location of the outer type, and use that as the location for the inner type. Let's extend our first example by doing exactly that (playground):

use std::ptr::addr_of;

struct Bed { cat: Cat }
impl Bed {
    fn new() -> Self {
        let cat = Cat::new(4);
        Self { cat }
    }
}

struct Cat { age: u8 }
impl Cat {
    fn new(age: u8) -> Self {
        let this = Self { age };
        dbg!(addr_of!(this)); // ← first call to `addr_of!`
        this
    }
}

fn main() {
    let bed = Bed::new();
    dbg!(addr_of!(bed));      // ← second call to `addr_of!`
    dbg!(addr_of!(bed.cat));  // ← third call to `addr_of!`
}

If we run the program it will print the following:

[src/main.rs:15:9] addr_of!(this) = 0x00007fff3910702f
[src/main.rs:22:5] addr_of!(bed) = 0x00007fff391071b7
[src/main.rs:23:5] addr_of!(bed.cat) = 0x00007fff391071b7

Adapting our return-based example to preserve referential stability is once again very mechanical. Rather than returning Self from a function, we pass a mutable reference to MaybeUninit<Self>. In for Cat to be constructed in Bed, all we have to do is make sure Bed contains a slot for Cat to be written to. Put together we end up with the following (playground):

use std::ptr::{addr_of, addr_of_mut};
use std::mem::MaybeUninit;

struct Bed { cat: MaybeUninit<Cat> }
impl Bed {
    fn new(slot: &mut MaybeUninit<Self>) {
        let this: *mut Self = slot.as_mut_ptr();
        Cat::new(4, unsafe { &mut (*this).cat });
    }
}

struct Cat { age: u8 }
impl Cat {
    fn new(age: u8, slot: &mut MaybeUninit<Self>) {
        let this: *mut Self = slot.as_mut_ptr();
        unsafe { 
            addr_of_mut!((*this).age).write(age);
            dbg!(addr_of!(*this)); // ← second call to `addr_of!`
        };
    }
}

fn main() {
    let mut slot = MaybeUninit::uninit();
    dbg!(addr_of!(slot));      // ← first call to `addr_of!`
    Bed::new(&mut slot);
    let bed: &mut Bed = unsafe { (slot).assume_init_mut() };
    dbg!(addr_of!(*bed));      // ← third call to `addr_of!`
}

Which if we run the program will print the following addresses. These are all the same because Cat is the only field inside of Bed, so they happen to point to the same memory location:

[src/main.rs:23:5] addr_of!(slot) = 0x00007fff8271d86f   # first call
[src/main.rs:17:9] addr_of!(*this) = 0x00007fff8271d86f  # second call
[src/main.rs:26:5] addr_of!(*bed) = 0x00007fff8271d86f   # third call

future possibilities

If we squint here it's not hard to see how this could be converted into a language feature. In this post we've mechanically performed a transformation by hand. Rather than returning a type T from a function, we're taking a &mut MaybeUninit<T> and writing into that. It feels like it's basically just a spicy return; and it seems like something we could introduce some kind of notation for.

Though admittedly things get trickier once we want to also enable self-references, perform phased initialization, immovable types, and so on. But those all depend on being able to write to a fixed place in memory - and it feels like perhaps these are concepts which we can decouple from one another? Anyway, if we just take in-place construction as a feature, I think we might be able to get away with something like this for our first example:

use std::ptr::addr_of;

struct Cat { age: u8 }
impl Cat {
    fn new(age: u8) -> #[in_place] Self { // ← new notation
        Self { age }
    }
}

fn main() {
    let cat = Cat::new(4);
    // ^ cat was constructed in-place
}

This is obviously not a new idea - but it stood out to me how simple the actual underpinnings of in-place construction seem. The change from a regular return to an in-place return feels mechanical in nature - and that seems like a good sign. For good measure let's also adapt our second example:

use std::ptr::addr_of;

struct Bed { cat: Cat }
impl Bed {
    fn new() -> #[in_place] Self {         // ← new notation
        Self { cat: Cat::new(4) }
        //     ^ cat was constructed in-place
    }
}

struct Cat { age: u8 }
impl Cat {
    fn new(age: u8) -> #[in_place] Self {  // ← new notation
        Self { age }
    }
}

fn main() {
    let bed = Bed::new();
    // ^ bed was constructed in-place
}

Admittedly I haven't read up on the 10-year discourse of placement-new, so I assume there are plenty of details left out that make this hard in the general sense. Things like heap-addresses and intermediate references. But for the simplest case? It seems surprisingly doable. Not quite something which we can proc-macro - but not far off either. And maybe scoping that as a first target would be enough? I don't know.

The connection to super let

edit 2024-06-25: This section was added after first publishing this post.

Jack Huey reached out after publishing this post, and mentioned there might be a connection with the super let feature. I think that's a super interesting point, and it's not hard to see why! Take this example from Mara's post:

let writer = {
    println!("opening file...");
    let filename = "hello.txt";
    super let file = File::create(filename).unwrap();
    Writer::new(&file)
};

The super let file notation here allows the file's lifetime to be scoped to the outer scope, making it valid for Writer to take a reference to file and return. Without super let this would result in a lifetime error. Its vibes are very similar to the #[in_place] notation we posited in this post.

Perhaps there a synthesis of both features could exist to create a form of "generalized super scope" feature? There definitely appears like there might be some kind of connection. Like, we could imagine writing something like super let to denote a "value which is allocated in the caller's frame". And a function returning super Type to signal from the type signature that this is actually an out-pointer 2.

2

Shout out to James Munns for teaching me about the term "outpointer" - that's apparently the term C++ uses for this using the outptr keyword.

use std::ptr::addr_of;

struct Cat { age: u8 }
impl Cat {
    fn new(age: u8) -> super Self {
        super let this = Self { age };  // declare in the caller's frame
        dbg!(addr_of!(this));
        this
    }
}

It's worth nothing though that this is not an endorsement for actually going with super let or super Type - but merely to speculate how there might be a possible connection between both features. I think it's fun, and the connection between both seems worthy of further exploration!

conclusion

In this post I've shown how we can construct types in-place using MaybeUninit which surprised me how simple it ended up being. I mostly wanted to have gone through the motions at least once - and now I have, and that was fun!

edit 2024-06-22: Thanks to Jordan Rose and Simon Sapin for helping un-break the unsafe pointer code in an earlier version of this post. Goes to show: Rust's pointer ergonomics really could use an overhaul.

The hero of time

abueloretrowave:

The hero of time

reader 3.13 released – scheduled updates

Hi there!

I'm happy to announce version 3.13 of reader, a Python feed reader library.

What's new? #

Here are the highlights since reader 3.12.

Scheduled updates #

reader now allows updating feeds at different rates via scheduled updates.

The way it works is quite simple: each feed has an update interval that determines when the feed should be updated next; calling update_feeds(​scheduled​=True) updates only feeds that should be updated at or before the current time.

The interval can be configured by the user globally or per-feed through the .reader​.update tag. In addition, you can specify a jitter; for an interval of 24 hours, a jitter of 0.25 means the update will occur any time in the first 6 hours of the interval.

In the future, the same mechanism will be used to handle 429 Too Many Requests.

Improved documentation #

As part of rewriting the Updating feeds user guide section to talk about scheduled updates, I've added a new section about being polite to servers.

Also, we have a new recipe for adding custom headers when retrieving feeds.

mark_as_read reruns #

You can now re-run the mark_as_read plugin for existing entries by adding the .reader​.mark-as-read​.once tag to a feed. Thanks to Michael Han for the pull request!


That's it for now. For more details, see the full changelog.

Want to contribute? Check out the docs and the roadmap.

Learned something new today? Share this with others, it really helps!

What is reader? #

reader takes care of the core functionality required by a feed reader, so you can focus on what makes yours different.

reader in action reader allows you to:

  • retrieve, store, and manage Atom, RSS, and JSON feeds
  • mark articles as read or important
  • add arbitrary tags/metadata to feeds and articles
  • filter feeds and articles
  • full-text search articles
  • get statistics on feed and user activity
  • write plugins to extend its functionality

...all these with:

  • a stable, clearly documented API
  • excellent test coverage
  • fully typed Python

To find out more, check out the GitHub repo and the docs, or give the tutorial a try.

Why use a feed reader library? #

Have you been unhappy with existing feed readers and wanted to make your own, but:

  • never knew where to start?
  • it seemed like too much work?
  • you don't like writing backend code?

Are you already working with feedparser, but:

  • want an easier way to store, filter, sort and search feeds and entries?
  • want to get back type-annotated objects instead of dicts?
  • want to restrict or deny file-system access?
  • want to change the way feeds are retrieved by using Requests?
  • want to also support JSON Feed?
  • want to support custom information sources?

... while still supporting all the feed types feedparser does?

If you answered yes to any of the above, reader can help.

The reader philosophy #

  • reader is a library
  • reader is for the long term
  • reader is extensible
  • reader is stable (within reason)
  • reader is simple to use; API matters
  • reader features work well together
  • reader is tested
  • reader is documented
  • reader has minimal dependencies

Why make your own feed reader? #

So you can:

  • have full control over your data
  • control what features it has or doesn't have
  • decide how much you pay for it
  • make sure it doesn't get closed while you're still using it
  • really, it's easier than you think

Obviously, this may not be your cup of tea, but if it is, reader can help.

Metroid Chill .2024

pixeljeff:

Metroid Chill .2024

Visit: https://www.instagram.com/pixeljeff_design/