blank

TP-Link cannot get IPv6 firewall right

2024-08-24T00:00:00+00:00

What led to two of my recent posts is TP-Link’s inability to get IPv6 firewall right on their routers.

Allowing all incoming IPv6 traffic

For a long time my home NAS was publically accessible from the internet under my own subdomain. Although risky, I did my best to secure things. First, in my TP-Link router I had forwarded to my NAS only a handful of ports for the Synology DSM and Home Assistant. Second, both of these used HTTPS using Let’s Encrypt certificates using Synology built-in functionality.

As I was removing all the port forwarding, I noticed something odd: I had never forwarded port 80 to my NAS, yet my Synology had been renewing Let’s Encrypt certificates for years using its HTTP-01 validation¹, which requires port 80 to be exposed. Moreover, even after removing all the port forwarding, those ports were still accessible from the internet (e.g., a DigitalOcean VPS). I hadn’t DMZ-ed the NAS in the TP-Link router, so how is this even possible?

To my absolute horror, I eventually realized that my TP-Link Archer C6 v2.0² simply allows all incoming IPv6 traffic to all LAN devices. While IPv4 uses NAT, the port forwarding (or the lack of it) forms a firewall, TP-Link must’ve been thinking that IPv6 not needing NAT means that there’s no need for any kind of firewall whatsoever. This is unbelievably insecure because non-expert (but also quite advanced) users would never realize this. Furthermore, there’s no way to change it other than disabling IPv6 altogether. So much for IPv6 adoption…

Some deep digging reveals, that this massive security flaw has been noticed a few times on TP-Link Community forums and Reddit before. Some other Reddit posts don’t mention the allow-all behavior explicitly, but just wonder about an IPv6 firewall of any sort.

Blocking all incoming IPv6 traffic

Luckily (not for me), someone at TP-Link must’ve realized their stupidity at some point but only for some newer router models — no firmware updates are available for mine. More often, people on TP-Link Community forums and various Reddit posts have complained about their TP-Link routers blocking all incoming IPv6 traffic without any configurability (unlike IPv4). At least this is secure enough to not expose everything to the internet, but it doesn’t help IPv6 adoption either…

Providing non-functional IPv6 firewall configuration

For some even newer router models, it seems that TP-Link tried to also solve that problem by making the IPv6 firewall finally configurable. Judging by posts on TP-Link Community forums, however, this configurability doesn’t seem to actually work and all incoming IPv6 traffic is still blocked.

Synology only supports HTTP-01 for custom domains. ↩
Using firmware “1.3.6 Build 20200902 rel.65591(4555)” which happens to be the latest for this router. ↩

Tailscale HTTPS certificate on Synology NAS

2024-08-11T00:00:00+00:00

I recently discovered Tailscale for setting up a private VPN. My main goal was to use it for accessing my Synology NAS at home from anywhere in the world. So far I had kept my home NAS publically accessible from the internet, which had been fine but risky nevertheless.

In order to secure web connections to the Synology DSM and various Docker-based services, I had set up Let’s Encrypt on Synology under my own subdomain. Since my NAS is no longer publically accessible, it cannot obtain new Let’s Encrypt certificates for the subdomain¹. Instead, I needed HTTPS certificates for the Tailscale full domain of the NAS.

Tailscale has a guide for setting Tailscale itself up on Synology and a guide for obtaining HTTPS certificates using tailscale cert. Surprisingly, neither documents the best solution, which is the undocumented command

tailscale configure synology-cert

Prior to its introduction, under this Tailscale issue users came up with their own scripts, but using the official command is now the easiest way.

Step-by-step

Set up Tailscale on your Synology NAS or update it to at least version 1.64.0.
Navigate in the Synology DSM to Control Panel → Task Scheduler.
Create a new scheduled task with an user-defined script (Create → Scheduled Task → User-defined script) with the following details:
- General:
  - Task (name): “Tailscale Certificate” (or whatever you want).
  - User: root (the Tailscale command needs that).
- Schedule:
  - “Run on the following days”: “Monthly”, “First”, “Monday” (should be frequent enough such that the 90 day Let’s Encrypt certificate is renewed automatically).
- Task Settings:
  - User-defined script: tailscale configure synology-cert (the magic command).
Press “OK” and follow on-screen instructions for setting up the root script.
Right click on the created task and select “Run” to get the first certificate immediately.
Navigate in the Synology DSM to Control Panel → Security → Certificate.
You should now see a certificate for your ts.net subdomain in this list.
Use the Tailscale certificate in one of the two ways, depending on your use case:
1. Right click on the certificate and select “Edit”. Then tick “Set as default certificate” and press “OK”.
2. Click “Settings” in the toolbar. Change the certificate on a per-service basis.

This would be possible with Let’s Encrypt’s DNS-01 domain validation (as opposed to HTTP-01), but Synology only supports HTTP-01 for custom domains. ↩

Firefox HSTS bypass

2024-08-10T00:00:00+00:00

HSTS is a mechanism to force browsers to use HTTPS instead of HTTP to connect to a site. The intention being that an attacker cannot replace it with an insecure version.

However, it might be desirable to undo this enforcement for valid and safe reasons, e.g., during web development and testing. In my case, I needed to override the protection after disabling “Automatically redirect HTTP connection to HTTPS for DSM desktop” in my Synology NAS settings.

While other browsers (Chrome/Edge) provide a way to power users to bypass HSTS for a site, Firefox insists not offering any means to do so due to “No User Recourse” from the HSTS RFC. Yet, the “solutions” presented by Mozilla employees still allow users to do just that, while also deleting other data and being significantly less secure than just bypassing for a single site…

Non-solutions

There are a few supposed solutions online, however, I consider each one a non-solution:

The official solution is to find the site in Firefox History and select “Forget this site” for it.

This is a non-solution because it deletes all data related to the site, not just its HSTS state.
Editing SiteSecurityServiceState.txt to remove the HSTS entry for a specific site.

While this only deletes the HSTS state, this is a non-solution because it no longer works: recent versions of Firefox use a proprietary binary file SiteSecurityServiceState.bin instead.
Deleting SiteSecurityServiceState.bin to remove HSTS entries for all sites.

This is a non-solution because it deletes HSTS data related to other unrelated sites and unnecessarily giving up security provided by HSTS. It’s the most insane “solution” of them all.

The solution

Find your Firefox profile path. You can do this as follows:
1. Navigate to the “URL” about:profiles.
2. Find your profile from the list. This is likely the one with “This is the profile in use and it cannot be deleted.” under it.
3. Copy the “Root Directory” or click “Open Directory” after it.
Close Firefox.
Back up the SiteSecurityServiceState.bin file in your Firefox profile path. For example, copying it as SiteSecurityServiceState.bin.bak. This is in case the binary file somehow ends up corrupted when modifying it in the next step.
Open the SiteSecurityServiceState.bin file in a hex editor. I used GHex on Linux.
Use the hex editor’s “Find” feature to find the desired site’s domain from the file.
Change the Unix timestamp in milliseconds (like 1723280965123) after it (there are many NUL/zero bytes in between) with one in the past. I changed it to 1696969696969.

The file seems to have a similar format to the old SiteSecurityServiceState.txt file, but since it’s in a proprietary binary format, it’s not as simple as deleting a line from it. So the safest way is to just change the HSTS expiry timestamp in-place.
Save the file in the hex editor.
Open Firefox.

Springer anti-typesetters, part 2

2024-07-22T00:00:00+00:00

This post continues the Springer typesetting saga from part 1 on a pair of papers in March 2024. We had two papers (Saan et al., 2024; Saan et al., 2024) accepted into the same conference proceedings and they were edited very differently. The following compares our submitted camera-ready version with the many proofs from Springer typesetters.

Proof 1

The first pair of proofs for the two papers were near-perfect. In both papers they just removed spaces between authors’ email addresses, i.e., {a, b}@c.d was replaced with {a,b}@c.d, which is harder to read in typewriter font but whatever. (They also did this for the paper in part 1.)

However, Springer still managed to introduce inconsistencies. Both papers are open access and have a paragraph about CC BY 4.0 after the references. In the Goblint Validator paper they inserted a dot after the bolded “Open Access” which begins the paragraph and changed the license URL into typewriter font (unlike all other URLs in the paper which were unchanged).

Proof 2

After accepting the first proofs, Springer emailed a second pair of proofs because they forgot to add artifact evaluation badges to both papers. Apparently their fabulous proof checking system cannot be used a second time so this had to be done completely by email.

In the Goblint (verifier) paper, the only change was indeed the addition of the badges. In the Goblint Validator paper, they intentionally screwed everything else up while adding those badges. Here’s the worst of what they did.

Misencoding editor names

In the year 2024, Springer still struggles with encodings: name of the proceedings editor Laura Kovács appeared like this:

Replacing small caps font in title

It’s common for tool names to be in small caps (\textsc) in paper titles, especially for these SV-COMP tool papers in TACAS proceedings. For some reason, small caps wasn’t enough for Springer and they made it bold italic small caps. I have never seen this in a paper title (or anywhere for that matter).

The following image comparisons illustrate the pessimization from our camera-ready version to Springer’s proof (it’s very easy to tell which side is which version). Hover/swipe across the images to fully appreciate the horror.

Expanding author emails

In this iteration they went a step further and expanded all author emails, i.e., {a,b}@c.d was replaced with a@c.d, b@c.d, which takes up more space. I don’t understand why it was necessary for this version of this paper, but not any other.

Reformatting tables entirely

Just like in part 1, Springer redid booktabs tables with the following changes:

All columns are left-aligned (again). That is especially bad for numeric data spanning multiple orders of magnitude. Our tables with siunitx columns that properly align digits and decimal points were ruined. Columns of centered checkmarks and crosses became awkward. Centered \multicolumn spans became odd.
Line breaks from multi-line column headers were removed. This makes some columns overly wide.
Row spacing was increased.
The font for tables was changed to something not matching the rest of the paper (again).

This time they didn’t add vertical column-separating rules between all columns!

Table 1

Note how they changed “2,015” to “2015” in the bottom right cell, thinking it’s a year or something.

Table 2

Yes, they moved this table onto a separate page and rotated it 90° (in addition to all of the above).

Proof 3

After painstakingly listing all the unnecessary changes they made as typesetting errors in the Goblint Validator paper, Springer gave up on (most of) their stupidity and reverted to the first proof (with artifact badges correctly added this time).

And yet, they still had to mess up something. In two places, end-of-line punctuation was shifted from the baseline to above the text. It is beyond me, how to do this accidentally.

Conclusion

At least this time Springer listened and properly undid all the ugliness, so I don’t have to feel shame about how the publisher’s versions of my papers look. But clearly they haven’t stopped making arbitrary unnecessary changes, which wasted everyone’s time, and unfathomable accidental changes.

Stay tuned for part 3! (It’s bound to happen…)

Springer anti-typesetters, part 1

2024-07-21T00:00:00+00:00

This post describes (only some) of my frustrations with Springer’s typesetting of one paper (Saan et al., 2024) in December 2023 and was also written around that time, but not published. It compares our submitted camera-ready version (which is very similar to the nicely-formatted arXiv version) with the proofs from Springer typesetters.

Discrediting other authors

In our paper the citation “Beyer and Strejček [23]” was edited by Springer to simply “Strejček [23]”, discrediting Dirk Beyer. We used \citet{Beyer2022} in LaTeX and both authors are still listed in the corresponding References entry. The cited paper is published via Springer, so they should have no doubt about the authorship. What reason would Springer have to replace \citet with a reduced authors list which is no longer consistent with References? Or how would one do that accidentally?

Reformatting tables entirely

We use the booktabs LaTeX package to typeset beautiful professional tables. For whatever reason Springer entirely reformats tables:

Vertical column-separating rules are added between all columns.
All columns are left-aligned. That is especially bad for numeric data spanning multiple orders of magnitude. Our tables with siunitx columns that properly align digits and decimal points were ruined. Columns of centered checkmarks and crosses became awkward. Centered \multicolumn spans became odd.
The font for tables was changed to something not matching the rest of the paper.

Nothing in the Springer guidelines requires any of such changes to tables, instead requiring:

It is essential that all illustrations are clear and legible.

By unnecessarily reformatting all tables, Springer editors have done the complete opposite of their own guidelines.

Comparisons

Table 1

Extra ugly is how the vertical rules have gaps in them and some cells are not completely colored.

Table 2

History

Apparently, this behavior is far from new: Dmytro Mishkin already complained about it in 2018. Meanwhile in the first half of 2023, booktabs tables were fine again, as evidenced by some of our papers. Clearly, Springer lacks any kind of policy on this and their typesetters are rulers of a wild west, each making up their own rules.

Making references inconsistent

Abbreviating partially

We explicitly edited our BibTeX bibliography to be consistent across all entries, in particular about booktitles/journals. As provided by Springer Link, we used unabbreviated names like “Static Analysis” and “Tools and Algorithms for the Construction and Analysis of Systems”. Springer says:

References are also modified to make them compatible with CrossRef, which will permit cross referencing within SpringerLink […]

Fair enough, they abbreviated such names to, e.g., SAS and TACAS. However, their editing is completely inconsistent: some entries use abbreviations, while others still have full names for the very same proceedings/journals.

Editing titles

We cite two papers which have the tool name Goblint set in small caps in their title. As provided by Springer Link, we did not have small caps in our bibliography. Springer edited one to use small caps, but left the other (right after the first) in normal font. If there were any policy, then it should be enforced consistently by Springer Link and Springer editors, or at minimum be consistent within a single References section.

Furthermore, Springer typesetters like to remove all crucial capitalization from cited titles:

replacing “C programs” with “c programs”,
replacing “Frama-C” with “frama-c”,
replacing “CommuHash” with “commuhash”.

Adding random dots

We reference items in a prior enumerate like “Items 2 and 4 illustrate”, which was edited by Springer to “Items 2 and 4. illustrate”. The additional dot was consistently (!) throughout the paper added after item number 4 (multiple instances of “Item 4 from Example 4” were changed to “Item 4. from Example 4”). But all other item references into the same list did not get a dot — Springer randomly did it to Item 4.

Conclusion

After listing every occurrence of all of these issues explicitly in the response to typesetting proofs, Springer did manage to fix (i.e., undo) most, but not all, of these issues. In the publisher’s version the tables still aren’t as nice as our original version.

~~Stay tuned for~~ Check out part 2!

Automata-theoretic approach to regex crosswords

2024-07-20T00:00:00+00:00

This post documents my automata-theoretic approach to solving regex crosswords, which is unlike other approaches out there (see Related work below). The biggest limitation of this computer science theory approach is that it only works for truly regular regexes (so no capture groups, look-aheads, etc). I have implemented it in Java using the dk.brics.automaton library.

Automata-theoretic approach

Let a rectangular regex crossword with \(h\) rows and \(w\) columns be defined by:

row regexes \(R_0, R_1, \dots, R_{h-1}\) (matching left-to-right),
column regexes \(C_0, C_1, \dots, C_{w-1}\) (matching top-to-bottom).

Let’s view the \(w \times h\) rectangle as a string of length \(w \times h\) where the rows have been concatenated (i.e., row-major order).

Example

Consider Intermediate: Always remember from regexcrossword.com as an example:

\[\begin{array} {|r|c|c|c|}\hline & \texttt{UB|IE|AW} & \texttt{[TUBE]*} & \texttt{[BORF].} \\ \hline \texttt{[NOTAD]*} & ? & ? & ? \\ \hline \texttt{WEL|BAL|EAR} & ? & ? & ? \\ \hline \end{array}\]

Row automata

First, construct a width automaton \(W\) which accepts all strings of length \(w\). This automaton corresponds to the regex .{w}.

Then, for each row regex \(R_i\) construct a row automaton \(R_i'\) as follows:

\[R_i' = W^{i} \circ (R_i \cap W) \circ W^{w - i - 1},\]

where

\(\circ\) is the binary operator for concatenating two automata,
exponentation self-concatenates indicated number of copies of the automaton (power 0 gives the automaton whose language contains only the empty string),
\(\cap\) is the binary operator for intersection of two automata.

Example

For the above example, the automata are the following.

\(W\) is the width automaton corresponding to the regex .{3}:

graph LR
    start:::hidden
    w0(($$w_0$$))
    w1(($$w_1$$))
    w2(($$w_2$$))
    w3((($$w_3$$)))
    start-->w0-->|.|w1-->|.|w2-->|.|w3

    classDef hidden display: none;

For brevity, a single character regex is used to describe the possible transitions, as opposed parallel transitions for each character in the alphabet. Dead/trap states and transitions are also omitted.

\(R_0\) is the automaton corresponding to the regex [NOTAD]*:

 graph LR
     start:::hidden
     r0((($$r_0$$)))
     start-->r0-->|"[NOTAD]"|r0

     classDef hidden display: none;

For brevity, a single character regex is used to describe the possible transitions, as opposed to 5 self-loops.

\(R_0' = (R_0 \cap W) \circ W\) is the row automaton (which happens to correspond to the regex [NOTAD]{3}.{3}):

 graph LR
     start:::hidden
     q0(( ))
     q1(( ))
     q2(( ))
     q3(( ))
     q4(( ))
     q5(( ))
     q6((( )))
     start-->q0-->|"[NOTAD]"|q1-->|"[NOTAD]"|q2-->|"[NOTAD]"|q3-->|.|q4-->|.|q5-->|.|q6

     classDef hidden display: none;

\(R_1\) is the automaton corresponding to the regex WEL|BAL|EAR:

 graph LR
     start:::hidden
     r0(($$r_0$$))
     r1(($$r_1$$))
     r2(($$r_2$$))
     r3(($$r_3$$))
     r4(($$r_4$$))
     r5(($$r_5$$))
     r6(($$r_6$$))
     r7((($$r_7$$)))
     start-->r0-->|W|r1-->|E|r2-->|L|r7
     r0-->|B|r3-->|A|r4-->|L|r7
     r0-->|E|r5-->|A|r6-->|R|r7

     classDef hidden display: none;

(For symmetry, this has not been minimized.)

\(R_1' = W \circ (R_1 \cap W)\) is the row automaton (which happens to correspond to the regex .{3}(WEL|BAL|EAR)):

 graph LR
     start:::hidden
     w0(( ))
     w1(( ))
     w2(( ))
     w3(( ))
     start-->w0-->|.|w1-->|.|w2-->|.|w3
     r1(( ))
     r2(( ))
     r3(( ))
     r4(( ))
     r5(( ))
     r6(( ))
     r7((( )))
     w3-->|W|r1-->|E|r2-->|L|r7
     w3-->|B|r3-->|A|r4-->|L|r7
     w3-->|E|r5-->|A|r6-->|R|r7

     classDef hidden display: none;

Column automata

While the construction of automata for each row regex is rather intuitive, it’s significantly more involved for column regexes. That is because the characters in the row-major string that make up the subsequence which needs to match the column regex are not consequtive. Nevertheless, it is possible using a novel¹ construction.

First, for each column regex \(C_i\) construct a guard automaton \(O^i \circ W^*\) which is in an accepting state at every position in the row-major string that belongs to column \(i\). Here, \(O\) (for offset) is the automaton corresponding to the regex .. The guard automaton is a repetition of the width automaton \(W\), because every \(w\)-th position in the row-major string is in the same column, prefixed by an offset automaton \(O^i\) to start the repetition from the corresponding column. This automaton corresponds to the regex .{i}(.{w})*.

Then, the column regex \(C_i\) construct a column automaton \(C_i'\) as follows:

\[C_i' = C_i \triangleleft (O^i \circ W^*),\]

where \(A \triangleleft G\) is a special product-like guarded automaton (\(A\) guarded by \(G\)), where \(A\) transitions only if \(G\) is in an accepting state. In general \(A \triangleleft G\) is defined as:

Its states \((a, g)\) are from the product set of states \(A \times G\).
Its initial state consists of the initial states of \(A\) and \(G\).
Its state \((a, g)\) is accepting if \(a\) is an accepting state of \(A\).
In state \((a, g)\) with input character \(c\) the automaton steps to
1. \((a', g')\) if \(g\) is an accepting state of \(G\), \(A\) steps from \(a\) to \(a'\) with \(c\) and \(G\) steps from \(g\) to \(g'\) with \(c\),
2. \((a, g')\) if \(g\) is not an accepting state of \(G\) and \(G\) steps from \(g\) to \(g'\) with \(c\).

(This has some similarities to stretching of automata.)

Example

For the above example, the automata are the following.

\(C_0\) is the automaton corresponding to the regex UB|IE|AW:

 graph LR
     start:::hidden
     c0(($$c_0$$))
     c1(($$c_1$$))
     c2(($$c_2$$))
     c3(($$c_3$$))
     c4((($$c_4$$)))
     start-->c0-->|U|c1-->|B|c4
     c0-->|I|c2-->|E|c4
     c0-->|A|c3-->|W|c4

     classDef hidden display: none;

\(W^*\) is the guard automaton corresponding to the regex (.{3})*:

 graph LR
     start:::hidden
     w0((($$w_0$$)))
     w1(($$w_1$$))
     w2(($$w_2$$))
     start-->w0-->|.|w1-->|.|w2-->|.|w0

     classDef hidden display: none;

\(C_0' = C_0 \triangleleft W^*\) is the column automaton (which happens to correspond to the regex (U..B|I..E|A..W)..):

 graph LR
     start:::hidden
     c0w0(($$c_0,w_0$$))
     c1w1(($$c_1,w_1$$))
     c1w2(($$c_1,w_2$$))
     c1w0(($$c_1,w_0$$))
     c4w1((($$c_4,w_1$$)))
     c2w1(($$c_2,w_1$$))
     c2w2(($$c_2,w_2$$))
     c2w0(($$c_2,w_0$$))
     c3w1(($$c_3,w_1$$))
     c3w2(($$c_3,w_2$$))
     c3w0(($$c_3,w_0$$))
     c4w2((($$c_4,w_2$$)))
     c4w0((($$c_4,w_0$$)))
     start-->c0w0-->|U|c1w1-->|.|c1w2-->|.|c1w0-->|B|c4w1
     c0w0-->|I|c2w1-->|.|c2w2-->|.|c2w0-->|E|c4w1
     c0w0-->|A|c3w1-->|.|c3w2-->|.|c3w0-->|W|c4w1
     c4w1-->|.|c4w2-->|.|c4w0

     classDef hidden display: none;

\(C_1\) is the automaton corresponding to the regex [TUBE]*:

 graph LR
     start:::hidden
     c0((($$c_0$$)))
     start-->c0-->|"[TUBE]"|c0

     classDef hidden display: none;

\(O^1 \circ W^*\) is the guard automaton corresponding to the regex .(.{3})*:

 graph LR
     start:::hidden
     o0(($$o_0$$))
     w0((($$w_0$$)))
     w1(($$w_1$$))
     w2(($$w_2$$))
     start-->o0-->|.|w0-->|.|w1-->|.|w2-->|.|w0

     classDef hidden display: none;

\(C_1' = C_1 \triangleleft (O^1 \circ W^*)\) is the column automaton (which happens to correspond to the regex .(([TUBE]..)*([TUBE].?)?)? – this is uglier than the automaton):

 graph LR
     start:::hidden
     c0o0((($$c_0,o_0$$)))
     c0w0((($$c_0,w_0$$)))
     c0w1((($$c_0,w_1$$)))
     c0w2((($$c_0,w_2$$)))
     start-->c0o0-->|.|c0w0-->|"[TUBE]"|c0w1-->|.|c0w2-->|.|c0w0

     classDef hidden display: none;

(Note that although all states shown are accepting, \(C_1'\) does not accept all strings – the dead state for mismatches at [TUBE] is not shown.)

\(C_2\) is the automaton corresponding to the regex [BORF].:

 graph LR
     start:::hidden
     c0(($$c_0$$))
     c1(($$c_1$$))
     c2((($$c_2$$)))
     start-->c0-->|"[BORF]"|c1-->|.|c2

     classDef hidden display: none;

\(O^2 \circ W^*\) is the guard automaton corresponding to the regex ..(.{3})*:

 graph LR
     start:::hidden
     o0(($$o_0$$))
     o1(($$o_1$$))
     w0((($$w_0$$)))
     w1(($$w_1$$))
     w2(($$w_2$$))
     start-->o0-->|.|o1-->|.|w0-->|.|w1-->|.|w2-->|.|w0

     classDef hidden display: none;

\(C_2' = C_2 \triangleleft (O^2 \circ W^*)\) is the column automaton (which happens to correspond to the regex ..[BORF]...):

 graph LR
     start:::hidden
     c0o0(($$c_0,o_0$$))
     c0o1(($$c_0,o_1$$))
     c0w0(($$c_0,w_0$$))
     c1w1(($$c_1,w_1$$))
     c1w2(($$c_1,w_2$$))
     c1w0(($$c_1,w_0$$))
     c2w1((($$c_2,w_1$$)))
     start-->c0o0-->|.|c0o1-->|.|c0w0-->|"[BORF]"|c1w1-->|.|c1w2-->|.|c1w0-->|.|c2w1

     classDef hidden display: none;

Solution

Finally, construct the solution automaton \(S\) as an intersection of all row and column automata:

\[S = \left(\bigcap_{i=0}^{h-1} R_i'\right) \cap \left(\bigcap_{i=0}^{w-1} C_i'\right).\]

This automaton describes all solutions to the regex crossword. If the regex crossword has a unique solution, this automaton is linear and describes exactly one accepted string.

Example

For the above example, the solution automaton is \(S = R_0' \cap R_1' \cap C_0' \cap C_1' \cap C_2'\) (which happens to correspond to the regex ATOWEL):

graph LR
    start:::hidden
    q0(( ))
    q1(( ))
    q2(( ))
    q3(( ))
    q4(( ))
    q5(( ))
    q6((( )))
    start-->q0-->|A|q1-->|T|q2-->|O|q3-->|W|q4-->|E|q5-->|L|q6

    classDef hidden display: none;

The solution to the regex crossword can be read off the automaton: the only accepted string is ATOWEL.

Performance

Although, I have implemented it in Java, I don’t have useful information about its performance (especially compared to other approaches). The runtimes on small test cases are neglible.

This approach involves a lot of product automata constructions (for intersections and guarded) which, at least in theory, can yield quite large automata. My hunch is that the intermediate automata, when minimized, are relatively small compared to the theoretical bounds (as also seen in the example). Conventionally regex crosswords have unique solutions, so as more automata are intersected more restrictions are combined, converging towards a smaller language with a smaller automaton.

The following table gives an overview of various approaches to the regex crossword problem. Most seem to use more brute force (backtracking, search, SMT), but also target non-regular regexes which cannot be expressed as finite automata.

Approach	Implementation	Description
Logic programming	Clojure	Blog
Search (“heuristic”)	C++	Blog
SMT (“string constraint solving”)	Python	Blog
Custom regex engine	Haskell	Blog part 1, part 2
Go regex DFA inspection (“backtracking”)	Go	Blog (archived)
Evolutionary algorithm (“heuristic”)	JavaScript	—
Search (“backtracking”)	Python	—
SMT	—	Paper
This (automata-theoretic)	Java	Above

As far as I am aware. Please inform me know otherwise. ↩

Error in Conway’s “Regular Algebra and Finite Machines”

2024-07-15T00:00:00+00:00

I was following “Proof Pearl: Regular Expression Equivalence and Relation Algebra” (Krauss & Nipkow, 2012) to implement a regular expression equivalence checker in OCaml to validate it as a project idea for my “Advanced Topics in Automata, Languages and Compilers” course.

Once the neat implementation was done, I wanted to test it, especially with pairs of regular expressions that aren’t trivially equivalent. I scoured the internet (mostly Stack Overflow) and the literature for such examples. Eventually I stumbled upon “Regular Algebra and Finite Machines” (Conway, 1971).

The error

Deep into the book, on page 121 it contains the following exercises:

(Here \(+\) indicates the regex operator |, \(1\) is the empty string aka \(\varepsilon\) and \([]\) are just nested parenthesis for grouping, not a regex character class.)

Exercise 3 asks us to prove that (xy)*(x|xy(yy*x)*)* is equivalent to ((xy*y)*yx|x)*(xy)*. However, my checker didn’t agree and spat out the counterexample yx. Indeed, it’s not too hard to verify by hand that the first regex does not match yx while the second regex does. Hence, they aren’t equivalent!

The fix

The correct formulation of exercise 3 would be:

\[(\boldsymbol{yx})^*[x+xy(yy^*x)^*]^* = [(xy^*y)^*yx+x]^*(xy)^*.\]

The difference compared to the book is shown in bold: beginning of the left-hand side should have yx instead of xy. My checker agrees that the two are now equivalent.

This is indirectly corroborated later in the book as well. Solutions to the exercises say:

A proof of 3 (not from C1-14) is implicit in later exercises.

And exercise 9 includes the fixed left-hand side regular expression.

References

2012

J. Autom. Reason.

Proof Pearl: Regular Expression Equivalence and Relation Algebra

Alexander Krauss, and Tobias Nipkow

Journal of Automated Reasoning, 2012

Abs HTML PDF

We describe and verify an elegant equivalence checker for regular expressions. It works by constructing a bisimulation relation between (derivatives of) regular expressions. By mapping regular expressions to binary relations, an automatic and complete proof method for (in)equalities of binary relations over union, composition and (reflexive) transitive closure is obtained. The verification is carried out in the theorem prover Isabelle/HOL, yielding a practically useful decision procedure.

1971

Regular Algebra and Finite Machines

John Horton Conway

1971

OCaml linting tools and techniques

2024-05-01T00:00:00+00:00

Recently (but also 3 years ago), I was interested in finding all catch-all exception handlers in Goblint, which is written in OCaml, in order to prevent “uncatchable” exceptions from being caught and accidentally swallowed. “Uncatchable” exceptions are those which should not be ignored, e.g. Out_of_memory. My first attempt was using a Semgrep rule, but it turned out to be too buggy to reliably do the job. Therefore, I sought out code linters for OCaml.

Tools

The following table summarizes all OCaml linting tools I managed to find; active or dead, general or special-purpose, standalone or Ppx, monolithic or modular. In this post I focus on linting (based on syntax and possibly types) and exclude program analyzers like reanalyze and Salto. Ocamllint and ocp-lint are the most universal attempts at OCaml linting, however they’re long dead and no replacement seems to have emerged.

Tool	Status	Use case	Mode	Structure
ocamllint	Archived	General	Ppx	Monolithic
ocp-lint/typerex-lint	Inactive	General	Standalone	Modular & extensible
camelot	Semiactive	General/teaching	Standalone	Modular
zanuda	Active	General	Standalone	Modular
bene-gesselint	Inactive	Framework	Ppxlib	Modular & extensible
ppx_js_style	Active	Company	Ppxlib	Monolithic
base ppx_base_lint	Active	Project	Ppxlib	Monolithic
mina ppx_version	Active	Project	Ppxlib	Monolithic
less-power ast-check	Active	Teaching	Standalone/Ppxlib	Monolithic

There are two general execution modes:

Ppx, which are OCaml AST preprocessors, executed by the build system similarly to other Ppx-es like @@deriving features. These are relatively easy to integrate into modern dune-based workflows.
Standalone, which are to be executed outside of the usual compilation process. These don’t integrate into modern dune-based workflows due to very limited linting support in dune.

There are three general structures to these linters:

Monolithic, where new rules would have to be implemented intertwined with already existing rules. This is reasonable for special-purpose linters and allows all checks to be performed in a single AST pass.
Modular (but not extensible), where rules are implemented independently from others but form a fixed ruleset. These can be more difficult to combine into a single AST pass and might mean multiple passes in some cases. They are non-extensible because new rules must be integrated into the core tool itself.
Modular and extensible, which has the benefits from the previous point, but also allows custom rules to be added without modifying the tool itself. Thus, they feature some sort of plugin system.

Non-Ppxlib tools

The following table provides more details about the non-Ppxlib tools. Notably, some support type information in rules, which allows more expressive and accurate checks, but also that they cannot be part of the usual Ppx preprocessing step.

Tool	`Parsetree` traversal	Type support	`Typedtree` traversal
ocamllint	`Ast_mapper`	No	-
ocp-lint/typerex-lint	Recursion	Yes	`TypedtreeIter`
camelot	Copy of `Ast_iterator`	No	-
zanuda	`Ast_iterator`	Yes	`Tast_iterator`

Ppxlib tools

The following table provides more details about the Ppxlib-based tools. All of these integrate with dune in one way or another. In some cases, different parts of the same linter from the first table work by slightly different means.

Tool	Dune integration	Ppxlib phase	Traversal	Output
bene-gesselint	`(lint)`	`~impl`	`iter`&`Ast_pattern`	`register_correction`
ppx_js_style (`enforce_cold`)	`(preprocess)`	`~lint_impl`	`fold`	`Lint_error`
ppx_js_style (other)	`(preprocess)`	`~impl`	`iter`	`raise_errorf`
base ppx_base_lint	`(preprocess)`	`~impl`	`iter`	`raise_errorf`
mina ppx_version (lint_primitive_uses)	`(preprocess)`	`~lint_impl`	`fold`/`iter`	`raise_errorf`
mina ppx_version (lint_version_syntax)	`(preprocess)`	`~lint_impl`	`fold`	`Lint_error`/`eprintf`
less-power ast-check	`(preprocess)`	`~impl`	`map_with_context`	`error_extensionf`

There are two possible ways of dune integration:

(preprocess) stanza, which is the usual way to add Ppx preprocessors to the build of a library/executable. This runs unconditionally during the normal build process.
(lint) stanza, which has similar syntax but is undocumented. This doesn’t run by default in dune, but rather requires dune build @lint to be executed, which is very easy to forget. A rare example of this exists in dune’s test suite.

Either way, a major inconvenience is that the linter has to be added to every dune library and executable. There’s no way right now to define entire-project linters, which is error-prone as one may simply forget to add the linter to a new library.

There are two main Ppxlib phases used for such linters:

~impl (or ~intf), which is usually used for defining AST transformations. However, linters wouldn’t actually transform the program, but just output warnings during such pass.
~lint_impl (or ~lint_intf), which run before any transformations take place. In fact, this phase cannot even transform the AST, but only return a list of Lint_errors.

There are various ways in Ppxlib to traverse the AST and each linter uses one based on what needs to be returned from the phase and how the output is done. Note that Ppxlib’s context-free (rewriting) rules aren’t suitable for linting as-is: they can only match extension nodes, special functions, custom constants and attribute-annotated nodes. In particular, arbitrary Ast_pattern-based matching is not offered by Ppxlib. This is what bene-gesselint tries to provide as a thin wrapper, however it doesn’t neatly combine multiple Ast_pattern-matching rules into a single AST pass.

There are five means of output for Ppxlib-based linters:

Driver.register_correction, which proposes a code change that can be promoted using dune. This only works during (lint) and must propose a change, so it cannot simply produce a warning.
Lint_error.of_string, which yields a preprocessor warning. These can only be returned from ~lint_impl, but may warnings can be returned from a single run.
Location.raise_errorf, which crashes the preprocessor with an error. Hence, multiple errors cannot be produced from a single linter run. Ppxlib also discourages the use of exceptions for error handling.
Location.error_extensionf, which creates a special error extension node to be put into the AST. Hence, this requires a map traversal, but also allows multiple errors to be returned. Ppxlib recommends this for error handling, at least for usual Ppxlib expanders, derivers and transformers. However, it seems to me that the OCaml compiler will still only print the error from the first error extension node.
eprintf, which is just very ad hoc.

Ppxlib techniques

Many combinations of dune integration, Ppxlib phase, traversal and output exist, but not all of them are compatible and sensible. Worse yet, some simply don’t even work, either silently or loudly. The following table gives an overview of the reasonable combinations and which to avoid.

Dune integration	Ppxlib phase	Traversal	Output	Comment
`(lint)`	`~lint_impl`	`fold`	`Lint_error.of_string`	Doesn’t work (no output)
`(lint)`	`~impl`	`iter`	`Driver.register_correction`	Dune-promotable changes
`(preprocess)`	`~lint_impl`	`fold`	`Lint_error.of_string`	Multiple preprocessor warnings
`(preprocess)`	`~lint_impl`	`iter`	`Location.raise_errorf`	Single error
`(preprocess)`	`~lint_impl`	`iter`	`eprintf`	Multiple non-standard warnings
`(preprocess)`	`~impl`	`map`	`Location.error_extensionf`	Multiple errors
`(preprocess)`	`~impl`	`iter`	`Location.raise_errorf`	Single error
`(preprocess)`	*	*	`Driver.register_correction`	Doesn’t work (promotable changes not allowed)

This GitHub repository includes examples of all of these setups in the corresponding subdirectories. See the Cram test run.t files for example outputs.

OCaml dependencies lower-bounds CI

2022-03-13T00:00:00+00:00

When submitting OCaml packages to the opam package repository, opam-ci runs extensive checks on the package submitted by the pull request. Most of these checks are very standard, involving building and testing the package on various OCaml versions, Linux distributions and OCaml compiler variants. In addition to all of that, there are checks called “lower-bounds”.

The purpose of these unique jobs is to check whether the lower bounds of the package’s dependencies (i.e. their minimal versions), if declared at all, are right. That is, does the package actually compile when the oldest allowed versions of its dependencies are installed. All the other checks follow the default behavior of the opam package manager and install the newest allowed dependencies, essentially checking the upper bounds (at the time of submission at least).

The lower-bounds check works by first installing the dependencies of the package normally. The dependency constraint solver of opam is then reconfigured to instead downgrade and remove as many packages as possible (while still satisfying your package’s lower bounds). Having installed the up-to-date versions first, this step nicely shows the version ranges being downgraded, which makes related issues easier to debug.

Problem

These lower-bounds jobs can be quite annoying when trying to submit a package to opam, because package developers usually don’t test for that. You only find missing or too relaxed lower bounds after doing a release of the package and submitting a PR to opam-repository, just to find out it fails on their extensive CI.

There are two main ways to fix these issues:

Tighten the lower bound for a particular dependency (or add a lower bound if it doesn’t have one).
If possible, change the usage of a dependency to not require features it only introduced in newer versions.

If you follow the recommendations of dune-release, then after fixing the lower bound, instead of re-releasing the exact same version number of the package (and replacing the archive in-place), you release a new patch version of it. This might go on for a while: you release a patched version, submit that to opam-repository, see another lower-bounds failure, fix that – rinse and repeat.

Solution

It would be much quicker and less hassle, if you could somehow run a similar lower-bounds job on your own GitHub repository’s Actions. In fact, you can by using opam-0install and its --prefer-oldest argument to downgrade the dependencies to their lower bounds¹. The complete GitHub Actions workflow using setup-ocaml is the following (replace MY_PACKAGE with your package name)²:

on:
  push:
  pull_request:

jobs:
  lower-bounds:
    strategy:
      matrix:
        os:
          - ubuntu-latest
        ocaml-compiler:
          - 4.13.1

    runs-on: ${{ matrix.os }}

    env:
      OPAMCONFIRMLEVEL: unsafe-yes # allow opam depext to yes package manager prompts

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Set up OCaml ${{ matrix.ocaml-compiler }}
        uses: ocaml/setup-ocaml@v2
        with:
          ocaml-compiler: ${{ matrix.ocaml-compiler }}

      - name: Install dependencies
        run: opam install . --deps-only --with-test

      - name: Install opam-0install
        run: opam install opam-0install

      - name: Downgrade dependencies
        # Option 1: allow OCaml version downgrade
        run: opam install --unlock-base $(opam exec -- opam-0install --prefer-oldest --with-test MY_PACKAGE)
        # Option 2: forbid OCaml version downgrade (specify ocaml-base-compiler again to prevent it from being downgraded)
        # run: opam install $(opam exec -- opam-0install --prefer-oldest --with-test MY_PACKAGE ocaml-base-compiler.${{ matrix.ocaml-compiler }})

      - name: Build
        run: opam exec -- dune build

      - name: Test
        run: opam exec -- dune runtest

Unlike opam-ci, this will not attempt to remove packages, but just downgrade them. ↩
Unlike opam-ci, which doesn’t actually test with lower bounds, this is stronger and uses --with-test. ↩

Training git rerere

2021-10-16T00:00:00+00:00

One advanced git feature is git rerere, which helps with resolving the same merge conflicts multiple times, e.g. when regularly rebasing a branch. In short, when activated, git will record the conflicts before and after you manually resolve them. During future conflicts, it will automatically try to apply the recorded resolutions, so you don’t have to manually resolve the same conflict again, which is error-prone.

The feature can be enabled for a repository with

git config rerere.enabled true

Training

When you first enable it on an existing repository which contains merge conflict resolutions, git rerere won’t automatically know about them because they haven’t been recorded while the feature is active.

Luckily there is an useful rerere-train.sh script to automatically record past merge conflict resolutions into its database. On Ubuntu this is also included in the git apt package at /usr/share/doc/git/contrib/rerere-train.sh (although without the execute bit set).

After enabling git rerere, you can use the script to learn conflict resolutions from a range of commits commit1..commit2 using the following command:

bash /usr/share/doc/git/contrib/rerere-train.sh commit1..commit2

If only one commit is given instead of a range, the script will train git rerere on the entire history back to the initial commit, which is probably undesired and useless.