<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://sim642.eu/feed.xml" rel="self" type="application/atom+xml"/><link href="https://sim642.eu/" rel="alternate" type="text/html" hreflang="en"/><updated>2026-04-20T15:41:50+00:00</updated><id>https://sim642.eu/feed.xml</id><title type="html">blank</title><subtitle></subtitle><entry><title type="html">Times with math: newtx vs Termes</title><link href="https://sim642.eu/blog/2026/01/24/times-with-math-newtx-vs-termes/" rel="alternate" type="text/html" title="Times with math: newtx vs Termes"/><published>2026-01-24T00:00:00+00:00</published><updated>2026-02-18T00:00:00+00:00</updated><id>https://sim642.eu/blog/2026/01/24/times-with-math-newtx-vs-termes</id><content type="html" xml:base="https://sim642.eu/blog/2026/01/24/times-with-math-newtx-vs-termes/"><![CDATA[<p>The <a href="https://cs.ut.ee/en/">Institute of Computer Science</a> at the <a href="https://ut.ee/en/">University of Tartu</a> has a <a href="https://www.overleaf.com/latex/templates/unitartucs-phd-template/hmxdhtwgvzvm">LaTeX template for PhD theses</a>. The same template is also suggested by the <a href="https://tyk.ee/en/requirements-and-recommendations">University of Tartu Press</a> for non-Word users. One major requirement from the press is the use of Times New Roman as the text font.</p> <p>And that’s what the LaTeX template does, using the <a href="https://ctan.org/pkg/mathptmx?lang=en"><code class="language-plaintext highlighter-rouge">mathptmx</code></a> LaTeX package. As the package name suggests, it also provides a version of the Times font for math typesetting, which is quite important for most LaTeX users. One of the many flaws of <code class="language-plaintext highlighter-rouge">mathptmx</code> is that it’s <strong>extremely obsolete</strong> and everyone strongly suggests against it:</p> <ol> <li>It’s <a href="https://ctan.org/pkg/mathptmx?lang=en">CTAN page</a> says “reckoned to be obsolete”.</li> <li>The 3rd edition of “The LaTeX Companion” published in 2023 warns against it.</li> <li>The esteemed TeX StackExchange user <a href="https://tex.stackexchange.com/users/4427/egreg">egreg</a> has said “remove <code class="language-plaintext highlighter-rouge">mathptmx</code>, which is a 25-year-old hack” in <a href="https://tex.stackexchange.com/a/731332/383946">2024</a>.</li> <li>etc.</li> </ol> <p>One reason against <code class="language-plaintext highlighter-rouge">mathptmx</code> is that its math font is an inconsistent mess, as described by <a href="https://ctan.org/pkg/mathptmx?lang=en">CTAN page</a>:</p> <blockquote> <p>[…] provides maths support using glyphs from the Symbol, Chancery and Computer Modern fonts together with letters, etc., from Times Roman.</p> </blockquote> <p>The successor of <code class="language-plaintext highlighter-rouge">mathptmx</code> is <a href="https://ctan.org/pkg/txfonts?lang=en"><code class="language-plaintext highlighter-rouge">txfonts</code></a> which is also obsolete by now. And its successor is <a href="https://ctan.org/pkg/newtx?lang=en"><code class="language-plaintext highlighter-rouge">newtx</code></a> which isn’t yet obsolete. Another non-obsolete alternative is to use the TeX Gyre Termes Math font. In this post, I will try switching my PhD thesis to both modern alternatives to compare them and find the best one to use instead of the ancient <code class="language-plaintext highlighter-rouge">mathptmx</code>.</p> <h2 id="setup">Setup</h2> <p>Before diving into the comparison, here are the three LaTeX setups I will be comparing (click each tab).</p> <ul id="setup" class="tab" data-tab="d81d76b5-48c5-4019-88cd-236f6717b7a7" data-name="setup"> <li class="active" id="setup-mathptmx"> <a href="#">mathptmx </a> </li> <li id="setup-newtx"> <a href="#">newtx </a> </li> <li id="setup-tex-gyre-termes"> <a href="#">TeX Gyre Termes </a> </li> </ul> <ul class="tab-content" id="d81d76b5-48c5-4019-88cd-236f6717b7a7" data-name="setup"> <li class="active"> <p>Under pdfLaTeX:</p> <div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\usepackage</span><span class="p">{</span>mathptmx<span class="p">}</span>
<span class="k">\usepackage</span><span class="p">{</span>amsmath<span class="p">}</span>
<span class="k">\usepackage</span><span class="p">{</span>amssymb<span class="p">}</span>
<span class="k">\usepackage</span><span class="p">{</span>mathtools<span class="p">}</span>
<span class="k">\usepackage</span><span class="p">{</span>stmaryrd<span class="p">}</span>
<span class="k">\renewcommand</span><span class="p">{</span><span class="k">\sfdefault</span><span class="p">}{</span>phv<span class="p">}</span>
<span class="k">\usepackage</span><span class="na">[mono]</span><span class="p">{</span>inconsolata<span class="p">}</span>
</code></pre></div></div> <p>This setup is the relevant part from the PhD thesis template, only <code class="language-plaintext highlighter-rouge">inconsolata</code> is my addition. It might not even show up any of the examples below but its placement in the package loading sequence matters.</p> <p>Helvetica (<code class="language-plaintext highlighter-rouge">phv</code>) loaded like this is visibly larger than Times, both for lower- and uppercase letters, but this doesn’t seem to have bothered anyone using the template?!</p> </li> <li> <p>Under pdfLaTeX:</p> <div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\usepackage</span><span class="na">[largesc]</span><span class="p">{</span>newtxtext<span class="p">}</span>
<span class="k">\usepackage</span><span class="na">[mono]</span><span class="p">{</span>inconsolata<span class="p">}</span>
<span class="k">\usepackage</span><span class="p">{</span>newtxmath<span class="p">}</span>
<span class="k">\usepackage</span><span class="p">{</span>mathtools<span class="p">}</span>
</code></pre></div></div> <p>The newtx packages take care to load a Helvetica clone (TeX Gyre Heros) at a more reasonable scale (0.94 in newer versions).</p> </li> <li> <p>Under LuaLaTeX:</p> <div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\usepackage</span><span class="p">{</span>fontspec<span class="p">}</span>
<span class="k">\setmainfont</span><span class="p">{</span>TeXGyreTermesX<span class="p">}</span>
<span class="k">\setsansfont</span><span class="p">{</span>TeX Gyre Heros<span class="p">}</span>[Scale=0.94]
<span class="k">\setmonofont</span><span class="p">{</span>inconsolata<span class="p">}</span>
<span class="k">\usepackage</span><span class="p">{</span>amsmath<span class="p">}</span>
<span class="k">\usepackage</span><span class="p">{</span>amssymb<span class="p">}</span>
<span class="k">\usepackage</span><span class="p">{</span>mathtools<span class="p">}</span>
<span class="k">\usepackage</span><span class="p">{</span>unicode-math<span class="p">}</span>
<span class="k">\setmathfont</span><span class="p">{</span>TeX Gyre Termes Math<span class="p">}</span>
<span class="k">\AtBeginDocument</span><span class="p">{</span><span class="c">%</span>
    <span class="k">\NewCommandCopy</span><span class="p">{</span><span class="k">\llbracket</span><span class="p">}{</span><span class="k">\lBrack</span><span class="p">}</span><span class="c">%</span>
    <span class="k">\NewCommandCopy</span><span class="p">{</span><span class="k">\rrbracket</span><span class="p">}{</span><span class="k">\rBrack</span><span class="p">}</span><span class="c">%</span>
<span class="p">}</span>
</code></pre></div></div> <p>Since TeX Gyre Termes Math only exists in the OpenType Format, I tried it out under LuaLaTeX. This requires a different way of loading text and math fonts (with <code class="language-plaintext highlighter-rouge">fontspec</code> and <code class="language-plaintext highlighter-rouge">unicode-math</code> respectively).</p> <p>In order to minimize the visual differences with the newtx setup, I’m using TeXGyreTermesX from newtx instead of TeX Gyre Termes for the text. Moreover, I’m using the same scaling for Heros and Inconsolata as the newtx setup does implicitly. This scaling might not be optimal, though: neither lowercase nor uppercase letter height is matched.</p> <p>Since <code class="language-plaintext highlighter-rouge">unicode-math</code> doesn’t define <code class="language-plaintext highlighter-rouge">\llbracket</code> and <code class="language-plaintext highlighter-rouge">\rrbracket</code> (like <code class="language-plaintext highlighter-rouge">stmaryrd</code> and <code class="language-plaintext highlighter-rouge">newtxmath</code> do), they’re defined using the corresponding <code class="language-plaintext highlighter-rouge">unicode-math</code> symbols to make the rest of my thesis easily compile.</p> </li> </ul> <h2 id="text-mode">Text mode</h2> <p>Although this post isn’t about Times in text mode, there is one difference worth pointing out.</p> <div class="row mt-3"> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/sc-kerning-mathptmx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/sc-kerning-mathptmx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/sc-kerning-mathptmx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/sc-kerning-mathptmx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">mathptmx</div> </div> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/sc-kerning-newtx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/sc-kerning-newtx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/sc-kerning-newtx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/sc-kerning-newtx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">newtx</div> </div> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/sc-kerning-termes-480.webp 480w,/assets/times-with-math-newtx-vs-termes/sc-kerning-termes-800.webp 800w,/assets/times-with-math-newtx-vs-termes/sc-kerning-termes-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/sc-kerning-termes.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">TeX Gyre Termes</div> </div> </div> <p><em>(Click on images to zoom.)</em></p> <p>The small caps (<code class="language-plaintext highlighter-rouge">\textsc</code>) in newtx is a bit heavier than in mathptmx, but the small capitals are a bit shorter. And this already is newtx loaded with the <code class="language-plaintext highlighter-rouge">largesc</code> option to get <em>large</em> small caps; by default, it offers <em>petite</em> caps which are even smaller.</p> <p>Since Termes in the rightmost figure is actually TeXGyreTermesX from newtx but in OpenType format, the small caps look <em>almost</em> the same as in newtx, but not quite! Somehow LuaLaTeX and the OpenType version allow for overly close kerning between certain letter pairs, causing uneven spacing across the whole word:</p> <ul> <li>“ac” in “RacerF”,</li> <li>“ag” in “Deagle”,</li> <li>“Da” in “Dartagnan”,</li> <li>“pa” in “UTaipan”,</li> <li>“Ac” in “CPAchecker”.</li> </ul> <p>On a side note, <code class="language-plaintext highlighter-rouge">unicode-math</code> in LuaLaTeX redefines <code class="language-plaintext highlighter-rouge">\vdots</code> to only work in math mode, which is why it’s missing in the figure (but easily fixed by just using it in text mode). The LaTeX default <code class="language-plaintext highlighter-rouge">\vdots</code> works in both modes because it’s not actually using a symbol from the font.</p> <h2 id="math-mode">Math mode</h2> <p>With text mode out of the way, the rest of the post compares Times in math mode.</p> <h3 id="parenthesis-kerning">Parenthesis kerning</h3> <div class="row mt-3"> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/paren-p-kerning-mathptmx-2-480.webp 480w,/assets/times-with-math-newtx-vs-termes/paren-p-kerning-mathptmx-2-800.webp 800w,/assets/times-with-math-newtx-vs-termes/paren-p-kerning-mathptmx-2-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/paren-p-kerning-mathptmx-2.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/paren-f-kerning-mathptmx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/paren-f-kerning-mathptmx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/paren-f-kerning-mathptmx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/paren-f-kerning-mathptmx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">mathptmx</div> </div> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/paren-p-kerning-newtx-2-480.webp 480w,/assets/times-with-math-newtx-vs-termes/paren-p-kerning-newtx-2-800.webp 800w,/assets/times-with-math-newtx-vs-termes/paren-p-kerning-newtx-2-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/paren-p-kerning-newtx-2.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/paren-f-kerning-newtx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/paren-f-kerning-newtx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/paren-f-kerning-newtx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/paren-f-kerning-newtx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">newtx</div> </div> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/paren-p-kerning-termes-2-480.webp 480w,/assets/times-with-math-newtx-vs-termes/paren-p-kerning-termes-2-800.webp 800w,/assets/times-with-math-newtx-vs-termes/paren-p-kerning-termes-2-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/paren-p-kerning-termes-2.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/paren-f-kerning-termes-480.webp 480w,/assets/times-with-math-newtx-vs-termes/paren-f-kerning-termes-800.webp 800w,/assets/times-with-math-newtx-vs-termes/paren-f-kerning-termes-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/paren-f-kerning-termes.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">TeX Gyre Termes</div> </div> </div> <p>Kerning between the left parenthesis and certain letters is bad in Termes. The parenthesis outright collides with ‘f’. The parenthesis doesn’t collide with ‘p’, although at normal text size it feels suspiciously close, even though the other fonts actually have a similar gap.</p> <p>The latter comes down to the different parentheses:</p> <ol> <li>In mathptmx they are the lightest but also tallest, extending below the descenders of ‘p’ and ‘f’.</li> <li>In newtx they are the heaviest but of medium height, in line with the descenders.</li> <li>In Termes they are of medium weight but the shortest, not reaching all the way around the descenders.</li> </ol> <p>On a side note, in mathptmx the <code class="language-plaintext highlighter-rouge">\mathsf</code> used for “unique” and “create” doesn’t actually use Helvetica (or its clone) but just Computer Modern Sans. This causes a dissonance with the sans-serif font in the text mode (e.g. <code class="language-plaintext highlighter-rouge">\textsf</code>), which does use a version of Helvetica. Thus, the same document mixes two different sans-serif fonts.</p> <h3 id="subscript-kerning">Subscript kerning</h3> <div class="row mt-3"> <div class="col-sm mt-3 mt-md-0"> <div class="row"> <div class="col-5"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/st-f-kerning-mathptmx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/st-f-kerning-mathptmx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/st-f-kerning-mathptmx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/st-f-kerning-mathptmx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <div class="col-7"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/x-j-kerning-mathptmx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/x-j-kerning-mathptmx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/x-j-kerning-mathptmx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/x-j-kerning-mathptmx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <div class="col-5"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/N-f-kerning-mathptmx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/N-f-kerning-mathptmx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/N-f-kerning-mathptmx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/N-f-kerning-mathptmx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <div class="col-7"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/G-w-kerning-mathptmx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/G-w-kerning-mathptmx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/G-w-kerning-mathptmx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/G-w-kerning-mathptmx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption">mathptmx</div> </div> <div class="col-sm mt-3 mt-md-0"> <div class="row"> <div class="col-5"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/st-f-kerning-newtx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/st-f-kerning-newtx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/st-f-kerning-newtx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/st-f-kerning-newtx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <div class="col-7"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/x-j-kerning-newtx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/x-j-kerning-newtx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/x-j-kerning-newtx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/x-j-kerning-newtx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <div class="col-5"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/N-f-kerning-newtx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/N-f-kerning-newtx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/N-f-kerning-newtx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/N-f-kerning-newtx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <div class="col-7"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/G-w-kerning-newtx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/G-w-kerning-newtx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/G-w-kerning-newtx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/G-w-kerning-newtx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption">newtx</div> </div> <div class="col-sm mt-3 mt-md-0"> <div class="row"> <div class="col-5"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/st-f-kerning-termes-480.webp 480w,/assets/times-with-math-newtx-vs-termes/st-f-kerning-termes-800.webp 800w,/assets/times-with-math-newtx-vs-termes/st-f-kerning-termes-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/st-f-kerning-termes.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <div class="col-7"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/x-j-kerning-termes-480.webp 480w,/assets/times-with-math-newtx-vs-termes/x-j-kerning-termes-800.webp 800w,/assets/times-with-math-newtx-vs-termes/x-j-kerning-termes-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/x-j-kerning-termes.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <div class="col-5"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/N-f-kerning-termes-480.webp 480w,/assets/times-with-math-newtx-vs-termes/N-f-kerning-termes-800.webp 800w,/assets/times-with-math-newtx-vs-termes/N-f-kerning-termes-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/N-f-kerning-termes.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <div class="col-7"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/G-w-kerning-termes-480.webp 480w,/assets/times-with-math-newtx-vs-termes/G-w-kerning-termes-800.webp 800w,/assets/times-with-math-newtx-vs-termes/G-w-kerning-termes-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/G-w-kerning-termes.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption">TeX Gyre Termes</div> </div> </div> <p>Ignoring the exact letterforms, subscript kerning is quite similar in mathptmx and newtx. Both have a bit too much space between the ‘x’ and the subscript ‘j’. This issue is acknowledged by newtx, which offers the package option <code class="language-plaintext highlighter-rouge">subscriptcorrection</code>. Using some LaTeX hacking, it implements special behavior to reduce the left kerning of certain letters appearing first in subscripts. This fixes the issue with ‘j’ but for some strange reason unnecessarily reduces the spacing in front of other non-problematic subscripts which aren’t even declared in its default <code class="language-plaintext highlighter-rouge">subscriptcorrectionfile</code>.</p> <p>Subscript kerning in Termes is all over the place:</p> <ol> <li>It gets the kerning in <code class="language-plaintext highlighter-rouge">x_j</code> correct out of the box.</li> <li>It leaves too little space in <code class="language-plaintext highlighter-rouge">\mathrm{st}_f</code>.</li> <li>It leaves way too much space for subscripts on <code class="language-plaintext highlighter-rouge">\mathcal</code> letters.</li> </ol> <h3 id="double-brackets">(Double) brackets</h3> <div class="row mt-3"> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/dbl-bracket-weight-mathptmx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/dbl-bracket-weight-mathptmx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/dbl-bracket-weight-mathptmx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/dbl-bracket-weight-mathptmx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">mathptmx</div> </div> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/dbl-bracket-weight-newtx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/dbl-bracket-weight-newtx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/dbl-bracket-weight-newtx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/dbl-bracket-weight-newtx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">newtx</div> </div> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/dbl-bracket-weight-termes-480.webp 480w,/assets/times-with-math-newtx-vs-termes/dbl-bracket-weight-termes-800.webp 800w,/assets/times-with-math-newtx-vs-termes/dbl-bracket-weight-termes-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/dbl-bracket-weight-termes.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">TeX Gyre Termes</div> </div> </div> <p>The double brackets (<code class="language-plaintext highlighter-rouge">\llbracket</code> and <code class="language-plaintext highlighter-rouge">\rrbracket</code>) in newtx are noticeably heavier than all other symbols and really stand out on a full page. Like in mathptmx, they consist of two normal (single) brackets close together (duh), but it becomes too much with the thicker single brackets (which match the parenthesis) in newtx. Termes has solved the weight issue by having the single brackets in a double bracket slightly lighter, giving a more uniform look overall.</p> <p>Furthermore, the double brackets in newtx are slightly taller than the normal brackets, which is not the case for the other two fonts. Given the bracket nesting, it might even be a good thing in this example, but it’s a strange mismatch in general.</p> <p>Additionally, bracket kerning is quite generous in newtx. In mathptmx letters almost look like they reach <em>into</em> the brackets (although they actually don’t, but there’s no additional space either), but in newtx there is additional space. The gap between the double bracket and the single bracket is particularly visible in newtx.</p> <p>On a side note, in mathptmx the <code class="language-plaintext highlighter-rouge">\mathbb{T}</code> in the subscript is clearly distinguishable from a normal <code class="language-plaintext highlighter-rouge">T</code>, but not so well in the others. In newtx this situation can perhaps be improved by choosing a different <code class="language-plaintext highlighter-rouge">\mathbb</code> variant using package options.</p> <h3 id="leq-vs-preceq"><code class="language-plaintext highlighter-rouge">\leq</code> vs <code class="language-plaintext highlighter-rouge">\preceq</code></h3> <div class="row mt-3"> <div class="col-sm mt-3 mt-md-0"> <div class="row"> <div class="col"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/prec-mathptmx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/prec-mathptmx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/prec-mathptmx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/prec-mathptmx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <div class="col"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/leq-mathptmx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/leq-mathptmx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/leq-mathptmx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/leq-mathptmx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption">mathptmx</div> </div> <div class="col-sm mt-3 mt-md-0"> <div class="row"> <div class="col"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/prec-newtx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/prec-newtx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/prec-newtx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/prec-newtx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <div class="col"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/leq-newtx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/leq-newtx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/leq-newtx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/leq-newtx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption">newtx</div> </div> <div class="col-sm mt-3 mt-md-0"> <div class="row"> <div class="col"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/prec-termes-480.webp 480w,/assets/times-with-math-newtx-vs-termes/prec-termes-800.webp 800w,/assets/times-with-math-newtx-vs-termes/prec-termes-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/prec-termes.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> <div class="col"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/leq-termes-480.webp 480w,/assets/times-with-math-newtx-vs-termes/leq-termes-800.webp 800w,/assets/times-with-math-newtx-vs-termes/leq-termes-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/leq-termes.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption">TeX Gyre Termes</div> </div> </div> <p>The <code class="language-plaintext highlighter-rouge">\preceq</code> relation in Termes is mistakably similar to <code class="language-plaintext highlighter-rouge">\leq</code>, especially at normal text size, because it doesn’t bend as much as the others. The same probably applies to <code class="language-plaintext highlighter-rouge">\prec</code> vs <code class="language-plaintext highlighter-rouge">&lt;</code>, etc.</p> <p>The <code class="language-plaintext highlighter-rouge">\preceq</code> in newtx has a particularly small gap between <code class="language-plaintext highlighter-rouge">\prec</code> and the bottom equality line, but I can live with that over Termes.</p> <p>On a side note, in mathptmx the ‘i’ appears much closer to the <code class="language-plaintext highlighter-rouge">\leq</code> relation than the ‘j’.</p> <h3 id="setminus"><code class="language-plaintext highlighter-rouge">\setminus</code></h3> <div class="row mt-3"> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/setminus-mathptmx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/setminus-mathptmx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/setminus-mathptmx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/setminus-mathptmx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">mathptmx</div> </div> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/setminus-newtx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/setminus-newtx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/setminus-newtx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/setminus-newtx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">newtx</div> </div> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/setminus-termes-480.webp 480w,/assets/times-with-math-newtx-vs-termes/setminus-termes-800.webp 800w,/assets/times-with-math-newtx-vs-termes/setminus-termes-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/setminus-termes.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">TeX Gyre Termes</div> </div> </div> <p>The <code class="language-plaintext highlighter-rouge">\setminus</code> operator in Termes is unusually wide: both in terms of its angle and its (left) kerning. I suppose it’s intended to match the width of other set operators (e.g. <code class="language-plaintext highlighter-rouge">\cup</code>, <code class="language-plaintext highlighter-rouge">\cap</code>) but I’m so used to the thinner one from Computer Modern. On the other hand, Termes has the most balanced left vs right kerning for it, with mathptmx being particularly uneven.</p> <h3 id="nabla-and-delta"><code class="language-plaintext highlighter-rouge">\nabla</code> and <code class="language-plaintext highlighter-rouge">\Delta</code></h3> <div class="row mt-3"> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/nabla-delta-mathptmx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/nabla-delta-mathptmx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/nabla-delta-mathptmx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/nabla-delta-mathptmx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">mathptmx</div> </div> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/nabla-delta-newtx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/nabla-delta-newtx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/nabla-delta-newtx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/nabla-delta-newtx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">newtx</div> </div> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/nabla-delta-termes-480.webp 480w,/assets/times-with-math-newtx-vs-termes/nabla-delta-termes-800.webp 800w,/assets/times-with-math-newtx-vs-termes/nabla-delta-termes-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/nabla-delta-termes.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">TeX Gyre Termes</div> </div> </div> <p>The <code class="language-plaintext highlighter-rouge">\nabla</code> and <code class="language-plaintext highlighter-rouge">\Delta</code> are specifically used as binary operators in abstract interpretation and, thus, properly wrapped in <code class="language-plaintext highlighter-rouge">\mathbin</code> here. Surprisingly, they have different width (or kerning) in mathptmx, as clearly seen from the misalignment of their right arguments.</p> <p>Annoyingly, <code class="language-plaintext highlighter-rouge">\nabla</code> in newtx is heavier than <code class="language-plaintext highlighter-rouge">\Delta</code> by having two thick sides instead of one. Not only are they asymmetric, the heaviness of <code class="language-plaintext highlighter-rouge">\nabla</code> stands out on a full page. Newtx offers <code class="language-plaintext highlighter-rouge">\laplace</code> as a similarly heavier version of <code class="language-plaintext highlighter-rouge">\Delta</code>, but that would make the latter stand out as well. Instead, I would like the opposite: a lighter version of <code class="language-plaintext highlighter-rouge">\nabla</code> that matches the <code class="language-plaintext highlighter-rouge">\Delta</code>, but newtx doesn’t offer that.</p> <p>On a side note, in Termes <code class="language-plaintext highlighter-rouge">\mathrm{\Delta}</code> does not work, despite <code class="language-plaintext highlighter-rouge">\Delta</code> being already upright. This is a bit strange since <code class="language-plaintext highlighter-rouge">unicode-math</code> documentation explicitly mentions <code class="language-plaintext highlighter-rouge">\mathup\Delta</code> and that <code class="language-plaintext highlighter-rouge">\mathup</code> is just an alias for <code class="language-plaintext highlighter-rouge">\mathrm</code>. I’m not sure why I had used <code class="language-plaintext highlighter-rouge">\mathrm{\Delta}</code> in the first place, perhaps it’s a relic from using the operator macro with a different font which defaulted to italic uppercase Greek letters.</p> <h3 id="times"><code class="language-plaintext highlighter-rouge">\times</code></h3> <div class="row mt-3"> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/times-mathptmx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/times-mathptmx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/times-mathptmx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/times-mathptmx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">mathptmx</div> </div> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/times-newtx-480.webp 480w,/assets/times-with-math-newtx-vs-termes/times-newtx-800.webp 800w,/assets/times-with-math-newtx-vs-termes/times-newtx-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/times-newtx.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">newtx</div> </div> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/times-termes-480.webp 480w,/assets/times-with-math-newtx-vs-termes/times-termes-800.webp 800w,/assets/times-with-math-newtx-vs-termes/times-termes-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/times-termes.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">TeX Gyre Termes</div> </div> </div> <p>The visual perception may be somewhat due to the not-so-common use of the <code class="language-plaintext highlighter-rouge">\times</code> operator, but:</p> <ol> <li>In mathptmx it is a bit light compared to the rest.</li> <li>In newtx it is heavier but really touches the baseline and doesn’t look as good.</li> <li>In Termes it is smaller (but suitably heavy) and closer to the middle line of the numbers, which looks the best in this context.</li> </ol> <blockquote class="block-tip"> <h5 id="texttimes"><code class="language-plaintext highlighter-rouge">\texttimes</code></h5> <div class="row justify-content-center mt-3"> <div class="col-sm-4 mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/times-newtx-texttimes-480.webp 480w,/assets/times-with-math-newtx-vs-termes/times-newtx-texttimes-800.webp 800w,/assets/times-with-math-newtx-vs-termes/times-newtx-texttimes-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/times-newtx-texttimes.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">newtx</div> </div> </div> <p>After switching to newtx I noticed that another similar instance looked different because it used the Unicode <code class="language-plaintext highlighter-rouge">×</code> in the LaTeX source. Turns out this is mapped to <code class="language-plaintext highlighter-rouge">\texttimes</code> which differs from <code class="language-plaintext highlighter-rouge">$\times$</code> in newtx and fixes my problem.</p> </blockquote> <h3 id="miscellany">Miscellany</h3> <div class="row mt-3"> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-mathptmx-1-480.webp 480w,/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-mathptmx-1-800.webp 800w,/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-mathptmx-1-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-mathptmx-1.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-mathptmx-2-480.webp 480w,/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-mathptmx-2-800.webp 800w,/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-mathptmx-2-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-mathptmx-2.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">mathptmx</div> </div> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-newtx-1-480.webp 480w,/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-newtx-1-800.webp 800w,/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-newtx-1-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-newtx-1.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-newtx-2-480.webp 480w,/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-newtx-2-800.webp 800w,/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-newtx-2-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-newtx-2.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">newtx</div> </div> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-termes-1-480.webp 480w,/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-termes-1-800.webp 800w,/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-termes-1-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-termes-1.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-termes-2-480.webp 480w,/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-termes-2-800.webp 800w,/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-termes-2-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-termes-2.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">TeX Gyre Termes</div> </div> </div> <p>First, there is no <code class="language-plaintext highlighter-rouge">\coloneqq</code> in Termes. However, there is <code class="language-plaintext highlighter-rouge">\coloneq</code>, so the former omission is odd.</p> <p>Second, there is no <code class="language-plaintext highlighter-rouge">\square</code> in Termes. The one in the Termes figure comes from <code class="language-plaintext highlighter-rouge">amssymb</code> and is thus the same one as in the mathptmx figure. It is too light for Times.</p> <p>Third, differences between text mode and math mode parentheses are visible:</p> <ol> <li>In mathptmx the math mode ones are too light.</li> <li>In newtx they appear to be the same, which is great for consistency.</li> <li>In Termes they are quite similar, although the math mode ones are slightly shorter but not too light.</li> </ol> <p>Fourth, the <code class="language-plaintext highlighter-rouge">\bowtie</code> operators are very different in size:</p> <ol> <li>In mathptmx it is the largest and most clearly visible.</li> <li>In newtx it is scaled down and heavier. This kind of matches with the smaller <code class="language-plaintext highlighter-rouge">\square</code> in newtx though.</li> <li>In Termes it is microscopic and horizontally squashed. The holes in the bowtie are hardly visible at normal text size.</li> </ol> <blockquote class="block-tip"> <h5 id="join"><code class="language-plaintext highlighter-rouge">\Join</code></h5> <div class="row justify-content-center mt-3"> <div class="col-sm-4 mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-newtx-2-join-480.webp 480w,/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-newtx-2-join-800.webp 800w,/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-newtx-2-join-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/times-with-math-newtx-vs-termes/coloneqq-and-square-and-bowtie-and-paren-and-brace-newtx-2-join.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption">newtx</div> </div> </div> <p>After digging into newtx font files with <a href="https://fontforge.org">FontForge</a> I accidentally noticed that it contains two different bowties! Turns out the other one is is less squashed and provided as <code class="language-plaintext highlighter-rouge">\Join</code> in newtx.</p> </blockquote> <h2 id="conclusion">Conclusion</h2> <p>Clearly mathptmx inferior to both newtx and Termes. Based on my observations above, I think <strong>newtx is the better one of the two</strong>. Termes needs some improvements to kerning and math operators.</p>]]></content><author><name></name></author><category term="academia"/><category term="typesetting"/><category term="latex"/><category term="rant"/><summary type="html"><![CDATA[The Institute of Computer Science at the University of Tartu has a LaTeX template for PhD theses. The same template is also suggested by the University of Tartu Press for non-Word users. One major requirement from the press is the use of Times New Roman as the text font.]]></summary></entry><entry><title type="html">DOI to Bib(La)TeX – a misery</title><link href="https://sim642.eu/blog/2025/10/06/doi-to-biblatex-a-misery/" rel="alternate" type="text/html" title="DOI to Bib(La)TeX – a misery"/><published>2025-10-06T00:00:00+00:00</published><updated>2025-10-06T00:00:00+00:00</updated><id>https://sim642.eu/blog/2025/10/06/doi-to-biblatex-a-misery</id><content type="html" xml:base="https://sim642.eu/blog/2025/10/06/doi-to-biblatex-a-misery/"><![CDATA[<p>My PhD-thesis–to–be combines 8 papers from the last 5 years. Their Bib(La)TeX bibliography entries come in a wide range of quality and style. I would like some consistency but it’s quite an effort to achieve across 234 entries. So I was wondering if there’s any good quality and consistent source from where I could (hopefully automatically) update their data via their DOI.</p> <h2 id="services">Services</h2> <p>Let’s look at a bunch of services for getting BibTeX entries by DOI. I’ll use the DOI <a href="https://doi.org/10.1007/978-3-031-50524-9_4">10.1007/978-3-031-50524-9_4</a> (one of my papers) as the example.</p> <p><em>Click on each tab to see the BibTeX entry from each service and my comments about it.</em></p> <ul id="service" class="tab" data-tab="8a8aefea-d3d4-4567-8a94-7b577db2d1fc" data-name="service"> <li class="active" id="service-doi"> <a href="#">DOI </a> </li> <li id="service-doi-formatter"> <a href="#">DOI formatter </a> </li> <li id="service-doi2bib"> <a href="#">doi2bib </a> </li> <li id="service-springer"> <a href="#">Springer </a> </li> <li id="service-acm"> <a href="#">ACM </a> </li> <li id="service-dblp-condensed"> <a href="#">DBLP condensed </a> </li> <li id="service-dblp-standard"> <a href="#">DBLP standard </a> </li> </ul> <ul class="tab-content" id="8a8aefea-d3d4-4567-8a94-7b577db2d1fc" data-name="service"> <li class="active"> <div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="nc">@inbook</span><span class="p">{</span><span class="nl">Saan_2023</span><span class="p">,</span> <span class="na">title</span><span class="p">=</span><span class="s">{Correctness Witness Validation by Abstract Interpretation}</span><span class="p">,</span> <span class="na">ISBN</span><span class="p">=</span><span class="s">{9783031505249}</span><span class="p">,</span> <span class="na">ISSN</span><span class="p">=</span><span class="s">{1611-3349}</span><span class="p">,</span> <span class="na">url</span><span class="p">=</span><span class="s">{http://dx.doi.org/10.1007/978-3-031-50524-9_4}</span><span class="p">,</span> <span class="na">DOI</span><span class="p">=</span><span class="s">{10.1007/978-3-031-50524-9_4}</span><span class="p">,</span> <span class="na">booktitle</span><span class="p">=</span><span class="s">{Verification, Model Checking, and Abstract Interpretation}</span><span class="p">,</span> <span class="na">publisher</span><span class="p">=</span><span class="s">{Springer Nature Switzerland}</span><span class="p">,</span> <span class="na">author</span><span class="p">=</span><span class="s">{Saan, Simmo and Schwarz, Michael and Erhard, Julian and Seidl, Helmut and Tilscher, Sarah and Vojdani, Vesal}</span><span class="p">,</span> <span class="na">year</span><span class="p">=</span><span class="s">{2023}</span><span class="p">,</span> <span class="na">month</span><span class="p">=</span><span class="nv">dec</span><span class="p">,</span> <span class="na">pages</span><span class="p">=</span><span class="s">{74–97}</span> <span class="p">}</span>
</code></pre></div></div> <p>This is returned by <a href="https://citation.doi.org/docs.html">DOI Content Negotiation</a> which simply means making an HTTP(S) request to the usual DOI URL <a href="https://doi.org/10.1007/978-3-031-50524-9_4">https://doi.org/10.1007/978-3-031-50524-9_4</a> but with the <code class="language-plaintext highlighter-rouge">Accept: application/x-bibtex</code> HTTP header, i.e.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-LH</span> <span class="s2">"Accept: application/x-bibtex"</span> https://doi.org/10.1007/978-3-031-50524-9_4
</code></pre></div></div> <p>For this particular DOI, this actually delegates to the <a href="https://www.crossref.org/documentation/retrieve-metadata/rest-api/">Crossref API</a> at</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-L</span> https://api.crossref.org/works/10.1007/978-3-031-50524-9_4/transform/application/x-bibtex
</code></pre></div></div> <h4 id="comments">Comments</h4> <ol> <li>The entry type is <code class="language-plaintext highlighter-rouge">@inbook</code>, although <code class="language-plaintext highlighter-rouge">@inproceedings</code> would be more precise for this work.</li> <li>The <code class="language-plaintext highlighter-rouge">url</code> field has value <a href="http://dx.doi.org/10.1007/978-3-031-50524-9_4">http://dx.doi.org/10.1007/978-3-031-50524-9_4</a>. There are two things wrong with that: <ol> <li>It’s HTTP, not HTTPS.</li> <li>It uses <a href="https://dx.doi.org">dx.doi.org</a>, not just <a href="https://doi.org">doi.org</a>.</li> </ol> <p>The former options in both points are <a href="https://www.doi.org/the-identifier/resources/factsheets/doi-resolution-documentation">no longer preferred</a>, yet the official DOI metadata service doesn’t follow its own recommendations.</p> </li> <li>The <code class="language-plaintext highlighter-rouge">booktitle</code> field is actually not specified for <code class="language-plaintext highlighter-rouge">@inbook</code> in <a href="https://mirrors.ctan.org/biblio/bibtex/base/btxdoc.pdf">BibTeX</a>. It is specified for <code class="language-plaintext highlighter-rouge">@inproceedings</code>, so it really should be that. In <a href="https://mirrors.ctan.org/macros/latex/contrib/biblatex/doc/biblatex.pdf">BibLaTeX</a>, <code class="language-plaintext highlighter-rouge">booktitle</code> is also specified for <code class="language-plaintext highlighter-rouge">@inbook</code> but only because BibLaTeX gives <code class="language-plaintext highlighter-rouge">@inbook</code> a slightly different meaning than BibTeX.</li> <li>The whole result is on one line (fine) and has a spurious single space in the beginning (which is odd).</li> </ol> </li> <li> <div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="nc">@misc</span><span class="p">{</span><span class="nl">Saan_Schwarz_Erhard_Seidl_Tilscher_Vojdani_2023</span><span class="p">,</span> <span class="na">title</span><span class="p">=</span><span class="s">{Correctness Witness Validation by Abstract Interpretation}</span><span class="p">,</span> <span class="na">url</span><span class="p">=</span><span class="s">{http://dx.doi.org/10.1007/978-3-031-50524-9_4}</span><span class="p">,</span> <span class="na">DOI</span><span class="p">=</span><span class="s">{10.1007/978-3-031-50524-9_4}</span><span class="p">,</span> <span class="na">journal</span><span class="p">=</span><span class="s">{Lecture Notes in Computer Science}</span><span class="p">,</span> <span class="na">publisher</span><span class="p">=</span><span class="s">{Springer Nature Switzerland}</span><span class="p">,</span> <span class="na">author</span><span class="p">=</span><span class="s">{Saan, Simmo and Schwarz, Michael and Erhard, Julian and Seidl, Helmut and Tilscher, Sarah and Vojdani, Vesal}</span><span class="p">,</span> <span class="na">year</span><span class="p">=</span><span class="s">{2023}</span><span class="p">,</span> <span class="na">month</span><span class="p">=</span><span class="nv">dec</span><span class="p">,</span> <span class="na">pages</span><span class="p">=</span><span class="s">{74–97}</span><span class="p">,</span> <span class="na">language</span><span class="p">=</span><span class="s">{en}</span> <span class="p">}</span>
</code></pre></div></div> <p>This is returned by the <a href="https://citation.doi.org/">DOI Citation Formatter</a> for the style <code class="language-plaintext highlighter-rouge">bibtex</code>, which can also be accessed through an <a href="https://citation.doi.org/api-docs.html">API</a>:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="s1">'https://citation.doi.org/format?doi=10.1007%2F978-3-031-50524-9_4&amp;style=bibtex&amp;lang=en-US'</span>
</code></pre></div></div> <h4 id="comments">Comments</h4> <p>It’s quite similar to the previous one from DOI Content Negotiation, but objectively worse:</p> <ol> <li>The entry type is now just <code class="language-plaintext highlighter-rouge">@misc</code>.</li> <li>The <code class="language-plaintext highlighter-rouge">booktitle</code> field is missing (it’s not specified for <code class="language-plaintext highlighter-rouge">@misc</code> anyway), and the title “Verification, Model Checking, and Abstract Interpretation” isn’t in any other field either.</li> <li>The <code class="language-plaintext highlighter-rouge">journal</code> field is now present (it’s not specified for <code class="language-plaintext highlighter-rouge">@misc</code> either!) and has value “Lecture Notes in Computer Science”, which isn’t a journal but a <a href="https://link.springer.com/series/558">book series</a> (which belongs to the <code class="language-plaintext highlighter-rouge">series</code> field, if it wasn’t for <code class="language-plaintext highlighter-rouge">@misc</code>).</li> </ol> </li> <li> <div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@inbook</span><span class="p">{</span><span class="nl">Saan2023</span><span class="p">,</span>
  <span class="na">title</span> <span class="p">=</span> <span class="s">{Correctness Witness Validation by Abstract Interpretation}</span><span class="p">,</span>
  <span class="na">ISBN</span> <span class="p">=</span> <span class="s">{9783031505249}</span><span class="p">,</span>
  <span class="na">ISSN</span> <span class="p">=</span> <span class="s">{1611-3349}</span><span class="p">,</span>
  <span class="na">url</span> <span class="p">=</span> <span class="s">{http://dx.doi.org/10.1007/978-3-031-50524-9_4}</span><span class="p">,</span>
  <span class="na">DOI</span> <span class="p">=</span> <span class="s">{10.1007/978-3-031-50524-9_4}</span><span class="p">,</span>
  <span class="na">booktitle</span> <span class="p">=</span> <span class="s">{Verification,  Model Checking,  and Abstract Interpretation}</span><span class="p">,</span>
  <span class="na">publisher</span> <span class="p">=</span> <span class="s">{Springer Nature Switzerland}</span><span class="p">,</span>
  <span class="na">author</span> <span class="p">=</span> <span class="s">{Saan,  Simmo and Schwarz,  Michael and Erhard,  Julian and Seidl,  Helmut and Tilscher,  Sarah and Vojdani,  Vesal}</span><span class="p">,</span>
  <span class="na">year</span> <span class="p">=</span> <span class="s">{2023}</span><span class="p">,</span>
  <span class="na">month</span> <span class="p">=</span> <span class="nv">dec</span><span class="p">,</span>
  <span class="na">pages</span> <span class="p">=</span> <span class="s">{74–97}</span>
<span class="p">}</span>
</code></pre></div></div> <p>This is returned by <a href="https://www.doi2bib.org">doi2bib</a> at <a href="https://www.doi2bib.org/bib/10.1007/978-3-031-50524-9_4">https://www.doi2bib.org/bib/10.1007/978-3-031-50524-9_4</a>. <a href="https://www.doi2bib.org">doi2bib</a> is just a browser frontend for DOI Content Negotiation and performs client-side reformatting. As far as I have seen, many other tools actually do this under the hood.</p> <h4 id="comments">Comments</h4> <p>It has all the issues of DOI Content Negotiation and only the following differences:</p> <ol> <li>The formatting is generally more human-friendly.</li> <li>The formatting adds double spaces after commas in field values. This shouldn’t affect Bib(La)TeX, but is odd nevertheless.</li> </ol> </li> <li> <div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@InProceedings</span><span class="p">{</span><span class="nl">10.1007/978-3-031-50524-9_4</span><span class="p">,</span>
<span class="na">author</span><span class="p">=</span><span class="s">"Saan, Simmo
and Schwarz, Michael
and Erhard, Julian
and Seidl, Helmut
and Tilscher, Sarah
and Vojdani, Vesal"</span><span class="p">,</span>
<span class="na">editor</span><span class="p">=</span><span class="s">"Dimitrova, Rayna
and Lahav, Ori
and Wolff, Sebastian"</span><span class="p">,</span>
<span class="na">title</span><span class="p">=</span><span class="s">"Correctness Witness Validation by Abstract Interpretation"</span><span class="p">,</span>
<span class="na">booktitle</span><span class="p">=</span><span class="s">"Verification, Model Checking, and Abstract Interpretation"</span><span class="p">,</span>
<span class="na">year</span><span class="p">=</span><span class="s">"2024"</span><span class="p">,</span>
<span class="na">publisher</span><span class="p">=</span><span class="s">"Springer Nature Switzerland"</span><span class="p">,</span>
<span class="na">address</span><span class="p">=</span><span class="s">"Cham"</span><span class="p">,</span>
<span class="na">pages</span><span class="p">=</span><span class="s">"74--97"</span><span class="p">,</span>
<span class="na">abstract</span><span class="p">=</span><span class="s">"Witnesses record automated program analysis results and make them exchangeable. To validate correctness witnesses through abstract interpretation, we introduce a novel abstract operation unassume. This operator incorporates witness invariants into the abstract program state. Given suitable invariants, the unassume operation can accelerate fixpoint convergence and yield more precise results. We demonstrate the feasibility of this approach by augmenting an abstract interpreter with unassume operators and evaluating the impact of incorporating witnesses on performance and precision. Using manually crafted witnesses, we can confirm verification results for multi-threaded programs with a reduction in effort ranging from 7{\%} to 47{\%} in CPU time. More intriguingly, we discover that using witnesses from model checkers can guide our analyzer to verify program properties that it could not verify on its own."</span><span class="p">,</span>
<span class="na">isbn</span><span class="p">=</span><span class="s">"978-3-031-50524-9"</span>
<span class="p">}</span>
</code></pre></div></div> <p>This is returned by the “<a href="https://citation-needed.springer.com/v2/references/10.1007/978-3-031-50524-9_4?format=bibtex&amp;flavour=citation">Download citation (.BIB)</a>” feature of Springer Link which the particular DOI points to:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="s1">'https://citation-needed.springer.com/v2/references/10.1007/978-3-031-50524-9_4?format=bibtex&amp;flavour=citation'</span>
</code></pre></div></div> <h4 id="comments">Comments</h4> <ol> <li>This is completely different from the previous ones based on DOI Content Negotiation. I guess because those actually come from <a href="https://www.crossref.org/">Crossref</a>’s database, while this one comes from Springer’s own database, but as a user I shouldn’t have to know or care. It’s still Springer submitting data to Crossref and the DOI URL itself redirects to Springer under normal conditions (standard HTTP request).</li> <li>The entry type is <code class="language-plaintext highlighter-rouge">@InProceedings</code>, which is more accurate than all the previous ones.</li> <li>The <code class="language-plaintext highlighter-rouge">doi</code> field is missing. The DOI is in the entry key, although that doesn’t help to have the DOI show up in a Bib(La)TeX bibliography.</li> <li>The <code class="language-plaintext highlighter-rouge">url</code> field is also missing. Thus, there would be no digital reference in a rendered bibliography.</li> <li>The formatting is multiline, but not indented.</li> </ol> </li> <li> <div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@inproceedings</span><span class="p">{</span><span class="nl">10.1007/978-3-031-50524-9_4</span><span class="p">,</span>
<span class="na">author</span> <span class="p">=</span> <span class="s">{Saan, Simmo and Schwarz, Michael and Erhard, Julian and Seidl, Helmut and Tilscher, Sarah and Vojdani, Vesal}</span><span class="p">,</span>
<span class="na">title</span> <span class="p">=</span> <span class="s">{Correctness Witness Validation by&amp;nbsp;Abstract Interpretation}</span><span class="p">,</span>
<span class="na">year</span> <span class="p">=</span> <span class="s">{2024}</span><span class="p">,</span>
<span class="na">isbn</span> <span class="p">=</span> <span class="s">{978-3-031-50523-2}</span><span class="p">,</span>
<span class="na">publisher</span> <span class="p">=</span> <span class="s">{Springer-Verlag}</span><span class="p">,</span>
<span class="na">address</span> <span class="p">=</span> <span class="s">{Berlin, Heidelberg}</span><span class="p">,</span>
<span class="na">url</span> <span class="p">=</span> <span class="s">{https://doi.org/10.1007/978-3-031-50524-9_4}</span><span class="p">,</span>
<span class="na">doi</span> <span class="p">=</span> <span class="s">{10.1007/978-3-031-50524-9_4}</span><span class="p">,</span>
<span class="na">abstract</span> <span class="p">=</span> <span class="s">{Witnesses record automated program analysis results and make them exchangeable. To validate correctness witnesses through abstract interpretation, we introduce a novel abstract operation unassume. This operator incorporates witness invariants into the abstract program state. Given suitable invariants, the unassume operation can accelerate fixpoint convergence and yield more precise results. We demonstrate the feasibility of this approach by augmenting an abstract interpreter with unassume operators and evaluating the impact of incorporating witnesses on performance and precision. Using manually crafted witnesses, we can confirm verification results for multi-threaded programs with a reduction in effort ranging from 7\% to 47\% in CPU time. More intriguingly, we discover that using witnesses from model checkers can guide our analyzer to verify program properties that it could not verify on its own.}</span><span class="p">,</span>
<span class="na">booktitle</span> <span class="p">=</span> <span class="s">{Verification, Model Checking, and Abstract Interpretation: 25th International Conference, VMCAI 2024, London, United Kingdom, January 15–16, 2024, Proceedings, Part I}</span><span class="p">,</span>
<span class="na">pages</span> <span class="p">=</span> <span class="s">{74–97}</span><span class="p">,</span>
<span class="na">numpages</span> <span class="p">=</span> <span class="s">{24}</span><span class="p">,</span>
<span class="na">keywords</span> <span class="p">=</span> <span class="s">{Correctness Witness, Witness Validation, Software Verification, Program Analysis, Abstract Interpretation}</span><span class="p">,</span>
<span class="na">location</span> <span class="p">=</span> <span class="s">{London, United Kingdom}</span>
<span class="p">}</span>
</code></pre></div></div> <p>This is returned by the “Export Citation” feature of ACM Digital Library at https://dl.acm.org/doi/10.1007/978-3-031-50524-9_4. Although the particular work is published by Springer, ACM seems to index it.</p> <h4 id="comments">Comments</h4> <ol> <li>The <code class="language-plaintext highlighter-rouge">title</code> field value includes <code class="language-plaintext highlighter-rouge">&amp;nbsp;</code>, which is inappropriate for Bib(La)TeX.</li> <li>The <code class="language-plaintext highlighter-rouge">publisher</code> and <code class="language-plaintext highlighter-rouge">address</code> field values “Springer-Verlag” and “Berlin, Heidelberg” seem wrong because Springer itself returned “Springer Nature Switzerland” and “Cham”. (Although personally I don’t care: I would drop the <code class="language-plaintext highlighter-rouge">address</code> and simplify <code class="language-plaintext highlighter-rouge">publisher</code> to “Springer”.)</li> <li>The <code class="language-plaintext highlighter-rouge">booktitle</code> field value includes the book’s subtitle “25th International Conference, VMCAI 2024, London, United Kingdom, January 15–16, 2024, Proceedings, Part I”. In BibTeX, there’s no other way (except omitting it like in all previous services). <a href="https://mirrors.ctan.org/macros/latex/contrib/biblatex/doc/biblatex.pdf">BibLaTeX</a> specifies the <code class="language-plaintext highlighter-rouge">booksubtitle</code> field, and even more appropriate ones like <code class="language-plaintext highlighter-rouge">eventtitle</code>, <code class="language-plaintext highlighter-rouge">venue</code> and <code class="language-plaintext highlighter-rouge">eventdate</code> (as also pointed out in <a href="https://tex.stackexchange.com/a/697291">this TeX StackExchange answer</a>).</li> <li>The formatting is multiline, but not indented.</li> </ol> </li> <li> <div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@inproceedings</span><span class="p">{</span><span class="nl">DBLP:conf/vmcai/SaanSESTV24</span><span class="p">,</span>
  <span class="na">author</span>       <span class="p">=</span> <span class="s">{Simmo Saan and
                  Michael Schwarz and
                  Julian Erhard and
                  Helmut Seidl and
                  Sarah Tilscher and
                  Vesal Vojdani}</span><span class="p">,</span>
  <span class="na">title</span>        <span class="p">=</span> <span class="s">{Correctness Witness Validation by Abstract Interpretation}</span><span class="p">,</span>
  <span class="na">booktitle</span>    <span class="p">=</span> <span class="s">{{VMCAI} {(1)}}</span><span class="p">,</span>
  <span class="na">series</span>       <span class="p">=</span> <span class="s">{Lecture Notes in Computer Science}</span><span class="p">,</span>
  <span class="na">volume</span>       <span class="p">=</span> <span class="s">{14499}</span><span class="p">,</span>
  <span class="na">pages</span>        <span class="p">=</span> <span class="s">{74--97}</span><span class="p">,</span>
  <span class="na">publisher</span>    <span class="p">=</span> <span class="s">{Springer}</span><span class="p">,</span>
  <span class="na">year</span>         <span class="p">=</span> <span class="s">{2024}</span>
<span class="p">}</span>
</code></pre></div></div> <p>This is returned by the “export record (BibTeX)” feature of DBLP at <a href="https://dblp.org/rec/conf/vmcai/SaanSESTV24.html?view=bibtex&amp;param=0">https://dblp.org/rec/conf/vmcai/SaanSESTV24.html?view=bibtex&amp;param=0</a>. DBLP offers multiple BibTeX formats, this being the condensed one.</p> <h4 id="comments">Comments</h4> <ol> <li>The <code class="language-plaintext highlighter-rouge">doi</code> field is missing and, unlike Springer, it’s not in the entry key either.</li> <li>The <code class="language-plaintext highlighter-rouge">url</code> field is also missing.</li> <li> <p>The <code class="language-plaintext highlighter-rouge">volume</code> field value is “14499” which actually corresponds to the <code class="language-plaintext highlighter-rouge">series</code> “Lecture Notes in Computer Science”. This is wrong in both BibTeX and BibLaTeX: it should instead be the <code class="language-plaintext highlighter-rouge">number</code> field with the value “14499”.</p> <p><a href="https://mirrors.ctan.org/biblio/bibtex/base/btxdoc.pdf">BibTeX</a> specifies:</p> <blockquote> <dl> <dt>number</dt> <dd>The number of […] a work in a series. […] sometimes books are given numbers in a named series.</dd> <dt>volume</dt> <dd>The volume of a journal or multivolume book.</dd> </dl> </blockquote> <p><a href="https://mirrors.ctan.org/macros/latex/contrib/biblatex/doc/biblatex.pdf">BibLaTeX</a> specifies:</p> <blockquote> <dl> <dt>number</dt> <dd>[…] the volume/number of a book in a series.</dd> <dt>volume</dt> <dd>The volume of a multi-volume book or a periodical.</dd> </dl> </blockquote> <p>This has also been pointed out in <a href="https://tex.stackexchange.com/a/697291">this TeX StackExchange answer</a>.</p> </li> <li> <p>The <code class="language-plaintext highlighter-rouge">booktitle</code> field value is essentially “VMCAI (1)”, where the 1 refers to the part. The latter is what actually should go into the <code class="language-plaintext highlighter-rouge">volume</code> field according to the specifications above.</p> <p>Alternatively, <a href="https://mirrors.ctan.org/macros/latex/contrib/biblatex/doc/biblatex.pdf">BibLaTeX</a> also specifies:</p> <blockquote> <dl> <dt>part</dt> <dd>The number of a partial volume. This field applies to books only, not to journals. It may be used when a logical volume consists of two or more physical ones. In this case the number of the logical volume goes in the <code class="language-plaintext highlighter-rouge">volume</code> field and the number of the part of that volume in the <code class="language-plaintext highlighter-rouge">part</code> field.</dd> </dl> </blockquote> <p>The distinction between logical and physical is a bit hazy in this case. Even <a href="https://link.springer.com/book/10.1007/978-3-031-50524-9">Springer</a> cannot make up their mind about the terminology:</p> <ol> <li>The subtitle of the book ends with “Part I”.</li> <li>The Springer Link page for the book has the section “<a href="https://link.springer.com/book/10.1007/978-3-031-50524-9#other-volumes">Other volumes</a>”.</li> <li>The “<a href="https://link.springer.com/book/10.1007/978-3-031-50524-9#about-this-book">About this book</a>” section on the same page mentions both, while starting with “The two-volume set LNCS 14499 and 14500 […]”.</li> </ol> </li> <li>The formatting is the nicest of them all. Although, when copying the BibTeX code from the DBLP website, the copied text includes two empty leading and trailing lines for some reason. The empty lines are not present in the downloadable .bib file.</li> </ol> </li> <li> <div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@inproceedings</span><span class="p">{</span><span class="nl">DBLP:conf/vmcai/SaanSESTV24</span><span class="p">,</span>
  <span class="na">author</span>       <span class="p">=</span> <span class="s">{Simmo Saan and
                  Michael Schwarz and
                  Julian Erhard and
                  Helmut Seidl and
                  Sarah Tilscher and
                  Vesal Vojdani}</span><span class="p">,</span>
  <span class="na">editor</span>       <span class="p">=</span> <span class="s">{Rayna Dimitrova and
                  Ori Lahav and
                  Sebastian Wolff}</span><span class="p">,</span>
  <span class="na">title</span>        <span class="p">=</span> <span class="s">{Correctness Witness Validation by Abstract Interpretation}</span><span class="p">,</span>
  <span class="na">booktitle</span>    <span class="p">=</span> <span class="s">{Verification, Model Checking, and Abstract Interpretation - 25th International
                  Conference, {VMCAI} 2024, London, United Kingdom, January 15-16, 2024,
                  Proceedings, Part {I}}</span><span class="p">,</span>
  <span class="na">series</span>       <span class="p">=</span> <span class="s">{Lecture Notes in Computer Science}</span><span class="p">,</span>
  <span class="na">volume</span>       <span class="p">=</span> <span class="s">{14499}</span><span class="p">,</span>
  <span class="na">pages</span>        <span class="p">=</span> <span class="s">{74--97}</span><span class="p">,</span>
  <span class="na">publisher</span>    <span class="p">=</span> <span class="s">{Springer}</span><span class="p">,</span>
  <span class="na">year</span>         <span class="p">=</span> <span class="s">{2024}</span><span class="p">,</span>
  <span class="na">url</span>          <span class="p">=</span> <span class="s">{https://doi.org/10.1007/978-3-031-50524-9\_4}</span><span class="p">,</span>
  <span class="na">doi</span>          <span class="p">=</span> <span class="s">{10.1007/978-3-031-50524-9\_4}</span><span class="p">,</span>
  <span class="na">timestamp</span>    <span class="p">=</span> <span class="s">{Sat, 10 Feb 2024 18:04:44 +0100}</span><span class="p">,</span>
  <span class="na">biburl</span>       <span class="p">=</span> <span class="s">{https://dblp.org/rec/conf/vmcai/SaanSESTV24.bib}</span><span class="p">,</span>
  <span class="na">bibsource</span>    <span class="p">=</span> <span class="s">{dblp computer science bibliography, https://dblp.org}</span>
<span class="p">}</span>
</code></pre></div></div> <p>This is returned by the “export record (BibTeX)” feature of DBLP at <a href="https://dblp.org/rec/conf/vmcai/SaanSESTV24.html?view=bibtex&amp;param=1">https://dblp.org/rec/conf/vmcai/SaanSESTV24.html?view=bibtex&amp;param=1</a>. DBLP offers multiple BibTeX formats, this being the standard one.</p> <h4 id="comments">Comments</h4> <p>It is mostly an extension of the previous one from DBLP, but with additional fields which can be (and are) treated incorrectly:</p> <ol> <li>The <code class="language-plaintext highlighter-rouge">doi</code> field value has the underscore escaped. This is unnecessary and even wrong: <a href="https://doi.org/10.1007/978-3-031-50524-9\_4">DOI lookup</a> returns “DOI Not Found”.</li> <li>The <code class="language-plaintext highlighter-rouge">url</code> field value also has the underscore escaped. This is again unnecessary and even wrong: the <a href="https://doi.org/10.1007/978-3-031-50524-9\_4"><code class="language-plaintext highlighter-rouge">url</code></a> is broken.</li> <li>The <code class="language-plaintext highlighter-rouge">booktitle</code> field value is uncondensed, but has the same issues as with ACM.</li> </ol> </li> </ul> <p><em>The tab content ends here.</em><sup id="fnref:tabs-css"><a href="#fn:tabs-css" class="footnote" rel="footnote" role="doc-noteref">1</a></sup></p> <hr/> <h3 id="comparison">Comparison</h3> <p>Here’s a table to summarize some aspects of the entries returned by the services. The values I consider acceptable are in <strong>bold</strong><sup id="fnref:bold-monospace"><a href="#fn:bold-monospace" class="footnote" rel="footnote" role="doc-noteref">2</a></sup> and the values I prefer are in <em>italic</em>.</p> <table> <thead> <tr> <th>Feature</th> <th>DOI</th> <th>DOI formatter</th> <th>doi2bib</th> <th>Springer</th> <th>ACM</th> <th>DBLP condensed</th> <th>DBLP standard</th> </tr> </thead> <tbody> <tr> <td>Entry type</td> <td><code class="language-plaintext highlighter-rouge">@inbook</code></td> <td><code class="language-plaintext highlighter-rouge">@misc</code></td> <td><code class="language-plaintext highlighter-rouge">@inbook</code></td> <td><strong><em><code class="language-plaintext highlighter-rouge">@InProceedings</code></em></strong></td> <td><strong><em><code class="language-plaintext highlighter-rouge">@inproceedings</code></em></strong></td> <td><strong><em><code class="language-plaintext highlighter-rouge">@inproceedings</code></em></strong></td> <td><strong><em><code class="language-plaintext highlighter-rouge">@inproceedings</code></em></strong></td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">doi</code></td> <td><strong><em>Yes</em></strong></td> <td><strong><em>Yes</em></strong></td> <td><strong><em>Yes</em></strong></td> <td>No</td> <td><strong><em>Yes</em></strong></td> <td>No</td> <td>Yes<sup id="fnref:underscore-problem"><a href="#fn:underscore-problem" class="footnote" rel="footnote" role="doc-noteref">3</a></sup></td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">url</code></td> <td>dx.doi.org</td> <td>dx.doi.org</td> <td>dx.doi.org</td> <td><strong><em>No</em></strong></td> <td><strong>doi.org</strong></td> <td><strong><em>No</em></strong></td> <td>doi.org<sup id="fnref:underscore-problem:1"><a href="#fn:underscore-problem" class="footnote" rel="footnote" role="doc-noteref">3</a></sup></td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">year</code></td> <td>2023</td> <td>2023</td> <td>2023</td> <td><strong><em>2024</em></strong></td> <td><strong><em>2024</em></strong></td> <td><strong><em>2024</em></strong></td> <td><strong><em>2024</em></strong></td> </tr> <tr> <td>Event info</td> <td><strong><em>No</em></strong></td> <td><strong><em>No</em></strong></td> <td><strong><em>No</em></strong></td> <td><strong><em>No</em></strong></td> <td>In <code class="language-plaintext highlighter-rouge">booktitle</code></td> <td><strong><em>No</em></strong></td> <td>In <code class="language-plaintext highlighter-rouge">booktitle</code></td> </tr> <tr> <td>LNCS №</td> <td><strong>No</strong></td> <td><strong>No</strong></td> <td><strong>No</strong></td> <td><strong>No</strong></td> <td><strong>No</strong></td> <td>In <code class="language-plaintext highlighter-rouge">volume</code></td> <td>In <code class="language-plaintext highlighter-rouge">volume</code></td> </tr> <tr> <td>Book part</td> <td>No</td> <td>No</td> <td>No</td> <td>No</td> <td>In <code class="language-plaintext highlighter-rouge">booktitle</code></td> <td>In <code class="language-plaintext highlighter-rouge">booktitle</code></td> <td>In <code class="language-plaintext highlighter-rouge">booktitle</code></td> </tr> </tbody> </table> <p>The table also compares the <code class="language-plaintext highlighter-rouge">year</code> field values which weren’t discussed above. Surprisingly, there even isn’t consensus about such a basic fact. It probably has to do with “<a href="https://doi.org/10.1007/978-3-031-50524-9_4">First Online: 30 December 2023</a>”. The Crossref data for the DOI seems to correspond to that, while Springer itself considers the publication to be in 2024, which is also when the conference took place. This just goes to show that the DOI Content Negotiation data, which gets used by many other services, may be inaccurate w.r.t. the very basics.</p> <h2 id="conclusion">Conclusion</h2> <p>I learned about <a href="https://citation.doi.org/docs.html">DOI Content Negotiation</a> and how bad it actually is for BibTeX. The databases (Springer, ACM, DBLP) are better, but none is perfect or even good enough, as the comparison table reveals. I guess I’ll end up doing a lot of manual work, although some is semi-automatable using BibLaTeX <em>source maps</em> (which are a story for another time).</p> <hr/> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:tabs-css"> <p>The styling of tabs in the website theme I’m using clearly isn’t great if I have to point it out. I should fix that. <a href="#fnref:tabs-css" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:bold-monospace"> <p>The CSS of the website theme I’m using is such that <strong>bold</strong> doesn’t work together with <code class="language-plaintext highlighter-rouge">monospace</code>. I should fix that. But until then, just imagine <code class="language-plaintext highlighter-rouge">@inproceedings</code> (and <code class="language-plaintext highlighter-rouge">@InProceedings</code>) being bold in the table. <a href="#fnref:bold-monospace" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:underscore-problem"> <p>Has problem with escaping underscores. <a href="#fnref:underscore-problem" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:underscore-problem:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> </ol> </div>]]></content><author><name></name></author><category term="academia"/><category term="latex"/><category term="rant"/><summary type="html"><![CDATA[My PhD-thesis–to–be combines 8 papers from the last 5 years. Their Bib(La)TeX bibliography entries come in a wide range of quality and style. I would like some consistency but it’s quite an effort to achieve across 234 entries. So I was wondering if there’s any good quality and consistent source from where I could (hopefully automatically) update their data via their DOI.]]></summary></entry><entry><title type="html">Scraping barcodes with suffix trees</title><link href="https://sim642.eu/blog/2025/08/29/scraping-barcodes-with-suffix-trees/" rel="alternate" type="text/html" title="Scraping barcodes with suffix trees"/><published>2025-08-29T00:00:00+00:00</published><updated>2025-08-29T00:00:00+00:00</updated><id>https://sim642.eu/blog/2025/08/29/scraping-barcodes-with-suffix-trees</id><content type="html" xml:base="https://sim642.eu/blog/2025/08/29/scraping-barcodes-with-suffix-trees/"><![CDATA[<p>Estonia has a <a href="https://en.wikipedia.org/wiki/Container-deposit_legislation">deposit-refund system for drink bottles and cans</a> (like the German <em>Pfand</em>). It is operated by Eesti Pandipakend whose website includes the <a href="https://eestipandipakend.ee/en/packaging-register">package registry</a><sup id="fnref:packaging-register"><a href="#fn:packaging-register" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>. The registry isn’t just a list of all products whose packages are part of the system but rather a search.</p> <h4 id="package-registry-search">Package registry search</h4> <p>One can search for the entire barcode of a product to see whether it’s included in the registry and some extra data about it. But one can also search for a part of the barcode to find products whose barcodes include it as a substring. However, there’s an important restriction: <strong>the search returns at most 10 products</strong> (like SQL’s <code class="language-plaintext highlighter-rouge">LIMIT 10</code>). Additionally, the empty substring cannot really be searched (misleadingly, it returns zero products).</p> <h2 id="scraping-problem">Scraping problem</h2> <p>With the search at hand, let’s consider scraping the entire package registry, i.e. finding barcodes of all the registered packages and the extra data about them.<sup id="fnref:motivation"><a href="#fn:motivation" class="footnote" rel="footnote" role="doc-noteref">2</a></sup> The main focus is on the barcodes because the extra data is a simple byproduct: when the barcode of a product is confirmed to be in the registry by seeing it in the results of some search query, then we also get its extra data and can record it on the side.</p> <p>The package registry is essentially <strong>rate-limited to 1 query per second</strong> (technically, 60 queries per minute). Therefore, the goal is to minimize the number of queries needed to completely scrape the registry.</p> <h3 id="naïve-solution">Naïve solution</h3> <p>Since the barcodes are numeric and (mostly) of length 13, there’s an obvious algorithm to scrape the registry: just query each complete barcode to see whether zero or one results are returned. Obviously, this algorithm is completely impractical because it requires \(10^{13}\) queries. (This could be reduced by exploiting the structure of <a href="https://en.wikipedia.org/wiki/International_Article_Number">EAN-13</a> barcodes but that’s not enough to make it practical.)</p> <p>To further complicate things, the registry also contains small numbers of 12-digit <a href="https://en.wikipedia.org/wiki/Universal_Product_Code">UPC-A</a> and 8-digit <a href="https://en.wikipedia.org/wiki/International_Article_Number">EAN-8</a> barcodes, which require additional queries to scrape. Furthermore, <em>a priori</em> there’s no indication which lengths of barcodes exist in the registry. The 10 result limit for a query means that an <strong>extra assumption</strong> is needed to make the scraping problem properly solvable at all. Namely, for every complete barcode \(b\) in the registry, the registry contains at most 10 barcodes which have \(b\) as a substring (including \(b\) itself). If this wasn’t the case, then there would be no guarantee that querying \(b\) would necessarily return \(b\). This assumption (probably severely) limits the potential search space.<sup id="fnref:search-space-exercise"><a href="#fn:search-space-exercise" class="footnote" rel="footnote" role="doc-noteref">3</a></sup></p> <h3 id="basic-online-solution">Basic online solution</h3> <p>The naïve algorithm, besides being impractical, has another limitation: it first fixes all the many queries to be made and then performs them all. For a feasible solution, we can use an <a href="https://en.wikipedia.org/wiki/Online_algorithm">online algorithm</a>, which determines new queries to make based on the results of previous queries.</p> <p>My first idea for such an algorithm was the following. First, query <code class="language-plaintext highlighter-rouge">0</code>: if it returns less than 10 results (which is unlikely), then we immediately know all barcodes in the registry which include <code class="language-plaintext highlighter-rouge">0</code>; otherwise, recursively query <code class="language-plaintext highlighter-rouge">00</code>, <code class="language-plaintext highlighter-rouge">01</code>, …, <code class="language-plaintext highlighter-rouge">09</code>. For each of those, behave similarly: if there are less than 10 results, then stop recursing; otherwise, continue recursing to all 10 extensions of the query. And so on…. Once all of this is done, repeat the same process starting with <code class="language-plaintext highlighter-rouge">1</code>, <code class="language-plaintext highlighter-rouge">2</code>, …, <code class="language-plaintext highlighter-rouge">9</code>. (The general approach would be to start one recursion from the empty string, but due to the empty query corner case that wouldn’t actually work.)</p> <blockquote class="block-warning"> <p>I have not proven this algorithm to be correct (nor defined what correctness means, yet). So if it isn’t, then this post has already gone off the rails. Comment below if you believe that’s the case!</p> </blockquote> <h3 id="suffix-tree-solution">Suffix tree solution</h3> <p>The basic online algorithm can be slightly improved using a <a href="https://en.wikipedia.org/wiki/Suffix_tree">(generalized) suffix tree</a>. To this end, every barcode returned by any query (including those with at least 10 results) is added to a generalized suffix tree, e.g. efficiently using <a href="https://en.wikipedia.org/wiki/Ukkonen%27s_algorithm">Ukkonen’s algorithm</a>. (For everything related to suffix trees, I would recommend reading <a class="citation" href="#Gusfield_1997">(Gusfield, 1997)</a>.) Before making any actual query, we can efficiently simulate the query on the suffix tree to check if we have already seen at least 10 barcodes with the query as substring. If this is the case, then we can skip the actual query and immediately proceed with recursion, because the actual query would also return 10 results and force us to recurse anyway.</p> <p>I used this approach to scrape the actual registry on 2025-08-22, which yielded <strong>10171 barcodes</strong> (9535 <a href="https://en.wikipedia.org/wiki/International_Article_Number">EAN-13</a>, 377 <a href="https://en.wikipedia.org/wiki/Universal_Product_Code">UPC-A</a> and 259 <a href="https://en.wikipedia.org/wiki/International_Article_Number">EAN-8</a>).<sup id="fnref:github"><a href="#fn:github" class="footnote" rel="footnote" role="doc-noteref">4</a></sup> Hoping this algorithm is correct, I have a copy of the entire registry at hand locally, which makes it <em>much</em> easier and faster to experiment with different solutions: the registry can be simulated without any rate limiting.<sup id="fnref:local-registry-large-query"><a href="#fn:local-registry-large-query" class="footnote" rel="footnote" role="doc-noteref">5</a></sup> On this local simulation, the suffix tree algorithm uses <strong>73713 queries</strong> to scrape the registry. With the actual rate limiting, this requires almost ~20.5 hours. In comparison, the basic algorithm uses 81250 queries, which is ~10% more.</p> <h3 id="better-solutions">Better solutions?</h3> <p>I suspect the suffix tree solution is far from optimal because it discovers each barcode in the registry numerous times, corresponding to each suffix of each barcode. Intuitively, it should be possible to do better. For example, if <code class="language-plaintext highlighter-rouge">0000</code> and its extensions have already been scraped, then later it would be a waste to scrape <code class="language-plaintext highlighter-rouge">10000</code> and its extensions, because all of them have already been found. <strong>Except, that’s not true!</strong> If the <code class="language-plaintext highlighter-rouge">0000</code> query has at least 10 results and <code class="language-plaintext highlighter-rouge">00001</code>, …, <code class="language-plaintext highlighter-rouge">00009</code> have all been scraped, then we might not have seen barcodes that <em>end with</em> <code class="language-plaintext highlighter-rouge">0000</code>. For this tricky reason, my attempts at further optimization (e.g. also using a suffix tree of all performed queries to find such query inclusions) have failed to correctly scrape the entire registry. Such pruning can reduce the number of queries by over 50%, but also cause it to miss a handful of barcodes, which is no good (<a href="https://web.archive.org/web/20221205165923/https://theprofoundprogrammer.com/post/28974600028/text-it-doesnt-work-but-its-fast">“It doesn’t work, but it’s fast”</a>).</p> <blockquote class="block-tip"> <p>Comment below if you have (ideas for) a better <em>correct</em> solution!</p> </blockquote> <h2 id="verification-problem">Verification problem</h2> <p>With the scraped registry at hand, let’s consider checking the result, i.e. the scraped dataset has exactly the same barcodes as the actual registry. This set equality can be viewed as two set inclusions: all the scraped barcodes exist in the actual registry, but also all the barcodes in the actual registry were scraped. Let’s call these properties <em>soundness</em> and <em>completeness</em>, respectively.</p> <p>In a way, verification is like solving the scraping problem while knowing the answer to begin with. So intuitively, it should be doable with fewer queries.</p> <h3 id="soundness">Soundness</h3> <h4 id="naïve-solution-1">Naïve solution</h4> <p>The obvious algorithm is to just query each scraped barcode from the actual registry and confirm that it is included in the results. (If any of the results contains new barcodes, we’ve inadvertently disproven completeness.) Thus, this requires as many queries as there are barcodes in the registry, which can’t be the most optimal. For the registry scraped above, it requires <strong>10171 queries</strong>.</p> <h4 id="suffix-tree-solution-1">Suffix tree solution</h4> <p>Since the answer is already known, a generalized suffix tree of all the barcodes can be constructed to begin with. The verification algorithm is then to traverse the suffix tree and query from the actual registry the substring corresponding to each suffix tree path (from its root) where the subtree under that node has at most 10 barcodes (while its parent has more). Unlike suffix tree solution to the scraping problem, nodes representing exactly 10 barcodes should also be fine here.</p> <p>As with the suffix tree solution to the scraping problem, this also checks each barcode multiple times, corresponding to each suffix of each barcode. Having implemented this approach, for the registry scraped above, it requires <strong>29794 queries</strong>, which is much worse than even the naïve algorithm.<sup id="fnref:github:1"><a href="#fn:github" class="footnote" rel="footnote" role="doc-noteref">4</a></sup></p> <h4 id="suffix-tree-and-set-cover-solution">Suffix tree and set cover solution</h4> <p>Since we have the answer, we can do some optimization to avoid verifying each barcode numerous times. Like the previous solution, the suffix tree provides all substrings we <em>could</em> query and their expected results. Minimizing the number of queries to cover all the barcodes is an instance of the <a href="https://en.wikipedia.org/wiki/Set_cover_problem">set cover problem</a>, which, somewhat unfortunately, is NP-hard. So curiously, to efficiently verify the solution to the scraping problem, we have to solve (not verify) an instance of the difficult set cover problem.</p> <p>Luckily, the greedy algorithm for the set cover problem isn’t too shabby in this case: having implemented it, for the registry scraped above, it requires <em>just</em> <strong>2159 queries</strong>, which is almost 5 times less than the naïve solution.<sup id="fnref:github:2"><a href="#fn:github" class="footnote" rel="footnote" role="doc-noteref">4</a></sup> Theoretical results about the greedy algorithm show that its over-approximation ratio here is \(H(10) \approx 2.93\), where 10 is the size of the largest set (picked from the suffix tree). In other words, the greedy algorithm can be at most ~2.93 times worse than the optimal set cover size. Or put another way, for the registry scraped above, the optimal number of verification queries is <em>at least</em> 738.</p> <p>As far as I managed to find, similar ideas have been proposed to solve a slightly different <strong>string barcoding problem</strong> <a class="citation" href="#10.1145/565196.565229">(Rash &amp; Gusfield, 2002)</a>. Coincidentally, it also mentions barcodes by name, albeit with a completely different meaning: it’s a problem in computational biology for constructing some kind of probes to distinguish a set of DNA sequences. Nevertheless, their solution also processes nodes of a suffix tree in relation to the strings contained in their corresponding subtrees. However, they use it to construct an <a href="https://en.wikipedia.org/wiki/Integer_programming">integer linear programming (ILP)</a> problem to then solve (i.e. optimize). Also coincidentally, the set cover problem can be stated as an ILP problem. So, obviously, ILP is also NP-hard, but can be approximated via relaxation to non-integer linear programming. That approximation doesn’t necessarily ensure as good of a solution than the greedy algorithm does for the scraped database because here it has approximation ratio of 11.</p> <h3 id="completeness">Completeness</h3> <p>Checking completeness is not as easy: we have to check whether every barcode in the actual registry is in our scraped answer, but knowing the former is the scraping problem itself! It seems more promising to check the contrapositive: whether every barcode <em>not</em> in our scraped answer also is <em>not</em> in the actual registry. This doesn’t sound much easier at first: most barcodes are not in the answer and iterating over them would be unrealistic.</p> <h4 id="suffix-tree-solution-2">Suffix tree solution</h4> <p>Although I have not fully worked out the algorithm for this, I believe the suffix tree used for soundness can also be useful for completeness. Namely, every branch which is <em>missing</em> in the suffix tree corresponds to a partial barcode which isn’t a substring of any barcode in the scraped answer. The suffix tree path (from its root) to this hypothetical missing branch serves as a query which should return zero results from the actual registry, otherwise our answer is incomplete.</p> <h4 id="better-solutions-1">Better solutions?</h4> <p>As always, such suffix-tree–based checking incurs some redundancy. For example, to prove that the barcode <code class="language-plaintext highlighter-rouge">000008</code> isn’t in the registry, it’s enough that the query <code class="language-plaintext highlighter-rouge">0000</code> returns zero results, making it unnecessary to also query <code class="language-plaintext highlighter-rouge">0008</code>, which might also be missing. However, the latter might still be necessary to rule out other substrings.</p> <p>Removing the redundancy using set cover isn’t as simple for the negative case: each of the missing branches (likely) represents a very large (complete) suffix subtree. Thus, the resulting set cover problem instance would not be realistically solvable.</p> <blockquote class="block-tip"> <p>Comment below if you have (ideas for) a better solution!</p> </blockquote> <h3 id="joint-verification">Joint verification</h3> <p>It is possible to exploit a single query for both soundness and completeness (as mentioned during soundness). If a query returns less than 10 barcodes, we both can confirm that our scraped ones are included, but also that no others containing the queried substring is present. Thus, a joint set cover problem to cover all barcodes in the answer and all missing branches in the suffix tree might yield a not-too-bad solution overall.</p> <h2 id="update-problem">Update problem</h2> <p>I scraped the registry at some point in time (or more precisely, over a duration of time) and hypothetically even verified it to be correct soon after. However, the actual registry will inevitably change: new products with new barcodes will be added (and perhaps some old ones removed, although I’m not sure if the registry would ever do that — it still contains some quite old products). Hence, another interesting problem arises: how to update the scraped registry without scraping it from scratch?</p> <p>I haven’t put much thought into this, but I suspect the suffix tree will again come in handy. Maybe one can try verifying the current scraped registry and, when a mismatch is detected, do some scraping from that point, but it’s just a vague idea. I would expect the amount of extra scraping to be somewhat proportional to the size of the registry change. Of course, if the registry is completely replaced with a disjoint set of barcodes (a hypothetical worst case scenario), then it’s probably hopeless to beat from scratch scraping.</p> <blockquote class="block-tip"> <p>Comment below if you have (ideas for) a solution!</p> </blockquote> <h2 id="conclusion">Conclusion</h2> <p>What started as a silly exercise in scraping a registry turned into a set of quite complicated algorithmic problems. And in no way can I claim any of these to be solved. Therefore, I challenge the reader to come up with better solutions and quickly try them out on the locally simulated registry.<sup id="fnref:github:3"><a href="#fn:github" class="footnote" rel="footnote" role="doc-noteref">4</a></sup></p> <h2 id="references">References</h2> <div class="publications"> <ol class="bibliography"><li><div class="row"> <div class="col col-sm-2 abbr"> <abbr class="badge rounded w-100">Book</abbr> </div> <div id="Gusfield_1997" class="col-sm-8"> <div class="title">Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology</div> <div class="author"> Dan Gusfield </div> <div class="periodical"> 1997 </div> <div class="periodical"> </div> <div class="links"> </div> </div> </div> </li> <li><div class="row"> <div class="col col-sm-2 abbr"> <abbr class="badge rounded w-100" style="background-color:#d9d9d9"> <a href="https://dl.acm.org/conference/recomb">RECOMB</a> </abbr> </div> <div id="10.1145/565196.565229" class="col-sm-8"> <div class="title">String barcoding: uncovering optimal virus signatures</div> <div class="author"> Sam Rash and Dan Gusfield </div> <div class="periodical"> <em>In Research in Computational Molecular Biology</em>, 2002 </div> <div class="periodical"> </div> <div class="links"> <a class="abstract btn btn-sm z-depth-0" role="button">Abs</a> <a href="https://doi.org/10.1145/565196.565229" class="btn btn-sm z-depth-0" role="button">DOI</a> </div> <div class="abstract hidden"> <p>There are many critical situations when one needs to rapidly identify an unidentified pathogen from among a given set of previously sequenced pathogens. DNA or RNA hybridization chips can be designed for such identifications. Each cell in the chip can report the presence or absence of a specific substring of DNA in the unidentified pathogen. Properly designed, the collection of reports obtained from the cells can uniquely identify any pathogen in the set, or determine that the unidentified pathogen is not in the set. There is a limit to the number of cells on a chip, and a range of substring lengths that a cell can handle. So, given the full sequences of a set of pathogens, the problem is to design the chip by selecting the smallest set of substrings of the appropriate lengths, so that each pathogen in the set has a unique set of cells that report a substring. For any given pathogen, the set of reporting cells is its signature, and hence the entire system is a "barcode" system for the pathogens. Previous work addressed this problem, but focused on pathogens of bacterial size, and hence had to make many compromises for the sake of efficiency. The substrings lengths were severely restricted, and no optimality or near-optimality was guaranteed. In this paper, we focus on viral-size pathogens. We show that for genomes of this size, it is practical to solve the barcode design problem optimally, or near-optimally, without artificially constraining the problem. We also efficiently find barcodes that provide a level of redundancy, tolerating a number of errors or mutations. The key technical ideas are the use of suffix trees to identify the critical substrings, integer-linear programming (ILP) to express the minimization problem, and a simple idea that dramatically reduces the size of the ILP, allowing it to be solved efficiently by the commercial ILP solver CPLEX. We report extensive tests of our approach on various collections of virus DNA and RNA sequences.</p> </div> </div> </div> </li></ol> </div> <hr/> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:packaging-register"> <p>They call it the packaging register, but that’s a bit of an odd translation of the Estonian term <em>pakendiregister</em>. <a href="#fnref:packaging-register" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:motivation"> <p>I don’t actually have a use for all this data, but if you do, then you’re in luck! I just got fascinated by the (surprisingly complex) computer science problem of doing so. <a href="#fnref:motivation" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:search-space-exercise"> <p>It might be a fun combinatorics problem to calculate the maximum possible size of a set of barcodes (up to a certain length) that satisfies this assumption. <a href="#fnref:search-space-exercise" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:github"> <p><a href="https://github.com/sim642/pandipakend">This GitHub repository</a> includes my Python code and the scraped dataset. <a href="#fnref:github" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:github:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a> <a href="#fnref:github:2" class="reversefootnote" role="doc-backlink">&#8617;<sup>3</sup></a> <a href="#fnref:github:3" class="reversefootnote" role="doc-backlink">&#8617;<sup>4</sup></a></p> </li> <li id="fn:local-registry-large-query"> <p>Note that for queries with more than 10 results, such local simulation might not return the same 10 results that the actual registry does. In turn, this may cause some algorithms to behave slightly differently. <a href="#fnref:local-registry-large-query" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name></name></author><category term="computer science"/><category term="algorithms"/><category term="programming"/><summary type="html"><![CDATA[Estonia has a deposit-refund system for drink bottles and cans (like the German Pfand). It is operated by Eesti Pandipakend whose website includes the package registry1. The registry isn’t just a list of all products whose packages are part of the system but rather a search. They call it the packaging register, but that’s a bit of an odd translation of the Estonian term pakendiregister. &#8617;]]></summary></entry><entry><title type="html">Securing applications with oauth2-proxy on Synology NAS</title><link href="https://sim642.eu/blog/2025/07/25/securing-applications-with-oauth2-proxy-on-synology-nas/" rel="alternate" type="text/html" title="Securing applications with oauth2-proxy on Synology NAS"/><published>2025-07-25T00:00:00+00:00</published><updated>2025-07-25T00:00:00+00:00</updated><id>https://sim642.eu/blog/2025/07/25/securing-applications-with-oauth2-proxy-on-synology-nas</id><content type="html" xml:base="https://sim642.eu/blog/2025/07/25/securing-applications-with-oauth2-proxy-on-synology-nas/"><![CDATA[<p>A <a href="https://www.synology.com/">Synology NAS</a> can be convenient for hosting various Docker-based applications for personal use. Some applications have authentication built in, but others don’t. The latter approach is not unreasonable: it doesn’t make sense for every application to include its own custom user management and authentication.</p> <p>Nevertheless, it may be desirable to make them accessible only to authenticated users. Luckily, <a href="https://oauth2-proxy.github.io/oauth2-proxy/">oauth2-proxy</a> is a (reverse) proxy service that lets us put authentication in front of any such application by delegating the authentication part to another service of choice, e.g. Google, GitHub, etc. In order to not depend on any external authentication service, it’s also possible to use the Synology DSM and its user management for authentication, just like it is for Synology’s own applications. And this post will show how to do exactly that.</p> <h2 id="introduction">Introduction</h2> <p>This post assumes that you have the Synology DSM accessible over HTTPS at <a href="https://example.com:5001">https://example.com:5001</a>, i.e. “example.com” stands for your own domain and Synology DSM is already configured with a valid HTTPS certificate for it. For example, this can be achieved using <a href="https://kb.synology.com/vi-vn/DSM/tutorial/How_to_enable_HTTPS_and_create_a_certificate_signing_request_on_your_Synology_NAS">Let’s Encrypt</a> or <a href="/blog/2024/08/11/tailscale-https-certificate-on-synology-nas/">Tailscale</a>.</p> <p>The tutorial will set up the following:</p> <ul> <li>For example’s sake, the insecure unauthenticated application will be <a href="https://github.com/postmanlabs/httpbin">httpbin</a>, which will not be exposed directly. This can be replaced with the insecure unauthenticated application of choice.</li> <li>The insecure <a href="https://oauth2-proxy.github.io/oauth2-proxy/">oauth2-proxy</a> will be exposed at <a href="http://example.com:5400">http://example.com:5400</a> and will serve as a reverse proxy to <a href="https://github.com/postmanlabs/httpbin">httpbin</a> once the user is authenticated. It is insecure only in the sense of HTTP — it is still authenticated.</li> <li>The secure <a href="https://oauth2-proxy.github.io/oauth2-proxy/">oauth2-proxy</a> will be exposed at <a href="https://example.com:5401">https://example.com:5401</a> via a reverse proxy to the insecure version. This allows the HTTPS certificate for “example.com” to be reused without having to also set up a HTTPS certificate in <a href="https://oauth2-proxy.github.io/oauth2-proxy/">oauth2-proxy</a> itself.</li> <li>The authentication will be provided by Synology DSM at <a href="https://example.com:5001">https://example.com:5001</a>.</li> </ul> <h2 id="step-by-step">Step-by-step</h2> <p>All of the following steps are to be performed in the Synology DSM.</p> <h3 id="reverse-proxy">Reverse Proxy</h3> <ol> <li>Navigate to <strong>Control Panel → Login Portal → Advanced → Reverse Proxy</strong>.</li> <li> <p>Create a new reverse proxy (<strong>Create</strong>) with the following details:</p> <ul> <li>Reverse Proxy Name: “oauth2-proxy”.</li> <li><strong>Source</strong>: <ul> <li>Protocol: HTTPS.</li> <li>Hostname: “example.com”.</li> <li>Port: 5401.</li> </ul> </li> <li><strong>Destination</strong>: <ul> <li>Protocol: HTTP.</li> <li>Hostname: “localhost”.</li> <li>Port: 5400.</li> </ul> </li> </ul> </li> <li>Press “Save” and close the Reverse Proxy window.</li> </ol> <h3 id="sso-server">SSO Server</h3> <ol> <li>Install “SSO Server” from <strong>Package Center</strong>.</li> <li>Open <strong>SSO Server</strong>.</li> <li>Under <strong>General Settings</strong>, set Server URL: “example.com:5001” (the “https://” prefix will be implicitly there).</li> <li>Under <strong>Service → OIDC</strong>, Enable OIDC server.</li> <li> <p>Under <strong>Application</strong>, add a new application (<strong>Add</strong>) with the following details:</p> <ol> <li>Select an SSO protocol: OIDC.</li> <li>Application Name: “oauth2-proxy”.</li> <li>Redirect URI: “<a href="https://example.com:5401/oauth2/callback">https://example.com:5401/oauth2/callback</a>”.</li> </ol> </li> <li>Select the just-added application from the list and click “Edit” in the toolbar.</li> <li>Make note of the generated “Application ID” and “Application secret”, which will be needed in the next step. You can just keep the Edit window open and come back to copy them when needed.</li> </ol> <h3 id="oauth2-proxy">oauth2-proxy</h3> <h4 id="configuration">Configuration</h4> <ol> <li>In <strong>File Station</strong>, create a folder for <a href="https://oauth2-proxy.github.io/oauth2-proxy/">oauth2-proxy</a> configuration files, e.g. <code class="language-plaintext highlighter-rouge">/docker/oauth2-proxy</code>.</li> <li>Using <strong>Text Editor</strong>, create the files <code class="language-plaintext highlighter-rouge">oauth2-proxy.cfg</code> and <code class="language-plaintext highlighter-rouge">authenticated_emails</code> with the contents described below, and save them into the previously-created folder.</li> </ol> <h5 id="oauth2-proxycfg"><code class="language-plaintext highlighter-rouge">oauth2-proxy.cfg</code></h5> <p>Copy the following:</p> <div class="language-toml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="py">http_address</span><span class="p">=</span><span class="s">"0.0.0.0:5400"</span>
<span class="py">reverse_proxy</span><span class="p">=</span><span class="s">"true"</span>
<span class="py">upstreams</span><span class="p">=</span><span class="s">"http://httpbin"</span>

<span class="c"># cookies</span>
<span class="py">cookie_secret</span><span class="p">=</span><span class="s">"TODO"</span>
<span class="py">cookie_secure</span><span class="p">=</span><span class="s">"true"</span>
<span class="py">cookie_domains</span><span class="p">=[</span><span class="s">"example.com"</span><span class="p">]</span>
<span class="py">whitelist_domains</span><span class="p">=[</span><span class="s">"example.com:5401"</span><span class="p">]</span>

<span class="c"># Synology</span>
<span class="py">provider</span><span class="p">=</span><span class="s">"oidc"</span>
<span class="py">oidc_issuer_url</span><span class="p">=</span><span class="s">"https://example.com:5001/webman/sso"</span>
<span class="py">client_id</span><span class="p">=</span><span class="s">"TODO"</span>
<span class="py">client_secret</span><span class="p">=</span><span class="s">"TODO"</span>
<span class="py">redirect_url</span><span class="p">=</span><span class="s">"https://example.com:5401/oauth2/callback"</span>
<span class="py">code_challenge_method</span><span class="p">=</span><span class="s">"S256"</span>
<span class="py">skip_provider_button</span><span class="p">=</span><span class="s">"true"</span>

<span class="c"># authentication</span>
<span class="py">oidc_email_claim</span><span class="p">=</span><span class="s">"sub"</span>
<span class="py">authenticated_emails_file</span><span class="p">=</span><span class="s">"/authenticated_emails"</span>
</code></pre></div></div> <p>Replace all TODOs as follows:</p> <ul> <li>Replace <code class="language-plaintext highlighter-rouge">cookie_secret</code> with a freshly-generated cookie secret, as described <a href="https://oauth2-proxy.github.io/oauth2-proxy/configuration/overview#generating-a-cookie-secret">here</a>.</li> <li>Replace <code class="language-plaintext highlighter-rouge">client_id</code> with “Application ID” from SSO Server.</li> <li>Replace <code class="language-plaintext highlighter-rouge">client_secret</code> with “Application secret” from SSO Server.</li> </ul> <h5 id="authenticated_emails"><code class="language-plaintext highlighter-rouge">authenticated_emails</code></h5> <p>Write a list of authorized Synology usernames, one on each line. For example:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>myuser
</code></pre></div></div> <blockquote class="block-danger"> <p>Despite the file and the <a href="https://oauth2-proxy.github.io/oauth2-proxy/">oauth2-proxy</a> configuration calling them emails, we don’t use emails here. Using emails from Synology users would be insecure because users can arbitrarily change their email. Hence, we make <a href="https://oauth2-proxy.github.io/oauth2-proxy/">oauth2-proxy</a> use usernames in place of emails using <code class="language-plaintext highlighter-rouge">oidc_email_claim="sub"</code>. I learned about this trick from Okta’s tutorial <a href="https://developer.okta.com/blog/2022/07/14/add-auth-to-any-app-with-oauth2-proxy">Add Auth to Any App with OAuth2 Proxy</a>.</p> </blockquote> <h5 id="permissions">Permissions</h5> <ol> <li>In <strong>File Station</strong>, navigate to the previously-created folder for <a href="https://oauth2-proxy.github.io/oauth2-proxy/">oauth2-proxy</a> configuration files.</li> <li>Right-click on <code class="language-plaintext highlighter-rouge">oauth2-proxy.cfg</code> and select “Properties”.</li> <li>Switch to “Permission” tab.</li> <li> <p>Press “Create” in the toolbar and enter the following details:</p> <ul> <li>User or group: Everyone.</li> <li>Type: Allow.</li> <li>Permission: Read.</li> </ul> </li> <li>Press “Done” and then “Save”.</li> <li>Repeat for <code class="language-plaintext highlighter-rouge">authenticated_emails</code>.</li> </ol> <p>This is <a href="https://kb.synology.com/en-global/DSM/tutorial/Docker_container_cant_access_the_folder_or_file">recommended by Synology</a> to fix permissions issues with files mapped into Docker containers (in the next step).</p> <h4 id="container">Container</h4> <ol> <li>Install “Container Manager” from <strong>Package Center</strong>.</li> <li>Navigate to <strong>Container Manager → Project</strong>.</li> <li> <p>Create a new project (<strong>Create</strong>) with the following details:</p> <ul> <li>Project name: “oauth2-proxy”.</li> <li>Path: “<code class="language-plaintext highlighter-rouge">/docker/oauth2-proxy</code>”.</li> <li>Source: Create docker-compose.yml.</li> </ul> </li> <li>Paste the following Docker Compose file into the text box below: <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">version</span><span class="pi">:</span> <span class="s1">'</span><span class="s">3.0'</span>
<span class="na">services</span><span class="pi">:</span>
  <span class="na">oauth2-proxy</span><span class="pi">:</span>
    <span class="na">container_name</span><span class="pi">:</span> <span class="s">oauth2-proxy</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s">quay.io/oauth2-proxy/oauth2-proxy:v7.10.0</span>
    <span class="na">command</span><span class="pi">:</span> <span class="s">--config /oauth2-proxy.cfg</span>
    <span class="na">hostname</span><span class="pi">:</span> <span class="s">oauth2-proxy</span>
    <span class="na">volumes</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s2">"</span><span class="s">./oauth2-proxy.cfg:/oauth2-proxy.cfg:ro"</span>
      <span class="pi">-</span> <span class="s2">"</span><span class="s">./authenticated_emails:/authenticated_emails:ro"</span>
    <span class="na">restart</span><span class="pi">:</span> <span class="s">unless-stopped</span>
    <span class="na">ports</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">5400:5400/tcp</span>
    <span class="na">networks</span><span class="pi">:</span>
      <span class="na">httpbin</span><span class="pi">:</span> <span class="pi">{}</span>
    <span class="na">depends_on</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">httpbin</span>
  <span class="na">httpbin</span><span class="pi">:</span>
    <span class="na">container_name</span><span class="pi">:</span> <span class="s">httpbin</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s">kennethreitz/httpbin</span>
    <span class="na">restart</span><span class="pi">:</span> <span class="s">unless-stopped</span>
    <span class="na">ports</span><span class="pi">:</span> <span class="pi">[]</span>
    <span class="na">networks</span><span class="pi">:</span>
      <span class="na">httpbin</span><span class="pi">:</span> <span class="pi">{}</span>
<span class="na">networks</span><span class="pi">:</span>
  <span class="na">httpbin</span><span class="pi">:</span> <span class="pi">{}</span>
</code></pre></div> </div> </li> <li>Press “Next”, “Next” and “Done”.</li> <li>Wait for the Docker containers to be downloaded and started.</li> </ol> <h2 id="conclusion">Conclusion</h2> <p>If all went well, then you can now navigate to <a href="https://example.com:5401">https://example.com:5401</a> in your browser. This should first redirect to Synology DSM login. However, since you should already be logged into Synology DSM, you will not be prompted to log in again. Finally, you will be redirected back to <a href="https://example.com:5401">https://example.com:5401</a>, but this time you’ll see the <a href="https://github.com/postmanlabs/httpbin">httpbin</a> application instead.</p> <p>You can replace the <a href="https://github.com/postmanlabs/httpbin">httpbin</a> container with your desired Docker-based application. Importantly, it should <em>not</em> expose any <code class="language-plaintext highlighter-rouge">ports</code>. Instead, it will only be accessed via the Docker network (also called httpbin in this case) by <a href="https://oauth2-proxy.github.io/oauth2-proxy/">oauth2-proxy</a>, which has access to the same network. You may need to adapt <code class="language-plaintext highlighter-rouge">upstreams</code> in <code class="language-plaintext highlighter-rouge">oauth2-proxy.cfg</code> for your application and restart the <a href="https://oauth2-proxy.github.io/oauth2-proxy/">oauth2-proxy</a> container for changes to take effect.</p>]]></content><author><name></name></author><category term="synology"/><category term="networking"/><category term="security"/><category term="tutorial"/><summary type="html"><![CDATA[A Synology NAS can be convenient for hosting various Docker-based applications for personal use. Some applications have authentication built in, but others don’t. The latter approach is not unreasonable: it doesn’t make sense for every application to include its own custom user management and authentication.]]></summary></entry><entry><title type="html">Highlighting parts of lines in minted</title><link href="https://sim642.eu/blog/2025/07/18/highlighting-parts-of-lines-in-minted/" rel="alternate" type="text/html" title="Highlighting parts of lines in minted"/><published>2025-07-18T00:00:00+00:00</published><updated>2025-07-18T00:00:00+00:00</updated><id>https://sim642.eu/blog/2025/07/18/highlighting-parts-of-lines-in-minted</id><content type="html" xml:base="https://sim642.eu/blog/2025/07/18/highlighting-parts-of-lines-in-minted/"><![CDATA[<p><a href="/blog/2025/05/01/referencing-lines-in-fancyvrb-minted/">To repeat</a>, I prefer to use the <a href="https://github.com/gpoore/minted/">minted</a> package for typesetting syntax-highlighted code in LaTeX. For example:</p> <div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">\begin{minted}</span>[linenos]<span class="p">{</span>c<span class="p">}</span>
for (int i = 0; i &lt; 100; i++) <span class="p">{</span>
    printf("i = <span class="c">%d\n", i);</span>
<span class="p">}</span>
<span class="nt">\end{minted}</span>
</code></pre></div></div> <p>This is rendered as:</p> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/highlighting-parts-of-lines-in-minted/no-highlight-480.webp 480w,/assets/highlighting-parts-of-lines-in-minted/no-highlight-800.webp 800w,/assets/highlighting-parts-of-lines-in-minted/no-highlight-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/highlighting-parts-of-lines-in-minted/no-highlight.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="No highlight" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <p>For completeness, here’s the preamble used for examples throughout this post:</p> <div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\usepackage</span><span class="p">{</span>minted<span class="p">}</span>
<span class="k">\setminted</span><span class="p">{</span>style=tango<span class="p">}</span>
<span class="k">\usepackage</span><span class="na">[svgnames]</span><span class="p">{</span>xcolor<span class="p">}</span>
<span class="k">\usepackage</span><span class="p">{</span>inconsolata<span class="p">}</span>
</code></pre></div></div> <h2 id="full-line-highlight">Full line highlight</h2> <p>Occasionally it is useful to highlight a particular line of code. Conveniently, <a href="https://github.com/gpoore/minted/">minted</a> provides the <code class="language-plaintext highlighter-rouge">highlightlines</code> option to do so by line number:</p> <div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">\begin{minted}</span>[linenos,highlightlines=1]<span class="p">{</span>c<span class="p">}</span>
for (int i = 0; i &lt; 100; i++) <span class="p">{</span>
    printf("i = <span class="c">%d\n", i);</span>
<span class="p">}</span>
<span class="nt">\end{minted}</span>
</code></pre></div></div> <p>The following image comparison illustrates the change in rendering compared to no highlight (hover/swipe across the image):<sup id="fnref:highlightlines-offset"><a href="#fn:highlightlines-offset" class="footnote" rel="footnote" role="doc-noteref">1</a></sup></p> <style>.slider-with-shadows{--default-handle-shadow:0 0 5px var(--global-theme-color);--divider-shadow:0 0 5px var(--global-theme-color);--divider-color:var(--global-theme-color);--default-handle-color:var(--global-theme-color)}</style> <img-comparison-slider class="slider-with-shadows z-depth-1" hover="hover" value="56.2"> <figure slot="first"> <picture> <source class="responsive-img-srcset" srcset="/assets/highlighting-parts-of-lines-in-minted/no-highlight-480.webp 480w,/assets/highlighting-parts-of-lines-in-minted/no-highlight-800.webp 800w,/assets/highlighting-parts-of-lines-in-minted/no-highlight-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/highlighting-parts-of-lines-in-minted/no-highlight.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="No highlight" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <figure slot="second"> <picture> <source class="responsive-img-srcset" srcset="/assets/highlighting-parts-of-lines-in-minted/line-highlight-480.webp 480w,/assets/highlighting-parts-of-lines-in-minted/line-highlight-800.webp 800w,/assets/highlighting-parts-of-lines-in-minted/line-highlight-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/highlighting-parts-of-lines-in-minted/line-highlight.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Full line highlight" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </img-comparison-slider> <h2 id="partial-line-highlight">Partial line highlight</h2> <p>Occasionally it would be useful to not highlight the entire line of code but only a part of it. Unfortunately, <a href="https://github.com/gpoore/minted/">minted</a> does not provide a straightforward way to do so. So let’s try to come up with a custom solution!</p> <h3 id="attempt-1">Attempt 1</h3> <p>The obvious way is to use a <code class="language-plaintext highlighter-rouge">\colorbox</code> within <a href="https://github.com/gpoore/minted/">minted</a>’s <code class="language-plaintext highlighter-rouge">escapeinside</code>:</p> <div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\newcommand</span><span class="p">{</span><span class="k">\codehighlight</span><span class="p">}</span>[1]<span class="p">{</span><span class="k">\colorbox</span><span class="p">{</span>LightCyan<span class="p">}{</span>#1<span class="p">}}</span>
<span class="nt">\begin{minted}</span>[linenos,escapeinside=||]<span class="p">{</span>c<span class="p">}</span>
for (int i = 0; |<span class="k">\codehighlight</span><span class="p">{</span>i &lt; 100<span class="p">}</span>|; i++) <span class="p">{</span>
    printf("i = <span class="c">%d\n", i);</span>
<span class="p">}</span>
<span class="nt">\end{minted}</span>
</code></pre></div></div> <p>The color <code class="language-plaintext highlighter-rouge">LightCyan</code> is just what <a href="https://github.com/gpoore/minted/">minted</a> has as default for <code class="language-plaintext highlighter-rouge">highlightcolor</code>.</p> <p>The following image comparison illustrates the change in rendering compared to full line highlight:</p> <img-comparison-slider class="slider-with-shadows z-depth-1" hover="hover" value="56.2"> <figure slot="first"> <picture> <source class="responsive-img-srcset" srcset="/assets/highlighting-parts-of-lines-in-minted/line-highlight-480.webp 480w,/assets/highlighting-parts-of-lines-in-minted/line-highlight-800.webp 800w,/assets/highlighting-parts-of-lines-in-minted/line-highlight-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/highlighting-parts-of-lines-in-minted/line-highlight.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Full line highlight" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <figure slot="second"> <picture> <source class="responsive-img-srcset" srcset="/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v1-480.webp 480w,/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v1-800.webp 800w,/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v1-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v1.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Partial line highlight (attempt 1)" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </img-comparison-slider> <p>This has multiple issues. For one, the characters in the <code class="language-plaintext highlighter-rouge">\colorbox</code> don’t align with the rest of the columns of monospaced text.</p> <h3 id="attempt-2">Attempt 2</h3> <p>This is caused by the default value of <code class="language-plaintext highlighter-rouge">\fboxsep</code>, which is used to pad the contents of the <code class="language-plaintext highlighter-rouge">\colorbox</code> on all sides. To avoid the padding causing misalignment, it can be set to <code class="language-plaintext highlighter-rouge">0pt</code>:</p> <div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\newcommand</span><span class="p">{</span><span class="k">\codehighlight</span><span class="p">}</span>[1]<span class="p">{{</span><span class="k">\setlength</span><span class="p">{</span><span class="k">\fboxsep</span><span class="p">}{</span>0pt<span class="p">}</span><span class="k">\colorbox</span><span class="p">{</span>LightCyan<span class="p">}{</span>#1<span class="p">}}}</span>
<span class="nt">\begin{minted}</span>[linenos,escapeinside=||]<span class="p">{</span>c<span class="p">}</span>
for (int i = 0; |<span class="k">\codehighlight</span><span class="p">{</span>i &lt; 100<span class="p">}</span>|; i++) <span class="p">{</span>
    printf("i = <span class="c">%d\n", i);</span>
<span class="p">}</span>
<span class="nt">\end{minted}</span>
</code></pre></div></div> <p>The following image comparison illustrates the change in rendering compared to the first attempt:</p> <img-comparison-slider class="slider-with-shadows z-depth-1" hover="hover" value="56.2"> <figure slot="first"> <picture> <source class="responsive-img-srcset" srcset="/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v1-480.webp 480w,/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v1-800.webp 800w,/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v1-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v1.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Partial line highlight (attempt 1)" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <figure slot="second"> <picture> <source class="responsive-img-srcset" srcset="/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v2-480.webp 480w,/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v2-800.webp 800w,/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v2-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v2.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Partial line highlight (attempt 2)" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </img-comparison-slider> <p>The columns of monospaced characters now align, but the highlighting is very tight, also vertically (unlike <code class="language-plaintext highlighter-rouge">highlightlines</code>).</p> <h3 id="attempt-3">Attempt 3</h3> <p>We could do some trickery to effectively get different horizontal and vertical <code class="language-plaintext highlighter-rouge">\fboxsep</code>. However, it’s much easier to just insert a <code class="language-plaintext highlighter-rouge">\strut</code>, which is an invisible zero-width box that (more-or-less) accounts for the line height:</p> <div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\newcommand</span><span class="p">{</span><span class="k">\codehighlight</span><span class="p">}</span>[1]<span class="p">{{</span><span class="k">\setlength</span><span class="p">{</span><span class="k">\fboxsep</span><span class="p">}{</span>0pt<span class="p">}</span><span class="k">\colorbox</span><span class="p">{</span>LightCyan<span class="p">}{</span><span class="k">\strut</span> #1<span class="p">}}}</span>
<span class="nt">\begin{minted}</span>[linenos,escapeinside=||]<span class="p">{</span>c<span class="p">}</span>
for (int i = 0; |<span class="k">\codehighlight</span><span class="p">{</span>i &lt; 100<span class="p">}</span>|; i++) <span class="p">{</span>
    printf("i = <span class="c">%d\n", i);</span>
<span class="p">}</span>
<span class="nt">\end{minted}</span>
</code></pre></div></div> <p>The following image comparison illustrates the change in rendering compared to the second attempt:</p> <img-comparison-slider class="slider-with-shadows z-depth-1" hover="hover" value="56.2"> <figure slot="first"> <picture> <source class="responsive-img-srcset" srcset="/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v2-480.webp 480w,/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v2-800.webp 800w,/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v2-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v2.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Partial line highlight (attempt 2)" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <figure slot="second"> <picture> <source class="responsive-img-srcset" srcset="/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v3-480.webp 480w,/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v3-800.webp 800w,/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v3-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v3.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Partial line highlight (attempt 3)" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </img-comparison-slider> <p>This fixes the alignment and padding issues,<sup id="fnref:strut-offset"><a href="#fn:strut-offset" class="footnote" rel="footnote" role="doc-noteref">2</a></sup> but there’s still a noticeable problem: the text, which we have highlighted for the user using the background color, is not being code highlighted by <a href="https://github.com/gpoore/minted/">minted</a> at all.</p> <h3 id="attempt-4-solution">Attempt 4 (solution)</h3> <p>As far as I know, it’s not really possible to put <code class="language-plaintext highlighter-rouge">\mintinline</code> inside the <code class="language-plaintext highlighter-rouge">minted</code> environment to somehow try to fix this. Instead, we can do something strange with how our <code class="language-plaintext highlighter-rouge">\codehighlight</code> macro is used:</p> <div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\newcommand</span><span class="p">{</span><span class="k">\codehighlight</span><span class="p">}</span>[1]<span class="p">{{</span><span class="k">\setlength</span><span class="p">{</span><span class="k">\fboxsep</span><span class="p">}{</span>0pt<span class="p">}</span><span class="k">\colorbox</span><span class="p">{</span>LightCyan<span class="p">}{</span><span class="k">\strut</span> #1<span class="p">}}}</span>
<span class="nt">\begin{minted}</span>[linenos,escapeinside=||]<span class="p">{</span>c<span class="p">}</span>
for (int i = 0; |<span class="k">\codehighlight</span><span class="p">{{</span>|i &lt; 100|<span class="p">}}</span>|; i++) <span class="p">{</span>
    printf("i = <span class="c">%d\n", i);</span>
<span class="p">}</span>
<span class="nt">\end{minted}</span>
</code></pre></div></div> <p>Previously, we had a single <code class="language-plaintext highlighter-rouge">escapeinside</code> that contained the macro applied to its contents. Now, there are two <code class="language-plaintext highlighter-rouge">escapeinside</code>s: one before the contents and one after:</p> <ol> <li>The one before only calls the macro and starts its argument. Importantly, this needs double <code class="language-plaintext highlighter-rouge">{</code> because <a href="https://github.com/gpoore/minted/">minted</a> puts the <code class="language-plaintext highlighter-rouge">escapeinside</code> contents into a group (i.e. between braces like <code class="language-plaintext highlighter-rouge">{\codehighlight{{}</code>). The first brace starts the macro argument, the second is only there to cancel out the group-closing brace. Without the latter, <code class="language-plaintext highlighter-rouge">\codehighlight</code> would just be given an empty argument.</li> <li>The one after only ends the macro argument. Analogously, this needs double <code class="language-plaintext highlighter-rouge">}}</code> (i.e. grouped like like <code class="language-plaintext highlighter-rouge">{}}}</code>). The first brace is only there to cancel out the group-opening brace and the second ends the macro argument.</li> </ol> <p>The following image comparison illustrates the change in rendering compared to the third attempt:</p> <img-comparison-slider class="slider-with-shadows z-depth-1" hover="hover" value="56.2"> <figure slot="first"> <picture> <source class="responsive-img-srcset" srcset="/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v3-480.webp 480w,/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v3-800.webp 800w,/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v3-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v3.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Partial line highlight (attempt 3)" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <figure slot="second"> <picture> <source class="responsive-img-srcset" srcset="/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v4-480.webp 480w,/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v4-800.webp 800w,/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v4-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v4.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Partial line highlight (attempt 4)" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </img-comparison-slider> <p>The strange amalgamation fixes the code highlighting “within” <code class="language-plaintext highlighter-rouge">\codehighlight</code>.</p> <p>Finally, this image comparison illustrates the change in rendering compared to no highlight:</p> <img-comparison-slider class="slider-with-shadows z-depth-1" hover="hover" value="56.2"> <figure slot="first"> <picture> <source class="responsive-img-srcset" srcset="/assets/highlighting-parts-of-lines-in-minted/no-highlight-480.webp 480w,/assets/highlighting-parts-of-lines-in-minted/no-highlight-800.webp 800w,/assets/highlighting-parts-of-lines-in-minted/no-highlight-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/highlighting-parts-of-lines-in-minted/no-highlight.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="No highlight" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <figure slot="second"> <picture> <source class="responsive-img-srcset" srcset="/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v4-480.webp 480w,/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v4-800.webp 800w,/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v4-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/highlighting-parts-of-lines-in-minted/partial-highlight-v4.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Partial line highlight (attempt 4)" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </img-comparison-slider> <hr/> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:highlightlines-offset"> <p>Only now, when producing the comparison images for this post, I noticed that <code class="language-plaintext highlighter-rouge">highlightlines</code> seems to slightly reduce line spacing. I’m not sure why, seems like a bug. <a href="#fnref:highlightlines-offset" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:strut-offset"> <p>Although our <code class="language-plaintext highlighter-rouge">\colorbox</code> now also seems to slightly reduce line spacing. I’m not sure why, but this will be irrelevant in the end. <a href="#fnref:strut-offset" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name></name></author><category term="typesetting"/><category term="latex"/><category term="tutorial"/><summary type="html"><![CDATA[To repeat, I prefer to use the minted package for typesetting syntax-highlighted code in LaTeX. For example: \begin{minted}[linenos]{c} for (int i=0; i &lt; 100; i++) { printf("i = %d\n", i); } \end{minted} This is rendered as:]]></summary></entry><entry><title type="html">Referencing lines in fancyvrb/minted</title><link href="https://sim642.eu/blog/2025/05/01/referencing-lines-in-fancyvrb-minted/" rel="alternate" type="text/html" title="Referencing lines in fancyvrb/minted"/><published>2025-05-01T00:00:00+00:00</published><updated>2025-06-03T00:00:00+00:00</updated><id>https://sim642.eu/blog/2025/05/01/referencing-lines-in-fancyvrb-minted</id><content type="html" xml:base="https://sim642.eu/blog/2025/05/01/referencing-lines-in-fancyvrb-minted/"><![CDATA[<p>I prefer to use the <a href="https://github.com/gpoore/minted/">minted</a> package for typesetting syntax-highlighted code in LaTeX. Occasionally it is necessary to refer to a particular line of code by its number in the accompanying text. Of course one can just hard-code the line numbers into the text, but that’s not very TeX-like. This becomes very error-prone when the code needs to be modified and all hard-coded line number references manually synchronized. So I want this to be automatic, just like numbering and referencing of sections, figures, etc.</p> <p>Luckily, <a href="https://github.com/gpoore/minted/">minted</a> builds on the much older <a href="https://ctan.org/pkg/fancyvrb?lang=en">fancyvrb</a> package, which provides basic support for referencing lines. For example, the desired line can be marked using <code class="language-plaintext highlighter-rouge">\label</code> using <code class="language-plaintext highlighter-rouge">escapeinside</code>:</p> <div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">\begin{minted}</span>[linenos,escapeinside=||]<span class="p">{</span>python<span class="p">}</span>
print("foo")
print("bar")|<span class="k">\label</span><span class="p">{</span>ln:bar<span class="p">}</span>|
print("baz")
<span class="nt">\end{minted}</span>
</code></pre></div></div> <p>Then elsewhere <code class="language-plaintext highlighter-rouge">\ref{ln:bar}</code> expands to “2”.</p> <h2 id="problem">Problem</h2> <p>Unfortunately, that’s also where the support ends. In particular, fancyvrb is <strong>incompatible with</strong> two other often-used packages:</p> <ol> <li> <p>The <strong><a href="https://ctan.org/pkg/hyperref?lang=en">hyperref</a></strong> package makes <code class="language-plaintext highlighter-rouge">\ref{ln:bar}</code> a hyperlink, but it’s a link to the wrong place! The link doesn’t go to the specific line of code or not even the beginning of the <code class="language-plaintext highlighter-rouge">minted</code> environment it is in. Instead, it goes to whatever happens to be the previous hyperref anchor at the point of <code class="language-plaintext highlighter-rouge">\label{ln:bar}</code>, e.g. a previous section heading, figure caption, etc. If you’re lucky, this might at least be on the same page as the code, but it could also be many pages before.</p> </li> <li> <p>The <strong><a href="https://ctan.org/pkg/cleveref?lang=en">cleveref</a></strong> package adds <code class="language-plaintext highlighter-rouge">\cref</code> and friends, which prefix the reference number with its kind, e.g. <code class="language-plaintext highlighter-rouge">\cref{sec:introduction}</code> might expand to “section 1” instead of just “1”. The incompatibility with fancyvrb is similar to the one with hyperref: <code class="language-plaintext highlighter-rouge">\cref{ln:bar}</code> expands to a complete reference to something before the code, e.g. “section 1”. Not only is the “section” prefix wrong, but the number isn’t even the line number “2”. (At least it’s self-consistent: the number is for whatever is being referenced instead.)</p> </li> </ol> <h3 id="texnical-details">TeXnical details</h3> <p>Deep inside LaTeX, the problem stems from the fact that both hyperref and cleveref achieve their functionality by redefining <code class="language-plaintext highlighter-rouge">\refstepcounter</code>. However, fancyvrb does not use that standard command and instead defines its own version:</p> <div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\def\FV</span>@refstepcounter#1<span class="p">{</span>
  <span class="k">\stepcounter</span><span class="p">{</span>#1<span class="p">}</span>
  <span class="k">\protected</span>@edef<span class="k">\@</span>currentlabel<span class="p">{</span><span class="k">\csname</span> p@#1<span class="k">\endcsname\arabic</span><span class="p">{</span>FancyVerbLine<span class="p">}}</span>
<span class="p">}</span>
</code></pre></div></div> <p>It explicitly uses <code class="language-plaintext highlighter-rouge">\arabic{FancyVerbLine}</code> instead of <code class="language-plaintext highlighter-rouge">\theFancyVerbLine</code>, which the standard definition would use. As far as I can tell, it needs to do that because it defines the latter as:</p> <div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\def\theFancyVerbLine</span><span class="p">{</span><span class="k">\rmfamily\tiny\arabic</span><span class="p">{</span>FancyVerbLine<span class="p">}}</span>
</code></pre></div></div> <p>which is weird because it puts the number formatting (for in the code environment) into the counter value formatting itself, which no other sensible counter would do. And <code class="language-plaintext highlighter-rouge">\FV@refstepcounter</code> is explicitly there to <em>not</em> make the number tiny at <code class="language-plaintext highlighter-rouge">\ref{ln:bar}</code>.</p> <h2 id="non-solutions">Non-solutions</h2> <p>There have been some attempts at avoiding the incompatibilities:</p> <ol> <li><a href="https://www.reddit.com/r/LaTeX/comments/zaohbs/how_do_i_reference_a_line_of_code_in_minted/">A reddit thread</a> suggests to use <code class="language-plaintext highlighter-rouge">|\phantomsection\label{ln:bar}|</code> in the <code class="language-plaintext highlighter-rouge">minted</code> environment instead. The <code class="language-plaintext highlighter-rouge">\phantomsection</code> provided by hyperref inserts a fresh anchor which <code class="language-plaintext highlighter-rouge">\ref{ln:bar}</code> will link to. This comes with two problems: <ol> <li>It’s outright annoying to have to <em>manually</em> insert <code class="language-plaintext highlighter-rouge">\phantomsection</code> before every <code class="language-plaintext highlighter-rouge">\label</code> in code.</li> <li>The anchor inserted by <code class="language-plaintext highlighter-rouge">\phantomsection</code> is not at the beginning of the line of code, but at the specific column where the escaped label is placed. One could put all code line labels at the beginning of lines, but that makes the code even less readable: at the end of lines the code at least maintains its indentation in the LaTeX sources.</li> </ol> </li> <li><a href="https://github.com/muzimuzhi/latex-examples/blob/78e3e1a55d30c648efba74dd99a94fb17b9da3a7/examples/fancyvrb-improvements.tex">One file on GitHub</a> uses <code class="language-plaintext highlighter-rouge">\let\FV@refstepcounter\refstepcounter</code> to make fancyvrb use the standard command and thus allow hyperref/cleveref to modify it as usual. In order to avoid the tiny numbers at <code class="language-plaintext highlighter-rouge">\ref{ln:bar}</code>, it goes on to patch fancyvrb in various places to move the <code class="language-plaintext highlighter-rouge">\tiny</code> to a more appropriate place by adding <code class="language-plaintext highlighter-rouge">numberstyle</code> customization option. This is morally the right approach, but unfortunately also comes with two problems: <ol> <li>It breaks fancyvrb’s <code class="language-plaintext highlighter-rouge">firstnumber</code> option and causes the first two lines to have the same number. Seems like fancyvrb’s logic is very particular to its oddities.</li> <li>The resulting hyperref anchors are based on the <code class="language-plaintext highlighter-rouge">FancyVerbLine</code> counter, which isn’t globally unique. So <code class="language-plaintext highlighter-rouge">pdflatex</code> warns about duplicate destinations and all of them link into the first <code class="language-plaintext highlighter-rouge">minted</code> environment, not the one where the <code class="language-plaintext highlighter-rouge">\label</code> actually is.</li> </ol> </li> <li><a href="https://github.com/wsmoses/Paper-MEng/blob/4272960994d5b43a4959225d33999b59facfb1c3/codehilite.sty#L37-L47">Another file on GitHub</a> only patches <code class="language-plaintext highlighter-rouge">\FV@refstepcounter</code> to add <code class="language-plaintext highlighter-rouge">\refstepcounter</code> of an additional global counter, avoiding the last issue. But it’s also not well-behaved, at least with minted: extra empty space appears before <code class="language-plaintext highlighter-rouge">minted</code> environments. I believe this is because minted works in two phases: <ol> <li>The contents of a <code class="language-plaintext highlighter-rouge">minted</code> environment are not typeset, but written to a file (to run Pygments on). While doing so, fancyvrb still steps the counter which inserts spurious hyperref anchors but nothing else is typeset yet.</li> <li>After running Pygments, its result is somehow input into some fancyvrb environment for actual typesetting. This is what actually displays the syntax-highlighted code along with the line numbers (which are produced by stepping the line counter again).</li> </ol> </li> <li><a href="https://tex.stackexchange.com/a/410667/383946">A TeX StackExchange</a> answer defines alternative <code class="language-plaintext highlighter-rouge">\label</code> and <code class="language-plaintext highlighter-rouge">\ref</code> commands to use just for lines of code, which isn’t entirely satisfactory (one doesn’t need separate commands for sections, figures, etc.). It bypasses the usual hyperref anchor mechanism and uses <code class="language-plaintext highlighter-rouge">\hypertarget</code> and <code class="language-plaintext highlighter-rouge">\hyperlink</code> to directly work with custom PDF destinations. It also tries to provide a hint for cleveref, but admits that it still produces a wrong reference.</li> </ol> <h2 id="solution">Solution</h2> <p>After digging deep into the implementation of fancyvrb, minted, hyperref and cleveref to (try to) understand how they work, I managed to put together a solution which seems to achieve the desired functionality without any of the downsides listed above. I wrapped my solution into <strong>my new <a href="https://github.com/sim642/fancyvrbref">fancyvrbref</a> package</strong>. I haven’t (yet) published it on CTAN, but you can just copy it into your project for the time being.</p> <p>The solution is the following:</p> <ol> <li>For hyperref compatibility, the fancyvrb internal command for actually typesetting a line is patched to insert a globally unique anchor with <code class="language-plaintext highlighter-rouge">FancyVerbLine*</code> prefix. Since this isn’t in <code class="language-plaintext highlighter-rouge">\FV@refstepcounter</code>, it doesn’t screw up minted.</li> <li>For cleveref compatibility, the counter <code class="language-plaintext highlighter-rouge">FancyVerbRefLine</code> is defined as an alias for <code class="language-plaintext highlighter-rouge">FancyVerbLine</code>, but without tiny font, and the <code class="language-plaintext highlighter-rouge">\FV@refstepcounter</code> command is extended to define <code class="language-plaintext highlighter-rouge">\cref@currentlabel</code> (for cleveref), <code class="language-plaintext highlighter-rouge">\@currentlabel</code> (for consistency with <code class="language-plaintext highlighter-rouge">\ref</code>), and <code class="language-plaintext highlighter-rouge">\@currentcounter</code> (for cleveref with LaTeX2e format since 2024-11-01). This avoids modifying <code class="language-plaintext highlighter-rouge">\theFancyVerbLine</code>.</li> </ol> <p>Let me know if you find this package useful or find any issues with it!</p> <hr/>]]></content><author><name></name></author><category term="typesetting"/><category term="latex"/><category term="tutorial"/><summary type="html"><![CDATA[I prefer to use the minted package for typesetting syntax-highlighted code in LaTeX. Occasionally it is necessary to refer to a particular line of code by its number in the accompanying text. Of course one can just hard-code the line numbers into the text, but that’s not very TeX-like. This becomes very error-prone when the code needs to be modified and all hard-coded line number references manually synchronized. So I want this to be automatic, just like numbering and referencing of sections, figures, etc.]]></summary></entry><entry><title type="html">Trends in UniTartuCS theses</title><link href="https://sim642.eu/blog/2025/04/13/trends-in-unitartucs-theses/" rel="alternate" type="text/html" title="Trends in UniTartuCS theses"/><published>2025-04-13T00:00:00+00:00</published><updated>2025-07-06T00:00:00+00:00</updated><id>https://sim642.eu/blog/2025/04/13/trends-in-unitartucs-theses</id><content type="html" xml:base="https://sim642.eu/blog/2025/04/13/trends-in-unitartucs-theses/"><![CDATA[<p>The <a href="https://cs.ut.ee/en/">Institute of Computer Science</a> at the <a href="https://ut.ee/en/">University of Tartu</a> (UniTartuCS for short) has a (new) <a href="https://thesis.cs.ut.ee/">register for bachelor’s and master’s theses</a>. Out of curiosity, I have done some data analysis on these theses and in this post I will present some results.</p> <p>As of July 6, 2025,<sup id="fnref:updated"><a href="#fn:updated" class="footnote" rel="footnote" role="doc-noteref">1</a></sup> the register contained <strong>2555 theses in total</strong>. I have excluded some from the following analysis:</p> <ul> <li>8 theses from before 2010, because it seems like the data for those years is incomplete.</li> </ul> <p>This leaves <strong>2547 theses</strong> for the analysis. Note that the data for 2025 is not yet final.</p> <h2 id="word-processors-used">Word processors used</h2> <p>The institute provides thesis templates for Microsoft Word and LaTeX. I wanted to find out how much each of them is used by the students.</p> <p>This is complicated by the fact that theses are submitted as PDFs. Luckily, PDF file metadata contains two fields which give a lot of insight: PDF creator and PDF producer. Although, the content of these fields is not standardized, it’s not as messy as <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Browser_detection_using_the_user_agent">web browser User-Agent headers</a> (yet). Working with the data, I reached the following classification:</p> <ul> <li><strong>Microsoft Word</strong> if the PDF creator matches <code class="language-plaintext highlighter-rouge">Microsoft®? (Office )?Word|Acrobat PDFMaker .* for Word</code>.</li> <li><strong>TeX</strong> if the PDF creator contains <code class="language-plaintext highlighter-rouge">TeX</code>.</li> <li><strong>Google Docs</strong> if the PDF producer matches <code class="language-plaintext highlighter-rouge">Google Docs|Skia/PDF</code>.</li> <li><strong>LibreOffice</strong> if the PDF producer matches <code class="language-plaintext highlighter-rouge">(Libre|Open)Office</code>.</li> <li><strong>Quartz</strong> if the PDF creator contains <code class="language-plaintext highlighter-rouge">Quartz PDFContext</code>. These are somehow created by MacOS, but it’s unclear to me how.</li> <li><strong>Print</strong> if the PDF producer matches <code class="language-plaintext highlighter-rouge">Microsoft: Print To PDF|Foxit Reader (PDF )?Printer|PDF Printer</code>. These are various PDF printers.</li> <li><strong>Unknown</strong> otherwise.</li> </ul> <h3 id="overall">Overall</h3> <p>First, let’s look at the overall word processor breakdown:</p> <pre><code class="language-vega_lite">{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "data": {
    "url": "/assets/2025-07-06-unitartucs-theses-blog.csv",
    "format": {
      "type": "csv",
      "parse": {
        "pdf_pages": "number"
      }
    }
  },
  "transform": [
    {
      "filter": "datum.defence_year &gt;= 2010 &amp;&amp; datum.defence_year &lt;= 2025"
    },
    {
      "aggregate": [{
        "op": "count",
        "as": "count"
      }],
      "groupby": ["pdf_classification"]
    },
    {
      "joinaggregate": [{
        "op": "sum",
        "field": "count",
        "as": "total_count"
      }],
      "groupby": []
    },
    {
      "calculate": "datum.count / datum.total_count",
      "as": "fraction"
    },
    {
      "lookup": "pdf_classification",
      "from": {
        "data": {
          "values": [
            {"pdf_classification": "Microsoft Word", "classification_order": 0},
            {"pdf_classification": "LibreOffice", "classification_order": 1},
            {"pdf_classification": "Google Docs", "classification_order": 2},
            {"pdf_classification": "Quartz", "classification_order": 3},
            {"pdf_classification": "Print", "classification_order": 4},
            {"pdf_classification": "Unknown", "classification_order": 5},
            {"pdf_classification": "TeX", "classification_order": 6}
          ]
        },
        "key": "pdf_classification",
        "fields": ["classification_order"]
      }
    }
  ],
  "width": "400",
  "mark": "arc",
  "encoding": {
    "theta": {
      "field": "count",
      "type": "quantitative",
      "aggregate": "sum",
      "stack": "normalize",
      "title": "Theses"
    },
    "color": {
      "field": "pdf_classification",
      "title": "PDF creator"
    },
    "order": {
      "field": "classification_order"
    },
    "tooltip": [
      {
        "field": "fraction",
        "format": ".0%",
        "title": " Theses"
      },
      {
        "field": "count",
        "title": "Theses"
      }
    ]
  }
}
</code></pre> <p>This shows that <strong>Microsoft Word is used slightly more than LaTeX</strong>. Notably, Word-like WYSIWYG editors make up the majority.</p> <p>Now, let’s dig a little deeper to see how the breakdown depends on the year and the curriculum.</p> <h3 id="by-year">By year</h3> <p>Second, let’s look at the word processor breakdown across the years:</p> <pre><code class="language-vega_lite">{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "data": {
    "url": "/assets/2025-07-06-unitartucs-theses-blog.csv",
    "format": {
      "type": "csv",
      "parse": {
        "pdf_pages": "number"
      }
    }
  },
  "transform": [
    {
      "filter": "datum.defence_year &gt;= 2010 &amp;&amp; datum.defence_year &lt;= 2025"
    },
    {
      "aggregate": [{
        "op": "count",
        "as": "count"
      }],
      "groupby": ["pdf_classification", "defence_year"]
    },
    {
      "joinaggregate": [{
        "op": "sum",
        "field": "count",
        "as": "year_count"
      }],
      "groupby": ["defence_year"]
    },
    {
      "calculate": "datum.count / datum.year_count",
      "as": "year_fraction"
    },
    {
      "lookup": "pdf_classification",
      "from": {
        "data": {
          "values": [
            {"pdf_classification": "Microsoft Word", "classification_order": 0},
            {"pdf_classification": "LibreOffice", "classification_order": 1},
            {"pdf_classification": "Google Docs", "classification_order": 2},
            {"pdf_classification": "Quartz", "classification_order": 3},
            {"pdf_classification": "Print", "classification_order": 4},
            {"pdf_classification": "Unknown", "classification_order": 5},
            {"pdf_classification": "TeX", "classification_order": 6}
          ]
        },
        "key": "pdf_classification",
        "fields": ["classification_order"]
      }
    }
  ],
  "width": "725",
  "mark": "bar",
  "encoding": {
    "x": {
      "field": "defence_year",
      "title": "Year",
      "axis": {
        "labelAngle": 0
      }
    },
    "y": {
      "field": "count",
      "type": "quantitative",
      "stack": "normalize",
      "title": "Theses"
    },
    "color": {
      "field": "pdf_classification",
      "title": "PDF creator"
    },
    "order": {
      "field": "classification_order"
    },
    "tooltip": [
      {
        "field": "year_fraction",
        "format": ".0%",
        "title": "Theses (of year)"
      },
      {
        "field": "count",
        "title": "Theses"
      }
    ]
  }
}
</code></pre> <p>This reveals two main trends:</p> <ol> <li>OpenOffice/LibreOffice usage has mostly diminished.</li> <li><strong>Google Docs usage has become widespread.</strong></li> </ol> <p>The latter is worrying because Google Docs is (in my opinion) inadequate for typesetting a thesis. Having supervised and reviewed a number of theses (although relatively few compared to senior staff members), it’s often obvious that a thesis has been typeset in Google Docs based on poor and inconsistent formatting. Importing the Microsoft Word template into Google Docs is a lossy conversion because Docs has limited features and customizability, even compared to Word.</p> <h3 id="by-curriculum">By curriculum</h3> <p>Third, let’s look at the relationship between word processor usage and curricula:</p> <pre><code class="language-vega_lite">{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "data": {
    "url": "/assets/2025-07-06-unitartucs-theses-blog.csv",
    "format": {
      "type": "csv",
      "parse": {
        "pdf_pages": "number"
      }
    }
  },
  "transform": [
    {
      "filter": "datum.defence_year &gt;= 2010 &amp;&amp; datum.defence_year &lt;= 2025"
    },
    {
      "lookup": "curriculum",
      "from": {
        "data": {
          "values": [
            {"curriculum": "bsc_computer_science", "curriculum_name": "BSc - Computer Science"},
            {"curriculum": "msc_computer_science", "curriculum_name": "MSc - Computer Science"},
            {"curriculum": "msc_software_engineering", "curriculum_name": "MSc - Software Engineering"},
            {"curriculum": "msc_data_science", "curriculum_name": "MSc - Data Science"},
            {"curriculum": "msc_data_science_exam", "curriculum_name": "MSc - Data Science (exam)"},
            {"curriculum": "msc_cyber_security", "curriculum_name": "MSc - Cyber Security"},
            {"curriculum": "msc_conversion_master_in_it", "curriculum_name": "MSc - Conversion Master in IT"},
            {"curriculum": "msc_innovation_and_technology_management", "curriculum_name": "MA - Innovation and Technology Management"},
            {"curriculum": "ma_maths_and_informatics_teacher", "curriculum_name": "MA - Teacher of Mathematics and Informatics"},
            {"curriculum": "other", "curriculum_name": "Other"}
          ]
        },
        "key": "curriculum",
        "fields": ["curriculum_name"]
      }
    }
  ],
  "width": {
    "step": "50"
  },
  "height": {
    "step": "50"
  },
  "mark": "rect",
  "encoding": {
    "x": {
      "field": "curriculum_name",
      "title": "Curriculum",
      "axis": {
        "labelAngle": -45
      },
      "sort": [
        "BSc - Computer Science",
        "MSc - Computer Science",
        "MSc - Software Engineering",
        "MSc - Cyber Security",
        "MSc - Data Science",
        "MSc - Data Science (exam)",
        "MSc - Conversion Master in IT",
        "MA - Innovation and Technology Management",
        "MA - Teacher of Mathematics and Informatics",
        "Other"
      ]
    },
    "y": {
      "field": "pdf_classification",
      "title": "PDF creator",
      "sort": [
        "TeX",
        "Unknown",
        "Print",
        "Quartz",
        "Google Docs",
        "LibreOffice",
        "Microsoft Word"
      ]
    },
    "color": {
      "aggregate": "count",
      "type": "quantitative",
      "scale": {
        "type": "log"
      },
      "title": "Theses"
    },
    "tooltip": {
      "aggregate": "count",
      "type": "quantitative"
    }
  }
}
</code></pre> <p>Although the heatmap shows some trends, the logarithmic color scale<sup id="fnref:curriculum-log"><a href="#fn:curriculum-log" class="footnote" rel="footnote" role="doc-noteref">2</a></sup> makes exact comparisons difficult. Thus, let’s look at the word processor breakdown across different curricula in a different way:</p> <pre><code class="language-vega_lite">{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "data": {
    "url": "/assets/2025-07-06-unitartucs-theses-blog.csv",
    "format": {
      "type": "csv",
      "parse": {
        "pdf_pages": "number"
      }
    }
  },
  "transform": [
    {
      "filter": "datum.defence_year &gt;= 2010 &amp;&amp; datum.defence_year &lt;= 2025"
    },
    {
      "aggregate": [{
        "op": "count",
        "as": "count"
      }],
      "groupby": ["pdf_classification", "curriculum"]
    },
    {
      "joinaggregate": [{
        "op": "sum",
        "field": "count",
        "as": "curriculum_count"
      }],
      "groupby": ["curriculum"]
    },
    {
      "calculate": "datum.count / datum.curriculum_count",
      "as": "curriculum_fraction"
    },
    {
      "lookup": "pdf_classification",
      "from": {
        "data": {
          "values": [
            {"pdf_classification": "Microsoft Word", "classification_order": 0},
            {"pdf_classification": "LibreOffice", "classification_order": 1},
            {"pdf_classification": "Google Docs", "classification_order": 2},
            {"pdf_classification": "Quartz", "classification_order": 3},
            {"pdf_classification": "Print", "classification_order": 4},
            {"pdf_classification": "Unknown", "classification_order": 5},
            {"pdf_classification": "TeX", "classification_order": 6}
          ]
        },
        "key": "pdf_classification",
        "fields": ["classification_order"]
      }
    },
    {
      "lookup": "curriculum",
      "from": {
        "data": {
          "values": [
            {"curriculum": "bsc_computer_science", "curriculum_name": "BSc - Computer Science"},
            {"curriculum": "msc_computer_science", "curriculum_name": "MSc - Computer Science"},
            {"curriculum": "msc_software_engineering", "curriculum_name": "MSc - Software Engineering"},
            {"curriculum": "msc_data_science", "curriculum_name": "MSc - Data Science"},
            {"curriculum": "msc_data_science_exam", "curriculum_name": "MSc - Data Science (exam)"},
            {"curriculum": "msc_cyber_security", "curriculum_name": "MSc - Cyber Security"},
            {"curriculum": "msc_conversion_master_in_it", "curriculum_name": "MSc - Conversion Master in IT"},
            {"curriculum": "msc_innovation_and_technology_management", "curriculum_name": "MA - Innovation and Technology Management"},
            {"curriculum": "ma_maths_and_informatics_teacher", "curriculum_name": "MA - Teacher of Mathematics and Informatics"},
            {"curriculum": "other", "curriculum_name": "Other"}
          ]
        },
        "key": "curriculum",
        "fields": ["curriculum_name"]
      }
    }
  ],
  "width": "725",
  "mark": "bar",
  "encoding": {
    "x": {
      "field": "curriculum_name",
      "title": "Curriculum",
      "axis": {
        "labelAngle": -45
      },
      "sort": [
        "BSc - Computer Science",
        "MSc - Computer Science",
        "MSc - Software Engineering",
        "MSc - Cyber Security",
        "MSc - Data Science",
        "MSc - Data Science (exam)",
        "MSc - Conversion Master in IT",
        "MA - Innovation and Technology Management",
        "MA - Teacher of Mathematics and Informatics",
        "Other"
      ]
    },
    "y": {
      "field": "count",
      "type": "quantitative",
      "stack": "normalize",
      "title": "Theses"
    },
    "color": {
      "field": "pdf_classification",
      "title": "PDF creator"
    },
    "order": {
      "field": "classification_order"
    },
    "tooltip": [
      {
        "field": "curriculum_fraction",
        "format": ".0%",
        "title": "Theses (of curriculum)"
      },
      {
        "field": "count",
        "title": "Theses"
      }
    ]
  }
}
</code></pre> <p>This reveals the following:</p> <ol> <li>Over 70% of “BSc - Computer Science” students use Word-like WYSIWYG editors and only 20% use LaTeX.</li> <li><strong>LaTeX usage is most popular among “MSc - Computer Science” and “MSc - Data Science” students.</strong> This is probably motivated by the need to typeset more mathematics or do more data visualization.</li> <li>LaTeX usage is (almost) nonexistent among “MSc - Conversion Master in IT” and “MA - Teacher of Mathematics and Informatics” students. This is probably because students in those curricula are less technical.</li> </ol> <h2 id="page-count-by-curriculum">Page count by curriculum</h2> <p>There’s another piece of PDF file metadata that can be analyzed: PDF page count. It only makes sense to consider curricula separately for this because the expected page counts (set by the guidelines) differ:</p> <ul> <li>Bachelor’s theses should be ~20 pages (excluding appendices).</li> <li>Master’s theses should be 40-50 pages (excluding appendices).</li> </ul> <p>Since the PDF files also contain the appendices, the PDF page count can be expected to be higher.</p> <p>From the following plots I’ve additionally excluded the following outliers:</p> <ul> <li>A 373-page thesis, because it would screw with the scale of the plots.</li> <li>All theses with under 11 pages, because these appear to be abstracts for theses with publishing restrictions and would skew the results.</li> </ul> <h3 id="overall-1">Overall</h3> <p>First, let’s look at the thesis page count statistics by curricula:</p> <pre><code class="language-vega_lite">{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "data": {
    "url": "/assets/2025-07-06-unitartucs-theses-blog.csv",
    "format": {
      "type": "csv",
      "parse": {
        "pdf_pages": "number"
      }
    }
  },
  "transform": [
    {
      "filter": "datum.defence_year &gt;= 2010 &amp;&amp; datum.defence_year &lt;= 2025"
    },
    {
      "filter": "datum.pdf_pages != 373 &amp;&amp; datum.pdf_pages &gt;= 11"
    },
    {
      "lookup": "curriculum",
      "from": {
        "data": {
          "values": [
            {"curriculum": "bsc_computer_science", "curriculum_name": "BSc - Computer Science"},
            {"curriculum": "msc_computer_science", "curriculum_name": "MSc - Computer Science"},
            {"curriculum": "msc_software_engineering", "curriculum_name": "MSc - Software Engineering"},
            {"curriculum": "msc_data_science", "curriculum_name": "MSc - Data Science"},
            {"curriculum": "msc_data_science_exam", "curriculum_name": "MSc - Data Science (exam)"},
            {"curriculum": "msc_cyber_security", "curriculum_name": "MSc - Cyber Security"},
            {"curriculum": "msc_conversion_master_in_it", "curriculum_name": "MSc - Conversion Master in IT"},
            {"curriculum": "msc_innovation_and_technology_management", "curriculum_name": "MA - Innovation and Technology Management"},
            {"curriculum": "ma_maths_and_informatics_teacher", "curriculum_name": "MA - Teacher of Mathematics and Informatics"},
            {"curriculum": "other", "curriculum_name": "Other"}
          ]
        },
        "key": "curriculum",
        "fields": ["curriculum_name"]
      }
    }
  ],
  "width": "725",
  "height": "350",
  "mark": {
    "type": "boxplot",
    "size": 30,
    "ticks": true
  },
  "encoding": {
    "x": {
      "field": "curriculum_name",
      "title": "Curriculum",
      "axis": {
        "labelAngle": -45
      },
      "sort": [
        "BSc - Computer Science",
        "MSc - Computer Science",
        "MSc - Software Engineering",
        "MSc - Cyber Security",
        "MSc - Data Science",
        "MSc - Data Science (exam)",
        "MSc - Conversion Master in IT",
        "MA - Innovation and Technology Management",
        "MA - Teacher of Mathematics and Informatics",
        "Other"
      ]
    },
    "y": {
      "field": "pdf_pages",
      "type": "quantitative",
      "title": "Pages"
    },
    "tooltip": {
      "field": "pdf_pages",
      "type": "quantitative"
    }
  }
}
</code></pre> <h3 id="by-year-1">By year</h3> <p>Second, let’s look at the thesis page count average across the years, still by curricula: <em>(click on a curriculum name in the legend for a more focused view)</em></p> <pre><code class="language-vega_lite">{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "data": {
    "url": "/assets/2025-07-06-unitartucs-theses-blog.csv",
    "format": {
      "type": "csv",
      "parse": {
        "pdf_pages": "number"
      }
    }
  },
  "transform": [
    {
      "filter": "datum.defence_year &gt;= 2010 &amp;&amp; datum.defence_year &lt;= 2025"
    },
    {
      "filter": "datum.pdf_pages != 373 &amp;&amp; datum.pdf_pages &gt;= 11"
    },
    {
      "lookup": "curriculum",
      "from": {
        "data": {
          "values": [
            {"curriculum": "bsc_computer_science", "curriculum_name": "BSc - Computer Science"},
            {"curriculum": "msc_computer_science", "curriculum_name": "MSc - Computer Science"},
            {"curriculum": "msc_software_engineering", "curriculum_name": "MSc - Software Engineering"},
            {"curriculum": "msc_data_science", "curriculum_name": "MSc - Data Science"},
            {"curriculum": "msc_data_science_exam", "curriculum_name": "MSc - Data Science (exam)"},
            {"curriculum": "msc_cyber_security", "curriculum_name": "MSc - Cyber Security"},
            {"curriculum": "msc_conversion_master_in_it", "curriculum_name": "MSc - Conversion Master in IT"},
            {"curriculum": "msc_innovation_and_technology_management", "curriculum_name": "MA - Innovation and Technology Management"},
            {"curriculum": "ma_maths_and_informatics_teacher", "curriculum_name": "MA - Teacher of Mathematics and Informatics"},
            {"curriculum": "other", "curriculum_name": "Other"}
          ]
        },
        "key": "curriculum",
        "fields": ["curriculum_name"]
      }
    }
  ],
  "width": "650",
  "height": "350",
  "mark": {
    "type": "line",
    "point": true
  },
  "params": [{
    "name": "curriculum_name",
    "select": {"type": "point", "fields": ["curriculum_name"]},
    "bind": "legend"
  }],
  "encoding": {
    "x": {
      "field": "defence_year",
      "title": "Year",
      "axis": {
        "labelAngle": 0
      }
    },
    "y": {
      "field": "pdf_pages",
      "type": "quantitative",
      "aggregate": "average",
      "title": "Pages (average)"
    },
    "color": {
      "field": "curriculum_name",
      "type": "nominal",
      "title": "Curriculum",
      "sort": [
        "BSc - Computer Science",
        "MSc - Computer Science",
        "MSc - Software Engineering",
        "MSc - Cyber Security",
        "MSc - Data Science",
        "MSc - Data Science (exam)",
        "MSc - Conversion Master in IT",
        "MA - Innovation and Technology Management",
        "MA - Teacher of Mathematics and Informatics",
        "Other"
      ]
    },
    "tooltip": {
      "field": "pdf_pages",
      "type": "quantitative",
      "aggregate": "average",
      "format": ".1f"
    },
    "opacity": {
      "condition": {"param": "curriculum_name", "value": 1},
      "value": 0.2
    }
  }
}
</code></pre> <hr/> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:updated"> <p>The post has been updated with 2025 data: initially the data was as of March 16, 2025 and excluded theses from 2025. None of the findings have changed with the update. <a href="#fnref:updated" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:curriculum-log"> <p>The logarithmic scale is necessary because the thesis counts differ in orders of magnitude. A linear color scale would be dominated by a few most popular combinations. <a href="#fnref:curriculum-log" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name></name></author><category term="academia"/><category term="teaching"/><category term="university"/><category term="typesetting"/><category term="latex"/><summary type="html"><![CDATA[The Institute of Computer Science at the University of Tartu (UniTartuCS for short) has a (new) register for bachelor’s and master’s theses. Out of curiosity, I have done some data analysis on these theses and in this post I will present some results.]]></summary></entry><entry><title type="html">My (not-so-great) experience with switching Android phones in 2025</title><link href="https://sim642.eu/blog/2025/03/21/my-not-so-great-experience-with-switching-android-phones-in-2025/" rel="alternate" type="text/html" title="My (not-so-great) experience with switching Android phones in 2025"/><published>2025-03-21T00:00:00+00:00</published><updated>2025-05-01T00:00:00+00:00</updated><id>https://sim642.eu/blog/2025/03/21/my-not-so-great-experience-with-switching-android-phones-in-2025</id><content type="html" xml:base="https://sim642.eu/blog/2025/03/21/my-not-so-great-experience-with-switching-android-phones-in-2025/"><![CDATA[<p>In the beginning of February I switched from my old <a href="https://www.gsmarena.com/samsung_galaxy_a52-10641.php">Samsung Galaxy A52</a> to a new <a href="https://www.gsmarena.com/samsung_galaxy_s25+-13609.php">Samsung Galaxy S25+</a>, which was just released. One can find endless complaining online about how the S25 series isn’t worth it, which might be the case for someone coming from S24 series. However, I’m doing a 4-year leap from a mid-range phone that has become genuinely problematic:</p> <ol> <li>There’s a <a href="https://www.reddit.com/r/GalaxyA52/comments/15af16f/a52_back_panel_peeling/">well</a>-<a href="https://www.reddit.com/r/GalaxyA52/comments/15klqaq/a52s_adhesive_failing_after_six_months/">known</a> <a href="https://www.reddit.com/r/GalaxyA52/comments/x6ijmb/anyone_else_have_this_issue_my_a52_4gs_back_is/">issue</a> with the A52’s plastic back panel coming off. This issue started to develop on my phone also in July 2024, but I alleviated it by getting a phone case (for the first time in my life).<sup id="fnref:a52case"><a href="#fn:a52case" class="footnote" rel="footnote" role="doc-noteref">1</a></sup></li> <li>The A52 got its <a href="https://www.androidupdatetracker.com/p/samsung-galaxy-a52">last Android update (to 14) in January 2024</a>.</li> <li>I’ve increasingly found it frustratingly sluggish.</li> <li>Its <em>virtual</em> proximity sensing is quite inaccurate: the screen often activates in the pocket and even triggers presses. In particular, such presses would often mess with media controls which are directly accessible from the lock screen (e.g. randomly seeking, skipping to next podcast).</li> </ol> <p>Anyway, I digress. I used Samsung Smart Switch to transfer everything from the old phone to the new one. Unfortunately, <em>everything</em> does not get automatically transferred and I’m writing this post to document/complain/rant about everything that did not go smoothly. This wasn’t my first time switching Android phones by the way: in 2021 I went from a <a href="https://www.gsmarena.com/samsung_galaxy_s6_edge+-7467.php">Samsung Galaxy S6 edge+</a> to the A52. Although I don’t recall that switch being as painful as this one, but I may have blocked out those memories or I simply wasn’t yet using many of the problematic apps at the time.</p> <p>Before getting into the bad, I want to briefly touch on the good. I was particularly glad to see that all of my workout data in <a href="https://www.gymrun.app/">GymRun</a> was smoothly automatically transferred. I prepared myself for something worse and had made explicit backups on my old phone, such that I could restore them on the new one. Luckily that didn’t turn out to be necessary.</p> <p>But now to the bad…</p> <h2 id="meta-apps">Meta apps</h2> <h3 id="whatsapp">WhatsApp</h3> <p>Transferring WhatsApp data, particularly chat history, between phones is notorious for being problematic and I was aware of that. Even Smart Switch has an explicit screen about it. So, on the old phone I made sure to have chat history backed up with a Google Drive account. But that didn’t prepare me for what was going to happen.</p> <p>Following very reasonable advice online, I moved my SIM card to my new phone before ever starting it up, such that I could set it up with everything in place. After logging into WhatsApp with my phone number on the new phone, my chats were there, but all with empty history! Digging into the respective menus (Settings → Chats → Chat backup), I see that there isn’t any backup.</p> <p>Naturally, I go back to WhatsApp on my old phone to check. Surprisingly, this turns into a whole fiasco, although I cannot recall the exact sequence of events and all the details. Basically, because I’ve already logged into WhatsApp on my new phone, the old one doesn’t let me into the app to even check the backup (or do any backup-related things) and wants to log in again. But at this point my SIM is in the new phone. Somehow the login still works, at least initially. I guess WhatsApp is fine with not re-verifying the phone number or even having the SIM?</p> <h4 id="chat-backup-view">Chat backup view</h4> <p>Anyway, here’s roughly what the Chat backup view looked like on the old phone:</p> <div class="row justify-content-center mt-3"> <div class="col-4 mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/Screenshot_20250216_102524_WhatsApp-480.webp 480w,/assets/Screenshot_20250216_102524_WhatsApp-800.webp 800w,/assets/Screenshot_20250216_102524_WhatsApp-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/Screenshot_20250216_102524_WhatsApp.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="WhatsApp Chat backup view" data-zoomable="" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <p>I had pressed “Back up”, I had selected Google for backup storage, I had selected the Google account and authorized it. <strong>But that’s not enough!</strong> Given the <em>default</em> Frequency setting, I should’ve pressed “Back up” again after setting up Google storage for WhatsApp backups.</p> <p>Perhaps I should’ve noticed the “Last Backup: Never” but I would argue that this view is weird and unintuitive. Following it top-down yields the following order of operations:</p> <ol> <li>It first describes backing up to Google for the purpose of switching phones. Great, that’s exactly what I want!</li> <li>Then there’s a big green button to supposedly do it. But at this point only a local backup would be made. (There’s no description of local backups anywhere, by the way.)</li> <li>Then I can choose and authorize a Google account.</li> </ol> <p>In my opinion, the sensible design would be to have Google account selection before doing the backup because it’s necessary for the use case described at the top.</p> <p>Furthermore, the backup status text above the “Back up” button is confusing. The “Last Backup” there seems to refer to the last <em>Google</em> backup. While “Local” seems to refer to the last <em>local</em> backup (again, unexplained).</p> <p>Anyway, I pressed “Back up” again to have chat history backed up to Google for real this time.</p> <h4 id="race-condition">Race condition</h4> <p>Now I want to restore the backup I just made on my new phone. Because I logged into WhatsApp on my old phone again, it logged the new phone out. Since I already got into the old phone, I think I triggered the login process on the new phone before I was done with all the backup stuff on the old one. And this seems to have caused a race condition.</p> <p>The new phone was in the middle of the login process: my phone number was already entered (probably just autofilled from before) and I pressed some button to continue. And then <strong>it just got stuck “Initializing…“</strong>:</p> <ul> <li>Killing WhatsApp didn’t help: it opened right back up in the middle of “Initializing…”.</li> <li>Deleting WhatsApp data didn’t help either: the “Clear data” button in Android’s settings for WhatsApp just took me back to WhatsApp, which was “Initializing…”. I expected to see the usual warning about causing loss of data and unexpected behavior before agreeing, but somehow WhatsApp does something I had never seen before. I guess it tried to send me into WhatsApp itself to do some data clearing, but that didn’t go to the right screen because it was “Initializing…”.</li> </ul> <p>Not sure if I did anything more or just waited enough, but sometime later opening WhatsApp again, it had given up on “Initializing…” and allowed me to do the login from scratch. This time without a race condition, the login worked instantly.</p> <h4 id="transfer-chats">Transfer chats</h4> <p>But what still didn’t work was restoring the backup. It didn’t happen automatically: maybe because I hadn’t gone into WhatsApp settings on the new phone to authorize it for Google Drive? (Although it’s impossible to get into those settings without logging in first, which would be a cyclic dependency.) Surprisingly, I couldn’t do it manually either: <strong>there’s no “Restore” button!</strong> So however it’s supposed to work, it didn’t and I went through all the Google storage backup trouble for nothing.</p> <p>What I ended up doing is using the “Transfer chats” feature instead: it does the transfer without Google via some local wireless connection. And that actually worked! (Although it took surprisingly long given how few messages of chat history I even have on WhatsApp.)</p> <h4 id="no-excuses">No excuses</h4> <p>To be honest, I could’ve moved on fine without having my WhatsApp chat history. I don’t really use WhatsApp and have two small (dead) chats. However, I wanted to do the transfer purely by principle: it’s 2025 and it’s supposed to be possible.</p> <p>Not just possible, it’s not supposed to be this hard. The fact that WhatsApp is end-to-end encrypted and thus doesn’t store chat history on servers but only client devices is no excuse for the process being so painful. Signal is very similar to WhatsApp (login by phone number, end-to-end encrypted chats, only local chat history), but its transfer process was very smooth. It’s unbelievable that in 2025 Meta cannot get this right.</p> <h2 id="google-apps">Google apps</h2> <h3 id="google-maps-timeline">Google Maps Timeline</h3> <p>Google Maps has had the Timeline feature, constantly tracking location for later viewing, for a long time. Recently, it was changed to no longer store location history on Google servers, but only on the phone itself. Thus, to not lose all that location data (almost a decade of it in my case), it needs to be <em>manually</em> transferred when switching phones.</p> <p>Conveniently, Google Maps offers a way to (automatically) backup that on Google servers<sup id="fnref:google-maps-timeline-backup"><a href="#fn:google-maps-timeline-backup" class="footnote" rel="footnote" role="doc-noteref">2</a></sup> and I had already set that up. Restoring that backup on the new phone wasn’t as easy as it should’ve been:</p> <ol> <li>On the new phone the “Your Timeline” button just wasn’t there. After some time it just appeared.</li> <li>Once it appeared, I could select “Import” for the backup from my old phone. It instantly said the backup was imported, but no old location data was actually present in the Timeline! Again, some time later trying again, it actually imported the data. <strong>So, manually check that the import worked!</strong> Some people on <a href="https://us.community.samsung.com/t5/Galaxy-S25/Google-Maps-Timeline-backup-import-not-working/td-p/3127065">Samsung Community forums</a> have reported similar issues with the Timeline on the S25 series, but it’s weird that Google Maps would have a bug that only appears on S25 phones: there shouldn’t be anything specific to these models.</li> </ol> <h3 id="google-calendar">Google Calendar</h3> <p>At least some (if not all) Google Calendar settings failed to transfer:</p> <ol> <li>The “Start of the week” day went back to Sunday from my chosen Monday.</li> <li>The selection of which calendars are synced to and shown on the phone.</li> </ol> <h3 id="gboard">Gboard</h3> <p>I’ve grown accustomed to Google’s Gboard over Samsung’s default keyboard. Annoyingly, <em>nothing</em> about the keyboard was transferred:</p> <ol> <li>The default keyboard changed to Samsung’s.</li> <li>All the languages and layouts in Gboard reset to one default one.</li> <li>All personal dictionaries were empty on the new phone.</li> <li>Probably all other Gboard settings went to default as well.</li> </ol> <h3 id="google-accounts">Google accounts</h3> <p>I have two Google accounts set up on my phone: a personal one and a work one. As expected, the personal one was transferred fully automatically, no login even needed if I remember correctly. Oddly enough, the work one did not get transferred at all.</p> <h2 id="authentication-apps">Authentication apps</h2> <h3 id="microsoft-authenticator">Microsoft Authenticator</h3> <p>I use Microsoft Authenticator for 2FA of my work account. Turns out, transferring Microsoft Authenticator data to a new phone, which can only be done via cloud backup, requires a <em>personal</em> Microsoft account. Andy Wegner has written a blog post about this very issue: <a href="https://andrewwegner.com/ms-authenticator-without-personal-account.html">Moving MS Authenticator to a new phone without a personal account</a>. The only alternative seems to be manually re-adding all the accounts on the new phone…</p> <p>Lucky for me, since I only use this app for my work account, there were only two accounts to re-add. However, the personal account restriction is arbitrary and unnecessary: my workplace already pays Microsoft (probably exorbitant amounts of) money to store emails and OneDrive files in regulatorily-compliant ways. Surely, forcing users to use personal accounts for work-related authentication is <strong>not acceptable for organization security</strong> and data-compliance.</p> <h3 id="smart-id">Smart-ID</h3> <p><a href="https://www.smart-id.com/">Smart-ID</a> is a digital identification and signing service used in Estonia. It offers a convenient alternative to the same features provided by the national ID card without requiring a computer with an ID card reader, while having equivalent legal binding in almost all cases.<sup id="fnref:e-voting"><a href="#fn:e-voting" class="footnote" rel="footnote" role="doc-noteref">3</a></sup></p> <p>Surprisingly, “transferring” a Smart-ID account from one device to another (<a href="https://www.smart-id.com/help/faq/registering/using-your-existing-smart-id-account-to-register-a-new-account/">Using your existing Smart-ID account to register a new account</a> in FAQ) requires that</p> <blockquote> <ol> <li>you registered your previous active account <strong>in a bank office</strong>, with the help of a bank teller</li> </ol> </blockquote> <p>That isn’t the case for me: I just registered for Smart-ID online on my own computer using an ID card.<sup id="fnref:smart-id-bank"><a href="#fn:smart-id-bank" class="footnote" rel="footnote" role="doc-noteref">4</a></sup> If there’s a reason for this weird restriction, then nowhere is that explained.<sup id="fnref:smart-id-strength"><a href="#fn:smart-id-strength" class="footnote" rel="footnote" role="doc-noteref">5</a></sup> The suggested alternative is just registering for a new Smart-ID account for the new phone on a computer, again using an ID card.</p> <h2 id="samsung-apps">Samsung apps</h2> <h3 id="galaxy-watch">Galaxy Watch</h3> <p>I had my Galaxy Watch 6 paired with my old phone. After Samsung Smart Switch, the watch suggested resetting itself before pairing with the new phone. It’s probably possible to skip the reset, however, some <a href="https://www.reddit.com/r/GalaxyS24Ultra/comments/1c6b2oo/camera_controller_missing/">Reddit threads</a> suggest doing so to unlock Galaxy Watch features that are (for no good reason) locked to Samsung S-series phones. For example, I couldn’t use my old A52 to access the Camera Controller on my watch, but the new S25+ would allow it.</p> <p>Resetting the watch is supposed to be a seamless experience: its data is backed up<sup id="fnref:galaxy-watch-cloud-backup"><a href="#fn:galaxy-watch-cloud-backup" class="footnote" rel="footnote" role="doc-noteref">6</a></sup> and automatically restored after reset. It’s not perfect though, at least two things failed to back up/restore correctly:</p> <ol> <li>It restored the correct watch face, but not the complication customizations I had made.</li> <li>It did not restore the three favorite exercises on the Samsung Health Exercises tile.</li> </ol> <h2 id="other-apps">Other apps</h2> <h3 id="antennapod">AntennaPod</h3> <p><a href="https://9to5google.com/2023/09/26/google-podcasts-youtube-music/">When Google killed Google Podcasts</a> (or more precisely, announced to), I switched to <a href="https://antennapod.org/">AntennaPod</a> for all my podcasting needs. It’s decentralized and open source, so I could be reasonably confident that my data is not held hostage. Hence, it’s another one of those apps whose data needs manual transferring to a new phone. Luckily <a href="https://antennapod.org/documentation/general/backup">AntennaPod provides import/export of its internal database</a> for my podcast subscriptions and listening statistics (very important), so I could get those across to my new phone without a problem. Unfortunately, not <em>everything</em> about AntennaPod gets transferred this way.</p> <p>For one, its database does not include app settings (which I don’t understand) nor does it provide any alternative way to transfer them. The only way is a manual transfer: have two phones side-by-side and navigate through all settings menus in parallel, changing settings on the new phone to match the old one. Either the database should include app settings or they should be separately transferrable (e.g. via usual Google app settings backup).</p> <p>The other thing which does not get transferred is downloaded podcast episodes. The argument seems to be that it’s too much data (easily gigabytes) to transfer via Google app backup (which has some size limit). However, there’s also no way to do it offline or manually! By default, it stores downloads in <code class="language-plaintext highlighter-rouge">/Android/data/</code> which on modern Android is inaccessible by file browsers (without root).</p> <h4 id="androiddata"><code class="language-plaintext highlighter-rouge">/Android/data/</code></h4> <p>There is a <a href="https://www.reddit.com/r/AndroidQuestions/comments/192vfjt/is_there_a_way_to_access_the_datadata_folder_in/">workaround</a>: Android’s own file chooser has access to those directories. Thus, there are <a href="https://play.google.com/store/apps/details?id=com.marc.files">apps</a> whose sole purpose is to open that file chooser (as if it’s a file browser). With that, it’s possible to go into the right directory and see all the downloaded podcast audio files, organized into subdirectories. It seems to even be possible to copy an individual file out of there. Weirdly enough, directories cannot be copied: its copying is non-recursive and only creates an empty directory at the target. There is an option to compress files and directories, but that also works for individual files, while only an empty directory gets compressed. With dozens of downloads in different subdirectories, it would be tedious to copy them one by one.</p> <p>There’s <a href="https://www.reddit.com/r/Android/comments/wru35i/clearing_up_confusion_about_how_to_access/ikvfe39/">one more workaround</a> described online: when Android’s file chooser is opened in split screen with itself, then one can move data by dragging it from one half of the split into the other. Well, I tried that, with <strong>catastrophic consequences</strong>. Perhaps it would’ve worked with individual files, but I dragged and dropped a directory containing all the podcast downloads, which ended up creating yet another empty directory. Because this <em>moves</em> (not copies), all the original files were actually deleted, even though they weren’t successfully created into the target directories.<sup id="fnref:android-safe-move"><a href="#fn:android-safe-move" class="footnote" rel="footnote" role="doc-noteref">7</a></sup></p> <p>At this point I no longer had any downloads to transfer, so the only solution was redownloading them all on my new phone. Luckily all of them were still available, although there’s no guarantee (RSS feeds or individual episode files can easily disappear from the internet). A big reason to keep some (favorite) downloaded episodes is to safeguard against their disappearance. Therefore, AntennaPod should really provide a proper migration path. That is, just store downloads in a sensible accessible place where they can be manually transferred (or even automatically by Smart Switch, which automatically transfers all user-accessible files (outside <code class="language-plaintext highlighter-rouge">/Android/data</code>) regardless of apps) by default.</p> <p>The redownloading isn’t entirely straightforward either: the transferred database contains the list of downloaded episodes and their paths (in the inaccessible directory). Thus, on the new phone, <a href="https://github.com/AntennaPod/AntennaPod/issues/3037">AntennaPod shows them as downloaded, and without a download button</a>. Only when you go play the episode does it realize that the file is missing and a download button re-appears. Alternatively, you can delete the downloads in AntennaPod (although there’s nothing to actually delete), which then also allows redownloading. However, this is also very problematic: you have to remember what you had downloaded (or at least favorite those episodes just for this purpose).</p> <h3 id="nonplay-store-apps">Non–Play-Store apps</h3> <p>Although Smart Switch transfers many things directly phone-to-phone, it doesn’t seem to do that with the installed apps themselves. Rather, the new phone just installs them from Google Play Store (and perhaps Galaxy Store). Any apps previously installed on the old phone which are not available from those stores are not installed on the new phone:</p> <ol> <li>A handful of Play Store apps have been removed from the store (for whatever reason). Their names and icons are transferred to the new phone and show up in the apps list but grayed out. Launching them just goes to a Play Store view saying “Item not found”.<sup id="fnref:flappy-bird"><a href="#fn:flappy-bird" class="footnote" rel="footnote" role="doc-noteref">8</a></sup></li> <li>Samsung has a group of apps collectively known as “Good Lock” which allow additional customization of their One UI. For some reason they are not available via the app stores in Estonia, but they can be installed via APKs found online.<sup id="fnref:good-lock-store"><a href="#fn:good-lock-store" class="footnote" rel="footnote" role="doc-noteref">9</a></sup></li> <li>Any other apps installed unofficially from APKs, like YouTube ReVanced.</li> </ol> <p>Sure, Google and Samsung want people to only install apps from official sources to avoid malware, but not transferring others doesn’t help with that. I already installed them on my old phone and can just as well re-install them on the new phone, it’s just unnecessarily annoying. Arguably, it would be better to use APKs from the old phone which I know to be safe than forcing me to go online and download new APKs, risking new malware.</p> <h3 id="app-logins">App logins</h3> <p>A general annoyance is that pretty much all apps are logged out of my accounts on the new phone, even though the same apps are automatically installed. It’s such a tedious process to log in to each one again manually.</p> <h2 id="settings">Settings</h2> <h3 id="time">Time</h3> <p>After the transfer, which I did late enough in the day, I noticed that the new phone used the 12-hour format for the clock. So for some reason the 24-hour clock format setting did not get transferred, which seems like an obvious one.</p> <h3 id="home-screen">Home screen</h3> <p>Smart Switch does transfer home screen and apps screen layouts, although there was a catch. My old phone used 5×5 grid but the new one defaults to 5×6 and Smart Switch went to that default, instead of what I had before. At least in this direction it’s harmless because only an extra empty row appeared.</p> <p>Although it transfers home screen widgets, their settings do not get transferred. In particular:</p> <ol> <li>For the AntennaPod widget I had configured which buttons it shows. For quite a while I thought the widget looked different because its slightly different size caused it to go with a different layout.</li> <li>For the GitHub widget I had chosen my one and only GitHub account to show. Thus, it only showed “Sign in” which took me to GitHub login (again), even though I had already logged into the GitHub app with the same account. Doing that login didn’t even set it to be shown on the widget. I had to manually choose it in the widget settings again.</li> </ol> <h3 id="wi-fi">Wi-Fi</h3> <p>The names and passwords (or whatever login details) of known Wi-Fi networks are transferred automatically, which is very convenient. However, eduroam did not work out of the box: all of its (enterprise) login details seemed to have been transferred and authentication to eduroam seemed successful, but I had no network access. If I recall correctly, Android reported that it could not obtain an IP address. Not sure what that was about, so I just ended up reconfiguring eduroam from scratch and succeeded. Perhaps eduroam didn’t like some transferred certificate or whatever suddenly being used with a different MAC address?</p> <h2 id="data-usage">Data usage</h2> <p>Android records data usage both (and separately) for Wi-Fi and mobile data. These monthly statistics were not transferred from the old phone to the new one. Moreover, the related settings (mobile data warning, etc.) also were not. Although this didn’t pose a problem for me, it could be annoying for some people, especially when switching phones mid–billing-cycle.</p> <hr/> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:a52case"> <p>Luckily some were still available but not many. <a href="#fnref:a52case" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:google-maps-timeline-backup"> <p>So the whole thing just seems like privacy theater with the extra inconvenience of not being able to browse the Timeline via web browser anymore. <a href="#fnref:google-maps-timeline-backup" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:e-voting"> <p>E-voting is the only exception that I can immediately think of. And even there, I’m not sure if the limitation is legal or technical. <a href="#fnref:e-voting" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:smart-id-bank"> <p>At some point bank tellers were really pushing this for some reason (perhaps commission?), although anyone who could do online banking with an ID card could just register for Smart-ID on their own. And a grandma who doesn’t do online banking at all won’t have any use for Smart-ID either. <a href="#fnref:smart-id-bank" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:smart-id-strength"> <p>Perhaps Smart-ID is only legally binding if directly registered by a national ID card, not when indirectly registered by a previous Smart-ID account which was registered by a national ID card. This restriction then suggests that Smart-ID accounts registered in a bank are somehow legally “stronger”. <a href="#fnref:smart-id-strength" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:galaxy-watch-cloud-backup"> <p>Backups from the watch onto the phone work reliably, but having those also backed up to Samsung Cloud requires the stars to really align: <a href="https://www.sammobile.com/news/why-your-galaxy-watch-has-not-backed-up-to-samsung-cloud-in-ages/">Why your Galaxy Watch hasn’t backed up to Samsung Cloud in ages</a>. And the stupidest part is: there’s no way to do that <strong>manually</strong>. I believe this wasn’t the problem in my case though. <a href="#fnref:galaxy-watch-cloud-backup" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:android-safe-move"> <p>This means that at least some Android APIs implement a horrendously unsafe filesystem move operation which fails basic validation and atomicity requirements. This is Android itself, not any non-stock app! <a href="#fnref:android-safe-move" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:flappy-bird"> <p>So <a href="https://www.theguardian.com/technology/2014/feb/10/phones-flappy-bird-ebay-app-store">Flappy Bird phones</a> still remain a thing. <a href="#fnref:flappy-bird" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:good-lock-store"> <p>Oddly enough, once installed from APK, Galaxy Store lists them as installed in some places. <a href="#fnref:good-lock-store" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name></name></author><category term="android"/><category term="samsung"/><category term="usability"/><category term="rant"/><summary type="html"><![CDATA[In the beginning of February I switched from my old Samsung Galaxy A52 to a new Samsung Galaxy S25+, which was just released. One can find endless complaining online about how the S25 series isn’t worth it, which might be the case for someone coming from S24 series. However, I’m doing a 4-year leap from a mid-range phone that has become genuinely problematic: There’s a well-known issue with the A52’s plastic back panel coming off. This issue started to develop on my phone also in July 2024, but I alleviated it by getting a phone case (for the first time in my life).1 The A52 got its last Android update (to 14) in January 2024. I’ve increasingly found it frustratingly sluggish. Its virtual proximity sensing is quite inaccurate: the screen often activates in the pocket and even triggers presses. In particular, such presses would often mess with media controls which are directly accessible from the lock screen (e.g. randomly seeking, skipping to next podcast). Luckily some were still available but not many. &#8617;]]></summary></entry><entry><title type="html">Shifting dates and times when resetting a Moodle course</title><link href="https://sim642.eu/blog/2025/02/08/shifting-dates-and-times-when-resetting-a-moodle-course/" rel="alternate" type="text/html" title="Shifting dates and times when resetting a Moodle course"/><published>2025-02-08T00:00:00+00:00</published><updated>2025-03-16T00:00:00+00:00</updated><id>https://sim642.eu/blog/2025/02/08/shifting-dates-and-times-when-resetting-a-moodle-course</id><content type="html" xml:base="https://sim642.eu/blog/2025/02/08/shifting-dates-and-times-when-resetting-a-moodle-course/"><![CDATA[<p>When resetting a Moodle course for a new year/semester, it can be very tedious to update the availability and due dates for all activities/assignments/etc in the course. <a href="https://docs.moodle.org/405/en/Reset_course#Course_start_date">Moodle (supposedly) provides a convenient way to do this</a>:</p> <blockquote> <p><em>NOTE: If you set a new course start date, then all course dates will be shifted by the same amount.</em></p> </blockquote> <h2 id="goal">Goal</h2> <p>For example, suppose that the old course start date (and time) is set to <strong>14.02.2024 10:00</strong> and activity’s availability/due date (and time) is <strong>14.02.2024 12:00</strong> (i.e. 2 hours after the course start). Both intuitively and according to the quoted documentation, the new course start date should be <strong>12.02.2025 10:00</strong> if the same activity’s date should become <strong>12.02.2025 12:00</strong> (i.e. shifted forward by 2 days less than 1 year).</p> <h2 id="problem">Problem</h2> <p>Unfortunately (and unsurprisingly), Moodle is not intuitive nor properly documented. If you actually reset the course and choose <strong>12.02.2025 10:00</strong> as the new start date, then the same activity’s date becomes <strong>12.02.2025 22:00</strong> (note the incorrect hour).</p> <p>Good luck figuring out why! Also, it’s not so easy to just try again by offsetting the new start date, because resetting changed the old start state. This offers hours of fun trial and error just to reset a Moodle course.</p> <h3 id="code">Code</h3> <p>Alternatively, one can dig into Moodle source code (something every teacher using Moodle definitely can/wants to do) and eventually find the corresponding logic in <a href="https://github.com/moodle/moodle/blob/139a0ad5f0458caaff7506c8b26081eea1c85231/lib/moodlelib.php#L5076-L5077"><code class="language-plaintext highlighter-rouge">moodlelib.php</code></a>:</p> <div class="language-php highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Time part of course startdate should be zero.</span>
<span class="nv">$data</span><span class="o">-&gt;</span><span class="n">timeshift</span> <span class="o">=</span> <span class="nv">$data</span><span class="o">-&gt;</span><span class="n">reset_start_date</span> <span class="o">-</span> <span class="nf">usergetmidnight</span><span class="p">(</span><span class="nv">$data</span><span class="o">-&gt;</span><span class="n">reset_start_date_old</span><span class="p">);</span>
</code></pre></div></div> <p>For some reason which is completely beyond me,<sup id="fnref:reason"><a href="#fn:reason" class="footnote" rel="footnote" role="doc-noteref">1</a></sup> the date (and time) shift is not calculated using the old course start date directly. Instead, the <code class="language-plaintext highlighter-rouge">usergetmidnight</code> changes the old <strong>14.02.2024 10:00</strong> into <strong>14.02.2024 00:00</strong> (note the midnight hour) before calculating the difference (with the new course start date that still has user-set hour). In this case, that causes the shift to be 10 hours greater, causing the activity’s date to be shifted 10 hours later.</p> <p>This means that the undesired extra time shift depends on the old course start time instead of being constant. If it were the latter (e.g. because of some timezone issue), then it would be much simpler to compensate for by trial and error.</p> <p>According to <code class="language-plaintext highlighter-rouge">git blame</code>, that midnight calculation has been there for 17 years! It took 12 years for anyone to complain in a Moodle issue (<a href="https://tracker.moodle.org/browse/MDL-65233">MDL-65233</a>), but has been repeatedly reported since (<a href="https://tracker.moodle.org/browse/MDL-76882">MDL-76882</a>, <a href="https://tracker.moodle.org/browse/MDL-82206">MDL-82206</a>, …). I wonder how many more years it will take to get fixed.</p> <blockquote class="block-tip"> <h5 id="fixed">Fixed</h5> <p>The answer is <em>one month</em>! I went ahead and submitted a patch to remove the <code class="language-plaintext highlighter-rouge">usergetmidnight</code>. It was accepted by Moodle developers and the fix should ship in Moodle versions 4.4.7 and 4.5.3. Nevertheless, I suggest following the recommended workaround below until you can be sure that your institution has updated.</p> </blockquote> <h2 id="workaround">Workaround</h2> <p>Anyone who has a course to run cannot wait, so here are a few workarounds, illustrated on the example.</p> <h3 id="workaround-1">Workaround 1</h3> <ol> <li>Reset the course to have new start date <strong>12.02.2025 00:00</strong> (the 10 hours less from the time part of the old start date). The activity’s date will get shifted as intended.</li> <li>The course start date after the reset will have the wrong time, so change it to <strong>12.02.2025 10:00</strong> in course settings (not by resetting the course!).</li> </ol> <h3 id="workaround-2-recommended">Workaround 2 (recommended)</h3> <ol> <li>Change the old course start date to <strong>14.02.2024 00:00</strong> in course settings (not by resetting the course!). Already having the time be midnight cancels the weirdness of <code class="language-plaintext highlighter-rouge">usergetmidnight</code>.</li> <li>Reset the course to have new start date <strong>12.02.2025 00:00</strong> (also midnight). Both the new course start date and activity’s date will be as intended.</li> </ol> <p>This workaround is a bit more future-proof:</p> <ol> <li>By always having the course start date at midnight, you hopefully avoid the issue during future course resets because the time shift will be intuitive then.</li> <li>If <del>Moodle ever fixes the calculation (to remove <code class="language-plaintext highlighter-rouge">usergetmidnight</code>) and</del> your institution finally updates Moodle, then this does not impact your course resetting workflow. (With workaround 1 you’d have to know about the change and stop manually compensating on each reset, because then you’d be overcompensating.)</li> </ol> <hr/> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:reason"> <p>Comment below if you think this makes any sense or have any idea why this would ever be desirable. <a href="#fnref:reason" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name></name></author><category term="academia"/><category term="teaching"/><category term="university"/><category term="moodle"/><category term="tutorial"/><summary type="html"><![CDATA[When resetting a Moodle course for a new year/semester, it can be very tedious to update the availability and due dates for all activities/assignments/etc in the course. Moodle (supposedly) provides a convenient way to do this:]]></summary></entry><entry><title type="html">Clickable, breakable, colored &amp;amp; underlined URLs in LaTeX</title><link href="https://sim642.eu/blog/2025/01/26/clickable-breakable-colored-underlined-urls-in-latex/" rel="alternate" type="text/html" title="Clickable, breakable, colored &amp;amp; underlined URLs in LaTeX"/><published>2025-01-26T00:00:00+00:00</published><updated>2025-01-26T00:00:00+00:00</updated><id>https://sim642.eu/blog/2025/01/26/clickable-breakable-colored-underlined-urls-in-latex</id><content type="html" xml:base="https://sim642.eu/blog/2025/01/26/clickable-breakable-colored-underlined-urls-in-latex/"><![CDATA[<h2 id="requirements">Requirements</h2> <p>Suppose the goal is to <em>simultaneously</em> achieve all of the following for typesetting URLs with the <code class="language-plaintext highlighter-rouge">\url</code> command:<sup id="fnref:url-command"><a href="#fn:url-command" class="footnote" rel="footnote" role="doc-noteref">1</a></sup></p> <ol> <li>URLs are clickable.</li> <li>URLs are (line-)breakable.</li> <li>URLs are colored.</li> <li>URLs are underlined.<sup id="fnref:pdfborderstyle"><a href="#fn:pdfborderstyle" class="footnote" rel="footnote" role="doc-noteref">2</a></sup></li> </ol> <p>Many StackOverflow questions and answers cover a subset of these, but none that I’ve found does all four. Moreover, such answers usually don’t combine to get all of the above.</p> <h3 id="a-rant">A rant</h3> <p>The esteemed typographer Robert Bringhurst states in his “The Elements of Typographic Style”:</p> <blockquote> <p>3.5.1 Change one parameter at a time.</p> </blockquote> <p>Moreover, underlining is shunned in general, e.g. by the <a href="https://texfaq.org/FAQ-underline">TeX FAQ</a> and the <a href="http://mirrors.ctan.org/macros/generic/soul/soul-ori.pdf"><code class="language-plaintext highlighter-rouge">soul-ori</code> package documentation</a> (with references to other typographic texts).</p> <p>Hence, requiring URLs in texts to be underlined (<em>and</em> colored) is madness, but this is what happens when Word users make the rules…</p> <h2 id="lualatex">LuaLaTeX</h2> <div class="language-latex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\usepackage</span><span class="p">{</span>hyperref<span class="p">}</span> <span class="c">% provides clickable \url command</span>
<span class="k">\hypersetup</span><span class="p">{</span>
    breaklinks, <span class="c">% allow line breaks in links</span>
    colorlinks, <span class="c">% allow colors for links</span>
    allcolors=black, <span class="c">% disable obnoxious colors for \ref, \cite, etc. (optional)</span>
    urlcolor=blue, <span class="c">% choose color for \url (default)</span>
<span class="p">}</span>
<span class="k">\usepackage</span><span class="p">{</span>xurl<span class="p">}</span> <span class="c">% allows arbitrary line breaks in \url (optional)</span>

<span class="k">\usepackage</span><span class="p">{</span>lua-ul<span class="p">}</span> <span class="c">% provides underlining for LuaLaTeX</span>
<span class="k">\makeatletter</span> <span class="c">% allow accessing \@underLine below</span>
<span class="k">\DeclareUrlCommand</span><span class="p">{</span><span class="k">\Hurl</span><span class="p">}{</span><span class="c">% redefine hyperref's internal \Hurl instead of \url to preserve clickability and color</span>
    <span class="k">\def\UrlLeft</span>##1<span class="k">\UrlRight</span><span class="p">{</span><span class="k">\@</span>underLine##1<span class="p">}</span> <span class="c">% underline \url, use internal command instead of \underLine to preserve breakability</span>
<span class="p">}</span>
<span class="k">\makeatother</span>
</code></pre></div></div> <h2 id="pdflatex">pdfLaTeX</h2> <p>Currently I am not aware of a way to simultaneously achieve breakability and underlining with pdfLaTeX. Neither the <code class="language-plaintext highlighter-rouge">soul</code> nor the <code class="language-plaintext highlighter-rouge">ulem</code> package for underlining with pdfLaTeX provides a command like <code class="language-plaintext highlighter-rouge">\@underLine</code> which can be used to underline without forcing the argument into an unbreakable box. Both of these packages and the <code class="language-plaintext highlighter-rouge">url</code> package (loaded by <code class="language-plaintext highlighter-rouge">hyperref</code>) for the <code class="language-plaintext highlighter-rouge">\url</code> command do some low-level <code class="language-plaintext highlighter-rouge">\catcode</code> trickery with their arguments, but they don’t seem to play well together. The <code class="language-plaintext highlighter-rouge">href-ul</code> package outright seems to do nothing.</p> <p><strong>Comment below if you have solution for pdfLaTeX!</strong></p> <hr/> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:url-command"> <p>It has to be the standard <code class="language-plaintext highlighter-rouge">\url</code> command from the <code class="language-plaintext highlighter-rouge">url</code> package, not an alternative custom command. The latter will lead to inconsistencies, e.g. URLs in BibLaTeX bibliography would not use the custom command. <a href="#fnref:url-command" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:pdfborderstyle"> <p><code class="language-plaintext highlighter-rouge">pdfborderstyle</code> from the <code class="language-plaintext highlighter-rouge">hyperref</code> package is not an appropriate and portable way to achieve this. <a href="#fnref:pdfborderstyle" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name></name></author><category term="academia"/><category term="typesetting"/><category term="latex"/><category term="tutorial"/><summary type="html"><![CDATA[Requirements]]></summary></entry><entry><title type="html">TP-Link cannot get IPv6 firewall right</title><link href="https://sim642.eu/blog/2024/08/24/tp-link-cannot-get-ipv6-firewall-right/" rel="alternate" type="text/html" title="TP-Link cannot get IPv6 firewall right"/><published>2024-08-24T00:00:00+00:00</published><updated>2025-01-05T00:00:00+00:00</updated><id>https://sim642.eu/blog/2024/08/24/tp-link-cannot-get-ipv6-firewall-right</id><content type="html" xml:base="https://sim642.eu/blog/2024/08/24/tp-link-cannot-get-ipv6-firewall-right/"><![CDATA[<p>What led to two of my <a href="/blog/2024/08/10/firefox-hsts-bypass/">recent</a> <a href="/blog/2024/08/11/tailscale-https-certificate-on-synology-nas/">posts</a> is TP-Link’s inability to get IPv6 firewall right on their routers.<sup id="fnref:err"><a href="#fn:err" class="footnote" rel="footnote" role="doc-noteref">1</a></sup></p> <h2 id="allowing-all-incoming-ipv6-traffic">Allowing all incoming IPv6 traffic</h2> <p>For a long time my home NAS was publically accessible from the internet under my own subdomain. Although risky, I did my best to secure things. First, in my TP-Link router I had forwarded to my NAS only a handful of ports for the Synology DSM and Home Assistant. Second, both of these used HTTPS using <a href="https://kb.synology.com/vi-vn/DSM/tutorial/How_to_enable_HTTPS_and_create_a_certificate_signing_request_on_your_Synology_NAS">Let’s Encrypt certificates using Synology built-in functionality</a>.</p> <p>As I was removing all the port forwarding, I noticed something odd: I had never forwarded port 80 to my NAS, yet my Synology had been renewing Let’s Encrypt certificates for years using its HTTP-01 validation<sup id="fnref:dns-01"><a href="#fn:dns-01" class="footnote" rel="footnote" role="doc-noteref">2</a></sup>, which requires port 80 to be exposed. Moreover, even after removing all the port forwarding, those ports were still accessible from the internet (e.g., a DigitalOcean VPS). I hadn’t DMZ-ed the NAS in the TP-Link router, so how is this even possible?</p> <p>To my absolute horror, I <em>eventually</em> realized that my <strong>TP-Link Archer C6 v2.0<sup id="fnref:firmware"><a href="#fn:firmware" class="footnote" rel="footnote" role="doc-noteref">3</a></sup> simply allows all incoming IPv6 traffic to all LAN devices</strong>. While IPv4 uses NAT, the port forwarding (or the lack of it) forms a firewall, TP-Link must’ve been thinking that IPv6 not needing NAT means that there’s no need for any kind of firewall whatsoever. This is unbelievably insecure because non-expert (but also quite advanced) users would never realize this. Furthermore, there’s no way to change it other than <strong>disabling IPv6 altogether</strong>. So much for IPv6 adoption…</p> <p>Some deep digging reveals, that this massive security flaw has been noticed a few times on <a href="https://community.tp-link.com/en/home/forum/topic/160757">TP-Link Community forums</a> and <a href="https://www.reddit.com/r/TpLink/comments/xcgica/how_to_block_network_access_from_my_lan_to/">Reddit</a> before. Some other <a href="https://www.reddit.com/r/TpLink/comments/1ek6u2u/ipv6_firewall_rules_on_tplink_routers/">Reddit</a> <a href="https://www.reddit.com/r/TpLink/comments/wibkgp/ipv6_firewall_not_present_in_tplink_archer_c64/">posts</a> don’t mention the allow-all behavior explicitly, but just wonder about an IPv6 firewall of any sort.</p> <h2 id="blocking-all-incoming-ipv6-traffic">Blocking all incoming IPv6 traffic</h2> <p>Luckily (not for me), someone at TP-Link must’ve realized their stupidity at some point but only for some newer router models — no firmware updates are available for mine. More often, people on <a href="https://community.tp-link.com/en/home/forum/topic/220864">TP-Link</a> <a href="https://community.tp-link.com/en/home/forum/topic/567744">Community</a> <a href="https://community.tp-link.com/en/home/forum/topic/185422">forums</a> and <a href="https://www.reddit.com/r/HomeNetworking/comments/1als6lu/how_do_i_expose_ipv6_port_to_wan_tp_link_ax_5400/">various</a> <a href="https://www.reddit.com/r/TpLink/comments/1bst93b/ax11000_v116_vs_v2_ipv6_firewall_support/">Reddit</a> <a href="https://www.reddit.com/r/TpLink/comments/15vjnuy/when_will_ax11000_have_proper_ipv6_firewall/">posts</a> have complained about their <strong>TP-Link routers blocking all incoming IPv6 traffic without any configurability</strong> (unlike IPv4). At least this is secure enough to not expose everything to the internet, but it doesn’t help IPv6 adoption either…</p> <h2 id="providing-non-functional-ipv6-firewall-configuration">Providing non-functional IPv6 firewall configuration</h2> <p>For some even newer router models, it seems that TP-Link tried to also solve that problem by making the IPv6 firewall finally configurable. Judging by posts on <a href="https://community.tp-link.com/en/home/forum/topic/654622">TP-Link</a> <a href="https://community.tp-link.com/en/home/forum/topic/670230">Community</a> <a href="https://community.tp-link.com/en/home/forum/topic/682276">forums</a>, however, <strong>this configurability doesn’t seem to actually work</strong> and all incoming IPv6 traffic is still blocked.</p> <hr/> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:err"> <p>The insecurity of TP-Link routers has recently come to broad attention: <a href="https://news.err.ee/1609557511/chinese-routers-to-be-banned-in-the-us-also-widespread-in-estonia">Chinese routers to be banned in the US also widespread in Estonia</a>. <a href="#fnref:err" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:dns-01"> <p>Synology only supports HTTP-01 for custom domains. <a href="#fnref:dns-01" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:firmware"> <p>Using firmware “1.3.6 Build 20200902 rel.65591(4555)” which happens to be the latest for this router. <a href="#fnref:firmware" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name></name></author><category term="networking"/><category term="security"/><category term="rant"/><summary type="html"><![CDATA[What led to two of my recent posts is TP-Link’s inability to get IPv6 firewall right on their routers.1 The insecurity of TP-Link routers has recently come to broad attention: [Chinese routers to be banned in the US also widespread in Estonia][err-article]. &#8617;]]></summary></entry><entry><title type="html">Tailscale HTTPS certificate on Synology NAS</title><link href="https://sim642.eu/blog/2024/08/11/tailscale-https-certificate-on-synology-nas/" rel="alternate" type="text/html" title="Tailscale HTTPS certificate on Synology NAS"/><published>2024-08-11T00:00:00+00:00</published><updated>2024-11-10T00:00:00+00:00</updated><id>https://sim642.eu/blog/2024/08/11/tailscale-https-certificate-on-synology-nas</id><content type="html" xml:base="https://sim642.eu/blog/2024/08/11/tailscale-https-certificate-on-synology-nas/"><![CDATA[<p>I recently discovered <a href="https://tailscale.com/">Tailscale</a> for setting up a private VPN. My main goal was to use it for accessing my <a href="https://www.synology.com/">Synology NAS</a> at home from anywhere in the world. So far I had kept my home NAS publically accessible from the internet, which had been fine but risky nevertheless.</p> <p>In order to secure web connections to the Synology DSM and various Docker-based services, I had set up <a href="https://kb.synology.com/vi-vn/DSM/tutorial/How_to_enable_HTTPS_and_create_a_certificate_signing_request_on_your_Synology_NAS">Let’s Encrypt on Synology</a> under my own subdomain. Since my NAS is no longer publically accessible, it cannot obtain new Let’s Encrypt certificates for the subdomain<sup id="fnref:dns-01"><a href="#fn:dns-01" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>. Instead, I needed HTTPS certificates for the Tailscale full domain of the NAS.</p> <p>Tailscale has <a href="https://tailscale.com/kb/1131/synology">a guide for setting Tailscale itself up on Synology</a> and <a href="https://tailscale.com/kb/1153/enabling-https">a guide for obtaining HTTPS certificates using <code class="language-plaintext highlighter-rouge">tailscale cert</code></a>. Surprisingly, neither documents the best solution, which is the <a href="https://tailscale.com/kb/1080/cli#configure-alpha">undocumented</a> command</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tailscale configure synology-cert
</code></pre></div></div> <p>Prior to its introduction, <a href="https://github.com/tailscale/tailscale/issues/4674">under this Tailscale issue</a> users came up with their own scripts, but using the official command is now the easiest way.</p> <h2 id="step-by-step">Step-by-step</h2> <ol> <li><a href="https://tailscale.com/kb/1131/synology">Set up Tailscale on your Synology NAS</a> or update it to at least <strong>version 1.64.0</strong>.</li> <li>Navigate in the Synology DSM to <strong>Control Panel → Task Scheduler</strong>.</li> <li> <p>Create a new scheduled task with an user-defined script (<strong>Create → Scheduled Task → User-defined script</strong>) with the following details:</p> <ul> <li><strong>General</strong>: <ul> <li>Task (name): “Tailscale Certificate” (or whatever you want).</li> <li>User: root (the Tailscale command needs that).</li> </ul> </li> <li><strong>Schedule</strong>: <ul> <li>“Run on the following days”: “Weekly”, “Monday” (seems “Monthly” is not frequent enough such that the 90 day Let’s Encrypt certificate is renewed automatically because months and 90 days may not remain nicely in sync).</li> </ul> </li> <li><strong>Task Settings</strong>: <ul> <li>User-defined script: <code class="language-plaintext highlighter-rouge">tailscale configure synology-cert</code> (the magic command).</li> </ul> </li> </ul> </li> <li>Press “OK” and follow on-screen instructions for setting up the root script.</li> <li>Right click on the created task and select “Run” to get the first certificate immediately.</li> <li>Navigate in the Synology DSM to <strong>Control Panel → Security → Certificate</strong>.</li> <li>You should now see a certificate for your <code class="language-plaintext highlighter-rouge">ts.net</code> subdomain in this list.</li> <li>Use the Tailscale certificate in one of the two ways, depending on your use case: <ol> <li>Right click on the certificate and select “Edit”. Then tick “Set as default certificate” and press “OK”.</li> <li>Click “Settings” in the toolbar. Change the certificate on a per-service basis.</li> </ol> </li> </ol> <hr/> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:dns-01"> <p>This would be possible with Let’s Encrypt’s DNS-01 domain validation (as opposed to HTTP-01), but Synology only supports HTTP-01 for custom domains. <a href="#fnref:dns-01" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name></name></author><category term="synology"/><category term="networking"/><category term="security"/><category term="tutorial"/><summary type="html"><![CDATA[I recently discovered [Tailscale] for setting up a private VPN. My main goal was to use it for accessing my Synology NAS at home from anywhere in the world. So far I had kept my home NAS publically accessible from the internet, which had been fine but risky nevertheless.]]></summary></entry><entry><title type="html">Firefox HSTS bypass</title><link href="https://sim642.eu/blog/2024/08/10/firefox-hsts-bypass/" rel="alternate" type="text/html" title="Firefox HSTS bypass"/><published>2024-08-10T00:00:00+00:00</published><updated>2024-08-10T00:00:00+00:00</updated><id>https://sim642.eu/blog/2024/08/10/firefox-hsts-bypass</id><content type="html" xml:base="https://sim642.eu/blog/2024/08/10/firefox-hsts-bypass/"><![CDATA[<p>HSTS is a mechanism to force browsers to use HTTPS instead of HTTP to connect to a site. The intention being that an attacker cannot replace it with an insecure version.</p> <p>However, it might be desirable to undo this enforcement for valid and safe reasons, e.g., during web development and testing. In my case, I needed to override the protection after disabling “Automatically redirect HTTP connection to HTTPS for DSM desktop” in my Synology NAS settings.</p> <p>While other browsers (Chrome/Edge) provide a way to power users to bypass HSTS for a site, <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1528738">Firefox insists not offering any means to do so due to “No User Recourse” from the HSTS RFC</a>. Yet, the “solutions” presented by Mozilla employees still allow users to do just that, while also deleting other data and being significantly less secure than just bypassing for a single site…</p> <h2 id="non-solutions">Non-solutions</h2> <p>There are a few supposed solutions online, however, I consider each one a non-solution:</p> <ol> <li> <p><strong><a href="https://connect.mozilla.org/t5/ideas/allow-firefox-to-bypass-hsts-errors/idc-p/27794/highlight/true#M15411">The official solution</a></strong> is to find the site in Firefox History and select “Forget this site” for it.</p> <p>This is a non-solution because it deletes <em>all</em> data related to the site, not just its HSTS state.</p> </li> <li> <p><a href="https://www.thesslstore.com/blog/clear-hsts-settings-chrome-firefox/#h-how-to-delete-hsts-settings-in-firefox">Editing <code class="language-plaintext highlighter-rouge">SiteSecurityServiceState.txt</code></a> to remove the HSTS entry for a specific site.</p> <p>While this only deletes the HSTS state, this is a non-solution because it no longer works: <a href="https://connect.mozilla.org/t5/ideas/allow-firefox-to-bypass-hsts-errors/idc-p/52339/highlight/true#M30458">recent versions of Firefox use a proprietary binary file <code class="language-plaintext highlighter-rouge">SiteSecurityServiceState.bin</code> instead</a>.</p> </li> <li> <p><a href="https://connect.mozilla.org/t5/ideas/allow-firefox-to-bypass-hsts-errors/idc-p/52339/highlight/true#M30458">Deleting <code class="language-plaintext highlighter-rouge">SiteSecurityServiceState.bin</code></a> to remove HSTS entries for <em>all</em> sites.</p> <p>This is a non-solution because it deletes HSTS data related to <em>other</em> unrelated sites and unnecessarily giving up security provided by HSTS. It’s the most insane “solution” of them all.</p> </li> </ol> <h2 id="the-solution">The solution</h2> <ol> <li> <p>Find your Firefox profile path. You can do this as follows:</p> <ol> <li>Navigate to the “URL” <code class="language-plaintext highlighter-rouge">about:profiles</code>.</li> <li>Find your profile from the list. This is likely the one with “This is the profile in use and it cannot be deleted.” under it.</li> <li>Copy the “Root Directory” or click “Open Directory” after it.</li> </ol> </li> <li> <p>Close Firefox.</p> </li> <li> <p>Back up the <code class="language-plaintext highlighter-rouge">SiteSecurityServiceState.bin</code> file in your Firefox profile path. For example, copying it as <code class="language-plaintext highlighter-rouge">SiteSecurityServiceState.bin.bak</code>. This is in case the binary file somehow ends up corrupted when modifying it in the next step.</p> </li> <li> <p>Open the <code class="language-plaintext highlighter-rouge">SiteSecurityServiceState.bin</code> file in a hex editor. I used <a href="https://wiki.gnome.org/Apps/Ghex">GHex</a> on Linux.</p> </li> <li> <p>Use the hex editor’s “Find” feature to find the desired site’s domain from the file.</p> </li> <li> <p>Change the Unix timestamp in milliseconds (like <code class="language-plaintext highlighter-rouge">1723280965123</code>) after it (there are many NUL/zero bytes in between) with one in the past. I changed it to <code class="language-plaintext highlighter-rouge">1696969696969</code>.</p> <p>The file seems to have a similar format to the old <code class="language-plaintext highlighter-rouge">SiteSecurityServiceState.txt</code> file, but since it’s in a proprietary binary format, it’s not as simple as deleting a line from it. So the safest way is to just change the HSTS expiry timestamp in-place.</p> </li> <li> <p>Save the file in the hex editor.</p> </li> <li> <p>Open Firefox.</p> </li> </ol>]]></content><author><name></name></author><category term="firefox"/><category term="open source"/><category term="security"/><category term="rant"/><category term="tutorial"/><summary type="html"><![CDATA[HSTS is a mechanism to force browsers to use HTTPS instead of HTTP to connect to a site. The intention being that an attacker cannot replace it with an insecure version.]]></summary></entry><entry><title type="html">Springer anti-typesetters, part 2</title><link href="https://sim642.eu/blog/2024/07/22/springer-anti-typesetters-part-2/" rel="alternate" type="text/html" title="Springer anti-typesetters, part 2"/><published>2024-07-22T00:00:00+00:00</published><updated>2024-07-22T00:00:00+00:00</updated><id>https://sim642.eu/blog/2024/07/22/springer-anti-typesetters-part-2</id><content type="html" xml:base="https://sim642.eu/blog/2024/07/22/springer-anti-typesetters-part-2/"><![CDATA[<p>This post continues the <a href="/blog/2024/07/21/springer-anti-typesetters-part-1/">Springer typesetting saga from part 1</a> on a pair of papers in March 2024. We had two papers <a class="citation" href="#GOBLINTVALIDATOR-SVCOMP24">(Saan et al., 2024; Saan et al., 2024)</a> accepted into the same conference proceedings and they were edited <em>very</em> differently. The following compares our submitted camera-ready version with the many proofs from Springer typesetters.</p> <h2 id="proof-1">Proof 1</h2> <p>The first pair of proofs for the two papers were near-perfect. In both papers they just removed spaces between authors’ email addresses, i.e., <code class="language-plaintext highlighter-rouge">{a, b}@c.d</code> was replaced with <code class="language-plaintext highlighter-rouge">{a,b}@c.d</code>, which is harder to read in typewriter font but whatever. (They also did this for the paper in <a href="/blog/2024/07/21/springer-anti-typesetters-part-1/">part 1</a>.)</p> <p>However, Springer still managed to introduce inconsistencies. Both papers are <em>open access</em> and have a paragraph about <a href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a> after the references. In the Goblint Validator paper they inserted a dot after the bolded “Open Access” which begins the paragraph and changed the license URL into typewriter font (unlike all other URLs in the paper which were unchanged).</p> <h2 id="proof-2">Proof 2</h2> <p>After accepting the first proofs, Springer emailed a second pair of proofs because they forgot to add artifact evaluation badges to both papers. Apparently their fabulous proof checking system cannot be used a second time so this had to be done completely by email.</p> <p>In the Goblint (verifier) paper, the only change was indeed the addition of the badges. In the Goblint Validator paper, they intentionally screwed everything else up while adding those badges. Here’s the worst of what they did.</p> <h3 id="misencoding-editor-names">Misencoding editor names</h3> <p>In the year 2024, Springer still struggles with encodings: name of the proceedings editor <a href="https://lkovacs.com/">Laura Kovács</a> appeared like this:</p> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/springer-anti-typesetters/goblint-validator-svcomp2024-copyright-proof2-480.webp 480w,/assets/springer-anti-typesetters/goblint-validator-svcomp2024-copyright-proof2-800.webp 800w,/assets/springer-anti-typesetters/goblint-validator-svcomp2024-copyright-proof2-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/springer-anti-typesetters/goblint-validator-svcomp2024-copyright-proof2.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Copyright notice in Springer's proof" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <h3 id="replacing-small-caps-font-in-title">Replacing small caps font in title</h3> <p>It’s common for tool names to be in small caps (<code class="language-plaintext highlighter-rouge">\textsc</code>) in paper titles, especially for these SV-COMP tool papers in TACAS proceedings. For some reason, small caps wasn’t enough for Springer and they made it bold italic small caps. I have never seen this in a paper title (or anywhere for that matter).</p> <p>The following image comparisons illustrate the pessimization from our camera-ready version to Springer’s proof (it’s very easy to tell which side is which version). Hover/swipe across the images to fully appreciate the horror.</p> <style>.slider-with-shadows{--default-handle-shadow:0 0 5px var(--global-theme-color);--divider-shadow:0 0 5px var(--global-theme-color);--divider-color:var(--global-theme-color);--default-handle-color:var(--global-theme-color)}</style> <img-comparison-slider class="slider-with-shadows z-depth-1" hover="hover" value="25"> <figure slot="first"> <picture> <source class="responsive-img-srcset" srcset="/assets/springer-anti-typesetters/goblint-validator-svcomp2024-title-camera-480.webp 480w,/assets/springer-anti-typesetters/goblint-validator-svcomp2024-title-camera-800.webp 800w,/assets/springer-anti-typesetters/goblint-validator-svcomp2024-title-camera-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/springer-anti-typesetters/goblint-validator-svcomp2024-title-camera.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Title in camera-ready version" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <figure slot="second"> <picture> <source class="responsive-img-srcset" srcset="/assets/springer-anti-typesetters/goblint-validator-svcomp2024-title-proof2-480.webp 480w,/assets/springer-anti-typesetters/goblint-validator-svcomp2024-title-proof2-800.webp 800w,/assets/springer-anti-typesetters/goblint-validator-svcomp2024-title-proof2-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/springer-anti-typesetters/goblint-validator-svcomp2024-title-proof2.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Title in Springer's proof" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </img-comparison-slider> <h3 id="expanding-author-emails">Expanding author emails</h3> <p>In this iteration they went a step further and expanded all author emails, i.e., <code class="language-plaintext highlighter-rouge">{a,b}@c.d</code> was replaced with <code class="language-plaintext highlighter-rouge">a@c.d</code>, <code class="language-plaintext highlighter-rouge">b@c.d</code>, which takes up more space. I don’t understand why it was necessary for <em>this</em> version of <em>this</em> paper, but not any other.</p> <h3 id="reformatting-tables-entirely">Reformatting tables entirely</h3> <p>Just like in <a href="/blog/2024/07/21/springer-anti-typesetters-part-1/">part 1</a>, Springer redid <a href="https://ctan.org/pkg/booktabs?lang=en"><code class="language-plaintext highlighter-rouge">booktabs</code></a> tables with the following changes:</p> <ol> <li><em>All</em> columns are left-aligned (again). That is especially bad for numeric data spanning multiple orders of magnitude. Our tables with <a href="https://ctan.org/pkg/siunitx?lang=en"><code class="language-plaintext highlighter-rouge">siunitx</code></a> columns that properly align digits and decimal points were ruined. Columns of centered checkmarks and crosses became awkward. Centered <code class="language-plaintext highlighter-rouge">\multicolumn</code> spans became odd.</li> <li>Line breaks from multi-line column headers were removed. This makes some columns overly wide.</li> <li>Row spacing was increased.</li> <li>The font for tables was changed to something not matching the rest of the paper (again).</li> </ol> <p>This time they didn’t add vertical column-separating rules between all columns!</p> <h4 id="table-1">Table 1</h4> <p>Note how they changed “2,015” to “2015” in the bottom right cell, thinking it’s a year or something.</p> <img-comparison-slider class="slider-with-shadows z-depth-1" hover="hover"> <figure slot="first"> <picture> <source class="responsive-img-srcset" srcset="/assets/springer-anti-typesetters/goblint-validator-svcomp2024-table1-camera-480.webp 480w,/assets/springer-anti-typesetters/goblint-validator-svcomp2024-table1-camera-800.webp 800w,/assets/springer-anti-typesetters/goblint-validator-svcomp2024-table1-camera-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/springer-anti-typesetters/goblint-validator-svcomp2024-table1-camera.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Table 1 in camera-ready version" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <figure slot="second"> <picture> <source class="responsive-img-srcset" srcset="/assets/springer-anti-typesetters/goblint-validator-svcomp2024-table1-proof2-480.webp 480w,/assets/springer-anti-typesetters/goblint-validator-svcomp2024-table1-proof2-800.webp 800w,/assets/springer-anti-typesetters/goblint-validator-svcomp2024-table1-proof2-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/springer-anti-typesetters/goblint-validator-svcomp2024-table1-proof2.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Table 1 in Springer's proof" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </img-comparison-slider> <h4 id="table-2">Table 2</h4> <p>Yes, they moved this table onto a separate page and rotated it 90° (in addition to all of the above).</p> <img-comparison-slider class="slider-with-shadows z-depth-1" hover="hover"> <figure slot="first"> <picture> <source class="responsive-img-srcset" srcset="/assets/springer-anti-typesetters/goblint-validator-svcomp2024-table2-camera-480.webp 480w,/assets/springer-anti-typesetters/goblint-validator-svcomp2024-table2-camera-800.webp 800w,/assets/springer-anti-typesetters/goblint-validator-svcomp2024-table2-camera-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/springer-anti-typesetters/goblint-validator-svcomp2024-table2-camera.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Table 2 in camera-ready version" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <figure slot="second"> <picture> <source class="responsive-img-srcset" srcset="/assets/springer-anti-typesetters/goblint-validator-svcomp2024-table2-proof2-480.webp 480w,/assets/springer-anti-typesetters/goblint-validator-svcomp2024-table2-proof2-800.webp 800w,/assets/springer-anti-typesetters/goblint-validator-svcomp2024-table2-proof2-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/springer-anti-typesetters/goblint-validator-svcomp2024-table2-proof2.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Table 2 in Springer's proof" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </img-comparison-slider> <h2 id="proof-3">Proof 3</h2> <p>After painstakingly listing all the unnecessary changes they made as typesetting errors in the Goblint Validator paper, Springer gave up on (most of) their stupidity and reverted to the first proof (with artifact badges correctly added this time).</p> <p>And yet, they still had to mess up something. In two places, end-of-line punctuation was shifted from the baseline to above the text. It is beyond me, how to do this accidentally.</p> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/springer-anti-typesetters/goblint-validator-svcomp2024-comma-proof3-480.webp 480w,/assets/springer-anti-typesetters/goblint-validator-svcomp2024-comma-proof3-800.webp 800w,/assets/springer-anti-typesetters/goblint-validator-svcomp2024-comma-proof3-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/springer-anti-typesetters/goblint-validator-svcomp2024-comma-proof3.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Shifted comma in Springer's proof" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/springer-anti-typesetters/goblint-validator-svcomp2024-dot-proof3-480.webp 480w,/assets/springer-anti-typesetters/goblint-validator-svcomp2024-dot-proof3-800.webp 800w,/assets/springer-anti-typesetters/goblint-validator-svcomp2024-dot-proof3-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/springer-anti-typesetters/goblint-validator-svcomp2024-dot-proof3.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Shifted dot in Springer's proof" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <h2 id="conclusion">Conclusion</h2> <p>At least this time Springer listened and properly undid all the ugliness, so I don’t have to feel shame about how the publisher’s versions of my papers look. But clearly they haven’t stopped making arbitrary unnecessary changes, which wasted everyone’s time, and unfathomable accidental changes.</p> <p>Stay tuned for part 3! (It’s bound to happen…)</p>]]></content><author><name></name></author><category term="academia"/><category term="typesetting"/><category term="latex"/><category term="rant"/><summary type="html"><![CDATA[This post continues the Springer typesetting saga from part 1 on a pair of papers in March 2024. We had two papers (Saan et al., 2024; Saan et al., 2024) accepted into the same conference proceedings and they were edited very differently. The following compares our submitted camera-ready version with the many proofs from Springer typesetters.]]></summary></entry><entry><title type="html">Springer anti-typesetters, part 1</title><link href="https://sim642.eu/blog/2024/07/21/springer-anti-typesetters-part-1/" rel="alternate" type="text/html" title="Springer anti-typesetters, part 1"/><published>2024-07-21T00:00:00+00:00</published><updated>2024-07-22T00:00:00+00:00</updated><id>https://sim642.eu/blog/2024/07/21/springer-anti-typesetters-part-1</id><content type="html" xml:base="https://sim642.eu/blog/2024/07/21/springer-anti-typesetters-part-1/"><![CDATA[<p>This post describes (only some) of my frustrations with Springer’s typesetting of one paper <a class="citation" href="#10.1007/978-3-031-50524-9_4">(Saan et al., 2024)</a> in December 2023 and was also written around that time, but not published. It compares our submitted camera-ready version (which is very similar to the nicely-formatted <a href="https://arxiv.org/abs/2310.16572">arXiv version</a>) with the proofs from Springer typesetters.</p> <h2 id="discrediting-other-authors">Discrediting other authors</h2> <p>In our paper the citation “Beyer and Strejček [23]” was edited by Springer to simply “Strejček [23]”, discrediting <a href="https://www.sosy-lab.org/people/beyer/">Dirk Beyer</a>. We used <code class="language-plaintext highlighter-rouge">\citet{Beyer2022}</code> in LaTeX and both authors are still listed in the corresponding References entry. <a href="https://link.springer.com/chapter/10.1007/978-3-031-22308-2_8">The cited paper</a> is published via Springer, so they should have no doubt about the authorship. What reason would Springer have to replace <code class="language-plaintext highlighter-rouge">\citet</code> with a reduced authors list which is no longer consistent with References? Or how would one do that accidentally?</p> <h2 id="reformatting-tables-entirely">Reformatting tables entirely</h2> <p>We use the <a href="https://ctan.org/pkg/booktabs?lang=en"><code class="language-plaintext highlighter-rouge">booktabs</code></a> LaTeX package to typeset beautiful professional tables. For whatever reason Springer entirely reformats tables:</p> <ol> <li>Vertical column-separating rules are added between all columns.</li> <li><em>All</em> columns are left-aligned. That is especially bad for numeric data spanning multiple orders of magnitude. Our tables with <a href="https://ctan.org/pkg/siunitx?lang=en"><code class="language-plaintext highlighter-rouge">siunitx</code></a> columns that properly align digits and decimal points were ruined. Columns of centered checkmarks and crosses became awkward. Centered <code class="language-plaintext highlighter-rouge">\multicolumn</code> spans became odd.</li> <li>The font for tables was changed to something not matching the rest of the paper.</li> </ol> <p>Nothing in <a href="https://www.springer.com/gp/computer-science/lncs/conference-proceedings-guidelines">the Springer guidelines</a> requires any of such changes to tables, instead requiring:</p> <blockquote> <p>It is essential that all illustrations are clear and legible.</p> </blockquote> <p>By unnecessarily reformatting all tables, Springer editors have done the complete opposite of their own guidelines.</p> <h3 id="comparisons">Comparisons</h3> <p>The following image comparisons illustrate the pessimization from our camera-ready version to Springer’s proof (it’s very easy to tell which side is which version). Hover/swipe across the images to fully appreciate the horror.</p> <style>.slider-with-shadows{--default-handle-shadow:0 0 5px var(--global-theme-color);--divider-shadow:0 0 5px var(--global-theme-color);--divider-color:var(--global-theme-color);--default-handle-color:var(--global-theme-color)}</style> <h4 id="table-1">Table 1</h4> <p>Extra ugly is how the vertical rules have gaps in them and some cells are not completely colored.</p> <img-comparison-slider class="slider-with-shadows z-depth-1" hover="hover"> <figure slot="first"> <picture> <source class="responsive-img-srcset" srcset="/assets/springer-anti-typesetters/unassume-table1-camera-480.webp 480w,/assets/springer-anti-typesetters/unassume-table1-camera-800.webp 800w,/assets/springer-anti-typesetters/unassume-table1-camera-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/springer-anti-typesetters/unassume-table1-camera.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Table 1 in camera-ready version" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <figure slot="second"> <picture> <source class="responsive-img-srcset" srcset="/assets/springer-anti-typesetters/unassume-table1-proof-480.webp 480w,/assets/springer-anti-typesetters/unassume-table1-proof-800.webp 800w,/assets/springer-anti-typesetters/unassume-table1-proof-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/springer-anti-typesetters/unassume-table1-proof.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Table 1 in Springer's proof" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </img-comparison-slider> <h4 id="table-2">Table 2</h4> <img-comparison-slider class="slider-with-shadows z-depth-1" hover="hover"> <figure slot="first"> <picture> <source class="responsive-img-srcset" srcset="/assets/springer-anti-typesetters/unassume-table2-camera-480.webp 480w,/assets/springer-anti-typesetters/unassume-table2-camera-800.webp 800w,/assets/springer-anti-typesetters/unassume-table2-camera-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/springer-anti-typesetters/unassume-table2-camera.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Table 2 in camera-ready version" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <figure slot="second"> <picture> <source class="responsive-img-srcset" srcset="/assets/springer-anti-typesetters/unassume-table2-proof-480.webp 480w,/assets/springer-anti-typesetters/unassume-table2-proof-800.webp 800w,/assets/springer-anti-typesetters/unassume-table2-proof-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/springer-anti-typesetters/unassume-table2-proof.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" alt="Table 2 in Springer's proof" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </img-comparison-slider> <h3 id="history">History</h3> <p>Apparently, this behavior is far from new: <a href="https://twitter.com/ducha_aiki/status/1059444234711977984?lang=en">Dmytro Mishkin already complained about it in 2018</a>. Meanwhile in the first half of 2023, <code class="language-plaintext highlighter-rouge">booktabs</code> tables were fine again, as evidenced by some of <a href="https://link.springer.com/chapter/10.1007/978-3-031-30820-8_34">our</a> <a href="https://link.springer.com/chapter/10.1007/978-3-031-30044-8_2">papers</a>. Clearly, Springer lacks any kind of policy on this and their typesetters are rulers of a wild west, each making up their own rules. </p> <h2 id="making-references-inconsistent">Making references inconsistent</h2> <h3 id="abbreviating-partially">Abbreviating partially</h3> <p>We explicitly edited our BibTeX bibliography to be consistent across all entries, in particular about <code class="language-plaintext highlighter-rouge">booktitle</code>s/<code class="language-plaintext highlighter-rouge">journal</code>s. As provided by Springer Link, we used unabbreviated names like “Static Analysis” and “Tools and Algorithms for the Construction and Analysis of Systems”. Springer says:</p> <blockquote> <p>References are also modified to make them compatible with CrossRef, which will permit cross referencing within SpringerLink […]</p> </blockquote> <p>Fair enough, they abbreviated such names to, e.g., SAS and TACAS. However, their editing is completely inconsistent: some entries use abbreviations, while others still have full names for the very same proceedings/journals.</p> <h3 id="editing-titles">Editing titles</h3> <p>We cite <a href="https://link.springer.com/chapter/10.1007/978-3-030-72013-1_28">two</a> <a href="https://link.springer.com/chapter/10.1007/978-3-031-30820-8_34">papers</a> which have the tool name Goblint set in small caps in their title. As provided by Springer Link, we did not have small caps in our bibliography. Springer edited one to use small caps, but left the other (right after the first) in normal font. If there were any policy, then it should be enforced consistently by Springer Link and Springer editors, or at minimum be consistent within a single References section.</p> <p>Furthermore, Springer typesetters like to remove all crucial capitalization from cited titles:</p> <ul> <li>replacing <a href="https://dl.acm.org/doi/10.1145/3470569">“C programs”</a> with “c programs”,</li> <li>replacing <a href="https://dl.acm.org/doi/10.1145/3470569">“Frama-C”</a> with “frama-c”,</li> <li>replacing <a href="https://link.springer.com/chapter/10.1007/978-3-031-30820-8_39">“CommuHash”</a> with “commuhash”.</li> </ul> <h2 id="adding-random-dots">Adding random dots</h2> <p>We reference items in a prior <code class="language-plaintext highlighter-rouge">enumerate</code> like “Items 2 and 4 illustrate”, which was edited by Springer to “Items 2 and 4. illustrate”. The additional dot was consistently (!) throughout the paper added after item number 4 (multiple instances of “Item 4 from Example 4” were changed to “Item 4. from Example 4”). But all other item references into the same list did not get a dot — Springer <a href="https://xkcd.com/221/">randomly</a> did it to Item 4.</p> <h2 id="conclusion">Conclusion</h2> <p>After listing every occurrence of all of these issues explicitly in the response to typesetting proofs, Springer did manage to fix (i.e., undo) most, but not all, of these issues. In the publisher’s version the tables still aren’t as nice as our original version.</p> <p><del>Stay tuned for</del> Check out <a href="/blog/2024/07/22/springer-anti-typesetters-part-2/">part 2</a>!</p>]]></content><author><name></name></author><category term="academia"/><category term="typesetting"/><category term="latex"/><category term="rant"/><summary type="html"><![CDATA[This post describes (only some) of my frustrations with Springer’s typesetting of one paper (Saan et al., 2024) in December 2023 and was also written around that time, but not published. It compares our submitted camera-ready version (which is very similar to the nicely-formatted arXiv version) with the proofs from Springer typesetters.]]></summary></entry><entry><title type="html">Automata-theoretic approach to regex crosswords</title><link href="https://sim642.eu/blog/2024/07/20/regex-crossword-automata/" rel="alternate" type="text/html" title="Automata-theoretic approach to regex crosswords"/><published>2024-07-20T00:00:00+00:00</published><updated>2024-07-20T00:00:00+00:00</updated><id>https://sim642.eu/blog/2024/07/20/regex-crossword-automata</id><content type="html" xml:base="https://sim642.eu/blog/2024/07/20/regex-crossword-automata/"><![CDATA[<p>This post documents my automata-theoretic approach to solving <a href="https://regexcrossword.com/">regex crosswords</a>, which is unlike other approaches out there (see <a href="#related-work">Related work</a> below). The biggest limitation of this computer science theory approach is that it only works for truly regular regexes (so no capture groups, look-aheads, etc). I have <a href="https://github.com/sim642/regex-crossword/">implemented it in Java</a> using the <a href="https://www.brics.dk/automaton/"><code class="language-plaintext highlighter-rouge">dk.brics.automaton</code></a> library.</p> <h2 id="automata-theoretic-approach">Automata-theoretic approach</h2> <p>Let a <em>rectangular regex crossword</em> with \(h\) rows and \(w\) columns be defined by:</p> <ol> <li>row regexes \(R_0, R_1, \dots, R_{h-1}\) (matching left-to-right),</li> <li>column regexes \(C_0, C_1, \dots, C_{w-1}\) (matching top-to-bottom).</li> </ol> <p>Let’s view the \(w \times h\) rectangle as a string of length \(w \times h\) where the rows have been concatenated (i.e., <a href="https://en.wikipedia.org/wiki/Row-_and_column-major_order">row-major order</a>).</p> <h4 id="example">Example</h4> <p>Consider <a href="https://regexcrossword.com/challenges/intermediate/puzzles/48f25c7e-0416-410e-b96a-c7ee19dfa110">Intermediate: Always remember from regexcrossword.com</a> as an example:</p> \[\begin{array} {|r|c|c|c|}\hline &amp; \texttt{UB|IE|AW} &amp; \texttt{[TUBE]*} &amp; \texttt{[BORF].} \\ \hline \texttt{[NOTAD]*} &amp; ? &amp; ? &amp; ? \\ \hline \texttt{WEL|BAL|EAR} &amp; ? &amp; ? &amp; ? \\ \hline \end{array}\] <h3 id="row-automata">Row automata</h3> <p>First, construct a <em>width automaton</em> \(W\) which accepts all strings of length \(w\). This automaton corresponds to the regex <code class="language-plaintext highlighter-rouge">.{w}</code>.</p> <p>Then, for each row regex \(R_i\) construct a <em>row automaton</em> \(R_i'\) as follows:</p> \[R_i' = W^{i} \circ (R_i \cap W) \circ W^{w - i - 1},\] <p>where</p> <ul> <li>\(\circ\) is the binary operator for concatenating two automata,</li> <li>exponentation self-concatenates indicated number of copies of the automaton (power 0 gives the automaton whose language contains only the empty string),</li> <li>\(\cap\) is the binary operator for intersection of two automata.</li> </ul> <h4 id="example-1">Example</h4> <p>For the above example, the automata are the following.</p> <p>\(W\) is the <em>width automaton</em> corresponding to the regex <code class="language-plaintext highlighter-rouge">.{3}</code>:</p> <pre><code class="language-mermaid">graph LR
    start:::hidden
    w0(($$w_0$$))
    w1(($$w_1$$))
    w2(($$w_2$$))
    w3((($$w_3$$)))
    start--&gt;w0--&gt;|.|w1--&gt;|.|w2--&gt;|.|w3

    classDef hidden display: none;
</code></pre> <p>For brevity, a single character regex is used to describe the possible transitions, as opposed parallel transitions for each character in the alphabet. Dead/trap states and transitions are also omitted.</p> <ol> <li>\(R_0\) is the automaton corresponding to the regex <code class="language-plaintext highlighter-rouge">[NOTAD]*</code>: <pre><code class="language-mermaid"> graph LR
     start:::hidden
     r0((($$r_0$$)))
     start--&gt;r0--&gt;|"[NOTAD]"|r0

     classDef hidden display: none;
</code></pre> <p>For brevity, a single character regex is used to describe the possible transitions, as opposed to 5 self-loops.</p> <p>\(R_0' = (R_0 \cap W) \circ W\) is the <em>row automaton</em> (which happens to correspond to the regex <code class="language-plaintext highlighter-rouge">[NOTAD]{3}.{3}</code>):</p> <pre><code class="language-mermaid"> graph LR
     start:::hidden
     q0(( ))
     q1(( ))
     q2(( ))
     q3(( ))
     q4(( ))
     q5(( ))
     q6((( )))
     start--&gt;q0--&gt;|"[NOTAD]"|q1--&gt;|"[NOTAD]"|q2--&gt;|"[NOTAD]"|q3--&gt;|.|q4--&gt;|.|q5--&gt;|.|q6

     classDef hidden display: none;
</code></pre> </li> <li>\(R_1\) is the automaton corresponding to the regex <code class="language-plaintext highlighter-rouge">WEL|BAL|EAR</code>: <pre><code class="language-mermaid"> graph LR
     start:::hidden
     r0(($$r_0$$))
     r1(($$r_1$$))
     r2(($$r_2$$))
     r3(($$r_3$$))
     r4(($$r_4$$))
     r5(($$r_5$$))
     r6(($$r_6$$))
     r7((($$r_7$$)))
     start--&gt;r0--&gt;|W|r1--&gt;|E|r2--&gt;|L|r7
     r0--&gt;|B|r3--&gt;|A|r4--&gt;|L|r7
     r0--&gt;|E|r5--&gt;|A|r6--&gt;|R|r7

     classDef hidden display: none;
</code></pre> <p>(For symmetry, this has not been minimized.)</p> <p>\(R_1' = W \circ (R_1 \cap W)\) is the <em>row automaton</em> (which happens to correspond to the regex <code class="language-plaintext highlighter-rouge">.{3}(WEL|BAL|EAR)</code>):</p> <pre><code class="language-mermaid"> graph LR
     start:::hidden
     w0(( ))
     w1(( ))
     w2(( ))
     w3(( ))
     start--&gt;w0--&gt;|.|w1--&gt;|.|w2--&gt;|.|w3
     r1(( ))
     r2(( ))
     r3(( ))
     r4(( ))
     r5(( ))
     r6(( ))
     r7((( )))
     w3--&gt;|W|r1--&gt;|E|r2--&gt;|L|r7
     w3--&gt;|B|r3--&gt;|A|r4--&gt;|L|r7
     w3--&gt;|E|r5--&gt;|A|r6--&gt;|R|r7

     classDef hidden display: none;
</code></pre> </li> </ol> <h3 id="column-automata">Column automata</h3> <p>While the construction of automata for each row regex is rather intuitive, it’s significantly more involved for column regexes. That is because the characters in the row-major string that make up the <em>subsequence</em> which needs to match the column regex are not consequtive. Nevertheless, it is possible using a novel<sup id="fnref:novel"><a href="#fn:novel" class="footnote" rel="footnote" role="doc-noteref">1</a></sup> construction.</p> <p>First, for each column regex \(C_i\) construct a <em>guard automaton</em> \(O^i \circ W^*\) which is in an accepting state at every position in the row-major string that belongs to column \(i\). Here, \(O\) (for offset) is the automaton corresponding to the regex <code class="language-plaintext highlighter-rouge">.</code>. The guard automaton is a repetition of the <em>width automaton</em> \(W\), because every \(w\)-th position in the row-major string is in the same column, prefixed by an offset automaton \(O^i\) to start the repetition from the corresponding column. This automaton corresponds to the regex <code class="language-plaintext highlighter-rouge">.{i}(.{w})*</code>.</p> <p>Then, the column regex \(C_i\) construct a <em>column automaton</em> \(C_i'\) as follows:</p> \[C_i' = C_i \triangleleft (O^i \circ W^*),\] <p>where \(A \triangleleft G\) is a special product-like <em>guarded automaton</em> (\(A\) guarded by \(G\)), where \(A\) transitions only if \(G\) is in an accepting state. In general \(A \triangleleft G\) is defined as:</p> <ol> <li>Its states \((a, g)\) are from the product set of states \(A \times G\).</li> <li>Its initial state consists of the initial states of \(A\) and \(G\).</li> <li>Its state \((a, g)\) is accepting if \(a\) is an accepting state of \(A\).</li> <li>In state \((a, g)\) with input character \(c\) the automaton steps to <ol> <li>\((a', g')\) if \(g\) is an accepting state of \(G\), \(A\) steps from \(a\) to \(a'\) with \(c\) and \(G\) steps from \(g\) to \(g'\) with \(c\),</li> <li>\((a, g')\) if \(g\) is <em>not</em> an accepting state of \(G\) and \(G\) steps from \(g\) to \(g'\) with \(c\).</li> </ol> </li> </ol> <p>(This has some similarities to <a href="https://dl.acm.org/doi/abs/10.5555/954014.954036">stretching of automata</a>.)</p> <h4 id="example-2">Example</h4> <p>For the above example, the automata are the following.</p> <ol> <li>\(C_0\) is the automaton corresponding to the regex <code class="language-plaintext highlighter-rouge">UB|IE|AW</code>: <pre><code class="language-mermaid"> graph LR
     start:::hidden
     c0(($$c_0$$))
     c1(($$c_1$$))
     c2(($$c_2$$))
     c3(($$c_3$$))
     c4((($$c_4$$)))
     start--&gt;c0--&gt;|U|c1--&gt;|B|c4
     c0--&gt;|I|c2--&gt;|E|c4
     c0--&gt;|A|c3--&gt;|W|c4

     classDef hidden display: none;
</code></pre> <p>\(W^*\) is the <em>guard automaton</em> corresponding to the regex <code class="language-plaintext highlighter-rouge">(.{3})*</code>:</p> <pre><code class="language-mermaid"> graph LR
     start:::hidden
     w0((($$w_0$$)))
     w1(($$w_1$$))
     w2(($$w_2$$))
     start--&gt;w0--&gt;|.|w1--&gt;|.|w2--&gt;|.|w0

     classDef hidden display: none;
</code></pre> <p>\(C_0' = C_0 \triangleleft W^*\) is the <em>column automaton</em> (which happens to correspond to the regex <code class="language-plaintext highlighter-rouge">(U..B|I..E|A..W)..</code>):</p> <pre><code class="language-mermaid"> graph LR
     start:::hidden
     c0w0(($$c_0,w_0$$))
     c1w1(($$c_1,w_1$$))
     c1w2(($$c_1,w_2$$))
     c1w0(($$c_1,w_0$$))
     c4w1((($$c_4,w_1$$)))
     c2w1(($$c_2,w_1$$))
     c2w2(($$c_2,w_2$$))
     c2w0(($$c_2,w_0$$))
     c3w1(($$c_3,w_1$$))
     c3w2(($$c_3,w_2$$))
     c3w0(($$c_3,w_0$$))
     c4w2((($$c_4,w_2$$)))
     c4w0((($$c_4,w_0$$)))
     start--&gt;c0w0--&gt;|U|c1w1--&gt;|.|c1w2--&gt;|.|c1w0--&gt;|B|c4w1
     c0w0--&gt;|I|c2w1--&gt;|.|c2w2--&gt;|.|c2w0--&gt;|E|c4w1
     c0w0--&gt;|A|c3w1--&gt;|.|c3w2--&gt;|.|c3w0--&gt;|W|c4w1
     c4w1--&gt;|.|c4w2--&gt;|.|c4w0

     classDef hidden display: none;
</code></pre> </li> <li>\(C_1\) is the automaton corresponding to the regex <code class="language-plaintext highlighter-rouge">[TUBE]*</code>: <pre><code class="language-mermaid"> graph LR
     start:::hidden
     c0((($$c_0$$)))
     start--&gt;c0--&gt;|"[TUBE]"|c0

     classDef hidden display: none;
</code></pre> <p>\(O^1 \circ W^*\) is the <em>guard automaton</em> corresponding to the regex <code class="language-plaintext highlighter-rouge">.(.{3})*</code>:</p> <pre><code class="language-mermaid"> graph LR
     start:::hidden
     o0(($$o_0$$))
     w0((($$w_0$$)))
     w1(($$w_1$$))
     w2(($$w_2$$))
     start--&gt;o0--&gt;|.|w0--&gt;|.|w1--&gt;|.|w2--&gt;|.|w0

     classDef hidden display: none;
</code></pre> <p>\(C_1' = C_1 \triangleleft (O^1 \circ W^*)\) is the <em>column automaton</em> (which happens to correspond to the regex <code class="language-plaintext highlighter-rouge">.(([TUBE]..)*([TUBE].?)?)?</code> – this is uglier than the automaton):</p> <pre><code class="language-mermaid"> graph LR
     start:::hidden
     c0o0((($$c_0,o_0$$)))
     c0w0((($$c_0,w_0$$)))
     c0w1((($$c_0,w_1$$)))
     c0w2((($$c_0,w_2$$)))
     start--&gt;c0o0--&gt;|.|c0w0--&gt;|"[TUBE]"|c0w1--&gt;|.|c0w2--&gt;|.|c0w0

     classDef hidden display: none;
</code></pre> <p>(Note that although all states shown are accepting, \(C_1'\) does not accept all strings – the dead state for mismatches at <code class="language-plaintext highlighter-rouge">[TUBE]</code> is not shown.)</p> </li> <li>\(C_2\) is the automaton corresponding to the regex <code class="language-plaintext highlighter-rouge">[BORF].</code>: <pre><code class="language-mermaid"> graph LR
     start:::hidden
     c0(($$c_0$$))
     c1(($$c_1$$))
     c2((($$c_2$$)))
     start--&gt;c0--&gt;|"[BORF]"|c1--&gt;|.|c2

     classDef hidden display: none;
</code></pre> <p>\(O^2 \circ W^*\) is the <em>guard automaton</em> corresponding to the regex <code class="language-plaintext highlighter-rouge">..(.{3})*</code>:</p> <pre><code class="language-mermaid"> graph LR
     start:::hidden
     o0(($$o_0$$))
     o1(($$o_1$$))
     w0((($$w_0$$)))
     w1(($$w_1$$))
     w2(($$w_2$$))
     start--&gt;o0--&gt;|.|o1--&gt;|.|w0--&gt;|.|w1--&gt;|.|w2--&gt;|.|w0

     classDef hidden display: none;
</code></pre> <p>\(C_2' = C_2 \triangleleft (O^2 \circ W^*)\) is the <em>column automaton</em> (which happens to correspond to the regex <code class="language-plaintext highlighter-rouge">..[BORF]...</code>):</p> <pre><code class="language-mermaid"> graph LR
     start:::hidden
     c0o0(($$c_0,o_0$$))
     c0o1(($$c_0,o_1$$))
     c0w0(($$c_0,w_0$$))
     c1w1(($$c_1,w_1$$))
     c1w2(($$c_1,w_2$$))
     c1w0(($$c_1,w_0$$))
     c2w1((($$c_2,w_1$$)))
     start--&gt;c0o0--&gt;|.|c0o1--&gt;|.|c0w0--&gt;|"[BORF]"|c1w1--&gt;|.|c1w2--&gt;|.|c1w0--&gt;|.|c2w1

     classDef hidden display: none;
</code></pre> </li> </ol> <h3 id="solution">Solution</h3> <p>Finally, construct the <em>solution automaton</em> \(S\) as an intersection of all row and column automata:</p> \[S = \left(\bigcap_{i=0}^{h-1} R_i'\right) \cap \left(\bigcap_{i=0}^{w-1} C_i'\right).\] <p>This automaton describes <em>all</em> solutions to the regex crossword. If the regex crossword has a unique solution, this automaton is linear and describes exactly one accepted string.</p> <h4 id="example-3">Example</h4> <p>For the above example, the solution automaton is \(S = R_0' \cap R_1' \cap C_0' \cap C_1' \cap C_2'\) (which happens to correspond to the regex <code class="language-plaintext highlighter-rouge">ATOWEL</code>):</p> <pre><code class="language-mermaid">graph LR
    start:::hidden
    q0(( ))
    q1(( ))
    q2(( ))
    q3(( ))
    q4(( ))
    q5(( ))
    q6((( )))
    start--&gt;q0--&gt;|A|q1--&gt;|T|q2--&gt;|O|q3--&gt;|W|q4--&gt;|E|q5--&gt;|L|q6

    classDef hidden display: none;
</code></pre> <p>The solution to the regex crossword can be read off the automaton: the only accepted string is <code class="language-plaintext highlighter-rouge">ATOWEL</code>.</p> <h3 id="performance">Performance</h3> <p>Although, I have <a href="https://github.com/sim642/regex-crossword/">implemented it in Java</a>, I don’t have useful information about its performance (especially compared to other approaches). The runtimes on small test cases are neglible.</p> <p>This approach involves a lot of product automata constructions (for intersections and guarded) which, at least in theory, can yield quite large automata. My hunch is that the intermediate automata, when minimized, are relatively small compared to the theoretical bounds (as also seen in the example). Conventionally regex crosswords have unique solutions, so as more automata are intersected more restrictions are combined, converging towards a smaller language with a smaller automaton.</p> <h2 id="related-work">Related work</h2> <p>The following table gives an overview of various approaches to the regex crossword problem. Most seem to use more brute force (backtracking, search, SMT), but also target non-regular regexes which cannot be expressed as finite automata.</p> <table> <thead> <tr> <th>Approach</th> <th>Implementation</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td>Logic programming</td> <td><a href="https://github.com/lvh/regex-crossword">Clojure</a></td> <td><a href="https://www.lvh.io/posts/solving-regex-crosswords/">Blog</a></td> </tr> <tr> <td>Search (“heuristic”)</td> <td><a href="https://github.com/antoine-trux/regex-crossword-solver">C++</a></td> <td><a href="https://solving-regular-expression-crosswords.blogspot.com/2016/05/blog-post.html?m=1">Blog</a></td> </tr> <tr> <td>SMT (“string constraint solving”)</td> <td><a href="https://github.com/blukat29/regex-crossword-solver">Python</a></td> <td><a href="https://blukat.me/2016/01/regex-crossword-solver/">Blog</a></td> </tr> <tr> <td>Custom regex engine</td> <td><a href="https://github.com/almost/regex-crossword-solver">Haskell</a></td> <td><a href="https://almostobsolete.net/regex-crossword/part1.html">Blog part 1</a>, <a href="https://almostobsolete.net/regex-crossword/part2.html">part 2</a></td> </tr> <tr> <td>Go regex DFA inspection (“backtracking”)</td> <td><a href="https://github.com/hermanschaaf/regex-crossword-solver">Go</a></td> <td><a href="https://web.archive.org/web/20190111061731/http://herman.asia/solving-regex-crosswords-using-go">Blog (archived)</a></td> </tr> <tr> <td>Evolutionary algorithm (“heuristic”)</td> <td><a href="https://github.com/maxymczech/gp-regex-crossword">JavaScript</a></td> <td>—</td> </tr> <tr> <td>Search (“backtracking”)</td> <td><a href="https://github.com/purple4reina/regex-crossword-solver">Python</a></td> <td>—</td> </tr> <tr> <td>SMT</td> <td>—</td> <td><a href="https://link.springer.com/chapter/10.1007/978-981-99-8664-4_12">Paper</a></td> </tr> <tr> <td><em>This (automata-theoretic)</em></td> <td><a href="https://github.com/sim642/regex-crossword/">Java</a></td> <td><em>Above</em></td> </tr> </tbody> </table> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:novel"> <p>As far as I am aware. Please inform me know otherwise. <a href="#fnref:novel" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name></name></author><category term="computer science"/><category term="regular expressions"/><category term="programming"/><category term="java"/><summary type="html"><![CDATA[This post documents my automata-theoretic approach to solving regex crosswords, which is unlike other approaches out there (see Related work below). The biggest limitation of this computer science theory approach is that it only works for truly regular regexes (so no capture groups, look-aheads, etc). I have implemented it in Java using the dk.brics.automaton library.]]></summary></entry><entry><title type="html">Error in Conway’s “Regular Algebra and Finite Machines”</title><link href="https://sim642.eu/blog/2024/07/15/error-in-conways-regular-algebra-and-finite-machines/" rel="alternate" type="text/html" title="Error in Conway’s “Regular Algebra and Finite Machines”"/><published>2024-07-15T00:00:00+00:00</published><updated>2024-07-15T00:00:00+00:00</updated><id>https://sim642.eu/blog/2024/07/15/error-in-conways-regular-algebra-and-finite-machines</id><content type="html" xml:base="https://sim642.eu/blog/2024/07/15/error-in-conways-regular-algebra-and-finite-machines/"><![CDATA[<p>I was following “Proof Pearl: Regular Expression Equivalence and Relation Algebra” <a class="citation" href="#krauss12">(Krauss &amp; Nipkow, 2012)</a> to implement a <strong>regular expression equivalence checker</strong> in OCaml to validate it as a project idea for my “Advanced Topics in Automata, Languages and Compilers” course.</p> <p>Once the neat implementation was done, I wanted to test it, especially with pairs of regular expressions that aren’t trivially equivalent. I scoured the internet (mostly Stack Overflow) and the literature for such examples. Eventually I stumbled upon <strong>“Regular Algebra and Finite Machines”</strong> <a class="citation" href="#conway71">(Conway, 1971)</a>.</p> <h2 id="the-error">The error</h2> <p>Deep into the book, on page 121 it contains the following exercises:</p> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/conway-regular-algebra-and-finite-machines-exercises-480.webp 480w,/assets/conway-regular-algebra-and-finite-machines-exercises-800.webp 800w,/assets/conway-regular-algebra-and-finite-machines-exercises-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/conway-regular-algebra-and-finite-machines-exercises.png" class="img-fluid rounded" width="100%" height="auto" alt="Exercises from Conway's &quot;Regular Algebra and Finite Machines&quot;" title="Exercises from Conway's &quot;Regular Algebra and Finite Machines&quot;" data-zoomable="" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <p>(Here \(+\) indicates the regex operator <code class="language-plaintext highlighter-rouge">|</code>, \(1\) is the empty string aka \(\varepsilon\) and \([]\) are just nested parenthesis for grouping, not a regex character class.)</p> <p>Exercise 3 asks us to prove that <code class="language-plaintext highlighter-rouge">(xy)*(x|xy(yy*x)*)*</code> is equivalent to <code class="language-plaintext highlighter-rouge">((xy*y)*yx|x)*(xy)*</code>. However, my checker didn’t agree and spat out the <strong>counterexample <code class="language-plaintext highlighter-rouge">yx</code></strong>. Indeed, it’s not too hard to verify by hand that <a href="https://regex101.com/r/DIYvXh/1">the first regex does not match <code class="language-plaintext highlighter-rouge">yx</code></a> while <a href="https://regex101.com/r/Yp9Tzv/1">the second regex does</a>. Hence, they aren’t equivalent!</p> <h2 id="the-fix">The fix</h2> <p>The correct formulation of exercise 3 would be:</p> \[(\boldsymbol{yx})^*[x+xy(yy^*x)^*]^* = [(xy^*y)^*yx+x]^*(xy)^*.\] <p>The difference compared to the book is shown in <strong>bold</strong>: beginning of the left-hand side should have <code class="language-plaintext highlighter-rouge">yx</code> instead of <code class="language-plaintext highlighter-rouge">xy</code>. My checker agrees that the two are now equivalent.</p> <p>This is indirectly corroborated later in the book as well. Solutions to the exercises say:</p> <blockquote> <p>A proof of 3 (not from C1-14) is implicit in later exercises.</p> </blockquote> <p>And exercise 9 includes the fixed left-hand side regular expression.</p> <h2 id="references">References</h2> <div class="publications"> <ol class="bibliography"><li><div class="row"> <div class="col col-sm-2 abbr"> <abbr class="badge rounded w-100" style="background-color:#007f4a"> <a href="https://link.springer.com/journal/10817">J. Autom. Reason.</a> </abbr> </div> <div id="krauss12" class="col-sm-8"> <div class="title">Proof Pearl: Regular Expression Equivalence and Relation Algebra</div> <div class="author"> Alexander Krauss and Tobias Nipkow </div> <div class="periodical"> <em>Journal of Automated Reasoning</em>, 2012 </div> <div class="periodical"> </div> <div class="links"> <a class="abstract btn btn-sm z-depth-0" role="button">Abs</a> <a href="https://doi.org/10.1007/s10817-011-9223-4" class="btn btn-sm z-depth-0" role="button">HTML</a> <a href="https://www21.in.tum.de/~krauss/papers/rexp.pdf" class="btn btn-sm z-depth-0" role="button">PDF</a> </div> <div class="abstract hidden"> <p>We describe and verify an elegant equivalence checker for regular expressions. It works by constructing a bisimulation relation between (derivatives of) regular expressions. By mapping regular expressions to binary relations, an automatic and complete proof method for (in)equalities of binary relations over union, composition and (reflexive) transitive closure is obtained. The verification is carried out in the theorem prover Isabelle/HOL, yielding a practically useful decision procedure.</p> </div> </div> </div> </li> <li><div class="row"> <div class="col col-sm-2 abbr"> <abbr class="badge rounded w-100">Book</abbr> </div> <div id="conway71" class="col-sm-8"> <div class="title">Regular Algebra and Finite Machines</div> <div class="author"> John Horton Conway </div> <div class="periodical"> 1971 </div> <div class="periodical"> </div> <div class="links"> </div> </div> </div> </li></ol> </div>]]></content><author><name></name></author><category term="computer science"/><category term="regular expressions"/><category term="academia"/><summary type="html"><![CDATA[I was following “Proof Pearl: Regular Expression Equivalence and Relation Algebra” (Krauss &amp; Nipkow, 2012) to implement a regular expression equivalence checker in OCaml to validate it as a project idea for my “Advanced Topics in Automata, Languages and Compilers” course.]]></summary></entry><entry><title type="html">OCaml linting tools and techniques</title><link href="https://sim642.eu/blog/2024/05/01/ocaml-linting/" rel="alternate" type="text/html" title="OCaml linting tools and techniques"/><published>2024-05-01T00:00:00+00:00</published><updated>2025-08-27T00:00:00+00:00</updated><id>https://sim642.eu/blog/2024/05/01/ocaml-linting</id><content type="html" xml:base="https://sim642.eu/blog/2024/05/01/ocaml-linting/"><![CDATA[<p>Recently (but also 3 years ago), I was interested in <a href="https://github.com/goblint/analyzer/pull/1435">finding all catch-all exception handlers in Goblint</a>, which is written in OCaml, in order to prevent “uncatchable” exceptions from being caught and accidentally swallowed. “Uncatchable” exceptions are those which should not be ignored, e.g. <code class="language-plaintext highlighter-rouge">Out_of_memory</code>. My first attempt was using a <a href="https://semgrep.dev/">Semgrep</a> rule, but it turned out to be <a href="https://github.com/semgrep/semgrep/issues/10193">too buggy</a> <a href="https://github.com/semgrep/semgrep/issues/3822">to reliably</a> <a href="https://github.com/semgrep/semgrep/issues/3821">do the job</a>. Therefore, I sought out code linters for OCaml.</p> <h2 id="tools">Tools</h2> <p>The following table summarizes all OCaml linting tools I managed to find; active or dead, general or special-purpose, standalone or Ppx, monolithic or modular. In this post I focus on linting (based on syntax and possibly types) and exclude program analyzers like <a href="https://github.com/rescript-association/reanalyze">reanalyze</a> and <a href="https://salto.gitlabpages.inria.fr/">Salto</a>. Ocamllint and ocp-lint are the most universal attempts at OCaml linting, however they’re long dead and no replacement seems to have emerged.</p> <table> <thead> <tr> <th>Tool</th> <th>Status</th> <th>Use case</th> <th>Mode</th> <th>Structure</th> </tr> </thead> <tbody> <tr> <td><a href="https://github.com/cryptosense/ocamllint">ocamllint</a></td> <td>Archived</td> <td>General</td> <td>Ppx</td> <td>Monolithic</td> </tr> <tr> <td><a href="https://github.com/OCamlPro/typerex-lint/">ocp-lint/typerex-lint</a></td> <td>Inactive</td> <td>General</td> <td>Standalone</td> <td>Modular &amp; extensible</td> </tr> <tr> <td><a href="https://github.com/upenn-cis1xx/camelot">camelot</a></td> <td>Semiactive</td> <td>General/teaching</td> <td>Standalone</td> <td>Modular</td> </tr> <tr> <td><a href="https://github.com/Kakadu/zanuda">zanuda</a></td> <td>Active</td> <td>General</td> <td>Standalone</td> <td>Modular</td> </tr> <tr> <td><a href="https://github.com/NathanReb/bene-gesselint">bene-gesselint</a></td> <td>Inactive</td> <td>Framework</td> <td>Ppxlib</td> <td>Modular &amp; extensible</td> </tr> <tr> <td><a href="https://github.com/janestreet/ppx_js_style">ppx_js_style</a></td> <td>Active</td> <td>Company</td> <td>Ppxlib</td> <td>Monolithic</td> </tr> <tr> <td><a href="https://github.com/janestreet/base/tree/3ce90cf26c60eea1f965f2358111e7cabc924953/lint">base ppx_base_lint</a></td> <td>Active</td> <td>Project</td> <td>Ppxlib</td> <td>Monolithic</td> </tr> <tr> <td><a href="https://github.com/MinaProtocol/mina/tree/5f668b06164e1951d4bce0594ec40f77c5cdd102/src/lib/ppx_version">mina ppx_version</a></td> <td>Active</td> <td>Project</td> <td>Ppxlib</td> <td>Monolithic</td> </tr> <tr> <td><a href="https://github.com/just-max/less-power/tree/da8ad093c5fb593917c140f955e7820e08f9231c/src/ast-check">less-power ast-check</a></td> <td>Active</td> <td>Teaching</td> <td>Standalone/Ppxlib</td> <td>Monolithic</td> </tr> </tbody> </table> <p>There are two general <strong>execution modes</strong>:</p> <ol> <li>Ppx, which are OCaml AST preprocessors, executed by the build system similarly to other Ppx-es like <code class="language-plaintext highlighter-rouge">@@deriving</code> features. These are relatively easy to integrate into modern dune-based workflows.</li> <li>Standalone, which are to be executed outside of the usual compilation process. These don’t integrate into modern dune-based workflows due to <a href="https://github.com/ocaml/dune/issues/3471">very limited linting support in dune</a>.</li> </ol> <p>There are three general <strong>structures</strong> to these linters:</p> <ol> <li>Monolithic, where new rules would have to be implemented intertwined with already existing rules. This is reasonable for special-purpose linters and allows all checks to be performed in a single AST pass.</li> <li>Modular (but not extensible), where rules are implemented independently from others but form a fixed ruleset. These can be more difficult to combine into a single AST pass and might mean multiple passes in some cases. They are non-extensible because new rules must be integrated into the core tool itself.</li> <li>Modular and extensible, which has the benefits from the previous point, but also allows custom rules to be added without modifying the tool itself. Thus, they feature some sort of plugin system.</li> </ol> <h3 id="non-ppxlib-tools">Non-Ppxlib tools</h3> <p>The following table provides more details about the non-Ppxlib tools. Notably, some support type information in rules, which allows more expressive and accurate checks, but also that they cannot be part of the usual Ppx preprocessing step.</p> <table> <thead> <tr> <th>Tool</th> <th><code class="language-plaintext highlighter-rouge">Parsetree</code> traversal</th> <th>Type support</th> <th><code class="language-plaintext highlighter-rouge">Typedtree</code> traversal</th> </tr> </thead> <tbody> <tr> <td><a href="https://github.com/cryptosense/ocamllint">ocamllint</a></td> <td><code class="language-plaintext highlighter-rouge">Ast_mapper</code></td> <td>No</td> <td>-</td> </tr> <tr> <td><a href="https://github.com/OCamlPro/typerex-lint/">ocp-lint/typerex-lint</a></td> <td>Recursion</td> <td>Yes</td> <td><code class="language-plaintext highlighter-rouge">TypedtreeIter</code></td> </tr> <tr> <td><a href="https://github.com/upenn-cis1xx/camelot">camelot</a></td> <td>Copy of <code class="language-plaintext highlighter-rouge">Ast_iterator</code></td> <td>No</td> <td>-</td> </tr> <tr> <td><a href="https://github.com/Kakadu/zanuda">zanuda</a></td> <td><code class="language-plaintext highlighter-rouge">Ast_iterator</code></td> <td>Yes</td> <td><code class="language-plaintext highlighter-rouge">Tast_iterator</code></td> </tr> </tbody> </table> <h3 id="ppxlib-tools">Ppxlib tools</h3> <p>The following table provides more details about the Ppxlib-based tools. All of these integrate with dune in one way or another. In some cases, different parts of the same linter from the first table work by slightly different means.</p> <table> <thead> <tr> <th>Tool</th> <th>Dune integration</th> <th>Ppxlib phase</th> <th>Traversal</th> <th>Output</th> </tr> </thead> <tbody> <tr> <td><a href="https://github.com/NathanReb/bene-gesselint">bene-gesselint</a></td> <td><code class="language-plaintext highlighter-rouge">(lint)</code></td> <td><code class="language-plaintext highlighter-rouge">~impl</code></td> <td><code class="language-plaintext highlighter-rouge">iter</code>&amp;<code class="language-plaintext highlighter-rouge">Ast_pattern</code></td> <td><code class="language-plaintext highlighter-rouge">register_correction</code></td> </tr> <tr> <td><a href="https://github.com/janestreet/ppx_js_style">ppx_js_style</a> (<code class="language-plaintext highlighter-rouge">enforce_cold</code>)</td> <td><code class="language-plaintext highlighter-rouge">(preprocess)</code></td> <td><code class="language-plaintext highlighter-rouge">~lint_impl</code></td> <td><code class="language-plaintext highlighter-rouge">fold</code></td> <td><code class="language-plaintext highlighter-rouge">Lint_error</code></td> </tr> <tr> <td><a href="https://github.com/janestreet/ppx_js_style">ppx_js_style</a> (other)</td> <td><code class="language-plaintext highlighter-rouge">(preprocess)</code></td> <td><code class="language-plaintext highlighter-rouge">~impl</code></td> <td><code class="language-plaintext highlighter-rouge">iter</code></td> <td><code class="language-plaintext highlighter-rouge">raise_errorf</code></td> </tr> <tr> <td><a href="https://github.com/janestreet/base/tree/3ce90cf26c60eea1f965f2358111e7cabc924953/lint">base ppx_base_lint</a></td> <td><code class="language-plaintext highlighter-rouge">(preprocess)</code></td> <td><code class="language-plaintext highlighter-rouge">~impl</code></td> <td><code class="language-plaintext highlighter-rouge">iter</code></td> <td><code class="language-plaintext highlighter-rouge">raise_errorf</code></td> </tr> <tr> <td><a href="https://github.com/MinaProtocol/mina/blob/5f668b06164e1951d4bce0594ec40f77c5cdd102/src/lib/ppx_version/lint_primitive_uses.ml">mina ppx_version (lint_primitive_uses)</a></td> <td><code class="language-plaintext highlighter-rouge">(preprocess)</code></td> <td><code class="language-plaintext highlighter-rouge">~lint_impl</code></td> <td><code class="language-plaintext highlighter-rouge">fold</code>/<code class="language-plaintext highlighter-rouge">iter</code></td> <td><code class="language-plaintext highlighter-rouge">raise_errorf</code></td> </tr> <tr> <td><a href="https://github.com/MinaProtocol/mina/blob/5f668b06164e1951d4bce0594ec40f77c5cdd102/src/lib/ppx_version/lint_version_syntax.ml">mina ppx_version (lint_version_syntax)</a></td> <td><code class="language-plaintext highlighter-rouge">(preprocess)</code></td> <td><code class="language-plaintext highlighter-rouge">~lint_impl</code></td> <td><code class="language-plaintext highlighter-rouge">fold</code></td> <td><code class="language-plaintext highlighter-rouge">Lint_error</code>/<code class="language-plaintext highlighter-rouge">eprintf</code></td> </tr> <tr> <td><a href="https://github.com/just-max/less-power/tree/da8ad093c5fb593917c140f955e7820e08f9231c/src/ast-check">less-power ast-check</a></td> <td><code class="language-plaintext highlighter-rouge">(preprocess)</code></td> <td><code class="language-plaintext highlighter-rouge">~impl</code></td> <td><code class="language-plaintext highlighter-rouge">map_with_context</code></td> <td><code class="language-plaintext highlighter-rouge">error_extensionf</code></td> </tr> </tbody> </table> <p>There are two possible ways of <strong>dune integration</strong>:</p> <ol> <li><code class="language-plaintext highlighter-rouge">(preprocess)</code> stanza, which is the usual way to add Ppx preprocessors to the build of a library/executable. This runs unconditionally during the normal build process.</li> <li><code class="language-plaintext highlighter-rouge">(lint)</code> stanza, which has similar syntax but is <em>undocumented</em>. This doesn’t run by default in dune, but rather requires <code class="language-plaintext highlighter-rouge">dune build @lint</code> to be executed, which is very easy to forget. A rare example of this exists in <a href="https://github.com/ocaml/dune/blob/69a24a41e993306d4f1335f3106436b4cdf3f535/test/blackbox-tests/test-cases/lint.t/correct/dune">dune’s test suite</a>.</li> </ol> <p>Either way, a major inconvenience is that the linter has to be added to <em>every</em> dune library and executable. There’s no way right now to define entire-project linters, which is error-prone as one may simply forget to add the linter to a new library.</p> <p>There are two main <a href="https://ocaml-ppx.github.io/ppxlib/ppxlib/driver.html#driver_execution"><strong>Ppxlib phases</strong></a> used for such linters:</p> <ol> <li><a href="https://ocaml-ppx.github.io/ppxlib/ppxlib/driver.html#global-transfo-phase"><code class="language-plaintext highlighter-rouge">~impl</code> (or <code class="language-plaintext highlighter-rouge">~intf</code>)</a>, which is usually used for defining AST transformations. However, linters wouldn’t actually transform the program, but just output warnings during such pass.</li> <li><a href="https://ocaml-ppx.github.io/ppxlib/ppxlib/driver.html#the-linter-phase"><code class="language-plaintext highlighter-rouge">~lint_impl</code> (or <code class="language-plaintext highlighter-rouge">~lint_intf</code>)</a>, which run before any transformations take place. In fact, this phase cannot even transform the AST, but only return a list of <code class="language-plaintext highlighter-rouge">Lint_error</code>s.</li> </ol> <p>There are <a href="https://ocaml-ppx.github.io/ppxlib/ppxlib/ast-traversal.html#the-different-kinds-of-traversals">various ways in Ppxlib to <strong>traverse the AST</strong></a> and each linter uses one based on what needs to be returned from the phase and how the output is done. Note that <a href="https://ocaml-ppx.github.io/ppxlib/ppxlib/Ppxlib/Context_free/Rule/index.html">Ppxlib’s context-free (rewriting) rules</a> aren’t suitable for linting as-is: they can only match extension nodes, special functions, custom constants and attribute-annotated nodes. In particular, arbitrary <a href="https://ocaml-ppx.github.io/ppxlib/ppxlib/matching-code.html#ast_pattern_intro"><code class="language-plaintext highlighter-rouge">Ast_pattern</code></a>-based matching is not offered by Ppxlib. This is what bene-gesselint tries to provide as a thin wrapper, however it doesn’t neatly combine multiple <code class="language-plaintext highlighter-rouge">Ast_pattern</code>-matching rules into a single AST pass.</p> <p>There are five means of <strong>output</strong> for Ppxlib-based linters:</p> <ol> <li><a href="https://ocaml-ppx.github.io/ppxlib/ppxlib/Ppxlib/Driver/index.html#val-register_correction"><code class="language-plaintext highlighter-rouge">Driver.register_correction</code></a>, which proposes a code change that can be promoted using dune. Since this must propose a change, it cannot simply produce a warning, but multiple changes can also be registered.<sup id="fnref:updated-lint-impl-iter-correction"><a href="#fn:updated-lint-impl-iter-correction" class="footnote" rel="footnote" role="doc-noteref">1</a></sup></li> <li><a href="https://ocaml-ppx.github.io/ppxlib/ppxlib/Ppxlib/Driver/Lint_error/index.html"><code class="language-plaintext highlighter-rouge">Lint_error.of_string</code></a>, which yields a preprocessor warning. These can only be returned from <code class="language-plaintext highlighter-rouge">~lint_impl</code>, but many warnings can be returned from a single run.</li> <li><a href="https://ocaml-ppx.github.io/ppxlib/ppxlib/Ppxlib/Location/index.html#val-raise_errorf"><code class="language-plaintext highlighter-rouge">Location.raise_errorf</code></a>, which crashes the preprocessor with an error. Hence, multiple errors cannot be produced from a single linter run. Ppxlib also discourages the use of exceptions for error handling.</li> <li><a href="https://ocaml-ppx.github.io/ppxlib/ppxlib/Ppxlib/Location/index.html#val-error_extensionf"><code class="language-plaintext highlighter-rouge">Location.error_extensionf</code></a>, which creates a special error extension node to be put into the AST. Hence, this requires a <code class="language-plaintext highlighter-rouge">map</code> traversal, but also allows multiple errors to be returned. Ppxlib recommends this for error handling, at least for usual Ppxlib expanders, derivers and transformers. However, it seems to me that the OCaml compiler will still only print the error from the first error extension node.</li> <li><code class="language-plaintext highlighter-rouge">eprintf</code>, which is just very <em>ad hoc</em>.</li> </ol> <h2 id="ppxlib-techniques">Ppxlib techniques</h2> <p>Many combinations of dune integration, Ppxlib phase, traversal and output exist, but not all of them are compatible and sensible. Worse yet, some simply don’t even work, either silently or loudly. The following table gives an overview of the reasonable combinations and which to avoid.</p> <table> <thead> <tr> <th>Dune integration</th> <th>Ppxlib phase</th> <th>Traversal</th> <th>Output</th> <th>Comment</th> </tr> </thead> <tbody> <tr> <td><code class="language-plaintext highlighter-rouge">(lint)</code></td> <td><code class="language-plaintext highlighter-rouge">~lint_impl</code></td> <td><code class="language-plaintext highlighter-rouge">fold</code></td> <td><code class="language-plaintext highlighter-rouge">Lint_error.of_string</code></td> <td><a href="https://github.com/ocaml-ppx/ppxlib/issues/306"><strong>Doesn’t work</strong> (no output)</a></td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">(lint)</code></td> <td><code class="language-plaintext highlighter-rouge">~impl</code></td> <td><code class="language-plaintext highlighter-rouge">iter</code></td> <td><code class="language-plaintext highlighter-rouge">Driver.register_correction</code></td> <td>Dune-promotable changes</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">(preprocess)</code></td> <td><code class="language-plaintext highlighter-rouge">~lint_impl</code></td> <td><code class="language-plaintext highlighter-rouge">fold</code></td> <td><code class="language-plaintext highlighter-rouge">Lint_error.of_string</code></td> <td>Multiple preprocessor warnings</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">(preprocess)</code></td> <td><code class="language-plaintext highlighter-rouge">~lint_impl</code></td> <td><code class="language-plaintext highlighter-rouge">iter</code></td> <td><code class="language-plaintext highlighter-rouge">Location.raise_errorf</code></td> <td>Single error</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">(preprocess)</code></td> <td><code class="language-plaintext highlighter-rouge">~lint_impl</code></td> <td><code class="language-plaintext highlighter-rouge">iter</code></td> <td><code class="language-plaintext highlighter-rouge">eprintf</code></td> <td>Multiple <strong>non-standard</strong> warnings</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">(preprocess)</code></td> <td><code class="language-plaintext highlighter-rouge">~impl</code></td> <td><code class="language-plaintext highlighter-rouge">map</code></td> <td><code class="language-plaintext highlighter-rouge">Location.error_extensionf</code></td> <td>Multiple errors</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">(preprocess)</code></td> <td><code class="language-plaintext highlighter-rouge">~impl</code></td> <td><code class="language-plaintext highlighter-rouge">iter</code></td> <td><code class="language-plaintext highlighter-rouge">Location.raise_errorf</code></td> <td>Single error</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">(preprocess)</code></td> <td><code class="language-plaintext highlighter-rouge">~impl</code></td> <td><code class="language-plaintext highlighter-rouge">iter</code></td> <td><code class="language-plaintext highlighter-rouge">Driver.register_correction</code></td> <td>Dune-promotable changes<sup id="fnref:updated-lint-impl-iter-correction:1"><a href="#fn:updated-lint-impl-iter-correction" class="footnote" rel="footnote" role="doc-noteref">1</a></sup></td> </tr> </tbody> </table> <p><strong><a href="https://github.com/sim642/dune-lint-demo">This GitHub repository</a></strong> includes examples of all of these setups in the corresponding subdirectories. See the Cram test <code class="language-plaintext highlighter-rouge">run.t</code> files for example outputs.</p> <hr/> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:updated-lint-impl-iter-correction"> <p>Previously, this post incorrectly claimed that <code class="language-plaintext highlighter-rouge">Driver.register_correction</code> only works during <code class="language-plaintext highlighter-rouge">(lint)</code>. <a href="#fnref:updated-lint-impl-iter-correction" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:updated-lint-impl-iter-correction:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> </ol> </div>]]></content><author><name></name></author><category term="programming"/><category term="ocaml"/><category term="open source"/><summary type="html"><![CDATA[Recently (but also 3 years ago), I was interested in finding all catch-all exception handlers in Goblint, which is written in OCaml, in order to prevent “uncatchable” exceptions from being caught and accidentally swallowed. “Uncatchable” exceptions are those which should not be ignored, e.g. Out_of_memory. My first attempt was using a Semgrep rule, but it turned out to be too buggy to reliably do the job. Therefore, I sought out code linters for OCaml.]]></summary></entry><entry><title type="html">OCaml dependencies lower-bounds CI</title><link href="https://sim642.eu/blog/2022/03/13/ocaml-dependencies-lower-bounds-ci/" rel="alternate" type="text/html" title="OCaml dependencies lower-bounds CI"/><published>2022-03-13T00:00:00+00:00</published><updated>2025-07-22T00:00:00+00:00</updated><id>https://sim642.eu/blog/2022/03/13/ocaml-dependencies-lower-bounds-ci</id><content type="html" xml:base="https://sim642.eu/blog/2022/03/13/ocaml-dependencies-lower-bounds-ci/"><![CDATA[<p>When submitting OCaml packages to the <a href="https://github.com/ocaml/opam-repository">opam package repository</a>, opam-ci runs extensive checks on the package submitted by the pull request. Most of these checks are very standard, involving building and testing the package on various OCaml versions, Linux distributions and OCaml compiler variants. In addition to all of that, there are checks called “lower-bounds”.</p> <p>The purpose of these unique jobs is to check whether the lower bounds of the package’s dependencies (i.e. their minimal versions), if declared at all, are right. That is, does the package actually compile when the oldest allowed versions of its dependencies are installed. All the other checks follow the default behavior of the opam package manager and install the newest allowed dependencies, essentially checking the upper bounds (at the time of submission at least).</p> <p>The lower-bounds check works by first installing the dependencies of the package normally. The dependency constraint solver of opam is then reconfigured to instead downgrade and remove as many packages as possible (while still satisfying your package’s lower bounds). Having installed the up-to-date versions first, this step nicely shows the version ranges being downgraded, which makes related issues easier to debug.</p> <h2 id="problem">Problem</h2> <p>These lower-bounds jobs can be quite annoying when trying to submit a package to opam, because package developers usually don’t test for that. You only find missing or too relaxed lower bounds after doing a release of the package and submitting a PR to <a href="https://github.com/ocaml/opam-repository">opam-repository</a>, just to find out it fails on their extensive CI.</p> <p>There are two main ways to fix these issues:</p> <ol> <li>Tighten the lower bound for a particular dependency (or add a lower bound if it doesn’t have one).</li> <li>If possible, change the usage of a dependency to not require features it only introduced in newer versions.</li> </ol> <p>If you follow the recommendations of <a href="https://github.com/tarides/dune-release">dune-release</a>, then after fixing the lower bound, instead of re-releasing the exact same version number of the package (and replacing the archive in-place), you release a new patch version of it. This might go on for a while: you release a patched version, submit that to <a href="https://github.com/ocaml/opam-repository">opam-repository</a>, see another lower-bounds failure, fix that – rinse and repeat.</p> <h2 id="solution">Solution</h2> <p>It would be <em>much</em> quicker and less hassle, if you could somehow run a similar lower-bounds job on your own GitHub repository’s Actions.</p> <blockquote class="block-tip"> <h5 id="updated">Updated</h5> <p>The post has been updated with the recommended modern approach. For reference, the old version is kept below.</p> </blockquote> <h3 id="with-opam--21-recommended">With opam ≥ 2.1 (recommended)</h3> <p>In fact, you can by using the <a href="https://github.com/ocaml-opam/opam-0install-cudf">0install solver</a> built into opam 2.1 and above. Its dependency solver <code class="language-plaintext highlighter-rouge">--criteria</code> argument allows configuring the preference for old versions (and possibly removing packages). The complete GitHub Actions workflow using <a href="https://github.com/ocaml/setup-ocaml">setup-ocaml</a> is the following<sup id="fnref:diff-test"><a href="#fn:diff-test" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>:</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">on</span><span class="pi">:</span>
  <span class="na">push</span><span class="pi">:</span>
  <span class="na">pull_request</span><span class="pi">:</span>

<span class="na">jobs</span><span class="pi">:</span>
  <span class="na">lower-bounds</span><span class="pi">:</span>
    <span class="na">strategy</span><span class="pi">:</span>
      <span class="na">matrix</span><span class="pi">:</span>
        <span class="na">os</span><span class="pi">:</span>
          <span class="pi">-</span> <span class="s">ubuntu-latest</span>
        <span class="na">ocaml-compiler</span><span class="pi">:</span>
          <span class="pi">-</span> <span class="s">4.14.2</span>

    <span class="na">runs-on</span><span class="pi">:</span> <span class="s">${{ matrix.os }}</span>

    <span class="na">steps</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Checkout code</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/checkout@v4</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Set up OCaml ${{ matrix.ocaml-compiler }}</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">ocaml/setup-ocaml@v3</span>
        <span class="na">with</span><span class="pi">:</span>
          <span class="na">ocaml-compiler</span><span class="pi">:</span> <span class="s">${{ matrix.ocaml-compiler }}</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Install dependencies</span>
        <span class="na">run</span><span class="pi">:</span> <span class="s">opam install . --deps-only --with-test</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Downgrade dependencies</span>
        <span class="c1"># Option 1: optimize for removing packages and downgrades (like opam-ci)</span>
        <span class="na">run</span><span class="pi">:</span> <span class="s">opam install --solver=builtin-0install --criteria="+removed,+count[version-lag,solution]" . --deps-only --with-test</span>
        <span class="c1"># Option 2: don't optimize for removing packages, only downgrades (unlike opam-ci); will also remove depopts</span>
        <span class="c1"># run: opam install --solver=builtin-0install --criteria="+count[version-lag,solution]" . --deps-only --with-test</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Build</span>
        <span class="na">run</span><span class="pi">:</span> <span class="s">opam exec -- dune build</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Test</span>
        <span class="na">run</span><span class="pi">:</span> <span class="s">opam exec -- dune runtest</span>
</code></pre></div></div> <h3 id="with-opam-0install-legacy">With opam-0install (legacy)</h3> <p>In fact, you can by using <a href="https://github.com/ocaml-opam/opam-0install-solver">opam-0install</a> and its <code class="language-plaintext highlighter-rouge">--prefer-oldest</code> argument to downgrade the dependencies to their lower bounds<sup id="fnref:diff-remove"><a href="#fn:diff-remove" class="footnote" rel="footnote" role="doc-noteref">2</a></sup>. The complete GitHub Actions workflow using <a href="https://github.com/ocaml/setup-ocaml">setup-ocaml</a> is the following (replace <code class="language-plaintext highlighter-rouge">MY_PACKAGE</code> with your package name)<sup id="fnref:diff-test:1"><a href="#fn:diff-test" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>:</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">on</span><span class="pi">:</span>
  <span class="na">push</span><span class="pi">:</span>
  <span class="na">pull_request</span><span class="pi">:</span>

<span class="na">jobs</span><span class="pi">:</span>
  <span class="na">lower-bounds</span><span class="pi">:</span>
    <span class="na">strategy</span><span class="pi">:</span>
      <span class="na">matrix</span><span class="pi">:</span>
        <span class="na">os</span><span class="pi">:</span>
          <span class="pi">-</span> <span class="s">ubuntu-latest</span>
        <span class="na">ocaml-compiler</span><span class="pi">:</span>
          <span class="pi">-</span> <span class="s">4.14.2</span>

    <span class="na">runs-on</span><span class="pi">:</span> <span class="s">${{ matrix.os }}</span>

    <span class="na">steps</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Checkout code</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/checkout@v4</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Set up OCaml ${{ matrix.ocaml-compiler }}</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">ocaml/setup-ocaml@v2</span>
        <span class="na">with</span><span class="pi">:</span>
          <span class="na">ocaml-compiler</span><span class="pi">:</span> <span class="s">${{ matrix.ocaml-compiler }}</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Install dependencies</span>
        <span class="na">run</span><span class="pi">:</span> <span class="s">opam install . --deps-only --with-test</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Install opam-0install</span>
        <span class="na">run</span><span class="pi">:</span> <span class="s">opam install opam-0install</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Downgrade dependencies</span>
        <span class="c1"># Option 1: allow OCaml version downgrade</span>
        <span class="na">run</span><span class="pi">:</span> <span class="s">opam install --unlock-base $(opam exec -- opam-0install --prefer-oldest --with-test MY_PACKAGE)</span>
        <span class="c1"># Option 2: forbid OCaml version downgrade (specify ocaml-base-compiler again to prevent it from being downgraded)</span>
        <span class="c1"># run: opam install $(opam exec -- opam-0install --prefer-oldest --with-test MY_PACKAGE ocaml-base-compiler.${{ matrix.ocaml-compiler }})</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Build</span>
        <span class="na">run</span><span class="pi">:</span> <span class="s">opam exec -- dune build</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Test</span>
        <span class="na">run</span><span class="pi">:</span> <span class="s">opam exec -- dune runtest</span>
</code></pre></div></div> <hr/> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:diff-test"> <p>Unlike opam-ci, which doesn’t actually test with lower bounds, this is stronger and uses <code class="language-plaintext highlighter-rouge">--with-test</code>. <a href="#fnref:diff-test" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:diff-test:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> <li id="fn:diff-remove"> <p>Unlike opam-ci, this will not attempt to remove packages, but just downgrade them. <a href="#fnref:diff-remove" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name></name></author><category term="programming"/><category term="ocaml"/><category term="github"/><category term="ci"/><category term="tutorial"/><summary type="html"><![CDATA[When submitting OCaml packages to the opam package repository, opam-ci runs extensive checks on the package submitted by the pull request. Most of these checks are very standard, involving building and testing the package on various OCaml versions, Linux distributions and OCaml compiler variants. In addition to all of that, there are checks called “lower-bounds”.]]></summary></entry><entry><title type="html">Training git rerere</title><link href="https://sim642.eu/blog/2021/10/16/training-git-rerere/" rel="alternate" type="text/html" title="Training git rerere"/><published>2021-10-16T00:00:00+00:00</published><updated>2021-10-16T00:00:00+00:00</updated><id>https://sim642.eu/blog/2021/10/16/training-git-rerere</id><content type="html" xml:base="https://sim642.eu/blog/2021/10/16/training-git-rerere/"><![CDATA[<p>One advanced git feature is <a href="https://git-scm.com/docs/git-rerere"><code class="language-plaintext highlighter-rouge">git rerere</code></a>, which helps with resolving the same merge conflicts multiple times, e.g. when regularly rebasing a branch. In short, when activated, git will record the conflicts before and after you manually resolve them. During future conflicts, it will automatically try to apply the recorded resolutions, so you don’t have to manually resolve the same conflict again, which is error-prone.</p> <p>The feature can be enabled for a repository with</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git config rerere.enabled true
</code></pre></div></div> <h2 id="training">Training</h2> <p>When you first enable it on an existing repository which contains merge conflict resolutions, git rerere won’t automatically know about them because they haven’t been recorded while the feature is active.</p> <p>Luckily there is an useful <a href="https://github.com/git/git/blob/master/contrib/rerere-train.sh"><code class="language-plaintext highlighter-rouge">rerere-train.sh</code></a> script to automatically record past merge conflict resolutions into its database. On Ubuntu this is also included in the <code class="language-plaintext highlighter-rouge">git</code> apt package at <code class="language-plaintext highlighter-rouge">/usr/share/doc/git/contrib/rerere-train.sh</code> (although without the execute bit set).</p> <p>After enabling git rerere, you can use the script to learn conflict resolutions from a range of commits <code class="language-plaintext highlighter-rouge">commit1..commit2</code> using the following command:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bash /usr/share/doc/git/contrib/rerere-train.sh commit1..commit2
</code></pre></div></div> <p>If only one commit is given instead of a range, the script will train git rerere on the entire history back to the initial commit, which is probably undesired and useless.</p>]]></content><author><name></name></author><category term="programming"/><category term="git"/><category term="tutorial"/><summary type="html"><![CDATA[One advanced git feature is git rerere, which helps with resolving the same merge conflicts multiple times, e.g. when regularly rebasing a branch. In short, when activated, git will record the conflicts before and after you manually resolve them. During future conflicts, it will automatically try to apply the recorded resolutions, so you don’t have to manually resolve the same conflict again, which is error-prone.]]></summary></entry><entry><title type="html">Program source code as machine learning data</title><link href="https://sim642.eu/blog/2021/06/13/program-source-code-as-machine-learning-data/" rel="alternate" type="text/html" title="Program source code as machine learning data"/><published>2021-06-13T00:00:00+00:00</published><updated>2021-06-14T00:00:00+00:00</updated><id>https://sim642.eu/blog/2021/06/13/program-source-code-as-machine-learning-data</id><content type="html" xml:base="https://sim642.eu/blog/2021/06/13/program-source-code-as-machine-learning-data/"><![CDATA[<p><em>This blog post is part of my project in <a href="https://courses.cs.ut.ee/2021/nn/spring/Main/HomePage">the Neural Networks course at University of Tartu</a>.</em></p> <p>Nowadays machine learning, and neural networks specifically, are used to solve a wide spectrum of tasks. Besides simple real-valued vectors as data, successful techniques and architectures have been developed to also work with images, natural language and audio. As a programmer and a programming languages enthusiast the obvious question is how program source code can be input into a neural network to solve tasks which would benefit us the programmers.</p> <p>First, I will describe what differentiates source code from the other mentioned types of input. Second, I will explain the path-based representation for inputting code into neural networks. Third, I will give a short summary of <a href="https://code2vec.org/">code2vec</a> and <a href="https://code2seq.org/">code2seq</a>, which are based on this representation. Last, I will discuss some of their limitations.</p> <p>This gives an overview of the following articles:</p> <ol> <li>A General Path-Based Representation for Predicting Program Properties<sup id="fnref:path-paper"><a href="#fn:path-paper" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>,</li> <li>Code2Vec: Learning Distributed Representations of Code<sup id="fnref:code2vec-paper"><a href="#fn:code2vec-paper" class="footnote" rel="footnote" role="doc-noteref">2</a></sup>,</li> <li>code2seq: Generating Sequences from Structured Representations of Code<sup id="fnref:code2seq-paper"><a href="#fn:code2seq-paper" class="footnote" rel="footnote" role="doc-noteref">3</a></sup>.</li> </ol> <h2 id="how-source-code-is-represented">How source code is represented?</h2> <p>Plainly put, source code is just text, so natural language is quite related. The symbols of both make lexical tokens and the token sequences of both contain grammatical structure, which can be represented by syntax trees. So naturally, one could take the token sequence of source code, as given by an existing <strong>lexer</strong>, and feed it into a recurrent neural network (<abbr title="Recurrent neural network">RNN</abbr>) just like a token (word) sequence of text.</p> <p>Natural language processing (<abbr title="Natural language processing">NLP</abbr>) techniques have been developed for decades and are very successful, so this likely isn’t too bad. Unfortunately, it completely ignores the inherent structure of source code. Thus, to be successful, such neural network would have to learn the structure from scratch by example from the training data. Not only may this be inaccurate, it also requires a significant amount of training data and time.</p> <p>Unlike natural language, where ambiguities are possible, code has very strictly defined structure which is enforced by a <strong>parser</strong>. Such parsers are fully precise and very fast to extract the structure of code in the form of a <strong>syntax tree</strong>. Although a similar thing exists in <abbr title="Natural language processing">NLP</abbr>, it’s a whole separate (machine learning) task on its own.</p> <p>Therefore, instead of requiring a machine learning model to learn the structure of programs, we can just get it using a perfect parser. The result is a syntax tree or, after some additional processing, an <strong>abstract syntax tree</strong> (<abbr title="Abstract syntax tree">AST</abbr>), which omits certain irrelevant details. ASTs are the fundamental representation of source code used by interpreters, compilers, program analyzers, etc. Hence, it makes sense to use the structure provided as an <abbr title="Abstract syntax tree">AST</abbr>.</p> <h3 id="example">Example</h3> <p>For example, consder the following Java method, which checks if a list contains an element:</p> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">boolean</span> <span class="nf">f</span><span class="o">(</span><span class="nc">Object</span> <span class="n">target</span><span class="o">)</span> <span class="o">{</span>
    <span class="k">for</span> <span class="o">(</span><span class="nc">Object</span> <span class="nl">elem:</span> <span class="k">this</span><span class="o">.</span><span class="na">elements</span><span class="o">)</span> <span class="o">{</span>
        <span class="k">if</span> <span class="o">(</span><span class="n">elem</span><span class="o">.</span><span class="na">equals</span><span class="o">(</span><span class="n">target</span><span class="o">))</span> <span class="o">{</span>
            <span class="k">return</span> <span class="kc">true</span><span class="o">;</span>
        <span class="o">}</span>
    <span class="o">}</span>
    <span class="k">return</span> <span class="kc">false</span><span class="o">;</span>
<span class="o">}</span>
</code></pre></div></div> <p>It has the following <abbr title="Abstract syntax tree">AST</abbr> (ignore the colors and numbers for now)<sup id="fnref:code2vec-paper:1"><a href="#fn:code2vec-paper" class="footnote" rel="footnote" role="doc-noteref">2</a></sup>:</p> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/program-source-code-as-machine-learning-data/example-ast-480.webp 480w,/assets/program-source-code-as-machine-learning-data/example-ast-800.webp 800w,/assets/program-source-code-as-machine-learning-data/example-ast-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/program-source-code-as-machine-learning-data/example-ast.png" class="img-fluid rounded" width="100%" height="auto" alt="AST of the example Java method" title="AST of the example Java method" data-zoomable="" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <h2 id="how-to-input-trees-into-neural-networks">How to input trees into neural networks?</h2> <p>Given an <abbr title="Abstract syntax tree">AST</abbr> for a program, the problem is far from solved because mainstream techniques don’t take input in the form of trees. RNNs allow a network to consume linear input of varying length, but trees are nonlinear, have varying size and varying number of children nodes. Although techniques for tree input have been proposed, they won’t be considered here.</p> <p>Instead, the people behind <a href="https://code2vec.org/">code2vec</a> suggest the following: don’t input the tree directly, but input paths, which are linear. This works as follows<sup id="fnref:path-paper:1"><a href="#fn:path-paper" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>:</p> <ol> <li>Pick any two leaves of the <abbr title="Abstract syntax tree">AST</abbr>.</li> <li>Represent them by their tokens including their data (e.g. variable name, integer value, etc).</li> <li>Trace the <em>unique</em> path which connects them in the <abbr title="Abstract syntax tree">AST</abbr>.</li> <li>Represent the path as a sequence of <abbr title="Abstract syntax tree">AST</abbr> intermediate node types delimited by up/down arrows, which indicate whether the path moves up or down the tree.</li> <li>Form a <strong>path-context</strong> triple: (first leaf, connecting path, second leaf).</li> </ol> <p>A single path-context represents only one part of the <abbr title="Abstract syntax tree">AST</abbr>, and by extension only one part of the code. To represent it all, just construct the bag/set of all path-contexts between all pairs of leaves.</p> <p>Although a path-context can be constructed for every pair of leaves, it may be too inefficient and not really necessary. A limited number of paths (<a href="https://code2vec.org/">code2vec</a> uses 200) may be randomly sampled or a maximum length limit may be set.</p> <h3 id="example-1">Example</h3> <p>In the previous example, consider the red path connecting the leaves <code class="language-plaintext highlighter-rouge">elements</code> and <code class="language-plaintext highlighter-rouge">true</code>. Its representation as a path-context is the following triple:</p> <p>(<code class="language-plaintext highlighter-rouge">elements</code>, Name ↑ FieldAccess ↑ Foreach ↓ Block ↓ IfStmt ↓ Block ↓ Return ↓ BooleanExpr, <code class="language-plaintext highlighter-rouge">true</code>).</p> <p>Reading this from left to right while following the arrows, it encodes a notable amount of structural information: <code class="language-plaintext highlighter-rouge">elements</code> is a field, which is iterated over using a foreach loop, which checks something for each element and if true, returns <code class="language-plaintext highlighter-rouge">true</code>.</p> <p>This single path-context already captures the key idea of this method. Other also important parts of this <abbr title="Abstract syntax tree">AST</abbr> are captured by the blue, green and yellow paths in the previous figure.</p> <h2 id="how-code2vec-works">How code2vec works?</h2> <p>Based on the bag of path-contexts representation of source code, <a href="https://code2vec.org/">code2vec</a><sup id="fnref:code2vec-paper:2"><a href="#fn:code2vec-paper" class="footnote" rel="footnote" role="doc-noteref">2</a></sup> does for code what word2vec, etc do for natural language: represent the input as a single dense fixed-size embedding vector. As with word2vec, the goal for the vectors is to somehow capture their semantic properties, such that close vectors correspond to semantically similar inputs, and far vectors to semantically dissimilar ones. These are also known as <strong>distributed representations</strong>.</p> <p>As word2vec has proven in <abbr title="Natural language processing">NLP</abbr>, such representations of the input are extremely useful for downstream machine learning tasks, which then don’t have to relearn all the semantic properties themselves. Hence <a href="https://code2vec.org/">code2vec</a> does this for source code, so that code can be input into the downstream neural network like any other vector.</p> <h3 id="training">Training</h3> <p>In order to do this, the neural network part of <a href="https://code2vec.org/">code2vec</a> takes as input the bag of path-contexts (which are extracted from the source code, as explained above). To train the embedding and have something to optimize, the downstream task of predicting the method name is used. This is apparently one of the most challenging tasks on source code while the method name should capture the semantics of its code.</p> <p>The following neural network is thus trained<sup id="fnref:code2vec-paper:3"><a href="#fn:code2vec-paper" class="footnote" rel="footnote" role="doc-noteref">2</a></sup>:</p> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/program-source-code-as-machine-learning-data/code2vec-architecture-480.webp 480w,/assets/program-source-code-as-machine-learning-data/code2vec-architecture-800.webp 800w,/assets/program-source-code-as-machine-learning-data/code2vec-architecture-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/program-source-code-as-machine-learning-data/code2vec-architecture.png" class="img-fluid rounded" width="100%" height="auto" alt="code2vec neural network architecture" title="code2vec neural network architecture" data-zoomable="" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <p>In order, the layers work as follows:</p> <ol> <li>Each path-context triple consists of two tokens and a path. A simple embedding maps the tokens to token vectors and a simple embedding maps the entire path to a path vector. These simple embeddings are trained simultaneously with the network. The three vectors are then concatenated into a <strong>context vector</strong>.</li> <li>Each context vector is reduced in dimensionality via a fully-connected layer (with <em>tanh</em> activation) into a <strong>combined context vector</strong>.</li> <li>The bag of combined context vectors is aggregated into a single <strong>code vector</strong>, which is the desired distributed representation. A global attention weight vector is used to compute a weight for each combined context vector by dot product. This attention weight vector is trained simultaneously with the network. Then the combined context vectors are aggregated into a code vector using the (normalized) attention weights. Attention allows the network to choose, which of the many remotely incoming path-contexts are more important and which less important.</li> <li>Finally softmax is used to predict the method name from the code vector.</li> </ol> <p>In the above example program and <abbr title="Abstract syntax tree">AST</abbr>, the latter shows the top four path-contexts by their attention weight:</p> <ol> <li>red path from <code class="language-plaintext highlighter-rouge">elements</code> to <code class="language-plaintext highlighter-rouge">true</code> (weight: 0.23),</li> <li>blue path from <code class="language-plaintext highlighter-rouge">target</code> to <code class="language-plaintext highlighter-rouge">false</code> (weight: 0.14),</li> <li>green path from <code class="language-plaintext highlighter-rouge">boolean</code> to <code class="language-plaintext highlighter-rouge">?</code> (actual function name is removed for to make prediction non-trivial), (weight: 0.09)</li> <li>yellow path from <code class="language-plaintext highlighter-rouge">Object</code> to <code class="language-plaintext highlighter-rouge">target</code> (weight: 0.07).</li> </ol> <p>Many other path-contexts also exist here but they are already too insignificant according to attention. The use of attention also improves the explainability of the model.</p> <h3 id="results">Results</h3> <p>This <a href="https://code2vec.org/">code2vec</a> network achieves state-of-the-art performance in the method name predicition task. It shows that both the path-context–based representation and the learned distributed representation are very useful for working with source code.</p> <p>The final softmax layer can be chopped off from the above network to simply get the distributed embeddings for other tasks. It still needs the preprocessing of parsing the code, constructing the <abbr title="Abstract syntax tree">AST</abbr> and extracting the bag of path-contexts from it.</p> <p>Famously word2vec embeddings have beautiful properties between the semantics of the words and their vectors. Surprisingly, this is also the case for <a href="https://code2vec.org/">code2vec</a>! For example, adding the embedding vectors for <code class="language-plaintext highlighter-rouge">equals</code> and <code class="language-plaintext highlighter-rouge">toLowerCase</code> methods gives a vector closest to the embedding of <code class="language-plaintext highlighter-rouge">equalsIgnoreCase</code>. Many other similar examples were noticed by the authors and they can be explored on the <a href="https://code2vec.org/">code2vec website</a>. This empirically proves that the learned distributed embeddings indeed capture some semantic properties of source code, which means that the trained <a href="https://code2vec.org/">code2vec</a> network should also work well for tasks other than predicting method names.</p> <h2 id="how-code2seq-works">How code2seq works?</h2> <p>The follow-up work <a href="https://code2seq.org/">code2seq</a><sup id="fnref:code2seq-paper:1"><a href="#fn:code2seq-paper" class="footnote" rel="footnote" role="doc-noteref">3</a></sup> improves upon <a href="https://code2vec.org/">code2vec</a> in a number of ways. It is still based on the bag of path-contexts extracted from the <abbr title="Abstract syntax tree">AST</abbr> and uses (combined) context vectors in the middle of the network.</p> <p>The following neural network is used<sup id="fnref:code2seq-paper:2"><a href="#fn:code2seq-paper" class="footnote" rel="footnote" role="doc-noteref">3</a></sup>:</p> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/program-source-code-as-machine-learning-data/code2seq-architecture-480.webp 480w,/assets/program-source-code-as-machine-learning-data/code2seq-architecture-800.webp 800w,/assets/program-source-code-as-machine-learning-data/code2seq-architecture-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/program-source-code-as-machine-learning-data/code2seq-architecture.png" class="img-fluid rounded" width="100%" height="auto" alt="code2seq neural network architecture" title="code2seq neural network architecture" data-zoomable="" loading="lazy" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <p>Differences from <a href="https://code2vec.org/">code2vec</a> are the following:</p> <ol> <li>Tokens like <code class="language-plaintext highlighter-rouge">ArrayList</code> are decomposed into <strong>subtokens</strong> like <code class="language-plaintext highlighter-rouge">Array</code> and <code class="language-plaintext highlighter-rouge">List</code>, which are embedded separately and summed. This allows the network to better exploit naming schemes present in source code.</li> <li>Paths are not embedded directly and monolithically, but instead using a <strong>bidirectional <abbr title="Long short-term memory">LSTM</abbr></strong> layer. A simple embedding is used for the individual path components separately. This is highly preferred to embedding the paths of varying-length monolithically because a simple embedding may be quite sparse and not make use of state-of-the-art RNNs.</li> <li>No single code vector is constructed, so <strong>no distributed representation</strong> can be extracted for downstream tasks!</li> <li>The network is still trained to predict method names, but now as a sequence of subtokens, e.g. <code class="language-plaintext highlighter-rouge">equals</code>, <code class="language-plaintext highlighter-rouge">Ignore</code>, <code class="language-plaintext highlighter-rouge">Case</code> for the name <code class="language-plaintext highlighter-rouge">equalsIgnoreCase</code>. It uses a decoder, which uses attention to attend over all the individual combined context vectors at every step.</li> </ol> <h2 id="how-limited-are-the-models">How limited are the models?</h2> <p>Both <a href="https://code2vec.org/">code2vec</a> and <a href="https://code2seq.org/">code2seq</a> provide their trained models (for Java code) for download. My initial goal was to reuse these (e.g. by extracting distributed representations) and apply them to some other task which has source code as input, for example related to teaching duties of the Laboratory of Software Science. Unfortunately, this turned out to be more complicated than expected.</p> <p><strong>Firstly</strong>, both trained models only handle the source code of a single Java method, not an entire file, which contains a class with likely many methods, among other things. Although a file/class can be split up into methods to separately pass into a model, it would be very limited. The same class may implement all the logic in a single method or have it nicely organized into multiple methods. They would semantically be the same, but comparing a single distributed representation with multiple is not meaningful and loses the semantic properties of the embedding.</p> <p>This is not a restriction of the path-based representation because the <abbr title="Abstract syntax tree">AST</abbr> of the entire file/class could be used instead. But new models would have to be trained from scratch. Not only that, the training task itself would have to change from method name prediction to maybe class name prediction (which usually doesn’t capture the semantics of all the methods in the class) or something else. And of course the new training task would still need to be such that the learned distributed representations have semantic properties.</p> <p><strong>Secondly</strong>, while <a href="https://code2vec.org/">code2vec</a> provides distributed representations of code suitable for downstream tasks, the more advanced <a href="https://code2seq.org/">code2seq</a> architecture and model do not. Rather, the final output is a sequence and the layer before that is just the entire bag of combined context vectors. Therefore, it is possible to combine the best of both worlds:</p> <ol> <li>Use the first part of <a href="https://code2seq.org/">code2seq</a>, which uses a <abbr title="Recurrent neural network">RNN</abbr> for path embedding.</li> <li>Use the middle part of <a href="https://code2vec.org/">code2vec</a>, which aggregates the combined context vectors into a single distributed representation using attention.</li> </ol> <p><strong>Thirdly</strong>, there is not enough labelled data available for semantic downstream tasks. Although there is a lot of source code data available on GitHub (<a href="https://code2vec.org/">code2vec</a> used 32GB, <a href="https://code2seq.org/">code2seq</a> used 125GB), there are no labels for semantic properties. Both of these models were just trained to predict method names, something which syntactically already exists in the same code. It is more likely, that the distributed representations could be used in unsupervised tasks instead, e.g. detecting similar but not duplicate code by clustering.</p> <h2 id="how-to-continue-from-here">How to continue from here?</h2> <p>Path-based representations are a promising generic and practical means of inputting source code into neural networks. Although introduced by <a href="https://code2vec.org/">code2vec</a> and <a href="https://code2seq.org/">code2seq</a>, others have started using them as well. Notably, JetBrains Research has developed <a href="https://github.com/JetBrains-Research/astminer">astminer</a> for extracting path-contexts from various languages (and can be extended further) and used them to build representations of authors’ coding style<sup id="fnref:codestyle-paper"><a href="#fn:codestyle-paper" class="footnote" rel="footnote" role="doc-noteref">4</a></sup>.</p> <p>Instead of an <abbr title="Abstract syntax tree">AST</abbr>, another idea would be to extract such path-contexts from a control-flow graph (<abbr title="Control-flow graph">CFG</abbr>). Control-flow of a method is more closely related to the runtime behavior and semantics of it. It would even be possible to inline method calls with CFGs to construct paths spanning across methods, regardless how well the code is structured. Although parsers exist for all programming languages, control-flow graph generators are almost impossible to come by because most languages have complex features, which significantly impact the flow, e.g. exceptions. Moreover, CFGs can only be constructed for executable code like method bodies, but not class fields or type declarations.</p> <hr/> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:path-paper"> <p>Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav. A General Path-Based Representation for Predicting Program Properties. PLDI 2018. URL: <a href="https://doi.org/10.1145/3192366.3192412">https://doi.org/10.1145/3192366.3192412</a>. <a href="#fnref:path-paper" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:path-paper:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> <li id="fn:code2vec-paper"> <p>Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav. Code2Vec: Learning Distributed Representations of Code. POPL 2019. URL: <a href="https://doi.org/10.1145/3290353">https://doi.org/10.1145/3290353</a>. <a href="#fnref:code2vec-paper" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:code2vec-paper:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a> <a href="#fnref:code2vec-paper:2" class="reversefootnote" role="doc-backlink">&#8617;<sup>3</sup></a> <a href="#fnref:code2vec-paper:3" class="reversefootnote" role="doc-backlink">&#8617;<sup>4</sup></a></p> </li> <li id="fn:code2seq-paper"> <p>Uri Alon, Shaked Brody, Omer Levy, Eran Yahav. code2seq: Generating Sequences from Structured Representations of Code. ICLR 2019. URL: <a href="https://openreview.net/forum?id=H1gKYo09tX">https://openreview.net/forum?id=H1gKYo09tX</a>. <a href="#fnref:code2seq-paper" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:code2seq-paper:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a> <a href="#fnref:code2seq-paper:2" class="reversefootnote" role="doc-backlink">&#8617;<sup>3</sup></a></p> </li> <li id="fn:codestyle-paper"> <p>Vladimir Kovalenko, Egor Bogomolov, Timofey Bryksin, Alberto Bacchelli. Building Implicit Vector Representations of Individual Coding Style. ICSEW 2020. URL: <a href="https://arxiv.org/abs/2002.03997">https://arxiv.org/abs/2002.03997</a>. <a href="#fnref:codestyle-paper" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name></name></author><category term="programming languages"/><category term="machine learning"/><category term="neural networks"/><category term="university"/><summary type="html"><![CDATA[This blog post is part of my project in the Neural Networks course at University of Tartu.]]></summary></entry><entry><title type="html">My GitHub Mars 2020 Helicopter Contributor badge</title><link href="https://sim642.eu/blog/2021/04/30/my-github-mars-2020-helicopter-contributor-badge/" rel="alternate" type="text/html" title="My GitHub Mars 2020 Helicopter Contributor badge"/><published>2021-04-30T00:00:00+00:00</published><updated>2021-04-30T00:00:00+00:00</updated><id>https://sim642.eu/blog/2021/04/30/my-github-mars-2020-helicopter-contributor-badge</id><content type="html" xml:base="https://sim642.eu/blog/2021/04/30/my-github-mars-2020-helicopter-contributor-badge/"><![CDATA[<p>On 19 April 2021 I saw a link titled <a href="https://daniel.haxx.se/blog/2021/04/19/mars-2020-helicopter-contributor/">“Mars 2020 Helicopter Contributor”</a> on /r/programming. It’s a blog post by Daniel Stenberg, the lead developer of curl, about him getting a special badge on his GitHub profile because curl was used for the Mars 2020 Helicopter Mission.</p> <p>As described on <a href="https://github.blog/2021-04-19-open-source-goes-to-mars/">the GitHub Blog</a>, this isn’t just limited to such high-profile open-source contributors but “nearly 12000” GitHub users got the badge. It is determined by having contributed to particular open-source projects before particular versions, with the full list available in <a href="https://docs.github.com/en/account-and-profile/setting-up-and-managing-your-github-profile/customizing-your-profile/personalizing-your-profile#list-of-qualifying-repositories-for-mars-2020-helicopter-contributor-achievement">GitHub documentation</a>. Most of these seem to be Python projects, but one non-Python project caught my eye: <a href="https://github.com/opencv/opencv">OpenCV</a>.</p> <p>Immediately I just had to check <a href="https://github.com/sim642">my GitHub profile</a> and to my great surprise I also got the “Mars 2020 Helicopter Contributor” badge:</p> <p><img src="/assets/my-github-mars-2020-helicopter-contributor-badge.png" alt="Screenshot of my GitHub Mars 2020 Helicopter Contributor badge"/></p> <p>I’m honored to be one of the 12000, but my single contribution is <em>very likely unrelated</em> to the NASA mission: <a href="https://github.com/opencv/opencv/pull/7751">“Allow V4L, V4L2 to be used as preferred capture API”</a>. As far as I remember, I encountered and fixed the issue while working on a Robotex soccer robot called <a href="https://github.com/sim642/Cryptex">Cryptex</a>, which relied on OpenCV for its cameras and vision. So, no, I don’t secretly work for NASA.</p>]]></content><author><name></name></author><category term="programming"/><category term="open source"/><category term="personal"/><category term="github"/><summary type="html"><![CDATA[On 19 April 2021 I saw a link titled “Mars 2020 Helicopter Contributor” on /r/programming. It’s a blog post by Daniel Stenberg, the lead developer of curl, about him getting a special badge on his GitHub profile because curl was used for the Mars 2020 Helicopter Mission.]]></summary></entry></feed>