DOI to Bib(La)TeX – a misery

My PhD-thesis–to–be combines 8 papers from the last 5 years. Their Bib(La)TeX bibliography entries come in a wide range of quality and style. I would like some consistency but it’s quite an effort to achieve across 234 entries. So I was wondering if there’s any good quality and consistent source from where I could (hopefully automatically) update their data via their DOI.

Services

Let’s look at a bunch of services for getting BibTeX entries by DOI. I’ll use the DOI 10.1007/978-3-031-50524-9_4 (one of my papers) as the example.

Click on each tab to see the BibTeX entry from each service and my comments about it.

  •  @inbook{Saan_2023, title={Correctness Witness Validation by Abstract Interpretation}, ISBN={9783031505249}, ISSN={1611-3349}, url={http://dx.doi.org/10.1007/978-3-031-50524-9_4}, DOI={10.1007/978-3-031-50524-9_4}, booktitle={Verification, Model Checking, and Abstract Interpretation}, publisher={Springer Nature Switzerland}, author={Saan, Simmo and Schwarz, Michael and Erhard, Julian and Seidl, Helmut and Tilscher, Sarah and Vojdani, Vesal}, year={2023}, month=dec, pages={74–97} }
    

    This is returned by DOI Content Negotiation which simply means making an HTTP(S) request to the usual DOI URL https://doi.org/10.1007/978-3-031-50524-9_4 but with the Accept: application/x-bibtex HTTP header, i.e.

    curl -LH "Accept: application/x-bibtex" https://doi.org/10.1007/978-3-031-50524-9_4
    

    For this particular DOI, this actually delegates to the Crossref API at

    curl -L https://api.crossref.org/works/10.1007/978-3-031-50524-9_4/transform/application/x-bibtex
    

    Comments

    1. The entry type is @inbook, although @inproceedings would be more precise for this work.
    2. The url field has value http://dx.doi.org/10.1007/978-3-031-50524-9_4. There are two things wrong with that:
      1. It’s HTTP, not HTTPS.
      2. It uses dx.doi.org, not just doi.org.

      The former options in both points are no longer preferred, yet the official DOI metadata service doesn’t follow its own recommendations.

    3. The booktitle field is actually not specified for @inbook in BibTeX. It is specified for @inproceedings, so it really should be that. In BibLaTeX, booktitle is also specified for @inbook but only because BibLaTeX gives @inbook a slightly different meaning than BibTeX.
    4. The whole result is on one line (fine) and has a spurious single space in the beginning (which is odd).
  •  @misc{Saan_Schwarz_Erhard_Seidl_Tilscher_Vojdani_2023, title={Correctness Witness Validation by Abstract Interpretation}, url={http://dx.doi.org/10.1007/978-3-031-50524-9_4}, DOI={10.1007/978-3-031-50524-9_4}, journal={Lecture Notes in Computer Science}, publisher={Springer Nature Switzerland}, author={Saan, Simmo and Schwarz, Michael and Erhard, Julian and Seidl, Helmut and Tilscher, Sarah and Vojdani, Vesal}, year={2023}, month=dec, pages={74–97}, language={en} }
    

    This is returned by the DOI Citation Formatter for the style bibtex, which can also be accessed through an API:

    curl 'https://citation.doi.org/format?doi=10.1007%2F978-3-031-50524-9_4&style=bibtex&lang=en-US'
    

    Comments

    It’s quite similar to the previous one from DOI Content Negotiation, but objectively worse:

    1. The entry type is now just @misc.
    2. The booktitle field is missing (it’s not specified for @misc anyway), and the title “Verification, Model Checking, and Abstract Interpretation” isn’t in any other field either.
    3. The journal field is now present (it’s not specified for @misc either!) and has value “Lecture Notes in Computer Science”, which isn’t a journal but a book series (which belongs to the series field, if it wasn’t for @misc).
  • @inbook{Saan2023,
      title = {Correctness Witness Validation by Abstract Interpretation},
      ISBN = {9783031505249},
      ISSN = {1611-3349},
      url = {http://dx.doi.org/10.1007/978-3-031-50524-9_4},
      DOI = {10.1007/978-3-031-50524-9_4},
      booktitle = {Verification,  Model Checking,  and Abstract Interpretation},
      publisher = {Springer Nature Switzerland},
      author = {Saan,  Simmo and Schwarz,  Michael and Erhard,  Julian and Seidl,  Helmut and Tilscher,  Sarah and Vojdani,  Vesal},
      year = {2023},
      month = dec,
      pages = {74–97}
    }
    

    This is returned by doi2bib at https://www.doi2bib.org/bib/10.1007/978-3-031-50524-9_4. doi2bib is just a browser frontend for DOI Content Negotiation and performs client-side reformatting. As far as I have seen, many other tools actually do this under the hood.

    Comments

    It has all the issues of DOI Content Negotiation and only the following differences:

    1. The formatting is generally more human-friendly.
    2. The formatting adds double spaces after commas in field values. This shouldn’t affect Bib(La)TeX, but is odd nevertheless.
  • @InProceedings{10.1007/978-3-031-50524-9_4,
    author="Saan, Simmo
    and Schwarz, Michael
    and Erhard, Julian
    and Seidl, Helmut
    and Tilscher, Sarah
    and Vojdani, Vesal",
    editor="Dimitrova, Rayna
    and Lahav, Ori
    and Wolff, Sebastian",
    title="Correctness Witness Validation by Abstract Interpretation",
    booktitle="Verification, Model Checking, and Abstract Interpretation",
    year="2024",
    publisher="Springer Nature Switzerland",
    address="Cham",
    pages="74--97",
    abstract="Witnesses record automated program analysis results and make them exchangeable. To validate correctness witnesses through abstract interpretation, we introduce a novel abstract operation unassume. This operator incorporates witness invariants into the abstract program state. Given suitable invariants, the unassume operation can accelerate fixpoint convergence and yield more precise results. We demonstrate the feasibility of this approach by augmenting an abstract interpreter with unassume operators and evaluating the impact of incorporating witnesses on performance and precision. Using manually crafted witnesses, we can confirm verification results for multi-threaded programs with a reduction in effort ranging from 7{\%} to 47{\%} in CPU time. More intriguingly, we discover that using witnesses from model checkers can guide our analyzer to verify program properties that it could not verify on its own.",
    isbn="978-3-031-50524-9"
    }
    

    This is returned by the “Download citation (.BIB)” feature of Springer Link which the particular DOI points to:

    curl 'https://citation-needed.springer.com/v2/references/10.1007/978-3-031-50524-9_4?format=bibtex&flavour=citation'
    

    Comments

    1. This is completely different from the previous ones based on DOI Content Negotiation. I guess because those actually come from Crossref’s database, while this one comes from Springer’s own database, but as a user I shouldn’t have to know or care. It’s still Springer submitting data to Crossref and the DOI URL itself redirects to Springer under normal conditions (standard HTTP request).
    2. The entry type is @InProceedings, which is more accurate than all the previous ones.
    3. The doi field is missing. The DOI is in the entry key, although that doesn’t help to have the DOI show up in a Bib(La)TeX bibliography.
    4. The url field is also missing. Thus, there would be no digital reference in a rendered bibliography.
    5. The formatting is multiline, but not indented.
  • @inproceedings{10.1007/978-3-031-50524-9_4,
    author = {Saan, Simmo and Schwarz, Michael and Erhard, Julian and Seidl, Helmut and Tilscher, Sarah and Vojdani, Vesal},
    title = {Correctness Witness Validation by Abstract Interpretation},
    year = {2024},
    isbn = {978-3-031-50523-2},
    publisher = {Springer-Verlag},
    address = {Berlin, Heidelberg},
    url = {https://doi.org/10.1007/978-3-031-50524-9_4},
    doi = {10.1007/978-3-031-50524-9_4},
    abstract = {Witnesses record automated program analysis results and make them exchangeable. To validate correctness witnesses through abstract interpretation, we introduce a novel abstract operation unassume. This operator incorporates witness invariants into the abstract program state. Given suitable invariants, the unassume operation can accelerate fixpoint convergence and yield more precise results. We demonstrate the feasibility of this approach by augmenting an abstract interpreter with unassume operators and evaluating the impact of incorporating witnesses on performance and precision. Using manually crafted witnesses, we can confirm verification results for multi-threaded programs with a reduction in effort ranging from 7\% to 47\% in CPU time. More intriguingly, we discover that using witnesses from model checkers can guide our analyzer to verify program properties that it could not verify on its own.},
    booktitle = {Verification, Model Checking, and Abstract Interpretation: 25th International Conference, VMCAI 2024, London, United Kingdom, January 15–16, 2024, Proceedings, Part I},
    pages = {74–97},
    numpages = {24},
    keywords = {Correctness Witness, Witness Validation, Software Verification, Program Analysis, Abstract Interpretation},
    location = {London, United Kingdom}
    }
    

    This is returned by the “Export Citation” feature of ACM Digital Library at https://dl.acm.org/doi/10.1007/978-3-031-50524-9_4. Although the particular work is published by Springer, ACM seems to index it.

    Comments

    1. The title field value includes  , which is inappropriate for Bib(La)TeX.
    2. The publisher and address field values “Springer-Verlag” and “Berlin, Heidelberg” seem wrong because Springer itself returned “Springer Nature Switzerland” and “Cham”. (Although personally I don’t care: I would drop the address and simplify publisher to “Springer”.)
    3. The booktitle field value includes the book’s subtitle “25th International Conference, VMCAI 2024, London, United Kingdom, January 15–16, 2024, Proceedings, Part I”. In BibTeX, there’s no other way (except omitting it like in all previous services). BibLaTeX specifies the booksubtitle field, and even more appropriate ones like eventtitle, venue and eventdate (as also pointed out in this TeX StackExchange answer).
    4. The formatting is multiline, but not indented.
  • @inproceedings{DBLP:conf/vmcai/SaanSESTV24,
      author       = {Simmo Saan and
                      Michael Schwarz and
                      Julian Erhard and
                      Helmut Seidl and
                      Sarah Tilscher and
                      Vesal Vojdani},
      title        = {Correctness Witness Validation by Abstract Interpretation},
      booktitle    = {{VMCAI} {(1)}},
      series       = {Lecture Notes in Computer Science},
      volume       = {14499},
      pages        = {74--97},
      publisher    = {Springer},
      year         = {2024}
    }
    

    This is returned by the “export record (BibTeX)” feature of DBLP at https://dblp.org/rec/conf/vmcai/SaanSESTV24.html?view=bibtex&param=0. DBLP offers multiple BibTeX formats, this being the condensed one.

    Comments

    1. The doi field is missing and, unlike Springer, it’s not in the entry key either.
    2. The url field is also missing.
    3. The volume field value is “14499” which actually corresponds to the series “Lecture Notes in Computer Science”. This is wrong in both BibTeX and BibLaTeX: it should instead be the number field with the value “14499”.

      BibTeX specifies:

      number
      The number of […] a work in a series. […] sometimes books are given numbers in a named series.
      volume
      The volume of a journal or multivolume book.

      BibLaTeX specifies:

      number
      […] the volume/number of a book in a series.
      volume
      The volume of a multi-volume book or a periodical.

      This has also been pointed out in this TeX StackExchange answer.

    4. The booktitle field value is essentially “VMCAI (1)”, where the 1 refers to the part. The latter is what actually should go into the volume field according to the specifications above.

      Alternatively, BibLaTeX also specifies:

      part
      The number of a partial volume. This field applies to books only, not to journals. It may be used when a logical volume consists of two or more physical ones. In this case the number of the logical volume goes in the volume field and the number of the part of that volume in the part field.

      The distinction between logical and physical is a bit hazy in this case. Even Springer cannot make up their mind about the terminology:

      1. The subtitle of the book ends with “Part I”.
      2. The Springer Link page for the book has the section “Other volumes”.
      3. The “About this book” section on the same page mentions both, while starting with “The two-volume set LNCS 14499 and 14500 […]”.
    5. The formatting is the nicest of them all. Although, when copying the BibTeX code from the DBLP website, the copied text includes two empty leading and trailing lines for some reason. The empty lines are not present in the downloadable .bib file.
  • @inproceedings{DBLP:conf/vmcai/SaanSESTV24,
      author       = {Simmo Saan and
                      Michael Schwarz and
                      Julian Erhard and
                      Helmut Seidl and
                      Sarah Tilscher and
                      Vesal Vojdani},
      editor       = {Rayna Dimitrova and
                      Ori Lahav and
                      Sebastian Wolff},
      title        = {Correctness Witness Validation by Abstract Interpretation},
      booktitle    = {Verification, Model Checking, and Abstract Interpretation - 25th International
                      Conference, {VMCAI} 2024, London, United Kingdom, January 15-16, 2024,
                      Proceedings, Part {I}},
      series       = {Lecture Notes in Computer Science},
      volume       = {14499},
      pages        = {74--97},
      publisher    = {Springer},
      year         = {2024},
      url          = {https://doi.org/10.1007/978-3-031-50524-9\_4},
      doi          = {10.1007/978-3-031-50524-9\_4},
      timestamp    = {Sat, 10 Feb 2024 18:04:44 +0100},
      biburl       = {https://dblp.org/rec/conf/vmcai/SaanSESTV24.bib},
      bibsource    = {dblp computer science bibliography, https://dblp.org}
    }
    

    This is returned by the “export record (BibTeX)” feature of DBLP at https://dblp.org/rec/conf/vmcai/SaanSESTV24.html?view=bibtex&param=1. DBLP offers multiple BibTeX formats, this being the standard one.

    Comments

    It is mostly an extension of the previous one from DBLP, but with additional fields which can be (and are) treated incorrectly:

    1. The doi field value has the underscore escaped. This is unnecessary and even wrong: DOI lookup returns “DOI Not Found”.
    2. The url field value also has the underscore escaped. This is again unnecessary and even wrong: the url is broken.
    3. The booktitle field value is uncondensed, but has the same issues as with ACM.

The tab content ends here.1


Comparison

Here’s a table to summarize some aspects of the entries returned by the services. The values I consider acceptable are in bold2 and the values I prefer are in italic.

Feature DOI DOI formatter doi2bib Springer ACM DBLP condensed DBLP standard
Entry type @inbook @misc @inbook @InProceedings @inproceedings @inproceedings @inproceedings
doi Yes Yes Yes No Yes No Yes3
url dx.doi.org dx.doi.org dx.doi.org No doi.org No doi.org3
year 2023 2023 2023 2024 2024 2024 2024
Event info No No No No In booktitle No In booktitle
LNCS № No No No No No In volume In volume
Book part No No No No In booktitle In booktitle In booktitle

The table also compares the year field values which weren’t discussed above. Surprisingly, there even isn’t consensus about such a basic fact. It probably has to do with “First Online: 30 December 2023”. The Crossref data for the DOI seems to correspond to that, while Springer itself considers the publication to be in 2024, which is also when the conference took place. This just goes to show that the DOI Content Negotiation data, which gets used by many other services, may be inaccurate w.r.t. the very basics.

Conclusion

I learned about DOI Content Negotiation and how bad it actually is for BibTeX. The databases (Springer, ACM, DBLP) are better, but none is perfect or even good enough, as the comparison table reveals. I guess I’ll end up doing a lot of manual work, although some is semi-automatable using BibLaTeX source maps (which are a story for another time).


  1. The styling of tabs in the website theme I’m using clearly isn’t great if I have to point it out. I should fix that. 

  2. The CSS of the website theme I’m using is such that bold doesn’t work together with monospace. I should fix that. But until then, just imagine @inproceedings (and @InProceedings) being bold in the table. 

  3. Has problem with escaping underscores.  2