Scraping barcodes with suffix trees

Estonia has a deposit-refund system for drink bottles and cans (like the German Pfand). It is operated by Eesti Pandipakend whose website includes the package registry¹. The registry isn’t just a list of all products whose packages are part of the system but rather a search.

Package registry search

One can search for the entire barcode of a product to see whether it’s included in the registry and some extra data about it. But one can also search for a part of the barcode to find products whose barcodes include it as a substring. However, there’s an important restriction: the search returns at most 10 products (like SQL’s LIMIT 10). Additionally, the empty substring cannot really be searched (misleadingly, it returns zero products).

Scraping problem

With the search at hand, let’s consider scraping the entire package registry, i.e. finding barcodes of all the registered packages and the extra data about them.² The main focus is on the barcodes because the extra data is a simple byproduct: when the barcode of a product is confirmed to be in the registry by seeing it in the results of some search query, then we also get its extra data and can record it on the side.

The package registry is essentially rate-limited to 1 query per second (technically, 60 queries per minute). Therefore, the goal is to minimize the number of queries needed to completely scrape the registry.

Naïve solution

Since the barcodes are numeric and (mostly) of length 13, there’s an obvious algorithm to scrape the registry: just query each complete barcode to see whether zero or one results are returned. Obviously, this algorithm is completely impractical because it requires \(10^{13}\) queries. (This could be reduced by exploiting the structure of EAN-13 barcodes but that’s not enough to make it practical.)

To further complicate things, the registry also contains small numbers of 12-digit UPC-A and 8-digit EAN-8 barcodes, which require additional queries to scrape. Furthermore, a priori there’s no indication which lengths of barcodes exist in the registry. The 10 result limit for a query means that an extra assumption is needed to make the scraping problem properly solvable at all. Namely, for every complete barcode \(b\) in the registry, the registry contains at most 10 barcodes which have \(b\) as a substring (including \(b\) itself). If this wasn’t the case, then there would be no guarantee that querying \(b\) would necessarily return \(b\). This assumption (probably severely) limits the potential search space.³

Basic online solution

The naïve algorithm, besides being impractical, has another limitation: it first fixes all the many queries to be made and then performs them all. For a feasible solution, we can use an online algorithm, which determines new queries to make based on the results of previous queries.

My first idea for such an algorithm was the following. First, query 0: if it returns less than 10 results (which is unlikely), then we immediately know all barcodes in the registry which include 0; otherwise, recursively query 00, 01, …, 09. For each of those, behave similarly: if there are less than 10 results, then stop recursing; otherwise, continue recursing to all 10 extensions of the query. And so on…. Once all of this is done, repeat the same process starting with 1, 2, …, 9. (The general approach would be to start one recursion from the empty string, but due to the empty query corner case that wouldn’t actually work.)

I have not proven this algorithm to be correct (nor defined what correctness means, yet). So if it isn’t, then this post has already gone off the rails. Comment below if you believe that’s the case!

Suffix tree solution

The basic online algorithm can be slightly improved using a (generalized) suffix tree. To this end, every barcode returned by any query (including those with at least 10 results) is added to a generalized suffix tree, e.g. efficiently using Ukkonen’s algorithm. (For everything related to suffix trees, I would recommend reading (Gusfield, 1997).) Before making any actual query, we can efficiently simulate the query on the suffix tree to check if we have already seen at least 10 barcodes with the query as substring. If this is the case, then we can skip the actual query and immediately proceed with recursion, because the actual query would also return 10 results and force us to recurse anyway.

I used this approach to scrape the actual registry on 2025-08-22, which yielded 10171 barcodes (9535 EAN-13, 377 UPC-A and 259 EAN-8).⁴ Hoping this algorithm is correct, I have a copy of the entire registry at hand locally, which makes it much easier and faster to experiment with different solutions: the registry can be simulated without any rate limiting.⁵ On this local simulation, the suffix tree algorithm uses 73713 queries to scrape the registry. With the actual rate limiting, this requires almost ~20.5 hours. In comparison, the basic algorithm uses 81250 queries, which is ~10% more.

Better solutions?

I suspect the suffix tree solution is far from optimal because it discovers each barcode in the registry numerous times, corresponding to each suffix of each barcode. Intuitively, it should be possible to do better. For example, if 0000 and its extensions have already been scraped, then later it would be a waste to scrape 10000 and its extensions, because all of them have already been found. Except, that’s not true! If the 0000 query has at least 10 results and 00001, …, 00009 have all been scraped, then we might not have seen barcodes that end with 0000. For this tricky reason, my attempts at further optimization (e.g. also using a suffix tree of all performed queries to find such query inclusions) have failed to correctly scrape the entire registry. Such pruning can reduce the number of queries by over 50%, but also cause it to miss a handful of barcodes, which is no good (“It doesn’t work, but it’s fast”).

Comment below if you have (ideas for) a better correct solution!

Verification problem

With the scraped registry at hand, let’s consider checking the result, i.e. the scraped dataset has exactly the same barcodes as the actual registry. This set equality can be viewed as two set inclusions: all the scraped barcodes exist in the actual registry, but also all the barcodes in the actual registry were scraped. Let’s call these properties soundness and completeness, respectively.

In a way, verification is like solving the scraping problem while knowing the answer to begin with. So intuitively, it should be doable with fewer queries.

Soundness

Naïve solution

The obvious algorithm is to just query each scraped barcode from the actual registry and confirm that it is included in the results. (If any of the results contains new barcodes, we’ve inadvertently disproven completeness.) Thus, this requires as many queries as there are barcodes in the registry, which can’t be the most optimal. For the registry scraped above, it requires 10171 queries.

Suffix tree solution

Since the answer is already known, a generalized suffix tree of all the barcodes can be constructed to begin with. The verification algorithm is then to traverse the suffix tree and query from the actual registry the substring corresponding to each suffix tree path (from its root) where the subtree under that node has at most 10 barcodes (while its parent has more). Unlike suffix tree solution to the scraping problem, nodes representing exactly 10 barcodes should also be fine here.

As with the suffix tree solution to the scraping problem, this also checks each barcode multiple times, corresponding to each suffix of each barcode. Having implemented this approach, for the registry scraped above, it requires 29794 queries, which is much worse than even the naïve algorithm.⁴

Suffix tree and set cover solution

Since we have the answer, we can do some optimization to avoid verifying each barcode numerous times. Like the previous solution, the suffix tree provides all substrings we could query and their expected results. Minimizing the number of queries to cover all the barcodes is an instance of the set cover problem, which, somewhat unfortunately, is NP-hard. So curiously, to efficiently verify the solution to the scraping problem, we have to solve (not verify) an instance of the difficult set cover problem.

Luckily, the greedy algorithm for the set cover problem isn’t too shabby in this case: having implemented it, for the registry scraped above, it requires just 2159 queries, which is almost 5 times less than the naïve solution.⁴ Theoretical results about the greedy algorithm show that its over-approximation ratio here is \(H(10) \approx 2.93\), where 10 is the size of the largest set (picked from the suffix tree). In other words, the greedy algorithm can be at most ~2.93 times worse than the optimal set cover size. Or put another way, for the registry scraped above, the optimal number of verification queries is at least 738.

As far as I managed to find, similar ideas have been proposed to solve a slightly different string barcoding problem (Rash & Gusfield, 2002). Coincidentally, it also mentions barcodes by name, albeit with a completely different meaning: it’s a problem in computational biology for constructing some kind of probes to distinguish a set of DNA sequences. Nevertheless, their solution also processes nodes of a suffix tree in relation to the strings contained in their corresponding subtrees. However, they use it to construct an integer linear programming (ILP) problem to then solve (i.e. optimize). Also coincidentally, the set cover problem can be stated as an ILP problem. So, obviously, ILP is also NP-hard, but can be approximated via relaxation to non-integer linear programming. That approximation doesn’t necessarily ensure as good of a solution than the greedy algorithm does for the scraped database because here it has approximation ratio of 11.

Completeness

Checking completeness is not as easy: we have to check whether every barcode in the actual registry is in our scraped answer, but knowing the former is the scraping problem itself! It seems more promising to check the contrapositive: whether every barcode not in our scraped answer also is not in the actual registry. This doesn’t sound much easier at first: most barcodes are not in the answer and iterating over them would be unrealistic.

Suffix tree solution

Although I have not fully worked out the algorithm for this, I believe the suffix tree used for soundness can also be useful for completeness. Namely, every branch which is missing in the suffix tree corresponds to a partial barcode which isn’t a substring of any barcode in the scraped answer. The suffix tree path (from its root) to this hypothetical missing branch serves as a query which should return zero results from the actual registry, otherwise our answer is incomplete.

Better solutions?

As always, such suffix-tree–based checking incurs some redundancy. For example, to prove that the barcode 000008 isn’t in the registry, it’s enough that the query 0000 returns zero results, making it unnecessary to also query 0008, which might also be missing. However, the latter might still be necessary to rule out other substrings.

Removing the redundancy using set cover isn’t as simple for the negative case: each of the missing branches (likely) represents a very large (complete) suffix subtree. Thus, the resulting set cover problem instance would not be realistically solvable.

Comment below if you have (ideas for) a better solution!

Joint verification

It is possible to exploit a single query for both soundness and completeness (as mentioned during soundness). If a query returns less than 10 barcodes, we both can confirm that our scraped ones are included, but also that no others containing the queried substring is present. Thus, a joint set cover problem to cover all barcodes in the answer and all missing branches in the suffix tree might yield a not-too-bad solution overall.

Update problem

I scraped the registry at some point in time (or more precisely, over a duration of time) and hypothetically even verified it to be correct soon after. However, the actual registry will inevitably change: new products with new barcodes will be added (and perhaps some old ones removed, although I’m not sure if the registry would ever do that — it still contains some quite old products). Hence, another interesting problem arises: how to update the scraped registry without scraping it from scratch?

I haven’t put much thought into this, but I suspect the suffix tree will again come in handy. Maybe one can try verifying the current scraped registry and, when a mismatch is detected, do some scraping from that point, but it’s just a vague idea. I would expect the amount of extra scraping to be somewhat proportional to the size of the registry change. Of course, if the registry is completely replaced with a disjoint set of barcodes (a hypothetical worst case scenario), then it’s probably hopeless to beat from scratch scraping.

Comment below if you have (ideas for) a solution!

Conclusion

What started as a silly exercise in scraping a registry turned into a set of quite complicated algorithmic problems. And in no way can I claim any of these to be solved. Therefore, I challenge the reader to come up with better solutions and quickly try them out on the locally simulated registry.⁴

References

Book

Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology

Dan Gusfield

1997
RECOMB

String barcoding: uncovering optimal virus signatures

Sam Rash and Dan Gusfield

In Research in Computational Molecular Biology, 2002

Abs DOI

There are many critical situations when one needs to rapidly identify an unidentified pathogen from among a given set of previously sequenced pathogens. DNA or RNA hybridization chips can be designed for such identifications. Each cell in the chip can report the presence or absence of a specific substring of DNA in the unidentified pathogen. Properly designed, the collection of reports obtained from the cells can uniquely identify any pathogen in the set, or determine that the unidentified pathogen is not in the set. There is a limit to the number of cells on a chip, and a range of substring lengths that a cell can handle. So, given the full sequences of a set of pathogens, the problem is to design the chip by selecting the smallest set of substrings of the appropriate lengths, so that each pathogen in the set has a unique set of cells that report a substring. For any given pathogen, the set of reporting cells is its signature, and hence the entire system is a "barcode" system for the pathogens. Previous work addressed this problem, but focused on pathogens of bacterial size, and hence had to make many compromises for the sake of efficiency. The substrings lengths were severely restricted, and no optimality or near-optimality was guaranteed. In this paper, we focus on viral-size pathogens. We show that for genomes of this size, it is practical to solve the barcode design problem optimally, or near-optimally, without artificially constraining the problem. We also efficiently find barcodes that provide a level of redundancy, tolerating a number of errors or mutations. The key technical ideas are the use of suffix trees to identify the critical substrings, integer-linear programming (ILP) to express the minimization problem, and a simple idea that dramatically reduces the size of the ILP, allowing it to be solved efficiently by the commercial ILP solver CPLEX. We report extensive tests of our approach on various collections of virus DNA and RNA sequences.

They call it the packaging register, but that’s a bit of an odd translation of the Estonian term pakendiregister. ↩
I don’t actually have a use for all this data, but if you do, then you’re in luck! I just got fascinated by the (surprisingly complex) computer science problem of doing so. ↩
It might be a fun combinatorics problem to calculate the maximum possible size of a set of barcodes (up to a certain length) that satisfies this assumption. ↩
This GitHub repository includes my Python code and the scraped dataset. ↩ ↩² ↩³ ↩⁴
Note that for queries with more than 10 results, such local simulation might not return the same 10 results that the actual registry does. In turn, this may cause some algorithms to behave slightly differently. ↩