Tue, 09 Jun 2026 06:46:23 -0500Innovation Through Constraint

mr's Preposter.us Blog

This being The Month of Zig, I started thinking about how I would port my Personal Search Engine (PSE) prototype to Zig.  As I worked through the prototype code one of the challenges to porting it is that the Go version depends on Go's built-in HTML parser, but there's no HTML parser in Zig's standard library (yet?).  The PSE's web crawler needs an HTML parser to read the HTML documents it crawls and get things like the page title, body and links that go into the PSE index. 

I found some third-party HTML parsers, but the ones I found are incomplete or have limited testing and crawling is both critical to the functionality of the PSE (probably the second most important thing) and also where most of the work happens so it needs to be accurate, reliable and efficient.  

I could have written an HTML parser (or at least enough of one to get the things the PSE needs out of an HTML page) but that's a nontrivial amount of work and I'm already chomping at the bit to get this prototype into other people's hands for testing.  HTML is also a moving target, which means that a home-grown parser will require ongoing maintenance.  Also this would have all the downsides of using one of the current third-party HTML parsers.

But being forced to think about HTML parsing gave me a new idea: what if I don't parse the HTML pages to extract what I need for the index?  What if instead I just read through the data and grab the things I need?  

This "streaming" approach gets me everything I need without the complexity and overhead of parsing each entire HTML document into an object.  Aside from eliminating all the problems above, this approach requires significantly less resources of processing all that data and then throwing-away most of it.  It also makes it possible to predict how much memory will be needed because you can control how much data you need to work with at a time.  In retrospect, it's a much better solution to the specific problems I need to solve than a full-blown parser.

I took this knowledge gained from working on the Zig port and "backported" it to the Golang prototype and the results are really amazing.  Not only does it work, but it's faster and much more memory-efficient while producing the exact same output I used from the parser.  Not only that, but I was able to implement it in a generic enough way that I was able to re-use this "streaming extractor" module to eliminate parsers elsewhere (notably some of the import code), further reducing dependencies in the code.

I might even re-use it in other projects...

I don't think it ever would have crossed my mind to question the use of an HTML parser in a web crawler used to generate a search index of web pages.  The only way it might have would have been if a performance problem had emerged while using the system in production, at which point solving it would have been an emergency.  Instead, by attempting to port the prototype to a different, more constrictive programming language I not only reconsidered this but came up with a superior implementation.  

Even if I never complete the Zig port of the PSE prototype, the porting effort will have paid major dividends.  This is a very practical example of the value of learning something new, even when there isn't an immediate "practical" reason to do so.



Jason J. Gullickson, 2026