Why write a search engine

Why do I do anything?

This started when I was trying to search for a cool old speaker and DuckDuckGo wasn’t responding. It made me think about what I’d use to search in situations where my primary search engine is inaccessible.

I asked the fediverse for advice and got a few ideas but the discussion evolved into talking about self-hosted or personal search engines.

A few interesting projects came-up, but they all seemed to have hang-ups and the overall assessment was that exactly what I had in mind didn’t exist.

What I had in mind was something easy to setup, simple to use and nothing too fancy. My mind started to imagine how I might build something to meet these needs and in the winter tradition of hacking something useful out in about 100 lines of Python I decided to take a whack at it.

So far it’s a couple of incomplete programs with a lot of TODOs. The source can be found on Codeberg:

https://codeberg.org/jjg/paxton

Why Paxton? Because names are hard, and for some reason it was the first name that popped-into my head.

There’s currently three programs that make-up the system: a crawler, and indexer and the search server itself. The crawler does (or will do) what you’d expect: fetch the content of a website and store it in a file for indexing. The current crawler doesn’t do much crawling (it just reads a single page it’s pointed at) but that’s just to keep things simple and fast for the moment.

The indexer reads the files stored by the crawler and generates an index file containing the distilled contents of the crawler output. This is where most of my noodling is happening at the moment. My current idea is to create a dictionary where the keys are important words and the values point to lists of URL’s that contain these key words. These nested lists contain further dictionaries which include the URL of the related page, the occurrence count of the words and possibly other metadata that might be useful for scoring results.

The search server listens for search requests (GET /?search=“foo”) and tries to find the search terms in the index. It does this by splitting the terms into words matching the words to the dictionary keys in the index and then building a list of results from the URL’s the index dictionary points to. The results are sorted by an accumulation of the occurrence count of the word in the page the URL points to and an aggregation if the same URL appears under multiple words in the search terms.

Of course this is a very simple system and there is loads of room for improvement. I have a lot of ideas about how to improve the indexing and searching steps to provide relevant results without resorting to anything too cute or clever. I want the results to be deterministic and not influenced by a bunch of hidden weights and balances. I’m also constraining the implementation to Python 3’s built-in functionality so if I can’t build it out of basic Python, it’s not going in there.

Will this will become a usable piece of software? Who’s to say? Maybe I’ll get frustrated/bored and switch back to something else, or maybe it will become good enough like Preposter.us and become a staple of my workflow.

Sun, 26 Nov 2023 05:20:33 -0600Why write a search engine

mr's Preposter.us Blog