A search engine in 250 lines of Scheme

28 Sep, 2025

During a regular dive into deep water lobsters' news, I hop over a post about a search engine made of haskell, referencing another post about a search engine in Python, again, in 150 lines but this time Python. So, here is my take.

Caveat Emptor

Simple: I was wondering how I could achieve that with Chez Scheme, and binink. And possibly maybe help someone get an idea or two about how Scheme programming can be done. That is it.

This is in no way a realisitc search engine, not even realisitic search engine for local archive. Indeed, analyzing, storing in memory, and searching several megabytes, or gigabytes with the following code is slow.

It is interesting because it does involve a small set standard parts, and couple of libraries html, sxpath, match. To keep it simple scheme, I will rely on lambda, car, cdr, box, unbox, set-box!, assoc, define, etc... no hash-tables, or bags.

What this post has to offer is glimpsing over recent Scheme, and comparing the code with Python, and Haskell. And prolly, getting you in the mood to do the same kind of post filling the gaps of our explanations on how to build a search engine in your favorite language.

Data

I can not find wikipedia dump of abstracts, that's why i will rely on a dump of lobste.rs filtered on lisp. By the way, it is more on more difficult to scrape the web. I recommend you backup whatever you can! Checkout https://archive.org/download/com.ycombinator.news.jsonl or https://kiwix.org/.

To bootstrap the dataset, I filter the file called rs.lobste-stories.jsonl.xz with xzgrep to keep only stories that mention lisp:

~/Data # /usr/bin/time xzgrep -F 'lisp' rs.lobste-stories.jsonl.xz > ~/Data/rs.lobste.lisp.jsonl
real	0m 6.44s

The file is 257M with 1744 lines. Each line, is a json object such as:

~/Data # head --lines 1 rs.lobste.lisp.jsonl

["https://nullprogram.com/blog/2017/01/30/", "http://nullprogram.com/blog/2017/01/30/", "<!DOCTYPE html>\n<title>How to Write Fast(er) Emacs Lisp</title>\n...

It is an array, where the first object is the url of the destination of the redirection, the second is the source, and the last object is html code of the page.

What

Let's start with a test to ground what I am looking to do:

(define ~check-seti-000
  (lambda ()
    (define seti (seti-new))
    (for-each (lambda (url html) (seti-index! seti url html)) dataset)
    (assert (< 0 (length (seti-search seti "lisp"))))))

There is a constructor seti-new that will produce a handle, that I reference to read, and write the search index. seti-index! write (index) html associated with the identifier url, and seti-search to read, in other words look up document that contain a key word.

How

`html->sxml`

html->sxml is a procedure I picked from guile-lib by Neil W. Van Dyke, modified by Andy Wingo. It will turn html, into sxml.

The name sxml is used to name the precise nesting of list, symbols, and strings that make up the construction produced by html->sxml. Glimpsing over:

(html
 (head (title "Hello schemer!!"))
 (body
  (div (@ (id "root"))
       (p "Password ruse grand scheme plan machination"))))

Mind the arobase @ sign, that is a control char, that can not be a tag in neither html, nor xml, and is used to tag the list to describe it as a list of sxml attribute. Yes, It is the same structure used for deserializing xml, with beautiful namespaces when necessary.

`sxml->text`

The procedure sxml->text turn an sxml file into string without markup, that is a sequence of words without structuring annotations (except natural language punctuation, and those will go away too later).

This will put to great use the mighty match with catamorphism.

You can achieve something similar with Andy Wingo foldts. It is more readable with match from SRFI 241 by Marc Nieper-Wißkirchen.

An example^WW A test will help understand what I am looking for

(define ~check-seti-001
  (lambda ()
    (assert
     ;; expected
     (string=? "Hello schemer!! Password ruse grand scheme plan machination"
               ;; given
               (sxml->text
                '(html (head (title "Hello schemer!!"))
                       (body
                        (div (@ (id "root"))
                             (p "Password ruse grand scheme plan machination")))))))

The following catamorphic match with the help of scheme xpath ie. sxpath achieve something nearly perfect:

(define html->text
  (lambda (html)
    ;; return #f in case of failure
    (guard (ex (else (display-condition ex) #f))
      ;; retrieve the title
      (define title (match (let ((title ((sxpath '(// title *text*)) html)))
                             ;; TODO: FIXME: I forgot how to match
                             ;; html/head/title with sxpath in a way
                             ;; that is absolute hence this dance
                             (if (null? title)
                                 '()
                                 ;; retrieve the first title, not
                                 ;; titles found in body. hopefully
                                 ;; the first title is the
                                 ;; /html/head/title
                                 (car title)))
                      ;; That exercise the feature of catamorphism, it
                      ;; will require the match to recurse with the
                      ;; syntax ,(fooobarqux) here ,(s*).  The
                      ;; recursing happens as if every single s*
                      ;; matched by s* ... are feed one by one as
                      ;; argument of the current match. The convention
                      ;; I learned in nanopass is to call variable
                      ;; that have ellipsis with a star suffix. The
                      ;; variable name is not parsed by the match
                      ;; macro :)

                      ;; do no recurse at*, and do not include them in
                      ;; output. it happens that <title abc="oops">
                      (((@ ,at* ...) ,(s*) ...)
                       ;; s* is bound to the result of the match
                       ;; catamorphism that is: a list of strings
                       (string-join s* " "))
                      ;; Ah that should be better. Because the sxpath
                      ;; is // title *text*, sxpath returns only the
                      ;; content of the node title, not the node title
                      ;; that would include the tagging symbol (title
                      ;; "hello world") vs. here "hello world"
                      ((,(s*) ...) (string-join s* " "))
                      ;; For some cases title include some non-string garbage,
                      ;; ignore it. The code reach that because of recursion
                      ;; of the catamorphism.

                      ;; if s is string return as-is, otherwise
                      ;; zero-object aka empty string.
                      (,s (if (string? s) s ""))))

      ;; need to transform the body into text... (sxpath '(// body //
      ;; *text*)) does not do what I expect, it will also match
      ;; attribute values, so, fetch body, and then match with
      ;; catamorphism.
      (define body (let ((body ((sxpath '(// body)) html)))
                     ;; XXX: there might be no body, bug in htmlprag?
                     (if (null? body) '() (car body))))
      (define text (string-join (match body ;; body looks like '(body (@ at* ...) e* ...)
                                  ;; we deal with empty list e* ... such as br.
                                  ;; return the zero object

                                  ;; in the match will: a) never
                                  ;; reverse on at*, and never pick it
                                  ;; up in output construction. match
                                  ;; need to return a string of text
                                  ;; from the body *properly*
                                  ;; separated with a single space.
                                  (() (list))
                                  ;; ignore script, zero object is list
                                  ((script (@ ,at* ...) ,e* ...) (list))
                                  ;; ignore no script :')
                                  ((noscript (@ ,at* ...) ,e* ...) (list))
                                  ;; match any tag with attributes, recurse children
                                  ((,tag (@ ,at* ...) ,(e*) ...) (apply append e*))
                                  ;; same without attributes
                                  ((,tag ,(e*) ...) (apply append e*))
                                  ;; if a string wrap in list, otherwise it is an error \w/
                                  (,e (if (string=? e)
                                          (list e)
                                          (error 'html->text "unsupported html"))))
                                " "))
      ;; XXX: some space is inserted, i do not know why, anyway it
      ;; does not matter for what comes next.
      (string-append title " " text))))

`string-clean`

The procedure string-clean will do two things at the same time. There is a small optimization. With the library (scheme char) I would write it with something like:

(define string-clean-r7rs
  (lambda (string)

    (define alphanumberic-or-space
      (lambda (char)
        (if (char-set:contains char-set-alphanumeric char) char #\space)))

    ;; to follow the clojure convention, because map's lambda is not a
    ;; one-liner: create a procedure
    (list->string (map alphanumberic-or-space (string->list (string-downcase string))))))

The "problem" with that procedure is that it iterates at least twice over the input string, once when string-downcase is called, and another time with string->list is called: BAD! Or so I think ;)

So here is another implementation that does not require (scheme char), and iterates less times over string:

(define string-clean
  (lambda (string)
    (list->string (map (lambda (c) (let ((c* (char-downcase)))
                                     (case c*
                                       ((#\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9
                                         #\q #\w #\e #\r #\t #\y #\u #\i
                                         #\o #\p #\a #\s #\d #\f #\g #\h
                                         #\j #\k #\l #\z #\x #\c #\v #\b
                                         #\m #\n)
                                        c*)
                                       (else #\space))))
                       (string->list string)))))

`bag-new`, and `bag-add!`

The procedure bag-new will turn the text into a counter. I will not use the R7RS bag data structure, instead I build my own counter datastructure with an association list, and a box:

(define bag-new
  (lambda (tokens)
    (define bag (cons 'bag (box (list))))
    (for-each (lambda (token) (bag-add! bag token)) tokens)
    bag))

(define bag-add!
  (lambda (bag token)
    (assert (eq? (car bag) 'bag))
    (let* ((counts (unbox (cdr bag)))
           (count (assoc token counts)))
      (if (not count)
          ;; there is no count for this token, it is new
          (set-box! (cdr bag) (cons (cons token (box 1)) counts))
          ;; there a count for this token, increment
          (set-box! (cdr count) (+ 1 (unbox (cdr count))))))))

Analyze

The above procedure are each passing their output as input to the next procedure to build a "pipeline". Some Scheme have compose, or threading operator, I will implement a procedure ==> that make the pattern explicit:

(define ==>
  (lambda procedures
    (lambda (object)
      (fold (lambda (input procedure) (procedure input)) object procedures))))

It could be a macro, but I macro-fu is weak.

Then the phase of analysis, is just a thread of procedures. When called, the output is a bag made of the words that are not stop words in the input html:

(define seti-analyze (==> html->sxml
                         sxml->text
                         string-clean
                         string-split
                         (lambda (tokens)
                           (filter (lambda (token) (not (member token stop-words))) tokens))
                         bag-new))

Let's try a REPL^W test and look at what we have:

(define ~check-set-004
  (lambda ()
    (assert (set-analyze "<html><head><title>hello world</title><head><body><h2>hello schemers</h2></body><html>")
            `(bag (("hello" . ,(box 2))
                   ("schemers" . ,(box 1)))))))

`(seti-new)`, and `(seti-index! seti url html)`

The procedure seti-new is:

(define seti-new
  (lambda ()
    (cons 'seti (box (list)))))

Now, seti-index! and precisely (seti-index! seti url html). So, we have the above seti-analyze that returns a bag representing the number of occurence of words within an html and that bag is associated to an url by the way of the html page. In ascii art we can represent an abstract traversal of the data as follow:

url -> word -> count

Or an abstract tuple:

(url word count)

The input query is made of a word, to retrieve all urls that contain words it is necessary to iterate over all tuples and keep only the tuples that have word equal to query that is called a algorithm complexity of O(n) where n the number of tuple, that is approximatly O(m) where m is the number of document if we consider that every document has the same number of words.

Long story made short, when there is a lot of document it helps to use another datastructure, a datastructure where it is easier to lookup words.

Instead of considering:

A document is made of words

The inverted index, posting, also known as index reverse the relationship and says:

words are in document

How? There is url -> html hence eventually a relation url -> word. I need to construct a relation word -> url, an obtain the abstract view, sometime it is called "reversing the dictionary". An abstract view gives:

(word url count)

So here is one way to implement seti-index!:

(define seti-new
  (lambda ()
    (cons 'seti (box (list)))))

(define ~check-seti-005
  (lambda ()
    (define seti (seti-new))
    (seti-index! seti "hello-schemers"
                 "<html><head><title>hello world</title><head><body><h2>hello schemers</h2></body><html>")
    (assert
     (equal? seti
             (cons 'seti (box
                          (list
                           (cons "hello" (box (list "hello-schemers")))
                           (cons "world" (box (list "hello-schemers")))
                           (cons "schemers" (box (list "hello-schemers")))
                           )))))))

(define seti-add!
  (lambda (seti token url)
    (define entry (assoc token (unbox (cdr seti))))
    (if (not entry)
        (set-box! (cdr seti) (cons (cons token (box (list url)))
                                   (unbox (cdr seti))))
        (set-box! (cdr entry) (cons url (unbox (cdr entry)))))))

(define seti-index!
  (lambda (seti url html)
    (assert (eq? (car seti) 'seti))
    (let* ((bag (seti-analyze html))
           (tokens (map car (unbox (cdr bag)))))
      (for-each (lambda (token) (seti-add! seti token url)) tokens))))

Final words

There is already 236 lines of tested code. I hope you enjoyed going through the code, and the writing, and learned a few tricks. I know first hand it is not enough but if it helped you either get into search engines, or get into scheme then all the better.

The astute reader will recognize a missing seti-search, that is left as an exercise to the reader. And so does the scoring functionality.

Listing

#!chezscheme
(library (seti)

  (export #;main
          ~check-seti-001
          ~check-seti-002
          ~check-seti-003
          ~check-seti-004
          ~check-seti-005
          )
  
  (import (chezscheme)
          (binink match base)
          (binink sxpath)          
          (binink html base)          
          )

  (define pk
    (lambda args
      (display ";; ")(write args)
      (newline)
      (flush-output-port)
      (car (reverse args))))

  (define ~check-seti-001
    (lambda ()
      (assert
       ;; expected
       (string=? "Hello schemer!! Password ruse grand scheme plan machination"
                 ;; given
                 (sxml->text
                  '(html (head (title "Hello schemer!!"))
                         (body
                          (div (@ (id "root"))
                               (p "Password ruse grand scheme plan machination")))))))))

  (define (string-join lst delimiter)
    (if (null? lst) ""
        (fold-left (lambda (item result) (string-append result delimiter item))
                   (car lst)
                   (cdr lst))))
  
  (define sxml->text
    (lambda (html)
      ;; return #f in case of failure
      (guard (ex (else (display-condition ex) #f))
        ;; retrieve the title
        (define title (match (let ((title ((sxpath '(// title *text*)) html)))
                               ;; TODO: FIXME: I forgot how to match
                               ;; html/head/title with sxpath in a way
                               ;; that is absolute hence this dance
                               (if (null? title)
                                   '()
                                   ;; retrieve the first title, not
                                   ;; titles found in body. hopefully
                                   ;; the first title is the
                                   ;; /html/head/title
                                   (car title)))
                        ;; That exercise the feature of catamorphism, it
                        ;; will require the match to recurse with the
                        ;; syntax ,(fooobarqux) here ,(s*).  The
                        ;; recursing happens as if every single s*
                        ;; matched by s* ... are feed one by one as
                        ;; argument of the current match. The convention
                        ;; I learned in nanopass is to call variable
                        ;; that have ellipsis with a star suffix. The
                        ;; variable name is not parsed by the match
                        ;; macro :)

                        ;; do no recurse at*, and do not include them in
                        ;; output. it happens that <title abc="oops">
                        (((@ ,at* ...) ,(s*) ...)
                         ;; s* is bound to the result of the match
                         ;; catamorphism that is: a list of strings
                         (string-join s* " "))
                        ;; Ah that should be better. Because the sxpath
                        ;; is // title *text*, sxpath returns only the
                        ;; content of the node title, not the node title
                        ;; that would include the tagging symbol (title
                        ;; "hello world") vs. here "hello world"
                        ((,(s*) ...) (string-join s* " "))
                        ;; For some cases title include some non-string garbage,
                        ;; ignore it. The code reach that because of recursion
                        ;; of the catamorphism.

                        ;; if s is string return as-is, otherwise
                        ;; zero-object aka empty string.
                        (,s (if (string? s) s ""))))

        ;; need to transform the body into text... (sxpath '(// body //
        ;; *text*)) does not do what I expect, it will also match
        ;; attribute values, so, fetch body, and then match with
        ;; catamorphism.
        (define body (let ((body ((sxpath '(// body)) html)))
                       ;; XXX: there might be no body, bug in htmlprag?
                       (if (null? body) '() (car body))))
        (define text (string-join (match body ;; body looks like '(body (@ at* ...) e* ...)
                                    ;; we deal with empty list e* ... such as br.
                                    ;; return the zero object

                                    ;; in the match will: a) never
                                    ;; reverse on at*, and never pick it
                                    ;; up in output construction. match
                                    ;; need to return a string of text
                                    ;; from the body *properly*
                                    ;; separated with a single space.
                                    (() (list))
                                    ;; ignore script, zero object is list
                                    ((script (@ ,at* ...) ,e* ...) (list))
                                    ;; ignore no script :')
                                    ((noscript (@ ,at* ...) ,e* ...) (list))
                                    ;; match any tag with attributes, recurse children
                                    ((,tag (@ ,at* ...) ,(e*) ...) (apply append e*))
                                    ;; same without attributes
                                    ((,tag ,(e*) ...) (apply append e*))
                                    ;; if a string wrap in list, otherwise it is an error \w/
                                    (,e (if (string=? e)
                                            (list e)
                                            (error 'html->text "unsupported html"))))
                                  " "))
        ;; XXX: some space is inserted, i do not know why, anyway it
        ;; does not matter for what comes next.
        (string-append title " " text))))

  (define string-clean
    (lambda (string)
      (list->string (map (lambda (c) (let ((c* (char-downcase c)))
                                       (case c*
                                         ((#\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9
                                           #\q #\w #\e #\r #\t #\y #\u #\i
                                           #\o #\p #\a #\s #\d #\f #\g #\h
                                           #\j #\k #\l #\z #\x #\c #\v #\b
                                           #\m #\n)
                                          c*)
                                         (else #\space))))
                         (string->list string)))))

  (define ~check-seti-002
    (lambda ()
      (assert (string=? (string-clean "ABC def i->k L;") "abc def i  k l "))))

  (define bag-new
    (lambda (tokens)
      (define bag (cons 'bag (box (list))))
      (for-each (lambda (token) (bag-add! bag token)) tokens)
      bag))

  (define bag-add!
    (lambda (bag token)
      (assert (eq? (car bag) 'bag))
      (let* ((counts (unbox (cdr bag)))
             (count (assoc token counts)))
        (if (not count)
            ;; there is no count for this token, it is new
            (set-box! (cdr bag) (cons (cons token (box 1)) counts))
            ;; there a count for this token, increment
            (set-box! (cdr count) (+ 1 (unbox (cdr count))))))))


  (define ~check-seti-003
    (lambda ()
      (assert (equal?
               (cons 'bag (box (list
                                (cons "you" (box 1))
                                (cons "world" (box 1))
                                (cons "hello" (box 2)))))
               (bag-new (list "hello" "world" "hello" "you"))))))

  (define ==>
    (lambda procedures
      (lambda (object)
        (fold-left (lambda (input procedure) (procedure input)) object procedures))))


(define (string-split char-delimiter? string)
  (define (maybe-add a b parts)
    (if (= a b) parts (cons (substring string a b) parts)))
  (let ((n (string-length string)))
    (let loop ((a 0) (b 0) (parts '()))
      (if (< b n)
          (if (not (char-delimiter? (string-ref string b)))
              (loop a (+ b 1) parts)
              (loop (+ b 1) (+ b 1) (maybe-add a b parts)))
          (reverse (maybe-add a b parts))))))
  
  (define stop-words (list "not" "has" "is" "not" ""))
  (define seti-analyze (==> html-read
                            sxml->text
                            string-clean
                            (lambda (s) (string-split char-whitespace? s))
                            (lambda (tokens)
                              (filter (lambda (token) (not (member token stop-words))) tokens))
                            bag-new))

  (define ~check-seti-004
    (lambda ()
      (assert (equal?
               (seti-analyze "<html><head><title>hello world</title><head><body><h2>hello schemers</h2></body><html>")
               (cons 'bag (box `(("schemers" . ,(box 1))
                                 ("world" . ,(box 1))
                                 ("hello" . ,(box 2)))))))))

  (define seti-new
    (lambda ()
      (cons 'seti (box (list)))))

  (define ~check-seti-005
    (lambda ()
      (define seti (seti-new))
      (seti-index! seti "hello-schemers"
                   "<html><head><title>hello world</title><head><body><h2>hello schemers</h2></body><html>")
      (assert
       (equal? seti
               (cons 'seti (box
                            (list
                             (cons "hello" (box (list "hello-schemers")))
                             (cons "world" (box (list "hello-schemers")))
                             (cons "schemers" (box (list "hello-schemers")))
                             )))))))

  (define seti-add!
    (lambda (seti token url)
      (define entry (assoc token (unbox (cdr seti))))
      (if (not entry)
          (set-box! (cdr seti) (cons (cons token (box (list url)))
                                     (unbox (cdr seti))))
          (set-box! (cdr entry) (cons url (unbox (cdr entry)))))))
  
  (define seti-index!
    (lambda (seti url html)
      (assert (eq? (car seti) 'seti))
      (let* ((bag (seti-analyze html))
             (tokens (map car (unbox (cdr bag)))))
        (for-each (lambda (token) (seti-add! seti token url)) tokens))))
        

  )