A search engine in 250 lines of Scheme
During a regular dive into deep water lobsters' news, I hop over a post about a search engine made of haskell, referencing another post about a search engine in Python, again, in 150 lines but this time Python. So, here is my take.
Caveat Emptor
Simple: I was wondering how I could achieve that with Chez Scheme, and binink. And possibly maybe help someone get an idea or two about how Scheme programming can be done. That is it.
This is in no way a realisitc search engine, not even realisitic search engine for local archive. Indeed, analyzing, storing in memory, and searching several megabytes, or gigabytes with the following code is slow.
It is interesting because it does involve a small set standard parts,
and couple of libraries html
, sxpath
, match
. To keep it simple
scheme, I will rely on lambda
, car
, cdr
, box
, unbox
,
set-box!
, assoc
, define
, etc... no hash-tables, or bags.
What this post has to offer is glimpsing over recent Scheme, and comparing the code with Python, and Haskell. And prolly, getting you in the mood to do the same kind of post filling the gaps of our explanations on how to build a search engine in your favorite language.
Data
I can not find wikipedia dump of abstracts, that's why i will rely on a dump of lobste.rs filtered on lisp. By the way, it is more on more difficult to scrape the web. I recommend you backup whatever you can! Checkout https://archive.org/download/com.ycombinator.news.jsonl or https://kiwix.org/.
To bootstrap the dataset, I filter the file called
rs.lobste-stories.jsonl.xz
with xzgrep
to keep only stories that mention lisp:
~/Data # /usr/bin/time xzgrep -F 'lisp' rs.lobste-stories.jsonl.xz > ~/Data/rs.lobste.lisp.jsonl
real 0m 6.44s
The file is 257M with 1744 lines. Each line, is a json object such as:
~/Data # head --lines 1 rs.lobste.lisp.jsonl
["https://nullprogram.com/blog/2017/01/30/", "http://nullprogram.com/blog/2017/01/30/", "<!DOCTYPE html>\n<title>How to Write Fast(er) Emacs Lisp</title>\n...
It is an array, where the first object is the url of the destination of the redirection, the second is the source, and the last object is html code of the page.
What
Let's start with a test to ground what I am looking to do:
(define ~check-seti-000
(lambda ()
(define seti (seti-new))
(for-each (lambda (url html) (seti-index! seti url html)) dataset)
(assert (< 0 (length (seti-search seti "lisp"))))))
There is a constructor seti-new
that will produce a handle, that I
reference to read, and write the search index. seti-index!
write
(index) html
associated with the identifier url
, and seti-search
to read, in other words look up document that contain a key word.
How
html->sxml
html->sxml
is a procedure I picked from guile-lib by Neil W. Van Dyke, modified
by Andy Wingo. It will turn html, into sxml.
The name sxml is used to name the precise nesting of list, symbols,
and strings that make up the construction produced by
html->sxml
. Glimpsing over:
(html
(head (title "Hello schemer!!"))
(body
(div (@ (id "root"))
(p "Password ruse grand scheme plan machination"))))
Mind the arobase @
sign, that is a control char, that can not be a
tag in neither html, nor xml, and is used to tag the list to describe
it as a list of sxml attribute. Yes, It is the same structure used
for deserializing xml, with beautiful namespaces when necessary.
sxml->text
The procedure sxml->text
turn an sxml file into string without
markup, that is a sequence of words without structuring annotations
(except natural language punctuation, and those will go away too
later).
This will put to great use the mighty match with catamorphism.
You can achieve something similar with Andy Wingo
foldts
. It
is more readable with match
from SRFI 241 by Marc
Nieper-Wißkirchen.
An exampleWW A test will help understand what I am looking for
(define ~check-seti-001
(lambda ()
(assert
;; expected
(string=? "Hello schemer!! Password ruse grand scheme plan machination"
;; given
(sxml->text
'(html (head (title "Hello schemer!!"))
(body
(div (@ (id "root"))
(p "Password ruse grand scheme plan machination")))))))
The following catamorphic match with the help of scheme xpath
ie. sxpath
achieve something nearly perfect:
(define html->text
(lambda (html)
;; return #f in case of failure
(guard (ex (else (display-condition ex) #f))
;; retrieve the title
(define title (match (let ((title ((sxpath '(// title *text*)) html)))
;; TODO: FIXME: I forgot how to match
;; html/head/title with sxpath in a way
;; that is absolute hence this dance
(if (null? title)
'()
;; retrieve the first title, not
;; titles found in body. hopefully
;; the first title is the
;; /html/head/title
(car title)))
;; That exercise the feature of catamorphism, it
;; will require the match to recurse with the
;; syntax ,(fooobarqux) here ,(s*). The
;; recursing happens as if every single s*
;; matched by s* ... are feed one by one as
;; argument of the current match. The convention
;; I learned in nanopass is to call variable
;; that have ellipsis with a star suffix. The
;; variable name is not parsed by the match
;; macro :)
;; do no recurse at*, and do not include them in
;; output. it happens that <title abc="oops">
(((@ ,at* ...) ,(s*) ...)
;; s* is bound to the result of the match
;; catamorphism that is: a list of strings
(string-join s* " "))
;; Ah that should be better. Because the sxpath
;; is // title *text*, sxpath returns only the
;; content of the node title, not the node title
;; that would include the tagging symbol (title
;; "hello world") vs. here "hello world"
((,(s*) ...) (string-join s* " "))
;; For some cases title include some non-string garbage,
;; ignore it. The code reach that because of recursion
;; of the catamorphism.
;; if s is string return as-is, otherwise
;; zero-object aka empty string.
(,s (if (string? s) s ""))))
;; need to transform the body into text... (sxpath '(// body //
;; *text*)) does not do what I expect, it will also match
;; attribute values, so, fetch body, and then match with
;; catamorphism.
(define body (let ((body ((sxpath '(// body)) html)))
;; XXX: there might be no body, bug in htmlprag?
(if (null? body) '() (car body))))
(define text (string-join (match body ;; body looks like '(body (@ at* ...) e* ...)
;; we deal with empty list e* ... such as br.
;; return the zero object
;; in the match will: a) never
;; reverse on at*, and never pick it
;; up in output construction. match
;; need to return a string of text
;; from the body *properly*
;; separated with a single space.
(() (list))
;; ignore script, zero object is list
((script (@ ,at* ...) ,e* ...) (list))
;; ignore no script :')
((noscript (@ ,at* ...) ,e* ...) (list))
;; match any tag with attributes, recurse children
((,tag (@ ,at* ...) ,(e*) ...) (apply append e*))
;; same without attributes
((,tag ,(e*) ...) (apply append e*))
;; if a string wrap in list, otherwise it is an error \w/
(,e (if (string=? e)
(list e)
(error 'html->text "unsupported html"))))
" "))
;; XXX: some space is inserted, i do not know why, anyway it
;; does not matter for what comes next.
(string-append title " " text))))
string-clean
The procedure string-clean
will do two things at the same
time. There is a small optimization. With the library (scheme char)
I would write it with something like:
(define string-clean-r7rs
(lambda (string)
(define alphanumberic-or-space
(lambda (char)
(if (char-set:contains char-set-alphanumeric char) char #\space)))
;; to follow the clojure convention, because map's lambda is not a
;; one-liner: create a procedure
(list->string (map alphanumberic-or-space (string->list (string-downcase string))))))
The "problem" with that procedure is that it iterates at least twice
over the input string
, once when string-downcase
is called, and
another time with string->list
is called: BAD! Or so I think ;)
So here is another implementation that does not require (scheme char)
, and iterates less times over string
:
(define string-clean
(lambda (string)
(list->string (map (lambda (c) (let ((c* (char-downcase)))
(case c*
((#\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9
#\q #\w #\e #\r #\t #\y #\u #\i
#\o #\p #\a #\s #\d #\f #\g #\h
#\j #\k #\l #\z #\x #\c #\v #\b
#\m #\n)
c*)
(else #\space))))
(string->list string)))))
bag-new
, and bag-add!
The procedure bag-new
will turn the text into a counter. I will not
use the R7RS bag data structure, instead I build my own counter
datastructure with an association list, and a box:
(define bag-new
(lambda (tokens)
(define bag (cons 'bag (box (list))))
(for-each (lambda (token) (bag-add! bag token)) tokens)
bag))
(define bag-add!
(lambda (bag token)
(assert (eq? (car bag) 'bag))
(let* ((counts (unbox (cdr bag)))
(count (assoc token counts)))
(if (not count)
;; there is no count for this token, it is new
(set-box! (cdr bag) (cons (cons token (box 1)) counts))
;; there a count for this token, increment
(set-box! (cdr count) (+ 1 (unbox (cdr count))))))))
Analyze
The above procedure are each passing their output as input to the next
procedure to build a "pipeline". Some Scheme have compose
, or
threading operator, I will implement a procedure ==>
that make the
pattern explicit:
(define ==>
(lambda procedures
(lambda (object)
(fold (lambda (input procedure) (procedure input)) object procedures))))
It could be a macro, but I macro-fu is weak.
Then the phase of analysis, is just a thread of procedures. When called, the output is a bag made of the words that are not stop words in the input html:
(define seti-analyze (==> html->sxml
sxml->text
string-clean
string-split
(lambda (tokens)
(filter (lambda (token) (not (member token stop-words))) tokens))
bag-new))
Let's try a REPL^W test and look at what we have:
(define ~check-set-004
(lambda ()
(assert (set-analyze "<html><head><title>hello world</title><head><body><h2>hello schemers</h2></body><html>")
`(bag (("hello" . ,(box 2))
("schemers" . ,(box 1)))))))
(seti-new)
, and (seti-index! seti url html)
The procedure seti-new
is:
(define seti-new
(lambda ()
(cons 'seti (box (list)))))
Now, seti-index!
and precisely (seti-index! seti url html)
. So, we
have the above seti-analyze
that returns a bag representing the
number of occurence of words within an html and that bag is associated
to an url by the way of the html page. In ascii art we can represent
an abstract traversal of the data as follow:
url -> word -> count
Or an abstract tuple:
(url word count)
The input query
is made of a word
, to retrieve all urls that
contain words it is necessary to iterate over all tuples and keep only
the tuples that have word
equal to query
that is called a
algorithm complexity of O(n) where n the number of tuple, that is
approximatly O(m) where m is the number of document if we consider
that every document has the same number of words.
Long story made short, when there is a lot of document it helps to use another datastructure, a datastructure where it is easier to lookup words.
Instead of considering:
A document is made of words
The inverted index, posting, also known as index reverse the relationship and says:
words are in document
How? There is url -> html
hence eventually a relation url -> word
. I need to construct a relation word -> url
, an obtain the
abstract view, sometime it is called "reversing the dictionary". An
abstract view gives:
(word url count)
So here is one way to implement seti-index!
:
(define seti-new
(lambda ()
(cons 'seti (box (list)))))
(define ~check-seti-005
(lambda ()
(define seti (seti-new))
(seti-index! seti "hello-schemers"
"<html><head><title>hello world</title><head><body><h2>hello schemers</h2></body><html>")
(assert
(equal? seti
(cons 'seti (box
(list
(cons "hello" (box (list "hello-schemers")))
(cons "world" (box (list "hello-schemers")))
(cons "schemers" (box (list "hello-schemers")))
)))))))
(define seti-add!
(lambda (seti token url)
(define entry (assoc token (unbox (cdr seti))))
(if (not entry)
(set-box! (cdr seti) (cons (cons token (box (list url)))
(unbox (cdr seti))))
(set-box! (cdr entry) (cons url (unbox (cdr entry)))))))
(define seti-index!
(lambda (seti url html)
(assert (eq? (car seti) 'seti))
(let* ((bag (seti-analyze html))
(tokens (map car (unbox (cdr bag)))))
(for-each (lambda (token) (seti-add! seti token url)) tokens))))
Final words
There is already 236 lines of tested code. I hope you enjoyed going through the code, and the writing, and learned a few tricks. I know first hand it is not enough but if it helped you either get into search engines, or get into scheme then all the better.
The astute reader will recognize a missing seti-search
, that is left
as an exercise to the reader. And so does the scoring functionality.
Listing
#!chezscheme
(library (seti)
(export #;main
~check-seti-001
~check-seti-002
~check-seti-003
~check-seti-004
~check-seti-005
)
(import (chezscheme)
(binink match base)
(binink sxpath)
(binink html base)
)
(define pk
(lambda args
(display ";; ")(write args)
(newline)
(flush-output-port)
(car (reverse args))))
(define ~check-seti-001
(lambda ()
(assert
;; expected
(string=? "Hello schemer!! Password ruse grand scheme plan machination"
;; given
(sxml->text
'(html (head (title "Hello schemer!!"))
(body
(div (@ (id "root"))
(p "Password ruse grand scheme plan machination")))))))))
(define (string-join lst delimiter)
(if (null? lst) ""
(fold-left (lambda (item result) (string-append result delimiter item))
(car lst)
(cdr lst))))
(define sxml->text
(lambda (html)
;; return #f in case of failure
(guard (ex (else (display-condition ex) #f))
;; retrieve the title
(define title (match (let ((title ((sxpath '(// title *text*)) html)))
;; TODO: FIXME: I forgot how to match
;; html/head/title with sxpath in a way
;; that is absolute hence this dance
(if (null? title)
'()
;; retrieve the first title, not
;; titles found in body. hopefully
;; the first title is the
;; /html/head/title
(car title)))
;; That exercise the feature of catamorphism, it
;; will require the match to recurse with the
;; syntax ,(fooobarqux) here ,(s*). The
;; recursing happens as if every single s*
;; matched by s* ... are feed one by one as
;; argument of the current match. The convention
;; I learned in nanopass is to call variable
;; that have ellipsis with a star suffix. The
;; variable name is not parsed by the match
;; macro :)
;; do no recurse at*, and do not include them in
;; output. it happens that <title abc="oops">
(((@ ,at* ...) ,(s*) ...)
;; s* is bound to the result of the match
;; catamorphism that is: a list of strings
(string-join s* " "))
;; Ah that should be better. Because the sxpath
;; is // title *text*, sxpath returns only the
;; content of the node title, not the node title
;; that would include the tagging symbol (title
;; "hello world") vs. here "hello world"
((,(s*) ...) (string-join s* " "))
;; For some cases title include some non-string garbage,
;; ignore it. The code reach that because of recursion
;; of the catamorphism.
;; if s is string return as-is, otherwise
;; zero-object aka empty string.
(,s (if (string? s) s ""))))
;; need to transform the body into text... (sxpath '(// body //
;; *text*)) does not do what I expect, it will also match
;; attribute values, so, fetch body, and then match with
;; catamorphism.
(define body (let ((body ((sxpath '(// body)) html)))
;; XXX: there might be no body, bug in htmlprag?
(if (null? body) '() (car body))))
(define text (string-join (match body ;; body looks like '(body (@ at* ...) e* ...)
;; we deal with empty list e* ... such as br.
;; return the zero object
;; in the match will: a) never
;; reverse on at*, and never pick it
;; up in output construction. match
;; need to return a string of text
;; from the body *properly*
;; separated with a single space.
(() (list))
;; ignore script, zero object is list
((script (@ ,at* ...) ,e* ...) (list))
;; ignore no script :')
((noscript (@ ,at* ...) ,e* ...) (list))
;; match any tag with attributes, recurse children
((,tag (@ ,at* ...) ,(e*) ...) (apply append e*))
;; same without attributes
((,tag ,(e*) ...) (apply append e*))
;; if a string wrap in list, otherwise it is an error \w/
(,e (if (string=? e)
(list e)
(error 'html->text "unsupported html"))))
" "))
;; XXX: some space is inserted, i do not know why, anyway it
;; does not matter for what comes next.
(string-append title " " text))))
(define string-clean
(lambda (string)
(list->string (map (lambda (c) (let ((c* (char-downcase c)))
(case c*
((#\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9
#\q #\w #\e #\r #\t #\y #\u #\i
#\o #\p #\a #\s #\d #\f #\g #\h
#\j #\k #\l #\z #\x #\c #\v #\b
#\m #\n)
c*)
(else #\space))))
(string->list string)))))
(define ~check-seti-002
(lambda ()
(assert (string=? (string-clean "ABC def i->k L;") "abc def i k l "))))
(define bag-new
(lambda (tokens)
(define bag (cons 'bag (box (list))))
(for-each (lambda (token) (bag-add! bag token)) tokens)
bag))
(define bag-add!
(lambda (bag token)
(assert (eq? (car bag) 'bag))
(let* ((counts (unbox (cdr bag)))
(count (assoc token counts)))
(if (not count)
;; there is no count for this token, it is new
(set-box! (cdr bag) (cons (cons token (box 1)) counts))
;; there a count for this token, increment
(set-box! (cdr count) (+ 1 (unbox (cdr count))))))))
(define ~check-seti-003
(lambda ()
(assert (equal?
(cons 'bag (box (list
(cons "you" (box 1))
(cons "world" (box 1))
(cons "hello" (box 2)))))
(bag-new (list "hello" "world" "hello" "you"))))))
(define ==>
(lambda procedures
(lambda (object)
(fold-left (lambda (input procedure) (procedure input)) object procedures))))
(define (string-split char-delimiter? string)
(define (maybe-add a b parts)
(if (= a b) parts (cons (substring string a b) parts)))
(let ((n (string-length string)))
(let loop ((a 0) (b 0) (parts '()))
(if (< b n)
(if (not (char-delimiter? (string-ref string b)))
(loop a (+ b 1) parts)
(loop (+ b 1) (+ b 1) (maybe-add a b parts)))
(reverse (maybe-add a b parts))))))
(define stop-words (list "not" "has" "is" "not" ""))
(define seti-analyze (==> html-read
sxml->text
string-clean
(lambda (s) (string-split char-whitespace? s))
(lambda (tokens)
(filter (lambda (token) (not (member token stop-words))) tokens))
bag-new))
(define ~check-seti-004
(lambda ()
(assert (equal?
(seti-analyze "<html><head><title>hello world</title><head><body><h2>hello schemers</h2></body><html>")
(cons 'bag (box `(("schemers" . ,(box 1))
("world" . ,(box 1))
("hello" . ,(box 2)))))))))
(define seti-new
(lambda ()
(cons 'seti (box (list)))))
(define ~check-seti-005
(lambda ()
(define seti (seti-new))
(seti-index! seti "hello-schemers"
"<html><head><title>hello world</title><head><body><h2>hello schemers</h2></body><html>")
(assert
(equal? seti
(cons 'seti (box
(list
(cons "hello" (box (list "hello-schemers")))
(cons "world" (box (list "hello-schemers")))
(cons "schemers" (box (list "hello-schemers")))
)))))))
(define seti-add!
(lambda (seti token url)
(define entry (assoc token (unbox (cdr seti))))
(if (not entry)
(set-box! (cdr seti) (cons (cons token (box (list url)))
(unbox (cdr seti))))
(set-box! (cdr entry) (cons url (unbox (cdr entry)))))))
(define seti-index!
(lambda (seti url html)
(assert (eq? (car seti) 'seti))
(let* ((bag (seti-analyze html))
(tokens (map car (unbox (cdr bag)))))
(for-each (lambda (token) (seti-add! seti token url)) tokens))))
)