Adding Search to My Pelican Blog with Datasette

Last summer I migrated my blog from Wordpress to Pelican. I did this for a couple of reasons (see my post here), but one thing that I was a bit worried about when I migrated was that Pelican's offering for site search didn't look promising.

There was an outdated plugin called tipue-search but when I was looking at it I could tell it was on it's last legs.

I thought about it, and since my blag isn't super high trafficked AND you can use google to search a specific site, I could wait a bit and see what options came up.

After waiting a few months, I decided it would be interesting to see if I could write a SQLite utility to get the data from my blog, add it to a SQLite database and then use datasette to serve it up.

I wrote the beginning scaffolding for it last August in a utility called pelican-to-sqlite, but I ran into several technical issues I just couldn't overcome. I thought about giving up, but sometimes you just need to take a step away from a thing, right?

After the first of the year I decided to revisit my idea, but first looked to see if there was anything new for Pelican search. I found a tool plugin called search that was released last November and is actively being developed, but as I read through the documentation there was just A LOT of stuff:

  • stork
  • requirements for the structure of your page html
  • static asset hosting
  • deployment requires updating your nginx settings

These all looked a bit scary to me, and since I've done some work using datasette I thought I'd revisit my initial idea.

My First Attempt

As I mentioned above, I wrote the beginning scaffolding late last summer. In my first attempt I tried to use a few tools to read the md files and parse their yaml structure and it just didn't work out. I also realized that Pelican can have reStructured Text and that any attempt to parse just the md file would never work for those file types.

My Second Attempt

The Plugin

During the holiday I thought a bit about approaching the problem from a different perspective. My initial idea was to try and write a datasette style package to read the data from pelican. I decided instead to see if I could write a pelican plugin to get the data and then add it to a SQLite database. It turns out, I can, and it's not that hard.

Pelican uses signals to make plugin in creation a pretty easy thing. I read a post and the documentation and was able to start my effort to refactor pelican-to-sqlite.

From The missing Pelican plugins guide I saw lots of different options, but realized that the signal article_generator_write_article is what I needed to get the article content that I needed.

I then also used sqlite_utils to insert the data into a database table.

def save_items(record: dict, table: str, db: sqlite_utils.Database) -> None:  # pragma: no cover
    db[table].insert(record, pk="slug", alter=True, replace=True)

Below is the method I wrote to take the content and turn it into a dictionary which can be used in the save_items method above.

def create_record(content) -> dict:
    record = {}
    author = content.author.name
    category = content.category.name
    post_content = html2text.html2text(content.content)
    published_date = content.date.strftime("%Y-%m-%d")
    slug = content.slug
    summary = html2text.html2text(content.summary)
    title = content.title
    url = "https://www.ryancheley.com/" + content.url
    status = content.status
    if status == "published":
        record = {
            "author": author,
            "category": category,
            "content": post_content,
            "published_date": published_date,
            "slug": slug,
            "summary": summary,
            "title": title,
            "url": url,
        }
    return record

Putting these together I get a method used by the Pelican Plugin system that will generate the data I need for the site AND insert it into a SQLite database

def run(_, content):
    record = create_record(content)
    save_items(record, "content", db)

def register():
    signals.article_generator_write_article.connect(run)

The html template update

I use a custom implementation of Smashing Magazine. This allows me to do some edits, though I mostly keep it pretty stock. However, this allowed me to make a small edit to the base.html template to include a search form.

In order to add the search form I added the following code to base.html below the nav tag:

    <section class="relative h-8">
    <section class="absolute inset-y-0 right-10 w-128">
    <form
    class = "pl-4"
    <
    action="https://search-ryancheley.vercel.app/pelican/article_search?text=name"
    method="get">
            <label for="site-search">Search the site:</label>
            <input type="search" id="site-search" name="text"
                    aria-label="Search through site content">
            <button class="rounded-full w-16 hover:bg-blue-300">Search</button>
    </form>
    </section>

Putting it all together with datasette and Vercel

Here's where the magic starts. Publishing data to Vercel with datasette is extremely easy with the datasette plugin datasette-publish-vercel.

You do need to have the Vercel cli installed, but once you do, the steps for publishing your SQLite database is really well explained in the datasette-publish-vercel documentation.

One final step to do was to add a MAKE command so I could just type a quick command which would create my content, generate the SQLite database AND publish the SQLite database to Vercel. I added the below to my Makefile:

vercel:
    { \
    echo "Generate content and database"; \
    make html; \
    echo "Content generation complete"; \
    echo "Publish data to vercel"; \
    datasette publish vercel pelican.db --project=search-ryancheley --metadata metadata.json; \
    echo "Publishing complete"; \
    }

The line

datasette publish vercel pelican.db --project=search-ryancheley --metadata metadata.json; \

has an extra flag passed to it (--metadata) which allows me to use metadata.json to create a saved query which I call article_search. The contents of that saved query are:

select summary as 'Summary', url as 'URL', published_date as 'Published Data' from content where content like '%' || :text || '%' order by published_date

This is what allows the action in the form above to have a URL to link to in datasette and return data!

With just a few tweaks I'm able to include a search tool, powered by datasette for my pelican blog. Needless to say, I'm pretty pumped.

Next Steps

There are still a few things to do:

  1. separate search form html file (for my site)
  2. formatting datasette to match site (for my vercel powered instance of datasette)
  3. update the README for pelican-to-sqlite package to better explain how to fully implement
  4. Get pelican-to-sqlite added to the pelican-plugins page

Publishing content to Pelican site

There are a lot of different ways to get the content for your Pelican site onto the internet. The Docs show an example using rsync.

For automation they talk about the use of either Invoke or Make (although you could also use Just instead of Make which is my preferred command runner.)

I didn't go with any of these options, instead opting to use GitHub Actions instead.

I have two GitHub Actions that will publish updated content. One action publishes to a UAT version of the site, and the other to the Production version of the site.

Why two actions you might ask?

Right now it's so that I can work through making my own theme and deploying it without disrupting the content on my production site. Also, it's a workflow that I'm pretty used to:

  1. Local Development
  2. Push to Development Branch on GitHub
  3. Pull Request into Main on GitHub

It kind of complicates things right now, but I feel waaay more comfortable with having a UAT version of my site that I can just undo if I need to.

Below is the code for the Prod Deployment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
name: Pelican Publish

on:
push:
    branches:
    - main

jobs:
deploy:
    runs-on: ubuntu-18.04
    steps:
    - name: deploy code
        uses: appleboy/ssh-action@v0.1.2
        with:
        host: ${{ secrets.SSH_HOST }}
        key: ${{ secrets.SSH_KEY }}
        username: ${{ secrets.SSH_USERNAME }}

        script: |
            rm -rf ryancheley.com
            git clone git@github.com:ryancheley/ryancheley.com.git

            source /home/ryancheley/venv/bin/activate

            cp -r ryancheley.com/* /home/ryancheley/

            cd /home/ryancheley

            pip install -r requirements.txt

            pelican content -s publishconf.py

Let's break it down a bit

Lines 3 - 6 are just indicating when the actually perform the actions in the lines below.

In line 13 I invoke the appleboy/ssh-action@v0.1.2 which allows me to ssh into my server and then run some command line functions.

On line 20 I remove the folder where the code was previously cloned from, and in line 21 I run the git clone command to download the code

Line 23 I activate my virtual environment

Line 25 I copy the code from the cloned repo into the directory of my site

Line 27 I change directory into the source for the site

Line 29 I make any updates to requirements with pip install

Finally, in line 31 I run the command to publish the content (which takes my .md files and turns them into HTML files to be seen on the internet)

Setting up the Server to host my Pelican Site

Creating the user on the server

Each site on my server has it's own user. This is a security consideration, more than anything else. For this site, I used the steps from some of my scripts for setting up a Django site. In particular, I ran the following code from the shell on the server:

adduser --disabled-password --gecos "" ryancheley

adduser ryancheley www-data

The first command above creates the user with no password so that they can't actually log in. It also creates the home directory /home/ryancheley. This is where the site will be server from.

The second commands adds the user to the www-data group. I don't think that's strictly necessary here, but in order to keep this user consistent with the other web site users, I ran it to add it to the group.

Creating the nginx config file

For the most part I cribbed the nginx config files from this blog post.

There were some changes that were required though. As I indicated in part 1, I had several requirements I was trying to fulfill, most notably not breaking historic links.

Here is the config file for my UAT site (the only difference between this and the prod site is the server name on line 3):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
server {

    server_name uat.ryancheley.com;
    root /home/ryancheley/output;

    location / {
        # Serve a .gz version if it exists
        gzip_static on;
        error_page 404 /404.html;
        rewrite ^/index.php/(.*) /$1  permanent;
    }

    location = /favicon.ico {
        # This never changes, so don't let it expire
        expires max;
    }


    location ^~ /theme {
        # This content should very rarely, if ever, change
        expires 1y;
    }

    listen [::]:443 ssl ipv6only=on; # managed by Certbot
    listen 443 ssl; # managed by Certbot
    ssl_certificate /etc/letsencrypt/live/uat.ryancheley.com/fullchain.pem; # managed by Certbot
    ssl_certificate_key /etc/letsencrypt/live/uat.ryancheley.com/privkey.pem; # managed by Certbot
    include /etc/letsencrypt/options-ssl-nginx.conf; # managed by Certbot
    ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem; # managed by Certbot

}

server {
    if ($host = uat.ryancheley.com) {
        return 301 https://$host$request_uri;
    } # managed by Certbot



    listen [::]:80;
    listen 80;

    server_name uat.ryancheley.com;
    return 404; # managed by Certbot


}

The most interesting part of the code above is the location block from lines 6 - 11.

    location / {
        # Serve a .gz version if it exists
        gzip_static on;
        error_page 404 /404.html;
        rewrite ^/index.php/(.*) /$1  permanent;
    }

Custom 404 Page

    error_page 404 /404.html;

This line is what allows me to have a custom 404 error page. If a page is not found nginx will serve up the html page 404.html which is generated by a markdown file in my pages directory and looks like this:

    Title: Not Found
    Status: hidden
    Save_as: 404.html

    The requested item could not be located.

I got this implementation idea from the Pelican docs.

Rewrite rule for index.php in the URL

    rewrite ^/index.php/(.*) /$1  permanent;

The rewrite line fixes the index.php challenge I mentioned in the previous post

It took me a really long time to figure this out because the initial config file had a location block that looked like this:

1
2
3
4
5
    location = / {
        # Instead of handling the index, just
        # rewrite / to /index.html
        rewrite ^ /index.html;
    }

I didn't recognize the location = / { on line 1 as being different than the location block above starting at line 6. So I added

    rewrite ^/index.php/(.*) /$1  permanent;

to that block and it NEVER worked because it never could.

The = in the location block indicates a literal exact match, which the regular expression couldn't do because it's trying to be dynamic, but the = indicates static 🤦🏻‍♂️

OK, we've got a user, and we've got a configuration file, now all we need is a way to get the files to the server.

I'll go over that in the next post.

Migrating to Pelican from Wordpress

A little back story

In October of 2017 I wrote about how I migrated from SquareSpace to Wordpress. After almost 4 years I’ve decided to migrate again, this time to Pelican. I did a bit of work with Pelican during my 100 Days of Web Code back in 2019.

A good question to ask is, “why migrate to a new platform” The answer, is that while writing my post Debugging Setting up a Django Project I had to go back and make a change. It was the first time I’d ever had to use the WordPress Admin to write anything ... and it was awful.

My writing and posting workflow involves Ulysses where I write everything in MarkDown. Having to use the WYSIWIG interface and the ‘blocks’ in WordPress just broke my brain. That meant what should have been a slight tweak ended up taking me like 45 minutes.

I decided to give Pelican a shot in a local environment to see how it worked. And it turned out to work very well for my brain and my writing style.

Setting it up

I set up a local instance of Pelican using the Quick Start guide in the docs.

Pelican has a CLI utility that converts the xml into Markdown files. This allowed me to export my Wordpress blog content to it’s XML output and save it in the Pelican directory I created.

I then ran the command:

pelican-import --wp-attach -o ./content ./wordpress.xml

This created about 140 .md files

Next, I ran a few Pelican commands to generate the output:

pelican content

and then the local web server:

pelican --listen

I reviewed the page and realized there was a bit of clean up that needed to be done. I had categories of Blog posts that only had 1 article, and were really just a different category that needed to be tagged appropriately. So, I made some updates to the categorization and tagging of the posts.

I also had some broken links I wanted to clean up so I took the opportunity to check the links on all of the pages and make fixes where needed. I used the library LinkChecker which made the process super easy. It is a CLI that generates HTML that you can then review. Pretty neat.

Deploying to a test server

The first thing to do was to update my DNS for a new subdomain to point to my UAT server. I use Hover and so it was pretty easy to add the new entry.

I set uat.ryancheley.com to the IP Address 178.128.188.134

Next, in order to have UAT serve requests for my new site I need to have a configuration file for Nginx. This post gave me what I needed as a starting point for the config file. Specifically it gave me the location blocks I needed:

    location = / {
        # Instead of handling the index, just
        # rewrite / to /index.html
        rewrite ^ /index.html;
    }

    location / {
        # Serve a .gz version if it exists
        gzip_static on;
        # Try to serve the clean url version first
        try_files $uri.htm $uri.html $uri =404;
    }

With that in hand I deployed my pelican site to the server

The first thing I noticed was that the URLs still had index.php in them. This is a hold over from how my WordPress URL schemes were set up initially that I never got around to fixing but it’s always something that’s bothered me.

My blog may not be something that is linked to a ton (or at all?), but I didn’t want to break any links if I didn’t have to, so I decided to investigate Nginx rewrite rules.

I spent a bit of time trying to get my url to from this:

https://www.ryancheley.com/index.php/2017/10/01/migrating-from-square-space-to-word-press/

to this:

https://www.ryancheley.com/migrating-from-square-space-to-word-press/

using rewrite rules.

I gave up after several hours of trying different things. This did lead me to some awesome settings for Pelican that would allow me to retain the legacy Wordpress linking structure, so I updated the settings file to include this line:

ARTICLE_URL = 'index.php/{date:%Y}/{date:%m}/{date:%d}/{slug}/'
ARTICLE_SAVE_AS = 'index.php/{date:%Y}/{date:%m}/{date:%d}/{slug}/index.html'

OK. I still have the index.php issue, but at least my links won’t break.

404 Not Found

I starting testing the links on the site just kind of clicking here and there and discovered a couple of things:

  1. The menu links didn’t always work
  2. The 404 page wasn’t styled like I wanted it to me styled

The pelican documentation has an example for creating your own 404 pages which also includes what to update the Nginx config file location block.

And this is what lead me to discover what I had been doing wrong for the rewrites earlier!

There are two location blocks in the example code I took, but I didn’t see how they were different.

The first location block is:

    location = / {
        # Instead of handling the index, just
        # rewrite / to /index.html
        rewrite ^ /index.html;
    }

Per the Nginx documentation the =

If an equal sign is used, this block will be considered a match if the request URI exactly matches the location given.

BUT since I was trying to use a regular expression, it wasn’t matching exactly and so it wasn’t ‘working’

The second location block was not an exact match (notice there is no = in the first line:

location / {
        # Serve a .gz version if it exists
        gzip_static on;
        # Try to serve the clean url version first
        try_files $uri.htm $uri.html $uri =404;
    }

When I added the error page setting for Pelican I also added the URL rewrite rules to remove the index.php and suddenly my dream of having the redirect rules worked!

Additionally, I didn’t need the first location block at all. The final location block looks like this:

    location / {
        # Serve a .gz version if it exists
        gzip_static on;
        # Try to serve the clean url version first
        # try_files $uri.htm $uri.html $uri =404;
        error_page 404 /404.html;
        rewrite ^/index.php/(.*) /$1  permanent;
    }

I was also able to update my Pelican settings to this:

ARTICLE_URL = '{date:%Y}/{date:%m}/{date:%d}/{slug}/'
ARTICLE_SAVE_AS = '{date:%Y}/{date:%m}/{date:%d}/{slug}/index.html'

Victory!

What I hope to gain from moving

In my post outlining the move from SquareSpace to Wordpress I said,

As I wrote earlier my main reason for leaving Square Space was the difficulty I had getting content in. So, now that I’m on a WordPress site, what am I hoping to gain from it?

  1. Easier to post my writing
  2. See Item 1

Writing is already really hard for me. I struggle with it and making it difficult to get my stuff out into the world makes it that much harder. My hope is that not only will I write more, but that my writing will get better because I’m writing more.

So, what am I hoping to gain from this move:

  1. Just as easy to write my posts
  2. Easier to edit my posts

Writing is still hard for me (nearly 4 years later) and while moving to a new shiny tool won’t make the thinking about writing any easier, maybe it will make the process of writing a little more fun and that may lead to more words!

Addendum

There are already a lot of words here and I have more to say on this. I plan on writing a couple of more posts about the migration:

  1. Setting up the server to host Pelican
  2. The writing workflow used