her.esy.fun/src/posts/rss-gen.org
Yann Esposito (Yogsototh) 22a2d2bd6c
Improved RSS generation
2019-09-30 15:10:39 +02:00

5.6 KiB

RSS Generation

TL;DR: To generate an RSS file you need to provide many metadatas. Those metadata are not part of all HTML files. So generating RSS from a tree of HTML file is not straightforward. Here is the script I use.

RSS Problem

RSS feed is meant to declare updates and new articles for a website. Each RSS entry must therefore have a date, an unique id, a title, maybe some categories, etc…

For most blog platform or even static website generator, those meta infos are clearly put in the sources or in some DB.

I use org-mode for generating my website, and the ox-rss is quite slow when generating an RSS with the full content of each item. Mainly, the way to achieve full content of my articles inside an RSS with ox-rss is by first creating a very big org file containing all the articles, and then transforming it in RSS. And this is very slow (many minutes).

So a simpler idea inspired by lb1 is to generate the RSS directly from the generated HTML files. The only difficulty is to find the metadata inside those HTML. Unfortunately there is no real standard for all those metas.

Has there is no standard place to have all those meta informations inside an HTML file in order to use the HTML as source you'll need to "parse" the HTML file. For that purpose I use html-xml-utils.

Here is the full script I use

#!/usr/bin/env nix-shell
#!nix-shell -i zsh

# Directory
webdir="_site"
postsdir="$webdir/posts"
rssfile="$webdir/rss.xml"

# maximal number of articles to put in the RSS file
maxarticles=10

# RSS Metas
rsstitle="her.esy.fun"
rssurl="https://her.esy.fun/rss.xml"
websiteurl="https://her.esy.fun"
rssdescription="her.esy.fun articles, mostly random personal thoughts"
rsslang="en"
rssauthor="yann@esposito.host (Yann Esposito)"
rssimgtitle="yogsototh"
rssimgurl="https://her.esy.fun/img/FlatAvatar.png"

# HTML Accessors (similar to CSS accessors)
dateaccessor='.article-date'
contentaccessor='#content'
# title and keyword shouldn't be changed
titleaccessor='title'
keywordsaccessor='meta[name=keywords]::attr(content)'

formatdate() {
    # format the date for RSS
    local d=$1
    LC_TIME=en_US date --date $d +'%a, %d %b %Y %H:%M:%S %z'
}

finddate(){ < $1 hxselect -c $dateaccessor }
findtitle(){ < $1 hxselect -c $titleaccessor }
getcontent(){ < $1 hxselect $contentaccessor }
findkeywords(){ < $1 hxselect -c $keywordsaccessor | sed 's/,//g' }
mkcategories(){
    for keyword in $*; do
        printf "\\n<category>%s</category>" $keyword
    done
}

autoload -U colors && colors

tmpdir=$(mktemp -d)
typeset -a dates
dates=( )
for fic in $postsdir/**/*.html; do
    blogfile="$(echo "$fic"|sed 's#^'$postsdir'/##')"
    printf "%-30s" $blogfile
    xfic="$tmpdir/$fic.xml"
    mkdir -p $(dirname $xfic)
    hxclean $fic > $xfic
    d=$(finddate $xfic)
    echo -n " [$d]"
    rssdate=$(formatdate $d)
    title=$(findtitle $xfic)
    keywords=( $(findkeywords $xfic) )
    printf ": %-55s" "$title ($keywords)"
    categories=$(mkcategories $keywords)
    { printf "\\n<item>"
      printf "\\n<title>%s</title>" "$title"
      printf "\\n<guid>%s</guid>" "${websiteurl}/${blogfile}"
      printf "\\n<pubDate>%s</pubDate>%s" "$rssdate"
      printf "%s" "$categories"
      printf "\\n<description><![CDATA[\\n%s\\n]]></description>" "$(getcontent "$xfic")"
      printf "\\n</item>\\n\\n"
    } >>  "$tmpdir/${d}-$(basename $fic).rss"
    dates=( $d $dates )
    echo " [${fg[green]}OK${reset_color}]"
done
echo "Publishing"
for fic in $(ls $tmpdir/*.rss | sort -r | head -n $maxarticles ); do
    echo "${fic:t}"
    cat $fic >> $tmpdir/rss
done

rssmaxdate=$(formatdate $(for d in $dates; do echo $d; done | sort -r | head -n 1))
rssbuilddate=$(formatdate $(date))
{
cat <<END
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
	   xmlns:content="http://purl.org/rss/1.0/modules/content/"
	   xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	   xmlns:dc="http://purl.org/dc/elements/1.1/"
	   xmlns:atom="http://www.w3.org/2005/Atom"
	   xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	   xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	   xmlns:georss="http://www.georss.org/georss"
     xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
     xmlns:media="http://search.yahoo.com/mrss/"><channel>
  <title>${rsstitle}</title>
  <atom:link href="${rssurl}" rel="self" type="application/rss+xml" />
  <link>${websiteurl}</link>
  <description><![CDATA[${rssdescription}]]></description>
  <language>${rsslang}</language>
  <pubDate>${rssmaxdate}</pubDate>
  <lastBuildDate>$rssbuilddate</lastBuildDate>
  <generator>mkrss.sh</generator>
  <webMaster>${rssauthor}</webMaster>
  <image>
    <url>${rssimgurl}</url>
    <title>${rssimgtitle}</title>
    <link>${websiteurl}</link>
  </image>
END
cat $tmpdir/rss
cat <<END
</channel>
</rss>
END
} > "$rssfile"

rm -rf $tmpdir
echo "RSS Generated"

The nix-shell bang pattern is a neat trick to have all the dependencies I need when running my script, I could have added zsh, but my main concern was about html-xml-utils.

Along my script I have a shell.nix file containing:

{ pkgs ? import (fetchTarball https://github.com/NixOS/nixpkgs/archive/19.09-beta.tar.gz) {} }:
  pkgs.mkShell {
    buildInputs = [ pkgs.html-xml-utils ];
  }