Pinboard bookmark to PDF with NodeJS and Puppeteer

I often bookmark articles, explanations, tutorials, etc. with the plan to one day come back and read them. Sadly, by the time I do, there's usually a few dead links, lost forever.

Recently I found myself frustrated at the sheer number of bookmarks in my browser and decided a bookmarking service was needed. In the past, I have used Delicious and even rolled my own, but I had also previously been a paying customer of Pinboard, and after some searching decided it best suited my needs. Those needs being tags, a browser plugin, and an API.

After reactivating my account I went through the ~100 bookmarks I added back in 2015. Removed any dead links, discovered some forgotten gems, and deleted the no longer relevant. Finally, I imported all ~650 of my Firefox bookmarks and set about sorting those. I hope to keep the tags to a minimum, favouring search over an abundance of tags.

To mitigate the loss of anything I found particularly interesting or important, such as this list of Undocumented DOS Commands, I wanted to be able to tag a link with topdf and have the site automatically printed to a PDF file and stored in Dropbox, or Google Drive.

The first port of call was IFTTT (If This Then That) where I set up a pathway from my Pinboard account to a service called PDFmyURL. This worked great, but it turned out they don't like free users printing to PDF using automation, and I wasn't about to pay $19 a month for the privilege.

But I'm a professional software engineer! It shouldn't be too hard to roll my own solution...

Looking up the Pinboard API I found an endpoint that would return the most recent pins (that's what they call bookmarks) for a given tag, in my case topdf:

https://api.pinboard.in/v1/posts/recent

I had already used Puppeteer in a solution for a client looking to generate and print labels from a web interface, so I knew it could handle converting a URL to a full PDF document. Puppeteer is a Node library which settled the question of which language to use.

Turns out there's a neat library on NPM for Pinboard called node-pinboard. I also chose to use slugify to convert the page titles to clean filenames, and something called scroll-to-bottomjs, which I'll get to later:

npm i --save puppeteer node-pinboard slugify scroll-to-bottomjs

Since Puppeteer is a pretty heavyweight tool, featuring a headless but otherwise fully functioning Chrome browser and V8 Javascript runtime, I had to install some dependencies to get things working:

sudo apt install ca-certificates fonts-liberation gconf-service libappindicator1 libasound2 libatk-bridge2.0-0 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgbm1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libnss3 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 lsb-release wget xdg-utils

Getting the bookmarks

The first thing to do is grab our bookmarks:

const Pinboard = require(`node-pinboard`).default;

const pinboard  = new Pinboard(<YOUR_PINBOARD_API_TOKEN>);

function getFromPinboard()
  pinboard.recent({ tag : `topdf` })
    .then((data) => {
      if(data.posts.length === 0){ return process.exit(0); }  // If nothing, exit
      const promises = [];
      data.posts.forEach((p) => {
        promises.push(generatePDF(p.href, p.description));
      });
      if(promises.length === 0){ process.exit(0); }  // If nothing, exit
      return Promise.all(promises);
    })
    .then((result) => {

    })
    .catch((error) => {

    });
}

The above code grabs all the recent pins tagged topdf. Looping through each one we send both it's description (the title of the pin which I may have altered from the default page title) and the href, or URL, to our generatePDF function. By pushing these async functions on to an array and using Promise.all we can run them all in parallel, eventually getting back an array of stored filenames.

Generating PDFs

const Puppeteer      = require(`puppeteer`);
const ScrollToBottom = require(`scroll-to-bottomjs`);
const Slugify        = require(`slugify`);

async function generatePDF(url, title) {
  try {
    const browser = await Puppeteer.launch({ headless : true, args : ['--no-sandbox'] }); // Puppeteer can only generate pdf in headless mode.
    const page = await browser.newPage();
    await page.goto(url, { waitUntil : 'networkidle2', timeout : 0 }); // timeout : 0 disables the network timeout
    const filename = `${Slugify(title)}.pdf`.toLowerCase();
    const pdfConfig = {
      path            : `/tmp/${filename}`, 
      format          : 'A4',
      printBackground : true
    };
    await page.emulateMedia('screen');
    await page.evaluate(ScrollToBottom);
    await page.pdf(pdfConfig);
    await browser.close();
    return filename;
  } catch(e) {
    console.error(e);
    return null;
  }
}

The above function creates an instance of Puppeteer, loads the URL, uses Slugify to create a clean filename then creates a PDF before returning the filename. The line await page.evaluate(ScrollToBottom); scrolls the page to the bottom before rendering the PDF, this ensures lazy-loaded images are rendered. I've chosen to store the PDF files in /tmp to ensure they will be deleted at some point. I did end up creating a CRON job to clear any .pdf files once a week.

Preventing duplicates

Running this daily means that pins I processed yesterday could be processed again today. By deleting the local PDF files periodically I need a different way to ensure I don't have duplicates.

The Pinboard API returns a fair amount of information on each pin:

{
  href: 'https://developer.nvidia.com/video-encode-decode-gpu-support-matrix',
  description: 'Video Encode and Decode GPU Support Matrix | NVIDIA Developer',
  extended: '',
  meta: '39ca3d2e5174865610eff3f2f1b83970',
  hash: 'c31630a3d99256a99a3718f10e7ecf37',
  time: '2020-05-31T11:23:43Z',
  shared: 'no',
  toread: 'no',
  tags: 'ffmpeg encoding'
},

Most notable here is hash. Hashes are unique fingerprints, or signatures so no two different URLs will give the same hash. Each time the program runs the previous days hashes are loaded as an array from a JSON file. The new pins pulled from the API are checked against this array and anything not processed has a PDF generated an uploaded, anything that is in the array is added to a new array, we'll call this new_hashes.

If a URL gets all the way through and is successfully uploaded, it too is added to new_hashes, and this is then converted to JSON and overwrites the previous day's hashes. If a URL is not successful, that is it fails at either the PDF or upload stages, it will not be added to the new_hashes array and so will be processed again on the subsequent day.

Uploading

As an early adopter I managed to snag a free 50gb account on box.com which I've never really found a use for, so this was ideal for me. I'll go into uploading the files to there in a different post as this won't apply to everyone, but you can find tutorials elsewhere online for uploading to Dropbox, OneDrive, or Google Drive, S3, etc.