Orchard Lab

A place to share my thoughts and learnings

Playground

Posted at — Sep 9, 2023

I love tmp folder driven development, it gives me the freedom to experiment without worrying about polluting things.

tmp folder driven development:

Start with tmp folder and messing around libraries or scripts before put it into the actual project.

Recently I’ve had a great time using bun:

Develop, test, run, and bundle JavaScript & TypeScript projects—all with Bun. Bun is an all-in-one JavaScript runtime & toolkit designed for speed, complete with a bundler, test runner, and Node.js-compatible package manager.

All that means to me is that I can write straight-up TypeScript and using any npm packages without messing around with endless configurations and build tools.

Idea

While I am building a desktop app, one feature I want to have is to get the cover image of the pdf file.

So, instead of immediately adding npm packages directly to my project. I want to quickly try it out first.

Playground

Playground is such an interesting and powerful concept if you think about it. It’s safe, sandbox-ed.

To set up a playground for my experiment. I did the following (thanks to bun):

cd /tmp
mkdir pdf-cover
bun init -y

That’s it. Now I can just start writing TypeScript and start adding libraries.

Assumption

The assumption is to read the first page of the pdf and convert it into an image. As that’s where most of the pdf (especially books) place their cover image.

Experiment

To extract the first page of the pdf, we can use pdf-lib.

bun add pdf-lib

Here is the code:

const pdfFilePath = "example.pdf";
// Read the existing PDF into a buffer

const existingPdfBytes: Buffer = fs.readFileSync(pdfFilePath);

// Load the PDF with pdf-lib
const pdfDoc = await PDFDocument.load(existingPdfBytes);

// Create a new PDF document
const newPdfDoc = await PDFDocument.create();

// Copy the cover page from the existing PDF into the new PDF
const [coverPage] = await newPdfDoc.copyPages(pdfDoc, [0]);
newPdfDoc.addPage(coverPage);

// Serialize the PDF to bytes and write it to the file system
const newPdfBytes: Uint8Array = await newPdfDoc.save();
const coverPdfPath: string = "cover_page.pdf";
fs.writeFileSync(coverPdfPath, newPdfBytes);

Running our script is dead simple:

bun run index.ts
# Because it calls graphicsmagick underneath
brew install graphicsmagick

Now we get the cover page pdf. Next let’s try to convert it into an image.

To do that. We add pdf2pic package.

bun add pdf2pic

pdf2pic can convert existing pdf to an image given some options.

const coverOutputPathParts = parseFilePath(coverOutputPath);
 const options = {
   density: 200,
   saveFilename: "cover-image",
   savePath: "./",
   format: ".png",
   width: 600,
   height: 600
 };
 const convert = fromPath("cover_page.pdf", options);
 const pageToConvertAsImage = 1;

 await convert(pageToConvertAsImage, { responseType: "image" });

Then let’s run our script again using bun run index.ts. Cool, now we get the image we want.

However, the image size is not well-proportioned as the original pdf dimensions. So we need to figure out the right dimension of the original pdf.

We can then go ahead write a function to figure that out:

async function getCoverPageDimensions(pdfFile: string): Promise<{
  width: number;
  height: number;
}> {
  // Read the existing PDF into a buffer
  const existingPdfBytes: Buffer = fs.readFileSync(pdfFile);

  // Load the PDF with pdf-lib
  const pdfDoc = await PDFDocument.load(existingPdfBytes);

  // Get the cover page
  const coverPage = pdfDoc.getPages()[0];

  // Get the dimensions
  return coverPage.getSize();
}

Modularize

Now we have the proof of concept, without any of those fighting with modules, TypeScript setup, build configurations, ts-node.

In order to consume this, let’s turn it into a module (a file with exported functions) instead. So that it can act as a well isolated unit.

Let’s define the interface we want:

export async function extractAndConvertCoverPage(
  pdfFilePath: string,
  coverOutputPath: string
) {}

Also as you notice from the previous code, it’s quite messy and inefficiencies. Which is totally fine, because that’s the point of playground. We want to get to a working version first and validate our ideas, then into the optimization phase.

The beauty of the interface is that we can hide all the inefficiencies behind the scene as well. So we can improve it without impacting the callers.

After some optimization here is the final code:

import fs from "fs";
import { PDFDocument } from "pdf-lib";
import { fromBuffer } from "pdf2pic";
import path from "path";

const OUTPUT_IMAGE_DPI = 200;

export async function extractPDFCoverImage(
  pdfFilePath: string,
  coverOutputPath: string
) {
  try {
    // Read the existing PDF into a buffer
    const existingPdfBytes: Buffer = fs.readFileSync(pdfFilePath);

    // Load the PDF with pdf-lib
    const pdfDoc = await PDFDocument.load(existingPdfBytes);
    const pdfDimensions = await getCoverPageDimensions(pdfDoc);

    // we need a new pdf doc to get buffer which will be used by pdf2pic
    const newPdfDoc = await PDFDocument.create();

    // Copy the cover page from the existing PDF into the new PDF
    const [coverPage] = await newPdfDoc.copyPages(pdfDoc, [0]);
    newPdfDoc.addPage(coverPage);

    // Serialize the PDF to bytes and write it to the file system
    const newPdfBytes: Uint8Array = await newPdfDoc.save();

    const newPdfBuffer = Buffer.from(newPdfBytes);

    // now we can start converting the pdf to image
    const coverOutputPathParts = parseFilePath(coverOutputPath);
    const options = {
      density: OUTPUT_IMAGE_DPI,
      ...coverOutputPathParts,
      ...pdfDimensions,
    };
    const convert = fromBuffer(newPdfBuffer, options);
    const pageToConvertAsImage = 1;

    await convert(pageToConvertAsImage, { responseType: "image" });
  } catch (err) {
    console.error("Error:", err);
  }
}

function parseFilePath(absolutePath: string) {
  const ext = path.extname(absolutePath).substring(1); // Get extension without the dot
  const filename = path.basename(absolutePath, `.${ext}`); // Get filename without extension
  const directory = path.dirname(absolutePath); // Get directory path

  return {
    saveFilename: filename,
    savePath: directory,
    format: ext,
  };
}

async function getCoverPageDimensions(pdfDoc: PDFDocument): Promise<{
  width: number;
  height: number;
}> {
  const coverPagePdf = pdfDoc.getPages()[0];
  // Get the dimensions
  return coverPagePdf.getSize();
}

Gist link

Learnings

All-in-one tools are great, fast tools are great. Not because they are more efficient, which of course they did. But more importantly, they don’t get in the way between you and your ideas.

Ideas are fragile.

Don’t get lost before you even start experimenting and cultivating your ideas.

Let’s do more tmp folder driven development, let’s do more playgrounds, let’s do more bun 🚀!