How to Split a Multi-Document PDF Using JavaScript and Google Cloud Document AI

Software Development

Oct 11, 2024

6-7 min

Share blog

Introduction

In this tutorial, I will guide you through a process of splitting a PDF that contains multiple documents using JavaScript, Google Cloud’s Document AI, and the pdf-lib library. This feature is useful when you have a PDF with several documents, each identified by page numbers (e.g., “Page 1 of 2” for the first document, “Page 1 of 3” for the second document, etc.). Document AI will help extract page number data, and then we’ll split the PDF accordingly.

Step 1: Understanding the Problem

Consider a PDF with multiple documents, each identified by page numbers:

The first document has 2 pages, labeled “Page 1 of 2”, “Page 2 of 2”.

The second document has 3 pages, labeled “Page 1 of 3”, “Page 2 of 3”, “Page 3 of 3”. We’ll use OCR (Optical Character Recognition) to extract these page numbers and split the PDF into separate files for each document.

Step 2: Setting Up Google Cloud Document AI

To OCR the page numbers, we will use Google Cloud Document AI’s Custom Extractor.

1. Create a Google Cloud Account if you don’t have one.

2. Set up Document AI by searching for it in the GCP Console

3. Create a Custom Processor by selecting the Custom Extractor model.

4. Select Custom extractor as our processor.

5. Upload Training Documents: Upload sample PDFs to train our processor .

6. Create Labels: Annotate the page numbers and total page count fields, creating two labels: page_no and page_total. For optimal accuracy, label at least 100 pages across 20 documents.

7. Train and Deploy the model.

Step 3: Extracting Page Numbers from the PDF Using Document AI

Once the processor is trained and deployed, you can extract labeled data like page numbers and total pages from the PDF. Here’s how we do it in JavaScript:

typescript

1const name = `projects/${projectId}/locations/${location}/processors/${processorId}`;
2const buffer = await getTheArrayBufferFromPdfUrl(s3Url);
3const encodedImage = Buffer.from(buffer).toString('base64');
4
5const request = {
6  name,
7  rawDocument: {
8    content: encodedImage,
9    mimeType: 'application/pdf',
10  },
11};
12
13const [result] = await client.processDocument(request);
14const { document } = result;
15const { entities } = document;
16const pages = formatData(entities);
17const pagesToSplit = getPdfPagesToSplit(pages);

This function organizes the extracted data into a structured array containing each page’s number and total page count.

Step 4: Identifying Document Boundaries

We then determine the starting and ending pages for each document inside the PDF:

javascript

1getPdfPagesToSplit = (pages) => {
2  const pdfPages = [];
3  let count = 0;
4  let skipCount = 0;
5
6  for (const page of pages) {
7    count++;
8    if (skipCount) {
9      skipCount--;
10      continue;
11    }
12
13    if (page.page_total == 1) {
14      pdfPages.push({ number: +page.number + 1, start: count, end: count });
15    } else if (page.page_total > 1) {
16      skipCount = page.page_total - 1;
17      pdfPages.push({ number: +page.number + 1, start: count, end: count + +page.page_total - 1 });
18    }
19  }
20
21  return pdfPages;
22};
23
24  },
25};
26
27const [result] = await client.processDocument(request);
28const {document} = result;
29const {entities} = document;
30const pages = formatData(entities);
31const pagesToSplit = getPdfPagesToSplit(pages);

Step 5: Splitting the PDF Using pdf-lib

Once we have the start and end pages, we can split the PDF using pdf-lib:

javascript

1extractPdfPage = async (arrayBuff, pageToSplit) => {
2  const pdfSrcDoc = await PDFDocument.load(arrayBuff);
3  const pdfNewDoc = await PDFDocument.create();
4  const pages = await pdfNewDoc.copyPages(pdfSrcDoc, range(pageToSplit.start, pageToSplit.end));
5  pages.forEach(page => pdfNewDoc.addPage(page));
6
7  const newPdf = await pdfNewDoc.save();
8  return newPdf;
9};

Here, pdf-lib copies and saves the pages of each document as a new PDF.

Step 6: Upload or Download the Split PDFs

Now, we can take the split PDFs from SplittedPdfs and either upload them to a cloud service or download them to the user’s machine:

javascript

1const SplittedPdfs = [];
2for (const pageToSplit of pagesToSplit) {
3  const splittedPdf = await extractPdfPage(imageFile, pageToSplit);
4  SplittedPdfs.push(splittedPdf);
5}
6// Now you can use SplittedPdfs as per your needs.

Conclusion

This tutorial demonstrates how to split a multi-document PDF using JavaScript, Document AI, and pdf-lib. We covered setting up Document AI, extracting page numbers, and splitting the PDF based on those page numbers. With these steps, you can easily implement this feature in your own applications.

Blogs

Discover the latest insights and trends in technology with the Omax Tech Blog. Stay updated with expert articles, industry news, and innovative ideas.

View All Blogs

AI-assisted coding workflow: connecting code, AI, and development tools for efficient product creation.

Muhammad Adan

4-6 min

Feb 11, 2026

AI-Assisted MVP Development (Vibe Coding)

Building a startup MVP used to be slow, expensive, and stressful especially if you weren’t technical....

Illustration showing SEO evolving into AEO and GEO, with search, analytics, and automation icons representing QA teams driving AI search visibility

Muhammad Khurram Khan

4-6 min

Feb 2, 2026

From SEO to AEO & GEO: Why QA Teams Will Own Search Visibility in the AI Era

Search is no longer just a list of links. It’s becoming a decision layer, A place where users expect an immediate, synthesized answer, a recommendation, or a next action...

Zohaib Anwar

4-6 min

Feb 2, 2026

Common Amazon EventBridge Pitfalls in Production (and How to Avoid Them)

Amazon EventBridge simplifies the implementation of event-driven architectures. Publish an event, configure a rule, attach a target-and the system appears to work seamlessly...

Digital network concept with interconnected computer icons over a glowing circuit board background.

Bilal Mamji

8-10 min

Jan 28, 2026

Building Production-Ready RAG Microservices: A Complete Serverless Architecture Guide

Large Language Models like GPT-4 and Claude have a critical flaw for businesses: they don't know your proprietary data. They can't answer questions about your products...

Illustration showing a modern data lakehouse architecture with interconnected data servers and centralized data processing.

Misbah Ali

4-6 min

Jan 22, 2026

What is a Data Lake, Data Warehouse, and Data Lakehouse? - A Simple Beginner’s Guide

Data has become one of the most valuable assets for modern businesses. Every click, transaction, message, and app interaction generates information that companies want to store, analyze, and learn from....

AWS cloud architecture diagram showing core services and infrastructure

Shahzaib Rauf

4-6 min

Jan 19, 2026

Implementing a Scalable AWS Landing Zone: A Practical Guide for DevOps Teams

An AWS Landing Zone is a well-architected, multi-account AWS environment designed to support scalability, security, compliance, and operational excellence from day one....

Abstract illustration of scalable cloud servers representing modern distributed system architecture.

Muhammad Adan

4-6 min

Jan 19, 2026

Using EventBridge for Async Communication in a Serverless Microservice Architecture

Microservices often begin with simple, synchronous communication: Service A calls Service B’s API and waits for a response...

illustration of an Amazon DynamoDB database on a blue background, representing pros and cons of using DynamoDB.

Shaheryar Pirzada

4-6 min

Jan 16, 2026

Pros and cons of using DynamoDB

Amazon DynamoDB has become one of the most popular NoSQL databases in the cloud, offering a fully managed, serverless experience....

Illustration comparing a SQL database and DynamoDB with a “VS” icon, representing migration from relational SQL to DynamoDB.

Shaheryar Pirzada

4-6 min

Jan 16, 2026

Moving Relational Data from SQL to DynamoDB: A Practical Guide

Migrating data from a traditional relational database like MySQL, PostgreSQL, or SQL Server into Amazon DynamoDB isn’t just a lift-and-shift operation...

Software Development

Data Engineering & Analytics

Artificial Intelligence

IT Staff Augmentation

ERP/CRM Solutions

Cloud/DevOps

UI/UX Design

Custom Software Development

SaaS Development

Web Application Development

MVP Development Services

Quality Assurance & Testing

How to Split a Multi-Document PDF Using JavaScript and Google Cloud Document AI

Share blog

Introduction

Step 1: Understanding the Problem

Step 2: Setting Up Google Cloud Document AI

Step 3: Extracting Page Numbers from the PDF Using Document AI

Step 4: Identifying Document Boundaries

Step 5: Splitting the PDF Using pdf-lib

Step 6: Upload or Download the Split PDFs

Conclusion

Blogs

AI-Assisted MVP Development (Vibe Coding)

From SEO to AEO & GEO: Why QA Teams Will Own Search Visibility in the AI Era

Common Amazon EventBridge Pitfalls in Production (and How to Avoid Them)

Building Production-Ready RAG Microservices: A Complete Serverless Architecture Guide

What is a Data Lake, Data Warehouse, and Data Lakehouse? - A Simple Beginner’s Guide

Implementing a Scalable AWS Landing Zone: A Practical Guide for DevOps Teams

Using EventBridge for Async Communication in a Serverless Microservice Architecture

Pros and cons of using DynamoDB

Moving Relational Data from SQL to DynamoDB: A Practical Guide

Get In Touch