linkedin insight
Omax Tech

Loading...

How to Split a Multi-Document PDF Using JavaScript and Google Cloud Document AI

How to Split a Multi-Document PDF Using JavaScript and Google Cloud Document AI

Software Development
Oct 11, 2024
6-7 min

Share blog

Introduction

In this tutorial, I will guide you through a process of splitting a PDF that contains multiple documents using JavaScript, Google Cloud’s Document AI, and the pdf-lib library. This feature is useful when you have a PDF with several documents, each identified by page numbers (e.g., “Page 1 of 2” for the first document, “Page 1 of 3” for the second document, etc.). Document AI will help extract page number data, and then we’ll split the PDF accordingly.

Step 1: Understanding the Problem

Consider a PDF with multiple documents, each identified by page numbers:

  • The first document has 2 pages, labeled “Page 1 of 2”, “Page 2 of 2”.

The second document has 3 pages, labeled “Page 1 of 3”, “Page 2 of 3”, “Page 3 of 3”. We’ll use OCR (Optical Character Recognition) to extract these page numbers and split the PDF into separate files for each document.

Image

Step 2: Setting Up Google Cloud Document AI

To OCR the page numbers, we will use Google Cloud Document AI’s Custom Extractor.

1. Create a Google Cloud Account if you don’t have one.

2. Set up Document AI by searching for it in the GCP Console

Image

3. Create a Custom Processor by selecting the Custom Extractor model.

Image

4. Select Custom extractor as our processor.

Image

5. Upload Training Documents: Upload sample PDFs to train our processor .

Image

6. Create Labels: Annotate the page numbers and total page count fields, creating two labels: page_no and page_total. For optimal accuracy, label at least 100 pages across 20 documents.

Image
Image

7. Train and Deploy the model.

Step 3: Extracting Page Numbers from the PDF Using Document AI

Once the processor is trained and deployed, you can extract labeled data like page numbers and total pages from the PDF. Here’s how we do it in JavaScript:

typescript
1const name = `projects/${projectId}/locations/${location}/processors/${processorId}`;
2const buffer = await getTheArrayBufferFromPdfUrl(s3Url);
3const encodedImage = Buffer.from(buffer).toString('base64');
4
5const request = {
6 name,
7 rawDocument: {
8 content: encodedImage,
9 mimeType: 'application/pdf',
10 },
11};
12
13const [result] = await client.processDocument(request);
14const { document } = result;
15const { entities } = document;
16const pages = formatData(entities);
17const pagesToSplit = getPdfPagesToSplit(pages);

This function organizes the extracted data into a structured array containing each page’s number and total page count.

Step 4: Identifying Document Boundaries

We then determine the starting and ending pages for each document inside the PDF:

javascript
1getPdfPagesToSplit = (pages) => {
2 const pdfPages = [];
3 let count = 0;
4 let skipCount = 0;
5
6 for (const page of pages) {
7 count++;
8 if (skipCount) {
9 skipCount--;
10 continue;
11 }
12
13 if (page.page_total == 1) {
14 pdfPages.push({ number: +page.number + 1, start: count, end: count });
15 } else if (page.page_total > 1) {
16 skipCount = page.page_total - 1;
17 pdfPages.push({ number: +page.number + 1, start: count, end: count + +page.page_total - 1 });
18 }
19 }
20
21 return pdfPages;
22};
23
24 },
25};
26
27const [result] = await client.processDocument(request);
28const {document} = result;
29const {entities} = document;
30const pages = formatData(entities);
31const pagesToSplit = getPdfPagesToSplit(pages);

Step 5: Splitting the PDF Using pdf-lib

Once we have the start and end pages, we can split the PDF using pdf-lib:

javascript
1extractPdfPage = async (arrayBuff, pageToSplit) => {
2 const pdfSrcDoc = await PDFDocument.load(arrayBuff);
3 const pdfNewDoc = await PDFDocument.create();
4 const pages = await pdfNewDoc.copyPages(pdfSrcDoc, range(pageToSplit.start, pageToSplit.end));
5 pages.forEach(page => pdfNewDoc.addPage(page));
6
7 const newPdf = await pdfNewDoc.save();
8 return newPdf;
9};

Here, pdf-lib copies and saves the pages of each document as a new PDF.

Step 6: Upload or Download the Split PDFs

Now, we can take the split PDFs from SplittedPdfs and either upload them to a cloud service or download them to the user’s machine:

javascript
1const SplittedPdfs = [];
2for (const pageToSplit of pagesToSplit) {
3 const splittedPdf = await extractPdfPage(imageFile, pageToSplit);
4 SplittedPdfs.push(splittedPdf);
5}
6// Now you can use SplittedPdfs as per your needs.

Conclusion

This tutorial demonstrates how to split a multi-document PDF using JavaScript, Document AI, and pdf-lib. We covered setting up Document AI, extracting page numbers, and splitting the PDF based on those page numbers. With these steps, you can easily implement this feature in your own applications.

Blogs

Discover the latest insights and trends in technology with the Omax Tech Blog.

View All Blogs
Responsive web development illustration showing cross-device software design on laptop, tablet, and mobile screens.
6-8 min
April 20, 2026

Our Proven Web Development Process That Delivers Real Results

In software development, success does not come from coding alone. Real results come from understanding business needs, planning the right workflow, building user-friendly designs...

Read More
Secure AWS Systems Manager connectivity illustration showing private cloud access to servers and databases without SSH exposure.
6-8 min
April 20, 2026

Secure AWS Connectivity Using AWS Systems Manager (SSM)

In traditional cloud architectures, secure access to private resources such as databases and internal servers often relies on...

Read More
Cloud upload architecture illustration showing secure multi-account AWS infrastructure for enterprise environments.
6-10 min
April 19, 2026

Building a Secure Multi-Account AWS Architecture for Enterprise Environments (Dev, STG, UAT, Prod)

In today’s cloud-first world, scalability and speed are no longer enough security, governance, and cost control are equally critical...

Read More
Friendly AI assistant robot beside a smartphone, representing adaptive AI agents for modern workflows.
6-8 min
April 15, 2026

Why You Should Use AI Agents Over Single Prompts: Unlocking the Power of Adaptive AI for Complex Workflows

In the world of artificial intelligence (AI), one of the biggest advancements has been the rise of AI agents that adapt dynamically to real-time data and complex workflows...

Read More
Data operations dashboard showing production quality checks, performance trends, and incident alerts across stores.
8-10 min
April 09, 2026

Production Ready ( Quality, performance, and the lessons learned shipping to 150 stores )

We chose dbt over custom scripts, built observability, optimized performance, and shipped to production...

Read More
Scalable data pipeline diagram highlighting dbt macros, reusable models, and multi-store analytics flow.
8-10 min
April 08, 2026

Scaling from 15 to 150 Stores ( When copy-paste becomes technical debt, macros become salvation )

We built a pipeline with observability, incremental models for performance, and snapshots for history. Our 15-store deployment ran smoothly...

Read More
Observability dashboard tracking source freshness, pipeline status, and real-time data quality alerts.
8-10 min
April 07, 2026

Keeping Your Data Fresh: ( The wake-up call at 3am that taught us about observability )

That morning taught us a crucial lesson: a successful dbt run doesn't mean your data is fresh, accurate, or complete. You need observability.

Read More
Retail data architecture visual showing fragmented store databases consolidated into a unified analytics pipeline.
8-10 min
April 06, 2026

Retail Data Chaos: How We Found Our Way Out ( When spreadsheets fail and databases multiply, where do you turn? )

Picture this: You're managing data for a growing retail chain. Store after store opens New York, San Francisco, Los Angeles—each with its own MySQL database...

Read More
Secure AI access workflow showing authentication, authorization, and protected enterprise operations.
8-10 min
April 07, 2026

Securing Your AI-Powered Future (How Authorization Ensures Safe and Appropriate Access)

Discover how authorization in MCP ensures secure, role-based access for AI-powered business workflows...

Read More

Get In Touch

Build Your Next Big Idea with Us

From MVPs to full-scale applications, we help you bring your vision to life on time and within budget. Our expert team delivers scalable, high-quality software tailored to your business goals.