Large PDFs to summarize? No problem! Try GPT with LangChain

Anirudh @ krysins.com
4 min readApr 18, 2023

--

Background

In the previous article Automated Summarisation of PDFs with GPT API in Python, the most common feedback and question was “This is good but how would it be done for longer PDFs?”. A very valid question and a real problem in day-to-day use. This article addresses one of the of the ways to address this issue.

Though it is a very simple approach, it has a few caveats. Before we get into the downsides, let us see how to generate the summary.

Illustration of the issue with OpenAI API

For this article, the reference blog that was used was ‘How I Make $27,000 Weekly in Passive Income’. The OpenAI API ‘openai.Completion.create’ to generate summary gives the following error:

openai.error.InvalidRequestError: This model’s maximum context length is 4097 tokens, however you requested 9352 tokens (8152 in your prompt; 1200 for the completion). Please reduce your prompt; or completion length.

i.e. the text in the article consisting of approximately 1500 unique words results in 9K tokens, which are much higher than the allowed limit of 4097 tokens.

Few things to remember:

  • 75 words approximately equal 100 tokens
  • The token limit of 4097 tokens is equal to sum of input tokens + output tokens and here the input tokens (8152 in the input text + prompt) already far exceeded the allowed token limit

High Level Approach

Three simple high level steps only:

  1. Fetch a sample document from internet / create one by saving a word document as PDF.
  2. Use Pythons PyPDF2 library to extract text.
  3. Instantiate langchain libraries class ‘AnalyzeDocumentChain’ with chain_type = ‘map_reduce’ and run it with extracted text to get the summary.

Step-by-Step

1.0 Downloading a sample PDF

For this article, the reference blog that was used was ‘How I Make $27,000 Weekly in Passive Income’. The article was converted to PDF by simply copy pasting the article content into google docs and exporting the document as PDF. The article is easy to read but it is relatively long.

Refer section “Illustration of the issue with OpenAI API” above for the issue related using directly OpenAI API.

2.0 Extract the text using PyPDF2 library

2.1 Install PyPDF2

The “free and open source pure python” PDF library PyPDF2 can be installed by simply calling pip install.

2.2 Write function to extract the text from PDF

from PyPDF2 import PdfReader

# This function is reading PDF from the start page to final page
# given as input (if less pages exist, then it reads till this last page)
def get_pdf_text(document_path, start_page=1, final_page=999):
reader = PdfReader(document_path)
number_of_pages = len(reader.pages)

for page_num in range(start_page - 1, min(number_of_pages, final_page)):
page += reader.pages[page_num].extract_text()
return page

3.0 Summarize using LangChain API

3.1 Install langchain

The “free and open source” library can be installed by simply calling pip install.

$pip install langchain

3.2 Invoke langchain functions

from langchain.chains import AnalyzeDocumentChain
from langchain.chains.summarize import load_summarize_chain
from langchain.llms import OpenAI

model = OpenAI(temperature=0)
summary_chain = load_summarize_chain(llm=model, chain_type='map_reduce')
summarize_document_chain = AnalyzeDocumentChain(combine_docs_chain=summary_chain)
print('output(AnalyzeDocumentChain):', summarize_document_chain.run(pages))

And the coding part is done….

Quick introduction about couple of lines from langchain piece of code

  • temperature=0: The range of values are 0 to 1, where 0 implies don’t be creative i.e. be deterministic and 1 implies be imaginative.
  • chain_type=map_reduce: The four supported chains are ‘stuff’, ‘map_reduce’, ‘refine’, and ‘map_rerank’. ‘stuff’ is recommended for smaller documents while ‘map_reduce’ and ‘refine’ work for a large corpus. ‘refine’ calls are not independent and order of information order plays a role. Similar is the case for ‘map_rerank’.
  • AnalyzeDocumentChain: is an end-to-end chain that can take a documents, split it up and combine the results again.

Now, Let’s check the output

Run-1

This article explores different ways to generate passive income, 
including investing in stocks, real estate, businesses, and
creating digital products and services.

Run-2

This article discusses various methods of generating passive income, 
such as investing in stocks, real estate, and online businesses.
It also outlines the advantages and disadvantages of each option.

Summary

  • The output is not exciting as generated by ChatGPT with an appropriate prompt. However, if you have no idea what the document is about it gets you up and running.
  • The output can be followed by Q&A to extract more relevant information from the text.
  • GPT and LangChain APIs are powerful tools for summarizing long PDF documents quickly and efficiently. By following the steps outlined in this article, you can streamline your summarization process, saving you time and effort.

Limitations

  • You have to create a paid account to get subscription to api at the OpenAI link. You have 5$ of credit, but as you can imagine the amount gets used up very quickly.
  • Data privacy issues when using OpenAI (though they claim that for paid user data will be retained for 30 days and not used for training).

Outro

Watch out for this space as in the subsequent articles we build upon this article to

  • Answer questions one may have based on long text (like this PDF)
  • Make the summary better (like ChatGPT)

The code link will be at original site where this article was published.

--

--

Anirudh @ krysins.com

To use my passion for learning and problem-solving to create innovative solutions that improve productivity and share my learnings to help others.