In this blog we will extract text from pdf using PyPDF2 library.
PyPDF2 is a free and open source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. PyPDF2 can retrieve text and metadata from PDFs as well.
There are several ways to install PyPDF2. The most common option is to use pip.
pip install PyPDF2
pip install git+https://github.com/py-pdf/PyPDF2.git
Install the PyPDF2 library in your system, if it is not installed.
In [17]:
import PyPDF2
In [18]:
PyPDF2.__version__
Out[18]:
In [19]:
inputFile = "input/extracting-text-from-pdf-file-in-python/sample.pdf"
In [20]:
pdf = open(inputFile, "rb")
In [21]:
dir(PyPDF2)
Out[21]:
We can see some classes and methods are available in PyPDF2 library.
class PdfFileReader(builtins.object)
| PdfFileReader(stream, strict=True, warndest=None, overwriteWarnings=True)
|
| Initializes a PdfFileReader object. This operation can take some time, as
| the PDF stream's cross-reference tables are read into memory.
|
| :param stream: A File object or an object that supports the standard read
| and seek methods similar to a File object. Could also be a
| string representing a path to a PDF file.
| :param bool strict: Determines whether user should be warned of all
| problems and also causes some correctable problems to be fatal.
| Defaults to ``True``.
| :param warndest: Destination for logging warnings (defaults to
| ``sys.stderr``).
| :param bool overwriteWarnings: Determines whether to override Python's
| ``warnings.py`` module with a custom implementation (defaults to
| ``True``).
|
| Methods defined here:
|
| __init__(self, stream, strict=True, warndest=None, overwriteWarnings=True)
| Initialize self. See help(type(self)) for accurate signature.
|
| cacheGetIndirectObject(self, generation, idnum)
|
| cacheIndirectObject(self, generation, idnum, obj)
|
| decrypt(self, password)
| When using an encrypted / secured PDF file with the PDF Standard
| encryption handler, this function will allow the file to be decrypted.
| It checks the given password against the document's user password and
| owner password, and then stores the resulting decryption key if either
| password is correct.
|
| It does not matter which password was matched. Both passwords provide
| the correct decryption key that will allow the document to be used with
| this library.
|
| :param str password: The password to match.
| :return: ``0`` if the password failed, ``1`` if the password matched the user
| password, and ``2`` if the password matched the owner password.
| :rtype: int
| :raises NotImplementedError: if document uses an unsupported encryption
| method.
|
| getDestinationPageNumber(self, destination)
| Retrieve page number of a given Destination object
|
| :param Destination destination: The destination to get page number.
| Should be an instance of
| :class:`Destination<PyPDF2.pdf.Destination>`
| :return: the page number or -1 if page not found
| :rtype: int
|
| getDocumentInfo(self)
| Retrieves the PDF file's document information dictionary, if it exists.
| Note that some PDF files use metadata streams instead of docinfo
| dictionaries, and these metadata streams will not be accessed by this
| function.
|
| :return: the document information of this PDF file
| :rtype: :class:`DocumentInformation<pdf.DocumentInformation>` or ``None`` if none exists.
|
| getFields(self, tree=None, retval=None, fileobj=None)
| Extracts field data if this PDF contains interactive form fields.
| The *tree* and *retval* parameters are for recursive use.
|
| :param fileobj: A file object (usually a text file) to write
| a report to on all interactive form fields found.
| :return: A dictionary where each key is a field name, and each
| value is a :class:`Field<PyPDF2.generic.Field>` object. By
| default, the mapping name is used for keys.
| :rtype: dict, or ``None`` if form data could not be located.
|
| getFormTextFields(self)
| Retrieves form fields from the document with textual data (inputs, dropdowns)
|
| getIsEncrypted(self)
|
| getNamedDestinations(self, tree=None, retval=None)
| Retrieves the named destinations present in the document.
|
| :return: a dictionary which maps names to
| :class:`Destinations<PyPDF2.generic.Destination>`.
| :rtype: dict
|
| getNumPages(self)
| Calculates the number of pages in this PDF file.
|
| :return: number of pages
| :rtype: int
| :raises PdfReadError: if file is encrypted and restrictions prevent
| this action.
|
| getObject(self, indirectReference)
|
| getOutlines(self, node=None, outlines=None)
| Retrieves the document outline present in the document.
|
| :return: a nested list of :class:`Destinations<PyPDF2.generic.Destination>`.
|
| getPage(self, pageNumber)
| Retrieves a page by number from this PDF file.
|
| :param int pageNumber: The page number to retrieve
| (pages begin at zero)
| :return: a :class:`PageObject<pdf.PageObject>` instance.
| :rtype: :class:`PageObject<pdf.PageObject>`
|
| getPageLayout(self)
| Get the page layout.
| See :meth:`setPageLayout()<PdfFileWriter.setPageLayout>`
| for a description of valid layouts.
|
| :return: Page layout currently being used.
| :rtype: ``str``, ``None`` if not specified
|
| getPageMode(self)
| Get the page mode.
| See :meth:`setPageMode()<PdfFileWriter.setPageMode>`
| for a description of valid modes.
|
| :return: Page mode currently being used.
| :rtype: ``str``, ``None`` if not specified
|
| getPageNumber(self, page)
| Retrieve page number of a given PageObject
|
| :param PageObject page: The page to get page number. Should be
| an instance of :class:`PageObject<PyPDF2.pdf.PageObject>`
| :return: the page number or -1 if page not found
| :rtype: int
|
| getXmpMetadata(self)
| Retrieves XMP (Extensible Metadata Platform) data from the PDF document
| root.
|
| :return: a :class:`XmpInformation<xmp.XmpInformation>`
| instance that can be used to access XMP metadata from the document.
| :rtype: :class:`XmpInformation<xmp.XmpInformation>` or
| ``None`` if no metadata was found on the document root.
|
| read(self, stream)
|
| readNextEndLine(self, stream)
|
| readObjectHeader(self, stream)
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| documentInfo
|
| isEncrypted
|
| namedDestinations
|
| numPages
|
| outlines
|
| pageLayout
| Get the page layout.
| See :meth:`setPageLayout()<PdfFileWriter.setPageLayout>`
| for a description of valid layouts.
|
| :return: Page layout currently being used.
| :rtype: ``str``, ``None`` if not specified
|
| pageMode
| Get the page mode.
| See :meth:`setPageMode()<PdfFileWriter.setPageMode>`
| for a description of valid modes.
|
| :return: Page mode currently being used.
| :rtype: ``str``, ``None`` if not specified
|
| pages
|
| xmpMetadata
In [22]:
pdf_reader = PyPDF2.PdfFileReader(pdf)
In [23]:
pdf_reader
Out[23]:
In [24]:
dir(pdf_reader)
Out[24]:
In [25]:
print(pdf_reader.getNumPages())
In [26]:
totalPages = pdf_reader.numPages
print(totalPages)
In [27]:
metadata = pdf_reader.getDocumentInfo()
metadata
Out[27]:
In [28]:
metadata['/Author']
Out[28]:
In [29]:
page = pdf_reader.getPage(0)
In [30]:
print(page.extractText())
In this pdf there are only 2 pages. If we have more pages, we can extract.
In [31]:
for i in range(0, totalPages):
pages = pdf_reader.getPage(i)
if i <= 5:
print(pages.extractText())
continue
pass
In [32]:
pdf.close()