Copying Text from Encrypted PDF Files

PDF encryption is sort of silly.  If you really wanted to grab the text, you can screen capture each page and then OCR it.  So clearly the true intention of the encryption is to deter the 99% of the users who wouldn’t go to such lengths to try to copy text.

Recently, I received a PDF of a text document that I needed as plain text (it was a database DDL commands of a system I was analyzing).  For some reason the author had encryption turned on so that I couldn’t copy and paste the text, and I was a bit impatient and didn’t want to wait for the author to send me the original text file.  However, this PDF still allowed printing.

Since I had Acrobat, I tried to print the file into a PDF to remove the encryption.  Nope.  Acrobat is smart enough to keep you from doing that.  But then I installed the Windows “Generic Text” printer driver and set it to print to a file.  By printing my “encrypted” PDF to the Generic Text printer, the text of the whole document was nicely saved for me.  After removing the page breaks and the margin I had my original database text document.

The punchline is: If you don’t want users to easily copy text from your encrypted PDF files, you
not only need to turn off the text copy capability, but also the print
capability.  Why?  Because it can easily be foiled using the Windows Generic Text
printer driver.

5 responses to “Copying Text from Encrypted PDF Files

  1. I’ve tried this “Text printer driver” trick in the past with quite a few different PDF files which didn’t allow you to copy’n’paste text from their pages. The “didn’t allow you” part is not always+necessarily caused because the author had *forbidden* it, but because the file contains an embedded font which uses a custom “encoding vector”.
    Note, ‘encoding’ in the context of PostScript or PDF fonts has a different meaning from encrypting. Encoding vectors for fonts basically are lists saying “glyph for ‘a’ is on position 1, glyph for ‘adieresis’ is on position 2,…”.
    How encoding vectors work is decribed in the public PostScript and PDF specifications. Adobe defined a few standard encoding vectors, and also how to create and use “custom” encoding vectors. Custom encoding vectors are in common use in many PDF files, and they have nothing to do with “encryption”. They are a necessary evil, because due to computing’s 8bit legacy, for non-Unicode fonts you by default only have room for 256 glyphs (character shapes).
    You can check the details about your PDF’s fonts by looking at the document properties dialog of Acrobat on the “fonts” tab.
    Or use the “pdffonts.exe” commandline utility from the XPDF suite of utilities…
    However, the “Text printer driver” trick does not work in these cases.
    And it is pretty annoying that the Acrobat Reader “Save as text…” menu item doesn’t work either if fonts use a custom encoding vector. Acrobat seems to have no problem with rendering the job to screen or for the printer, but it is utterly failing when trying to extract text…

  2. I tried this, and I was able to print from the PDF, but the file I got had a lot of random symbols and split-up words, and was basically unusable. I tried another driver, the “Microsoft Office Document Image Writer” driver, and it worked. I got a file that looked like a screenshot, but one that I could copy text from into Word.

  3. SOLVED:
    (worked for me on Windows 8, Acrobat XI, Office 2010)
    Option 1:
    1. Print from Acrobat using “Microsoft XPS Document Writer” Output is: “your file name.oxps”
    2. Open “…oxps” with XPS Viewer. *(see download link in comments below)
    3. Print to PDF (Acrobat PDF, or CutePDF), using the highest resolution (600 DPI).
    4. Open with Acrobat and use OCR (Searchable Image (Exact)) option.
    BINGO!
    Comments:
    — Using highest resolution and Searchable Image (exact) will save your text without loosing its clean appearance. Low resolution will make your text readable, but crappy looking.
    — Download Microsoft XPS (files):
    http://www.microsoft.com/en-us/download/details.aspx?id=11816
    — If you don’t know what OCR is, or where to find Searchable Image (exact), or How to print using “Microsoft XPS Document Writer”, PLEASE, Google it on your own, for your own best experiences.
    *Download only if you do not have XPS installed.
    Option 2:
    Do similar, but save as image (png, tiff, …), then you will have to combine all pages back in one “PDF” file.

Comments are closed.