Thursday, April 16, 2009

Making ProQuest newspaper scans full-text searchable

I love using ProQuest newspaper databases for my dissertation research, however one thing bothers me about them. Even though the documents are full-text searchable in their database, once you download a document, simple find commands or searches through Spotlight and Google Desktop fail to find anything except for the bibliographic information that ProQuest adds to the top of the document. If you open the document in Acrobat and attempt to make it full-text searchable by clicking on Document->OCR Text Recognition->Recognize Text using OCR... it will produce an error.

To solve this I created an Apple Automator script that takes the pdf, turns it into an image, and then turns it back into a pdf. You can then use Acrobat to make the text recognizable. You can get the script here.

No comments: