Why you need to split your big PDF books

I use searching tools a lot in my classes and during my own research.

When we discuss some issue with my classmates, for example, we sometimes come up with some wild ideas. We ponder about it; ask if anybody else has thought that before us. We google the idea if anybody else explained it on the web. But, the fact of the matter is, google has a lot of noise out there with the same keywords but has little to offer on  very specific information in our field.

That is where a internal database comes to rescue. I collect as many books and articles into my disk so that I can dig them whenever i want to learn about a  specific idea. The concept is known as “text mining” in a different camp of linguistics.

Right now, I have over 2000 books and articles in my disk all of which deal with Theoretical Linguistics.

If you have a collection of books and articles like me and tried to search a specific phrase into it; using Alfred, Spotlight or Devonthink, you will immediately learn that the big books always come on top rank. The reason behind it is the word count. The larger the book, the more likely that it contains the queried word multiple times. The huge  Encyclopedia books are especially like a virus to my database because they are the ones which always top the search result.

Therefore, to make result balanced, and make the smaller as visible as any other material in the database, I split the large documents into smaller chunks.

I have experimented with different tools of splitting; beginning from Apple’s own Automator to a number of python and Shell scripts. Most of them work by bursting each page.

Bursting a book into single pages could be feasible when you have a smaller number of books. As the number of books grows, the bursting floods your drive with documents. In addition, the single pages won’t contain enough material to read from the search result itself (FoxTrot preview window, for example). That is where splitting to around 50 page range turn out to be very useful.

Right now, I have shell script that breaks down my books in 15 pages ranges.

Any book or article I add to Bookends library directly gets copied to another folder (using Hazel ) and gets splitted into the 50-page ranges. All the pages finally move to another folder for ultimate archival; where my searching tools such as Devonthink and Foxtrot index.


Once you have done the splitting and indexing, every small article is as visible as the large document. You will be able to find most relevant article for the searched term regardless of its size.

5 thoughts on “Why you need to split your big PDF books

Add yours

  1. It’s strange that none of the on-disk searching algorithms have yet implemented the standard data-mining ranking method. It measures the frequency of your target words, divided by the total length of the document. (Approximately.) Thus big books are not favored.
    Your technique has some advantages, but of course it also has weaknesses. If you are searching on several words/phrases, which occur but are split among the smaller documents, you won’t get any hits.

  2. To strke a balance what about splitting by articles and chapters within a book? Adobe Acrobat 11 can split by single page (not useful) and by any # of pages you choose, but as far as chapters the best it can do is by top bookmarks, which can involve several chapters. Which app could target single chapters?

    What about PDFs we have that have no bookmarks? Thats a big problem.

  3. I always use a very manual, time-consuming method, splitting large PDFs with an application called Combine PDFs. (As the name suggests, it can also be used for combining, but I almost never do that.) That’s probably not a good use of my time.
    I agree with a later article of yours that Oxford Scholarship Online’s offering of individual chapters is very useful. Many Cambridge University Press and Springer books also use the same system. It would be great if they could also come up with a way of confirming previous purchases of the physical books and offering discounted electronic access in such cases. (One can dream!)

What do you think?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Blog at WordPress.com.

Up ↑

%d bloggers like this: