Why you need to split your big PDF books

I have one secret tool that I bust all my class mates when it comes to digging down the nitty-gritties of small pieces of information. 

When we discuss some issue with professors or classmates, sometimes we come up with some wild ideas. We ponder about it; ask if anybody else has thought that before us. What they usually do is google.  I also sometimes google the ideas if anybody else thought them  before us (me). But, the fact of the matter is, google has a lot of noise out there with the same keywords but has little to offer the very specific information I am looking for. 

That is where a internal database comes to rescue. I collect as many books and article into my disk so that I can dig them whenever i want to find out specific ideas. The concept is known by “text mining” in a different camp of linguistics.

 Right now, I have over 2000 books and articles in my disk all of which deal with Theoretical Linguistics. 

If you have a collection of books and articles like me, and tried search a specific phrase into it; using Alfred, Spotlight or Devonthink, you will immediately learn that the biggest book always comes on top regardless of the quality of the material in it. The reason behind it is the word count. The larger the book, the more likely that it contains the queried word multiple times. If you collection specially contains gigantic Encyclopedia books, there is not chance that the short article comes out on top of your search result however relevant the article could be. 

Therefore, to make each small article as competent as any other material; and that your search tools could pick the small articles whenever they are relevant, you need to split the books into article sizes. 

I have experimented with different tools of splitting my books; beginning from Apple’s own Automator to a number of python and Shell scripts. Most of them work by bursting  the book by pages.

Bursting a book into single pages could be feasible when you have less than 1000 books. As you books grow, the bursting creates too many files to manage. In addition, the single pages won’t contain enough material to read within  the search result (FoxTrot for me). That is where splitting in 10-15 (article size) rage becomes crucial. 

Right now, I have shell script that breaks down my books in 10 pages ranges, a script that I adopt from a South African guy (I forget his name; I met him in Acadamic.edu). 

Any book or article I add to Sente library directly gets copied to another folder (using Hazel ) and gets splitted into article size page. All the pages finally move to another folder for ultimate archival; where my searching tools such as Devonthink and Foxtrot index. 

I will come back  to the  the full workflow and the scripts I use to achieve the task in another post. 

Advertisements

3 thoughts on “Why you need to split your big PDF books

  1. It’s strange that none of the on-disk searching algorithms have yet implemented the standard data-mining ranking method. It measures the frequency of your target words, divided by the total length of the document. (Approximately.) Thus big books are not favored.
    Your technique has some advantages, but of course it also has weaknesses. If you are searching on several words/phrases, which occur but are split among the smaller documents, you won’t get any hits.

  2. To strke a balance what about splitting by articles and chapters within a book? Adobe Acrobat 11 can split by single page (not useful) and by any # of pages you choose, but as far as chapters the best it can do is by top bookmarks, which can involve several chapters. Which app could target single chapters?

    What about PDFs we have that have no bookmarks? Thats a big problem.

What do you think?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s