How to convert scanned images to perfect ebooks

ScanTailor is better than  Acrobat and Abby Fine Reader

One  major challenge to convert phsyical books to ebooks is the cleaning up of the black marks on margines  left from the scanning process.

There are many tools out there that can help to reduce the problem. But, almost all of them have a problem of removing the shades or black marks fully.  That is where ScanTailor, a free software for mac and windows,  excels.

ScanTailoer is the most valuable pieces of software to turn junked, dirty scanned pages into  neat, readable pdf files. Even the most expensive software such as Acrobat Adobe and Abby Fine reader cannot compare with it. It has some magic to clean out all the marks, shades and black marks on the margins of the pages.

Once you process the images in ScanTailor, and OCR them with Adobe or Abby, you get an industry standard pdf files. If you have an old book that you want to convert it to ebook format and sell it, this is the right step to follow.

Scan it with any scanner to image files such as Tiff or PNG–>clean it up in ScanTailor –>OCR it with Acrobat or Abby. Abby can also export it to Epub and Mobi if you prefer these format.

At end of the process, you will  have a neat ebook that you can even sell it online.

 

DJVU to PDF

If you have a DJVU  file you want to convert it searchable pdf file, for the macc, you need Cisdem Document Reader to extract all the pages to images (PNG or TIFF) so that StainTailor can import them. I used to follow a convoluted process of exporting them to pdf then converting them to image because my djvu reader (DjVuLibre) cannot extract all the pages to (PNG or TIFF) image.

This is one of the lessons I learned in the process of experimenting with different tools.

 

How to install ScanTailor

It is hard to find the mac version of ScanTailor. I have spent many hours trying to compile from the source code, and downloading from Homebrew. The Homebrew is failing to download some dependencies in the latest version of Mac OS.  If you are having issues to compile from the source code or install from Hebrew,  the best option you have is to to download the Universal version (thankfully the authors have compiled for us. This version seems to  lack some features in comparison to the Advanced version. But it works fine for the main functions I described here).

Advertisements

Time-lining software to manage projects

I generally don’t like mind mapping softwares. I rarely have a pre-made, hierarchical data that I will list down into the nodes of a mind-maping software. My data is usually messy; unstructured. It is my job to collect and structure them.  The rigid structure that mind-mapping softwares imposes puts me off. It is ironic that they call them “mind mapping” tools as I find them the least mind-friendly of my tools.  My brain just doesn’t work with this kind of rigid structure. My brain works with connections, and fussy boundaries. That is reality of the human mind as we know it. 

I am a linguist. I know this as a matter of fact because I know the boundary between the word ‘like’ and ‘love’, ‘meet’ and ‘gather’ even between ‘map’ and ‘structure’ etc is as fuzzy as it gets. Fuzzy connections, or associations is how human brain works–not by rigid hierarchical structures.  

I have considered  mind-mapping  software as irrelevant to my work flow. I just gave up with them very early on. 

There is one exception thought. I tried them for managing projects. Very few of these applications have the capability  to manage projects. The first of these tools I have tried was Xmind. Xmind has Gantt chart. I like the gantt and many of the other views like the Matrix. 

But, it didn’t stick with me that much because the application is a bit cumbersome. It is very hard to pull an application everyday if it puts a bit of burden on my processor. 

I used Tinderbox a bit for this kind task. I used Tinderbox longer than Xmind for managing projects. I specially used it a lot for Agile system. 

Tinderbox also swamped me with large number of notes: as every pieces of task should be dropped with an individual note file. A simple task managing file immediately grew to hundreds of notes because every task requires its own individual note. 

I recently tried MindView. It is a lot better than Xmind because it supports both gantt charts as well as timeline. I specially like the Timeline. It also has dedicated project management system: like managing resources; tasks: etc. I found that Mindview is much more potent system for managing project than any other mind mapping application out there. 

The fact that I like the timeline a lot led me to further investigations to the applications that have this feature. In the process, I discovered Aeon Timeline. It is very neat application. Very fast and efficient. 

Timeline and Gantt are unified into a single system. Best of all, zooming into the details and out to the general overview has never been easier. This guy is the first software I so far discovered that successfully showed me the big picture of my project while still seeping into the details. 

In all other softwares, the choice is either or. You have to get to the details—losing the big picture; or lose the details and see the big picture. Tinderbox itself has this weakness. You have to either go to the top of the map; or zoom into one corner of the map. There is no both ways. In Aeon Timeline, flying from details to granular structures is just a matter of scrolling on your mouse; or simple punching on the touchpad. It is very beautiful. I love this easy way of zooming in and out of the details. 

Extract Bibtex references from table of contents

I have been trying  different tools to extract bibliography references from table of contents of a pdf book.

Assume you have the pdf format of an edited book which contains 20 articles in it. If you want to have the reference data for all of the entries, you have to go to google scholar and extract the reference data for each of them. The process is hectic. Furthermore, google scholar usually offers incomplete data. You need to go and edit each of these references. it is a lot of work.

Won’t it be easier if you can just pick the reference data directly from the table of contents of the given pdf book?

Yes, in principle.

But, in practice, you need to understand a lot of programming and under-the-hood understanding of PDF files. I have none of it. Therefore, I came up with a simpler, but, equally plausible solution= using Keyboard Maestro and Jabref.

The process is a bit complex. But, the output is much better and faster than Google shcolar or any of the reference extraction methods.

  1. Fill up the reference data of the main book in Jabref (from Worldcat)
  2. Copy the bibtex of the edited book to a specific clipboard inside Keyboard maestro. (if you are importing my macro, simply hit CTRL+C; that will copy the bibtex and make some calculations to get the publication year)
  3. Copy the Title, Author and page number of each of the articles of the pdf book. Each of the references must be copied in that order.
  4. hit a shortcut (CMD+ALT+9) that calls a window of Keyboard maestro asking me for the number of copied references. I count the number of references I copied and answer the question. I typically copy 8 references at a time.
  5. click OK. KM magically turns the clipboard to references; calculates the page numbers for each entry, and crossrefs them with the mother book.

I magically get a perfectly formatted reference from the copied clipboards. Once you get how it works, it is very powerful script.

Keyboard maestro script

You can ask if you are interested in the script.

Bookends vs Zotero vs Mendeley vs Jabref

I have been very dismissive of Mendeley for many years now. For one good reason: the data is always extracted from Google Scholar. I get the worst, most incomplete reference from Mendeley.  Being an early adopter (staring from its beta stage; around 2008), I was left with frustrations with Mendeley. Now, it is time to appreciate one great quality of Mendeley that no other reference manager can emulate: its attempt to do the undo-able. That is, Mendeley tries to get the reference information by reading the PDF file directly. This technology is unique to Mendeley, so far as I can tell. While both Bookends and Zotero can extract some identifiers like DOI and ISBN, they never try to get the Title, the author and the date by directly reading the PDF file.  Mendeley does that. As a result, it is a life saver when you have a lot of junk to clean up.

I recently downloaded more than 3400 pdf files from a linguistic archive. Importing them to any of the references gives not a single relevant reference data–both Zotero and Bookends gave me zero result. I also tried Papers3. Quite interestingly, Papers was able to pick some of them. But, the data it gets was less than 20% success rate.

Then, I dragged them to Mendeley, majority of them get their references filled. Most of them get junk reference, of course, as usual. But, hey, this is technology. We have to do a lot of trial and error. Cleaning the junk library was much better than inserting references, one by one, for 3400 item. For that, I am now grateful of Mendeley.

But, ultimately I cannot live with Mendeley because it gets data from Google Scholar only–always junk data. That is why I have to move back these partially filled references to either Bookends or Zotero.

Completing the incomplete references in Zotero is a nightmare, I learned by the hard way. Zotero excels at getting data from browser (internet) and the attach the PDF over the reference. Having PDFs with incomplete reference or no DOI, however,  Zotero is a huge pain.  I  am a PDF guy. I rarely pick the reference from the web page. I always go to the pdf; and  attempt to fill up the reference when I have some extra time latter. For that, Bookends is much better. BE has a feature called Autocomplete (similar to the Targeted Browsing feature in Sente) which helps me highlight the Title of the book (article) from the pdf and tell it to search it somewhere in the web engines (google scholar, World Cat, or my local library website). That way, I don’t have to write the reference manually.

For Zotero, if you have missed to get the data from the website first, or that the PDF contains no DOI, the only option you have is to manually write the reference. Jabref is even better on that because you can copy and paste references from Google scholar to the existing PDF.

But, I find Zotero  better than Bookends on these two aspects.

  1. Direct syncing of the Bibtex file: using the Better BibTex plugin
  2. Automatically  getting the ISBN of the books. Zotero picks the ISBN of the books almost always correctly. This feature is coming to Bookends as well. But, BE is not really effective at yet.

 

How about Jabre?

It excels at manipulating references in the bibtex format. Furthermore, it as one unique feature that no other reference manager yet implemented–embedding the XML metadata into the pdf files. There are two good reasons to write metadata into the PDF files.

  • it improves searching: you can search Spotlight by the author or the Title of the book; or order the books by their date of publication. This is specially very useful if you use more advanced searching tools like FoxTrot, or (DTsearch in the windows)
  • you can lose the reference, or share the pdf without losing the reference data about it. If your library is lost or  corrupted, you don’t have to fill the reference data again. You can just drag the pdf and Jabref will populate the reference data for you.

The conclusion is: every reference manager has its own strengths and weakness. Each of them have their own niche users; and niche features. Bookends and Jabref are my all time favorite reference managers. I think I will keep all the 3 reference managers me for now. Some people own three cars, just for sheer fun of it, even if one car is usually enough. I don’t have to chose among these great softwares. I will use Zotero for the books, Bookends for everything else and Jabref for bibtex.

Mirroring ‘Oxford Scholarship Online’ on your mac using Bookends

 

Oxford publications started up a sort of revolution on how printed books are accessed online.

Most other publishers put the exact same book, in a single PDF or Epub format, online for sale. Oxford makes an important change that seems to start a revolution on how we consume books online. In Oxford’s system, each of the chapters of the books stand by themselves as published articles. Each of the chapters of the collection gets their own DOI that they can be read, downloaded or referred independently of the main book.

Look at this video how it works.

Now, we are in a position to have the cake and eat it at the same time. The old conflicts on whether to keep books broke into their chapters and sections (as such a system facilitates searching and discovery on each of the chapters), or, keep it together to be able to see the work as a whole immediately disappears. We have both the whole and the individuals at the same time.

So, I have been wondering on how to mirror that system in Bookends. I experimented on how each of these independent chapters could be managed. Bookends has a mechanism to manage Chapters separately; while still linked to the main book.  Bookends is such a well-thought application: each of the chapters are able to live by themselves just exactly like they are presented in Oxford Scholarship Online. I am so impressed. I haven’t seen any other offline reference manager that is able to manage chapters like Bookends (Oxford).

The more I use each of the features, the more appreciate how well-designed Bookends is.

For me, I used to split PDF books into specific page ranges (page 1-50; 51-100 etc) just to keep the large pdf books easier for my searching tools like Foxtrot. I have already explained the system before.

Now, with Oxford Scholarship online and Bookends, I don’t need to split. I simply download each of the chapters and attach them as chapters into a single reference entry in Bookends. The pdf files of each of the chapters are attached to their corresponding entries. It is much better system because I won’t have duplication of information. I only hope that all other publishers offer books in the same way to the Oxford Scholarship online.

 

Jotting applications in the Mac

Jotting applications are my class of note writing apps which specifically focus on fast jotting of notes as  ideas strike my mind. They don’t have to be complete. They are measured by how they are efficient to feed to a full blown writing applications; and how they are easy to insert the notes. Here, I have the comparison of my favorite apps I tried recently.

Criteria Curiota NoteAway Tab Notes Unclutter NvALT TaskCard Devonthink sorter
Transparent file storage in Finder (library in Documents folder) x x
Supports RTF and RTFD it is a sort of rtfd; but it is ntRTFD extension x x x
Permits direct assignment of Finder tags x x x x
Transparent file naming (the title of the note is the file name) adds further junk x x x
menu bar icon for quick jotting
Quick note inserting shortcut

 

Curiota and Devonthink Sorter come at the top. If Curiota permits direct assignment of tags, it would be the perfect jotting app.

I would like to hear if there is an app that satisifies all the tests.

 

Switching from Day One to MacJournal

Day One has been a great application for writing daily journals. Now, they are moving to a subscription system. I don’t like a subscription-based software. One reason for this is I want to spend some months completely offline. In addition, thinking about some payment every month makes me feel miserable.

So, I am now trying to move to the old MacJournal for my journaling needs. I haven’t found any other better application than this app. Many of the note-writing applications do not encryption.

So, to export my journals, Day One can export to a handful of formats: html, plain text and json. The json keeps the most complete information. MacJournal doesn’t accept the sjon. I found a transitional application that natively imports the json=Bear.

I use bear as means to transit to Macjournal.

Day One—> export in Json format—> import the json to Bear—>expor it in RTF format—> import the rtf to JacJournal.

There are still two losses in this process:

The tags: the tags are transferred as in text tags: not true finder tags. As such, Macjournal cannot recognize them as tags

The images are lost; because RTF cannot keep the images

For the images, if you have many of them, a better strategy would be to have the pro version of Bear and export in Word format. MacJournal can import the word. But, for the tag, still, the word format is not a solution. We need some mechanism of converting the in text tags (marked as #tag) to finder tags so that MacJournal or any other appliation for that matter would recognize them as tags. Trying different methods, I now  have this [Hazel rule](https://www.dropbox.com/s/w5w597lz3emwyp8/ExportFolder.hazelrules?dl=0) to convert those hashed texts to Finder tags.

Export the notes from Bear to a finder folder–> run the hazel rule on the folder. The rule assumes that the tags in each file are not more than 5. If each of your notes contain many more tags, you might need to modify it.

Chronosync is Time Machine plus Git

Writing a very important document needs some care. A reliable backup is crucial.

In addition, versioning system is very helpful. It is different from the backup because you can go back in time and revert back to some of the changes you made. Not all the changes we make on our document are useful. We could make mistakes. You wish you have the old version of your file. Version is a great strategy to make a carefree editing. You can get the old version anyways: why do you worry to make the changes. It improves your productivity as you are relieved of wrong changes.

Subversion has been the most dominant system for ages. Now, Git  has replaced it. But, even if these tools are as useful for writers as for developers, they are less popular among writers probably because of the technical nature of them. I have been using Git for a while to keep versions of my latex files. My latex editor, TexSTudio even supports committing git commands.

After a while, I have however realized that I often forget to commit my changes. Sometimes, I want to revert back; learning that I have no version of that certain editing. I tried to supplement the git system with Keyboard maestro to automatically commit. It was working fine. Still, things become too hectic when I made a lot of changes distributed in many folders. The files are also not all  latex.  so, I need commit them with a separate program (via the Terminal).  So, looking around for other solution. One strategy is to rely on Time Machine, as many people do. The problem with time machine it that it is less configurable. I want more versions on some files and less on others. Some files are crucial: I wanted them versioned in every 20-30 minute because I often want to refer back the old versions.

In addition, as it tries to copy all the files in the disk, Time machine is a huge resource hog. When I was using it, it topped the applications which consume the most of energy of my machine. It sucks the battery juice from my machine. Furthermore, Time Machine doesn’t support bookable backups. They are very useful in case of crisis. That is where I started to check out Chronosync.

 

I have been using Carbon Copy Cloner for keeping bootable backups. CCC also keeps versions of files, to be fair. But, the versioning system in CCC is not really useful to keep track of changes in a file. That is when  I decided to migrate to Chronosync. This beast does both the syncing and the versioning like a pro.

 

Chronosyc permits a more fine-tuned backup and versioning schema. You can tell it to backup some folders just once in a day (say the downloads folder) while versioning the most active, working folder, Projects folder, every 30 minutes. Best of all, you will never feel the pressure on your mac. Since you can dissect your backups to your like, Chronosync doesn’t eat up your RAM or heat up your machine.

Chronosync is like Swiss army for both tasks of versioning and backing up. A money well spent. The saving I make on the battery pays back the price of Chronosync.

Comparing the 3 best pdf readers in Mac: Skim vs Highlights vs PDF expert

PDF files are the life of the academic. All information come with them. Of the time I spend reading, more than 95% of it goes by Pdf files. For that end, a good pdf reading application would be very important: much more important than any other application I use.

I have been looking for different tools for reading Pdf files.

Every one of them have strengths and weakness:

1. Skim

Skim: designed for the academic community: free and open source.

Strengths:

  • The ranked search is amazing.  of the  3, Skim has the best searching capabilities. PDF  expert comes next.
  • The annotations are more powerful and flexible: the anchored note is specially a wonderful tool. You can literally draft your next book using the anchored note. Reading triggers ideas; ideas breed ideas. Ideas cannot come out of the blue: they emerge during the reading. The best part of the anchored notes that you can give Titles to the notes. You can manipulate them so that the exported note will be much cooler.

I usually put ## on the title of the anchored note so that the title will come out as a true title when I exported the annotation using Markdown format.

  • The keep on top feature is very useful to compare ideas: any of the annotations can be kept on top.
  • Supports scripting
  • Export templates: you can modify these templates to your need. This is extremely useful.

But, there are some weaknesses with this all strength.

  • The non-standard format: if you want to read or see the annotations of the Skim, you have to export it. You cannot just open the pdf and keep on annotating. This is a deal breaker. Really. I am tied up to local system; this is like a prison. I like PDF expert for it adheres to the standard PDF specifications. I think it is the best mac reader with the standard formats next to Adobe’s own products.  Mac OS has this weird system comes by the name **PDFKit**: it gets broken, keep on screwing us all the time. Corporate greed seems the reason why we are suffering. Why doesn’t Apple adhere to the standard Adobe specifications? This same crap Kit also seems the  reason that saving pdf files in Preview and the rest of Mac local applications bulges up the size of the pdf.
  • The separate .SKIM file is a pain in the ass. It gets lost. If you export and import the pdf, all the annotations get duplicated. It is whole mess.

If Skim follows the standard PDF specifications; writing the annotations directly to the PDF itself, I would never look around.

2. Highlights

Strength:

  •  It follows the standard annotation system: annotations made in Highlights can be viewed and edited in other editors (both in the mac and windows)
  • The annotations are powerful. The annotation panel could be wide: therefore,  a long text can be directly inserted. Even if it is not as convenient as the anchored notes in Skim, the panel is generally convenient to drop longish texts.
  • Works great with other applications: like Devonthink, Bookends, and Evernote. This is one of its best features
  • the exported notes are in Markdown format: this can be taken as strength and weakness: depending on your interests.
  • Splitting annotations into distinct notes. This is the most interesting feature, for me, because I can keep single ideas as separate notes. I  have been using Sente annotations for this purpose.

Weaknesses

  • general clunkiness: the app contains a lot of bugs
  • the Splitting feature is not well worked out. I would have bought this app if I were able to assign titles to each of the annotations. The Titles are very useful for summarizing the concepts of each of the singular annotations. This is the most debilitating problem I have with Highlights. The spliced notes have no meaning: not life because they are not customized by titles or tags.

3. PDF expert

PDF expert is very fast and fluid application. I use it everyday. The developers are generally very responsible and fast guys. The code they write is amazing. The programmer talent in Readdle  tend to be very high. I participated in their beta versions for a long time now. I can tell you, their betas are more matured and reliable than the final releases that Apple sends out. There are some small details: specially its speed, which makes this app worth trying. I like it so much. I is the first app I open in the morning. The best part of the app is that it follows the standard Adobe system. The annotations made in PDF expert are visible on any other pdf reader. That is why I use it as my default reader.

Strengths:

  • I also like the new searching tool. It searches all the open files.
  • it automatically detects the true pages numbers of the pdf
  • blazing fast
  • follows the standard Acrobat annotation format
  • The annotation tools are generally ok

Unfortunately, PDF expert is also  the least creative of the 3 apps I am trying. The features it contain are most already in acrobat or other pdf readers.  The export features are very weak: even terrible. I tried to export in the Markdown format. It doesn’t permit me to customize on what types of text I want to export. Generally, the exported text turn out to be  vary bad containing unwanted stuff (like Date, author…).  Even if there is wondrous programming talent,  the direction they are taking are mundane and non-creative.There is barely a new feature in this reader that other readers, like PDFPen, Acrobat etc, do not have.   They don’t understand the areas of need. The annotation tools could  be better. The developers of Highlights have truly understood the needs of the scientific community.  It is only the implementation that is lacking in the latter.

My take:

After trying it on a couple of times, I have given up with highlights.

I am now using Pdf expert and Skim. I use Pdf expert for fast reading. When I have to just scan and take a few points, I open my file with it. I highlight a bit; clip a few lines to Curiota and close it down.

When I have to read a book or an article from the beginning to the end, for intensive reading, no reader offers the comfort that Skim offers. The exports are also much robust. Therefore, for in-depth reading, I am relying on Skim.

By the way, there is an other candidate that could offer a similar comfort for reading: the Marginnote. It seems to have some great annotation tools similar to anchored note: even better, mind mapping within the reader. I tried it for a couple of minutes. But, I dropped the app immediately because the annotations are in proprietary format: they will be a big lock down. While the annotation summary can be exported, the annotation and the annotated pdf are divorced forever. I am skeptic of apps that highly rely with proprietary file system.

Blog at WordPress.com.

Up ↑