PDF and OCR

From SuperMemopedia
Revision as of 18:31, 22 October 2022 by SuperMemoUser (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Summary

PDF is not natively supported by SuperMemo. You can use OCR to convert PDF to importable format

OCR

Optical Character Recognition (OCR) software can be used to extract HTML text from PDF files. However it is a lengthy process and most available software is commercial. Even then, errors can occur which need to be corrected by hand, and formatting for math formulas and other non-standard symbols generally requires hours of "training" the software. Here are some commercial OCR programs:

  • Adobe's own converter is very unreliable: http://www.adobe.com/products/acrobat/access_simple_form.html (some converted pages are full of squares)
  • Other commercial software has been discussed in the Yahoo Group Forum, notably Abby FineReader pro 7.0, OmniPage, and Solid PDF Converter.
  • InftyReader is specifically designed for converting PDF math files, but is expensive at $900

Suggestion

I use a few workarounds to import PDF files in Supermemo.

If the PDF is plain text, the first thing I try is to save it as a text file. Foxit Reader allows you to do that (File -> Save as...). Then I open it with Notepad and copy/paste in a new article.

Alternatively, you can use Calibre - import the PDF as an ebook, then convert it in HTML. The conversion may not be perfect, but you can always make adjustment incrementally or use an HTML editor (Kompozer is free and more than adequate for most jobs of this kind) to clean it up a bit beforehand. You can then copy/paste the text or open the HTML file in Internet Explorer and import it in Supermemo. (It's easier and faster than it sounds.)

If that fails for whatever reason, I use OCR (Optical Character Recognition), specifically a program called ABBY Finereader. Unfortunately it's not free. A free alternative is Free OCR, which supports scanned PDFs but not (quote) "complicated PDF's that contain text and images".

User-suggested method

  1. Scan textbook pages.
  2. OCR textbook pages.
  3. Import textbook pages into supermemo, with one Topic per 2-page spread. Subheadings are emboldened, and the main title is referenced.
  4. Incrementally read the texts, highlighting almost every sentence in the article.
  5. When the article is done, dismiss it.
  6. Go through converting all the Remember Cloze items to Cloze Deletions
  7. When a Remember Cloze item is complete, dismiss it.
  8. Go through all the cloze deletions questions.
  9. Use SuperMemo Assistant

Taken from: Learning from a textbook