Converting a PDF to HTML using Adobe Acrobat

From SuperMemopedia
Jump to navigation Jump to search

Summary

Tips for using Adobe Acrobat Pro DC and ABBYY Finereader to import PDF to SuperMemo

Using Adobe Acrobat

I was just messing around with Adobe Acrobat Pro DC, and I noticed that there is a "Save As HTML" feature. It keeps all/most of the formatting (I need to inspect it more carefully. Glancing at it though, it looks just fine). Now, this is all good, except for the fact that you will have to sacrifice the figures used. Furthermore, make sure that you extract the pages of interest rather than the whole document (to save time). I just tried it with a 300mb file, extracted about 40 pages (10mb), and it took around 30 seconds to complete!

Not sure if this available on the Adobe Acrobat Reader DC (the free version of Acrobat). If it is, then that's great!

Well, that's all. I hope that this has/will help someone.

Notes

I noticed that by saving the extracted pages as XML (which takes more time), the pictures and figures will also be extracted. I'm not really good with XML, but if someone knows how we can use this to import it into SuperMemo, then please feel free to share!

Using ABBYY Finereader

I have used ABBYY Finereader (AF) for this for the last 2 years or so and I have converted/imported many PDF´s into Supermemo this way. AF scans the entire PDF upon import in AF and uses an algorithm to determine what is text, image, tables etc. etc. and it is next to flawless so most times you just verify all pages in AF, delete some page you may not need and then export the entire thing into a single HTML file which you then import into SuperMemo. The PDF formatting etc. is preserved and you can select several modes in AF (exact copy, flexible, text etc. etc.). Works like a charm and is the only way I do this. PS: I have also done this for multiple 100+ page ebooks so it also works for really large and complex PDFs.

A side note related to this (to HTML import in Supermemo). Supermemo has some problems importing HTML pages and retaining the paths to image files so if you have a local HTML page with image links to images in the HTML file folder, Supermemo messes that up on import. To fix this, before you import your HTML file into Supermemo, replace the path of files for the HTML file to a absolute path e.g "C:/KS/LocalHtml/MyHtmlFile_Images" and the easiest way to do this is just opening the HTML file up in a HTML or text editor and using search & replace.

Reply

To fix this, before you import your HTML file into Supermemo, replace the path of files for the HTML file to a absolute path

AAAHH I used this in the good ol' days when I transferred my collection from Anki to SuperMemo. You reminded me of sweet times my friend.

Thank you for sharing!