Chinese characters in UTF8 XML Q&A export are encoded wrong

From SuperMemopedia
Revision as of 20:43, 24 November 2005 by SuperMemoHelp (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Environment: Supermemo 2004 / Build 12.05 Jan 27 2005 // Windows XP SP2 Supermemo CE 3.22 / 2004-12-27 // Pocket PC 2003


After exporting some items from Supermemo 2004 and importing them on the pocket pc there are not displayed correctly. I have used UTF8 XML Q&A export for that. Robert has answered my initial question in the SM yahoo group:

There is another problem with importing into pocket pc using xml. I seem to remember that the format that either the desktop uses or the pocket pc, in it's export, won't import properly into pocket pc. it uses a format with &#xxxx; to represent a unicode character, and whereas that works ok with Russian, with Chinese it causes a problem and it needs to be in the format &#xxxx; where the x's are the unicode number.In the files section there is a utility I wrote which makes the replacement but you could do it in any text editor.

But for me there is stil a weired thing happening: after each chinese word a random ascii character appears. This happens during the export and is not produced by Robert's utility. (my test data was in RTF items which seem to also produce other problems with chinese characters.) However if I use XML export with the native charset, not UTF8 this random character is not there. But in that case I can not use the data in SM CE.


The newest version of SuperMemo for Pocket PC converts all &#xxxx; codes on import

The update is available from here:

Inside SuperMemo 2004

SuperMemo 2004 does not convert Unicode sequences stored in HTML during XML export (assuming conversion options are unchecked). It exports HTML 'as is'. The code &#xxxx; is probably present in the Chinese HTML source code as stored in HTML components.

SuperMemo DOES convert & to & only when working with plain text that is to be displayed in HTML. This is to prevent wrong interpretation of & as an HTML code. This conversion occurs in:

  • displaying plain text in HTML
  • some options used in sending text via e-mail
  • Search&Replace options in HTML code