Limited unicode support for internal viewer ASAP

SvA · Post by **SvA** » 10 Apr 2012, 17:25

The discussion in .reg files in internal viewer and also the the request for unicode search in Re: Which features? and the issue with EBCDIC conversion in Viewer does not show eol properly for EBCDIC viewer made me think about content search in the search window, search in the viewer as related to codepage and applied codepage conversion and how it would relate to unicode/UTF8.

My findings were the following:

in the search window, no conversion is applied automatically and there is no provision to select one. This can be looked at as if the text/hex given is converted to a binary pattern, in the case of text according to salamander's system ANSI codepage, and then a binary search is performed. Regular expression serach is handled accordingly. Upper/lowercase conversion appears to be done for ASCII characters only (i.e. case is not ignored for accented or foreign characters even in case-independend search)
search in the viewer appears to work accordingly. However, not the bit pattern of the original file is being searched, but the one to which the current conversion has been applied.
Hex view does not show the values stored in the file, but the ones resulting from the selected conversion in both, the hex values and the textual representation.
You might ave to ignore the BOM if present.

This taken together makes me think, that providing basic support for unicode in the viewer might be a relatively small change:

You need detection code for UTF8/UTF16LE/UTF16BE.
You need a converter to convert text from those to the (ANSI) codepage used by the textbox control, converting unknown/invalid codepoints to some invalid character marker (same as you do use '?' already for characters that have no representation in the destination codepage).
You might need some changes to map character positions to byte positions in the file, as this is no longer a 1:1-relationship.
You might have to clear selections when switching between conversions that do not share the same character/byte mapping (i.e. to/from UTF8 and between a 1-byte code and a 2-byte code).

This is no full fledged unicode support, as it still only supports one codepage at a time, but most unicode text files will probably contain codepoints from the ANSI codepage only anyway (i.e. log files, config files, .reg files ...).

For files that cannot be converted (i.e. non-unicode files, binary or different text encoding) you might choose to force the user to switch to a different conversion or just display some error indication.

If possible you might even consider to use a different codepage in the textbox, i.e. to show Cyrillic or Greek text on a western system, but then you might have to disable text search or switch that input control accordingly.

What do you think about it?

therube · Post by **therube** » 10 Apr 2012, 18:02

(Since I didn't know, byte order mark (BOM) & UTF-8, UTF-16, UTF-32 & BOM & since I happened to run into it here too, AkelPad, figured it was time to know

.)

Altap Salamander Support Forums

Limited unicode support for internal viewer ASAP

Limited unicode support for internal viewer ASAP

Re: Limited unicode support for internal viewer ASAP