Limited unicode support for internal viewer ASAP

We welcome any suggestions for new features or improvements in Altap Salamander. Please post one suggestion per report.
User avatar
SvA
Posts: 483
Joined: 29 Mar 2006, 02:41
Location: DE

Limited unicode support for internal viewer ASAP

Post by SvA »

The discussion in .reg files in internal viewer and also the the request for unicode search in Re: Which features? and the issue with EBCDIC conversion in Viewer does not show eol properly for EBCDIC viewer made me think about content search in the search window, search in the viewer as related to codepage and applied codepage conversion and how it would relate to unicode/UTF8.

My findings were the following:
  • in the search window, no conversion is applied automatically and there is no provision to select one. This can be looked at as if the text/hex given is converted to a binary pattern, in the case of text according to salamander's system ANSI codepage, and then a binary search is performed. Regular expression serach is handled accordingly. Upper/lowercase conversion appears to be done for ASCII characters only (i.e. case is not ignored for accented or foreign characters even in case-independend search)
  • search in the viewer appears to work accordingly. However, not the bit pattern of the original file is being searched, but the one to which the current conversion has been applied.
  • Hex view does not show the values stored in the file, but the ones resulting from the selected conversion in both, the hex values and the textual representation.
  • You might ave to ignore the BOM if present.
This taken together makes me think, that providing basic support for unicode in the viewer might be a relatively small change:
  • You need detection code for UTF8/UTF16LE/UTF16BE.
  • You need a converter to convert text from those to the (ANSI) codepage used by the textbox control, converting unknown/invalid codepoints to some invalid character marker (same as you do use '?' already for characters that have no representation in the destination codepage).
  • You might need some changes to map character positions to byte positions in the file, as this is no longer a 1:1-relationship.
  • You might have to clear selections when switching between conversions that do not share the same character/byte mapping (i.e. to/from UTF8 and between a 1-byte code and a 2-byte code).
This is no full fledged unicode support, as it still only supports one codepage at a time, but most unicode text files will probably contain codepoints from the ANSI codepage only anyway (i.e. log files, config files, .reg files ...).

For files that cannot be converted (i.e. non-unicode files, binary or different text encoding) you might choose to force the user to switch to a different conversion or just display some error indication.

If possible you might even consider to use a different codepage in the textbox, i.e. to show Cyrillic or Greek text on a western system, but then you might have to disable text search or switch that input control accordingly.

What do you think about it?
therube
Posts: 674
Joined: 14 Dec 2006, 06:22

Re: Limited unicode support for internal viewer ASAP

Post by therube »

(Since I didn't know, byte order mark (BOM) & UTF-8, UTF-16, UTF-32 & BOM & since I happened to run into it here too, AkelPad, figured it was time to know :-).)
WinXP Pro SP3 or Win7 x86 | SS 2.54
Post Reply