Code Conversion

This chapter describes the features DirX Identity provides to perform code conversion. These features are based on the built-in capabilities of the Tcl 8.3 scripting language.

This chapter provides:

  • Basic usage of the code conversion features

  • Expert Usage of the code conversion features

  • A systematic overview of the Tcl features for code conversion

  • Character sets

Please refer to the original Tcl documentation for more details.

Basic Usage

DirX Identity supports all Tcl capabilities described in the character sets section below.

The main functionality of the meta controller is controlled via the Tcl variable _localcode (as it was in previous versions).This chapter describes the common usage of this variable and the related behavior to

  • transfer information between LDAP and utf-8 coded files

  • transfer information between LDAP and files coded in another character set

  • control DirX Identity interactively from a terminal window

More detailed information (especially when you intend to program with Tcl) is provided in the following chapters.

Transfer of information between LDAP and utf-8 coded files

Set the _localcode variable to utf-8.

This allows either to read utf-8 coded files to LDAP or to write from LDAP to utf-8 coded files.

This mode is the fastest mode available. If you need high performance synchronizations, you should keep all data always in utf-8 format.
The setting of the _localcode variable influences the code conversion for all files (i.e. you cannot handle different character sets for different files at the same time). If you want to handle files with different character sets please refer to expert usage section.

Transfer of information between LDAP and files into another codeset

Set the _localcode variable to the required character set (for example Latin1).

You can evaluate the available character sets with the command encoding names. It is also possible to extend the available character sets with additional ones. See the original Tcl 8.3 documentation.

This allows either to read files of that character set to LDAP or to write from LDAP to files in that character set.

Note that the setting of the _localcode variable influences the code conversion for all files (i.e. you cannot handle different character sets for different files at the same time). If you want to handle files with different character sets please refer to expert usage section.

The DirX Identity default applications are delivered with _localcode set to Latin1 to obtain compatibility to previous versions.

Interactive operation from a terminal window

In this mode, you can define operations for DirX Identity interactively in a terminal window. The necessary code conversion is done automatically.

Note: for compatibility reasons you can set the _localcode variable to PC850 if you work with a DOS box. This setting is ignored. Tcl converts the terminal input and output automatically to utf-8 format.

You can define Unicode characters in strings in the form "\udddd" (for example "\u592a" represents a Chinese glyph and "\u0041" represents the glyph "A"). To input utf-8 characters you have to use the method for Unicode characters.

For single byte characters you can use the representation "\xdd" where Tcl assumes the first byte to be zero (for example "\x41" represents the glyph "A"). In this case Tcl has the behaviour that the \x mode does only end when a character not equal a to f is input. Therefore you must input all following characters in the \x mode if they are in the range a to f.

Examples:

\xfc\x62\x65l results in the German word übel (b and e are in the range a to f, l is not).

f\xfcndig results in the German word fündig (n is not in the range a to f).

Output conversion to the screen can result in the message ‘Not convertible’ if utf-8 characters are contained which cannot be represented with the terminal character set.

You can redirect the output to a file where you can define any character set you want (use the fconfigure channel -encoding encoding command for this purpose - this is described in the next chapters in detail).

Output conversion to a file can result in the message ‘Not convertible’ if utf-8 characters are contained which cannot be represented with the defined encoding set.

If you want to use the source command with command files not in the format of the terminal window, you have to read the full file content into a variable and evaluate it afterwards (see the Tcl documentation for details).

Expert Usage

This chapter provides detailed knowledge of the internal structure of the meta controller, which is necessary to understand its behavior and to program specific code conversions in Tcl scripts directly.

Metacp Architecture

To understand the capabilities of the code conversion mechanisms the internal architecture of metacp has to be understood (see next figure).

meta controller architecture for code conversions
Figure 1. meta controller architecture for code conversions

Storage Area (Tcl variables and Tcl arrays)

Contains all Tcl variables and the Tcl arrays that are used for intermediate storage of the entries to be synchronized. Tcl assumes all internal strings to be in utf-8 format.

Tcl Interpreter

All Tcl functionality is available plus DirX specific extensions (for example to access the translation API or the File Handler). It handles the stdin (per default the keyboard) and the stdout (per default the screen) channels (1). These channels can be redirected to files (2).

It also contains commands to handle file input and output via other channels (3).

Via Tcl commands the storage area can be accessed directly. Variables and arrays can be processed and converted.

Translation API

Gets orders via its API functions and translates it to LDAP or DAP specific calls. Returns the results back to the API functions (for example search results). It is called by the Tcl extensions or from external programs (for example DirXmanage).

File Handler

Reads and writes files (for example LDIF, CSV or XML files) as specified in the attribute configuration to and from the storage area. It is called from the Tcl extensions. It is able to convert data from or to the internal used character set utf-8 to any of the provided character sets.

Code conversions

Code conversions are controlled by:

  • The metacp specific Tcl variable _localcode that can be used to set a global character set for all file handling (a default value). As value you can choose any of the provided character sets. For compatibility reasons the values UTF8, LATIN1 and PC850 are still supported.

  • The metacp specific settings of the -encoding parameter of the meta openconn and meta readattrconf commands for individual setting of character sets for specific files.

  • The Tcl specific settings of the system encoding (encoding system).

  • The Tcl specific settings of the different channels (stdin, stdout or individual channel definitions). These settings can be set or retrieved via the fconfigure command.

The next paragraphs describe typical applications of these settings.

Transfer of information between LDAP and utf-8 coded files

Set the _localcode variable to utf-8.

When reading files (4), the File Handler reads the file into the storage area of the Tcl arrays while no conversion is done. Internal Tcl routines can now work on these arrays (for example mapping functions (5)). The Tcl arrays can be accessed by the Translation API (6) to move the data to the LDAP API (7) and from there to the LDAP server (not shown in the picture). Because the data is already in utf-8 format, no conversion is performed.

Note: This mode is the fastest mode available. If you need high performance synchronizations, you should keep all data always in utf-8 format.

When writing files, the data is retrieved from the LDAP server 7) - in utf-8 format) via the Translation API into the storage area (6). Internal Tcl routines can now work on these arrays (for example mapping functions - (5. The File Handler can access the Tcl arrays and write it to files ((4) - no conversion is performed).

Transfer of information between LDAP and files codes in another character set

Set the _localcode variable to the required character set (for example Latin1).

When reading files, the File Handler reads the file (4) into the storage area of the Tcl arrays while a conversion from the defined character set to utf-8 is done. Internal Tcl routines can now work on these arrays ((5) - for example mapping functions). The Tcl arrays can be accessed by the Translation API (6) to move the data to the LDAP server (7). Because the data is already in utf-8 format, no conversion is performed.

When writing files, the data is retrieved from the LDAP server ((7) in utf-8 format) via the Translation API into the storage area (6). Internal Tcl routines can now work on these arrays ((5) - for example mapping functions). The File Handler can access the Tcl arrays (4) and write it to files while converting it from utf-8 to the required character set.

Converting files from one character set to another

If you want to convert a file coded in character set A to a file in character set B you have to set the individual -encoding parameters for the meta openconn and meta readattrconf commands.

For example:

  • Open the tagged source file mysourcefile.data with character set iso-8859-6 with the command:

    • meta openconn -type FILE \
      -file mysourcefile.data \
      -mode READ \
      -format TAGGED \
      -encoding iso-8859-6 \
      -attrconf ah \
      -conn ch

  • You can read the data from the file to the metacp storage area. It is automatically converted from iso-8859-6 to utf-8 characters.

  • Perform all necessary handling of the Tcl variables and fields (all in utf-8 format) like mapping routines and other issues.

  • Open the untagged target file mytargetfile.data with character set unicode with the command:

    • meta openconn -type FILE \
      -file msexch.data \
      -mode WRITE \
      -format NON-TAGGED \
      -encoding unicode \
      -attribute {o ou sn givenName telephoneNumber facsimileTelephoneNumber} \
      -attrconf ah \
      -conn ch

  • Write the information to the output file. The utf-8 characters are automatically converted to Unicode characters (in this example).

  • Close the files as usual.

Interactive operation from a terminal window

For compatibility reasons you can set the _localcode variable to PC850 in a DOS box (this is no longer required because Tcl converts the terminal input and output automatically from and to utf-8 format).

You can use the escape sequences described in the next chapter to input Unicode or Hex characters in your input strings.

The commands, which are input at the terminal window, are sent to the Translation API and processed there. For example you can enter a search request to the LDAP server. After conversion from the terminal character set (1), all parameters are handled in utf-8 format and transferred to the LDAP server (7). The search result is returned also in utf-8 format (7) and converted to the terminal character set (1). Output conversion to the screen can result in the message ‘Not convertible’ if utf-8 characters are contained which cannot be represented with the terminal character set.

You can redirect the output to a file (2) where you can define any character set you want (use the fconfigure channel -encoding encoding command for this purpose - details see next chapter).

If you want to use the source command with command files not in the format of the terminal window, you have to read the file content into a variable and evaluate it afterwards (see the Tcl documentation for details). A sample routine dxm_source to perform this task is contained in the DirX Identity Connectivity Configuration under Configuration → Tcl → Other Scripts → Common Script.

Directly programmed code conversions

If you intend to create files directly (3) from the content in the storage area 5) or (7, you can open channels to those files and set the character set with the fconfigure -encoding command for each of them (see the next chapter for details).

You can also access the storage area via Tcl commands and perform character set conversions directly in memory 5) or (7.Be aware not to send variables other than utf-8 coded ones to the operating system (stdin, stdout).This will lead to strange results.

External access of the Translation API

External programs can access the translation API directly (B).In this case automatic code conversions based on a special interface switch (_localstrings - not used by metacp) can be performed.If _localstrings is set to true, no conversion is performed (the strings are assumed to be in utf-8 format), otherwise a transformation from Latin1 to utf-8 is done.

The DAP functionality at the Translation API is only available for compatibility reasons.

Tcl Features

Tcl has changed the internal representation of characters to utf-8/Unicode, i.e. it assumes to be all strings in this format when conversions have to be done to the operating system.

Unicode is a two byte (16 bit) representation used at the operating / user interface level.

utf-8 is a representation of Unicode, which is identical for all characters from 00 to 7F to the ASCII character set.All other Unicode characters are represented by a 1- to 3-byte sequence with the most significant bit of the first character set.Tcl handles these character sequences as one character when calculating strings (for example in the length command).

Therefore, utf-8 is fully compatible to all scripts and strings that only use the ASCII character set.

Defining Unicode Characters

You can define Unicode characters in strings in the form "\udddd" (for example "\u592a" represents a Chinese glyph and "\u0041" represents the glyph "A"). To input utf-8 characters you have to use the method for Unicode characters.

For single byte characters you can use the representation "\xdd" where Tcl assumes the first byte to be zero (for example "\x41" represents the glyph "A"). In this case Tcl has the behaviour that the \x mode does only end when a character not equal a to f is input. Therefore you must input all following characters in the \x mode if they are in the range a to f.

Examples:

\xfc\x62\x65l results in the German word übel (b and e are in the range a to f, l is not).

f\xfcndig results in the German word fündig (n is not in the range a to f).

Tcl has built in functionality for about 50 common character encodings. You can display the available encodings with the command:

encoding names

Important encodings are:

  • ascii - pure ascii (single-byte)

  • utf-8 - the internal character set of Tcl (multi-byte)

  • unicode - the Unicode character set (two-byte)

  • cp850 - the PC DOS character set for DOS shells (single-byte)

  • iso8859-1 - the Latin-1 (Windows) character set (single-byte)

Channel Based Conversion

You can define a character encoding for each input / output channel:

fconfigure channelid -encoding encoding

Example:

set fd [open $file r]
fconfigure $fd -encoding shiftjis

Tcl now converts automatically all shiftjis coded characters (shiftjis is a widely used Japanese character set) from the file to the internal utf-8 format.

The Tcl source command always reads files using the system encoding.

You can check the encoding for a specific channel with the command:

fconfigure channelid

Example:

fconfigure stdin
fconfigure $fd
in a DOS box the encoding is set to cp850 (which is the PC850 character set).

System Encoding

The system encoding is the character encoding used by the operating system. Tcl automatically handles conversions between utf-8 and the system encoding when interacting with the operating system.

Tcl usually can determine a reasonable default system encoding based on the platform and locale settings. If not, it uses ISO8859-1 (Latin1) as default setting.

You can check the actual system encoding with

encoding system

You can redefine the system encoding with

encoding system encoding

This is not recommended because the interaction of the system could not work correctly. Use the fconfigure channelid -encoding encoding command instead.

String Conversion

Strings can be converted with the functions:

encoding convertfrom
encoding convertto

Example:

set ha [encoding convertfrom utf-8 "\xc3\xbc"]

On a DOS terminal window this should result in the echoed output "ü".

Character Sets

Tcl 8.3 supports a variety of different encodings. (see install_path\lib\tcl8.3\encoding (on Windows) or install_path/lib/tcl8.3/encoding (on UNIX)).The encodings listed in these encoding files can be used in the _localcode variable by dropping the file suffix ".enc".

Example:

  • Encoding file name: cp850.enc
    _localcode must be set to "cp850".

The following character sets or code sets are available:

  • ascii

    big5

    cp437
    cp737
    cp775
    cp850 (also PC850 for use in _localcode variable)
    cp852
    cp855
    cp857
    cp860
    cp861
    cp862
    cp863
    cp864
    cp865
    cp866
    cp869
    cp874
    cp932
    cp936
    cp949
    cp950
    cp1250
    cp1251
    cp1252
    cp1253
    cp1254
    cp1255
    cp1256
    cp1257
    cp1258

    dingbats

    euc-cn
    euc-jp
    euc-kr

    gb12345
    gb1988
    gb2312

    identity

    iso2022
    iso2022-jp
    iso2022-kr
    iso8859-1 (also Latin1 for use in _localcode variable)
    iso8859-2
    iso8859-3
    iso8859-4
    iso8859-5
    iso8859-6
    iso8859-7
    iso8859-8
    iso8859-9

    jis0201
    jis0208
    jis0212

    koi8-r

    ksc5601

    macCentEuro
    macCroatian
    macCyrillic
    macDingbats
    macGreek
    macIceland
    macJapan
    macRoman
    macRomania
    macThai
    macTurkish
    macUkraine

    shiftjis

    symbol

    unicode

    utf-8 (also UTF8 for use in _localcode variable)

For compatibility reasons the values UTF8, LATIN1 and PC850 are also supported for the _localcode variable.
You can extend the available character sets by writing your own character conversion tables. See the original Tcl 8.3 documentation for details.