Code Conversion
This chapter describes the features DirX Identity provides to perform code conversion. These features are based on the built-in capabilities of the Tcl 8.3 scripting language.
This chapter provides:
-
Basic usage of the code conversion features
-
Expert Usage of the code conversion features
-
A systematic overview of the Tcl features for code conversion
-
Character sets
Please refer to the original Tcl documentation for more details.
Basic Usage
DirX Identity supports all Tcl capabilities described in the character sets section below.
The main functionality of the meta controller is controlled via the Tcl variable _localcode (as it was in previous versions).This chapter describes the common usage of this variable and the related behavior to
-
transfer information between LDAP and utf-8 coded files
-
transfer information between LDAP and files coded in another character set
-
control DirX Identity interactively from a terminal window
More detailed information (especially when you intend to program with Tcl) is provided in the following chapters.
Transfer of information between LDAP and utf-8 coded files
Set the _localcode variable to utf-8.
This allows either to read utf-8 coded files to LDAP or to write from LDAP to utf-8 coded files.
| This mode is the fastest mode available. If you need high performance synchronizations, you should keep all data always in utf-8 format. |
| The setting of the _localcode variable influences the code conversion for all files (i.e. you cannot handle different character sets for different files at the same time). If you want to handle files with different character sets please refer to expert usage section. |
Transfer of information between LDAP and files into another codeset
Set the _localcode variable to the required character set (for example Latin1).
You can evaluate the available character sets with the command encoding names. It is also possible to extend the available character sets with additional ones. See the original Tcl 8.3 documentation.
This allows either to read files of that character set to LDAP or to write from LDAP to files in that character set.
Note that the setting of the _localcode variable influences the code conversion for all files (i.e. you cannot handle different character sets for different files at the same time). If you want to handle files with different character sets please refer to expert usage section.
The DirX Identity default applications are delivered with _localcode set to Latin1 to obtain compatibility to previous versions.
Interactive operation from a terminal window
In this mode, you can define operations for DirX Identity interactively in a terminal window. The necessary code conversion is done automatically.
Note: for compatibility reasons you can set the _localcode variable to PC850 if you work with a DOS box. This setting is ignored. Tcl converts the terminal input and output automatically to utf-8 format.
You can define Unicode characters in strings in the form "\udddd" (for example "\u592a" represents a Chinese glyph and "\u0041" represents the glyph "A"). To input utf-8 characters you have to use the method for Unicode characters.
For single byte characters you can use the representation "\xdd" where Tcl assumes the first byte to be zero (for example "\x41" represents the glyph "A"). In this case Tcl has the behaviour that the \x mode does only end when a character not equal a to f is input. Therefore you must input all following characters in the \x mode if they are in the range a to f.
Examples:
\xfc\x62\x65l results in the German word übel (b and e are in the range a to f, l is not).
f\xfcndig results in the German word fündig (n is not in the range a to f).
| Output conversion to the screen can result in the message ‘Not convertible’ if utf-8 characters are contained which cannot be represented with the terminal character set. |
You can redirect the output to a file where you can define any character set you want (use the fconfigure channel -encoding encoding command for this purpose - this is described in the next chapters in detail).
| Output conversion to a file can result in the message ‘Not convertible’ if utf-8 characters are contained which cannot be represented with the defined encoding set. |
If you want to use the source command with command files not in the format of the terminal window, you have to read the full file content into a variable and evaluate it afterwards (see the Tcl documentation for details).
Expert Usage
This chapter provides detailed knowledge of the internal structure of the meta controller, which is necessary to understand its behavior and to program specific code conversions in Tcl scripts directly.
Metacp Architecture
To understand the capabilities of the code conversion mechanisms the internal architecture of metacp has to be understood (see next figure).
Storage Area (Tcl variables and Tcl arrays)
Contains all Tcl variables and the Tcl arrays that are used for intermediate storage of the entries to be synchronized. Tcl assumes all internal strings to be in utf-8 format.
Tcl Interpreter
All Tcl functionality is available plus DirX specific extensions (for example to access the translation API or the File Handler). It handles the stdin (per default the keyboard) and the stdout (per default the screen) channels (1). These channels can be redirected to files (2).
It also contains commands to handle file input and output via other channels (3).
Via Tcl commands the storage area can be accessed directly. Variables and arrays can be processed and converted.
Translation API
Gets orders via its API functions and translates it to LDAP or DAP specific calls. Returns the results back to the API functions (for example search results). It is called by the Tcl extensions or from external programs (for example DirXmanage).
File Handler
Reads and writes files (for example LDIF, CSV or XML files) as specified in the attribute configuration to and from the storage area. It is called from the Tcl extensions. It is able to convert data from or to the internal used character set utf-8 to any of the provided character sets.
Code conversions
Code conversions are controlled by:
-
The metacp specific Tcl variable _localcode that can be used to set a global character set for all file handling (a default value). As value you can choose any of the provided character sets. For compatibility reasons the values UTF8, LATIN1 and PC850 are still supported.
-
The metacp specific settings of the -encoding parameter of the meta openconn and meta readattrconf commands for individual setting of character sets for specific files.
-
The Tcl specific settings of the system encoding (encoding system).
-
The Tcl specific settings of the different channels (stdin, stdout or individual channel definitions). These settings can be set or retrieved via the fconfigure command.
The next paragraphs describe typical applications of these settings.
Transfer of information between LDAP and utf-8 coded files
Set the _localcode variable to utf-8.
When reading files (4), the File Handler reads the file into the storage area of the Tcl arrays while no conversion is done. Internal Tcl routines can now work on these arrays (for example mapping functions (5)). The Tcl arrays can be accessed by the Translation API (6) to move the data to the LDAP API (7) and from there to the LDAP server (not shown in the picture). Because the data is already in utf-8 format, no conversion is performed.
Note: This mode is the fastest mode available. If you need high performance synchronizations, you should keep all data always in utf-8 format.
When writing files, the data is retrieved from the LDAP server 7) - in utf-8 format) via the Translation API into the storage area (6). Internal Tcl routines can now work on these arrays (for example mapping functions - (5. The File Handler can access the Tcl arrays and write it to files ((4) - no conversion is performed).
Transfer of information between LDAP and files codes in another character set
Set the _localcode variable to the required character set (for example Latin1).
When reading files, the File Handler reads the file (4) into the storage area of the Tcl arrays while a conversion from the defined character set to utf-8 is done. Internal Tcl routines can now work on these arrays ((5) - for example mapping functions). The Tcl arrays can be accessed by the Translation API (6) to move the data to the LDAP server (7). Because the data is already in utf-8 format, no conversion is performed.
When writing files, the data is retrieved from the LDAP server ((7) in utf-8 format) via the Translation API into the storage area (6). Internal Tcl routines can now work on these arrays ((5) - for example mapping functions). The File Handler can access the Tcl arrays (4) and write it to files while converting it from utf-8 to the required character set.
Converting files from one character set to another
If you want to convert a file coded in character set A to a file in character set B you have to set the individual -encoding parameters for the meta openconn and meta readattrconf commands.
For example:
-
Open the tagged source file mysourcefile.data with character set iso-8859-6 with the command:
-
meta openconn -type FILE \
-file mysourcefile.data \
-mode READ \
-format TAGGED \
-encoding iso-8859-6 \
-attrconf ah \
-conn ch
-
-
You can read the data from the file to the metacp storage area. It is automatically converted from iso-8859-6 to utf-8 characters.
-
Perform all necessary handling of the Tcl variables and fields (all in utf-8 format) like mapping routines and other issues.
-
Open the untagged target file mytargetfile.data with character set unicode with the command:
-
meta openconn -type FILE \
-file msexch.data \
-mode WRITE \
-format NON-TAGGED \
-encoding unicode \
-attribute {o ou sn givenName telephoneNumber facsimileTelephoneNumber} \
-attrconf ah \
-conn ch
-
-
Write the information to the output file. The utf-8 characters are automatically converted to Unicode characters (in this example).
-
Close the files as usual.
Interactive operation from a terminal window
For compatibility reasons you can set the _localcode variable to PC850 in a DOS box (this is no longer required because Tcl converts the terminal input and output automatically from and to utf-8 format).
You can use the escape sequences described in the next chapter to input Unicode or Hex characters in your input strings.
The commands, which are input at the terminal window, are sent to the Translation API and processed there. For example you can enter a search request to the LDAP server. After conversion from the terminal character set (1), all parameters are handled in utf-8 format and transferred to the LDAP server (7). The search result is returned also in utf-8 format (7) and converted to the terminal character set (1). Output conversion to the screen can result in the message ‘Not convertible’ if utf-8 characters are contained which cannot be represented with the terminal character set.
You can redirect the output to a file (2) where you can define any character set you want (use the fconfigure channel -encoding encoding command for this purpose - details see next chapter).
If you want to use the source command with command files not in the format of the terminal window, you have to read the file content into a variable and evaluate it afterwards (see the Tcl documentation for details). A sample routine dxm_source to perform this task is contained in the DirX Identity Connectivity Configuration under Configuration → Tcl → Other Scripts → Common Script.
Directly programmed code conversions
If you intend to create files directly (3) from the content in the storage area 5) or (7, you can open channels to those files and set the character set with the fconfigure -encoding command for each of them (see the next chapter for details).
You can also access the storage area via Tcl commands and perform character set conversions directly in memory 5) or (7.Be aware not to send variables other than utf-8 coded ones to the operating system (stdin, stdout).This will lead to strange results.
External access of the Translation API
External programs can access the translation API directly (B).In this case automatic code conversions based on a special interface switch (_localstrings - not used by metacp) can be performed.If _localstrings is set to true, no conversion is performed (the strings are assumed to be in utf-8 format), otherwise a transformation from Latin1 to utf-8 is done.
The DAP functionality at the Translation API is only available for compatibility reasons.
Tcl Features
Tcl has changed the internal representation of characters to utf-8/Unicode, i.e. it assumes to be all strings in this format when conversions have to be done to the operating system.
Unicode is a two byte (16 bit) representation used at the operating / user interface level.
utf-8 is a representation of Unicode, which is identical for all characters from 00 to 7F to the ASCII character set.All other Unicode characters are represented by a 1- to 3-byte sequence with the most significant bit of the first character set.Tcl handles these character sequences as one character when calculating strings (for example in the length command).
Therefore, utf-8 is fully compatible to all scripts and strings that only use the ASCII character set.
Defining Unicode Characters
You can define Unicode characters in strings in the form "\udddd" (for example "\u592a" represents a Chinese glyph and "\u0041" represents the glyph "A"). To input utf-8 characters you have to use the method for Unicode characters.
For single byte characters you can use the representation "\xdd" where Tcl assumes the first byte to be zero (for example "\x41" represents the glyph "A"). In this case Tcl has the behaviour that the \x mode does only end when a character not equal a to f is input. Therefore you must input all following characters in the \x mode if they are in the range a to f.
Examples:
\xfc\x62\x65l results in the German word übel (b and e are in the range a to f, l is not).
f\xfcndig results in the German word fündig (n is not in the range a to f).
Tcl has built in functionality for about 50 common character encodings. You can display the available encodings with the command:
encoding names
Important encodings are:
-
ascii - pure ascii (single-byte)
-
utf-8 - the internal character set of Tcl (multi-byte)
-
unicode - the Unicode character set (two-byte)
-
cp850 - the PC DOS character set for DOS shells (single-byte)
-
iso8859-1 - the Latin-1 (Windows) character set (single-byte)
Channel Based Conversion
You can define a character encoding for each input / output channel:
fconfigure channelid -encoding encoding
Example:
set fd [open $file r]
fconfigure $fd -encoding shiftjis
Tcl now converts automatically all shiftjis coded characters (shiftjis is a widely used Japanese character set) from the file to the internal utf-8 format.
| The Tcl source command always reads files using the system encoding. |
You can check the encoding for a specific channel with the command:
fconfigure channelid
Example:
fconfigure stdin fconfigure $fd
| in a DOS box the encoding is set to cp850 (which is the PC850 character set). |
System Encoding
The system encoding is the character encoding used by the operating system. Tcl automatically handles conversions between utf-8 and the system encoding when interacting with the operating system.
Tcl usually can determine a reasonable default system encoding based on the platform and locale settings. If not, it uses ISO8859-1 (Latin1) as default setting.
You can check the actual system encoding with
encoding system
You can redefine the system encoding with
encoding system encoding
This is not recommended because the interaction of the system could not work correctly. Use the fconfigure channelid -encoding encoding command instead.
String Conversion
Strings can be converted with the functions:
encoding convertfrom
encoding convertto
Example:
set ha [encoding convertfrom utf-8 "\xc3\xbc"]
On a DOS terminal window this should result in the echoed output "ü".
Character Sets
Tcl 8.3 supports a variety of different encodings. (see install_path\lib\tcl8.3\encoding (on Windows) or install_path/lib/tcl8.3/encoding (on UNIX)).The encodings listed in these encoding files can be used in the _localcode variable by dropping the file suffix ".enc".
Example:
-
Encoding file name: cp850.enc
_localcode must be set to "cp850".
The following character sets or code sets are available:
-
ascii
big5
cp437
cp737
cp775
cp850 (also PC850 for use in _localcode variable)
cp852
cp855
cp857
cp860
cp861
cp862
cp863
cp864
cp865
cp866
cp869
cp874
cp932
cp936
cp949
cp950
cp1250
cp1251
cp1252
cp1253
cp1254
cp1255
cp1256
cp1257
cp1258dingbats
euc-cn
euc-jp
euc-krgb12345
gb1988
gb2312identity
iso2022
iso2022-jp
iso2022-kr
iso8859-1 (also Latin1 for use in _localcode variable)
iso8859-2
iso8859-3
iso8859-4
iso8859-5
iso8859-6
iso8859-7
iso8859-8
iso8859-9jis0201
jis0208
jis0212koi8-r
ksc5601
macCentEuro
macCroatian
macCyrillic
macDingbats
macGreek
macIceland
macJapan
macRoman
macRomania
macThai
macTurkish
macUkraineshiftjis
symbol
unicode
utf-8 (also UTF8 for use in _localcode variable)
| For compatibility reasons the values UTF8, LATIN1 and PC850 are also supported for the _localcode variable. |
| You can extend the available character sets by writing your own character conversion tables. See the original Tcl 8.3 documentation for details. |