No category

Download project documentation

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

Transcript

the other hand, the user might want to develop a set of programs for the Arabic
language or the Chinese language, for instance. So, a Unicode-enabled lexical
analyzer generator will be great. Consequently, two builds will be available: an ASCII
release and a Unicode release.
During the development process, the program was compiled and tested under the
ASCII build. When the application was complete, it was the time to try the Unicode
build. We thought defining the aforementioned macro would get things work as
expected. Unicode-encoded test files were prepared and all what remained was to
build the application under Unicode. It was true that the application compiled
successfully with no problems, but it failed to read the input files of all test cases. The
failures ranged from detecting invalid sequence of characters at the beginning of the
file to reading spurious null before or after each character. When the input included
Arabic letters, nothing related to Arabic was processed. We tried the same files with
simple programs developed with C# and faced no problem.
We started to write simple, “Hello World” text files under different Unicode
encodings and use binary editors to view the contents of these files. We found that all
Unicode files always begin with a fixed sequence of bytes that are independent of the
actual text stored in the file. The bytes in that sequence differ according to the
encoding under use. We correctly concluded that this was to help the application
determine the specific encoding under use in the file. But this was not enough to tell
how to solve the problem.
Indeed, this problem exhausted an excessive amount of time from us. Such a problem
was never expected. See references [40] – [48] for more about this problem. We made
a research plan for the whole matter. The plan was organized as a set of questions to
be answered as follows:
-
What are the different encodings used to represent Unicode?
How does IOStream internally work and how does it deal with wide
characters and different file encodings?
How did other tools deal with Unicode?
The answer to the first question is quite long and is beyond the scope of this
document. The answer of the second question is the topic of many books merely
dedicated to the IOStream library. For the third question, we were not surprised with
the number of forums and websites that tackled the topic. However, we shall briefly
and collectively illustrate the results of the three questions and the solution of the
problem in the following outline [35].
•
C++ IOStream classes use some type of encoder/decoder classes to convert
between the internal representation of characters and their external
representation. If the characters are externally encoded using some encoding
scheme, then an appropriate encoder/decoder object should be ‘imbued’ with
the stream object.
•
The most famous Unicode encodings are UTF-8, UTF-16 BE, and UTF-16
LE. UTF-32 BE and LE are not as famous. UTF-8 is an 8-bit, variable-width
encoding, compatible with ASCII that uses one to four bytes per character.

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Download project documentation