Download project documentation

Transcript
the other hand, the user might want to develop a set of programs for the Arabic
language or the Chinese language, for instance. So, a Unicode-enabled lexical
analyzer generator will be great. Consequently, two builds will be available: an ASCII
release and a Unicode release.
During the development process, the program was compiled and tested under the
ASCII build. When the application was complete, it was the time to try the Unicode
build. We thought defining the aforementioned macro would get things work as
expected. Unicode-encoded test files were prepared and all what remained was to
build the application under Unicode. It was true that the application compiled
successfully with no problems, but it failed to read the input files of all test cases. The
failures ranged from detecting invalid sequence of characters at the beginning of the
file to reading spurious null before or after each character. When the input included
Arabic letters, nothing related to Arabic was processed. We tried the same files with
simple programs developed with C# and faced no problem.
We started to write simple, “Hello World” text files under different Unicode
encodings and use binary editors to view the contents of these files. We found that all
Unicode files always begin with a fixed sequence of bytes that are independent of the
actual text stored in the file. The bytes in that sequence differ according to the
encoding under use. We correctly concluded that this was to help the application
determine the specific encoding under use in the file. But this was not enough to tell
how to solve the problem.
Indeed, this problem exhausted an excessive amount of time from us. Such a problem
was never expected. See references [40] – [48] for more about this problem. We made
a research plan for the whole matter. The plan was organized as a set of questions to
be answered as follows:
-
What are the different encodings used to represent Unicode?
How does IOStream internally work and how does it deal with wide
characters and different file encodings?
How did other tools deal with Unicode?
The answer to the first question is quite long and is beyond the scope of this
document. The answer of the second question is the topic of many books merely
dedicated to the IOStream library. For the third question, we were not surprised with
the number of forums and websites that tackled the topic. However, we shall briefly
and collectively illustrate the results of the three questions and the solution of the
problem in the following outline [35].
•
C++ IOStream classes use some type of encoder/decoder classes to convert
between the internal representation of characters and their external
representation. If the characters are externally encoded using some encoding
scheme, then an appropriate encoder/decoder object should be ‘imbued’ with
the stream object.
•
The most famous Unicode encodings are UTF-8, UTF-16 BE, and UTF-16
LE. UTF-32 BE and LE are not as famous. UTF-8 is an 8-bit, variable-width
encoding, compatible with ASCII that uses one to four bytes per character.