Download project documentation
Transcript
the other hand, the user might want to develop a set of programs for the Arabic language or the Chinese language, for instance. So, a Unicode-enabled lexical analyzer generator will be great. Consequently, two builds will be available: an ASCII release and a Unicode release. During the development process, the program was compiled and tested under the ASCII build. When the application was complete, it was the time to try the Unicode build. We thought defining the aforementioned macro would get things work as expected. Unicode-encoded test files were prepared and all what remained was to build the application under Unicode. It was true that the application compiled successfully with no problems, but it failed to read the input files of all test cases. The failures ranged from detecting invalid sequence of characters at the beginning of the file to reading spurious null before or after each character. When the input included Arabic letters, nothing related to Arabic was processed. We tried the same files with simple programs developed with C# and faced no problem. We started to write simple, “Hello World” text files under different Unicode encodings and use binary editors to view the contents of these files. We found that all Unicode files always begin with a fixed sequence of bytes that are independent of the actual text stored in the file. The bytes in that sequence differ according to the encoding under use. We correctly concluded that this was to help the application determine the specific encoding under use in the file. But this was not enough to tell how to solve the problem. Indeed, this problem exhausted an excessive amount of time from us. Such a problem was never expected. See references [40] – [48] for more about this problem. We made a research plan for the whole matter. The plan was organized as a set of questions to be answered as follows: - What are the different encodings used to represent Unicode? How does IOStream internally work and how does it deal with wide characters and different file encodings? How did other tools deal with Unicode? The answer to the first question is quite long and is beyond the scope of this document. The answer of the second question is the topic of many books merely dedicated to the IOStream library. For the third question, we were not surprised with the number of forums and websites that tackled the topic. However, we shall briefly and collectively illustrate the results of the three questions and the solution of the problem in the following outline [35]. • C++ IOStream classes use some type of encoder/decoder classes to convert between the internal representation of characters and their external representation. If the characters are externally encoded using some encoding scheme, then an appropriate encoder/decoder object should be ‘imbued’ with the stream object. • The most famous Unicode encodings are UTF-8, UTF-16 BE, and UTF-16 LE. UTF-32 BE and LE are not as famous. UTF-8 is an 8-bit, variable-width encoding, compatible with ASCII that uses one to four bytes per character.