gbk_欧元符号的兼容性问题
差别
这里会显示出您选择的修订版和当前版本之间的差别。
后一修订版 | 前一修订版 | ||
gbk_欧元符号的兼容性问题 [2023/07/14 14:17] – 创建 MNBVC项目组 | gbk_欧元符号的兼容性问题 [2025/06/02 15:17] (当前版本) – 外部编辑 127.0.0.1 | ||
---|---|---|---|
行 10: | 行 10: | ||
以下是微软对这个问题的详细解释: | 以下是微软对这个问题的详细解释: | ||
- | ``` | + | |
What is GB18030? | What is GB18030? | ||
GB18030–2000 is a new Chinese character encoding standard. The standard contains many characters and has some tough new conformance requirements. GB18030-2000 encodes characters in sequences of one, two, or four bytes. These sequences are defined as follows: | GB18030–2000 is a new Chinese character encoding standard. The standard contains many characters and has some tough new conformance requirements. GB18030-2000 encodes characters in sequences of one, two, or four bytes. These sequences are defined as follows: | ||
行 25: | 行 25: | ||
Is GB18030 replacing the Windows Simplified Chinese code page (CP936)? | Is GB18030 replacing the Windows Simplified Chinese code page (CP936)? | ||
No, Windows code pages must be either one byte (SBCS) or a mix of one and two bytes (DBCS). This requirement is reflected throughout our code e.g. in data structures, program interfaces, network protocols and applications. The existing code page for Simplified Chinese, CP936, is a double byte code page. GB18030 is a four–byte code page i.e. every character is represented by one, two or four bytes. To replace CP936 with GB18030 would require rewriting much of the system. Even if we were to do this, such a system would not run regular applications nor interoperate with regular Windows. | No, Windows code pages must be either one byte (SBCS) or a mix of one and two bytes (DBCS). This requirement is reflected throughout our code e.g. in data structures, program interfaces, network protocols and applications. The existing code page for Simplified Chinese, CP936, is a double byte code page. GB18030 is a four–byte code page i.e. every character is represented by one, two or four bytes. To replace CP936 with GB18030 would require rewriting much of the system. Even if we were to do this, such a system would not run regular applications nor interoperate with regular Windows. | ||
- | ``` | ||
+ | [[mailto: | ||
+ | uconv -f gbk -t utf-8 224.txt > test.txt | ||
- | 解决这个问题最简单的方式就是使用MS936来处理这种有问题的文件(Windows Code page 936 (abbreviated MS936, Windows-936 or (ambiguously) CP936))。不过遗憾的是Python语言暂时没有独立支持MS936或者CP936,而是把它们统一作为GBK处理了。https:// | + | 执行以后,打开 test.txt 可以看到“100~120RMB/80~90$/65~70€左右”这句正确解码了。 |
- | 因此目前单纯使用Python对文件内容进行decode无法正确处理这些特殊符号。 | + | |
+ | 那么,在 Python 中,可以用下面的代码调用 libicu 库(通过PyICU封装)来解码: | ||
+ | <code python> | ||
+ | from icu import UnicodeString | ||
- | 最后我用Java写了一段代码验证,可以顺利的用MS936解析出文件中的欧元符号,目前计划把Java对于MS936的支持翻译成Python代码作为本项目的一部分。 | ||
+ | def convert_encoding(input_file, | ||
+ | # 打开二进制文件进行读取 | ||
+ | with open(input_file, | ||
+ | with open(output_file, | ||
+ | data = f_input.read() | ||
+ | # 将读取的数据转换为UTF-8编码 | ||
+ | utf8_data = UnicodeString(data, | ||
+ | # 将转换后的UTF-8数据写入输出文件 | ||
+ | f_output.write(str(utf8_data)) | ||
- | ``` | + | |
+ | if __name__ == " | ||
+ | input_file = " | ||
+ | output_file = " | ||
+ | |||
+ | convert_encoding(input_file, | ||
+ | print(" | ||
+ | |||
+ | </ | ||
+ | |||
+ | 同样基于 libicu 库的 Java 也可以处理这个问题: | ||
+ | |||
+ | <code java> | ||
import java.io.BufferedReader; | import java.io.BufferedReader; | ||
import java.io.File; | import java.io.File; | ||
行 68: | 行 91: | ||
} | } | ||
} | } | ||
- | ``` | + | </ |
gbk_欧元符号的兼容性问题.1689315457.txt.gz · 最后更改: (外部编辑)