gbk_欧元符号的兼容性问题
差别
这里会显示出您选择的修订版和当前版本之间的差别。
两侧同时换到之前的修订记录前一修订版后一修订版 | 前一修订版 | ||
gbk_欧元符号的兼容性问题 [2023/07/14 14:22] – MNBVC项目组 | gbk_欧元符号的兼容性问题 [2025/06/02 15:17] (当前版本) – 外部编辑 127.0.0.1 | ||
---|---|---|---|
行 26: | 行 26: | ||
No, Windows code pages must be either one byte (SBCS) or a mix of one and two bytes (DBCS). This requirement is reflected throughout our code e.g. in data structures, program interfaces, network protocols and applications. The existing code page for Simplified Chinese, CP936, is a double byte code page. GB18030 is a four–byte code page i.e. every character is represented by one, two or four bytes. To replace CP936 with GB18030 would require rewriting much of the system. Even if we were to do this, such a system would not run regular applications nor interoperate with regular Windows. | No, Windows code pages must be either one byte (SBCS) or a mix of one and two bytes (DBCS). This requirement is reflected throughout our code e.g. in data structures, program interfaces, network protocols and applications. The existing code page for Simplified Chinese, CP936, is a double byte code page. GB18030 is a four–byte code page i.e. every character is represented by one, two or four bytes. To replace CP936 with GB18030 would require rewriting much of the system. Even if we were to do this, such a system would not run regular applications nor interoperate with regular Windows. | ||
+ | [[mailto: | ||
+ | uconv -f gbk -t utf-8 224.txt > test.txt | ||
+ | 执行以后,打开 test.txt 可以看到“100~120RMB/ | ||
- | 解决这个问题最简单的方式就是使用MS936来处理这种有问题的文件(Windows Code page 936 (abbreviated MS936, Windows-936 or (ambiguously) CP936))。不过遗憾的是Python语言暂时没有独立支持MS936或者CP936,而是把它们统一作为GBK处理了。https:// | + | 那么,在 Python 中,可以用下面的代码调用 libicu 库(通过PyICU封装)来解码: |
- | 因此目前单纯使用Python对文件内容进行decode无法正确处理这些特殊符号。 | + | < |
+ | from icu import UnicodeString | ||
- | 最后我用Java写了一段代码验证,可以顺利的用MS936解析出文件中的欧元符号,目前计划把Java对于MS936的支持翻译成Python代码作为本项目的一部分。 | + | def convert_encoding(input_file, |
+ | # 打开二进制文件进行读取 | ||
+ | with open(input_file, | ||
+ | with open(output_file, | ||
+ | data = f_input.read() | ||
+ | # 将读取的数据转换为UTF-8编码 | ||
+ | utf8_data = UnicodeString(data, | ||
+ | # 将转换后的UTF-8数据写入输出文件 | ||
+ | f_output.write(str(utf8_data)) | ||
- | < | + | if __name__ == " |
+ | input_file = " | ||
+ | output_file = " | ||
+ | |||
+ | convert_encoding(input_file, | ||
+ | print(" | ||
+ | |||
+ | </code> | ||
+ | |||
+ | 同样基于 libicu 库的 Java 也可以处理这个问题: | ||
+ | |||
+ | < | ||
import java.io.BufferedReader; | import java.io.BufferedReader; | ||
import java.io.File; | import java.io.File; | ||
行 68: | 行 91: | ||
} | } | ||
} | } | ||
- | </java> | + | </code> |
gbk_欧元符号的兼容性问题.1689315749.txt.gz · 最后更改: (外部编辑)