用户工具

站点工具


gbk_欧元符号的兼容性问题

差别

这里会显示出您选择的修订版和当前版本之间的差别。

到此差别页面的链接

两侧同时换到之前的修订记录前一修订版
后一修订版
前一修订版
gbk_欧元符号的兼容性问题 [2023/07/14 14:22] MNBVC项目组gbk_欧元符号的兼容性问题 [2025/06/02 15:17] (当前版本) – 外部编辑 127.0.0.1
行 26: 行 26:
 No, Windows code pages must be either one byte (SBCS) or a mix of one and two bytes (DBCS). This requirement is reflected throughout our code e.g. in data structures, program interfaces, network protocols and applications. The existing code page for Simplified Chinese, CP936, is a double byte code page. GB18030 is a four–byte code page i.e. every character is represented by one, two or four bytes. To replace CP936 with GB18030 would require rewriting much of the system. Even if we were to do this, such a system would not run regular applications nor interoperate with regular Windows. No, Windows code pages must be either one byte (SBCS) or a mix of one and two bytes (DBCS). This requirement is reflected throughout our code e.g. in data structures, program interfaces, network protocols and applications. The existing code page for Simplified Chinese, CP936, is a double byte code page. GB18030 is a four–byte code page i.e. every character is represented by one, two or four bytes. To replace CP936 with GB18030 would require rewriting much of the system. Even if we were to do this, such a system would not run regular applications nor interoperate with regular Windows.
  
 +[[mailto:zhangxu@apusai.com|zhangxu]]:目前 libicu 库是对这个问题有较好的处理的,我们可以用使用 libicu 库的 Linux 命令 uconv 进行验证:
 +uconv -f gbk -t utf-8 224.txt > test.txt
  
 +执行以后,打开 test.txt 可以看到“100~120RMB/80~90$/65~70€左右”这句正确解码了。
  
-解决这个问题最简单方式就是使MS936处理这种有问题的文件(Windows Code page 936 (abbreviated MS936, Windows-936 or (ambiguously) CP936))。不过遗憾的是Python语言暂时没有独立支持MS936或者CP936,而是把它们统一作为GBK处理了。https://docs.python.org/3/library/codecs.html。 +那么,在 Python 中,可以用下面代码调用 libicu 库(通过PyICU封装)解码: 
-因此目前单纯使用Python对文件内容进行decode无法正确处理这些特殊符号。+<code python> 
 +from icu import UnicodeString
  
  
-我用Java了一段代码验证,可以顺利的用MS936解析出文件中的欧元符号,目前计划把Java对于MS936的支持翻译成Python代码作为本项目的一部分。+def convert_encoding(input_file, output_file): 
 +    # 打开二进制文件进行读取 
 +    with open(input_file, "rb") as f_input: 
 +        with open(output_file, "w") as f_output: 
 +            data = f_input.read() 
 +            # 将读取的数据转换为UTF-8编码 
 +            utf8_data = UnicodeString(data, "GBK"
 +            # 将转换的UTF-8数据入输出文件 
 +            f_output.write(str(utf8_data))
  
  
-<java>+if __name__ == "__main__": 
 +    input_file = "224.txt" 
 +    output_file = "224_utf8.txt" 
 + 
 +    convert_encoding(input_file, output_file) 
 +    print("Conversion completed."
 + 
 +</code> 
 + 
 +同样基于 libicu 库的 Java 也可以处理这个问题: 
 + 
 +<code java>
 import java.io.BufferedReader; import java.io.BufferedReader;
 import java.io.File; import java.io.File;
行 68: 行 91:
     }     }
 } }
-</java>+</code>
gbk_欧元符号的兼容性问题.1689315749.txt.gz · 最后更改: (外部编辑)