在SHIFT上处理Perl文件

2018-06-23 16:12:51

我有一套来自Windows的SHIFT_JIS（日文）编码的csv文件，我正尝试在运行Perl v5.10.1的Linux服务器上使用正则表达式进行字符串替换。

这是我的要求：我希望Perl脚本的正则表达式是人类可读的（至少对于日本人来说）即，像这样：s /北/ 0 / g; 而不是十六进制码s / 4/0 / g;

现在，我在Windows上的Notepad ++中编辑Perl脚本，并将需要从csv数据文件搜索到的字符串粘贴到Perl脚本中。

我有以下工作测试脚本：

use strict;
use warnings;
use utf8;

open (IN1,  "<:encoding(shift_jis)", "${work_dir}/tmp00.csv") or die "Error: tmp00.csvn";
open (OUT1, "+>:encoding(shift_jis)" , "${work_dir}/tmp01.csv") or die "Error: tmp01.csvn";

while (<IN1>)
{
    print $_ . "n";
    chomp;
    s/北/0/g;
    s/10:00/9:00/g;     
    print OUT1 "$_n";
}    

close IN1;
close OUT1;

这将在csv文件中以9:00成功替换10:00，但问题是我无法用0替换北（即北），除非顶部还包含使用utf8。

问题：

1）在打开的文档中，http://perldoc.perl.org/functions/open.html，我没有看到使用utf8作为要求，除非它是隐含的？

a）如果我只使用utf8，那么循环中的第一个打印语句会将垃圾字符打印到我的xterm屏幕。

b）如果我只打开了：encoding（shift_jis），那么循环中的第一个打印语句会将日语字符打印到我的xterm屏幕，但替换不会发生。没有警告说没有指定使用utf8。

c）如果我同时使用a）和b），那么这个例子就可以工作。

“使用utf8”如何修改在此Perl脚本中使用enoding（shift_jis）调用open的行为？

2）我也尝试在没有指定任何编码的情况下打开文件，Perl不会将文件字符串视为原始字节，并且如果我在脚本中粘贴的字符串使用相同的编码，则可以以这种方式执行正则表达式匹配作为原始数据文件中的文本？我可以用这种方式更新文件名，而无需指定任何编码（请参考我的相关帖子：Perl日文到英文文件名替换）。

谢谢。

更新1

在Perl中测试一个简单的本地化样本，用于日语文件名和文件替换

在Windows XP中，将南字符从.csv数据文件中复制并复制到剪贴板，然后将其用作文件名（例如.txt）和文件内容（南）。在Notepad ++中，读取编码为UTF-8的文件显示x93xEC，在SHIFT_JIS下读取它显示南。

脚本：

使用以下Perl脚本south.pl，该脚本将使用Perl 5.10在Linux服务器上运行

#!/usr/bin/perl
use feature qw(say);

use strict;
use warnings;
use utf8;
use Encode qw(decode encode);

my $user_dir="/usr/frank";
my $work_dir = "${user_dir}/test_south";

# forward declare the function prototypes
sub fileProcess;

opendir(DIR, ${work_dir}) or die "Cannot open directory " . ${work_dir};

# readdir OPTION 1 - shift_jis
#my @files = map { Encode::decode("shift_jis", $_); } readdir DIR; # Note filename    could not be decoded as shift_jis
#binmode(STDOUT,":encoding(shift_jis)");                    

# readdir OPTION 2 - utf8
my @files = map { Encode::decode("utf8", $_); } readdir DIR; # Note filename could be decoded as utf8
binmode(STDOUT,":encoding(utf8)");                           # setting display to output utf8

say @files;                                 

# pass an array reference of files that will be modified
fileNameTranslate();
fileProcess();

closedir(DIR);

exit;

sub fileNameTranslate
{

    foreach (@files) 
    {
        my $original_file = $_; 
        #print "original_file: " . "$original_file" . "n";     
        s/南/south/;     

        my $new_file = $_;
        # print "new_file: " . "$_" . "n";

        if ($new_file ne $original_file)
        {
            print "Rename " . $original_file . " to nt" . $new_file . "n";
            rename("${work_dir}/${original_file}", "${work_dir}/${new_file}") or print "Warning: rename failed because: $!n";
        }
    }
}

sub fileProcess
{

    #   file process OPTION 3, open file as shift_jis, the search and replace would work
    #   open (IN1,  "<:encoding(shift_jis)", "${work_dir}/south.txt") or die "Error: south.txtn";
    #   open (OUT1, "+>:encoding(shift_jis)" , "${work_dir}/south1.txt") or die "Error: south1.txtn";  

    #   file process OPTION 4, open file as utf8, the search and replace would not work
open (IN1,  "<:encoding(utf8)", "${work_dir}/south.txt") or die "Error: south.txtn";
    open (OUT1, "+>:encoding(utf8)" , "${work_dir}/south1.txt") or die "Error: south1.txtn";   

    while (<IN1>)
    {
        print $_ . "n";
        chomp;

        s/南/south/g;


        print OUT1 "$_n";
    }

    close IN1;
    close OUT1; 
}

结果：

（BAD）取消注释选项1和3（注释选项2和4）设置：Readdir编码，SHIFT_JIS; 文件打开编码SHIFT_JIS结果：文件名称替换失败..错误：utf8“ x93”没有映射到Unicode ////south.pl第68行。 x93

（BAD）取消注释选项2和4（注释选项1和3）设置：Readdir编码，utf8; 文件打开编码utf8结果：文件名替换工作，south.txt生成但是south1.txt文件内容替换失败，它的内容为 x93（）。错误：“ x {fffd}”没有映射到位于////south.pl第25行的shiftjis。... -Ao？=（Bx {fffd} .txt

（GOOD）取消注释选项2和3（注释选项1和4）设置：Readdir编码，utf8; 文件打开编码SHIFT_JIS结果：文件名替换工作，south.txt生成的South1.txt文件内容替换工作，它的内容为南。

结论：

这个例子必须使用不同的编码方案才能正常工作。 Readdir utf8和文件处理SHIFT_JIS，因为csv文件的内容是SHIFT_JIS编码的。

开始的一个好地方是阅读utf8模块的文档。其中说：

use utf8 pragma告诉Perl解析器在当前词法范围的程序文本中允许UTF-8（允许基于EBCDIC的平台上的UTF-EBCDIC）。 no utf8 pragma告诉Perl切换回在当前词法范围内将源文本视为字面字节。

如果您的代码中没有use utf8 ，那么Perl编译器会假定您的源代码位于您系统的本地单字节编码中。而'北'这个字就没有意义了。添加杂注告诉Perl，你的代码包含Unicode字符，并且所有东西都开始工作。

链接地址: http://www.djcxy.com/p/66365.html

上一篇: Perl file processing on SHIFT

下一篇: Why is this program valid? I was trying to create a syntax error