Open/read command in Tcl 8.5 for large files

Sorry if the title doesn't match my question well, I'm still unsure as to how I should put it.

Anyway, I've been using Tcl/Tk on Windows ( wish ) for a while now and haven't encountered any problem on the script I wrote until recently. The script is supposed to break down a large txt file into smaller files that can be imported to excel (I'm talking about breaking down a file with maybe 25M lines which comes around 2.55 GB).

My current script is something like that:

set data [open "file.txt" r]
set data1 [open "File Part1.txt" w]
set data2 [open "File Part2.txt" w]
set data3 [open "File Part3.txt" w]
set data4 [open "File Part4.txt" w]
set data5 [open "File Part5.txt" w]


set count 0
while {[gets $data line] != -1} {
    if {$count > 4000000} {
        puts $data5 $line
    } elseif {$count > 3000000} {
        puts $data4 $line
    } elseif {$count > 2000000} {
        puts $data3 $line
    } elseif {$count > 1000000} {
        puts $data2 $line
    } else {
        puts $data1 $line
    }
    incr count
}

close $data
close $data1
close $data2
close $data3
close $data4
close $data5

And I alter the numbers within the if to get the desired number of lines per file, or add/remove any elseif where required.

The problem is, with the latest file I got, I end up with only about half the data (1.22 GB instead of 2.55 GB) and I was wondering if there was a line which told Tcl to ignore the limit that it can read. I tried to look for it, but I didn't find anything (or anything that I could understand well; I'm still quite the amateur at Tcl ^^;). Can anyone help me?

EDIT (update): I found a program to open large text files and managed to get a preview of the contents of the file directly. There are actually 16,756,263 lines. I changed the script to:

set data [open "file.txt" r]
set data1 [open "File Part1.txt" w]

set count 0
while {[gets $data line] != -1} {
    incr count
}
puts $data1 $count
close $data
close $data1

to get where the script is blocking and it stopped here: 在这里输入图像描述

There's a character that the text editor is not recognising in the middle line showing as a little square. I tried to use fconfigure like evil otto suggested but I'm afraid I don't quite understand how the channelID , name or value work exactly to escape that character. Um... help?

reEDIT : I managed to find out how fconfigure worked! Thanks evil otto! Um, I'm not sure how I can 'choose' your answer since it's a comment instead of a proper answer...


Is it possible there is any binary data in "file.txt"? Under windows, tcl will flag eof if it reads a ^Z (the default eofchar ) in a file. You can turn this off with fconfigure :

fconfigure $data -eofchar {}

See the docs for full details.


I ran your script on a Mac, which is Unix-based, and noticed the following:

  • The incr count should be at the beginning of the loop--a minor point.
  • More importantly, File.txt contains 25M lines, yet you divided unevenly: the first four each contains 1M, and the rest goes into File5.txt. If you want to evenly divide the files, then the break points should be 20M, 15M, 10M and 5M.
  • Other than that, I did not notice any data loss. I don't have a Windows machine to try it out.
  • 链接地址: http://www.djcxy.com/p/11604.html

    上一篇: git从整个历史中删除所有已删除的文件

    下一篇: 在Tcl 8.5中打开/读取大文件的命令