Efficient way to alter 100GB table

We have a number of databases which store 10s to 100s of gigabytes of data in one of the tables. It contains image data. The problem is that a lot of these databases were created improperly. Basically the primary key isn't actually a primary key. The were created with a unique index on a nullable column. And some of them have an int as a primary key instead of a bigint.

So we've been slowly going through and fixing these databases. They run on SQL Server 2000 through SQL Server 2008, although most of the ones with primary key issues are on SQL Server 2000. The problem is, we don't want to lock down the database for a whole day while it converts the table. We've gone through several strategies:

  • Tell SQL Server to directly change the column type. This locks the table until it's complete, and after leaving it overnight in many cases, it still wasn't finished.

  • Insert all the images into a new table in one go. This was more easily interrupted, but the whole table basically gets written to the log file in the process.

  • Insert 100 rows at a time where the rows don't exist in the target table. The upside is that they can continue using the database while this is going on (with a big performance hit) and that it can be stopped and restarted arbitrarily at any point, and it prevents the 100GB+ log files. This is what we're currently doing, but finding the top 100 rows that don't exist gets really really slow as the target table gets bigger and bigger. UPDATE STATISTICS and DBCC INDEXDEFRAG help considerably, but in the most recent attempt, we got to the point even 100 images at a time was sitting there not responding.

    INSERT INTO %s  
      SELECT TOP 100 Source.*  
      FROM %s AS Source WITH (NOLOCK)  
      LEFT OUTER JOIN %s AS Target WITH (NOLOCK) ON Source.DocumentID = Target.DocumentID  
      WHERE Target.DocumentID IS NULL  
      ORDER BY Source.DocumentID  
    
  • So the question is, is there an option which can copy bulk data in an efficient and resumable manner? It doesn't have to be 100% accurate, we can always go back and fix any discrepancies at the end, as long as it does 99% of the work.


    The join is the problem. Don't do that. Just loop through your current table using some reasonable interval, using the current clustered index. Something like:

    Declare @idrange int;
    Set @idrange = 1;
    
    WHILE @idrange < 10000000
    
    INSERT INTO Destination
      SELECT *  
      FROM Source
      WHERE DocumentID between @idrange and @idrange + 999
      ORDER BY Source.DocumentID  
    
    Set @idrange = @idrange + 1000
    End 
    

    Note that for best speed, remove all indexes (including the clustered index) from the destination table, then add the indexes once all rows have been inserted.

    EDIT : changed the range interval to prevent an overlap (since BETWEEN includes the endpoints)

    One final clarification: the overall point of my sample script is that you simply want to walk through your current records in some reasonable order, and put them all into the new table in batches. There's no reason to continue checking the destination table each time, since you should already know what you've put there, and what is still left. Most of the time, it makes sense to use the clustered index (if there is one), since this means it can walk through the physical order of the table without doing bookmark lookups. If the table has no cluster, then just use whatever makes the most sense (your PK, probably).

    链接地址: http://www.djcxy.com/p/2810.html

    上一篇: 在矢量应用程序上创建Quartz路径?

    下一篇: 改变100GB表格的有效方法