Hello All.
I have this script that I need to run every Saturday evening and it takes more than 5 hours to complete. Is there a better way to structure my script? Please advise. Thank you.
update Table_A set financial_yr = t2.Master_financial_year,
financial_period = t2.Master_financial_period
from Master_financial_Table t2
where frst_requested_date >= t2.Start_date and frst_requested_date <= t2.End_date
and financial_year>2000
There are a total of 4.6 million records (and growing weekly) in Table_A and 42 records in Master_financial_table (standard)
Indexes have been created for these tables.Maybe this will be faster
You don't read 4 million times the Master_financial_Table
Declare @.Master_financial_year DateTime
Declare @.Master_financial_period DateTime
Declare @.Start_date DateTime
Declare @.End_date DateTime
Select @.Master_financial_year=Master_financial_year,
@.Master_financial_period=Master_financial_period,
@.Start_date=Start_date,
@.End_date=End_date
From Master_financial_Table
Update Table_A
Set financial_yr=@.Master_financial_year,
financial_period = @.Master_financial_period
Where frst_requested_date between @.Start_date and @.End_date and
financial_year>2000|||I suspect that Table_A will have been given a non-clustered index on frst_requested_date. Given that where frst_requested_date >= t2.Start_date and frst_requested_date <= t2.End_date is going to select a very large number of records, if Table_A has clustered index, you should consider removing it.
In SQL Server 2000, if a table has a clustered index, the leaf nodes of all non-clustered indexes contain the key values of the clustered index corresponding to the index match. So, when a non-clustered index match is found it is followed by a bookmark-lookup on the tables clustered index. On *very* large tables even a highly specific index can return many thousands of records, thereby inducing thousands of clustered index seeks in order to resolve the secondary bookmark lookup. SQL Server should be smart enough to realize a table scan is going to be better, but often it doesnt, and so very large clustered tables exhibit unpleasant performance when a query is conducted through a non-clustered index.
If a table is non-clustered (or heaped), the leaf nodes of all its indexes consist of offset pointers directly into the data blob. This has a significant performance risk in that any index page splitting becomes expensive to fixup on insert or update, but all indexes recover their data by jumping directly to the data row without thousands or millions of bookmark lookups or full blown table scans.
If you can generate a query plan and discover many bookmark lookups are taking place, its worth considering unclustering Table_A.|||If a table is non-clustered (or heaped), the leaf nodes of all its indexes consist of offset pointers directly into the data blob. This has a significant performance risk in that any index page splitting becomes expensive to fixup on insert or update, but all indexes recover their data by jumping directly to the data row without thousands or millions of bookmark lookups or full blown table scans.
I disagree. If the table is a heap the pointers point to the ROWID, which takes into place dbname, table, page, etc. Not sure what you mean by data blob.
and so very large clustered tables exhibit unpleasant performance when a query is conducted through a non-clustered index.
I disagree again. SQL will choose to do a clustered index scan if a table scan is more efficient, a clustered index seek is actually using the index and SQL will usually choose to ignore the NCI and just scan the table (clustered index scan) Also, while you mention pagesplitting, a heap reclaims empty space, this is horrible as it takes time for SQL to find the empty space. Also, choosing the right clustered key will keep your inserts fast and your NCI lookups fast as your key should be small (smaller than the ROWID lookup)
I have found there are very rare occasions when a clustered index should not be on a table.
For the poster - mess with your indexes and play with the "set statistics IO on" command, this will show you how many page reads your query costs you and allow you the ability to tell if removing or adding an index will really benefit you or hurt you.
HTH|||Originally posted by rhigdon
I disagree. If the table is a heap the pointers point to the ROWID, which takes into place dbname, table, page, etc. Not sure what you mean by data blob.
I disagree again. SQL will choose to do a clustered index scan if a table scan is more efficient, a clustered index seek is actually using the index and SQL will usually choose to ignore the NCI and just scan the table (clustered index scan) Also, while you mention pagesplitting, a heap reclaims empty space, this is horrible as it takes time for SQL to find the empty space. Also, choosing the right clustered key will keep your inserts fast and your NCI lookups fast as your key should be small (smaller than the ROWID lookup)
I have found there are very rare occasions when a clustered index should not be on a table.
For the poster - mess with your indexes and play with the "set statistics IO on" command, this will show you how many page reads your query costs you and allow you the ability to tell if removing or adding an index will really benefit you or hurt you.
HTH
You are quite welcome to disagree. However very large data tables behave differently from more modest deployments.
I quote from http://www.sql-server-performance.com/jc_sql_server_quantative_analysis5d.asp
"The key observation for multi-row select queries is that there can be a very wide discrepancy between the point where query optimizer switches the execution plan to a Table Scan and the actually observed cross-over point."
"Other important points include the following. Bookmark Lookups are less expensive for heap organized tables than tables with a clustered index. It is frequently recommended that tables have a clustered index. If clustering only benefits a small fraction of the queries (weighted by the number of rows involved), then it may be better to leave the table a heap."|||"Other important points include the following. Bookmark Lookups are less expensive for heap organized tables than tables with a clustered index. It is frequently recommended that tables have a clustered index. If clustering only benefits a small fraction of the queries (weighted by the number of rows involved), then it may be better to leave the table a heap."
I agree if you will not be having a lot of inserts or deletes. The reclaiming of empty space is non-optimal if either are occuring. The other benefit of more efficient key locks with a clustered index than row locks should be taken into consideration if you will be joining to the table. I would be real interested to hear the posters logical IO when using or not using a clustered index. There are very few tables I work with that are read-only or used solely for querying, the only place I really have that is in a warehouse that I use cubes for anyway rather than SQL to query.
Looks like an interesting article, will have a read.|||Karolyn, limteckboon's query is not going to make four million passes through the table. That's ridiculous. Your solution is functionally equivalent to his, but with extra coding.
limteckboon,
It's unclear whether fields frst_requested_date, and financial_year are in Table_A or Master_financial_Table. It makes a difference in how the query should be written. Please clarify. If they are in Table_A then possibly a JOIN instead of a WHERE clause would improve efficiency:
update Table_A
set financial_yr = t2.Master_financial_year,
financial_period = t2.Master_financial_period
from Table_A
inner join Master_financial_Table t2
on Table_A frst_requested_date between t2.Start_date and t2.End_date
and Table_A.financial_year>2000
In the meantime, you may get better performance if you DROP the indexes on Table_A prior to your update, especially any indexes involving columns financial_yr and financial_period. Add the indexes back in at the end of the process.
blindman|||my query is not equivalent...
there's no JOINs on my query
so the UPDATE query will be excuted faster|||reindexing a 4-million-table must cost a lot...|||It costs less than continuously reshuffling the index pages.|||Hello All.
Thank you very much for all your most valuable advises. Greatly appreciated. Now, I need time to digest them. Will give a try on the 2 suggested codes to see which is more suitable for me but I am very sure both set of codes will definitely give better performance than mine.
Once again Thanks a million to all.
Best regards|||i don't see a difference between Karolyn and blindy's code, except for "extra coding" which actually eliminates a whole table out of the update (that's actually a plus, blindy)
limteckboon:
consider non-conventional data modifications. i guarantee that we can shrink the execution time down to under 60 minutes, maybe even less than 30 if you stop listening to blabbering about indexing and inner joins.
did i get your attention?|||ms_sql_dba posted in another thread
"i've done this type of update on larger number of rows without using update. the trick? use queryout on bcp with values that you want to have, then truncate and bcp data back in."
to his idea
--> use Bulk Insert to put back in the data is faster than Bcp
--> and set the option 'select into/bulkcopy' to True with this command
exec sp_option 'DataBaseName', 'select into/bulkcopy','True'|||simple... *** admiration ***|||If you're going to examine 4 million rows, so a table scan is going to be done (almost certainly), but I doubt that the scanning time is going to be an issue. What will certainly be an issue is the fact that you are updating every single one of them, every single time!
Once you've processed 4 million records this week, how many of them are really going to change by next week? Your query should select, and only update the status of, those records which have changed (or that do not have a known status yet). It seems to me that if you are examining the "first-requested" date, not too many records will ever change status a second time! Take full advantage of that!
If you must make the change to all the records, write a script that processes 1,000 records at a time, then commits. Otherwise, the server is prepared to roll back every one of those changes!
Any change to an indexed column will cause the index to be updated every time. This should be avoided.
Just as an afterthought: can these be calculated fields? I mean, with only twenty-something date ranges, total ...
No matter how efficient a computer or a piece of software may be, the best way to get good performance out of it is to ask it to do absolutely no more than it has to. I think that the root cause of the problem is that you are making the server do far too much work.
没有评论:
发表评论