显示标签为“packages”的博文。显示所有博文
显示标签为“packages”的博文。显示所有博文

2012年3月22日星期四

Anyone else have problems with Sort/Merge in For Each loop?

Just wondering if I am an isolated case or just doing stuff other people don't. Ive created around a dozen packages for importing various types of data into a datawarehouse and some of the packages have random crashes. Over time I have been trying to tie together what the crashing ones have in common and now believe I have completely isolated it: Sort + Merge inside a dataflow in a for each loop. The more sorts and merges in the flow the more likely the problem is to occur.

This week I created 2 identical packages to import some XML... one used some gnarly complicated substrings and such to extract some data and add it as derived columns while the other used the XML correctly that produced 2 streams (one being meta data - other being detail data) and sorted + merged them. The gnarly substring package could run all night (processes about a file every 2 or 3 seconds so you can do the math) while the merged data flows crashes somewhere between 1 and 4 hrs of running - usually right around 3 hrs (also processes about 1 file every 2 seconds). The file it crashes on wil be completly random and not related to file size - I've had crashes on files with as little as 1 row of data and 1 row of metadata and as big as 1 row of metadata and 100,000+ rows of data. Another more complicated package I created has 4 merges (and thus 8 sorts) and it will crash anywhere between 5 minutes and 20 minutes of running.

The crash is usually one where the whole process just shuts down with a memory dump but occasionally I have gotten errors about out of memory or warnings about threads leaking buffers.

I am trying to do some "research" to see if others have this problem. I've spent 1.5 months writing all ETL jobs in SSIS and its gotten to the point management has asked me to explore other platforms. I do have a case open with Microsoft on the package with 4 merges - but we've spent weeks now just trying to get debugging tools to get them some useful info when the process crashes. I am running SP1 and kb91822.

Anyone else?

FYI to anyone that comes across this thread - msft identified a bug in the sort object (stack heap corruption) and now has a hotfix. If you run into this same situation please contact them (until the patch is released to the public)

2012年3月6日星期二

Any Problems with Large Packages?

I've been suffering with a variety of symptoms lately, and I don't know why:

    Can only debug once in BIDS. After that, it can never attach to DtsDebugHost.exe. DtsDebugHost.exe never receives any CPU time, and never grows past its initial memory allocation.

    Save takes a long time with 100% CPU usage. This can be as simple as openning a SQL Server Destination component to correct the metadata due to an added column in the underlying table, then saving. It takes several minutes at 100% CPU.

    I've had several Visual Studio crashes today (DW has reported these)

I've read here about people with 10MB or 17MB packages, so I feel badly complaining about my 3.2MB package! However, something is going on, and I'd like to know what it is, so I can debug again.

Thanks for any help.

Rebooting can sometimes work wonders Smile

What SP level are you running? I highly recommend SP2.

As a "good" practice, I try to keep packages small, for a variety of reasons. Smaller packages are easier to debug, seem to be more stable in BIDS, and are easier to manage in a team environment with multiple developers. My packages normally are under 1mb, and most often are 300-700kb.

|||

I'm running SP2.

Once I get it out to QA, I may look into breaking it up into smaller packages. I'm sure the size is due to the 17 outputs from the XML Source all being processed in the same data flow task. At the cost of reading the XML more than once, and of creating subset schemas (so I don't get hundreds of column not used warnings), I can break some pieces off and execute them in parallel.There are performance reasons as well, as there are buffers made wide by large string and NTEXT fields.

In the meantime, I'm wondering about how to avoid that reboot issue...

|||

A couple of other suggestion:

Try setting Work Offline (under the SSIS menu) to true before saving, and see if that makes a difference (it shouldn't, but I've seen stranger things).

How complex are the transformations in the data flow? You might want to use one package to read the file once, and save all the outputs to RAW files. Then have other packages that pick up the RAW files to do the rest of the processing. RAW files are extremely fast, so they don't introduce as much overhead as you might expect. On the other hand, if the transformations are simple, you won't get a lot of benefit from that approach.

|||

Another wonderful behavior is that data viewers aren't working reliably. I've just given up on one pair: one each on the default output and on the error output of a derived column transformation. The final step in that execution path writes to a SQL Server Destination, and indeed, the rows get there. But they don't stop at either data viewer, and there's no path to the destination that does not go through the derived column transformation.