Staging errors occurring on UNIFYBroker MA after failing full import
Historically we have had this issue prior to upgrading the latest version of UNIFYBroker (on v5.0.0). However we are now on version 5.3.2, and didn't seen the errors for some time, though now they appear to be occurring again. The errors so far have only been seen occurring on every DIDS after a failing full import operations. As mentioned before UNIFYBroker is on the latest version and FIM this user the latest version on the Unify extension (v5.3.0).
The FIM operation currently has a page size of 1000 and an operation timeout of 1500 seconds. So it appears to get stuck for quite some time.
The UNIFYBroker Log doesnt have anything on the exact time of failure but it does appear to be surrounded by Timeout errors:
System.ApplicationException: This operation returned because the timeout period expired. (Exception from HRESULT: 0x800705B4)
at System.Threading.ReaderWriterLock.AcquireReaderLockInternal(Int32 millisecondsTimeout)
at System.Threading.ReaderWriterLock.AcquireReaderLock(TimeSpan timeout)
at Unify.Framework.Data.LinqQueryConversionProvider`5.ExecuteMethodCallExpression[TResult](MethodCallExpression methodCallExpression)
at Unify.Framework.Data.LinqQueryConversionProvider`5.Execute[TResult](Expression expression)
at System.Linq.Queryable.Count[TSource](IQueryable`1 source)
at Unify.Product.IdentityBroker.AdapterStatisticsEngine.GetEntityCount(Guid adapterId)",Normal
Let me what other information you need, and I'll be happy to provide it.
Customer support service by UserEcho
I have captured the change log for the last couple of days if it helps. if you wanted to take a look I can upload in a csv if you like?
Just as another note, we are trying to run an FIDS to rectify the errors. Though it appears it's timed out too. Looking at what else was running in UNIFYBroker and there appeared to be and FI running on a separate connector at the time.
Since first appearing, do all entity-related operation now timeout, including ones on other connectors/adapters that should complete quickly or is it more localized to specific places/operations?
Have you restarted the Broker service since these errors started occurring?
How long since the recommended maintenance tasks been performed on the database?
Also, can you provide the full log, including the first occurrence of exceptions like the example you provided?
It looks to me like the database is being overloaded. The first exception is caused by an entity related pending database operation being blocked by another entity related write operation for over 15 minutes. This is a protection implemented in Broker, not the database. The second exception is due a connection to the SQL server being unable to be opened, either because the connection pool limit was exceeded as the error message stated, or the server was unable to open the connection before the request timed out.
Does this environment meet the combined resource requirements for SQL server and any other processes running on it? Does Broker share the SQL server instances with any other applications?
To start with, I'd recommend identifying any connector or MA import operations that process a high number of entities and adjust the exclusion group configuration and scheduling so that other operation are not competing with them for resources.
You can also use a query such as this one to list the open connections. Run this while the errors are occurring, or as soon as possible afterwards, for best results.
There's also the option of increasing the threadpool limit and connection timeout in the connection string (
<installDir>\Services\Extensibility\Unify.Framework.Data.DataEnginePlugInKey.extensibility.config.xml). This isn't a great solution as it can hide other issues and impact performance in other ways, so avoid increasing these values too much. See this page for defaults and other details.
Thanks Beau, I'll be able to take a closer look at the environment tomorrow. So I'll let you know what I can find and probably get further advise before making recommendations.
So the staging errors we are seeing as a result of the timeouts. Is there something we could be doing to stop these from occurring when it happens?
You could change your MA timeout to be greater than 15m, so that the MA import wouldn't timeout before Broker does. This would just change how the import fails and take longer to do so, but the Broker timeout exception message would probably be included in the event log entry.
The timeout for the particular operation has been set to 25 minutes for all the past errors. So I believe the error message should be in what I've included. I'll also collect the information now and post it soon.
At initial glance there are 2 separate servers both with 8 processors and 128Gb of memory. UNIFYBroker DB sets on a separate instance to FIMSync and FIMServer. Though I am not sure of how replication works or which servers the instances hosted, but we are working on finding this out.
Also I have noticed the all the failures so far have failed around the same number of records. The timing varies by about 10-15 minutes, but it all times out around ~475800 records, varying by about ~100 records.
Just adding here, but I have also noticed staging errors at another site recently. These errors didn't seem to start occurring from any failure or after any full import. These look to have just started during one delta sync and have grown from there. It should also be noted this site is one version behind in Broker (v5.3.1), so I'm not sure if that makes a difference.
Implemented a few workarounds to the first site. We haven't seen any staging errors since but additionally we also having been running the daily full imports.
Along with the below list, we also did some assessment of UNIFYBroker database connections and believe at initial glance that this could be causing issues. While performing 1 Full Import operation and about 4 Delta Import operations on UNIFYBroker connectors, we saw ~70 connections to the UNIFYBroker database all from the UNIFYBroker service. So we deduced that its wouldn't be to far of a stretch to think that at some busy times these connections could reach ~100.
Please see the list below of workarounds:
We currently have the issue occurring in UAT. So much so that the error is produced even on Full Imports, where historically this use to clean all staging errors up. And much like the normal staging errors these also will grow when DIDS operations are executed. I believe cleaning the adapter and re-importing will fix this, but didn't want to clean it up in case you may be able to use this state for a more reliable failure.
A full import on the connector in IDB and clearing the adapter changes cleared up the staging errors in UAT. Unfortunately this means that troubleshooting cannot be continued. Will update the ticket when this error reoccurs.
After clearing the change log we have noted that the staging-error's reappear on connector's that are yet to run a full import in FIM. We have also noted that the LDAP gateway has multiple connections open. We have also noted that adapter's have deadlocked on resources and are also trying to insert duplicate entities into the Entity table.
Current debug steps we are running is refreshing the LDAP gateway connection and running full import's in FIM to ensure the system has a clean slate.
Can you try following these instructions to save a deadlock graph. Run it long enough to capture a few occurrences of the deadlock errors, where
Transaction (Process ID xx) was deadlocked on lock resources ...appears in the Broker logs.
It would also be helpful if you could run a second trace using the Standard (default) template, allowing it to run while
Violation of PRIMARY KEY constraint 'PK_Entity'.errors are occurring.
These traces can be run over a longer period of time by manually starting the trace configured with a automatic stop time, if that help in capturing these error events.
Upload the output files from the traces, along with the Broker logs for while the traces were running.
We have received word that a vendor for the customer undertaking a cloud migration of the production environment had turned on UNIFY Broker and Now services in the cloud instance. This cloud instance was still configured to communicate with current database on premise used by the Identity Broker service. This we believe has been what has caused deadlock and duplicate issues in Broker.
We're monitoring the situation currently and it looks like the problem is resolved for the moment but we will keep you informed if the issues arise again.