Aurion connector time out "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond"

Adrian Corston 1 year ago in UNIFYBroker/Aurion updated by Matthew Davis (Technical Product Manager) 7 months ago 11

One of my Aurion connectors is failing to import all with the following error.  Two other Aurion connectors for the same agent do not return this error.  Test Connection for the agent is successful.  I can't find a client-side timeout parameter on the configuration screen.  The error is occurring around 5m24s after the import starts.  There were around 7,200 records the last time the import was working in this environment (I don't know how long ago that was).  The other two working connectors have similar entity counts and each take around 90 seconds to run to successful completion.

Could you please investigate?  If this is a server-side timeout please let me know and I'll escalate it to Aurion.

Image 6411

Customer identifying details have been redacted from the following log entry:

20230127,02:25:20,UNIFYBroker,Change detection engine,Error,"Change detection engine import all items failed.
Change detection engine import all items for connector Aurion Employee Connector failed with reason Unable to connect to the remote server. Duration: 00:05:24.5919187
Error details:
System.Net.WebException: Unable to connect to the remote server ---> System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond XX.XX.XX.XX:443
at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
at System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure, Socket s4, Socket s6, Socket& socket,IPAddress& address, ConnectSocketState state, IAsyncResult asyncResult, Exception& exception)
--- End of inner exception stack trace ---
at System.Net.HttpWebRequest.GetRequestStream(TransportContext& context)
at System.Net.HttpWebRequest.GetRequestStream()
at System.Web.Services.Protocols.SoapHttpClientProtocol.Invoke(String methodName, Object[] parameters)
at Unify.Communicators.AurionAPI.EV397_AURION_WSService.LOGOFF(String P_TOKEN)
at Unify.Communicators.AurionWSCommunicator.Logout()
at Unify.Communicators.AurionAgent.Close()
at Unify.Connectors.AurionApiReadingConnector.d__5.System.IDisposable.Dispose()
at Unify.Connectors.AurionApiReadingConnector.d__5.MoveNext()
at System.Linq.Buffer`1..ctor(IEnumerable`1 source)
at System.Linq.Enumerable.ToArray[TSource](IEnumerable`1 source)
at Unify.Product.IdentityBroker.AuditReadingConnectorDecorator.GetAllEntities(IStoredValueCollection storedValues, CancellationToken cancellationToken)
at Unify.Product.IdentityBroker.EventNotifierReadingConnectorDecoratorBase`1.GetAllEntities(IStoredValueCollection storedValues, CancellationToken cancellationToken)
at Unify.Product.IdentityBroker.ChangeDetectionImportAllJob.ImportAllChangeProcess()
at Unify.Product.IdentityBroker.ChangeDetectionImportAllJob.RunBase()
at Unify.Framework.DefinedScopeJobAuditTrailJobDecorator.Run()
at Unify.Product.IdentityBroker.ConnectorJobExecutor.<>c__DisplayClass30_0.b__0()
at Unify.Framework.AsynchronousJobExecutor.PerformJobCallback(Object state)",Normal
Under review

Hi Adrian,

You can set the timeout on the Aurion agent. If a custom timeout is not set, the default is 100 seconds, I believe. Note that this is a per-request timeout, however, not for an import as a whole.

I'm not sure this looks like a timeout issue, however; at least not directly. From the error's stack trace, I can see it occurred when it was trying to call the LOGOFF method, which is the last request made in the GetAllEntities process. Basically, that means Broker was able to authenticate with the server, make one or more query requests for entity data, and only failed when trying to logoff, which I presume is to end an authenticated session.

Thanks Beau, I should have thought of looking on the agent instead of the connector.  It was set to 5 minutes but after raising it to 15 minutes the error now occurs at the 15:25 mark instead.  Maybe the agent attempts a logoff operation after the timeout, and that's what's logging the error.  Nevertheless this does suggest that the problem is on the server side (since it isn't succeeding, even after 15 minutes) so I'll escalate it to Aurion.  Thanks!

I tried a number of combinations of different timeouts and turning diagnostic logging on and off, and after going back to the original settings it's now importing successfully every time (with an elapsed duration of just over 4 minutes).  I'm not sure what the root cause is, but I still think it's the server side.

I have undertaken more investigation into this.  The issue continues to occur only for that one report in the one Aurion instance.  I don't have access to another Aurion instance with a similar data set to try it out on, but the problem doesn't occur in UNIFY's Aurion instance with about one twentieth of the number of users.

However, I did build a PowerShell test harness to invoke the Aurion API QUERY_TO_XML call, in order to try to debug what's going wrong, and despite running it dozens of times I cannot replicate the timeout error from the test harness.  Every time I call QUERY_TO_XML I get back the report content successfully after 4 minutes, give or take 10 seconds.

The test harness even ran successfully when it was started at around the same time as a UNIFYBroker connector import which failed with timeout error.

So - as unlikely as it may seem - I now think that the root cause for this problem is a combination of (a) the report, (b) the Aurion data that it is retrieving, and (c) something unidentified about the way that the UNIFYBroker connector is retrieving the report content.  Changing any one of those variables eliminates the problem.

Could you please look into whether there is something you can do to investigate further on the UNIFYBroker side?

Test harness code attached (no passwords or credentials): Aurion QUERY_TO_XML test harness

Hi Adrian,

Is there ever a time when this report completes successfully through UNIFYBroker? 

The logoff operation is a fairly simple operation which just terminates that particular token - as you've seen in your script. Given the operation through UNIFYBroker isn't having problems elsewhere (either on the same or other Aurion instances) and the only difference between the logoff call execution is the token for that run, I'm not sure (at this stage) what changes we could make to the UNIFYBroker side. 

The logoff operation is called once all the data has been received as the final step before the data is returned to the connector engine for processing and change detection. 

Are you running your script from the same network as the broker testing? Can the problem be replicated outside of that specific environment? 

Hi Matt.  Yes, it often completes successfully and only fails about half of the time.  I extracted all attempts from the UNIFYBroker log yesterday, see this file: Aurion report connection summary.xlsx

I ran my test harness on a different server, so that may be relevant.  I will move it into the UNIFYConnect UNIFYBroker service and run it as a scheduled job, to see what the outcome is when it's on exactly the same compute/IP.

The LOGOFF request hasn't ever failed (i.e., non-2xx HTTP status which would cause Invoke-WebRequest to throw an exception, or a long-lived client session that eventually times out) in my test harness executions.  But note that the test harness ignores the XML content returned by the call.

If the report itself returns successfully but the LOGOFF fails, does UNIFYBroker abort the connector import?  If so then maybe it could be changed to report the LOGOFF failure as a warning, rather than aborting the import altogether.

Hi Adrian,

Have you had a chance to re-test your harness on the same box to see if the problem reproduces?

The scheduled job shows behaviour consistent with UNIFYBroker - the QUERY_TO_XML call hangs (in my case there is no client timeout so it hangs seemingly forever, still running after over an hour).  So the problem connecting with the Aurion API call is isolated to UNIFYConnect pod.

I have been working with Aurion's techs and they have the following assessment:

Hi Adrian
Our WAF will terminate the connection at 30 minutes, I suspect there may be something up with your routing as you should not still be waiting for a response at 1 hour.
Even if Aurion takes longer than 30 minutes to generate a response the TCP tunnel will be closed at 30 minutes.
At this stage I can only recommend trying from another internet connection as it sound like a routing issue where you are not receiving responses.
I am not able to say the issue is from our side as a large number of customers use SOAP and this applies to all clients.
We would have certainly had more reports of this if it was the case.
Robert O'Donoghue
Technical Consultant

In light of this (and not being able to replicate the issue outside of UNIFYConnect) I am increasingly confident that the issue lies in some aspect of the UNIFYConnect network/server infrastructure.