Office Connector Import fails with System.Net.WebException: The operation has timed out
Bob Bradley 7 years ago in UNIFYBroker/Microsoft Office Enterprise • updated by Curtis Lusmore 5 years ago • 33
QBE reported this week that they are continuing to have long periods (several hours) where licenses are not being assigned, and are having to manually restart the IdB service multiple times during the day (the service is already being restarted each night at 4 am). The timeout error continues to be reported in the logs
See JIRA ticket QBE-59 for more details.
Customer support service by UserEcho
Change detection engine import changes failed.
Could you please let me know what the polling timeout value is set to on the agent? If it's only 1m40s, please try increasing it and see whether that resolves the issue.
After changing the timeout to 5:00 and rerunning, as expected the full import completed successfully the first time. However the subsequent run produced the following exception:
I'm having trouble seeing how that could suddenly start happening if you just had it successfully work. All of that logic is abstracted away by the API. Are you able to capture the request (e.g. Fiddler)?
Adam - this being Production, no I can't run fiddler. I might be able to run something more lightweight, however it will have to be tomorrow now. Not sure what the best state to leave this in now happens to be. It has always worked only on the first import, and ALWAYS failed on the subsequent import - the only difference is now it is failing with an error instead of seemingly hanging forever until the service is restarted.
Oh, I wasn't aware of this being the behaviour all the time, it's never behaved that way for myself!
I'll have to do some looking into it.
Currently unable to reproduce. Subsequent imports are working fine (have done two sets of runs: 3 and then 2). It takes ~10mins for ~45000 objects.
From your last set of logs, there is only one full import (and it completes in ~9 minutes):
Adam - when I had both adapters disabled I was also not able to reproduce the problem in PROD (after restarting the service). I have now re-enabled the adapters to see if there is a similar result - noting the significant change processing backlog will take some time to clear.
Testing this morning identified that the "Test Connection" action for the AAD agent was failing (no response) until the IdB service had been restarted. At the same time the polling connector had been consistently timing out (after 5 minutes) and this also began working immediately after the service was recycled. Logging indicated that the login.windows.net endpoint was the only 1 of the 3 graph endpoints that was being queried. This suggests that the problem is one of ongoing authentication, and suggests there may be a stale authentication token that is being persisted beyond a token expiry period.
Still having trouble reproducing and I’ve confirmed that the auth code follows the sample code exactly, as well as reading up on how the token cache works and what the recommendations are for it.
What I was able to do was update the auth libraries (which were quite outdated). I’ve uploaded a new version that essentially just makes use of the new libraries. Please try it out and let me know. If that fails the next step is to try and capture some more information (possibly asking for permission to run Fiddler on the box).
Installed version 220.127.116.11 from https://unifysolutions.jira.com/wiki/display/SUBIDBMOE/Downloads in DEV - will monitor over next few days.
Latest occurrence of problem logged in JIRA here with this latest version.
Is that with or without the restarts? Is that 2 weeks that it has been improved for?
Adam - the restarts are still in place (every 4 hours) and my comment was to say that despite the restarts and upgrade we are still getting timeouts - albeit the one yesterday is the only one in the logs for the last 5 days. I expect the frequency of the restarts is causing a reduction in frequency, but we only need 1 to see that there is still a problem.
I'm having trouble getting the build to work. I'll take another look tomorrow.
If you recall, the changes that I'm making are essentially guesses to work around the bug in the API that is causing this to fail. Any more information that you could capture would be greatly appreciated (e.g. traces).
Adam - the best I can do right now is give you the VERBOSE log from the day the last timeout occurred - here. If I was to run a trace for say a 12 hour period and turn off the restart operation to attempt to reproduce the problem (as we have already tried) then we disrupt production - something I can't do. I think we need to run some sort of independent test harness from within the PROD environment talking to the same endpoints ... and tracing this. Thoughts?
Is there a spare box that can have IdB running the imports continuously, with .NET network trace on? It will likely consume a fair bit of disk space, however.
Another possibility, through research I came across this, worth giving a shot?
I read the article, then rang Scott and asked him if there were any limits on the firewall. He is going to try to find out, but suggested that regardless of the firewall there will be a limit on the NAT (all internal IPs are published here). There will be a limit on the number of concurrent connections per IP, but he will find out what it is if he can. So what you are suggesting is that we implement the following in the IdB service config file, setting a number that is lower than the known limit?
Yes, there are suggestions that the connections are not cleaned up in time and they bunch up past the limit. Giving us the helpful (not) timeout error.
So without knowing what the limit might be yet - is there any point adding a guess to the config now anyway? I would have to stop recycling the service to test the effect.
In regards to the question of a spare box, the answer is no. However we can get more disk assigned to the existing server and run an EXE on the same box as the IdB service ... although being PROD and in a change freeze period I am thinking I will have trouble with this idea.
It would also require a completely new program to be written, or a console used instead of a Windows service.
Timely post from Jorge here overnight on the AAD connector for AADConnect - I suspect something similar is happening with the REST API call for IdB.
There's very little detail on that post, just a bunch of errors (that I can't tell whether or not are the same as what we're seeing) and a lucky guess at the fix. What's actually happening though? Why did whatever he did fix the problem?
What caught my attention was this:
The rest is neither here nor there, just a pointer to say that the OOTB hybrid identity bridge is going out in sympathy with IdB. The resolution that Jorge found is indeed a guess, but exactly what I would have done myself given that's about the only option (short of bouncing AADConnect) which we have to reset it (akin to following a stopped-server error with a FI instead of a DI).
So no idea if it is related - just thought it would be of interest in building out the picture of the problem space.
While I am not able to progress this issue further without a support agreement in place, I thought this post might provide a possible clue - given that it appears the problem occurs after an extended period of successful operation and from investigation so far appears most likely an authentication issue.
While possibly not directly related, I wondered if the suggested FBA option could be a clue. I also wondered if there might be some sort of dialog being raised in response to an HTTP request from IdB, which of course could never be seen or responded to ...
Any update Bob?