0
Answered

FIM MA run profiles invoked unexpectedly when scheduler stopped

Bob Bradley 8 years ago • updated by anonymous 3 years ago 28

The fact that the scheduler service is not running does not prevent the FIM custom workflow from firing the nominated Event Broker operation list.
While this arguably makes sense when you consider that this is the only non-polling interface for EvB, it was still unexpected because it would be fair to say that most FIM administrators would expect that stopping the scheduler should disable all interaction with FIM.
This issue probably falls into the "traps for young players" category, as there is an easy work-around as long as you remember to ALSO disable the corresponding FIM MA (inbound) operation list. My point is that I expect 9 out of 10 admins will be a victim of this oversight unless this "feature" is not spelled out clearly somehow.
Unless this feature was unintentional, then we probably need to talk this one through to decide what's best to do here ... agreed?

Update - the implications are actually more serious, as disabling the FIM MA inbound operation list causes the FIM workflow to throw the following exception:

Error connecting to Event Broker, please review the inner exception: Operation list "FIM Incoming" with id 0d85f25b-42aa-4058-8c41-f2e33b21cd33 cannot be run because it is disabled.

This is not a good thing ... I was expecting that because I had "queue missed" set ON for this operation that the call would be queued for later processing. Obviously not.

We therefore appear to have a design issue which needs to be resolved. Maybe there are actually 2 issues here?

Just thinking about this ... perhaps one option would be to catch the above exception and not throw it, logging a warning instead perhaps? It's not really an error if the intent was not to run the operation, although if you code around this error then you are effectively giving the operation no chance of implementing "queue missed" ...

Curiously I am noticing that if the FIM MA is running an export, causing updates to the FIM db which in turn fire the Event Broker workflow, then there is no exception thrown but the operation doesn't run after the export has completed. In other words, a return status of "no-start-ma-already-running" appears to be detected OK as a success status, but there is no queued "try later" because we don't have a polling model here. Again ... a new dilemma ...

Hi Bob,

I'll tackle this one comment at a time as you seem to be jumping around the place a bit:

  • Regarding your initial report, I agree with you entirely - I thought I'd already changed it to work that way, but if this is not the case it's a fairly easy fix and I can have an RC2 to you today.
  • Your second point (first comment) touches on a completely different issue. An operation list being disabled does not queue and it was never intended to work that way - queueing applies to non-disabled lists that are attempted to run while the scheduler is on but cannot due to group configuration. For example, there's no point running something that polls every 10 seconds, but something that's triggered from an external source or a timing that only runs once a day you would want to queue. Have we not made this clear?
  • Not sure how to appropriately address your third point (second comment), but my understanding is from what Matt has told me is that an exception is the only way of informing the workflow itself that it did not complete... ie, if we do not throw an exception it assumes to have worked correctly even though the list is disabled and therefore not fired. What would you change here?
  • Not sure what you mean by your final point, Matthew may have to lend assistance here.

I believe the final comment relates to the following. Please confirm:

  • The scheduler and operation list are enabled
  • The FIM MA is already running a run profile
  • The Portal Workflow attempts to run the baselining operation list, but since the management agent is already running, the run profile is not fired.
  • Even if the operation list is marked as queue missed, it does not attempt to fire when the management agent frees up again

Currently, the Portal Workflow is only concerned with whether or not it is able to fire the operation list successfully (as reflected in the error messages that can be received). It does not sit and wait for an operation list result from Event Broker. The alternative is that objects could be left in the PostProcessing state for hours while waiting for long running operation lists to complete execution. I am not fully aware of how the FIM Portal handles threading in its post processing, but I imagine there is potential here to have impact on the performance of the Portal.

I believe you're asking that in this case that the operation list goes on queue to complete execution, the operation list should retry. This functionality does exist if you do the following:

  • Remove the no-start-ma-already-running from the success statuses. If you don't want this to affect the other operations for the agent, you could create an identical agent with this success status removed, and make the FIM MA baselining operation list use this one.
  • Change the retry settings on the first operation in the list, which is presumably the full import full sync on the FIM MA.

If the baselining operation list is a trigger member in an exclusion group, it can stop any other delta processing from firing while it is running.

Thanks Matt ... no I wasn't talking about the baseline operation, just the ongoing delta import/delta sync run profile operation. However your answer gives me something to go on in terms of trying to get around the no-start-ma-already-running issue, however there being no agent for the FIM MA (and yes I don't want this to affect the other operations) then I don't see how I can tell just the FIM MA operation to ignore the "no-start-ma-already-running". The point I was making is that the operation list is active, and set to queue missing, and yet wasn't firing a delta import/delta sync after the export was finished.

Can y6ou have a look at my comment and call me if you still have any questions Matt? Not online for the next 30 mins.

I don't have any questions on that. A separate FIM agent cane be used for operations just using that FIM MA, with the no-start-ma-already-running removed from its success statuses. The reason for this is because the success status is configured at the agent level, and cannot be configured on a per operation basis.

Bob, I need to complete this work today so can you please read over the comments I made on Friday - the first point I addressed will be completed as discussed unless you object now. Can you also acknowledge the expected behaviour I have outlined in points 2 and 3 and post any objections you have for this version (keeping in mind that we're happy to sit down and discuss design changes with you in the next version).

Patrick - I spoke to Matt about this on Friday.
1. Great - fixing this will mean that the main concern I had when I raised this issue has been addressed (i.e. people will think that turning off the scheduler should be enough to prevent operations from running, regardless of the way they are triggered).
2. Yes you made that clear enough ... the issue is really all about working with a style of plug-in (non-polling) that is new to Event Broker, and as a consequence a number of these scenarios have not been seen before, and are going to generate a lot of questions for our clients (i.e. "what is best practice here?"). I'll comment more after I've addressed the other 2 points ...
3. I understand the dilemma, and yes you might expect that throwing an exception is appropriate. However, I believe that there are really now at least 3 scenarios we need to consider - and I think that I have a workable model to deal with them (see below).
4. Matt and I have discussed this last issue ... and I am experimenting now with a second FIM agent to allow for a different set of success status values to apply under different scenarios. Not sure if this is going to make it into the "best practice" model going forward, as we're still experimenting with what makes best sense ...

OK ... given that you agree we should reject external requests to fire operations under certain conditions, I believe we need to consider several questions, i.e. What should happen to external operation requests when
(a) the EvB service is running but the scheduler is disabled,
(b) the EvB service is running and the scheduler is enabled but the operation is disabled,
(c) the EvB service is stopped

We need to make sure that the EvB product "plays nice" in the FIM operational world, and that means that it should behave consistently. Operators tend to panic BIG TIME if a service or scheduled process starts throwing 100s or 1000s of exceptions, so we've got to be VERY careful about how to strike a balance between getting important info back to operators and not causing them to go into melt-down.

I have been working on the demo environment over the weekend, and with the latest Event Broker bug where the main EvB service couldn't be restarted, the result was just under 4000 "PostProcessingError" failures in the FIM request history. This number of errors was due to the large amount of changes I was pumping into the FIM portal at the time, and I have tied the Event Broker workflow to quite a number of policies as they all trigger changes the sync engine needs to read back ... it is this sort of scenario that will freak people out ... unfortunately even if we do explain the rules loud and clear up front.

I need to post this comment quickly as I know you are looking at this now ... but I have more to say here ...

Thank you Bob, I appreciate that you can spare time to answer our questions - I knew you'd talk to Matt about the final point but wasn't sure how much depth you went in to the others, and rather than lose something in translation I wanted to ask you directly.

I'll proceed on the first point as planned, but it sounds like there needs to be further thought put in to ensuring we degrade gracefully without a large number of errors. Not quite sure how we'll handle it at this stage, we may have to sit down and talk it through shortly, I'll see what time we can allocate.

Thursday is still RTM day no matter what happens, so worst case scenario is a v3.0.1 patch to improve this later.

OK - so how do we give a consistent experience, keep operators informed, but not overload them with 1000s of errors?

My first thoughts on the 3 scenarios above are that we should probably do this:
(a) ignore the request and return a success status, but queue the request
(b) ignore the request and return a success status, but queue the request
(c) throw an exception (current behaviour) ... but make it clear in the product doco the VERY SIGNIFICANT impact of the EvB service not running

Now I don't know what is feasible now in regards to (a) or (b), but my thinking is that both of these scenarios are akin to putting your car in neutral while the engine is running. If your car didn't have neutral you'd have to always turn the engine off, and that would be a pain ... so if we don't expect them to turn off the engine (scheduler as a whole or just the operation) then we don't want to hear the sound of the gears grinding either .

If the "queue missed" idea was all about exclusive groups, then I wasn't aware of this, because there is at least this other scenario where you would want to do the same. If I think of how a paired plugin works (like the one for AD) vs. a non-paired plugin (like the one for Identity Broker) - i.e. paired plugins need to be told to stop retrying whereas non-paired ones don't care, then this is kind of like a paired plug-in scenario. However, in this case the workflow cannot store state anywhere ... unlike when we were using the file changes plug-in as a stop-gap.

I am thinking that the quickest short-term solution might be simply adding 3 checkboxes for each of the above scenarios, and asking the implementer to decide on what to do. The checkbox labels could be as follows:

X report failures when the EvB service is running but the scheduler is disabled,
X report failures when the EvB service is running and the scheduler is enabled but the operation is disabled,
X report failures when the EvB service is stopped

By default they could all be OFF and that way we are handballing the decision to the operators ... and the next version could perhaps queue requests for the first 2 of them (if we can't do this now).

I quite like your latest suggestion for letting the operator choose and I'd definitely aim for that in the future, however, your analysis of being the quickest short-term solution is actually well off the mark - it's the most difficult of the suggestions you've outlined so far. It would require changes to every single layer of Event Broker including the UI, the WCF endpoint definition and the engine which unfortunately means it's out of the question for this week.

However, my counter-suggestions are as follows:

(a) and (b) - The "ignore the request and return a success status" part can be done, and we should also be able to log an internal Event Broker warning that it was ignored - however, queuing the request may have to wait until the first patch. I'll need to review what changes would be necessary in order to get this to work. Please understand that I agree with you entirely, but I'd rather not have it then introduce problems by attempting such a change at this stage - if we don't do it, it will be with the full intention of making it top of the list for v3.0.1 (which will not be "months" away).

(c) - Agree with you there as well - we actually have to throw an error here because there's nowhere else to log and as WMI is the only real requirement of keeping the Event Broker service running it shouldn't ever really be switched off.

I just read your response Patrick, and the 3 checkboxes I was thinking of are on the UI for Matt's workflow activity - not anywhere within the Event Broker config itself - since this is all about interpreting what the Event Broker engine is returning to the FIM workflow instance ... does that change your thinking???

Potentially - it would depend on our capability of catching and suppressing errors as most of it's occurring on the service. I'll sit down with Matthew later and make whatever changes are possible and let you know.

Hi Matt,

I've made some minor changes to the RunOperationList method of the OperationEngine, however, it was correct before and shouldn't have been running operation lists when the scheduler was off despite what Bob indicated. Can you please test with the existing RC1 to see what happens when you try to run a list with the scheduler is off?

Furthermore, is there a reason the PortalWorkflow assembly doesn't know about the custom Event Broker exceptions? And why is a "CommunicationException" thrown when an exception is caught?

Operation lists did not fire in RC1 while the scheduler was off. I can also confirm the below appears in the Event Broker log in the most recent build.

28/06/2011 3:41:31 PM Warning FIM Event Broker Operation Engine The operation list cannot be executed as the scheduler has not been started.

The FIM Portal Workflow needed to be updated slightly to reference the correct assembly version (still requires the 4 digit version number even though the 2 digit number is being used).

The Portal Workflow is also now throwing the standard exception type.

Following on from Matt's comment, Bob, what gave you the impression that it was firing?

Patrick - the FIM workflow history was full of exceptions whenever the workflow fired and the operation was disabled. Fact.

Throwing an exception under that circumstance was intentional prior to your (very helpful) suggestions about not filling the log up with exceptions - but I thought the point of this issue was that Event Broker was running operation lists the workflow had requested despite the fact the scheduler was off?

Patrick - that is true, that WAS the original point of the issue. Now that it's going to reject them when the scheduler is off, the overtone switched to one of how to best handle this scenario (i.e. how to NOT fill up the request history). I'm hearing that we're reaching convergence here ...

Sorry Bob, I think our team has completed misinterpreted your original description here - we thought you were reporting that the operation list was actually executing in Event Broker when requested by the workflow even when the scheduler was switched off. If all you're suggesting is that the workflow is just firing and making the request even when the list is disabled or the scheduler is off, then that makes a LOT more sense - although that's only way it could ever work as the workflow can't know until after the request is made what state Event Broker and the operation list is in.

So, if that is what you meant and your desire is to not pollute the FIM Portal with thousands of errors then I complete agree, and we've made changes that have resulted in the following:

  • If the operation list does not exist or the event broker license has expired, an internal Event Broker error is logged and no exception is thrown to the workflow
  • If the operation list is disabled or the scheduler is not running, an internal Event Broker warning is logged and no exception is thrown to the workflow
  • If the Event Broker service is not running at all, the workflow will error

In all circumstances I would be very eager to work with you in the next version to provide more flexibility to give administrators more choice about what will happen in these scenarios.

Patrick - I'm sorry for the confusion, but you didn't misinterpret the situation at all. Yes the Event Broker operation ran when the scheduler was OFF, and as a consequence no exceptions were raised when this happened. It wasn't until I was forced to disable the operation (the scheduler was already OFF) in order to stop the operations from firing that the workflows started failing, and it was at this point that I realized there was more to it than met the eye. I am happy with the resolution dot points you list above because it is largely consistent with what I had arrived at further above, and I appreciate this is a "first cut" at something for which there is absolutely no precedent ... hence my concern that we don't have operators reject the product on something trivial like filling up logs.
Did I read Matt correctly in him saying that the operation couldn't have fired if the scheduler was off? If that is the case (and it definitely did happen this way), then I'm wondering if the scheduler wasn't really shut down properly ... i.e. there was a lingering process component responding to the request that was being held open by a previous call ...

Also, please understand that at the time of night I was logging this info, I wasn't able to rationalise the various different issues as separate entities in their own right ... I was just trying to capture as much as I could in a chronological sequence so that we could join the dots later. If that means that part of the initial response is to break down issues like EB-381 into sub issues then so be it ... but when I'm trying to capture what I'm seeing in front of me, it is unlikely that the info will be recorded in the optimal way for you to consume. Maybe we need to change the way we log our initial observations as just that ... observations ... and then from these deduce what issues need to be raised? Anyhow, with timing being of the essence in this case I was just concerned with logging SOMETHING asap.

In response to an email from Bob on outgoing operations failing to fire, I have updated the EB300:Groups page, and added a new item to the EB300:Troubleshooting section

Thanks Matt ... my exclusion group definitions had been changed, presumably by the upgrade to the latest RC. Fixing these resulted in the FIM MA outgoing firing correctly.

The run profiles are no longer firing unexpectedly, but some of the issues discussed in this thread are to be the subject of a new feature request to better handle queuing of externally requested run profiles which returned a "no-start-ma-already-running" status.