Relieving backlog situations after maintenance work and system outages

Following maintenance work on the server and system outages, backlogs with resource bottlenecks inevitably occur in the workload automation that need to be resolved as quickly as possible. In a backlog situation, the operator team has to take the necessary measures to maintain production operations, and to be able to return to normal operation as quickly as possible. In doing so, key or critical batches and jobs have to be given a higher priority. Unimportant or non-critical jobs have to be either run with a lower priority or stopped completely. These interventions also have to be carried out for newly submitted batches/jobs. Without the support of the scheduling system, this requires a lot of manual work.

Backlog situations that occur after downtimes create jams on the data highway that require an emergency lane

Added to this is the fact that the important questions concerning these decisions – such as “What is important?”, “What needs to run?”, “What can be stopped?” – have to be clarified beforehand with the management so that the operator can initiate the correct measures. Whereas this is still quite simple in the case of scheduled maintenance work where the operator team can prepare itself accordingly, Murphy’s Law dictates that system failures frequently occur just when things are getting really hectic anyway.

If the scheduling system does not offer any effective support in such a situation, intervening action has to be taken under time pressure which on the one hand correctly accommodates the stipulations required by the management, while at the same time being adapted to the actual, possibly chaotic situation. The pressure of responsibility and the stress rise to very high levels since the system has to be brought up and running again and the backlog resolved – at the same time. Above all, however, some extremely critical and time-sensitive batches and jobs require an emergency lane in order to bypass the pending, less critical and possibly lengthy jobs that are still running.

In a complex heterogeneous system environment with hundreds of servers, thousands of jobs and infinite dependencies, this can be extremely nerve-wrecking for the operator. For this reason, we have developed the Nice Profiles in the ENTERPRISE Edition of the BICsuite, which was designed specifically for extremely large and highly complex system environments. Nice Profiles allow measures previously agreed upon with the management to be poured into a structured mould and applied with a minimum of effort even before the backlog actually occurs.

The ENTERPRISE Edition of the BICsuite Enterprise Scheduling System thus supports two effective mechanisms for the controlled handling of backlogs: The suspend timeout for time scheduling submits and the Nice Profiles (from BICsuite R2.6.1). The following applies for all editions of BICsuite: After a downtime of the BICsuite Scheduling Server, ‘missed’ time scheduling submits are submitted as being ‘suspended’ if too much time has elapsed since the planned submit (suspend timeout). The suspend timeout can be configured for each scheduled batch or job. Following a longer downtime, the operator can therefore decide whether and which workflows are to be executed.

With this functionality, a Nice Profile allows the creation of an infrastructure and organisation, and defines beforehand what needs to be done in the event of a backlog. Staying with the motorway example: Ambulances and emergency doctors are given a built-in right of way, while particularly slow and unimportant transports have to wait on the hard shoulder until the normal traffic is running smoothly again. To do this, the operator defines a change in the priority (Nice Value) and/or the suspend status for a list of workflows or folders. A folder entry applies for all the workflows that are defined under this folder. When they are activated, these changes are automatically applied to all already submitted (running) and newly submitted workflows. Multiple Nice Profiles can be activated simultaneously. When a Nice Profile is deactivated, the changes made by the activation are undone for all the workflows that are still running.

The batches and jobs are sorted into the folders “Critical”, “Normal” and “Optional” for which a Nice Profile is now defined for processing the backlog

Folder names or scheduling entity names (batches or jobs) are created for a Nice Profile, and the action that is to be executed for this profile is specified. If a folder name is given, the action applies for all the scheduling entities under this folder.

The Nice Profile for processing a backlog

In the “Backlog” Nice Profile, the priority of the “Critical” folder is raised and the priority of the “Normal” folder is lowered. All the batches in the “Optional” folder are stopped. When this Nice Profile is activated, the following details are shown in the monitoring window:

The monitoring window shows that the “Cleanup Logfiles” batch has been stopped

The “Cleanup Logfiles” batch has the status “active” because jobs had already been started in the batch before the Nice Profile was activated. However, no further jobs are started in this batch until the Nice Profile has been deactivated or the batch is resumed by an administrator.

With Nice Profiles, BICsuite offers an effective tool for handling backlog situations – namely by creating an emergency lane through the data jam. This allows operative problems that arise following scheduled and unplanned downtimes to be remedied quickly and systematically. Stress situations caused by system failures are mitigated for the administrators. Please get in touch with us if you have any questions about the Nice Profiles or the BICsuite Enterprise Scheduling System in general.