Documentation
The Team News
Bugs And Fixes Feature Requests Help Options |
Overview and Statement of Purpose
The purpose of this document is to define a
state-of-the-art batch scheduling system for use on Unix systems called
OpenJCS (Job Control System). Through the intelligent use of load balancing,
dependency control, scheduling, runbooks, queueing and resource allocation
subsystems, the objective is to create a thoroughly integrated system of
unmatched scheduling flexibility that still gives good visibility into system
usage, while providing scalability and ease of use wherever possible. Scheduling Features
OpenJCS has the features you have come to
expect in enterprise class schedulers, such as: v Full dependency control between jobs and
across job groups. v Calendar scheduling, including repetitive
interval, and custom calendars. v Job queueing. v Resource control and allocation. v Ad hoc job scheduling and control. v Job monitoring and completion verification. v Automatic schedule recovery and restart in
event of scheduling system shutdown. v Custom notifications for failed jobs. But OpenJCS also provides some features you
might not expect in a commercial scheduler, much less an Open Source
scheduler: v The ability to aggregate hosts in ways that
make sense to you. v The ability to see why a job is on hold,
not just the fact that it is on hold. v The ability to share special OpenJCS
variables between jobs, or even between systems. v The ability to look into a job as it
executes. v An API for hooking your programs into OpenJCS. v The ability to integrate new dependencies
and commands into OpenJCS. v Virtual Host definition so you can cluster
hosts that share a given workload. v The ability to prompt for runtime
parameters for jobs. v Security that lets you specify who can run
what, when it can run, and under whose ID it can run. v Job Runbooks that allow greater control
over how jobs run, and how to verify if they ran correctly. v Job Runtime Analysis. v Transparent dependency control across and
between disparate systems, and the ability to synchronize jobs across and
between these systems. v Job Interrupts, Resumption, and Restart. v Specification of special pre-processors for
jobs. v Transparent transfer of job files to target
hosts. Conceptual
Basics
Batch
Jobs and Events
OpenJCS is at its heart an
event scheduler. So, what is an event in OpenJCS? An event can be any of the
following: v
A
single command. v
A file
of commands to be executed. v
A
batch job. v
Another
dependency or schedule. v
A file
containing any combination of the above. A batch job is a set of
commands that are executed as a unit in OpenJCS’ own batch control system.
The commands can be read from a file, or submitted a line at a time from the
user’s standard input. The commands usually execute under the same user name
as the user who submitted the job, but if the user is given sufficient
permission, the job can be executed as another user as well. Once the job is
successfully submitted, the user will be given a Job Number that can be used
to track the status of the job when it begins to execute. From that point
forward, the job is completely separate from the user’s process tree, and
executes under the control and supervision of OpenJCS. Dependencies
In OpenJCS, a dependency is
anything that puts a condition on when an event can or cannot execute. For
example, you might not want a job on server A to execute until another job on
machine B has successfully finished. The batch control system contains a
large variety of dependencies that the user may use in scheduling. Some of
the dependencies available are: time and date scheduling (including drop dead
date and repetitive interval scheduling), job dependency scheduling,
scheduling by arbitrary logical expression, manual operation confirmation,
file dependencies, resource dependencies, and job concurrency dependencies.
These dependencies can be mixed and matched as desired. Time and Date Scheduling – The ability to schedule events according
to time and date dependencies is the most basic requirement for any
scheduler. At the very least, a scheduler should be able to schedule by time
of day, day of week, day of But OpenJCS can do more than
just these. It also lets you define your own date patterns for things like
business day calendars and fiscal calendars. This way, you can tailor your
scheduling according to your calendars. OpenJCS even lets you define your own
calendars and temporal units so that you can make your calendars as flexible
as you want. Logical Dependencies – OpenJCS allows you to schedule an event
for when a condition becomes true, for whenever a condition becomes true,
repeat the event while the condition is true, or repeat the event until a
condition becomes true. So what makes up a logical condition? A logical
condition results from the logical comparison between two items chosen from
the following: v
A
constant or literal. v
An
environment variable. v
A
special JCS variable (see the JCS Variable
section or the variable command for further
information. v
A piece
of information about a current job as returned by jobinfo. (See library
routine jobinfo) v
A
piece of information about a file as returned by finfo. (See library routine finfo) v
A
piece of information about a pending reply as returned by replyinfo. (See
library routine replyinfo) v
A
piece of information about a jobqueue as returned by jobqinfo. (See library
routine jobqinfo) v
A
piece of information about a resource as returned by resinfo. (See library
routine resinfo) v
A
piece of information from an SNMP MIB. v
The
output of a programatically executable command. Logical conditions can also be
compounded by the use of AND and or, and can be negated with NOT. Perhaps a
few examples will make this clearer. Example 1: job –launch –file
myfile –when finfo datafile,eof > 0 AND jobinfo joba,state = “FINISHED” This will launch the job file
myfile when the file datafile exists and is nonempty, and job joba has
finished. Example 2: schedule –whenever
\`ps –eafl|grep sshd|wc –l\` < 1 –cmd /etc/init.d/sshd start This will restart sshd
whenever it is not found on the system. Manual Confirmation – Sometimes, there are events that just
can’t be accounted for automatically. For example, you may have to wait for a
tape to arrive before a job can be launched. Or you may have to wait for a
manual analysis of a report before continuing on a schedule. In cases such as
these, it would be nice to be able to set the event up, and put it on hold
until you are ready to release it for processing. OpenJCS allows for just
such a mechanism. It allows you to specify that an event must receive manual
confirmation from system management before it can be processed. It also
allows you to tag the hold with a File Dependencies – Files are the Job Dependencies – When you have a job processing system,
the relationships between jobs become extremely important. A job may depend
on one or more other jobs completing before it can be launched. Fortunately,
OpenJCS provides multiple ways to define job dependencies. It even lets you
set up dependencies based on the current state of another job, even if that
job is on another system. Job Concurrency – Sometimes, you may only want a job to
execute if one or more other jobs are currently running. OpenJCS allows you
to specify job concurrencies that must (or must not) exist before a job can
be launched, and these concurrencies can exist within a system, or across
several systems. Job concurrencies are maintained via Job
Runbooks. Resource Dependencies – No matter how robust your systems are,
they have their limits. Only so much of a resource is available at any time.
There is only so much CPU available, only so much memory, so much bandwidth,
and so on. The ability to distribute your workload according to resource
availability is absolutely critical. Overloaded systems cause response time Other Dependencies – We have demonstrated lots of
dependencies that dictate things that have to happen before a job can begin
to execute, but what about if we want to start synchronizing jobs that are
already executing? Well, OpenJCS provides mechanisms for doing just that.
First off, OpenJCS provides special variables that can be shared between
processes, jobs, users, groups and even systems in the domain. You can
specify at what level the sharing takes place, and the owner of the variable
can specify exactly what kind of access to the variable is allowed, and by
whom. This creates unmatched opportunities for information sharing and job
synchronization. The second method of job
synchronization available is in the job control shell (jcsh) itself. It
provides a when command that releases a block for processing when a
logical condition becomes true. This means you can effectively create
dependencies that are checked while the job is executing, as opposed to being
checked before the job can be released for processing. Batch
Environment Definitions
Job – A job is the central unit of work and
entity management in this scheduler. A job is a batch process that can be
formed by a single command, a series of commands or a file of commands. Once
a job is successfully introduced, it is completely separate form the user’s
process tree, and executes under OpenJCS’ batch environment. A job will be
executed as a process that by default runs under the ID of the user that
launched the job. The ID can be changed via the job’s job card or by commands
that alter the job before it begins execution. A job is launched via the job -launch command. Serial
Queue – A series of
jobs that execute sequentially. A serial queue can be created either via the job -launch command or by attaching an –after
dependency to a job. Jobqueue – A jobqueue is a way to restrict the
parallel execution groups of jobs, and to enforce consistent standards on the
jobs in the queue. The maximum number of jobs that can execute at one time in
a jobqueue can be changed as needed.
Some of the restrictions and standards that can be placed on a queue
are: minimum job priority a job must have to be admitted to the queue,
maximum number of concurrent jobs in the queue, maximum load on the hosts
serving the queue, job numbers that will be assigned to the queue, resources demanded
by each job in the queue, queue
hierarchy, conditions placed on each job in the queue before it is permitted
to execute and process execution priority profiles. Jobqueues can also be
nested as desired, such that the jobs in a given queue are also counted
toward the maximum number of simultaneous jobs in the parent of that queue.
Jobqueues are managed via the jobq command. Job
Number Family
- A family of jobs that will receive
similarly formatted job numbers, and will have similar dependencies placed on
them. This permits jobs that are related to be viewed together easily. Job
Number families are managed via the jobnum
command.
Job
Runbooks – A job
runbook is a set of constraints, permissions and instructions to be performed
at the introduction and conclusion of a job. The runbook is actually a file
that contains the list of instructions and permissions, and it is associated
with one or more jobs via the job -runbook command.
In the file, some of the constraints that can be placed on the job are: who
can launch the job, what its priority is, when it can run, what jobs it can
run with (and which ones it can’t), how the job is permitted to log on, what
machines are permitted to run the job, what job queue and job family it will
use, dependencies that will be attached automatically to the job, and what
shell it will use to execute. See the Runbook
section for more information about Job Runbooks. Preprocessors – Preprocessors are processes that read
job files as input, use various types of substitutions to create the finished
job file as output. The preprocessor could search the job for special
reserved words, for example, and substitute in content specified by those reserved
words. Resources
Resources provide a way to restrict
concurrency of jobs and simulate various aspects of system load management.
For example, a resource can be created to limit the number of jobs accessing
a given database at any given point in time, or a resource could be created
to keep jobs from being launched when system load is too high, or when high
user traffic is expected. A resource is created via the resource
-add command. Once a resource is created, its use is not automatic.
It has to be explicitly requested by the commands that support resource
dependencies (job –launch, jobq
and schedule, for example). An exception to
this is when a resource requirement is set up in a job’s runbook (see Job Runbooks). There are two different allocation methods
for a resource. The first way it can be allocated is if the resource has a
fixed maximum count attached to it. In this case, when a request is made for
that resource, if there are enough unallocated units of that resource
available, the request is granted, and the number of units requested are
deducted from the number available. If there are not enough units of the
resource available, the request is queued until there are enough units
available. The other allocation method for a resource
occurs when the resource does not have a maximum count attached to it.
Instead, it has an allocation rule attached to it. In this case, whenever the
resource is requested, the attached rule is executed, and the results
returned by the rule dictate whether the request is granted, denied or queued
and retried later. This resource type is very useful for resources that are
more aligned with performance metrics, such as system load, CPU JCS
Variables
JCS Variables are similar to
environment variables, but JCS variables can be shared between processes,
jobs, users, groups, systems or even regions as desired. JCS variables can
also be made permanent if desired. This creates opportunities for
unparalleled information sharing. For example, a job can set a given JCS
variable to a value, and another job that has appropriate access to the
variable can read that value, without the jobs having to have any knowledge
of each other. Or a job can wait for a given JCS variable to be set to a
particular value, without having to know which job or process will be
responsible for setting the variable to the desired value. JCS variables can
be used to synchronize processes, share information, and set execution
values, among other things. JCS variables make cooperative processing easier,
even when that cooperation needs to extend between dissimilar systems. JCS
variables also provide access security, so that the owner of a variable can
specify who has access to the variable, and what access is available (ability
to set a variable value, read a variable value or administer permissions for
the variable). JCS Variables are managed via the variable
command. Regions
A region collection of related
hosts. Jobs within a region can share JCS variables and resources. Regions
can also be used as assignment targets for virtual hosts. A region can be a
collection of Virtual
Hosts
A virtual host is a load
balancing device used to distribute jobs and commands among several target
machines. This provides additional scalability in the scheduling mix. The
user can choose from several load balancing algorithms: round robin, load
based, cpu usage based, network traffic based, fewest concurrent jobs, job
response time, resource based and based on an arbitrary rule provided by the
user. For Once a virtual host is defined, jobqueues,
resources and JCS variables can be defined and shared among the target
members of that virtual host. Ask/Reply Facility
The ask/reply facility permits a process to
ask system management for information. When a process requests information
via the ask command, the process is stopped until
the request is answered via the reply command.
Also, the recall command will show all
outstanding reply commands. Finally, the
automatic reply facility provides a way to specify answers that should be
provided automatically to selected ask commands.
Automatic replies are specified via the autoreply
command. Rulesets
Rulesets are scripts or collections of
commands used for various purposes by the scheduling system. Rulesets can,
for example, open and close queues, determine how resources should be
allocated, determine which host should launch a job or select the queue for a
job. Rulesets are managed via the rule command.
|