| |
|
Home
|
| Red Hat Linux 8.0: The Official Red Hat Linux System Administration Primer |
|---|
| Prev | | Next |
Chapter 8. Planning for DisasterDisaster planning is a subject that is easy for a system administrator
to forget — it is not pleasant and it always seems that there is
something else more pressing to do. However, letting disaster planning
slide is one of the worst things you can do. Although it is often the dramatic disasters (such as fire, flood, or
storm) that first come to mind, the more mundane problems (such as
construction workers cutting cables) can be just as disruptive.
Therefore, the definition of a disaster that a system administrator should
keep in mind is any unplanned event that disrupts the normal operation of
your organization. While it would be impossible to list all the different types of
disasters that could strike, we will look at the leading factors that are
part of each type of disaster. In this way, you can start looking at any
possible exposure not in terms of its likelihood, but in terms of the
factors that could cause that disaster. Types of DisastersIn general, there are four different factors that can trigger a
disaster. These factors are: Hardware failures Software failures Environmental failures Human errors
We will now look at each factor in more detail. Hardware FailuresHardware failures are easy to understand — the hardware
fails, and work grinds to a halt. What is more difficult to
understand is the nature of the failures, and how your exposure to
them can be minimized. Here are some approaches that you can
use: Keeping Spare HardwareAt its simplest, exposure due to hardware failures can be
reduced by having spare hardware available. Of course, this
approach assumes two things: Someone on-site has the necessary skills to diagnose the
problem, identify the failing hardware, and replace it. A replacement for the failing hardware is available.
These issues are covered in more detail below. Having the SkillsDepending on your past experience and the hardware involved,
having the necessary skills might be a non-issue. However, if you
have not worked with hardware before, you might consider looking
into local community colleges for introductory courses on PC
repair. While such a course is not in and of itself sufficient to
prepare you for tackling problems with an enterprise-level server,
you will learn the basics (proper handling of tools and
components, basic diagnostic procedures, and so on).  | Tip |
|---|
| | Before taking the approach of first fixing it yourself, make
sure that the hardware in question: If you attempt repairs on hardware that is covered by a
warranty and/or service contract, you are likely violating the
terms of these agreements, and jeopardizing your continued
coverage. |
However, even with minimal skills, it might be possible to
effectively diagnose and replace failing hardware — if you
choose your stock of replacement hardware properly. What to Stock?This question illustrates the multi-faceted nature of anything
related to disaster recovery. When considering what hardware to
stock, here are some of the issues you should keep in mind: Maximum allowable downtime The skill required to affect a repair Budget available for spares Storage space required for spares Other hardware that could utilize the same spares
Each of these issues has a bearing on the types of spares that
should be stocked. For example, stocking complete systems would
tend to minimize downtime and require minimal skills to install,
but would be much more expensive than having a spare CPU and RAM
module on a shelf. However, this expense might be worthwhile if
your organization has several dozen identical servers that could
benefit from a single spare system. No matter what the final decision, the following question is
inevitable, and is discussed next. How Much to Stock?The question of spare stock levels is also multi-faceted.
Here the main issues are: Maximum allowable downtime Projected rate of failure Estimated time to replenish stock Budget available for spares Storage space required for spares Other hardware that could utilize the same
spares
At one extreme, for a system that can afford to be down a
maximum of two days, and a spare that might be used once a year
and could be replenished in a day, it would make sense to carry
only one spare (and maybe even none, if you were confident of
your ability to secure a spare within 24 hours). At the other end of the spectrum, a system that could afford
to be down no more than a few minutes, and a spare that might be
used once a month (and could take several weeks to replenish)
might mean that a half dozen spares (or more) should be on the
shelf. Spares That Are Not SparesWhen is a spare not a spare? When it is hardware that is in
day-to-day use, but is also available to serve as a spare for a
higher-priority system should the need arise. This approach has
some benefits: There are, however, downsides to this approach: Given these constraints, the use of another production system
as a spare may work, but the success of this approach will hinge
on the system's specific workload, and how the system's absence
will impact overall data center operations. Service ContractsService contracts make the issue of hardware failures someone
else's problem. All that is necessary for you to do is to confirm
that a failure has, in fact, occurred and that it does not appear to
have a software-related cause. You then make a telephone call, and
someone shows up to make things right again. It seems so simple. But as with most things in life, there is
more to it than meets the eye. Here are some things that you will
need to consider when looking at a service contract: Hours of coverage Response time Parts availability Available budget Hardware to be covered
We will explore each of these details more closely below. Hours of CoverageDifferent service contracts are available to meet different
needs; one of the big variables between different contracts
relates to the hours of coverage. Unless you are willing to pay a
premium for the privilege, you cannot call just any time and
expect to see a technician at your door a short time later. Instead, depending on your contract, you might find that you
cannot even phone the service company until a specific day/time,
or if you can, they will not dispatch a technician until the
day/time specified for your contract. Most hours of coverage are defined in terms of the hours and
the days during which a technician may be dispatched. Some of the
more common coverage hours are: Monday through Friday, 9:00 to 17:00 Monday through Friday, 12/18/24 hours each day (with the
start and stop times mutually agreed upon) Monday through Saturday (or Monday through Sunday), same
times as above
As you might expect, the cost of a contract increases with the
hours of coverage. In general, extending the coverage Monday
through Friday will cost less than adding on Saturday and
Sunday coverage. But even here there is a possibility to reduce costs if you
are willing to do some of the work. Depot ServiceIf your situation does not require anything more than the
availability of a technician during standard business hours and
you have sufficient experience to be able to determine what is
broken, you might consider looking at depot
service. Known by many names (including
walk-in service and drop-off
service), manufacturers may have service depots
where technicians will work on hardware brought in by
customers. Depot service has the benefit of being as fast as you are.
You do not have to wait for a technician to become available
and show up at your facility. Depot technicians normally work
at the depot full-time, meaning that there will be someone to
work on your hardware as soon as you can get it to the
depot. Because depot service is done at a central location, there
is a better chance that any parts that are required will be
available. This can eliminate the need for an overnight
shipment, or waiting for a part to be driven several hundred
miles from another office that just happens to have that part in
stock. There are some trade-offs, however. The most obvious is that
you cannot choose the hours of service — you get service
when the depot is open. Another aspect to this is that the
technicians will not work past their quitting time, so if your
system failed at 16:30 on a Friday and you got the system to the
depot by 17:00, it will likely not be worked on until the
technicians arrive at work the following Monday morning. Another trade-off is that depot service depends on having a
depot nearby. If your organization is located in a metropolitan
area, this is likely not going to be a problem. However,
organizations in more rural locations will find that a depot may
be a long drive away.  | Tip |
|---|
| | If considering depot service, take a moment and consider
the mechanics of actually getting the hardware to the depot.
Will you be using a company vehicle or your own? If your own,
does your vehicle have the necessary space and load capacity?
What about insurance? Will more than one person be necessary
to load and unload the hardware? Although these are rather mundane concerns, they should be
addressed before making the decision to use depot
service. |
Response TimeIn addition to the hours of coverage, many service agreements
specify a level of response time. In other words, when you call
requesting service, how long will it be before a technician
arrives? As you might imagine, a faster response time equates to
a more expensive service agreement. There are limits to the response times that are available.
For instance, the travel time from the manufacturer's office to
your facility has a large bearing on the response times that are
possible[1]. Response times in the four
hour range are usually considered among the quicker offerings.
Slower response times can range from eight hours (which becomes
effectively "next day" service for a standard business hours
agreement), to 24 hours. As with every other aspect of a service
agreement, even these times are negotiable — for the right
price.  | Note |
|---|
| | Although it is not a common occurrence, you should be aware
that service agreements with response time clauses can sometimes
stretch a manufacturer's service organization beyond its ability
to respond. It is not unheard of for a very busy service
organization to send somebody —
anybody — on a short response-time
service call just to meet their response time commitment. This
person can then appear to diagnose the problem, calling "the
office" to have someone bring in "the right part". In fact, they are just waiting for a technician that is
actually capable of handling the call to arrive. While it might be understandable to see this happen under
extraordinary circumstances (such as power problems that have
damaged systems throughout their service area), if this is a
consistent method of operation you should contact the service
manager, and demand an explanation. |
If your response time needs are stringent (and your budget
correspondingly large), there is one approach that can cut your
response times even further — to zero. Zero Response Time — Having an On-Site TechnicianGiven the appropriate situation (you are one of the biggest
customers in the area), sufficient need (downtime of
any magnitude is unacceptable), and
financial resources (if you have to ask for the price, you
probably cannot afford it), you might be a candidate for a
full-time, on-site technician. The benefits of having a
technician always standing by are obvious: As you might expect, this option can be
very expensive, particularly if you require
an on-site technician 24X7. But if this approach is appropriate
for your organization, you should keep a number of points in
mind in order to gain the most benefit. First, an on-site technician will need many of the resources
of a regular employee, such as a workspace, telephone,
appropriate access cards and/or keys, and so on. On-site technicians are not very helpful if they do not have
the proper parts. Therefore, make sure that secure storage is
set aside for the technician's spare parts. In addition, make
sure that the technician keeps a stock of parts appropriate for
your configuration, and that those parts are not "cannibalized"
by other technicians for their customers. Parts AvailabilityObviously, the availability of parts plays a large role in
limiting your organization's exposure to hardware failures. In
the context of a service agreement, the availability of parts
takes on another dimension, as the availability of parts applies
not only to your organization, but to any other customer in the
manufacturer's territory that might need those parts as well.
Another organization that has purchased more of the manufacturer's
hardware than you might get preferential treatment when it comes
to getting parts (and technicians, for that matter). Unfortunately, there is little that can be done in such
circumstances, short of working out the problem with the service
manager. Available BudgetAs outlined above, service contracts vary in price according
to the nature of the services being provided. Keep in mind that
the costs associated with a service contract are a recurring
expense; each time the contract is due to expire you will need to
negotiate a new contract and pay again. Hardware to be CoveredHere is an area where you might be able to help keep costs to
a minimum. Consider for a moment that you have negotiated a
service agreement that has an on-site technician 24X7, on-site
spares — you name it. Every single piece of hardware you
have purchased from this vendor is covered, including the PC that
the company receptionist uses to surf the Web while answering
phones and handing out visitor badges. Does that PC really need to have someone
on-site 24X7? Even if the PC were vital to the receptionist's
job, the receptionist only works from 9:00 to 17:00; it is highly
unlikely that: The PC will be in use from 17:00 to 09:00 the next morning
(not to mention weekends) A failure of this PC will be noticed, except between 09:00
and 17:00
Therefore, paying to enable having this PC serviced in the
middle of a Saturday night is simply a waste of money. The thing to do is to split up the service agreement such that
non-critical hardware is grouped separately from the more critical
hardware. In this way, costs can be kept as low as
possible.  | Note |
|---|
| | If you have twenty identically-configured servers that are
critical to your organization, you might be tempted to have a
high-level service agreement written for only one or two, with
the rest covered by a much less expensive agreement. Then, the
reasoning goes, no matter which one of the servers fails on a
weekend, you will say that it is the one
eligible for high-level service. Do not do this. Not only is it
dishonest, most manufacturers keep track of such things by using
serial numbers. Even if you figure out a way around such
checks, you will spend far more after being discovered than you
will by simply being honest and paying for the service you
really need. |
Software FailuresSoftware failures can result in extended downtimes. For example,
customers of a highly-available computer system recently experienced
this firsthand when a bug in the time handling code of the computer's
operating system resulted in every customer's system crashing at a
certain time of a certain day. While this particular situation is a
more spectacular example of a software failure in action, other
software-related failures may be less dramatic, but still as
devastating. Software failures can strike in one of two areas: In the operating system In applications
Each type of failure has its own specific impact; we will explore
each in more detail below. Operating System FailuresIn this type of failure, the operating system is responsible for
the failure. The type of failures come from two main areas: The main thing to keep in mind about operating system failures
is that they take out everything that the computer was running at
the time of the failure. As such, operating system failures can be
devastating to production. CrashesCrashes occur when the operating system experiences an error
condition from which it cannot recover. The reasons for crashes
can range from an inability to handle an underlying hardware
problem, to a bug in kernel-level code. When an operating system
crashes, the system must be rebooted in order to continue
production. HangsWhen the operating system stops handling system events, the
system grinds to a halt. This is known as a
hang. Hangs can be caused by
deadlocks (two resource consumers
contending for resources the other has) and
livelocks (two or more processes responding
to each other's activities, but doing no useful work), but the end
result is the same — a complete lack of productivity. Application FailuresUnlike operating system failures, application failures can be
more limited in the scope of their damage. Depending on the
specific application, a single application failing might impact only
one person. On the other hand, if it is a server application
servicing a large population of client applications, the failure
could be much more widespread. Environmental FailuresEven though the hardware is running perfectly, and even though the
software is configured properly and is working as it should, problems
can still occur. The most common problems that occur outside of the
system itself have to do with the physical environment in which the
system is running. Environmental issues can be broken into four major
categories: Building IntegrityFor such a seemingly simple structure, a building performs a
great many functions. It provides shelter from the elements. It
provides the proper micro-climate for the building's contents. It
has mechanisms to provide power and to protect against fire and
theft/vandalism. Performing all these functions, it is not
surprising that there is a great deal that can go wrong with a
building. Here are some possibilities to consider: Roofs can leak into data centers. Various building systems (such as water, sewer, or air
handling) can fail, rendering the building uninhabitable. Floors may have insufficient load-bearing capacity to hold
everything you want to put in the data center.
It is important to have a creative mind when it comes to
thinking about the different ways buildings can fail. The list
above is only meant to start you thinking along the proper
lines. ElectricityBecause electricity is the lifeblood of any computer system,
power-related issues are paramount in the mind of system
administrators everywhere. There are several different aspects to
power; we will cover them in more detail below. The Security of Your PowerFirst, it is necessary to determine how secure your normal
power supply may be. Just like nearly every other data center,
you probably obtain your power from a local power company via
power transmission lines. Because of this, there are limits to
what you can do to make sure that your primary power supply is as
secure as possible.  | Tip |
|---|
| | Organizations located near the boundaries of a power company
might be able to negotiate connections to two different power
grids: The costs involved in running power lines from the
neighboring grid are sizable, making this an option only for
larger organizations. However, such organizations find that the
redundancy gained outweigh the costs in many cases. |
The main things to check are the methods by which the power is
brought onto your organization's property and into the building.
Are the transmission lines above ground or below? Above-ground
lines are susceptible to: Damage from extreme weather conditions (ice, wind,
lightning) Traffic accidents that damage the poles and/or
transformers Animals straying into the wrong place and shorting out the
lines
Below-ground lines have their own unique shortcomings: Continue to trace the power lines into your building. Do they
first go to an outside transformer? Is that transformer protected
from vehicles backing into it or trees falling on it? Are all
exposed shutoff switches locked? Once inside your building, could the power lines (or the
panels to which they attach) be subject to other problems? For
instance, could a plumbing problem flood the electrical
room? Continue tracing the power into the data center; is there
anything else that could unexpectedly interrupt your power supply?
For example, is the data center sharing a circuit with non-data
center loads? If so, the external load might one day trip the
circuit's overload protection, taking down the data center as
well. Power QualityIt is not enough to ensure that the data center's power source
is as secure as possible. You must also be concerned with the
quality of the power being distributed throughout the data center.
There are several factors that must be considered: - Voltage
The voltage of the incoming power must be stable, with
no voltage reductions (often called
sags, droops,
or brownouts) or voltage increases
(often known as spikes and
surges). - Waveform
The waveform must be a clean sine wave, with minimal
THD (Total Harmonic
Distortion). - Frequency
The frequency must be stable (most countries use a power
frequency of either 50Hz or 60Hz). - Noise
The power must not include any
RFI (Radio Frequency Interference) or
EMI (Electro-Magnetic Interference)
noise. - Current
The power must be supplied at a current rating
sufficient to run the data center.
Power supplied directly from the power company will not
normally meet the standards necessary for a data center.
Therefore, some level of power conditioning is usually required.
There are several different approaches possible: - Surge Protectors
Surge protectors do just what their name implies —
they filter surges from the power supply. Most do nothing
else, leaving equipment vulnerable to damage from other
power-related problems. - Power Conditioners
Power conditioners attempt a more comprehensive
approach; depending on the sophistication of the unit, power
conditioners often can take care of most of the types of
problems outlined above. - Motor-Generator Sets
A motor-generator set is essentially a large electric
motor powered by your normal power supply. The motor is
attached to a large flywheel, which is, in turn, attached to
a generator. The motor turns the flywheel and generator,
which generates electricity in sufficient quantities to run
the data center. In this way, the data center power is
electrically isolated from outside power, meaning that most
power-related problems are eliminated. The flywheel also
provides the ability to maintain power through short
outages, as it takes several seconds for the flywheel to
slow to the point at which it can no longer generate
power. - Uninterruptible Power Supplies
Some types of Uninterruptible Power Supplies (more
commonly known as UPSs) include most
(if not all) of the protection features of a power
conditioner[2].
With the last two technologies listed above, we have started
in on the topic most people think of when they think about power
— backup power. In the next section, we will look at
different approaches to providing backup power. Backup PowerOne power-related term that nearly everyone has heard is the
term blackout. A blackout is a complete
loss of electrical power, and may last from a fraction of a second
to weeks. Because the length of blackouts can vary so greatly, it is
necessary to approach the task of providing backup power using
different technologies for different lengths of blackouts.  | Tip |
|---|
| | The most frequent blackouts last, on average, no more than a
few seconds; longer outages are much less frequent. Therefore,
concentrate first on protecting against blackouts of only a few
minutes in length, then work out methods of reducing your
exposure to longer outages. |
Providing Power For the Next Few SecondsSince the majority of outages last only a few seconds, your
backup power solution must have two primary
characteristics: The backup power solutions that match these characteristics
are motor-generator sets and UPSs. The flywheel in the
motor-generator set allows the generator to continue producing
electricity for enough time to ride out outages of a few
seconds. Motor-generator sets tend to be quite large and
expensive, making them a practical solution for mid-sized and
larger data centers. However, another technology can fill in for those situations
where a motor-generator set is too expensive, as well as
handling longer outages. Providing Power For the Next Few MinutesUPSs can be purchased in a variety of sizes — small
enough to run a single low-end PC for five minutes, or large
enough to power an entire data center for an hour or
more. UPSs are made up of the following parts: A transfer switch for switching
from the primary power supply to the backup power supply A battery, for providing backup power An inverter, which converts the
DC current from the battery into the AC current required by
the data center hardware
Apart from the size and battery capacity of the unit, UPSs
come in two basic types: The offline UPS uses its
inverter to generate power only when the primary power
supply fails. The online UPS uses its inverter
to generate power all the time, powering the inverter via
its battery only when the primary power supply fails.
Each type has their advantages and disadvantages. The
offline UPS is usually less expensive, because the inverter
does not have to be constructed for full-time operation.
However, a problem in the inverter of an offline UPS will go
unnoticed (until the next power outage, that is). online UPSs tend to be better at providing clean power to
your data center; after all, an online UPS is essentially
generating power for you full time. But no matter what type of UPS you choose, you must properly
size the UPS to your anticipated load (thereby ensuring that the
UPS has sufficient capacity to produce electricity at the
required voltage and current), and you must
determine how long you would like to be able to run your data
center on battery power. To determine this information, you must first identify those
loads that will be serviced by the UPS. Go to each piece of
equipment and determine how much power it draws (this is
normally listed near the unit's power cord). Write down the
voltage, watts, and/or amps. Once you have these figures for
all of the hardware, you will need to convert them to
VA (Volt-Amps). If you have a wattage
number, you can simply use the listed wattage as the VA; if you
have amps, multiply it by volts to get VA. By adding the VA
figures you can arrive at the approximate VA rating required for
the UPS.  | Note |
|---|
| | Strictly speaking, this approach to calculating VA is not
entirely correct; however, to get the true VA you would need to
know the power factor for each unit, and this information is
rarely, if ever, provided. In any case, the VA numbers you will
obtain from this approach will reflect worst-case values,
leaving a large margin of error for safety. |
Determining runtime is more of a business question than a
technical question — what sorts of outages are you willing
to protect against, and how much money are you prepared to spend
to do so? Most sites select runtimes that are less than an hour
or two at most, as battery-backed power becomes very expensive
beyond this point. Providing Power For the Next Few Hours (and Beyond)Once we get into power outages that are measured in days,
the choices get even more expensive. At this point the
technologies are limited to generators powered by some type of
engine — diesel and gas turbine, primarily. At this point, your options are wide open, assuming your
organization has sufficient funds. This is also an area where
experts should help you determine the best solution for your
organization. Very few system administrators will have the
specialized knowledge to plan the acquisition and deployment of
these kinds of power generation systems.  | Tip |
|---|
| | Portable generators of all sizes can be rented, making it
possible to have the benefits of generator power without the
initial outlay of money necessary to purchase one. However,
keep in mind that in disasters affecting your general
vicinity, rented generators will be in very short supply and
very expensive. |
Planning for Extended OutagesWhile a black out of five minutes is little more than an
inconvenience to the personnel in a darkened office, what about an
outage that lasts an hour? Five hours? A day? A week? The fact is, at some point even if the data center is
operating normally, an extended outage will eventually affect your
organization. Consider the following points: What if there is no power to maintain environmental
control in the data center? What if there is no power to maintain environmental
control in the entire building? What if there is no power to operate personal
workstations, the telephone system, the lights?
The point here is that your organization will need to
determine at what point an extended outage will just have to be
tolerated. Or if that is not an option, your organization will
need to reconsider its ability to function completely
independently of on-site power for extended periods, meaning that
very large generators will be needed to power the entire
building. Of course, even this level of planning cannot take place in a
vacuum. It is very likely that whatever caused the extended
outage is likely affecting the world outside of your organization,
and that the outside world will start having an affect on your
organization's ability to continue operations, even given
unlimited power generation capacity. Heating, Ventilation, and Air ConditioningHeating, Ventilation, and Air Conditioning
(HVAC) systems used in today's office
buildings are incredibly sophisticated. Often computer controlled,
the HVAC system is vital to providing a comfortable work
environment. Data centers usually have additional air handling equipment,
primarily to remove the heat generated by the many computers and
associated equipment. Failures in an HVAC system can be devastating
to the continued operation of a data center. And given their
complexity and electro-mechanical nature, the possibilities for
failure are many and varied. Here are a few examples: The air handling units (essentially large fans driven by
large electric motors) can fail due to electrical overload,
bearing failure, belt/pulley failure, etc. The cooling units (often called
chillers) can lose their refrigerant due
to leaks, or can have their compressors and/or motors
seize.
Again, HVAC repair and maintenance is a very specialized field
that the average system administrator should leave to the experts.
If anything, a system administrator should make sure that the HVAC
equipment serving the data center is checked for normal operation on
a daily basis (if not more frequently), and is maintained according
to the manufacturer's guidelines. Weather and the Outside WorldThere are some types of weather that will obviously cause
problems for a system administrator: Heavy snow and ice can prevent personnel from getting to the
data center, and can even clog air conditioning condensers,
resulting in elevated data center temperatures just when no one
is able to get to the data center to take corrective
action. High winds can disrupt power and communications, with
extremely high winds actually doing damage to the building
itself.
There are other types of weather than can still cause problems,
even if they are not as well known. For example, exceedingly high
temperatures can result in overburdened cooling systems, and
brownouts or blackouts as the local power grid becomes
overloaded. Although there is little that can be done about the weather,
knowing the way that it can affect your data center operations can
help you to keep running even when the weather turns bad. Human ErrorsIt has been said that computers really are
perfect. The reasoning is that if you dig deeply enough, behind every
computer error you will find the human error that caused it. In this
section, we will explore the more common types of human errors and
their impacts. System Administrator ErrorsSystem administrators sometimes make unnecessary work for
themselves when they are not careful about what they are doing.
During the course of carrying out day-to-day responsibilities,
system administrators have more than sufficient access to the
computer systems (not to mention their super-user access privileges)
to mistakenly bring systems down. System administrators either make errors of misconfiguration, or
errors during maintenance. Misconfiguration ErrorsSystem administrators must often configure various aspects of
a computer system. This configuration might include: Email User accounts Network Applications
The list could go on quite a bit longer. The actual task of
configuration varies greatly; some tasks require editing a text
file (using any one of a hundred different configuration file
syntaxes), while other tasks require running a configuration
utility. The fact that these tasks are all handled differently is
merely an additional challenge to the basic fact that each
configuration task itself requires different knowledge. For
example, the knowledge required to configure a mail transport
agent is fundamentally different from the knowledge required to
configure a new network connection. Given all this, perhaps it should be surprising that so
few mistakes are actually made. In any case,
configuration is, and will continue to be, a challenge for system
administrators. Is there anything that can be done to make the
process less error-prone? Change ControlThe common thread of every configuration change is that some
sort of a change is being made. The change may be large, or it
may be small. But it is still a change, and should be treated
in a particular way. Many organizations implement some type of a change control
process. The intent is to help system administrators (and all
parties affected by the change) to manage the process of change,
and to reduce the organization's exposure to any errors that may
occur. A change control process normally breaks the change into
different steps. Here is an example: - Preliminary research
Preliminary research attempts to clearly
define: The nature of the change to take place. Its impact, should the change the succeed. A fallback position, should the change
fail. An assessment of what types of failures are
possible.
Preliminary research might include testing the
proposed change during a scheduled downtime, or it may go
so far as to include implementing the change first on a
special test environment run on dedicated test
hardware. - Scheduling
Here, the change is examined with an eye toward the
actual mechanics of implementation. The scheduling being
done here includes outlining the sequencing and timing of
the change (along with the sequencing and timing of any
steps necessary to back the change out should a problem
arise), as well as ensuring that the time allotted for the
change is sufficient and does not conflict with any other
system-level activity. The product of this process is often a checklist of
steps for the system administrator to use while making the
change. Included with each step are instructions to
perform in order to back out the change should the step
fail. Estimated times are often included, making it
easier for the system administrator to determine whether
the work is on schedule or not. - Execution
At this point, the actual execution of the steps
necessary to implement the change should be
straightforward and anti-climactic. The change is either
implemented, or (if trouble crops up) it is backed
out. - Monitoring
Whether the change is implemented or not, the
environment is monitored to make sure that everything is
operating as it should. - Documenting
If the change has been implemented, all existing
documentation is updated to reflect the changed
configuration.
Obviously, not all configuration changes require this level
of detail. Creating a new user account should not require any
preliminary research, and scheduling would likely consist of
determining whether the system administrator has a spare moment
to create the account. Execution would be similarly quick;
monitoring might consist of ensuring that the account was
usable, and documenting would probably entail sending an email
to the user's manager. But as the configuration changes become more complex, a more
formal change control process becomes necessary. Mistakes Made During MaintenanceThis type of error can be insidious because there is usually
so little planning and tracking done during day-to-day
maintenance. System administrators see the results of this kind
of error every day, especially from the many users that swear they
did not change a thing — the computer just broke. The user
that says this usually does not remember what they did, and when
the same thing happens to you, you will probably not remember what
you did, either. The key thing to keep in mind is that you must be able to
remember what changes you made during maintenance if you are to be
able to resolve any problems quickly. A full-blown change control
process is simply not realistic for the hundreds of small things
done over the course of a day. What can be done to keep track of
the 101 small things a system administrator does every day? The answer is simple — takes notes. Whether it is done
in a notebook, a PDA, or as comments in the affected files, take
notes. By tracking what you have done, you will stand a better
chance of seeing a failure as being related to a change you
recently made. Operations Personnel ErrorsOperators have a different relationship with an organization's
computers than system administrators. Operators tend to have a more
formal tie to the computers, using them in ways that have been
dictated by others. Therefore, the types of errors that an operator
might make differ from those a system administrator might
make. Failure to Follow ProceduresOperators should have sets of procedures documented and
available for nearly every action they perform[3]. It might be that an operator does
not follow the procedures as they are laid out. There might be
several reasons for this: The environment was changed at some time in the past, and
the procedures were never updated. Now the environment
changes again, rendering the operator's memorized procedure
invalid. At this point, even if the procedures were updated
(which is unlikely, given the fact that they were not updated
before) this operator will not be aware of it. The environment was changed, and no procedures exist.
This is just a more out-of-control version of the previous
situation. The procedures exist and are correct, but the operator
will not (or cannot) follow them.
Depending on the management structure of your organization,
you might not be able to do much more than communicate your
concerns to the appropriate manager. In any case, making yourself
available to do what you can to help resolve the problem is the
best approach. Mistakes Made During ProceduresEven if the operator follows the procedures, and even if the
procedures are correct, it is still possible for mistakes to be
made. If this happens, the possibility exists that the operator
is careless (in which case the operator's management should become
involved). Another explanation is that it was just a mistake. In these
cases, the best operators will realize that something is wrong and
seek assistance. Always encourage the operators you work with to
contact the appropriate people immediately if they suspect
something is wrong. Although many operators are highly-skilled
and able to resolve many problems independently, the fact of the
matter is that this is not their job. And a problem that is made
worse by a well-meaning operator will harm both that person's
career, and your ability to quickly resolve what might originally
have been a small problem. Service Technician ErrorsSometimes the very people that are supposed to help you keep
your systems reliably running can actually make things worse. This
is not due to any conspiracy; it is simply that anyone working on
any technology for any reason risks rendering that technology
inoperable. The same effect is at work when programmers fix one
bug, but end up creating another. Improperly-Repaired HardwareIn this case, the technician either failed to correctly
diagnose the problem and made an unnecessary (and useless) repair,
or the diagnosis was correct, but the repair was not carried out
properly. It may be that the replacement part was itself
defective, or that the proper procedure was not followed when the
repair was carried out. This is why it is important to be aware of what the technician
is doing at all times. By doing this, you can keep an eye out for
failures that seem to be related to the original problem in some
way. This will keep the technician on track should there be a
problem; otherwise there is a chance that the technician will view
this fault as being new and unrelated to the one that was
supposedly fixed. In this way, time will not be wasted chasing
the wrong problem. Fixing One Thing and Breaking AnotherSometimes, even though a problem was diagnosed and repaired
successfully, another problem pops up to take its place. The CPU
module was replaced, but the anti-static bag it came in was left
in the cabinet, blocking the fan and causing an over-temperature
shutdown. Or the failing disk drive in the RAID array was
replaced, but because a connector on another drive was bumped and
accidentally disconnected, the array is still down. These things might be the result of chronic carelessness, or
an honest mistake. It does not matter. What you should always do
is to carefully review the repairs made by the technician and
ensure that the system is working properly before letting the
technician leave. End-User ErrorsThe users of a computer can also make mistakes that can have
serious impacts. However, due to their normally unprivileged
operating environment, user errors tend to be errors that are more
localized. Improper Use of ApplicationsWhen applications are used improperly, various problems can
occur: Files inadvertently overwritten Wrong data used as input to an application Files not clearly named and organized Files accidentally deleted
The list could go on, but this is enough to illustrate the
point. Due to users not having super-user privileges, the
mistakes they make are usually limited to their own files. As
such, the best approach is two-pronged: Educate users in the proper use of their applications and
in proper file management techniques Make sure backups of users' files are made regularly, and
that the restoration process is as streamlined and quick as
possible
Beyond this, there is little that can be done to keep user
errors to a minimum.
|
|
|
|
|
|
|
|
Disclaimer: For authoritative source or latest update to this
documentation, please refer to http://www.redhat.com/docs/manuals/linux/ |
|
 |
|
|
|
Quotes: Imagination grows by exercise, and contrary to common belief, is more powerful in the mature than in the young.
|
|
|
|
|
|
|