You are on page 1of 6

Unplanned faults that impact application's availability Server: Application Server Hardware fail DB Server Hardware fail Server

HBA/IO card fail Network interface card (NIC) fail Power fail on Server High temperature (Fan fail or AC) System and Application Software Operating System malfunction Middleware Software fail DBMS fail Application malfunction Software Aging Resource Exhaustion Memory errors Logical and Physical data errors (data loss, data corruption, etc) Network Router fail Switch fail Network segments Other network component e.g. interswitches links, ports. Storage Subsystem Storage HBA fail Disk fail SAN switch fail RAID disk groups / Cache fail Storage Director / disk director / disk adapter fail Data Corruption / data lost Fibre links Power fail on Storage Human errors Scheduled activities that impact availability: Backup process Data maintenance, data defragmentation Data migration Preventive maintenance, microcode updates and patches (on Server & Storage) Configuration changes (Server & Storage) Levels of protection Point-in-time copies? Business continuity copies? Backup from copies? Recovery process. remote boot or local? Data recovery: from backup or copy?

Eventos de Software: quando um sistema pra, pode ser difcil determinar se a parada foi causada por falha de hardware ou software, especialmente se um subsequente reboot bem sucedido. Freqentemente ao hardware atribudo a culpa pela parada, talvez porque o tcnico responsvel pela soluo desconhea os processos mais complexos de diagnstico e tenha que fazer alguma coisa que seja percebida como busca da soluo do problema. Em conseqncia, um hardware sem defeito programado para manuteno, disparando uma srie de atividades para substitu-lo, algumas complexas como transferncia dos servios para outro servidor. Os eventos de hardware podem fornecer evidncias para determinar se a parada foi devido a falha de hardware ou no, reduzindo ocorrncias imprprias de manuteno como a que acabamos de citar e orientando os esforos de resoluo para o Software. Na verdade, paradas no planejadas so mais provveis de serem causadas por falhas em software do que em hardware. Falhas de software mostram probabilidade crescente de ocorrncia ao longo do tempo e vrios estudos e evidncias apontam para o "envelhecimento do software" (software aging) como um fenmeno comum no qual o estado de um software se degrada com o tempo. Exausto dos recursos de sistema, corrupo de dados e acumulao de erros numricos so os sintomas primrios desta degradao que pode eventualmente conduzir degradao de performance do software, falhas do tipo crash/hang ou outros efeitos indesejveis. Algumas causas tpicas desta degradao so inchao de memria (memory bloating), memory leaking, threads no terminadas, locks de dados no liberados, corrupo de dados, fragmentao das alocaes em discos, e acumulao de erros. Software aging tem sido observado no somente em aplicativos especializados mas tambm em software de uso generalizado. O aging ocorre porque o software costuma ser extremamente complexo e nunca livre de erros. quase impossvel testar totalmente um software e garantir que ele esteja livre de bugs. Esta situao ainda exacerbada pelo fato de o desenvolvimento do software ser extremamente direcionado pelo mercado e pelo tempo, que resulta em software que atendem as necessidades de curto prazo do mercado, embora no cuidem muito

bem das ramificaes de longo prazo tais como confiabilidade. Assim, falhas residuais tm que ser toleradas na fase operacional. Estas falhas podem tomar vrias formas mas aquelas com as quais estamos preocupados so as que tem poder de causar downtime nos servios, as que causam esgotamento a mdio prazo de recursos do sistema como memria, threads, e tabelas de kernel. Assim, desde que os software no esto livres de bugs, ns temos que enderear esta realidade devido aos problemas de disponibilidade que surgem a partir deste enfoque prevalecente do desenvolvimento. Estratgia pr-ativa para evitar falhas de software: a estratgia pr-ativa para software tem como objetivo evitar a ocorrncia de condies que levam ao crash do sistema, desenvolvendo aes preventivas para evit-las ou reduzir seus efeitos, atravs da correo das deficincias e das condies potenciais de erro. Entre as tcnicas pr-ativas mais utilizadas, podemos citar: Software rejuvenation: algumas mtricas de performance e de configurao fornecem indicaes relacionadas iminncia de exausto de recursos, que pode resultar em falha. Se tais indicadores so monitorados, correlacionados e comparados com thresholds, podem gerar eventos de alerta desde que um processo ter, antes de sua falha, tais indicadores prximos dos limites de segurana. Alertas relativos proximidade da exausto de recursos podem requerer uma ao proativa de software rejuvenation a ser desenvolvida em horrios mais adequados, quando no haja carga significativa no sistema, resultando em menos dowtime e menor custo que o enfoque reativo. Outra fonte de falha relacionada a "software aging" estar o servidor processando prximo ao ponto de saturao de recursos de sistema como memria, recursos internos do OS, processador, capacidade de I/O e de rede, levando formao de condies que podem gerar crashes. Do mesmo modo, eventos de alerta sobre tais condies podem ser gerados para a tomada de medidas preventivas. Aplicao preventiva de correes, patches e updates: a implementao de novos releases ou verses e a aplicao de service packs, patches e updates disponibilizados pelo fornecedor de software a forma mais conhecida de ao preventiva, embora, na maior parte das organizaes ela venha sendo usada de modo reativo. A aplicao de correes e updates pode, eventualmente, causar problemas inesperados no ambiente de processamento. O uso de um ambiente de homologao e teste bem como a seleo rigorosa dos patches a serem aplicados pode reduzir tais riscos embora no totalmente. Tal preocupao pode limitar seu uso como recurso proativo. Controle de Mudanas: Outra fonte tradicional de distrbios na estabilidade do sistema so as mudanas, sobretudo alteraes em software, sejam no cdigo do aplicativo, nos settings do sistema operacional ou no software middleware. O processo de controle de mudanas, enfatizado desde muito tempo nas disciplinas de gerenciamento de IT, tem tambm um enfoque preventivo, com sua meta de buscar evitar os impactos que as mudanas podem causar na disponibilidade dos servios. Analogamente ao caso anterior, a aplicao de mudanas, sempre que possvel, no ambiente de teste/homologao antes da produo, ajuda sobremaneira na reduo dos riscos. Testes especialistas do cdigo aplicativo: outro relevante enfoque preventivo e baseia-se no uso de ferramentas especializadas de teste ou anlise de cdigo do aplicativo para descobrir deficincias potenciais que possam levar ao downtime. Algumas detectam, por exemplo, em tempo de execuo, problemas como memory error, memory corruption, memory leaking, locks de recursos, os quais podem resultar em software aging. Outras detectam deficincias de codificao que podem exorbitar a utilizao de um ou mais recursos do sistema e gerar indisponibilidade parcial ou total deste. So deficincias que tornam o cdigo ofensor, como denominado, e que compromete igualmente o desempenho

e a disponibilidade de outros processos. A estratgia submeter o cdigo aplicativo suspeito anlise de tais ferramentas para identificar os riscos potenciais como tambm todo o cdigo que sofrer alteraes, seja de usurio ou do fornecedor, num processo permanente conhecido como melhoria contnua. O processo de Event Management, estruturado nos trs tipos de monitorao citados e nas aes corretivas e preventivas, tem como meta prover um retorno satisfatrio organizao em termos de qualidade de servios e reduo de downtime. Entre eles citamos: Identificao imediata do componente que falhou ou que est na iminncia de falhar, o que ir acelerar o processo de recuperao e reduzir o downtime. Manuteno Preditiva em componentes de Hardware (Servers, Storage, links e rede), que ir reduzir o nmero de falhas permanentes e o conseqente downtime. Anlise preditiva de eventos para se antecipar s falhas de Software, anlogo ao anterior Otimizao de desempenho dos sistemas, o que ir melhorar os nveis de servios de performance e melhorar a visibilidade da empresa. Prevenir indisponibilidade de recursos causado por aplicaes, processos ofensores e parmetros inadequados, o que resulta igualmente em melhoria dos nveis de servios. Software failures are now known to be a dominant source of system outages. Several studies and much anecdotal evidence point to software aging as a common phenomenon, in which the state of a software system degrades with time. Exhaustion of system resources, data corruption, and numerical error accumulation are the primary symptoms of this degradation, which may eventually lead to performance degradation of the software, crash/hang failure, or other undesirable effects. Software rejuvenation is a proactive technique intended to reduce the probability of future unplanned outages due to aging. The basic idea is to pause or halt the running software, refresh its internal state, and resume or restart it. Software rejuvenation can be performed by relying on a variety of indicators of aging, or on the time elapsed since the last rejuvenation. In response to the strong desire of customers to be provided with advance notice of unplanned outages, our group has developed techniques that detect the occurrence of software aging due to resource exhaustion, estimate the time remaining until the exhaustion reaches a critical level, and automatically perform proactive software rejuvenation of an application, process group, or entire operating system, depending on the pervasiveness of the resource exhaustion and our ability to pinpoint the source. This technology has been incorporated into the IBM Director. Software aging Unplanned computer system outages are more likely to be the result of software failures than of hardware failures Moreover, software often exhibits an increasing failure rate over time, typically because of increasing and unbounded resource consumption, data corruption, and numerical error accumulation. This constitutes a phenomenon called software aging, and may be caused by errors in the application, middleware, or operating system. Under aging conditions, the state of the software degrades gradually with time, inevitably resulting in undesirable consequences. Some typical causes of this degradation are memory bloating and leaking, unterminated threads, unreleased file-locks, data corruption, storage-space fragmentation, and accumulation of round-off errors. This phenomenon has been reported in telecommunications billing applications, where over time the application experiences a crash or a hang failure. Avritzer and Weyuker discuss aging in telecommunication switching software, in which the effect manifests itself as gradual performance degradation. Software

aging has been observed not only in specialized software, but also in widely used software, where rebooting to clear a problem is a common practice. Aging occurs because software is extremely complex and never wholly free of errors. It is almost impossible to fully test and verify that a piece of software is bug-free. This situation is further exacerbated by the fact that software development tends to be extremely time-tomarket-driven, which results in applications that meet the short-term market needs, yet do not account very well for long-term ramifications such as reliability. Hence, residual faults have to be tolerated in the operational phase. These residual faults can take various forms, but the ones that we are concerned with cause long-term depletion of system resources such as memory, threads, and kernel tables. The essentially economic problem of developing and producing bug-free code is not the problem at hand; instead we address one of the problems that arises from the prevailing approach to developing software, and one approach to attacking that problem is software rejuvenation. Software rejuvenation To counteract software aging, a proactive technique called software rejuvenation has been devised. It involves stopping the running software occasionally, cleaning its internal state (e.g., garbage collection, flushing operating system kernel tables, and reinitializing internal data structures) and restarting it. An extreme but well-known example of rejuvenation is a system reboot. There are numerous examples in real-life systems where software rejuvenation is being used. For example, it has been implemented in the real-time system collecting billing data for most telephone exchanges in the United States Software capacity restoration, a technique similar to rejuvenation, has been used by Avritzer and Weyuker in a large telecommunications-switching software application In this case, the switching computer is rebooted occasionally, which restores its service rate to the peak value. The necessity of performing preventive maintenance in a safety-critical environment is evident from the example of aging in Patriot missile software [8]. The failure, which resulted in loss of human lives, might have been prevented had the operators heeded the advice that the system had to be restarted after every eight hours of running time. The Apache Web Server1 (from The Apache Software Foundation) provides a means to prevent itself from becoming too much of a resource burden on a system. Apache has a controlling process and a handler process. The controlling process watches the handler process to ensure that it is running up to standard. The handler process, on the other hand, handles requests from the clients. When the handler process is deemed to be in a bad state, the controlling process stops it and starts another process. Most current fault-tolerance techniques are reactive in nature. Proactive fault management, on the other hand, takes suitable corrective action to prevent a failure before the system experiences a fault. Although this technique has long been used on an ad hoc basis in physical systems, it has only recently gained recognition and importance for computer systems. Software rejuvenation is a specific form of proactive fault management, which can be performed at suitable times, such as when there is no load on the system, and thus typically results in less downtime and cost than the reactive approach. Since proactive fault management incurs some overhead, an important research issue is to determine the optimal times to invoke it in operational software systems. Proactive fault management can be greatly enhanced by the ability to predict the fault far enough in advance that one can take action to avoid or mitigate its effects. Resource exhaustion by its very nature offers clues that failure is imminent, in the form of parameters that can be monitored, extrapolated, and compared to thresholds via suitable algorithms.

Discrimination between hardware and software faults


When a system crashes, it can be difficult to determine whether the crash was due to hardware or software, especially if a subsequent system reboot is successful. Frequently, hardware is identified as the culprit when it is associated with a certain number of outages, perhaps because the service technician has to do something that is perceived as solving the customer's problem. Consequently, no faulty hardware is often returned to IBM under a service contract. We think that a variant of the software- monitoring agent we have developed can provide valuable clues to the user or technician as to whether the crash was due to a software problem it is capable of detecting, possibly reducing the number of no-fault-found hardware returns. It could also be expanded to monitor hardware errors to improve its diagnostic resolution, since it is fairly well known that permanent hardware failures are often preceded by an increasing rate of occurrence of transient hardware failures. Adaptive and multi-parameter predictive capabilities The current version of SRA was developed on the basis of a preconceived notion of the types of resources that can be exhausted and their exhaustion thresholds, for a given set of operating systems. Consequently, the SRA is capable of monitoring and predicting the exhaustion of any single parameter that is on a fairly static initialization list, using a flexible curve-fitting methodology. We think this capability is adequate for a sufficiently large number of applications to make the product useful. However, it is not known in general which parameters may be exhausted or enter critical regions in all scenarios, nor what their exhaustion thresholds are. An outage may occur only when a combination of parameters reaches critical regions, or it may be the case that a given parameter or combination of parameters does not have to be at extreme to constitute a hazardous situation. For example, we noticed during our testing of a web-serving application that just prior to the outage, the committed bytes, nonpaged-pool bytes, and available memory approached known exhaustion limits, as expected. However, other parameters also repeatedly exhibited unusual and indicative behavior just prior to the outage: For example, the variance of the paging rate and the number of nonpaged-pool allocations skyrocketed, although these do not seem to be outage precursors on their own.