Thursday, January 17, 2013

Software Guidelines: Operations, Availability, Fault management, Maintainability

This is a part of the blog series about (SOA) software guidelines. For the complete list of the guidelines (i.a. about design, security, performance, operations, database, coding, versioning) please refer to:


  • Have you consulted the operations & infrastructure teams in the design of this service? Verify your design assumptions & constraints early with the production/infrastructure team.
  • Design & plan for rollback just in case if the deployment of the new version fails or buggy. Sometime fixing forward cost you more or difficult to perform, e.g. if the data corruption spread in multiple places and haven't been detected until long time after the new version deployed.
  • Centralized configuration management (e.g. Weblogic management console, PAM configuration in linux)
  • Identify  deployment constraints in production (e.g. protocol restrictions, firewalls, deployment topology, security policies, operational requirements). Identify  external services / cloud  resources restrictions (protocol, security, throughput, response time,      #open-sessions, message size) from clouds / external services..
  • Write a short "getting started" administration & configuration document (e.g. artifacts, how to install, servers & resources addresses, dependencies, configurations, service request/response examples, troubleshooting / error scenarios). This document will be especially useful when you act as an external consultant / temporary      project developer or if you need to pass the project to other colleagues (perhaps you leave the company, or have to take another project, or get promoted :).
  • Minimize deviations from standard configuration for easy maintenance. Avoid esoteric solutions, use standard hardware & software.
  • Use least privileged  principle (e.g. fine granularity for privilege: the sendOrder process can only read customer address table but doesn't have admin/write privileges over the entire database)
  • Modules and configurations can be changed/load/unload without restarting the server/OS (e.g. using OSGi or Spring DM).
  • Having standard operation procedures document for efficiency and consistency (e.g. how to start/shutdown, backup/restore, database creation/purge, virtual machine creations/purge, user management, incident resolution flow). Automate the procedures if possible (e.g. shell scripting).

·         Visibility: how the availability and application states were monitored (e.g. using JMX, Oracle BAM)? At any point during processing, the admin can inspect the data & state information of any individual unit of work.
·         Design to be monitored (e.g. logging, JMX for java application)
·         Continue to measure performance in production. Gather production information (e.g. number of users, usage patterns, data volume) and use it for design input in the next iteration.
·         Monitor the system continuously to prevent fault before it happens (e.g. disk quotanetwork congestion, memory leak, not closed connection/resources, endless process forking/spawning), use alert feature (e.g. alert email to admin in case of threshold violation). You can build automatic script to clean up (e.g. clean up disk space, move old files, close idle connections, kill processes, restart servers) in case of alerts. Use monitoring & alert system (e.g. Nagios). 
·         In case of failure, how the support team received alerts regarding the health of the service? 
·         Alert users that approaching quota limit.
·         Log & audit admin operations (e.g. new users/grant, change configurations, server restart)
·         Provide health check test scripts to diagnose the problem during contingency (e.g. check connectivity to other systems/resources, check numbers of connections open, statistics of transaction errors, respond time of critical operations)
·         Review the logs for incomplete information, too much information, mistaken severity level, unreadable/bad format.


  • Logging operation is centralized (e.g. LoggingService)
  • Use a centralized file location / database for logging
  • Enforce standardization for logging (e.g. which information to be in log, how to log)
  • How to manage enormous growing of log files (e.g. throttling, alert, circular files, automatic backup) in case of critical component failure or DoS attack. How to prevent the whole server to failure when the disk-space is filled up with logs.
  • Avoid people login with default account (e.g. sa, weblogic) so that you can’t identified who did the actions, give people distinct accounts.
  • Audit logging regularly for early detection of anomalies
  • Log these information (e.g. as database fields or xml elements in log files):
    • service /application name
    • proxy / class  name
    • server name
    • operation / method name
    • fault message
    • timestamp
    • messageID / requestID
    • message/request being process (e.g. soap:body)
    • contextual information (e.g. userID, important application states, configuration/environment variables, jms topic/queue name, version of external software, IP-address of the external cloud service)
  • Beware that logs are read by people (often in stressful crisis situations), so readability is      important.
  • Trade off between too few vs too much information. Logging is overhead and consume storage. You might throttle the logs to avoid the system overwhelm with big request.
  • How long you will keep the log files, how you archive the logfile to other (lower grade / cheaper) storage.
  • Log files in another server than the application server: better performance (parallel), more robust (damage in the server file system will not damage the logfile so hopefully we can trace back what happened)
  • Synchronize the time between components (e.g. using NTP), cast all timestamp to one time-zone (e.g. to GMT)

Availability, Robustness

  • What are the availability requirements? What is the impact if the service is down for seconds or hours?
  • Availability is a function of inputs (e.g. validate the inputs), application robustness (e.g. defensive programming, recovery), infrastructure availability (e.g. resource redundancy) and operational best practice (e.g. monitoring). Availability also depend on security (e.g. DOS attack can cripple availability) and performance (e.g. locking problems can impede the performance so that the application is less available to accept the requests).
  • Recovery: the user/client should see minimum disruption (e.g. recover session data / last known states, reauthenticate using reused sso-token without user reintervention)
  • Ideally the modules can be deployed/undeployed without restarting thus the availability of the services is not interrupted.
  • Simple capacity planning:
  • Minimize number of servers or components (e.g. storage) to reduce the complexity of maintenance (e.g. backup). Fewer servers mean fewer chances for failure, fewer necessity for reboots, smaller security surface attack. Roughly speaking if you use 10 storages with MTBF 20 years, chance that you will busy  dealing with a storage failure once every 2 years (a rough calculation). You can minimize number of servers/storage by reducing the data, prioritizing the services (e.g. remove the old services), scale up (bigger CPU, memory, storage).
  • Clean up temp files regularly. Delete/backup old log files.
  • Trade off your timeout (the most common suggestion is about 3 min). Too long timeout means your system will detect the failure late thus slower the failover process. Too short timeout (under application/network latency) can make your system mark a working connection as failure.
  • Beware that a fault tolerant mechanism can hide the root problem, for example: if the load balancer restart the fail servers without alert the admin the root problem (e.g. memory leak) is never detected and corrected. The results is a non consistent performance: initially good then the performance degrades then the performance is recover after restart, then degrades again.
  • Trade off robustness (e.g. permissive to mild violation in sake of user experience, substitute with default values instead of abort/crash) with correctness (e.g. abort the process in radiation therapy machine.)
  • Provide alternative way to perform the same tasks in case of outage / unavailability of the primary systems. Train the operators to be familliar with the procedures. For example if the CustomerRegistrationWebservice breaks down, the customer service can still serve the new customer perhaps using an application that directly communicate to the database using jdbc connection. Or if everything breaks down, at least the operators know the manual procedures using pen and papers.

Errors handling
  • Error is logged and reported in a consistent fashion. What information needed in the log/report? Which error statistics will be useful to improve your system?
  • Do you use company standard error codes? For security you might translate framework-specific exceptions to standard error code to mask the details to hackers.
  • How to handle error? Do you have redundancy-clusters/fail over? Do you need failure recovery (e.g. self restart)?
  • What runtime exceptions are likely to be generated? How can this be mitigated?
  • Provide error locality check in the client (e.g. using JavaScript to validate phone  field) to prevent the client submit wrong request to the backend. But still revalidate the request in the back-end to prevent attacks that bypassing the client.
  • Release resources (e.g. file, database, thread process) : close connections and transactions after an error.
  • Centralized error handling.
  • Avoid empty catch block. 
  • Make sure that your code/platform handle all possible errors / failures , handle gracefully, inform the users/administrators, make sure data/transaction integrity (e.g. compensation, clean up): for more robustness and better user experience.
  • To make sure that your application handles all exceptions gracefully (including system exceptions) you might try to catch all exceptions in your code. But this is not always possible (due to exhaustive possible of exceptions, performance & maintenance burdens). Also if you catch the exceptions your framework (e.g. Weblogic) can miss the exception in the higher level critical mechanism (e.g. centralized logging/monitoring, global transaction management) will fail to act. 
  • If you handle the exception better to handle it locally (close to the context, close to the resources involved) than to throw it to higher level (except the cross cutting concerns such as logging/monitoring and global transaction management.) In a big software the developers are highly specialized, when the exception is handle locally it will be handled within the scope of domain of the developers that have the expertise.
  • catch-rethrow mechanism is expensive  and harder to debug so use then only if you can add some value
  • Don't use exception for flow control because throw-catch exception is expensive
  • Well define exception handling boundary to reduce redundancy & inconsistency
  • Methods return exception instead of int return code (e.g. 0 is success, 1 is input error, etc), since an exception is more expressive
  • How to tolerate nonfatal failures (e.g. disk quota, read/write failure) for example: move the requests to the error queue to be automatically rerun when the fail condition is solved.
  • Introduce software "fuses": stop the processing when unrecoverable error happened or when a  max number of exceptions occurred. Make sure that the job can be rerun without risk of duplication / inconsistency by  compensation/clean-up the partial results.
  • Apply software "valves" when the system fail: stop processing further inputs (from GUI or consumer services) e.g. move the requests to error queue. The valves can be also used in conjunction with scheduler to prioritize works (e.g. to avoid offline OLAP processing hinder the online OLTP performance during working hours).
  • Test the behavior of system under all possible errors. What is the expected behavior if one of the consumer errorly send a surge of (error)inputs? Test if the exception impact other process (e.g. security service).
  • Consider alternative to exceptions e.g. substitute with default safe values.

Fault prevention
  • Use service throttling (to avoid server get overwhelmed with requests)
  • Try to process data during off-hours in large bulk (e.g. scheduled ETL, precompute aggregation functions in an intermediate table)
  • Make sure that long running transactions / processing are within the timeout limit of servers and load balancers. You might need to apply splitter - aggregator pattern (
  • Prevent further inputs when faults exist in the pipeline (e.g. disable submit button in the GUI, inform the user "sorry the service is momentary unavailable, please try later.") .
  • Use hostname that are easy for human (e.g. Baan4_FinanceDept server instead of b4f556 server).
  • Automate manual task to reduce human error (e.g. using scripts for build, deployment, test)
  • Avoid processes that run too long with:
    • simplify data & process
    • incremental updates (e.g. update using jms topics instead of daily database bulk copy)
    • divide & conquer
    • asynchronous
    • dedicate server for long process
    • reschedule to nightly job batch (e.g. ETL for OLAP)
  • Avoid long transaction, you can use compensation  instead of rely on transactional rollback.
  • Minimize number of hardware (e.g. some servers don't need monitor  screen & mouse), minimize services/applications running in the server, minimize connections (e.g. to file servers).
  • Validate data to filter error/malicious inputs. For example for web service inputs check the data type, range/permitted values, size limit against XSD.
  • Apply business rule validation (e.g. compare the request with the average size to detect request anomalies, verify time/sequence constraint). You can apply these checks in application or using trigger  or database constraints. Cast these check  to a standard norm (e.g. USD for money, GMT for time comparison).
  • Use RAID storage. Generally the performance of parity based RAID (3,4,5) is poorer than mirroring (RAID 1,0+1,1+0) but parity RAID will save the numbers of harddisk needed. Buy a fast RAID hardware if you decided to use parity based RAID. RAID 5 is better than RAID 3,4 to avoid parity disk bottleneck. RAID 1+0 is better than RAID 0+1 qua surviving rate and recovery time.
  • Use assertion to detect conditions that should never occur. The assertions are useful during development & test to detect bugs, in the production you can log the mild assertion violations or abort the process for serious violations (e.g. in the radiation therapy machine.)

Fault tolerant system:
  • Data and transactions are not lost / corrupted, data integrity / consistency across system is maintain.
  • Use redundancy / partitioning for storage (e.g. RAID) for better reliability and performance.
  • If you really need high availability and rich enough: maintain an idle copy of the server for fast disaster recovery in another physical site. Use load balancer to switch the requests automatically in case of disasters.
  • Propagate the fault status to the (services) consumer chain while maintaining graceful degradation e.g. replace the failure-service with a dummy respond that convey status=fault information,  automatic markdown with timeout.
  • Use persistent queue to reduce the impact when faults exist in the pipeline, to avoid bottleneck in this queue limit the inputs as the faults exist.
  • Design for N+1 for failure, e.g. if the capacity planning suggest you to have 3 servers, add      1 for just in case of failure.
  • Redundancy for  application servers, database, networks (including routers, gateways/firewalls) Use load balancer: to route the loads to avoid failed servers, to distribute the works efficiently. The redundancy should be isolated to each other e.g. don't put them as virtual machines in the same hardware, use dedicated infrastructure if high availability is demanded.
  • For robustness, avoid single points of failure (SPOF) & minimize series components: remove unnecessary components, add parallel redundancies. Beware of second order SPOF e.g. dependency to networks, DNS server, firewall router, central security service, single backup system, the cooling system of the server room. When one of this second order SPOF breaks down, the whole infrastructure will fail.
  • Can you isolate the failure (e.g. the security services will not be compromised)? Consider "swim lanes" fault isolation: divide the process into domains. The domains don't share their database, servers, disks, host-hardware. The domains don't share services. No synchronous calls between swim lanes, and only limited async calls between swim lanes (calls between swim lanes increase the chance of failure propagation).

Automated recovery techniques
  • Prefer fail gracefully:  the users don't notice that the a failure happened and preserve the uncommit works.
  • If possible remove/recover the fail condition before retry/restart.
  • Provide tools for manually trigger the recovery (just in case that the automatic mechanism/detection doesn't work)
  • Provide transparency of numbers of transactions in each state (e.g. done, pending, fail, retry). Some software persist the current transactions in the file/database so that the transaction can be recovered after crash (e.g. Weblogic transaction log or Oracle redo log).
  • How easy a system recover with wrong data (e.g. should I reboot the whole system or rebuild the database schema or if it has self-recovery/fail-over capabilities). Can the system recognize and clean the wrong data or move the wrong data to the error queue. If I restart the system does the system will repeat the same problems to the error data haven't being clean up from the inputs/database?
  • Fail over (can be the server or the client side): self clean up and restart
  • If the client/server need to restart: how to reconnect & relogin preferable without requiring intervention from the user (e.g. using SSO token).
  • If the other side of conversation is restart: how to detect that a restart happened, how to restart/redo the transactions (while maintaining data integration e.g. to avoid double requests), how to reestablished the conversation, how to recover session data.
  • Know which states should be cleaned up and which states can be preserved to speed-up the restart process & to preserve the uncommit works.
  • Gather as much information as possible about the failure condition (e.g. states that cause failure, resources that not responding) useful especially for memory related problems. Save this in protected logs file instead of displaying this information in GUI / blue screen (to avoid information gathering by attackers).
  • Enable the resource to reuse port/connection to avoid "address already in use" error in case of reconnect.
  • Check application heartbeat:  the process might stuck is it hasn't back to he main loop / listening state , thus you might need to kill the process or restart.
  • Audit the automated recovery progress. Provide exception/alert if the recovery fails.
  • After recovery, is that possible to continue with the process with minimum reworks (e.g. if the fault happens in the 10000th iteration, you'd better continue with the state from the last iteration instead having to redo the 9999th previous iterations).
  • Some automated recovery techniques:
    • wait & retry.
    • clean up (e.g. clean up disk space, move old files, close idle connections, kill processes, restart server-instances) without restart the OS
    • save the entire memory and restart OS
    • save the last good states (not the entire memory) and restart: this works for memory leak problem, but you might end up with the same problem given the same initialization states & deterministic algorithm so you might have to randomize.
    • restart from a checkpoint (using periodically saved configurations/states): useful to avoid long startup/recovery.
    • crash and restart (last resort, you might loss uncommitted data)
  • If your failover routine includes retry:
    • make sure that the target systems can handle duplicate messages.
    • You might need to apply compensation/clean-up the partial results before retry.
    • decide when the system stop retry & then escalate the problem (e.g. alert the operator for manual intervention). For example if  the business-requirement demands a respond within 3 hours and a manual intervention need 1 hour then your       system have 3 - 1 = 2 hours to retry.)
    • limit how many retry attempts. Ideally you can configure number of retries (such as in Weblogic Jms).
    • decide the interval of retry (too short retry interval may overwhelm the system)
    • you can use queue to persist unprocessed requests. The queue can provide information such as how many requests in the queue, the last time a request was successfully being processed,  the last failed request.
    • provide mechanism to disable retry mechanism
    • provide manual control to reset and resume the system

Backup strategies
  • Partial restore (restore only the parts that damage) to speed up the recovery
  • If the data is too big use partial backup or incremental backup (e.g. level 0 every weekend, level 1 every day)
  • Schedule the backup process so that it doesn't interfere daily operation performance  (e.g. night/weekends)
  • For data consistency, prefer cold backup (prevent updates during backup)
  • Read only data (e.g. department codes) can be backup less often separately
  • Test your backup files (e.g. Oracle RMAN validate / restore ... validate) and restore/recovery procedures (by exercising the procedures)


  • Document the code, document installation/configuration procedures.
  • For easier interoperability & maintenance: use homogenous networks (e.g. all switches & routers are from Novell) and software platform (e.g. service bus & BPEL engines both from Weblogic/BEA, all databases are from Oracle). But beware of vendor locking.
  • Reduce using many (& complex) third party solutions, each third party software introduces cost for hardware & personnel resource, vendor dependency. For example, I know a company which use Baan, PeopleSoft, SAP/BO and they are keep busy integrating those software along with others (LDAP, etc,) spending a lot of money to hire specialists for custom-works during upgrading & integration.
  • Use text-based format for data exchange (e.g. XML for JMS messages, JSON) since text is more readable than binaries so it easier to debug.
  • For traceability to the original request, include request (& sender) identifier (e.g. requestID, clientID, sequenceID, wsa:MessageID, wsa:RelatesTo)  in the reply, intermediate processing messages, (error) logs, messages to/from external system for  intermediate processing. The wsa  here is the WS-Addressing namespace. This traceability is useful not only for debugging but also for request status monitoring, e.g. when you need to do manual intervention in case of a stuck process in a series of business processes you know which processing have been done/not for a particular request.
  • Register defects using bug tracking system e.g. Trac, Bugzilla.

Source: Steve's blogs

Any comments are welcome :)



    ·         Distributed Transactions:

·         Patterns for Performance and Operability by Ford
·         Blueprints for High Availability by Marcus & Stern
·         Scalability Rules by Abbott & Fisher

High Availability and Disaster Recovery by Schmidt

·         The Art of Scalability by Abbott & Fisher
·        Code complete by McConnell
·         Service Design Review Checklist for SOA Governance by Eben Hewitt
·         Report review & test checklist, university washington
·         IEEE Standard for Software Reviews and Audits 1028-2008
·         The Definitive Guide to SOA by Davies
·         Enterprise Integration Patterns by Hohpe
·         SOA Principles of Service Design by Erl
·         Improving .NET Application Performance and Scalability by Meier
·         Patterns of Enterprise Application Architecture by Fowler
·          Hacking Exposed Web Applications by Scambray
·          OWASP Web Service Security Cheat Sheet
·          OWASP Code Review Guide
·          Improving Web Services Security (Microsoft patterns & practices) by Meier
·          Improving .NET Application Performance and Scalability by Meier
·          Concurrency Series: Basics of Transaction Isolation Levels by Sunil Agarwal

No comments: