Batch processing or data integration is mostly about loosely coupled parallel processing and scheduling. As I mentioned before, we need to understand the business problem to make the project successful.
Segregation of data
This analysis should start with a high level data flow diagram. Understanding the source of data and their availability will help us make critical design decisions. Ideally, we should have data from different sources, usable as they become available. We can design our data flow architecture using parallel processing with different servers, databases, etc. Database replication can be used to move data. A simple extraction layer can also be built to get data out from the transaction oriented database for subsequent processing.
Critical path analysis
Once we decided on the data flow, we should understand the critical paths of our processing. There will always be a few steps that would make us nervous. These are the areas that we should focus on. We need to tune their performance, add instrumentation to trend the growth, explore relevant new technologies to improve processing time or reliability.
Optimal scheduling is also critical. Unnecessary dependencies can potentially cause a much bigger impact on the availability than a poorly optimized database query. In my past experiences, some jobs could actually start 1/2 hour earlier after removing all unnecessary dependencies. No one can tune these 10 minutes database query to save 30 minutes. On the other hand, missing dependencies could also happen and it could sometimes cause data corruption. We should always review and strive to get the optimal scheduling dependencies.
Rerunability of individual job or group of jobs
Each step in processing flow should ideally be rerunnable. This means that if the job fails, we can just restart that and continue. It will be very difficult to make decisions during a production outage to determine if it is safe to rerun one or more jobs. Worse, if they are not safe to rerun, we need to come up quickly with some ad-hoc solutions for handling the failure.
For example, if we have a script inserting data to a database, this script should have a cleanup step to delete unnecessary data before the insert. We can run the job one or ten times and the same data should result in the database.
Use of technology
Fault tolerant or self recovery is important in some cases. Think about if our infrastructure will automatically disconnect to a faulty server and retry that piece of work on another server in a retry loop. This will save the manual support of responding to a failure and manually restarting a job.
Also, in-memory database and file-based processing should be used when appropriate. Relational database is a very powerful, simple and general solution. But it might not be the best tool for a very specific problem. For example, if you need to sequentially process all the data, a file-based solution will be faster than a database solution. There is no overhead of indexing, managing transactions that a general purpose DBMS needs to do.
If multithreading is the right technology, we should always refresh the important considerations.
Wednesday, May 13, 2009
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment