Technology and Business: May 2009

Wednesday, May 20, 2009

Java data structures

The followings are random thoughts on some of the key Java data structures and their importance.

Primitive Types
What are the sizes of byte, short, int, long, float, double, boolean and char? int can be either 32 bits or 64 bits. char is 16 bits (to support UTF-16 encoding of Unicode code points).

Encoding
EBCDIC (mainly on mainframes) has different encoding scheme comparing to ASCII (UNIX, PC, etc). When you FTP a file from a EBCDIC machine to a ASCII machine, an explicit conversion will happen automatically.

Byte ordering
Byte ordering can be either big Endian or little Endian. If you migrate an application to a different processor family (e.g. from SPARC to x86) and the application code has logic that depends on byte locations, transformation is needed. If you store numbers in binary format (e.g. with 4 bytes int) or ASCII characters as Unicode code points in a file and transfer the file to a machine with different byte ordering, transformation is needed again.

String
String and StringBuffer. StringBuffer should be used if the value of the string could change.

Collections Framework
There are 3 major interfaces in the Collections framework. List, Map, Set. There are specific implementations for each of the 3 major interfaces. For example, there are HashSet, TreeSet and LinkedListSet implementations for the Set interface.

All the implementations in the Collections framework are designed to be unsynchronized for performance reason. However, we can use synchronization wrappers to return a synchronized (thread-safe) version of each collection implementation. You can also refer to the external link in my comment for an interesting article on how JVMs do synchronization optimization.

Vector and Hashtable
These are legacy implementations that are synchronized. ArraryList and HashMap are the corresponding unsynchronized versions of Vector and Hashtable.

Hashcode() and equal()
The default implementation uses the address of each key to do the hashing and testing for equality. If performance is important, a custom implemenation can be evaluated.

Friday, May 15, 2009

Under the hoods of sorting

We all use library routines for sorting data. I suspect for more than 99% of applications, there is no good reason to justify any custom implementation. I did some readings to refresh some important points for the popular sorting algorithms.

Popular sorting algorithms
The most popular sorting algorithms used now are still the same as 20 years ago. Merge sort, quick sort, heap sort and insertion sort have been used for a very long time. What has changed drastically during last 20 years is the amount of physical memory we have in computers (we are spoiled now). The abundance of memory makes these internal sorting algorithms even more popular. During the times when amount of memory was limited, we had to study external sorting algorithms which rely on external I/O for large data sets.

Time performance
It is important to understand the average and worse case performance of an algorithm. For example, merge sort, quick sort and heap sort all have average O(n log n) runtime. However, performance of quick sort degrades to O(n²) for a sorted list, with the 1st element selected as pivot.. There are workaround solutions to reduce the chance of hitting the worse case scenario.

If the time consumed in sorting is relatively small (e.g. 5% of total), it might not be worthwhile for us to tune the performance. We should review the other 95% first. Only if there was no room for improvement (highly unlikely), then we should determine if it would still worth our money to tune the remaining 5%.

We should also be aware if stable sort is important or not. Some implementations trade off performance without making stable sort a requirement. Merge sort is a stable sort. Quick sort and heap sort are not.

Memory usage
Nowadays, in most situations, memory will not be the limiting factor. However, it is still important to know the memory requirement. For example, heap sort requires O(1), quick sort requires O(log n) and merge sort requires O(n). If memory is the primary factor, your decision might change, with all three algorithms having similar average time complexity.

Language support
Java uses merge sort, quick sort or insertion sort as default for different data types and array size. Perl moved from quick sort to merge sort (Perl 5.8) recently.

Most of us also do a lot of sorting in relational database using the SQL ORDER BY construct. The obvious question would be how the database sort compares to the in-memory sorts. We can easily do some research on the internet and verify by running on production level hardware for benchmarks. The available application server hardware to perform in-memory sorts could be very different from the database servers..

Takeaways
There is always benefit in understanding some details of a library function with respect to what we are trying to solve. This will help us understand better of what we are doing.

Sorting (or other data transformation and enrichment) using parallel CPUs and computers have gained popularity in recent years. There are a lot of parallel processing algorithms and tools available. MapReduce and Hadoop are two examples.

Please let me know if you have any thoughts or different views on this.

Wednesday, May 13, 2009

Loosely coupled parallel processing and scheduling

Batch processing or data integration is mostly about loosely coupled parallel processing and scheduling. As I mentioned before, we need to understand the business problem to make the project successful.

Segregation of data
This analysis should start with a high level data flow diagram. Understanding the source of data and their availability will help us make critical design decisions. Ideally, we should have data from different sources, usable as they become available. We can design our data flow architecture using parallel processing with different servers, databases, etc. Database replication can be used to move data. A simple extraction layer can also be built to get data out from the transaction oriented database for subsequent processing.

Critical path analysis
Once we decided on the data flow, we should understand the critical paths of our processing. There will always be a few steps that would make us nervous. These are the areas that we should focus on. We need to tune their performance, add instrumentation to trend the growth, explore relevant new technologies to improve processing time or reliability.

Optimal scheduling is also critical. Unnecessary dependencies can potentially cause a much bigger impact on the availability than a poorly optimized database query. In my past experiences, some jobs could actually start 1/2 hour earlier after removing all unnecessary dependencies. No one can tune these 10 minutes database query to save 30 minutes. On the other hand, missing dependencies could also happen and it could sometimes cause data corruption. We should always review and strive to get the optimal scheduling dependencies.

Rerunability of individual job or group of jobs
Each step in processing flow should ideally be rerunnable. This means that if the job fails, we can just restart that and continue. It will be very difficult to make decisions during a production outage to determine if it is safe to rerun one or more jobs. Worse, if they are not safe to rerun, we need to come up quickly with some ad-hoc solutions for handling the failure.

For example, if we have a script inserting data to a database, this script should have a cleanup step to delete unnecessary data before the insert. We can run the job one or ten times and the same data should result in the database.

Use of technology
Fault tolerant or self recovery is important in some cases. Think about if our infrastructure will automatically disconnect to a faulty server and retry that piece of work on another server in a retry loop. This will save the manual support of responding to a failure and manually restarting a job.

Also, in-memory database and file-based processing should be used when appropriate. Relational database is a very powerful, simple and general solution. But it might not be the best tool for a very specific problem. For example, if you need to sequentially process all the data, a file-based solution will be faster than a database solution. There is no overhead of indexing, managing transactions that a general purpose DBMS needs to do.

If multithreading is the right technology, we should always refresh the important considerations.

Tuesday, May 12, 2009

Concurrency basics

Please refer to multi-core computing for some very basic overview on the topic.

Multi-processing or multi-threading
It is easy to scale using multi-processing when there is no need to share data. For example, we can have a splitter of an input file and have multiple formatting jobs running. At the end, a merge function need to consolidate the data back into one unit. We need a scheduler to manage the dependencies and scheduling, either by a script file or a tool.

On the other hand, we can have a program that has different threads for different purposes and controlling all the dependencies and scheduling of the threads within the program. Starting a thread is also much faster and less resource intensive than starting a new process.

Threading considerations
Let's review some key areas.

Tread safety
This is the most important aspect of concurrent processing. We should not have race conditions, or data corruptions. The classic example on race condition is the simultaneous deposit/withdraw used in many textbooks. Moreover, we cannot have thread A handling the data for client A overlap and corrupt data by thread B of Client B. Think clearly on global, class, instance and local variables and their scopes and uses for each thread.

Locking mechanism
Semaphores, Mutex, Monitor, Lock.. Each one of them has different uses and its pros and cons in concurrency. When we are using a library to implement concurrent data access, we should understand which locking mechanism they used and why.

Deadlock and livelock
We need to avoid both. We can use prevention (e.g. by ordering) or detection (and kill to recover) mechanisms.

How many threads to run
In general we should have the number of threads less than or equal to the number of CPUs. However in some cases, we can increase the number of threads somewhat if we know some threads are not CPU bound and can be interrupted for I/O or other activities. Do some testings and performance benchmarking and determine the optimal number.

Operating system support
It is important for the kernel to have native support for multi-threading. Sometimes, the language or tool opt for user level threads instead of kernel level threads. User level threads in general cannot utilize the multiple CPUs as the kernel treats it as one process. There may also be different threading library implementations for the same operating system. In Linux, there are LinuxThreads and NPTL. In LinuxThreads (found in older Linux versions), each thread actually has a unique PID, so we need to take that into consideration in coding.

Code review, testing and logging
It is not easy to spot and catch every possible threading issues. Reviewing codes carefully, focusing on the business logic, variables scope and locking mechanism can help. Performance testing may detect some bugs due to luck. Since it is almost impossible to reproduce any production issues which involved the timing of execution of different threads, detail logging may be desired for important critical sessions.

Java
The most widely used concurrency control in Java is the "synchronized" keyword, which is a basically a monitor. However, we should only synchronized the critical sections of the codes. In the extreme case, if you synchronized one big function, all threads running that function will be serialized. You can also increase concurrency if you use wait() and notify() correctly.

Monday, May 11, 2009

Bank stress tests results

We have all heard about the headline numbers of additional capital required for 10 of the 19 bank holding companies. It is important to understand the basic assumptions and the evaluation process to get some perspectives on these numbers; the $75 billion additional capital required, the potential additional loss of $600 billion till end of 2010.

Assumptions
I am a little bit surprised that the conditions of the stress tests were not clearly listed together in headlines with the results. The macroeconomic scenarios (considered to be worse than expected) used for the stress tests were:

Unemployment: 8.9% in 2009 and 10.3% in 2010
GDP: -3.3% in 2009 and 0.5% in 2010
House Price: -22% in 2009 and -7% in 2010

With these macroeconomics scenarios, the loss rates of the 12 categories of loans were projected.

Process and Methodology
With the above mentioned assumptions, each bank used its own risk models and calculated its potential lose. The same were done to project revenue, profits and cash flows to determine available capital. The 180 people federal team then audited each bank's models and results, and requested additional data for clarity. An important observation that I had - the final published results of some banks differed drastically from just 2 weeks ago. This showed how uncertain these "estimates" were.

I also read in an article that trading exposure of only 5 banks were included in the stress tests. $100 billion of trading assets seemed to be the threshold.

Questions
I only spent a couple of hours following the news on the stress test results. I am interested to know some basics of "how" these numbers were established. If I have a real need to know (which I don't now), I could drill down and analyze more on the followings.

Are the worst case scenarios really bad enough? A peak unemployment of 10.3% and GDP growth of 0.5% in 2010 does not seem like extreme dire scenario now. April 2009 unemployment rate already equaled the more adverse scenario in 2009 in the stress test.
How was the potential loss of each loan category for each bank estimated and approved by the federal team? This is the heart of the problem, how much of different assets have in each bank's book and how to value them.
I am also assuming the stress tests did not include any future trading risk of the 5 companies. The potential loss was all based on their existing exposure. Each company can and will change their strategy in the future. Also, is it true that only trading exposure of banks of $100 billion trading assets were included? What if a bank had $80 billion of trading assets and the bank (hypothetically) lose them all next year, that risk were not considered in the stress tests? I must be wrong here.
What assumptions were used to evaluate the revenue, market share, profitability and growth of each company? If we look back to the last few quarters, there were way too much surprises in the sector.

Friday, May 8, 2009

Apple: from iPod to iPhone

I have only used 2 types of smart phones - Blackberry and iPhone, so I may not have a full complete view on this.

iPhone 3G
I now use my iPhone for cell phone calls, VoIP phone calls, email, iPod music player, GPS, news reader (Bloomberg, NY Times, Wall Street Journal), checking weather, internet radio, calendar appointments, some occasional casual games (with good motion sensing controls), and entertaining my 2 daughters when needed.

The web browser is nice, but I use it less frequent now as most websites have a native application available. I also use the camera, it is a decent backup. Then, there is a voice recorder, Chinese/English translator, Wi-Fi keyboard/mouse to control my Mac mini hooked up to my TV. I also started reading some very basic Amazon kindle books on the iPhone (but the screen is just too small for serious reading).

Vision by Apple
An extremely user friendly touch screen, multi-touch input method is the heart of the iPhone or iPod touch. Adding an always on 3G Internet connection and the well thought out App Store, the possibilities are just endless. The included iPod was an early hit, and it will continue to be.

Technologies
There are now so many similar devices that used solely the touch screen for inputs, but it was Apple that pioneered this niche. Who would have thought that there are now more than 17 millions iPhones (adding ~15 millions iPod Touches) worldwide in less than 2 years. The user interface is the key technology differentiator for the iPhone. The business model for the App Store is amazing. It attracts so many individual developers to publish very low cost applications (many are free). Recently, Apple is heavily promoting the 1 billion application downloads from about 40,000 applications.

Wish List
Adding a turn-by-turn voice prompt GPS and a video camcorder to the iPhone would make it the "one and only" device you needed to carry. Both features should be feasible as software only updates. Improve battery life will also help, as all of us are using the device more and more. Background apps, if it is done right without impacting battery life, could also open up a lot of different opportunities for the iPhone to act as an agent to remind you on everything.

Of course, with Google (Android), Palm (Pre) and Research in Motion (Blackberry) also targeting the same market, more innovations will come. We as consumers will benefit from this healthy competition.

Thursday, May 7, 2009

How to lead a team

I worked for a company, like many other companies, that has an annual anonymous 360 degrees performance review process. I have inputs from my direct reports, peers and managers throughout the years. I have been managing different teams for the past 9 years. I have worked with independent consultants, outsourced consultants (both on-site and offshore), full time employees and summer internship employees.

Flexibility and Integrity
A manager, in many ways, should be called a coach. Like a professional sports coach, a good manager uses his/her experiences to guide and motivate everyone and to get the best out of the team. It is a little bit more complex in the business world as we are generally dealing with people with diverse cultural backgrounds, different personal goals and values, and experience and skill levels. There are also differences by age groups in how they perceive work and family. There is no simple answer on what work best, the coach needs to adapt based on the team structure, the company culture and your upper management style. Consultants and employees generally have different objectives and motivations and need different coaching styles.

These days, every company has very aggressive schedules and deadlines. Every level of management feel the same pressure and trying to do more with less. We have to lead the team as fair as possible while balancing the company needs.

Skills and knowledge
Let's start with good observation and listening skills. It is so important to know your team well, what each person's strengths, interests, and their longer term ambitions are.

Motivation is also at the top of the list. It is best of both worlds when an employee is motivated. He/she will automatically line up his/her interest with yours and the company. Like getting a "winner" in professional sports, this is the type of person that performs best when needed. Giving credits and not taking credit away from anyone is a key principle that I use. I find it very useful in building up trust, which will in turn motivate the whole team and create success and recognition for everyone.

A quick note on remote and global teams. From my past experiences, it took commitments and personal sacrifices to motivate the remote team. You had to show the remote team members that you genuinely treated them as part of the team. Many companies have offshore teams, but the real productivity and success is in the hand of the day to day execution of the direct manager.

I am a true believer of "lead by example" and delegation. You should not preach on something you would not do yourself if you were in their roles. If you are a team lead or manager, it is ineffective for you to have the time to know the details of all the work of everyone. You have your own responsibility. So, having a big picture view, knowing the status of the team's work, and have the ability and interest to drill down on the details as needed may be a winning combination.

As a manager, you need to be able to explain the vision to the team and to help them make good decisions. In career development, you should help them grow by pointing out the important skills they need to acquire to succeed. Open and honest feedback, if done right, will help the employee grows.

Matching up assignments to available team members require careful plannings. There are assignments that most people would like to do, and there are always some areas that are perceived as grunt work, but are critical and have to be done.

Be a good team member
Besides working with your team, remember you are also part of your manager's team. All the points we discussed here also apply to your manager. You have to play the role of a good team member too.You need to be flexible and support the goals with your management. Put yourself in their shoes and you will probably understand and appreciate their actions more.

Wednesday, May 6, 2009

OTC derivatives

Nowadays, when you turn on CNN, CNBC or reading the news on the internet, the chances are you will hear something about TARP and the bank stress tests. You may also hear about CDO, CDS etc. These financial products certainly get some bad representation. Let's try to get some basics of what they are and some of their properties.

OTC and Derivatives
The common exchange-traded financial products, like stocks, bonds, futures, currencies are traded through an broker via an exchange. The type of an exchange can be physical (e.g. part of NYSE) or electronic (e.g. NASDAQ). One of the main advantage for using an exchange is to eliminate counter-party risks. It also improves market liquidity.

When we have an over-the-counter (OTC) contract, it is purely between 2 parties. Broker A can sell a contract tracking the performance of one or more financial products to Client B. They are binded by the terms of the legal agreement. Because of this counter-party risk, there is always the need of collateral. Periodic mark-to-market and contract resets are also used to determine if there is any need of additional collateral postings.

Derivative is a general term that say a product is traded based on some other underlying product(s), or it's value is derived from some product(s). A call option on IBM is a derivative that is derived from IBM. Futures and Swaps are some other example derivatives.

Similar to exchange-traded derivatives, OTC derivatives can be used to speculate or hedge. The leverage ratio of these contracts can also vary. The buyer may not need to come up with all the principles for the contract, the seller can determine what risk it is willing to take and only require a percentage to be posted as collateral.

It is how a company uses these financial instruments that makes them risky. For example, if a company buys a Credit Default Swap (CDS) to hedge the chance of a company default event for one for their junk bond holding, it will stabilize its portfolio. Of course, now they have to think about if they want to hedge the default risk of the counter-party writing the CDS.

Integrated Reporting of OTC Derivatives
There are lots of non-standard attributes for different OTC Derivatives products. If we understand what some of the key attributes and their meanings are, we can design our processing flows accordingly. I am not going to cover all the different attributes here, but would like to point out that there are usually 3 main components. The "exposure" part, the "interest" part, and the "collateral" part. In contrasts, the exchange-traded products usually only contain the "exposure" attributes.

For example, if we do not need the collateral and interest rate information on the OTC derivatives for some custody or accounting reports, they can be optional attributes for those applications.

Tuesday, May 5, 2009

From HTTP/HTML to Web 2.0

The internet was revolutionary. It changed our lives as individuals. Companies changed strategies to adapt to this new channel. Let's look at how the technologies evolved since the 1990s.

Early Days
Netscape made the first widely available public web browser in late 1994, named Navigator. It started the internet revolution then. The worldwide network used for email and other electronic file exchanges served as the internet backbone. This universal connectivity was critical for the success of the internet.

The browsers and the world wide web servers exchange information through the HTTP protocol. HTTP was a stateless asynchronous communication channel. The transport was mostly over TCP/IP. The rendering was done in a very simple markup language, HTML. Besides rendering text and images, HTML was mostly about hyperlinks and forms. The beauty of HTML was the simplicity.

The Document Object Model (i.e. sandbox) of the popular browsers were also critical in the exponential growth of the internet. What you did from your browsers were totally safe, the code cannot do anything harmful to your computer or secretly retrieve information from your computer.

Mass adoption
In the mean time, consumers started to subscribe to dial-up connections via cable companies or AOL, hooking up directly to the internet for the first time from their home. Everyone had gradually caught on to the internet. Companies recognized that they can use the platform to launch new businesses and extend their existing business. Then it came the dot-com boom, and eventually dot-com crash. Amazon, Ebay and Yahoo are a few of the successful companies born during that period and still remain as dominant forces now.

Extensions like applets, ActiveX controls, Netscape Plug-ins, Netscape Communicator channels, Adobe Flash were introduced to achieve a richer and more user friendly environments. There were a lot of hypes that the browsers would took over the operating systems during the dot-com boom also. The main driver for these new technologies were to gain respective market share for their company. You can make your own judgment on how most of these non-standard technologies sustained over time.

Web 2.0
Nowadays, the internet is used mostly as an interactive and collaborative tool.. The request/response model and the simplistic HTML rendering that worked very well in the early days also showed signs of aging. AJAX, or Asynchronous Javascript And XML, is the buzzword now, with some credibility. It enables web pages to have more dynamic contents without the need of hitting "refresh" (i.e. think about type-ahead form suggestions and continuous updates to stock quotes). It also allows layers of information on top of each other (i.e. think about Google Maps). Mashup also seems to be gaining good momentum.

Tools also played a big role in the Web 2.0. On the server and content side, now we have so many frameworks to choose from. Aspect-Oriented Programming style of Spring framework combines inversion of controls and dependency injections. The details will be in another post. Microsoft has the ASP.NET and C# framework, mainly to help people in the Microsoft camps to do their work easier. Then there is the high level Rudy on Rails. Lastly don't forget the still popular Struts, servlets MVC frameworks. There are also a lot of web servers, database access and caching tools and technolgies to choose from also. There is no one size fits all in selecting the right technologies. Each company will need to access their unique requirement and their existing infrastructure and IT staffs' skill sets to determine the best technologies to use.

On the presentation side, sharing information are becoming bigger and better. New blogs, tweets, social networking sites are spawning up very quickly. With the availability of these user friendly sites, wider group of users can effectively share knowledge and information. This is certainly good use of technologies. I am very interested to see what we have in 5 or 10 years from now.

Monday, May 4, 2009

Data integration challenges

Data is everywhere and is expanding at a very fast pace. Companies are dealing with gigabytes and terabytes of data.

Business Needs
Every business wants the final processed data to be accurate and delivered on time. As a technologist working very closely with the business to serve our clients, I fully understand that.

It is the responsibility of the technology department to point out the complexity (i.e. cost) and potential problems (i.e. cost again) so that contingency plans can be agreed upon and established. This will take some serious efforts and trust to achieve as each team have different domain knowledge and perspectives. If the business team does not understand what is technically feasible or not, they will naturally try to ask for more. If the technology team does not understand the business drivers, they will focus on the wrong problem to solve. Ultimately, only if the business and technology staffs can truely work together as one team, the results can be substantially better and can create a huge competitive advantage for the company.

We absolutely do not want to over-engineer for the not-so-critical functions or exceptional cases. However, we want to make sure we use the best and effective technology to handle the most critical scenarios.

Technologies
We need to think about data formats, processing methods, hardware and network bandwidth, scalability, data integrity checks, contingency sources, etc.. Let me touch upon a little bit of each.

Formats - XML, flat files, relational database, other industry standards (e.g. FIX, SWIFT, FpML)
Processing methods - when we should use pre-processors, in-memory databases, replications, DOM vs SAX parsers for XML. Archiving, compression, purging schemes. Design from transaction processing to data warehouse. Database normalization and performance tuning.
Hardware/Network bandwidth - is CPU, memory or I/O the bottleneck? NAS/SAN or local disk implications.
Scalability - how can the infrastructure scale to the anticipated growth (2x, 10x, or 100x)? Based on realistic projections, we can make very significant different design decisions. Can we just horizontally scale the application by adding more instances or hardware? Some designs will NOT allow us to do that.
Data Integrity checks - sanity checks, row count checks, mandatory vs optional fields.
Contingency - critical path analysis, checkpoints and how to partial re-run batch, alternative data source or algorithms.

It takes a lot of plannings to make things right. And it takes only one unanticipated problem to create a critical problem. It is also an on-going project for any data integration process. Volume growth, new data sources, and new use cases will test your initial design and see how flexible and extensible it really is.

I will drill down into more details for some of the technical considerations listed above in future posts.

Friday, May 1, 2009

Netbook has a winning formula

I have been using a 1.66GHz, 10 inch Netbook since March. Besides using it on the road, the Netbook is also my favorite computer at home for most simple tasks.

Understanding the basics
Battery life and weight should rank at the top of most people's priority list for simple on-the-road computing device. So, what about a "laptop" with 7 hours battery life (with Wi-Fi on), and weight only about 3 pounds?

Creative packaging
The model of my netbook is ASUS Eee PC 1000HE. It uses a low power consumption, hyper-threading Intel processor. It uses Windows XP instead of Vista, and has a 1024-pixels horizon screen (wide enough for most web pages). The chiclet (Apple style) keyboard is very nice too at 92% full size. The performance button can overclock or underclock the CPU as needed.. The 2 fingers scrolling on the touch pad is very user friendly. To round up a nice netbook, ASUS includes a 160GB hard drive, 1 GB memory, 802.11n Wi-Fi and Bluetooth.

Drawbacks and workarounds
I have some slight disappointment with the 600 pixels vertical display. It is acceptable once I switched to the full screen mode for most programs. To work around the missing DVD drive, I mounted the DVD drive of my Vista laptop to this netbook over Wi-Fi, for the occasional use to install programs.

Conclusion
This new invention is quickly gaining more market share over traditional laptops. It will continue to get better and be more powerful.

Technology and Business