Tuesday, 2 June 2009

Fast Track Data Warehouse Reference Architectures

Introduction
In my previous entries, I’ve covered some of the important new features of SQL Server 2008, how they work and why they help improve efficiency and performance, as well as saving money. In this entry, I’m going to take a slight diversion and introduce you to the recently published Fast Track Data Warehouse Reference Architectures, which are essentially a set of guidelines that will help you to plan and implement data warehousing solutions.
Building a data warehouse represents a major investment for any organization, and it requires a significant development effort. Hardware and software can be complex to install, configure, test, and tune, and because the design requirements of a data warehouse are very different to those for OLTP databases, specialist skills are needed, which your average DBS is unlikely to possess – that’s not to say that DBS’s are not capable of learning these skills, of course, but training them up will add to the project’s cost and potentially delay its progress. As a result of these factors, development can be a very long, expensive process, and because of the complexities involved, there is no guarantee that the finished data warehouse will deliver the desired levels of performance or the business insight that is required to drive revenue.

Fast Track Data Warehouse Reference Architectures
The new Fast Track Data Warehouse Reference Architectures are designed to address these issues and to ensure that organizations can quickly and efficiently create high-performance, highly-scalable, cost-effective solutions that meet the needs of the business effectively.
Specifically, the aims of the Fast Track Reference Architectures are:
• To speed up deployment and provide a faster time to value.
• To lower TCO for the customer
• To provide scalability up to tens of terabytes
• To provide excellent performance out of the box
• To provide a choice of hardware vendors

The Fast Track reference architectures deliver on these aims through a combination of factors:
• Firstly, they provide a set of pre-tested hardware configurations based on servers from trusted leading vendors, including HP, Dell, and Bull. This drastically reduces the time to value and TCO because it removes the need for customers to source, configure, and test hardware and software themselves and it provides a reliable, high-performance platform. The hardware configurations include two, four, and eight processor options so that differing performance, scalability, and pricing needs can be met, and extensive supporting technical documentation and best practice guides ensure that customers can fine tune systems to their specific requirements. The available documentation and support files also make it much more straightforward and less risky for organizations to create their own custom configurations, should they choose to go down that route. The choice of vendors provides the flexibility for organizations to make best use of their existing in-house skill base, and reduces the need for re-training.
• Secondly, they leverage the features of SQL Server 2008 Enterprise Edition to help to deliver performance, flexibility, and scalability, and to drive down TCO. These include data and backup compression, partitioning, Resource Governor, and star join
• Finally, the reference architecture configurations are optimized for sequential I/O and use a balanced approach to hardware that avoids performance bottlenecks in the system. Let’s explore these last two concepts in a little more detail.

Sequential I/O
The Fast Track reference architectures are based on the concept of sequential I/O as the primary method of data access. Data warehouses have a usage pattern that is very different to OLTP systems. A business intelligence query will usually involve selecting, summarizing, grouping and filtering data from tables that consist of millions and billions of rows, and will return results for a range of data. For example, a query may return a summary of sales for a particular product from date A to date B. Rows in fact tables are often stored ordered by date, so SQL Server can process queries like this by accessing data sequentially from disk, which, assuming minimal data fragmentation, is very efficient. Sequential I/O and predominantly read based activity are key characteristics of data warehouse workloads, in contrast to OLTP workloads, which more commonly involve random I/O and extensive read / write activity as rows are inserted, updated and deleted.

Balanced Approach
The second key concept underlying the Fast Track Reference Architectures involves optimizing throughput by taking a balanced approach to hardware. Rather than looking at factors such CPUs, I/O channels, and the I/O capacity of the storage system in isolation, a balanced approach assesses the collective impact of these components on total throughput. This helps to avoid the accidental creation of bottlenecks in the system, which can occur if the throughput of any one of the components is not balanced against the others. For example, if the storage system does not have enough drives, or the drives are not fast enough, the speed at which data is read from them will not be fast enough to match the capacity of the other hardware components (CPUs and the system bus primarily), and performance will suffer. This can be confusing to administrators because monitoring may reveal that, for example, your CPUs have spare capacity, and yet response times are still poor. Adding more CPUs would have no effect in this scenario, because the problem is that the hardware is not balanced correctly. Solving the problem involves improving the throughput of the limiting factor, which in this case is the storage system. A balanced approach starts with the CPUs, evaluating the amount of data that each core can process as it is fed in, and the other components are balanced against this.

Project Madison – Scaling to Petabyte Levels
The Fast Track reference architectures we’ve discussed here are all based on a symmetric multiprocessing (or SMP) ‘shared everything’ model, in which your database is hosted on a single powerful server with dedicated CPU and disk resources. These configurations offer excellent value, scalability, and performance for databases in the 4 – 32 terabyte range, but they are unsuitable for larger implementations because the resultant increased resource contention erodes the performance benefits. An extended set of reference architectures, codenamed ‘Project Madison’, are due for release in the near future. Madison provides a scale-out, shared nothing architecture based upon the concept ‘massive parallel processing’ (or MPP), in which multiple servers work together, coordinated by an MPP query optimizer . ‘Shared nothing’ refers to the fact that each server has its own set of resources, which it does not share with any of the other servers. Madison enables growth from terabyte levels to petabyte levels through scaling out, providing a growth path for businesses that meets their requirements now and in the future.

No comments: