JETZT ONLINE BESTELLEN
Add to Cart
The Art of Capacity Planning
Scaling Web Resources

First Edition Oktober 2008
ISBN 978-0-596-51857-8
152 Seiten
EUR37.00

Weitere Informationen zu diesem Buch

Inhaltsverzeichnis |


Inhaltsverzeichnis

	
Chapter 1: Goals, Issues, and Processes in Capacity Planning
Inhaltsvorschau
presented in the following chapters. If you do not grasp the concepts introduced in this chapter, reading the remainder of this book will be like setting out on the open ocean without knowing how to use a compass, sextant, or GPS device—you can go around in circles forever.
When you break them down, capacity planning and management—the steps taken to organize the resources your site needs to run properly—are, in fact, simple processes. You begin by asking the question: what performance do you need from your website?
First, define the application's overall load and capacity requirements using specific metrics, such as response times, consumable capacity, and peak-driven processing. Peak-driven processing is the workload experienced by your application's resources (web servers, databases, etc.) during peak usage. The process, illustrated in , involves answering these questions:
  1. How well is the current infrastructure working?
    Measure the characteristics of the workload for each piece of the architecture that comprises your applications—web server, database server, network, and so on—and compare them to what you came up with for your performance requirements mentioned above.
  2. What do you need in the future to maintain acceptable performance?
    Predict the future based on what you know about past system performance then marry that prediction with what you can afford, and a realistic timeline. Determine what you'll need and when you'll need it.
  3. How can you install and manage resources after you gather what you need?
    Deploy this new capacity with industry-proven tools and techniques.
  4. Rinse, repeat.
    Iterate and calibrate your capacity plan over time.
Figure : The process for determining the capacity you need
Your ultimate goal lies between not buying enough hardware and wasting your money on too much hardware.
Let's suppose you're a supermarket manager. One of your tasks is to manage the schedule of cashiers. Your challenge is picking the right number of cashiers working at any moment. Assign too few, and the checkout lines will become long, and the customers irate. Schedule too many working at once, and you're spending more money than necessary. The trick is finding the right balance.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Quick and Dirty Math
Inhaltsvorschau
The ideas I've just presented are hardly new, innovative, or complex. Engineering disciplines have always employed back-of-the-envelope calculations; the field of web operations is no different.
Because we're looking to make judgments and predictions on a quickly changing landscape, approximations will be necessary, and it's important to realize what that means in terms of limitations in the process. Being aware of when detail is needed and when it's not is crucial to forecasting budgets and cost models. Unnecessary detail means wasted time. Lacking the proper detail can be fatal.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Predicting When Your Systems Will Fail
Inhaltsvorschau
Knowing when each piece of your infrastructure will fail (gracefully or not) is crucial to capacity planning. Capacity planning for the web, more often than one would like to admit, looks like the approach shown in .
Figure : Finding failure points
Including this information as part of your calculations is mandatory, not optional. However, determining the limits of each portion of your site's backend can be tricky. An easily segmented architecture helps you find the limits of your current hardware configurations. You can then use those capacity ceilings as a basis for predicting future growth.
For example, let's assume you have a database server that responds to queries from your frontend web servers. Planning for capacity means knowing the answers to questions such as these:
  • Taking into account the specific hardware configuration, how many queries per second (QPS) can the database server manage?
  • How many QPS can it serve before performance degradation affects end user experience?
Adjusting for periodic spikes and subtracting some comfortable percentage of headroom (or safety factor, which we'll talk about later) will render a single number with which you can characterize that database configuration vis-à-vis the specific role. Once you find that "red line" metric, you'll know:
  • The load that will cause the database to fail, which will allow you to set alert thresholds accordingly.
  • What to expect from adding (or removing) similar database servers to the backend.
  • When to start sizing another order of new database capacity.
We'll talk more about these last points in the coming chapters. One thing to note is the entire capacity planning process is going to be architecture-specific. This means the calculations you make to predict increasing capacity may have other constraints specific to your particular application.
For example, to spread out the load, a LAMP application might utilize a MySQL server as a master database in which all live data is written and maintained, and use a second, replicated slave database for read-only database operations. Adding more slave databases to scale the read-only traffic is generally an appropriate technique, but many large websites (including Flickr) have been forthright about their experiences with this approach, and the limits they've encountered. There is a limit to how many read-only slave databases you can add before you begin to see diminishing returns as the rate and volume of changes to data on the master database may be more than the replicated slaves can sustain, no matter how many you add. This is just one example where your architecture can have a large effect on your ability to add capacity.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Make Your System Stats Tell Stories
Inhaltsvorschau
Server statistics paint only part of the picture of your system's health. Unless they can be tied to actual site metrics, server statistics don't mean very much in terms of characterizing your usage. And this is something you'll need to know in order to track how capacity will change over time.
For example, knowing your web servers are processing X requests per second is handy, but it's also good to know what those X requests per second actually mean in terms of your users. Maybe X requests per second represents Y number of users employing the site simultaneously.
It would be even better to know that of those Y simultaneous users, A percent are uploading photos, B percent are making comments on a heated forum topic, and C percent are poking randomly around the site while waiting for the pizza guy to arrive. Measuring those user metrics over time is a first step. Comparing and graphing the web server hits-per-second against those user interaction metrics will ultimately yield some of the cost of providing service to the users. In the examples above, the ability to generate a comment within the application might consume more resources than simply browsing the site, but it consumes less when compared to uploading a photo. Having some idea of which features tax your capacity more than others gives you context in which to decide where you'll want to focus priority attention in your capacity planning process. These observations can also help drive any technology procurement justifications.
Quite often, the person approving expensive hardware and software requests is not the same person making the requests. Finance and business leaders must sometimes trust implicitly that their engineers are providing accurate information when they request capital for resources. Tying system statistics to business metrics helps bring the technology closer to the business units, and can help engineers understand what the growth means in terms of business success. Marrying these two metrics together can therefore help the awareness that technology costs shouldn't automatically be considered a cost center, but rather a significant driver of revenue. It also means that future capital expenditure costs have some real context, so even those non-technical folks will understand the value technology investment brings.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Buying Stuff: Procurement Is a Process
Inhaltsvorschau
After you've completed all your measurements, made snap judgments about usage, and sketched out future predictions, you'll need to actually buy things: bandwidth, storage appliances, servers, maybe even instances of virtual servers. In each case, you'll need to explain to the people with the checkbooks why you need what you think you need, and why you need it when you think you need it. (We'll talk more about predicting the future and presenting those findings in .)
Procurement is a process, and should be treated as yet another part of capacity planning. Whether it's a call to a hosting provider to bring new capacity online, a request for quotes from a vendor, or a trip to your local computer store, you need to take this important segment of time into account.
Smaller companies, while usually a lot less "liquid" than their larger bretheren, can really shine in this arena. Being small often goes hand-in-hand with being nimble. So while you might not be offered the best price on equipment as the big companies who buy in massive bulk, you'll likely be able to get it faster, owing to a less cumbersome approval process.
Quite often the person you might need to persuade is the CFO, who sits across the hall from you. In the early days of Flickr, we used to be able to get quotes from a vendor and simply walk over to the founder of the company (seated 20 feet away), who could cut and send a check. The servers would arrive in about a week, and we'd rack them in the data center the day they came out of the box. Easy!
Yahoo! has a more involved cycle of vetting hardware requests that includes obtaining many levels of approval and coordinating delivery to various data centers around the world. Purchases having been made, the local site operation teams in each data center then must assemble, rack, cable, and install operating systems on each of the boxes. This all takes more time than when we were a startup. Of course, the flip side is, with such a large company we can leverage buying power. By buying in bulk, we can afford a larger amount of hardware for a better price.
In either case, the concern is the same: the procurement process should be baked into your larger planning exercise. It takes time and effort, just like all the other steps. There is more about this in .
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Performance and Capacity: Two Different Animals
Inhaltsvorschau
The relationship between performance tuning and capacity planning is often misunderstood. While they affect each other, they have different goals. Performance tuning optimizes your existing system for better performance. Capacity planning determines what your system needs and when it needs it, using your current performance as a baseline.
Let's face it: tuning is fun, and it's addictive. But after you spend some time tweaking values, testing, and tweaking some more, it can become a endless hole, sucking away time and energy for little or no gain. There are those rare and beautiful times when you stumble upon some obvious and simple parameter that can make everything faster—you find the one MySQL configuration parameter that doubles the cache size, or realize after some testing that those TCP window sizes set in the kernel can really make a difference. Great! But as illustrated in , for each of those rare gems you discover, the amount of obvious optimizations you find thereafter dwindles pretty rapidly.
Figure : Decreasing returns from performance tuning
Capacity planning must happen without regard to what you might optimize. The first real step in the process is to accept the system's current performance, in order to estimate what you'll need in the future. If at some point down the road you discover some tweak that brings about more resources, that's a bonus.
Here's a quick example of the difference between performance and capacity. Suppose there is a butcher in San Francisco who prepares the most delectable bacon in the state of California. Let's assume the butcher shop has an arrangement with a store in San Jose to sell their great bacon there. Every day, the butcher needs to transport the bacon from San Francisco to San Jose using some number of trucks—and the bacon has to get there within an hour. The butcher needs to determine what type of trucks, and how many of them he'll need to get the bacon to San Jose. The demand for the bacon in San Jose is increasing with time. It's hard having the best bacon in the state, but it's a good problem to have.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
The Effects of Social Websites and Open APIs
Inhaltsvorschau
As more and more websites install Web 2.0 characteristics, web operations are becoming increasingly important, especially capacity management. If your site contains content generated by your users, utilization and growth isn't completely under the control of the site's creators—a large portion of that control is in the hands of the user community, as shown by my example in the Preface concerning the London subway bombing. This can be scary for people accustomed to building sites with very predictable growth patterns, because it means capacity is hard to predict and needs to be on the radar of all those invested—both the business and the technology staff. The challenge for development and operations staff of a social website is to stay ahead of the growing usage by collecting enough data from that upward spiral to drive informed planning for the future.
Providing web services via open APIs introduces a another ball of wax altogether, as your application's data will be accessed by yet more applications, each with their own usage and growth patterns. It also means users have a convenient way to abuse the system, which puts more uncertainty into the capacity equation. API usage needs to be monitored to watch for emerging patterns, usage edge cases, and rogue application developers bent on crawling the entire database tree. Controls need to be in place to enforce the guidelines or
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 2: Setting Goals for Capacity
Inhaltsvorschau
, you shouldn't begin planning for capacity before you determine your site's requirements. Capacity planning involves a lot of assumptions related to why you need the capacity. Some of those assumptions are obvious, others are not.
For example, if you don't know that you should be serving your pages in less than three seconds, you're going to have a tough time determining how many servers you'll need to satisfy that requirement. More important, it will be even tougher to determine how many servers you'll need to add as your traffic grows.
Common sense, right? Yes, but it's amazing how many organizations don't take the time to assemble a rudimentary list of operational requirements. Waiting until users complain about slow responses or time-outs isn't a good strategy.
Establishing the acceptable speed or reliability of each part of your site can be a considerable undertaking, but it will pay off when you're planning for growth and need to know what standard you should maintain. This chapter shows you how to understand the different types of requirements your management and customers will force you to deal with, and how architectural design helps you with this planning.
Now that we're talking about requirements—which might be set by others, external to your group—we can look at the different types you'll need to deal with. Your managers, your end-users, and your clients running websites with you, all have varying objectives and measure success in different ways. Ultimately, these requirements, or capacity goals, are interrelated and can be distilled into the following:
  • Performance
    — External service monitoring
    — Business requirements
    — User expectations
  • Capacity
    — System metrics
    — Resource ceilings
Your site should be available not only to your colleagues performing tests on your website from a facility down the road, but also to real visitors who may be located on other continents with slow connections. Some large companies choose to have site performance (and availability) constantly monitored by services such as Keynote (http://keynote.com) or Gomez (http://gomez.com
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Different Kinds of Requirements and Measurements
Inhaltsvorschau
Now that we're talking about requirements—which might be set by others, external to your group—we can look at the different types you'll need to deal with. Your managers, your end-users, and your clients running websites with you, all have varying objectives and measure success in different ways. Ultimately, these requirements, or capacity goals, are interrelated and can be distilled into the following:
  • Performance
    — External service monitoring
    — Business requirements
    — User expectations
  • Capacity
    — System metrics
    — Resource ceilings
Your site should be available not only to your colleagues performing tests on your website from a facility down the road, but also to real visitors who may be located on other continents with slow connections. Some large companies choose to have site performance (and availability) constantly monitored by services such as Keynote (http://keynote.com) or Gomez (http://gomez.com). These commercial services deploy worldwide networks of machines that constantly ping your web pages to record the return time. Servers then keep track of all these metrics and build you a handy-dandy dashboard to evaluate how your site performance and uptime appears from many locations around the world. Because Keynote and Gomez are deemed "objective" third parties, those statistics can be used to enforce or guide Service Level Agreements (SLAs) arranged with partner companies or sites (we'll talk more about SLAs later). Keynote and Gomez can be considered enterprise-level services. There are also plenty of low-cost alternatives, including PingDom (http://pingdom.com), SiteUptime (http://siteuptime.com), and Alertra (http://alertra.com).
It's important to understand exactly what these services measure, and how to interpret the numbers they generate. Since most of them are networks of machines rather than people, it's essential to be aware of how those web pages are being requested. Some things to consider when you're looking at service monitoring systems include:
  • Are they simulating human users?
  • Are they caching objects like a normal web browser would? Why or why not?
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Architecture Decisions
Inhaltsvorschau
Your architecture is the basic layout of how all of the backend pieces—both hardware and software—are joined. Its design plays a crucial role in your ability to plan and manage capacity. Designing the architecture can be a complex undertaking, but there are a couple of great books available to help you: Cal Henderson's Building Scalable Web Sites (O'Reilly) and Theo Schlossnagle's Scalable Internet Architectures (Pearson).
Your architecture affects nearly every part of performance, reliability, and management. Establishing good architecture almost always translates to easier effort when planning for capacity.
Both for measurements purposes as well as for rapid response to changing conditions, you want your architecture to be designed so you can easily split it into parts that perform discrete tasks. In an ideal world, each component of the backend should have a single job to do, but it could still do multiple jobs well, if needed. At the same time, its effectiveness on each job should be easy to measure.
For instance, let's look at a simple, database-driven web application just starting on its path toward world domination. To get the most bang for our buck, we have our web server and our database residing on the same hardware server. This means all the moving parts share the same hardware resources, as shown in .
Figure : A simple, single-server web application architecture
Let's suppose you've already read (cheating, are we?) and you have configured measurements for both system and application-level statistics for your server. You can measure the system statistics of this server via sar or rrdtool, and maybe even application-level measurements such as web resource requests or database queries-per-second.
The difficulty with the setup in is you can't easily distinguish which system statistics correspond with the different pieces of the architecture. Therefore, you can't answer basic questions that are likely to arise, such as:
  • Is the disk utilization the result of the web server sending out a lot of static content from the disk, or rather, the database's queries being disk-bound?
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 3: Measurement: Units of Capacity
Inhaltsvorschau
.
George Bernard Shaw
capacity planning—you'll only be guessing. Fortunately, a seemingly endless range of tools is available for measuring computer performance and usage. I'm willing to bet that moments after the first computer program was written, another one was written to measure and record how fast the first one performed.
Most operating systems come with some basic built-in utilities that can measure various performance and consumption metrics. Most of these utilities usually provide a way to record results as well. Additional popular open source tools are easy to download and run on virtually any modern system. For capacity planning, your measurement tools should provide, at minimum, an easy way to:
  • Record and store data over time
  • Build custom metrics
  • Compare metrics from various sources
  • Import and export metrics
As long as you choose tools that can in some way satisfy this criteria, you don't need to spend much time pondering which to use. What is more important is what metrics you choose to measure, and what metrics to which you pay particular attention.
In this chapter, I'll discuss the specific statistics you'll want to measure for different purposes, and show the results in graphs to help you better interpret them. There are plenty of other sources of information on how to set up particular tools to generate the measurements; most professional system administrators already have such tools installed.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Aspects of Capacity Tracking Tools
Inhaltsvorschau
This chapter is about automatically and routinely measuring server behavior over a predefined amount of time. By monitoring normal behavior over days, weeks, and months, you'll be able to see both patterns that recur regularly, and trends over time that help you predict when you need to increase capacity.
We'll also discuss deliberately increasing the load through artificial scaling using methods that closely simulate what will happen to your site in the future. This will also help you predict the need to increase capacity.
For the tasks in this chapter, you need tools that collect, store, and display (usually on a graph) metrics over time. They can be used to drive capacity predictions as well as problem resolution.
Examples of these tools include:
The tools don't need to be fancy. In fact, for some metrics, I still simply load them into Excel and plot them there. contains a more comprehensive list of capacity planning tools.
It's important to start out by understanding the types of monitoring to which this chapter refers. Companies in the web operations field use the term monitoring to describe all sorts of operations—generating alerts concerning system availability, data collection and its analysis, real-world and artificial end user interaction measurement—the list goes on and on. Quite often this causes confusion. I suspect many commercial vendors who align on any one of those areas exploit this confusion to further their own goals, much to our detriment as end users.
This chapter is not concerned with system availability, the health of your servers, or notification management—the sorts of activities offered by Nagios, Zenoss, OpenNMS, and other popular network monitoring systems. Some of these tools do offer some of the features we need for our monitoring purposes, such as the ability to display and store metrics. But they exist mostly to help you recognize urgent problems and avoid imminent disasters. For the most part, they function a lot like extremely complex alarm clocks and smoke detectors.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Applications of Monitoring
Inhaltsvorschau
The remainder of this chapter uses examples to demonstrate some of the important monitoring techniques you need to know and perform.
As mentioned earlier, server statistics paint only a part of the capacity picture. You should also measure and record higher-level metrics specific to your application—not specific to one server, but to the whole system. CPU and server disk usage on a web server doesn't tell the whole tale of what's happening to each web request, and a stream of web requests can involve multiple pieces of hardware.
At Flickr, we have a dashboard that collects these application-level metrics. They are collected on both a daily and cumulative basis. Some of the metrics can be drawn from a database, such as the number of photos uploaded. Others can come from aggregating some of the server statistics, such as total disk space consumed across disparate machines. Data collection techniques can be as simple as running a script from a cron job and putting results into its own database for future mining.
Some of the metrics currently tracked at Flickr are:
  • Photos uploaded (daily, cumulative)
  • Photos uploaded per hour
  • Average photo size (daily, cumulative)
  • Processing time to segregate photos based on their different sizes (hourly)
  • User registrations (daily, cumulative)
  • Pro account signups (daily, cumulative)
  • Number of photos tagged (daily, cumulative)
  • API traffic (API keys in use, requests made per second, per key)
  • Number of unique tags (daily, cumulative)
  • Number of geotagged photos (daily, cumulative)
We also track certain financial metrics, such as payments received (which lie outside the scope of this book). For your particular application, a good exercise would be to spend some time correlating business and financial data to the system and application metrics you're tracking.
For example, a Total Cost of Ownership (TCO) calculation would be incomplete without some indication of how much these system and application metrics cost the business. Imagine being able to correlate the real costs to serve a single web page with your application. Having these calculations would not only put the architecture into a different context from web operations (business metrics instead of availability, or performance metrics), but they can also provide context for the more finance-obsessed, non-technical upper management who might have access to these tools.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
API Usage and Its Effect on Capacity
Inhaltsvorschau
As more and more websites use open APIs to open up their services to external developers, capacity planning for the use of those services must follow.
You may have guessed by now I'm a strong advocate of application-level metrics as well as system metrics, and API usage is the area where application-level metrics can matter the most. When you allow others access to your data via an open API, you're essentially allowing a much more focused and routine use of your website.
One of the advantages of having an open API is it allows for more efficient use of your application. If external developers wanted to gain access to your data and no API methods existed, they might screen scrape your site's pages to get at the data, which is extremely inefficient for a number of reasons. If they're only interested in a specific piece of data on the page, they'd still have to request the entire page and everything that entails, such as downloading CSS markup, JavaScript, and other components necessary for a client's browser to render the page, but of no interest to the developer. Although APIs allow more efficient use of your application, if not tracked properly, they also expose your web service to potential abuse, as they enable other applications to ask for those specific pieces of data.
Having some way to measure and record the usage of your open API on a per-user, or per-request-method basis, should be considered mandatory in capacity tracking on a site offering a web API. This is commonly done through the use of unique API keys, or other unique credentials. Upon each call to the API, the key identifies the application and the developer responsible for building the application.
Because it's much easier to issue an enormous volume of calls to an API than to use a regular client browser, you should keep track of what API calls are being made by what application, and at what rate.
At Flickr, we automatically invalidate any key that appears to be abusing the API, according to provisions outlined in the Terms of Service. We maintain a running total every hour for every API key that makes a call, how many calls were made, and the details of each call. See for the basic idea of API call metrics.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Examples and Reality
Inhaltsvorschau
Will your web servers, databases, storage, and caches exhibit the same behavior as these? It's almost guaranteed they won't because each application and type of data affects system resources differently. The examples in this chapter simply illustrate the methods and thought processes by which you can investigate and form a better understanding of how increased load can affect each part of your infrastructure.
The important lesson to retain is each segment of your architecture will spend system resources to serve your website, and you should make sure you're measuring those resources appropriately. However, recording the right measurements isn't enough. You need to have some idea of when those resources will run out, and that's why you periodically need to probe to establish those ceilings.
Running through the exercise of finding your architecture's upper limits can reveal bottlenecks you didn't even know existed. As a result, you might make changes to your application, your hardware, your network, or any other component responsible for the problem. Every time you make a change to your architecture, you'll need to check ceilings again, because they're likely to change. This shouldn't be a surprise, because by now you know that capacity planning is a process, not a one-time event.
Figure : API key details and history
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Summary
Inhaltsvorschau
Measurement is a necessity, not an option. It should be viewed as the eyes and ears of your infrastructure. It can inform all parts of your organization: finance, customer care, engineering, and product management.
Capacity planning can't exist without the measurement and history of your system and application-level metrics. Planning is also ineffective without knowing your system's upper performance boundaries so you can avoid approaching them. Finding the ceilings of each part of your architecture involves the same process:
  1. Measure and record the server's primary function.
    Examples: Apache hits, database queries
  2. Measure and record the server's fundamental hardware resources.
    Examples: CPU, memory, disk, network usage
  3. Determine how the server's primary function relates to its hardware resources.
    Examples: n database queries result in m percent CPU usage
  4. Find the maximum acceptable resource usage (or ceiling) based on both the server's primary function and hardware resources by one of the following:
    • Artificially (and carefully) increasing real production load on the server through manipulated load balancing or application techniques.
    • Simulating as close as possible a real-world production load.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 4: Predicting Trends
Inhaltsvorschau
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Riding Your Waves
Inhaltsvorschau
A good capacity plan depends on knowing your needs for your most important resources, and how those needs change over time. Once you have gathered historical data on capacity, you can begin analyzing it with an eye toward recognizing any trends and recurring patterns.
For example, in the last chapter I recounted how at Flickr, we discovered Sunday has been historically the highest photo upload day of the week. This is interesting for many reasons. It may also lead us to other questions: has that Sunday peak changed over time, and if so, how has it changed with respect to the other days of the week? Has the highest upload day always been Sunday? Does that change as we add new members residing on the other side of the International Date Line? Is Sunday still the highest upload day on holiday weekends? These questions can all be answered once you have the data, and the answers in turn could provide a wealth of insight with respect to planning new feature launches, operational outages, or maintenance windows.
Recognizing trends is valuable for many reasons, not just for capacity planning. When we looked at disk space consumption in , we stumbled upon some weekly upload patterns. Being aware of any recurring patterns can be invaluable when making decisions later on. Trends can also inform community management, customer care and support, product management, and finance. Some examples of how metrics measurement can be useful include:
  • Your operations group can avoid scheduling maintenance that could affect image processing machines on a Sunday, opting for a Friday instead, to minimize any adverse effects on users.
  • If you deploy any new code that touches the upload processing infrastructure, you might want to pay particular attention the following Sunday to see whether everything is holding up well when the system experiences its highest load.
  • Making customer support aware of these peak patterns allows them to gauge the effect of any user feedback regarding uploads.
  • Product management might want to launch new features based on the low or high traffic periods of the day. A good practice is to make sure everyone on your team knows where these metrics are located and what they mean.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Procurement
Inhaltsvorschau
As we've demonstrated, with our resource ceilings pinpointed, we can predict when we'll need more of a particular resource. When we complete the task of predicting when we'll need more, we can use that timeline to gauge when to trigger the procurement process.
Your procurement pipeline is the process by which you obtain new capacity. It's usually the time it takes to justify, order, purchase, install, test, and deploy any new capacity. illustrates the procurement pipeline.
The tasks outlined in vary from one organization to another. In some large organizations, it can take a long time to gain approvals to buy hardware, but delivery can happen quickly. In a startup, approvals may come quickly, but the installation likely proceeds more slowly. Each situation will be different, but the challenge will remain the same: estimate how long the entire process will take, and add some amount of comfortable buffer to account for unforeseen problems. Once you have an idea of what that buffer timeline is, you can then work backward to plan capacity.
Figure : Typical procurement pipeline
In our disk storage consumption example, we have current data on our disk consumption up to 8/15/05, and we estimate we'll run out of space on 8/30/05. You now know you have exactly two weeks to justify, order, receive, install, and deploy new storage. If you don't, you'll run out of space and be forced to trim that consumption in some way. Ideally, this two-week deadline will be long enough for you to bring new capacity online.
Obviously, the when of ordering equipment is just as important as the what and how much. Procurement timelines outlined above hint at how critical it is to keep your eye on how long it will take to get what you need into production. Sometimes external influences, such as vendor delivery times and physical installation at the data center can ruin what started out to be a perfectly timed integration of new capacity.
Startups routinely order servers purely out of the fear they'll be needed. Most newly launched companies have developers to work on the product and don't need to waste money on operations-focused engineers. The developers writing the code are most likely the same people setting up network switches, managing user accounts, installing software, and wearing whatever other hats are necessary to get their company rolling. The last thing they want to worry about is running out of servers when they launch their new, awesome website. Ordering more servers as needed can be rightly justified in these cases, because the hardware costs are more than offset by the costs of preparing a more streamlined and detailed capacity plan.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
The Effects of Increasing Capacity
Inhaltsvorschau
All of the segments within your infrastructure interact in various ways. Clients make requests to the web servers, which in turn make requests to databases, caching servers, storage, and all sorts of miscellaneous components. Layers of infrastructure work together to respond to users by providing web pages, pieces of web pages, or confirmations that they've performed some action, such as uploading a photo.
When one or more of those layers encounters a bottleneck, you bring your attention to bear, figure out how much more capacity you need, and then deploy it. Depending on how bottlenecked that layer or cluster is, you may find you'll see second-order effects of that new deployment, and end up simply moving the traffic jam to yet another part of your architecture.
For example, let's assume your website involves a web server and a database. One of the ways organizations can help scale their application is to cache computationally expensive database results. Deploying something like memcached can allow you to do this. In a nutshell, it means for certain database queries you choose, you can consult an in-memory cache before hitting the database. This is done primarily for the dual purpose of speeding up the query and reducing load on the database server for results that are frequently returned.
The most noticeable benefit is queries that used to take seconds to process might take as little as a few milliseconds, which means your web server will be able to send the response to the client more quickly. Ironically, there's a side effect to this; when users are not waiting for pages as long, they have a tendency to click on links faster, causing more load on the web servers. It's not uncommon to see memcached deployments turn into web server capacity issues rather quickly.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Long-Term Trends
Inhaltsvorschau
Now you know how to apply the statistics collected in to immediate needs. But you may also want to view your site from a more global perspective—both in the literal sense (as your site becomes popular internationally), and in a figurative sense, as you look at the issues surrounding the product and the site's strategy.
As mentioned earlier, getting to know the peaks and valleys of your various resources and application usage is paramount to predicting the future. As you gain more and more history with your metrics, you may be able to perceive more subtle trends that will inform your long-term decisions.
For example, let's take a look at , which illustrates a typical traffic pattern for a web server.
shows a pretty typical U.S. daily traffic pattern. The load rises slowly in the morning, East Coast time, as users begin browsing. These users go to lunch as West Coast users come online, keeping up the load, which finally drops off as people leave work. At this point, the load drops to only those users browsing over night.
As your usage grows, you can expect this graph to grow vertically as more users visit your site during the same peaks and valleys. But if your audience grows more internationally, the bump you see every day will widen as the number of active user time zones increases. As seen in , you may even see distinct bumps after the U.S. drop-off if your site's popularity grows in a region further away than Europe.
displays two daily traffic patterns, taken one year apart, and superimposed one on top of the other. What once was a smooth bump and decline has become a two-peak bump, due to the global effect of popularity.
Figure : Typical daily web server traffic pattern
Figure : Daily traffic patterns grow wider with increasing international usage
Of course, your product and marketing people are probably very aware of the demographics and geographic distribution of your audience, but tying this data to your system's resources can help you predict your capacity needs.
also shows that your web servers must sustain their peak traffic for longer periods of time. This will indicate when you should schedule any maintenance windows to minimize the effect of downtime or degraded service to the users. Notice the ratio between your peak and your low period has changed as well. This will affect how many servers you can stand to lose to failure during those periods, which is effectively the ceiling of your cluster.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Iteration and Calibration
Inhaltsvorschau
Producing forecasts by curve-fitting your system and application data isn't the end of your capacity planning. In order to make it accurate, you need to revisit your plan, re-fit the data, and adjust accordingly.
Ideally, you should have periodic reviews of your forecasts. You should check how your capacity is doing against your predictions on a weekly, or even daily, basis. If you know you're nearing capacity on one of your resources and are awaiting delivery of new hardware, you might keep a much closer eye on it. The important thing to remember is your plan is going to be accurate only if you consistently re-examine your trends and question your past predictions.
As an example, we can revisit our simple storage consumption data. We made a forecast based on data we gleaned for a 15-day period, from 7/26/05 to 8/09/05. We also discovered that on 8/30/2005 (roughly two weeks later), we expected to run out of space if we didn't deploy more storage. More accurately, we were slated to reach 20,446.81 GB of space, which would have exceeded our total available space is 20,480 GB.
How accurate was that prediction? shows what actually happened.
Figure : Disk consumption: predicted trend versus actual
As it turned out, we had a little more time than we thought—about four days more. We made a guess based on the trend at the time, which ended up being inaccurate but at least in favor of allowing more time to integrate new capacity. Sometimes, forecasts can either widen the window of time (as in this case) or tighten that window.
This is why the process of revisiting your forecasts is critical; it's the only way to adjust your capacity plan over time. Every time you update your capacity plan, you should go back and evaluate how your previous forecasts fared.
Since your curve-fitting and trending results tend to improve as you add more data points, you should have a moving window with which you make your forecasts. The width of that forecasting window will vary depending on how long your procurement process takes.
For example, if you know that it's going to take three months on average to order, install, and deploy capacity, then you'd want your forecast goal to be three months out, each time. As the months pass, you'll want to add the influence of most recent events to your past data and recalculate your predictions, as is illustrated in .
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Summary
Inhaltsvorschau
Predicting capacity is an ongoing process that requires as much intuition as it does math to help you make accurate forecasts. Even simple web applications need to be attended, and some of this crystal ball work can be tedious. Automating as much of the process as you can will help you stay ahead of the procurement process. Taking the time to connect your metric collection systems to trending software, such as cfityk will prove to be invaluable as you develop a capacity plan that is easily adaptable. Ideally, you'll want some sort of a capacity dashboard that can be referred to at any point in time to inform purchasing, development, and operational decisions.
The overall process in making capacity forecasts is pretty simple:
  1. Determine, measure, and graph your defining metric for each of your resources.
    Example: disk consumption
  2. Apply the constraints you have for those resources.
    Example: total available disk space
  3. Use trending analysis (curve fitting) to illustrate when your usage will exceed your constraint.
    Example: find the day you'll run out of disk space
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 5: Deployment
Inhaltsvorschau
the hardware, you'll need to physically install it and deploy it into production.
Historically, deployment has been viewed as a headache. Installing the operating system and application software, making sure all of the right settings are in place, and loading your website's data—all these tedious steps must be done in order to integrate new hardware that's fresh out of the crate. Fortunately, the pain of repeating these steps over and over has inspired an entire category of software: automated installation and configuration tools.
Although various automatic installation and configuration tools differ in their implementation and execution, most of them share the same design philosophy. Just as with monitoring and metric-collection tools, many of these concepts and designs originated in the high-performance computing (HPC) field. Because HPC and web operations have similarities in their infrastructure, the web operations community has adopted many of these tools and approaches.
The time needed to acquire, install, and provision new hardware must be factored into your calculations as you determine when you're going to run out of capacity. If your capacity will be exhausted in six weeks, and it takes you three weeks to add new hardware, you only have three weeks of breathing room. Automated deployment and configuration minimizes the time spent on the phase of the process over which you have the most control—integrating machines onto your network and beginning operations.
When making changes to hosts, it's preferable to have a central location from which to push changes appropriate to the servers you're affecting. Having a central location provides a "control tower" from which to manage all aspects of your infrastructure. Unlike server architectures, in which distributed resources help with horizontal scaling, centralized configuration and management environments yield several advantages:
  • Version control can be used for all configurations: OS, application, or otherwise. RCS/CVS/Subversion and others are used to track the "who, what, when, and why" of each change to the infrastructure.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Automated Deployment Philosophies
Inhaltsvorschau
Although various automatic installation and configuration tools differ in their implementation and execution, most of them share the same design philosophy. Just as with monitoring and metric-collection tools, many of these concepts and designs originated in the high-performance computing (HPC) field. Because HPC and web operations have similarities in their infrastructure, the web operations community has adopted many of these tools and approaches.
The time needed to acquire, install, and provision new hardware must be factored into your calculations as you determine when you're going to run out of capacity. If your capacity will be exhausted in six weeks, and it takes you three weeks to add new hardware, you only have three weeks of breathing room. Automated deployment and configuration minimizes the time spent on the phase of the process over which you have the most control—integrating machines onto your network and beginning operations.
When making changes to hosts, it's preferable to have a central location from which to push changes appropriate to the servers you're affecting. Having a central location provides a "control tower" from which to manage all aspects of your infrastructure. Unlike server architectures, in which distributed resources help with horizontal scaling, centralized configuration and management environments yield several advantages:
  • Version control can be used for all configurations: OS, application, or otherwise. RCS/CVS/Subversion and others are used to track the "who, what, when, and why" of each change to the infrastructure.
  • Replication and backup of installation and configuration files is easier to manage.
  • An aggregated configuration and management logging system is an ideal troubleshooting resource.
  • This centralized management environment makes an ideal place to keep hardware inventory, particularly if you want to have different configuration settings for different hardware.
This is not to suggest that your configuration, installation, monitoring, and management setup should be kept on a single server. Each of these deployment components demands specific resources. Growth over time would simply overwhelm a single machine, rendering it a potential single point of failure. Separate these components from the rest of your infrastructure. Monitoring and metric collection can reside on one server; configuration management and log aggregation on another. See for an example of a typical installation, configuration, and management architecture.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Automated Installation Tools
Inhaltsvorschau
Before you can even begin to worry about configuration management, you need to get your servers to a state in which they can be configured. You want a system that can automatically (and repetitively) install your OS of choice. Many such systems have been developed over the years, all employing similar techniques.
There are two basic approaches to the task of imaging new machines. Most OS vendors offer a package-based installer option, which performs the normal installation process in a non-interactive fashion. It provides the installer with a configuration file that specifies the packages to be installed. Examples include Solaris Jumpstart, Red Hat Kickstart, and Debian FAI.
Many third-party products take a disk-image approach. A gold client image is prepared on one machine and replicated byte-for-byte onto newly imaged hosts. Often, a single image is used for every server in the infrastructure, with hosts only differing in the services that are configured and running. SystemImager is a product that uses this approach.
Each method has advantages. Package-based systems provide accountability; every file installed is guaranteed to belong to a package, and package management tools make it easy to quickly see what's installed. You can get the same result with disk image systems by installing only packaged files. The temptation to muck about with the gold client filesystem directly can lead to confusion down the road.
On the other hand, image-based systems tend to be faster to install. The installer merely has to create a filesystem and dump the image onto it, rather than download many packages, calculate dependencies, and install them one by one. Some products, such as SystemImager, even support parallel installs to multiple clients by streaming the disk images via multicast or BitTorrent.
Most organizations aren't happy with the operating system their vendor installs. Default OS installs are notoriously inappropriate for production environments because they are designed to run on as many different hardware platforms as possible. They usually contain packages you don't need and typically are missing those that you do. As a result, most companies create custom OS images suitable for their specific needs.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Automated Configuration
Inhaltsvorschau
Now that your machines are up on the network, it's time to configure them to do their jobs. Configuration management systems help with this task in the following ways:
  • They let you organize your configuration files into useful subsystems, which you can combine in various ways to build production systems.
  • They put all the information about your running systems in one place, from which it can easily be backed up or replicated to another site.
  • They extract institutional knowledge out of your administrator's head and place it into a form that can be documented and reused.
A typical configuration management system consists of a server in which configurations are stored, and a client process, which runs on each host and requests a configuration from the server. In an infrastructure with automated deployment, the client is run as part of the initial install or immediately after the initial boot into the new OS.
After the initial configuration, a scheduled task on the client host periodically polls the server to see if any new configuration is available. Automating these checks ensures every machine in your infrastructure is always running the latest configuration.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Summary
Inhaltsvorschau
Knowing how much hardware you need does little good if you can't get that hardware into service quickly. Automating your infrastructure with tools like configuration management and automated installation, ensures your deployment processes are efficient and repeatable. Automation converts system administration tasks from one-off efforts into reusable building blocks.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Appendix : Virtualization and Cloud Computing
Inhaltsvorschau
most efficient manner, and to predict future needs based on the patterns of current use. For those well-defined workloads, you can get pretty close to utilizing most of the hardware resources for each class of server you have, such as databases, web servers, and storage devices. Unfortunately, web application workloads are rarely (if ever) perfectly aligned with the available hardware resources.
In those circumstances, you end up with inefficiencies in your capacity. For example, if you know your database's specific ceiling (limit) is determined by its memory or disk usage, but meanwhile it uses very little CPU, then there's no reason to buy servers with two quad-core CPUs. That resource (and investment) will simply be wasted unless you direct the server to work on other CPU-intensive tasks. Even buying a single CPU may be overkill. But often, that's all that's available, so you end up with idle resources.
It's the continual need to balance correct resources to workload demand that makes capacity planning so important, and in recent years some technologies and approaches have emerged that render this balance easier to manage, with ever-finer granularity.
Server virtualization and cloud computing are two such approaches, and it's worth exploring what they mean in the context of capacity planning.

Virtualization

Cloud Computing

Summary

Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Virtualization
Inhaltsvorschau
There are many definitions of virtualization. In general, virtualization is the abstraction of computing resources at various levels of a computer. Hardware, application, and operating system levels are some of the few places in which this abstraction can take place, but in the context of growing web operations, virtualization is generally used to describe OS abstraction, otherwise known as server virtualization.
An example of this is the Xen virtual machine monitor, or VMWare's ESX server, where a bottom-level OS functions with guest operating systems running on top of it. The bottom-level OS, known as the hypervisor, can be thought of as the brains of the virtualization. It allows the guest operating systems to share resources and easily be created, destroyed, or migrated to other hosts.
Entire books are written on the topic of virtualization. As it relates to capacity planning, virtualization allows for more granular control of how resources are used at the bare metal level. illustrates this concept.
Figure : Virtual servers running on bare-metal hardware
shows multiple guest operating systems running on the same server. There are many advantages to employing this abstraction:
Efficient use of resources
There's no reason to waste an entire server to run small tasks like corporate email. If there are spare CPU, memory, or disk resources, you can pile on other services to that resource to make better use it. Because of this, organizations use virtualization to consolidate many servers to run on a single piece of hardware.
Portability and fault tolerance
When a physical host is reaching known (or perhaps unknown) limits, or suffers a hardware failure, a guest OS (and its associated load) can be safely migrated to another host.
Development sandboxes
Because entire operating systems can be created and destroyed without harming the underlying host environment, virtualization is ideal for building multiple development environments that demand different operating systems, kernels, or system configurations. If there's a major bug that causes the entire test-bed to explode, no problem—it can be easily recreated.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Cloud Computing
Inhaltsvorschau
The concept of packaging up computing resources (computation, storage) into rentable units, like power and telephone utilities, isn't a new one. Virtualization technologies have spawned an entire industry of computing "utility" providers, who leverage the efficiencies inherent in virtualization to build what are known as clouds. Cloud service providers then make those resources available on a cost-per-usage basis via an API, or other means. Since cloud computing and storage essentially takes some of the infrastructure deployment and management out of the hands of the developer, using cloud infrastructure can be an attractive alternative to running your own servers. But as with virtualization, you lose some of the ability to monitor and precisely measure your usage.
No matter how you look at it, your website needs computing and storage resources. Somewhere—whether in your direct and total control or not—a server will need to respond to client requests for data, and those requests may need some amount of computation and data retrieval.
Virtualization has been around almost as long as computing. At one time, computers were seen as equipment only managed by large financial, educational, or research institutions. Since computers were extremely expensive, IBM and other manufacturers built large-scale minicomputers and mainframes to handle processing for multiple users at once, utilizing many of the virtualization concepts still in use today. Users would be granted slices of computation time from mainframe machines, accessing them from thin, or dumb, terminals. Users submitted jobs whose computation contended for resources. The centralized system was managed via queues, virtual operating systems, and system accounting that governed resource allocation. All of the heavy lifting of computation was handled by the mainframe and its operators, and was largely invisible to the end users. The design of these systems was largely driven by security and reliability, so considerable effort was applied to containing user environments and data redundancy. illustrates the client-sever structure in a mainframe environment.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Summary
Inhaltsvorschau
Deploying your site to cloud infrastructure can change how you view deploying capacity, and largely depends on how you intend to make efficient use of it. In the use cases above, we see both non-technical, and technical considerations, as outlined in the lists that follow.
Non-technical considerations:
  • Legal concerns surrounding privacy, security, and ownership of data by a third-party.
  • Confidence in the availability and performance of cloud infrastructure.
  • The effect of SLAs (or lack thereof) in the context of pieces of your infrastructure.
  • Levels of comfort with a still-emerging technology platform.
Technical considerations:
  • Redesigning their own application to make the most efficient use of cloud resources. Architectures that avoid transfer costs when possible and deploying compute instances only when you need them are very common practices.
  • Not knowing where your data physically resides. This forces developers to think about their application (and the management of their application) at a higher-level. Expecting that compute instances can stall, disappear, or migrate requires redundancy to be built in.
Regardless of how organizations decide to use cloud infrastructure, its effect on capacity planning can be significant. Wordpress.com is paying more for their storage than it did prior to migrating its data storage, but is comfortable with that. SmugMug.com is paying less for Amazon S3 than it would if it were managing its own storage. Ultimately, there is no one-case-fits-all situation with respect to cloud infrastructure; each decision is dependent on the application and organization involved, just as with so many other technologies.
Clouds can shrink deployment timelines and provide more granular control over how you're using your capacity. These are facets of capacity management that we've discussed in the previous chapters, and should be applied to cloud infrastructure as well:
  • Put capacity measurement into place—both metric collection and event notification systems—to collect and record systems and application statistics.
  • Discover the current limits of your resources (utilization on compute nodes, for example) and determine how close you are to those limits.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Appendix : Dealing with Instantaneous Growth
Inhaltsvorschau
unexpected incident—technological or otherwise—can wipe out all your future projections. There are no magic theories or formulas to banish your capacity woes in these situations, but you may be able to lessen the pain.
Besides catastrophes—like a tornado destroying your data center—the biggest problem you're likely to face is too much traffic. Ironically, becoming more popular than you can handle could be the worst web operations nightmare you've ever experienced. You might be fortunate enough to have a popular piece of content that is the target of links from all over the planet, or launch a new, killer feature that draws more attention than you ever planned. This can be as exciting as having your name in lights, but you might not feel so fortunate at the time it's all happening.
From a capacity point of view, not much can be done instantaneously. If you're being hosted in a utility computing, or virtualized manner, it's possible to add capacity relatively quickly depending on how it will be used—but this approach has limits. Adding servers can only solve the "I need more servers" problem. It can't solve the harder architectural problems that can pop up when you least expect them.
At Flickr, we have found that edge-use cases arise (probably more often than routine capacity issues!) that tax the infrastructure in ways we hadn't expected. For example, some years ago we had a user who automated his webcam to take a photo of his backyard, upload it to Flickr, and tag it with the Unix timestamp every minute of every day. This makes for interesting database side effects, since we weren't expecting to have that many unique tags for so many photos. We've also seen users with very few photos but many thousands of tags on each one. Each one of these cases gave us insight as to where our limits existed, as we were forced to adapt to each one.

Mitigating Failure

Handling Outages

Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Mitigating Failure
Inhaltsvorschau
The following tips and tricks are for worst-case scenarios, when other options for increasing capacity are exhausted, and substantially changing the infrastructure itself is impossible for the moment. It should be said this type of firefighting scenario is most of what capacity planning aims to avoid; yet sometimes it's simply unavoidable.
The following list of tips and tricks isn't meant to be exhaustive—just a few things that can help when the torrent of traffic comes and your servers are dying under load.
One contingency is to disable some of the site's heavier features. Building in the ability to turn certain features on or off can help capacity and operations respond significantly, even in the absence of some massive traffic event. Having a quick, one-line configuration parameter in your application with values of on or off can be of enormous value, particularly when that feature is either the cause of a problem or contributing to unacceptable performance.
For example, the webservers at Flickr perform geographic (country) lookups based on client IP addresses for logged-out users in an effort to deduce their language preferences. It's an operation that enhances the user experience, but it is yet another function the application must handle. When we launched the localized Flickr in seven different languages, we had this feature turned on with the launch. It almost immediately placed too much load on the mechanisms that carried out the country lookups, so we simply turned it off until we could learn what the issue was. The problem turned out to be an artificial throttle placed on the rate of requests the geo server could handle, which was tuned too conservatively. We isolated and fixed the issue by lifting the throttle to a more acceptable level then turned the feature (which is mostly transparent) back on. Had we not implemented the quick on/off switch for that feature—had it been hardcoded within the application—it would have taken more time to troubleshoot, disable, and fix. During this time the site would have been in a degraded state, or possibily even down.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Handling Outages
Inhaltsvorschau
When failure comes knocking at your door (and sadly, it will at some point), there are a number of steps you can take that can minimize the pain for users as well. Good customer service requires strong and effective communications between the operations and customer care groups, so users are promptly informed about site outage and problems, such as bugs, capacity, and performance. I thought I'd share some of the lessons we've learned when serving such a strong and vocal online community during emergencies or outages.
If your kitchen is flooded, but a plumber is underneath your sink, you at least have the feeling that someone has recognized the problem and is trying to resolve it. A good plumber will give you updates on the cause of the problem and what must be done to fix it.
Web applications are different: you can't see someone working on a problem, and users can sometimes feel left in the dark. Our experience at Fickr is users are much more forgiving of problems when you keep them in the information loop. We have forums in which users can report bugs and issues, and a blog (hosted outside our own data center so it can't be affected by outages there) where we provide updates on what's going on if the site is down.
An entire book can be written on the topic of customer care for online communities. Unfortunately, this isn't that book. But from a web operations perspective, site outages can—and do—happen. How you handle them is just as important as how long it takes to get back up and running.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Appendix : Capacity Tools
Inhaltsvorschau
appendix, I've compiled a list of some of the more popular tools and utilities for your reference. We use a good deal of these tools at Flickr, and some of them are simply open-source equivalents of software that have been written within Yahoo! to achieve the same goal.

Monitoring

Deployment Tools

Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Monitoring
Inhaltsvorschau
As we discussed in , there can be a lot of overlap in event notification software (tools that alert on resources based on thresholds) and metric collection and display tools. Some of the following tools have alerting abilities, some of them are more focused on graphing and collection, and some have both.
Ganglia, http://ganglia.info
Born out of the HPC community, Ganglia has a very active community of users and developers. We use Ganglia extensively at Flickr, as do Wikipedia and other large-scale social networking sites.
Nagios, http://nagios.org
We use a modified version of Nagios at Yahoo! to monitor services across thousands of machines.
Hyperic HQ, http://hyperic.com
GroundWork, http://www.groundworkopensource.com/
GroundWork is a hybrid of Nagios and Ganglia.
Reconnoiter, https://labs.omniti.com/trac/reconnoiter
Still in early development.
RRDTool, http://oss.oetiker.ch/rrdtool/
Mature graphing and metric storage tool.
Collectd, http://collectd.org/
Scalable system stats collection daemon. Uses multicast, like Ganglia.
Rrd2csv, http://www.opennms.org/index.php?title=Rrd2csv
RRD to csv converter.
Dstat, http://dag.wieers.com/home-made/dstat/
System statistics tool, modular.
GraphClick, http://www.arizona-software.ch/graphclick/
Digitizer that constructs data from an image of a graph—handy when you have the image but not the raw data.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Deployment Tools
Inhaltsvorschau
SystemImager, http://wiki.systemimager.org/
SystemImager comes from the HPC community and is used to install thousand-node computer clusters. Used by many large-scale web operations as well. Interesting work has been done to use bittorrent as the transfer mechanism.
FAI, http://www.informatik.uni-koeln.de/fai
A Debian auto-installation tool with a healthy community.
Cobbler, http://cobbler.et.redhat.com
Cobbler is a relatively new project from RedHat, supporting RedHat, Fedora, and CentOs.
Puppet, http://reductivelabs.com/trac/puppet
Fast becoming a very popular configuration tool, Puppet has some very passionate developers and a very involved community of users. Written in Ruby.
Cfengine, http://www.cfengine.org/
Written in C, it's been around for many years and has a large installed base and active community.
Lcfg (Large-scale Unix configuration system), http://www.lcfg.org/
Capistrano, http://www.capify.org/
Written in Ruby, Capistrano is becoming popular in the Rails environments.
Func, https://fedorahosted.org/func/
Func is the Fedora Unified Network Controller, and can replace ad-hoc cluster-wide ssh commands with an authenticated client/server architecture.
iClassify, https://wiki.hjksolutions.com/display/IC/Home
iClassify is a relatively new asset management system, which supports auto-registration and provides hooks for Puppet and Capistrano.
Fityk, http://www.unipress.waw.pl/fityk/
Excellent GUI and command-line curve fitting tool.
SciPy, http://www.scipy.org
Scientific and analysis tools and libraries for Python, includes some curve-fitting routines.
R, http://www.r-project.org
Statistical computing package, includes curve-fitting utilities.
Gunther, Neil. Guerilla Capacity Planning (Springer, 2006)
Menascé, Daniel A. and Virgilio A. F. Almeida. Capacity Planning For Web Services: Metrics, Models, and Methods
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
	

Zurück zu The Art of Capacity Planning


Themen

Buchreihen

Special Interest

International Sites

O'Reilly China O'Reilly USA O'Reilly Japan O'Reilly Taiwan