Twice a year, Microsoft Chairman Bill Gates was known for going off on sequestered “Think Weeks” to read papers submitted by Microsoft employees with ideas for new products and technologies which they believed Microsoft should be considering, going forward.
In the early Microsoft days, these papers were secret. But in the middle of this decade, Microsoft began sharing these Think Week papers publicly inside the company, allowing employees to comment on them and to see Gates’ and other key Microsoft executives’ comments on them.
One of these papers, shared with me by a source who requested anonymity, provided a good sense of some of the “cloud-computing ” infrastructural issues with which Microsoft has been — and needs to be — grappling.
Because Microsoft is spending so much on building out its datacenters and people in the online business space in order to gird for the Web 2.0 and 3.0 battles, the issues described in this paper are especially interesting. And there are some hints about the still-under-wraps Microsoft CloudDB and Blue technologies that are also rather intriguing.
An Edge Computing Network for MSN and Windows Live Services
Author: Jason Zions
Microsoft’s online properties create and monetize rich content and innovative end-user experiences. To meet their business objectives while providing the quality of service needed to attract and delight an audience, they must overcome a variety of technical and operational challenges. Many of these challenges arise from the current architecture of the properties themselves and from limitations of the infrastructure which supports them.
The Edge Computing Network (ECN) extends Microsoft’s existing core network and data center infrastructure with intelligent computing nodes at the “edge” of the network cloud. This distributed computing network provides a set of net¬work, com¬puting, storage and management resources and services closer end users. The ECN goes beyond traditional Content Distribution Networks to enable a wider range of application architect¬ures that offer improvements to performance and robust¬ness and reduction or elimination of some operational challenges.
This document identifies the challenges faced by Microsoft’s on-line properties (as well as some problems created by current implementations), lays out the vision for the Edge Com¬puting Network, describes the progress already made towards achieving the vision, and works through two scenarios showing how the ECN could be exploited by a property.
Challenges Faced by On-Line Properties
Microsoft has roughly 150 on-line properties covering a tremendous range of scale, customer base, and functionality. Despite that huge variation, they each face a fairly consis¬tent set of challenges that fall into a few broad categories.
Any internet-facing service has to deal with network latency, which degrades the usability of the service. The larger the latency, the worse the problem; the usability impact is mul¬tiplied by the number of round trips. Common web site development can results in tens or even hundreds of round-trips to display a fairly complex page; each separate graphical element gets retrieved independently. Various technologies have been created to deal with this. Content Distribution Networks (CDNs) like Akamai and Savvis sell distributed small-object caching services, relying on a global network of points-of-presence (POPs) housing web servers which serve static graphical elements. These POPs are located so as to have much lower latency (with respect to the end user’s browser) than that of the owner of the actual on-line service.
Similar problems arise for streaming video, although the problem is due more to packet loss and variation in packet delivery times than in pure latency. Streaming as well as straight downloading of large files raise issues of bandwidth provisioning; serving many downloads and streams from a single origin would require very large (and very expensive) egress from that origin to the Internet, and the Internet’s backbone itself isn’t growing in capacity as fast as Internet traffic overall is growing. CDNs sell services which solve those issues as well, serving large files and streams from many distributed POPs around the world.
The past few years have seen the rise of Distributed Denial of Service (DDoS) attacks, in which a large army of zombie attacking nodes attempt to overwhelm a single service. While usage of CDNs can protect static content from DDoS attacks, a single origin hosting dynamic content is still vulnerable.
Properties have to acquire and provision server hardware to host their services, and they must size their resources to match their expected peak load. Any time the load on a service is below the servable peak, money is being wasted, both in the form of under¬utilized capital equipment and in the form of power to run and cool unneeded servers. Worse yet, pro¬per¬ties must continue to accurately predict their peak usage as they grow; insufficient capacity for peak load results in slow or interrupted service, reducing customer satisfaction and leading to defection of users to competing services.
Since properties acquire their own servers, they typically attempt to optimize their choice of equipment for their specific application. This means that, across all of Microsoft’s on-line properties, there is only limited commonality of hardware. Excess servers cannot be easily repurposed to other properties. Also, the cost of maintaining large numbers of servers scales linearly with the number of SKUs; we can’t fully leverage the sublinear scale of cost versus the total number of servers. Finally, OS deployment onto such a large variety of ser¬vers is quite complex, introducing still more cost as well as greater risk of misconfiguring some systems or missing a patch. Rigorous standardization of servers is of only limited help; specific hardware models eventually leave production lose support from vendors, and new models must therefore be introduced.
Each property is responsible for its own monitoring, reporting, and logging systems. Most early-stage or small properties can’t afford to invest significant time or money in auto¬ma¬tion; these tasks are handled in a more expensive ad hoc manner. There are some projects under way to build common tools in this space (e.g. MAGIK) , but these tools are designed for the needs of the largest properties (particularly Live Search and Hotmail) and are too inflexible to meet the varied needs perceived by the great majority of properties.
Many existing properties have centralized architectures that cannot be easily geo-distri¬buted. As a result, there are limits to their growth based on the pace at which Microsoft can build sufficiently large datacenters to hold their infrastructure. These limited architec¬tures made sense early in the life of the property when servers were hosted in only one datacen¬ter; the simplifying assumptions made possible by that design are pervasive through¬out the code, though, making the overall implementation inflexible. Some assump¬tions render the implementation highly fragile, e.g. those related to multicast architectures or round-trip time to access various elements of the pro¬perty. For some properties, capacity growth can be achieved only by adding new racks of servers physically adjacent to already deployed servers. These kinds of constraints make consolidated manage¬ment of datacenter space extremely difficult; smaller properties are often “bumped” from datacenter to datacenter, sometimes repeatedly, to allow larger, monolithic/fragile pro¬perties to expand.
CDN as Partial Solution
CDNs can be used to solve some, but not all, of the network challenges described above. Some properties use one or more services from various CDN providers to address the specific challenges they face. Very few properties leverage the full suite of available CDN services, and many properties use none at all. Various factors play into these choices.
CDN services are not inexpensive; Microsoft spent about $40 million on CDN services in FY06. Projections of future growth (based on expected growth in the number of properties, amount of traffic, and usage of CDN services) show this growing to more than $130 million in FY11.
CDNs can only be used to handle static content. While there is some limited capability for hosting application code remotely via a CDN (e.g. Hotmail’s “AATE” component on Savvis), there are significant drawbacks: cost is significant, capacity is limited, management and deployment tools are primitive, and Personally Identifiable Information (PII) cannot be used there.
All CDNs provide a “traffic manager” or “global load balancer” (GLB) service which directs user requests to the location most appropriate for serving the request. The GLB services provided by the various CDNs are limited in sophistication; because of their general-purpose nature, they cannot take into account application-specific quality-of-service needs, and they cannot route traffic based on Microsoft-specific business logic (e.g. from which locations can Microsoft serve this traffic without paying for bandwidth, based on our current peering relationships, time of day, link utilization, etc.).
The logs created by CDNs are difficult to access for specific purposes. Raw billing data is expensive to retrieve and process. While Microsoft can meet its regulatory and forensic needs in cooperation with CDN vendors, the process is sometimes quite complex.
Edge Computing Network Vision
Partial solutions can only get us so far; Microsoft’s properties need an integrated and complete solution to their technical and business challenges. The Edge Computing Network is one such complete solution.
The ECN vision is based on the following four assertions:
1. Quality of Service (QoS) and Scalability are critical to the success of Microsoft’s online properties.
2. When delivering global online services/applications, centralized architectures are in¬adequate because they lead to poor QoS and do not scale well.
3. For the same purpose, distributed architectures can achieve superior QoS and scalability.
4. Centralized technologies and the people who are familiar with them are plentiful whereas distributed technologies and the people who know them are few.
The Edge Computing Network will contribute to the success of Microsoft’s various online properties by enabling properties to make use of a comprehensive set of optimized and easy-to-use distributed computing services and resources. Properties can focus on de¬li¬vering compelling end-user experiences and addressing their business priorities without having to make the daunting choice between spending unsustainable amounts of money to compensate for architectural limitations of the centralized approach or developing and operating a proprietary distributed computing network.
The Edge Computing Network comprises roughly 24 nodes distributed worldwide. Most Internet users will be no more than 20 msec roundtrip time away from at least one node. Each node provides traditional CDN services and also provides distributed computation and storage services capable of hosting elements of Microsoft’s own on-line properties. Each node would have egress capacity to the Internet end-users in the region. The nodes would be connected to each other and to existing Microsoft data¬centers by a network overlaid on the Internet and on private links leased or owned by Microsoft.
Based on the service needs of properties desiring access to the customers within a region, the ECN node serving that region would be provisioned so as to provide an appropriate subset (and capacity) of these services:
• Small object caching
• Large file downloads
• Media streaming
• Peer-to-peer file transfer
• Smart (business-rule and load aware) traffic management and load balancing
• Traffic and user analytic data
• Logging to support billing, regulatory compliance and forensic demands
• Monitoring and management of services
By providing these services in-house, Microsoft can extend and enhance these services beyond what is possible through external CDNs. Properties can more effectively use these ser¬vices when when developers have visibility into the details of their actual functioning. For example, we can do a much better job of distributing downloadable content around the world in res¬ponse to sudden changes in demand patterns. Traffic management can take into account information which Microsoft is unwilling to divulge to third parties. Our control of the soft¬ware we ship can permit us to build P2P distribution mechanisms so they improve per¬formance for end users while reducing costs to Internet Service Providers and to Microsoft.
More importantly, Microsoft can control the exact “footprint” of our network, citing nodes in locations which are most important to our business and using our bargaining power and relationships to do so in the most cost-effective manner.
Many of these goals could be achieved in partnership with a third-party CDN; however, that CDN would be free to sell those same enhanced services to others, including our com¬peti¬tors. Our intellectual property would be used to the benefit of the very companies over which we seek to build competitive advantage.
Microsoft already has considerable intellectual property in this space; we hold a substantial portfolio of patents across these technologies. While we could implement all of these ser¬vices from scratch, the most cost-effective way to get these capabilities deployed and working to our advantage is to acquire existing, functioning technology from one or more CDN providers.
Distributed Computation and Storage
Based on the service needs of properties desiring access to the customers within a region, the ECN node serving that region would be provisioned so as to provide an appropriate capacity of these services:
• Application containers
• BLOB storage (replicated and local)
• Transaction-oriented (database) storage (replicated and local)
• Distributed file system
• Logging to support billing and diagnostics
• Monitoring and management of services
An application container is an isolated environment for running application code. Con¬tain¬ers come in a single “size”; that is, each container exposes to the code it hosts a specific amount of physical and virtual memory, a single processor of defined speed, and a maxi¬mum amount of internet egress. Property developers construct elements of their applica¬tions to fit those containers. Scale of service for a property is achieve through running code in the desired number of containers, rather than through increasing the capacity of a single con¬tainer; that is, capacity is provided in a scale-out, rather than scale-up, fashion.
Each container is isolated from other containers running on a given physical system and from the host OS which runs directly on the physical system. Applications are unaware of the precise mechanism used to provide this abstraction or virtualization. The provisioning infrastructure is capable of starting or stopping instances of a particular computational element based on a variety of criteria:
• Scheduled by time-of-day or time-of-year
• On demand of operations staff
• Automatically in response to measured load on already-running instances
• Automatically in response to load on other nodes
Scheduled instance provisioning is the easiest to provide but still provides property own¬ers with solid tools to control quality of service versus the cost to provide that service. Financial properties (e.g. MS Money) would schedule increased capacity during end-of-month, end-of-quarter, and end-of-year periods as well as tax-preparation periods; this could easily be adjusted to match regional fiscal reporting requirements (e.g. April 1-15 in the US but April 16-30 in Canada). Operations staff could easily override these schedules to meet unusual demand.
More interestingly, the provisioning infrastructure could monitor some property-defined load measure and dynamically adjust the number of instances of distributed elements of that property to keep the load within a specified range. Code would compute that load fac¬tor so that it accurately reflected the quality of service provided to the end user. De¬pending on the nature of the property, live code could be run on the end-user’s system which meas¬ured actual elapsed time for various operations rather than simply relying on brute-force measures of round-trip times; these real time values would be combined with server queue lengths, memory pressure measures, etc. to help determine whether ad¬ditional instances would improve subpar response times or to decide that the property is over-provisioned and can release one or more instances.
Automating the balancing of service between ECN nodes is a key element in providing pro¬tection against DDoS attacks. By integrating this automated balancing with the routing ca¬pability provided by the Traffic Manager, the ECN will be capable of providing service at “nearby” nodes in the event a single node is under attack. The provision of additional serv¬ers and addresses for a service will dilute the impact of the attack; the fact that the addi¬tional addresses are more distant (network-wise) from the attackers will decrease the rate of the attack as well.
Storage in the ECN node comes in a variety of types reflecting the varying needs of app¬lications. All storage usage would be limited by per-property quotas and tracked for billing. Application code (exe and dll files, etc.) need to reside someplace visible to various app¬lica¬tion containers within the node; a distributed, replicated file store would address code distribution needs, and Microsoft has several such file systems either in release, in development, or in research.
Applications need to store chunks of data for a variety of uses; some are purely local to an instance, others are intended to be shared amongst all instances of the application running anywhere in the Edge Computing Network. An application would tell the blob storage infra¬structure about the replication needs for each category of blob data: local to this instance, local to the node or a set of nodes, and/or rep¬li¬cated to a backing store in one of Microsoft’s large-scale datacenters. The blob store would be built on top of the Cheap File Store (used by Windows Live Spaces), Blue (used by Windows Live Mail), or some variation on those which accommodates the extended rep¬lication needs of ECN.
Some applications need transactional storage; again, replication needs would vary by ap¬plication. For properties needing distributed database semantics, a service like CloudDB would be provided. Properties needing a database purely local to a node would use SQL Server on top of local non-replicated storage; a property which needed off-node replication of data for business-continuity and backup could build such a mech¬an¬ism using existing and forthcoming services provided by the major datacenters (e.g. Blue).
Overlap of CDN and Distributed Computation Infrastructure
While the features and implementation of the CDN and Distributed Computation infra¬struc¬tures have been described separately, there’s no reason they need to remain separate in practice. Over time, the various systems which provide CDN functionality would become applications running within Application Containers in the ECN node. Stor¬age previously dedicated to small-object caching, large file download, and streaming would be integrated into the blob storage infrastructure of the distributed computation environment. Logging and monitoring systems would converge.
No CDN provider today does things this way; none has a general-purpose distributed com¬pu¬tation environment. The ability to dynamically reallocate resources to match needs across the spectrum of CDN and computation services is a unique advantage of the Edge Computing Network we envision.
Status of the ECN Program
An attempt was made to acquire a CDN vendor; changes in the economics and valuation of the various players in that market space drove the price of the acquisition outside the ac¬cep¬table range. As a result, negotiations have taken place to license tech¬nology from one or more CDN vendors; this technology would serve to jumpstart the ECN implementation. The agreement will include consulting services and operational assistance for a period of time, enabling Microsoft to acquire a set of best practices from a successful CDN.
A list of 24 broad locations for ECN nodes has been roughed out; selection of the first three and second three specific sites is under way. Site selection is strongly influenced by the ex¬perience of the Windows Live Infrastructure team responsible for siting and managing Mi¬crosoft’s major data centers. The current plan has the first three sites built out by the end of FY07, the second three sites built by mid-FY08, and the remaining sites from the full list of 24 coming on line by the end of FY09.
Selection of technologies for the various components of the distributed computing environ¬ment is under way. Of particular interest is the technology used to support the application container itself. An appropriate balance needs to be struck between isolation and perfor¬mance; the selected technology needs to provide enough management tools and hooks that beta-quality services can be provided in three sites at the end of FY07. Microsoft has a va¬riety of virtualization and isolation technologies entering the product stream over the next two years (e.g. Silos); ideally, deployed properties would be abstracted from the app con¬tainer technology so as to allow the ECN implementation to change over that time without requiring major rework.
Scenarios and Examples
A wide variety of scenarios were used to drive the development of requirements for the Edge Computing Network. A small set of scenarios are presented here to give a flavor of the ways properties can leverage the ECN to change the way they do things.
Cricket Match Play-by-Play
The population of nations which play international-level cricket exceeds one billion. During the course of a five-day test match involving his national team, a cricket fan is likely to be viewing a small web service which displays end-by-end descriptions of play and the cur¬rent score; users typically check the window every 4-8 minutes for at least six hours.
This type of property is ideal for deployment within ECN. Minimal state information needs to be saved, and none of it is per-user. The dynamic content (play-by-play data and current scores) is generated near the edge and can easily be injected there. Service instances can be started to meet current demand and can be shut down at the end of each day of play and at the end of a match. No servers need to be provisioned in any central datacenter; compu¬ta¬tion resources are only consumed (and thus only paid for) during a match. Archival infor¬mation (play-by-play of the previous day’s play and previous matches) can be stored using replication services; access to archival information wouldn’t require caching the data at an edge node, but the application could detect broad interest in particular archival data and elect to cache that data.
More interestingly, a peer-to-peer network could be built dynamically which would reduce the amount of network egress bandwidth required from the ECN node while increasing the speed at which updates could be pushed to the entire network of users. By taking advan¬tage of knowledge about ISP network structure (i.e. which IP address ranges belong to which ISPs) and structuring the P2P network accordingly, ISPs would also see reduced in¬ter¬net downhaul costs; information would flow from the ECN-deployed application into a small subset of an ISP’s total user base, and those users would pass the information amongst themselves within the P2P network.
Knowledge about a user’s location and interest in cricket and in their nation would be used to drive the delivery of appropriate ad content. Since update frequencies are predic¬table, the AdCenter property could spend longer than the normal 1-2 ms time budget allocated for selecting the ad which generates the highest income to Microsoft. Ad content would be dynamically cached at the ECN node; some ads (e.g. those tied very specifically to a par¬tic¬ular series of matches) would automatically expire and be flushed from the cache, while others would remain to be reused by AdCenter through other properties.
Today’s Hotmail service is monolithic; it isn’t easily geo-distributed, it has huge storage requirements, and it relies upon other services (e.g. address book, passport) with which it expects to be collocated. Given the capabilities of the ECN, some parts of this can change.
The Passport service itself could be deployed into the ECN. Each user’s passport profile would be stored in the ECN node closest to that users normal location with backing store in a primary datacenter. User sign-on would be much faster, since long round-trips to a single datacenter in the US would be eliminated. The secure nature of an ECN node allows us to store passwords and PII information right on the edge nodes.
Hotmail itself could cache mailbox header information in the ECN node closest to the user. By caching just headers, the page-load time for the initial mailbox view would be drama¬tically reduced, addressing one of Hotmail’s biggest competitive disadvantages against Yahoo Mail and Gmail. The cache could be pruned by code written by the Hotmail team and deployed to the edge independently from the Hotmail service code.
Intelligent pre-fetching of individual email messages could be performed; the Hotmail application code can predict which message is likely to be the next one the user wants to view. With the cooperation of an AJAX-style client, the ECN-resident component of Hotmail could use various forms of HTML and TCP compression to more efficiently send content to the client browser.
Several people have contributed to the vision and ideas described in this paper: Jeff Cohen, Vidyanand Rajpathak, Hang Tan, Ben Black, and Robyn Jerard. Thank you for your know¬ledge, review, and spirited argument.