At Red Hat, we have been involved in the creation of many of the core industry standards that will be used in building future 64-bit ARM powered servers. Over the past few years, we have assisted in the creation of such standards as the Server Base System Architecture (SBSA), the UEFI 2.4 and ACPI 5.1 bindings for the 64-bit ARM Architecture, and other standards and specifications that have yet to be announced. I believe that each of these standards forms an essential component in the creation of a general purpose computing platform suited to widespread enterprise adoption, as opposed to an embedded or appliance-like device that is tailored to one specific purpose (but for which the Operating System and platform are welded together). Such general purpose platforms are important because customers and end users have many expectations around interoperability and system behavior that they have come to expect from decades of working with highly reliable (and highly standardized) platforms. And while it is important to foster true innovation, gratuitous differentiation only serves to harm everyone involved. It might be fun to build an embedded appliance for a specific application, but using that approach in designing a server is a surefire way to ensure a lack of commercial success.
This is why Red Hat has been leading development of a number of standards that will allow 64-bit ARM servers to run general purpose Operating System software (such as our own). I personally believe that ARM servers may become extremely popular in the market, and if this happens, it is necessary that it be possible to support such systems with a standard Operating System, rather than building a custom one for each different server system that comes along (something that simply cannot scale into the enterprise). Our customers and end users demand that they have a general purpose Operating System capable of supporting a range of standard hardware, which in turn affords those customers the ability to choose from among the available options based upon features and capabilities that extend that standard platform to deliver the highest overall value, rather than worrying about whether the system can be made to run at all. Standards we have been involved in creating include those such as SBSA relating to the underlying ARM Architecture and the realization of that architecture in physical devices (ensuring a common basic set of platform devices necessary to boot an Operating System), as well as those pertaining to system initialization (UEFI) and the discovery and operation of platform devices throughout the course of normal system use (ACPI). Each of these standards have been carefully crafted or adapted to facilitate a common server platform more readily supported by a range of OS choices.
Why do standards matter?
We of course take standards entirely for granted in our everyday lives. From the moment we get out of bed in the morning, until the moment we retire in the evening – and at all stages both before and after – our lives benefit tremendously from the efforts of standardization. Standards enable us to consume our morning coffee. That coffee we enjoy is farmed across several regions of the globe (subject to standards of measurement quality defined by various international organizations), shipped in standardized shipping containers at low-cost (which enable our modern world to transport goods at economical prices), roasted according to defined processes guaranteeing consistency, and served in a range of standardized cup sizes for a mere few dollars. And that is to say nothing of the many sugars, sweeteners, milks, creams, flavors, and other additions that each have their own standards for production, distribution, measurement, and quality. More generally, standards allow us to never think twice about many of life’s everyday adventures. We can generally safely plug the latest home appliance or gadget into a wall electrical socket without wondering whether the voltage is being delivered at the correct frequency for safe operation, or (and it is unlikely many of us seriously consider this upon waking in the morning) whether today is the day that our shower will suddenly become incompatible with our indoor plumbing installation. Yet just such incompatibilities have occurred often in the technology landscape. This is in large part because it often takes a certain amount of painful experience before the cost benefit analysis of introducing a standard (which intentionally serves to limit flexibility) begin to pay off.
Imagine living one day of your life devoid of many of these standards. First, there would of course be no notion of what “a day” means, nor of its start, or of its duration. There would be no agreed upon notion of time, of time zones, of globally synchronized clocks. Nor would (and this might actually be a good thing) there be a million notifications or emails awaiting us upon our awakening (these require complex networks built upon thousands of standards), or indeed electrical power (which requires standardized transmission) to power the devices in our homes. For that matter, there would be no standardized bedding to sleep on, or potable water to drink upon awakening (which requires standardized components for infrastructure, as well as for measurement of safety), no fire hydrants (which have a universal set of fixtures for hookup), and no sewage treatment. And since there would be no general building codes to define the rules for assessing construction, there would be no way to know whether our home was even structurally sound to begin the day with either. There would of course be no food codes, and no food, since the absence of standardized transportation infrastructure and processes for delivering fuel would mean there was no mass transportation of food, even if it were somehow possible to grow it. All in all, a day without standards would likely be a very brief, and a very lonely one.
Standards are key to the success of modern civilization. They allow many different components, from many different suppliers, to work together (more) seamlessly. We take them totally for granted without so much as a second thought. This is a good thing of course, because people are generally more interested in the forward progress of society than in pondering how everything they use somehow manages to work together. In the server space, we have built many industry standards over the past few years, each forged from bitter experiences that have generally taught us what works well, and what works less well for the overall experience of customers and end users. Standards cover such areas as hardware platform design, platform discovery and enumeration, system configuration and (power) management, and much more besides. One of the key benefits of platform standardization is that the user is able to make a choice of system from a range of compatible options (each built from components that adhere to various standard specifications). Each of these systems performs certain core functions in an abstracted way so that the same Operating System is able to run without modification for many years at a time. The Operating System does not need to know about many aspects of platform initialization and configuration that were performed by firmware and can instead leverage standardized interfaces to ascertain information about the platform that are critical and necessary to runtime operation.
A brief history
Early ARM servers were not nearly so standardized as will be the case over the coming few years. Many of the early proof of concept designs used non-standardized software and interfaces, such as “U-Boot” for platform firmware, and “DeviceTree” for platform hardware description. Neither of these is fully standardized. In the case of the former (U-Boot), there is an upstream project but no formal specification from a standards body detailing which features should be present on a system, nor what the defaults and behaviors should be. Consequently, nearly ever vendor has added incompatible “value add”, often using a U-Boot release that is many years out of date as a starting point. In the Fedora Project, developers have worked on cunning hacks (and even mini-specifications of a kind) to facilitate working around many of these limitations, but differences between systems and the lack of an overall industry standard for U-Boot in server systems are such that additional hacks are required for each new system that appears. This is, in my personal opinion, unsupportable at a commercial-scale. Finally, U-Boot is further unsupportable because it does not standardize the location and behavior of the system boot files. This means that an Operating System must have magical knowledge about where to store its boot files and how to configure these specifically for the chosen system.
Fortunately, we have an industry standard firmware platform with a formal specification that details both its behavior and allows for many compatible Operating Systems (and Operating System bootloaders to be used in support of booting those Operating Systems). The Unified Extensible Firmware Interface (UEFI) is not only fully standardized and follows a formal process for modification over time, but it also provides a very many services to Operating Systems early in the boot process (diagnostic, informational, and so forth) and at runtime (Real Time Clock, Boot Device Selection and Configuration). UEFI still further standardizes the process of installing an Operating System and Operating System bootloader such that the OS can use a standard interface to control the UEFI System Volume, which also supports multi-boot of a variety of different installed Operating Systems that do not needlessly interfere with one another.
The many problems with U-Boot are even less significant perhaps than those of the DeviceTree. DeviceTree is a platform description mechanism that is used to enumerate the many different platform hardware devices installed within a system, their topology, and their configuration. It is true that DeviceTree originally descends from an industry standard: that of OpenFirmware (IEEE1275). Indeed, this specification has been used effectively by other architectures, such as IBM POWER, and Sun SPARC. The problems with DeviceTree use in ARM based server systems stem from a lack of development of the IEEE1275 specification specific to its use in System-on-Chip and ARM based designs. DeviceTrees have a formalized structure (described on the devicetree.org website), but the content of those trees – the so-called “bindings” that describe individual devices and their configuration – is not formally standardized or controlled by any single industry body. Instead, bindings are created, modified, and destroyed by sending patches to the Linux Kernel Mailing List (LKML) along with changes to drivers that are impacted by those modifications. This affords an extremely flexible and fluid development process, but unfortunately also affords an extremely fluid set of incompatible DeviceTree bindings over time. It is very advantageous towards embedded development that there is now a mechanism within the kernel (DeviceTree) that can be used in place of hard coding values into kernel code, and the rapid iteration of DeviceTree bindings is certainly far less of a problem – and more of a development benefit – when a device ships with the Operating System as part of a complete solution (such as is the case with an Android cellphone).
Server systems running general purpose Operating Systems (that cannot simply be rebuilt on a whim) need to be formed from platform descriptions that are set in stone, and which do not change for many years at a time. Their description needs to be specified using device bindings that cannot easily be changed, and when changes are necessary, they must be carefully managed by an industry standards body and collaboration with different Operating System vendors that will be affected by those changes. This is not the case with DeviceTree based designs today. Still further complicating matters is the fact that DeviceTree provides no support for runtime device abstraction. The Operating System device drivers must instead be updated whenever even small changes occur in system design. Contrast the use of DeviceTree, with the use of ACPI (Advanced Configuration and Power Interface) on Intel Architecture. ACPI is both a formalized standard with rigorous specification and change management process, and it features an interpreted language that can be used to abstract the control of certain platform functions – to work around a hardware bug in the platform code (for example), or to abstract certain differences between one OEM/ODM platform and the next without needlessly modifying the Linux kernel on a continual basis for each emerging platform. Finally, ACPI intentionally limits the kinds of system that can be expressed and described so as to encourage certain system design philosophies necessary for successful server designs. It’s suited for industry standard server platforms and (arguably) less suited for embedded use cases, but those can easily be addressed by using embedded technologies (such as U-Boot and DeviceTree, which will continue to be used in millions of non-server devices).
Of course, standards are not perfect. Everyone can point to an example of a time where an ACPI issue meant that their laptop did not suspend correctly, or that their backlight didn’t work properly on a newer platform, or a time where they were suitably excited to point at the perceived evils of Microsoft Windows being used against them. Alas, in the frenzy of finger-pointing and hating, it can be easy to forget the many times that ACPI functioned entirely as intended, that systems did operate correctly, and the many times when an older Linux kernel was able to be installed on a new laptop and functioned without specific modification. This is how (for better or worse) most end users expect their computing experience to be. They are not willing to “just change this U-Boot variable” or “try this new DeviceTree with the added registers for the updated binding from last Thursday”. They expect their systems to adhere to industry standards that allow those same systems to boot a variety of different Operating Systems, and they have the reasonable expectation that there will be a certain level of stability to the underlying platform they have in use.
As part of the process of adopting ACPI (and even UEFI) for ARM server systems, it was important to realize that there is a world beyond our own preference for Operating System. In the broader marketplace, other options exist beyond Linux, and these will also need to coexist with our own on ARM server systems. With this consideration in mind, many different organizations have collaborated in order to ensure a standardized base platform has been defined that can be leveraged by the full variety of Operating System choices. Linux will, of course, be a first class citizen. To begin with, the entire governance model for ACPI was changed as part of the process of standardizing the ARM server platform. Several years worth of effort from many talented individuals, lawyers, and others resulted in the reformation of ACPI under the UEFI Forum, which is a neutral body of which anyone can become a member. Operating System companies, such as Red Hat, as well as many of our competitors, are members of the UEFI Forum and can (and do) actively participate in the process of governing and defining both the UEFI and ACPI specifications. Beyond this, core engineering activities around the UEFI and ACPI specifications have already implemented support for both on the ARM Archictecture under Linux. UEFI support is upstream as of Linux 3.16, while support for the (newly published) ACPI5.1 specification has been posted and will take a little more time to integrate into the upstream Linux kernel.
What’s in a server?
Modern servers are complex systems formed from very many parts. Of course there are the System-on-Chips (SoCs) that contain the CPUs (“Processing Elements”, in ARM parlance), and these CPUs are connected together in a cache coherent on-chip interconnect that also attaches to (external) memory, as well as on-chip devices such as network and storage controllers, IO bus interfaces (PCIe), and acceleration engines intended to enhance system performance. Then there are the many other supporting components that are less obvious to those not involved in system design. Each of the different devices on the SoC has its own control interfaces for powering it on or off, and each has a number of inputs and outputs for clocks, muxes, and other configuration. A modern SoC may feature hundreds of different clock networks that provide different frequency clock signals to each of the different devices on-chip, and each of these different devices may be connected to additional logic blocks both on or off chip with interdependencies. For example, a network (SGMII, RGMII, XFI, or similar MAC) and a disk (SATA) interface on the chip may be connected to an external series of PHYs (high speed serial links known as SerDes controllers) through a set of muxes (multiplexers) that allow those devices to share the external PHY resources, allowing for only certain combinations of configurations. Somewhere along the line, it is necessary to configure all of these devices to work together, along with their interdependent logic, and to program the various clocks and configuration interfaces to realize the intended overall system design.
This is where a fundamental divergence exists between the worlds of embedded and server. In the embedded Linux world, where devices are built from components that are tightly coupled, it is seen as a design benefit that the kernel is as educated as possible about the underlying platform hardware, how it is wired together, and how it is configured. This allows the kernel to drive the hardware into the lowest possible power states necessary for cellphones, where even microwatts of power are critical to battery life. ARM devices built using the Android operating system (and using DeviceTree) include kernels that are tightly coupled to the underlying device hardware. These Linux kernels contain support for configuring underlying platform devices, as well as their on-chip interconnectivity (through muxes and associated phys that are often shared between completely different devices), and their many (sometimes many hundreds) clocks, voltage regulators, and other low-level platform specifics. All of this information must be carefully communicated to the Linux kernel through a cacophony of drivers, frameworks, and DeviceTree bindings that have many interdependencies. When it works, it works extremely well, but changing one component can have impact upon other unrelated components. This becomes a problem when building more general purpose systems.
General purpose systems must exist as a stable platform for many years of successful operation. They ship with firmware that contains integrated platform descriptions (rather than coupling this data to a specific Operating System, or even a specific release of the Linux kernel included within Linux Operating Systems), and they must describe platform devices in a suitably generic way that can be supported for the long-term. Early proof of concept ARM servers built using U-Boot and DeviceTree embed too much platform knowledge in the kernel. While this allows a large degree of flexibility, it also introduces a very strong design dependency between specific versions of the platform firmware and Linux kernel. DeviceTree bindings have frequently changed upstream, forcing users to upgrade both their firmware and kernel at exactly the same time (lest their system fail to even boot), and firmware built upon U-Boot has lacked the capability to configure the base platform to bring up devices in a working configuration. In the U-Boot world, it is anticipated that the Linux kernel will know all about unrelated device dependencies, clocks, voltage regulators, and hundreds of other system components simply in order to boot the system to a login prompt. This has resulted in special Operating System releases for each ARM server platform (or a subset of supported platforms).
Contrast the embedded Linux approach used on some early ARM server systems with that used in the enterprise space, and in all future ARM server platforms built upon the industry standards here discussed. In the Enterprise space, we expect system firmware to correctly initialize underlying platform devices, along with their associated clock inputs, voltage regulators, muxes, phys, and other pieces necessary to boot. We expect server grade hardware to be designed without interdependencies between hardware devices such that the kernel does not have to have special knowledge of the hidden relation between a network device and a disk interface, and we expect it to be possible to boot an existing Linux kernel compliant with a base architecture specification on newer emerging platforms (compliant with that specification) as they come to market. Such expectations, which are long-established in the customer and end userbase that will be deploying ARM server systems, are fundamentally incompatible with an embedded approach to enterprise server design, and are a strong impetus for our collaboration with industry partners to build a stable foundational series of architectural platform specifications that bring the traditional notions of platform stability into the ARM server space. Using the SBSA, along with UEFI and ACPI, as well as other appropriate platform standards, we will help the newly emerging ARM server market to support general purpose computing (rather than embedded designs) and general purpose Operating Systems that do not require special knowledge of the underlying computing platform. These will be more easily supportable, more general purpose, and more useful computing systems for those seeking to embrace ARM.
In the Linux space, standards are sometimes seen as a double-edged sword. We like to innovate and rapidly progress and standards can be seen as hindrances to such rapid progress. Which is of course what they can be. If you’re a Linux kernel engineer, with a deep knowledge of how the hardware in your computer operates at the component level, you probably don’t need a general purpose abstraction such as ACPI to describe it (since you know exactly how the hardware behaves and perhaps wrote the DeviceTree describing it in your sleep), and you will see this as a needless hinderance to your development. Which can of course certainly be true. Yet, at the same time, few end users are kernel engineers, and even fewer relish the notion of having to know precisely which components are installed within their systems and how they are wired together for successful operation. Those users want a rigorously standardized process for designing, building, and enumerating the devices installed within their server systems. The purpose of the various efforts underway in the ARM server community is to build an ability to have a general purpose Operating System option for the many millions of servers that we expect users will want to deploy over the next few years.
Photo: Jon Masters pedal-powering an ARM server (source: Red Hat flickr stream)
At this year’s Red Hat Summit, I gave a talk entitled “Hyperscale Cloud Computing with ARM processors” (video coming soon). In the talk, I introduced and gave a live demo of the world’s first bicycle powered ARM server (HP Redstone server powered by Calxeda EnergyCore quad-core ARM processors). I wanted to make a point that the (hyperscale) future will be all about energy efficient technology. The quad-core Calxeda EnergyCore ARM-based chips used in my demo (powering the HP Redstone server system) use only 5W of power at full load, including the RAM, fabric interconnect, and management controller. The (pre-production) test system had 8 of these installed, for a total of 32 ARM cores. At 5W per quad-core, that’s still only 40W to run all of the compute within the server system.
We wanted to visualize the power (pun intended) of low energy computing. In some way that would be both memorable, but would also connect the audience at a personal level. The idea of using a bicycle was suggested, and I took this very much to heart. Over a period of several weeks (on and off), I designed and built a modified solar power rig, replacing the solar panels with a bicycle generator system based on the Pedal-A-Watt (which was used by the “Amp” energy drink manufacturer during a Superbowl pre-game event a few years back, along with many riders and batteries, to power the entire pre-game show). The (single speed) bicycle was attached to a (reverse diode protected) generator that connected to a solar charge controller. The charge controller handles keeping a 12V (35AH) deep-cycle AGM (Absorption Glass Mat – safe against leaks and for use in a public environment) trickle fed while diverting excess load (power that needs to go somewhere other than into heating and destroying the generator) to a “diversion load” – in this case a convenient trucker’s fan (cooling the rider in the process). The battery feeds an inverter of the kind found in trucks and larger automobiles, which is then connected to a smoothing circuit. For the demo, we used a (smallish) UPS as the smoothing circuitry because this provided for a guaranteed regulated sine wave output, a buffer against pedal startup/shutdown, and helped to avoid continually cycling the (expensive, pre-production prototype) server on and off. If you’re just powering an embedded board or some home electronics, you can skip the UPS part.
Video: The initial proof of concept (source: Jon Masters)
Inline with the generator, I installed two multi-meters. One captured instantaneous current flowing into the charge controller circuitry, the other captured voltage across the generator. Using a simple (and not entirely ideal, but we can work on that) Power = Voltage x Current type of calculation, it was possible to display a measure of instantaneous power being generated. I used a model (the TekPower TP4000ZC) of multi-meter that was inexpensive and yet had an RS232 output (this is in fact the cheapest multi-meter with such a feature that I could find). Two of these provided the necessary data, which I read using a custom utility I wrote (the QtDMM, “multimeter”, and other Linux applications not being adequate for my console-driven needs). I considered graphing the results with gnuplot (and in fact, I did do this) but the visualization wasn’t as straightforward for an audience as a single large power reading. So I wrote a small pygtk (GTK+) application to display the instantaneous power calculation returned by my “multi” software. This is what the audience saw during the demo. It in fact was a single GTK window wherein I had hacked the main loop horribly to read the output of my “multi” utility as a pipe on the command line (since it’s my demo, I can violate all the rules of modern graphical programming if I want to).
Using the rig, we were able to generate instantaneous power readings of up to several hundred Watts, while 100W was quite reasonable with little effort. The bicycle used was a single speed (for the aesthetic) and the lack of gears meant that we didn’t approach the 300-400W maximum that the generator can theoretically put out (good thing too, because realizing this, I put the current measuring multi-meter inline with the generator and it has a 10A fuse rating – for a bigger rig, some kind of current sensing coil might be needed, etc.). During the live on-stage demo as part of my Red Hat Summit talk, I appeared to generate much less than 100W at times. This is because the jury rigged wire attaching the inverter came loose during the demo and we were periodically dumping load out to the fan (there’s a reminder there about the dangers of doing live demos). Since the fan offers little electron resistance compared with charging the battery, the bike becomes much easier to pedal and you start pedaling very fast, very quickly. In a permanent rig, a better dump load would be a second battery or other resistive load offering similar characteristics to the battery. I fixed the wiring after the demo and subsequent riders at the booth were generating up to 200W of power once again.
Photo: Jon Masters pedal-powering an ARM server (source: Jon Masters)
Here’s the full component list:
- 30A digital Solar Charge Controller with diversion load support
- UB12350 35AH 12V AGM battery
- Cobra 400W inverter
- 2 x TekPower TP4000ZC multi-meters
- 2 x Pluggable RS232 to USB adapters
- 20A ATC fuses
- 2 x ATC fuse holders
- Marine grade 12V DC electrical socket
- Road Pro 10″ 12V fan
- Miscellaneous wire, terminal blocks, electrical tape, etc.
The all-important Red Hat cycle jersey is available in the Red Hat “Cool Stuff” store.
Those who attended (or read about) Red Hat Summit last week might have noticed my talk, entitled “Hyperscale Cloud Computing with ARM Processors“. In the talk (slides coming soon, video coming soon), I introduced the concept of Hyperscale Computing as an inflexion point in the industry that will disrupt the very concept of a server in future systems. Modern servers have come a long way, but they are nonetheless fundamentally based around designs originally created decades ago. Racks of individually connected, high-power, low density servers (and blades) are installed in modern data centers thousands at a time. Each of these server systems requires its own networking infrastructure, (high) power distribution, HVAC, and maintenance engineers to take care of it when things go wrong. But that’s all so last century.
In the future, we may still use racks, but we will design and integrate at the rack (and datacenter) level. We won’t have individual network ports with spaghetti wiring between servers, we’ll use a fabric interconnect to expose a single (or several) network ports for a whole rack (including built-in resiliency, high availability, and fault tolerance through multi-path fabric connections). We will use System-on-Chip (Server-on-Chip) technology to integrate all but the system RAM and (flash) storage onto the server chip – including all of the IO, offload, GPGPU, etc. – and further down the line Package-on-Package will allow us to integrate some (or all) of the rest. System-on-Chip technology, which began life as an embedded systems technology (but is primed to storm the datacenter in the next few years) allows for mass levels of integration at high density. Combine this will a choice of redesigned or alternative computer architectures (such as ARM, or low energy x86 designs) requiring little active cooling and you will see 1,000 (possible right now) or even (eventually) 10,000 server nodes in a single rack.
Performance in the Hyperscale world will be gained in aggregate, at low energy, not necessarily through having a small number of beefy and energy inefficient servers. Although you will still see servers featuring dozens of fully coherent chips with elaborate interconnects and hundreds of cores, hyperscale will be more about having thousands of individual servers each having a smaller number of cores. This is consistent with a general trend in the industry away from single-core scalability. In the past few decades since the end of the 1970s, computer architecture and silicon process enhancements have allowed a level of unparalleled performance growth. Year-on-year, we saw 52% growth vs. 25% growth in single core performance in the years prior. Now, once again, we have returned to an average performance growth rate of 22% per year (see Computer Architecture, 5th edition for the statistics). Everyone has given up on linear single core growth as the strategy, and it’s time to give up on a strategy of single-system coherent designs that are energy inefficient and complex to design. Instead, application-level support for scale-out will allow for the use of much simpler designs. We’ll still have big, beefy servers, but they will be the exception, and not the rule, at least in this space.
A single Hyperscale server node will be formed from at most three pieces:
- Server-on-Chip (SoC) – CPU, GPU, IO, fabric interconnect, management controller
- RAM – stacked in the longer term once PoP technology allows for this
- Storage – individual flash will combine with virtualized fabric-distributed I/O
We won’t have maintenance engineers scurrying around datacenters ripping out servers and replacing parts. Instead, we’ll use Failure-in-Place (also Fail-in-Place, FiP) to allow server nodes to fail and be marked as “bad”, much as how we disregard a few dead pixels on a laptop display, or a few bad blocks (or flash erase blocks) on a storage device. When you have 1,000+ server nodes in a single rack (and ultimately much more), the last thing you will see is an SLA that calls for individual node replacement. Instead, a certain number of bad servers will be tolerated before the whole system (or parts of it) are replaced as a scheduled event.
For more about Hyperscale Computing, follow this blog, and check out the videos from my Red Hat Summit talk.