|
||
|
SMTP[1] is close to being the perfect application protocol: it solves a large, important problem in a minimalist way. It's simple enough for an entry-level implementation to fit on one or two screens of code, and flexible enough to form the basis of very powerful product offerings in a robust and competitive market. Modulo a few oddities (e.g., SAML), the design is well conceived and the resulting specification is well-written and largely self-contained. There is very little about good application protocol design that you can't learn by reading the SMTP specification.
Unfortunately, there's one little problem: SMTP was originally published in 1981 and since that time, a lot of application protocols have been designed for the Internet, but there hasn't been a lot of reuse going on. You might expect this if the application protocols were all radically different, but this isn't the case: most are surprisingly similar in their functional behavior, even though the actual details vary considerably.
In late 1998, as Carl Malamud and I were sitting down to review the Blocks architecture[2], we realized that we needed to have a protocol for exchanging Blocks. The conventional wisdom is that when you need an application protocol, there are three ways to proceed:
An engineer can make reasoned arguments about the merits of each of the three approaches. Here's the process we followed...
The most appealing option is to find an existing protocol and use that. (In other words, we'd rather "buy" than "make".) So, we did a survey of many existing application protocols and found that none of them were a good match for the semantics of the protocol we needed.
For example, most application protocols are oriented toward client-server behavior, and emphasize the client pulling data from the server; in contrast with Blocks, a client usually pulls data from the server, but it also may request the server to asynchronously push (new) data to it. Clearly, we could mutate a protocol such as FTP[3] or SMTP into what we wanted, but by the time we did all that, the base protocol and our protocol would have more differences than similarities. In other words, the cost of modifying an off-the-shelf implementation becomes comparable with starting from scratch.
Another approach is to use HTTP[4] as the exchange protocol and define the rules for data exchange over that. For example, the IPP[5] (the Internet Printing Protocol) uses this approach. The basic idea is that HTTP defines the rules for exchanging data and then you define the data's syntax and semantics. Because you inherit the entire HTTP infrastructure (e.g., HTTP's authentication mechanisms, caching proxies, and so on), there's less for you to have to invent (and code!). Or, conversely, you might view the HTTP infrastructure as too helpful. As an added bonus, if you decide that your protocol runs over port 80, you may be able to sneak your traffic past older firewalls, at the cost of port 80 saturation.
HTTP has a lot of strengths, for example, it uses MIME[6] for encoding data and is ubiquitously implemented. Unfortunately for us, even with HTTP 1.1[7], there still wasn't a good fit. As a consequence of the highly-desirable goal of maintaining compatibility with the original HTTP, HTTP's framing mechanism isn't flexible enough to support server-side asynchronous behavior and its authentication model isn't similar to other Internet applications. In addition, we weren't of a mind to play games with port 80.
So, this left us the final alternative: defining a protocol from scratch. However, we figured that our requirements, while a little more stringent than most, could fit inside a framework suitable for a large number of future application protocols. The trick is to avoid the kitchen-sink approach. (Dave Clark has a saying: "One of the roles of architecture is to tell you what you can't do.")
...if you're willing to make the problem small enough.
Our most important step is to limit the problem to application protocols that exhibit certain features:
First, we're only going to consider connection-oriented application protocols (those that work on top of TCP[8]). Another branch in the taxonomy, connectionless, are those that don't want the delay or overhead of establishing and maintaining a reliable stream. For example, most DNS[9] traffic is characterized by a single request and response, both of which fit within a single IP datagram. In this case, it makes sense to implement a basic reliability service above the transport layer in the application protocol itself.
Second, we're only going to consider message-oriented application protocols. A "message" in our lexicon is simply structured data exchanged between loosely-coupled systems. Another branch in the taxonomy, tightly-coupled systems, uses remote procedure calls as the exchange paradigm. Unlike the connection-oriented/connectionless dichotomy, the issue of loosely- or tightly-coupled systems is similar to a continuous spectrum. Fortunately, the edges are fairly sharp.
For example, NFS[10] is a tightly-coupled system using RPCs. When running in a properly-configured LAN, a remote disk accessible via NFS is virtually indistinguishable from a local disk. To achieve this, tightly-coupled systems are highly concerned with issues of latency. Hence, most (but not all) tightly-coupled systems use connection-less RPC mechanisms; further, most tend to be implemented as operating system functions rather than user-level programs. (In some environments, the tightly-coupled systems are implemented as single-purpose servers, on hardware specifically optimized for that one function.)
Finally, we're going to consider the needs of application protocols that exchange messages asynchronously. The classic client-server model is that the client sends a request and the server sends a response. If you think of requests as "questions" and responses as "answers", then the server answers only those questions that it's asked and it never asks any questions of its own. We'll need to support a more general model, peer-to-peer. In this model, for a given transaction one peer might be the "client" and the other the "server", but for the next transaction, the two peers might switch roles.
It turns out that the client-server model is a proper subset of the peer-to-peer model: it's acceptable for a particular application protocol to dictate that the peer that establishes the connection always acts as the client (initiates requests), and that the peer that listens for incoming connections always acts as the server (issuing responses to requests).
There are quite a few existing application domains that don't fit our requirements, e.g., nameservice (via the DNS), fileservice (via NFS), multicast-enabled applications such as distributed video conferencing, and so on. However, there are a lot of application domains that do fit these requirements, e.g., electronic mail, file transfer, remote shell, and the world-wide web. So, the bet we are placing in going forward is that there will continue to be reasons for defining protocols that fit within our framework.
The next step is to look at the tasks that an application protocol must perform and how it goes about performing them. Although an exhaustive exposition might identify a dozen (or so) areas, the ones we're interested in are:
There are three commonly used approaches to delimiting messages: octet-stuffing, octet-counting, and connection-blasting.
An example of a protocol that uses octet-stuffing is SMTP. Commands in SMTP are line-oriented (each command ends in a CR-LF pair). When an SMTP peer sends a message, it first transmits the "DATA" command, then it transmits the message, then it transmits a "." (dot) followed by a CR-LF. If the message contains any lines that begin with a dot, the sending SMTP peer sends two dots; similarly, when the other SMTP peer receives a line that begins with a dot, it discards the dot, and, if the line is empty, then it knows it's received the entire message. Octet-stuffing has the property that you don't need the entire message in front of you before you start sending it. Unfortunately, it's slow because both the sender and receiver must scan each line of the message to see if they need to transform it.
An example of a protocol that uses octet-counting is HTTP. Commands in HTTP consist of a request line followed by headers and a body. The headers contain an octet count indicating how large the body is. The properties of octet-counting are the inverse of octet-stuffing: before you can start sending a message you need to know the length of the whole message, but you don't need to look at the content of the message once you start sending or receiving.
An example of a protocol that uses connection-blasting is FTP. Commands in FTP are line-oriented, and when it's time to exchange a message, a new TCP connection is established to transmit the message. Both octet-counting and connection-blasting have the property that the messages can be arbitrary binary data; however, the drawback of the connection-blasting approach is that the peers need to communicate IP addresses and TCP port numbers, which may be "transparently" altered by NATS[11] and network bugs. In addition, if the messages being exchanged are small (say less than 32k), then the overhead of establishing a connection for each message contributes significant latency during data exchange.
There are many schemes used for encoding data (and many more encoding schemes have been proposed than are actually in use). Fortunately, only a few are burning brightly on the radar.
The messages exchanged using SMTP are encoded using the 822-style[12]. The 822-style divides a message into textual headers and an unstructured body. Each header consists of a name and a value and is terminated with a CR-LF pair. An additional CR-LF separates the headers from the body.
It is this structure that HTTP uses to indicate the length of the body for framing purposes. More formally, HTTP uses MIME, an application of the 822-style to encode both the data itself (the body) and information about the data (the headers). That is, although HTTP is commonly viewed as a retrieval mechanism for HTML[13], it is really a retrieval mechanism for objects encoded using MIME, most of which are either HTML pages or referenced objects such as GIFs.
An application protocol needs a mechanism for conveying error information between peers. The first formal method for doing this was defined by SMTP's "theory of reply codes". The basic idea is that an error is identified by a three-digit string, with each position having a different significance:
Operational experience with SMTP suggests that the range of error conditions is larger than can be comfortably encoded using a three-digit string (i.e., you can report on only 10 different things going wrong for any given part of the system). So, [14] provides a convenient mechanism for extending the number of values that can occur in the second and third positions.
- the first digit:
- indicating success or failure, either permanent or transient;
- the second digit:
- indicating the part of the system reporting the situation (e.g., the syntax analyzer); and,
- the third digit:
- identifying the actual situation.
Virtually all of the application protocols we've discussed thus far use the three-digit reply codes, although there is less coordination between the designers of different application protocols than most would care to admit.
Finally, in addition to conveying a reply code, most protocols also send a textual diagnostic suitable for human, not machine, consumption. (More accurately, the textual diagnostic is suitable for people who can read a widely-used variant of the English language.) Since reply codes reflect both positive and negative outcomes, there have been some innovative uses made for the text accompanying positive responses, e.g., prayer wheels.
Few application protocols today allow independent parallel exchanges over the same connection. In fact, the more widely-implemented approach is to allow pipelining, e.g., command pipelining[15] in SMTP or persistent connections in HTTP 1.1. Pipelining allows a client to make multiple requests of a server, but requires the requests to be processed serially. (Note that a protocol needs to explicitly provide support for pipelining, since, without explicit guidance, many implementors produce systems that don't handle pipelining properly; typically, an error in a request causes subsequent requests in the pipeline to be discarded).
Pipelining is a powerful method for reducing network latency. For example, without persistent connections, HTTP's framing mechanism is really closer to connection-blasting than octet-counting, and it enjoys the same latency and efficiency problems.
In addition to reducing network latency (the pipelining effect), parallelism also reduces server latency by allowing multiple requests to be processed by multi-threaded implementations. Note that if you allow any form of asynchronous exchange, then support for parallelism is also required, because exchanges aren't necessarily occurring under the synchronous direction of a single peer.
Unfortunately, when you allow parallelism, you also need a flow control mechanism to avoid starvation and deadlock. Otherwise, a single set of exchanges can monopolize the bandwidth provided by the transport layer. Further, if the peer is resource-starved, then it may not have enough buffers to receive a message and deadlock results.
The flow control mechanism used by TCP is based on sequence numbers and a sliding window: each receiver manages a sliding window that indicates the number of data octets that may be transmitted before receiving further permission. However, it's now time for the third shoe of multiplexing to drop: segmentation. If you do flow control then you also need a segmentation mechanism to fragment messages into smaller pieces before sending and then re-assemble them as they're received.
All three of the multiplexing issues: parallelism, flow control, and segmentation have an impact on how the protocol does framing. Before we defined framing as "how to tell the beginning and end of each message" in addition, we need to be able to identify independent messages, send messages only when flow control allows us to, and segment them if they're larger than the available window (or too large for comfort).
Perhaps for historical (or hysterical) reasons, most application protocols don't do authentication. That is, they don't authenticate the identity of the peers on the connection or the authenticity of the messages being exchanged. Or, if authentication is done, it is domain-specific for each protocol. For example, FTP and HTTP use entirely different models and mechanisms for authenticating the initiator of a connection. (Independent of mainstream HTTP, there is a little-used variant[16] that authenticates the messages it exchanges.)
A few years ago, SASL[17] (the Simple Authentication and Security Layer) was developed to provide a framework for authenticating protocol peers. SASL let's you describe how an authentication mechanism works, e.g., an OTP[18] (One-Time Password) exchange. It's then up to each protocol designer to specify how SASL exchanges are conveyed by the protocol. For example, [19] explains how SASL works with SMTP.
A notable exception to the SASL bandwagon is HTTP, which defines its own authentication mechanisms[20].There is little reason why SASL couldn't be introduced to HTTP, although to avoid race-conditions with the use of OTP, the persistent connection mechanism of HTTP 1.1 must be used.
SASL has an interesting feature in that in addition to explicit protocol exchanges to authenticate identity, it can also use implicit information provided from the layer below. For example, if the connection is running over IPsec[21], then the credentials of each peer are known and verified when the TCP connection is established.
HTTP is the first widely used protocol to make use of transport security to encrypt the data sent on the connection. The current version of this mechanism, TLS[22], is also available for SMTP and other application protocols such as ACAP[23] (the Application Configuration Access Protocol).
The key difference between the original mechanism and TLS, is one of provisioning. In the initial approach, a world-wide web server would listen on two ports, one for plaintext traffic and the other for secured traffic; in contrast, a server implementing an application protocol that is TLS-enabled listens on a single port for plaintext traffic; once a connection is established, the use of TLS is negotiated by the peers.
Let's briefly compare the properties of the three main connection-oriented application protocols in use today:
Mechanism SMTP FTP HTTP ------------------- ---------- --------- ------------- Framing Stuffing Blasting Counting Encoding 822 Binary MIME Error Reporting 3-digit 3-digit 3-digit Multiplexing pipelining no pipelining User Authentication SASL user/pass user/pass Transport Security TLS no TLS (nee SSL)
Note that the username/password mechanisms used by FTP and HTTP are entirely different with one exception: both can be termed a "username/password" mechanism.
These three choices are broadly representative: as more protocols are considered, the patterns are reinforced. For example, POP[24] uses octet-stuffing, but IMAP[25] uses octet-counting, and so on.
When we design an application protocol, there are a few properties that we should keep an eye on.
A well-designed protocol scales well when deployed.
Because few application protocols support multiplexing, a common trick is for a program to open multiple simultaneous connections to a single destination. The theory is that this reduces latency and increases throughput. The reality is that both the transport layer and the server view each connection as an independent instance of the application protocol, and this causes problems.
In terms of the transport layer, TCP uses adaptive algorithms to efficiently transmit data as networks conditions change. But what TCP learns is limited to each connection. So, if you have multiple TCP connections, you have to go through the same learning process multiple times -- even if you're going to the same host. Not only does this introduce unnecessary traffic spikes into the network, because TCP uses a slow-start algorithm when establishing a connection, the program still sees additional latency. To deal with the fact that a lack of multiplexing in application protocols causes implementors to make sloppy use of the transport layer, network protocols are now provisioned with increasing sophistication, e.g., RED[26].
In terms of the server, each incoming connection must be dispatched and (probably) authenticated against the same resources. Consequently, server overhead increases based on the number of connections established, rather than the number of remote users. The same issues of fairness arise: it's much harder for servers to allocate resources on a per-user basis, when a user can cause an arbitrary number of connections to pound on the server.
Another important aspect of scalability is to consider the relative numbers of clients and servers. (This is true even in the peer-to-peer model, where a peer can act both in the client and server role.) Typically, there are many more client peers than server peers. In this case, functional requirements should be shifted from the servers onto the clients. The reason is that a server is likely to be interacting with multiple clients and this functional shift makes it easier to scale.
A well-designed protocol is efficient.
For example, although a compelling argument can be made than octet-stuffing leads to more elegant implementations than octet-counting, experience shows that octet-counting consumes far fewer cycles.
Regrettably, we sometimes have to compromise efficiency in order to satisfy other properties. For example, 822 (and MIME) use textual headers. We could certainly define a more efficient representation for the headers if we were willing to limit the header names and values that could be used. In this case, extensibility is viewed as more important than efficiency. Of course, if we were designing a network protocol instead of an application protocol, then we'd make the trade-offs using a razor with a different edge.
A well-designed protocol is simple.
Here's a good rule of thumb: a poorly-designed application protocol is one in which it is equally as "challenging" to do something basic as it is to do something complex. Easy things should be easy to do and hard things should be harder to do. The reason is simple: the pain should be equal to the gain.
Another rule of thumb is that if an application protocol has two ways of doing the exact same thing, then there's a problem somewhere in the architecture underlying the design of the application protocol.
Hopefully, simple doesn't mean simple-minded: something that's well-designed accommodates everything in the problem domain, even the troublesome things at the edges. What makes the design simple is that it does this in a consistent fashion. Typically, this leads to an elegant design.
A well-designed protocol is extensible.
As clever as application protocol designers are, there are likely to be unforeseen problems that the application protocol will be asked to solve. So, it's important to provide the hooks that can be used to add functionality or customize behavior. This means that the protocol is evolutionary, and there must be a way for implementations reflecting different steps in the evolutionary path to negotiate which extensions will be used.
But, it's important to avoid falling into the extensibility trap: the hooks provided should not be targeted at half-baked future requirements. Above all, the hooks should be simple.
Of course good design goes a long way towards minimizing the need for extensibility. For example, although SMTP initially didn't have an extension framework, it was only after ten years of experience that its excellent design was altered. In contrast, a poorly-designed protocol such as Telnet[27] can't function without being built around the notion of extensions.
Finally, we get to the money shot: here's what we did.
We defined an application protocol framework called BXXP (the Blocks eXtensible eXchange Protocol). The reason it's a "framework" instead of an application protocol is that we provide all the mechanisms discussed earlier without actually specifying the kind of messages that get exchanged. So, when someone else needs an application protocol that requires connection-oriented, asynchronous request-response interactions, they can start with BXXP. It's then their responsibility to define the last 10% of the application protocol, the part that does, as we say, "the useful work".
So, what does BXXP look like?
Framing looks a lot like SMTP or HTTP: there's a command line that identifies the beginning of the frame, then there's a MIME object (headers and body). Unlike SMTP, BXXP uses octet-counting, but unlike HTTP, the command line is where you find the size of the payload.
Actually, the command line for BXXP has a lot of information, it tells you:
Since you need to know all this stuff to process a frame, we put it all in one easy to parse location. You could probably devise a more efficient encoding, but the command line is a very small part of the frame, so you wouldn't get much bounce from optimizing it. Further, because framing is at the heart of BXXP, the frame format has some self-consistency checks that catch the majority of programming errors.
Another trick is in the headers: because the command line contains all the framing information, the headers may contain minimal MIME information (such as Content-Type). Usually, however, the headers are empty. That's because the BXXP default payload is XML[28]. (Actually, a "Content-Type: text/xml" with 8-bit encoding).
We chose XML as the default because it provides a simple mechanism for nested, textual representations. (Alas, the 822-style encoding doesn't easily support nesting.) By design, XML's nature isn't optimized for compact representations. That's okay because we're focusing on loosely-coupled systems and besides there are efficient XML parsers available.
We use 3-digit error codes.
In addition, the response message to a request is flagged as either positive or negative. This makes it easy to signal success or failure and allow the receiving peer some freedom in the amount of parsing it wants to do on failure.
Despite the lessons of SMTP and HTTP, there isn't a lot of field experience to rely on when designing the multiplexing features of BXXP. Here's what we did: frames are exchanged in the context of a "channel". Each channel has an associated "profile" that defines the syntax and semantics of the messages exchanged over a channel.
Channels provide both an extensibility mechanism for BXXP and the basis for multiplexing. Remember the last parameter in the command line of a BXXP frame? The "part of the system" that gets the message is identified by a channel number.
A profile is defined according to a "Profile Registration" template. The template defines how the profile is identified (using an XML DTD), what kind of messages get exchanged during channel creation, what kind of messages get sent in requests and responses, along with the syntax and semantics of those messages. When you create a channel, you identify a profile and provide some arguments. If the channel is successfully created, you get back a positive response; otherwise, you get back a negative response explaining why.
Perhaps the easiest way to see how channels provide an extensibility mechanism is to consider what happens when a connection is established. The BXXP peer that accepted the connection sends a greeting on channel zero identifying the profiles that it supports. (Channel 0 is used for channel management it's automatically created when a connection is opened.) If you want transport security, the very first thing you do is to create a channel that negotiates transport security, and, once the channel is created, you tell it to do its thing. Next, if you want to authenticate, you create a channel that performs user authentication, and, once the channel is created, you tell it to get busy. At this point, you create one or more channels for data exchange. This process is called "tuning"; once you've tuned the connection, you start using the data exchange channels to do "the useful work".
The first channel that's successfully started has a trick associated with it: when you ask to start the channel, you're allowed to specify a "service name" that goes with it. This allows a server with multiple configurations to select one based on the client's suggestion. (A useful analogy is HTTP 1.1's "Host:" header.) If the server accepts the "service name", then this configuration is used for the rest of the connection.
To allow parallelism, BXXP allows you to use multiple channels simultaneously. Each channel processes requests serially, but there are no constraints on the processing order for different channels. So, in a multi-threaded implementation, each channel maps to its own thread.
This is the most general case, of course. For one reason or another, an implementor may not be able to support this. So, BXXP allows for both positive and negative responses when a request is made. So, If you want the classic client-server model, the client program should simply reject any requests made by the server. This effectively throttles any asynchronous messages from the server.
Of course, we now need to provide a flow control mechanism and segmentation. For the former, we just took the mechanism used by TCP (sequence numbers and a sliding window) and used that. It's proven, and can be trivially implemented by a minimal implementation of BXXP. For the latter, we just put a "continuation" or "more to come" flag in the command line for the frame.
We use SASL. If you successfully authenticate using a channel, then there is a single user identity for each peer on that connection (i.e., authentication is per-connection, not per-channel). This design decision mandates that each connection correspond to a single user regardless of how many channels are open on that connection. One reason why this is important is that it allows service provisioning, such as quality of service (e.g., as in [29]) to be done on a per-user granularity.
We defined BXXP profiles for the most commonly used SASL mechanisms: OTP and Anonymous.
We use TLS. If you successfully complete a TLS negotiation using a channel, then all traffic on that connection is secured (i.e., confidentiality is per-connection, not per-channel, just like authentication).
We defined a BXXP profile that's used to start the TLS engine.
We purposefully excluded two things that are common to most application protocols: naming and authorization.
Naming was excluded from the framework because, outside of URIs[30], there isn't a commonly accepted framework for naming things. To our view, this remains a domain-specific problem for each application protocol. So, when an application protocol designer defines their own profile to do "the useful work", they'll have to deal with naming issues themselves. BXXP provides a mechanism for identifying profiles and binding them to channels. It's up to you to define the profile and use the channel.
Similarly, authorization was explicitly excluded from the framework. Every approach to authorization we've seen uses names to identify principals (i.e., targets and subjects), so if your framework doesn't include naming, it can't very well include authorization.
Of course, application protocols do have to deal with naming and authorization those are two of the issues addressed by the applications protocol designer when defining a profile for use with BXXP.
So, how do you go about using BXXP?
First, get the specification[32] and read it. Next, define your own profile. Finally, get a TCP port number for your protocol and start implementing.
The BXXP specification defines five profiles itself: a channel management profile, three user authentication profiles, and a transport layer security profile. These provide good examples. Of course, we've been using BXXP internally for a year now, so if you want to look at a rather detailed profile definition, check out the Blocks Simple Exchange[31] profile. It addresses the issue of naming for its application domain, and, in doing so, opens the door for authorization.
Since we published BXXP as an Internet-draft for comments we've gotten some pretty good feedback. Prior to the next meeting of the IETF, we'll be publishing an updated specification along with a couple of open source implementations. So, you might want to wait a few weeks before you start implementation.
Marshall T. Rose | |
Invisible Worlds, Inc. | |
1179 North McDowell Boulevard | |
Petaluma, CA 94954-6559 | |
US | |
Phone: | +1 707 789 3700 |
EMail: | mrose@invisible.net |
|
||