Fast Cryptoserver
We have designed and built a fast "cryptoserver" for public-key operations. Our cryptoserver makes an expensive cryptographic operation – modular exponentiation – fast and cheap. The cryptoserver is equipped with a number of hardware cryptographic accelerators, which can be shared among a large number of clients. This sharing allows the cost of the acceleration hardware to be amortized over a large number of client machines, and allows even clients who perform very few cryptographic operations to benefit from hardware speedups before such time as accelerators are cheap enough to be present in every machine. We have built interfaces to the cryptoserver and made its services available on our network.
This approach allows the service to operate securely and efficiently is to trade expensive cryptographic operations for cheap ones, and to bootstrap many expensive cryptographic operations performed frequently on the cryptoserver by means of few expensive operations performed rarely on the client. Clients secure their communications with the cryptoserver using symmetric encryption (3DES). The keys used to provide this secure channel are set up as part of a long-lived security association between client and cryptoserver -- for the cost of one possibly expensive key exchange on the client and (inexpensive) symmetric encryption, the client gets back many cryptographic operations performed cheaply on the server. Under reasonable assumptions, the amortized cost (including network bandwidth, hardware, development, etc. of a 1024 bit modular exponentiation is .0004 cents.
Network-attached cryptographic acceleration has been used only in special cases so far. Rainbow Technologies has sold products in their CryptoSwift EN line for several years, but its network connectivity is not secured. This makes it only appropriate for use on trustworthy networks. In contrast, our approach is suitable for deployment on any network with suitable availability -- 1000 1024-bit RSA operations per second requires approximately 4 Mbit/s of bandwidth; hardly a problem with common 100 Mbit/s Ethernet infrastructure.
Applications
We can identify three types of applications for fast cryptoservers: simple decryption/signature, bulk decryption/signature, and sophisticated cryptographic protocols.
In the first type of application, the client needs to perform a single cryptographic operation. Instead of performing it locally, the client sends a request to the cryptoserver to have it done remotely. Note that the client must trust the cryptoserver with knowledge of his private key. If the client is unwilling to trust the cryptoserver to perform the computation correctly, then there are a few precautions that could be taken. If the client’s public key is low-exponent RSA, then verifying the computation is much less costly than performing it directly. This verification could be performed by the client himself, or shipped off as a request to a second cryptoserver. Alternatively, the original request could be shipped off to two or more independent cryptoservers as a way to perform quality checks on the output. A client unwilling to trust a single cryptoserver with knowledge of its private key may use standard secret-sharing techniques to separate its key into multiple parts, and send each part to a different cryptoserver. The client then (relatively inexpensively) can recombine the results returned by all the servers to produce a valid decryption. Only if all the cryptoservers so used combine their information will they be able to recover the client’s private key. Lastly, the client may be unwilling to trust the cryptoserver with knowledge of the data, i.e., the message to be signed or the ciphertext to be decrypted. This can be accommodated by a straightforward application of "blinding." Note that blinding is particularly efficient for RSA signing operations, such as those used in the SSL/TLS key exchange. Clients unwilling to trust a "public" cryptoserver might still benefit by sharing such a server among machines who already mutually trust one another, thus placing the server within their effective security perimeter -- e.g., a number of web servers servicing the same web site might effectively share one cryptoserver.
The second type of application is similar to the first, except that many requests are sent to the cryptoserver simultaneously. These can certainly be treated as independent simple requests. However, there are a number of reasons to treat this type of application separately. One is that there are often amortized efficiency gains that can be realized by the cryptoserver when processing a number of related cryptographic operations, i.e., different keys processing identical data, or identical keys processing different data. Second, we believe that bulk encryption and signature using one or more fast cryptoservers will turn out to be quite effective in certain (bandwidth-unconstrained) multicast scenarios, such as secure communication for dynamic coalitions.
The third type of application exploits the same underlying cryptographic operations that are performed for encryption and signature, but in order to achieve more sophisticated security goals. For example, there are a number of security protocols for which the degree of privacy scales nearly linearly with the computational burden. A fast cryptoserver can perform the necessary computations on the user’s behalf. The cryptoserver could be owned by a security service, or it could be an independent service that sub-contracted the privacy-preserving computation. The cryptoserver needs to perform only ordinary public key operations computations that are exactly what it computes when it signs and decrypts. An example is private information retrieval where the querier to a database wishes to hide his query from the database by, in effect, hiding it among a large number of other queries. One solution for this [Cachin, Micali, Stadler, "Computationally private information retrieval with polylogarithmic communication", Eurocrypt ‘99] requires the querier and the database to each perform about m modular exponentiations, and reduces the communication to one round of messages of size log m. This example has many practical applications, such as allowing a user to get a real-time stock quote without revealing which stock he is interested in, or query a patent database without leaking the query to a competitor. It is possible to extend this approach to hide all information about the queries from the fast cryptoserver as well. In fact, the calls to the fast cryptoserver can be made independent of the actual queries that will be made. When the computations are data-independent pre-computations, we say that the computations are "generic."
A second example is group authentication. A participant proves to be one of a plausible set of m possible participants without revealing which, where m is a tunable parameter. There is also a non-interactive version of this called a group signature. Existing cryptographic solutions require the participant and the verifier to each perform about m modular exponentiations, and exchange one round of communication of size log m. The group authentication and group signature schemes can be computed in a way that is data-dependent and blindable. That is, sensitive information can be concealed from the fast cryptoserver, but cannot be performed as a generic pre-computation
Implementation
The implementation challenge is to efficiently export the computational resources of a cryptographic accelerator to the network in a secure fashion.
We built our cryptoserver out of generally available hardware, restricting our development efforts to custom software. Our initial prototype uses a Sun Ultra-10 workstation running Solaris 7 with an Atalla AXL200 accelerator. The AXL200 is rated at 236 1024-bit private key RSA operations per second. Our design cleanly factors the interface to the cryptographic accelerator from the rest of the software, so that multiple heterogeneous accelerator boards can be supported concurrently. Our software hides, as much as possible, the differences in the various accelerator hardware. We chose the Sun workstation because several cryptographic accelerator vendors support Solaris on SPARC hardware.

Figure 1: Cryptoserver Specification
A production cryptoserver would differ from the present configuration in three major aspects:Multiple public key accelerators would be used, for a dramatic throughput increase. While these enhancements would undoubtably increase performance, our present system, with very modest hardware, already performs acceptably well. Our software has been built with scalability in mind.
Software Architecture
Our architecture was designed to meet a number of important goals:
There are many possible choices for the software to meet these goals. We explain our choice of a communications substrate for the implementation. We describe the interface that the cryptoserver presents to its clients. Finally, we explain our choice of software architecture.
Middleware
We implemented our system on top of Sun’s Transport Independent Remote Procedure Call (TI-RPC) middleware. We chose TI-RPC because:
We desired to use UDP because datagrams are a much more natural match for the RPC paradigm than connection-oriented transports (such as TCP), and because a UDP-based solution obviated the need for complicated connection pool management logic, as we desire to scale the number of clients past the number of socket descriptors available in a single UNIX process (typically on the order of 1024). UDP also minimizes transport-related overhead for clients who may make infrequent calls to the cryptoserver. Other middleware choices, most notably CORBA, had one or more substantial gaps in supporting multithreaded applications over encrypted datagram transports.
A major advantage of this choice of middleware is the availability of RPCSEC GSS, an interface to the GSSAPI. GSSAPI is a pluggable security API, allowing a consistent interface to a variety of different authentication and encryption technologies. It is one of the few security technologies to explicitly support negotiation of security associations over connectionless transports (though we could negotiate security associations out-of-band over a connected transport), and which is capable of securing communications over such transports. Most importantly, RPCSEC GSS naturally supports the idea of leveraging long-term security associations to secure RPC-based requests with a minimum of per-request overhead.
Another advantage of TI-RPC is that it is compatible on the wire with Sun’s ONC RPC, a widely deployed RPC protocol that is at the heart of NFS. As such, ONC RPC implementations are available on a wide variety of platforms. This is due, in part, to Sun’s making source code to a reference implementation available under liberal licensing terms. Sun recently made TI-RPC source code available under a liberal license, and as NFS v4 will be built on TI-RPC, we expect it to widely ported over the next few years.
Security Association Negotiation
There are a variety of approaches to generating security associations between client and cryptoserver. The very simplest is a pure key exchange (e.g. Diffie-Hellman) in order to produce a shared symmetric key used to encrypt further communication between client and server. This is the approach we took in our initial implementation, as we are not measuring key exchange performance. In a production server, we anticipate that session keys will be generated every 1—24 hours per client in actual use. If one uses a little care to make sure that keys expire uniformly across an hour, even with 10,000 clients and 1 hour session expirations, this implies 3 key agreements (i.e., modular exponentiations) per second, or 1% of the capacity of a single AXL200.
To produce the benchmark numbers given below, we used the 192-bit Diffie-Hellman key agreement mechanism available with the distribution of TI-RPC. While this provides woefully inadequate security for production use, it is sufficient for the demonstration presented here.
For our symmetric cipher, we are using triple-DES. The performance of our overall system depends on the choice of symmetric cipher. Using triple DES is a conservative choice as almost all other ciphers offer better performance. We anticipate moving towards AES when the final selection has been made.
A variety of interesting approaches to negotiating security associations are open to us. In the minimal case, a client would like to have assurance that the machine it is communicating with is indeed a trustworthy cryptoserver. Therefore, the server must be able to authenticate itself to the client. If the service is freely available, the client need not authenticate itself to the server (and indeed may want to remain anonymous). In order to provide this base level of functionality, we intend to use a public-key based GSS mechanism (such as SPKM) and use a Certification Authority trusted by all clients to certify cryptoservers as such.
Further variants on authentication mechanisms would use client authentication to control access to a cryptoserver. PKI-based or Kerberos-based authentication mechanisms could be used to identify clients authorized to use the cryptoserver. Forms of digital cash could be used to allow clients to pay for cryptographic operations by both number of operations or quality of service (e.g., speed, latency, etc) -- clients could set up an account as part of security association negotiation, or could include payment tokens on a per-request basis.
Client Interface to the Cryptoserver
The client interface to the cryptoserver is designed to both allow sophisticated clients to most effectively use one or more such servers while minimizing network-related overhead, and to make it easy to incorporate cryptoserver support into legacy client packages such as OpenSSL and Microsoft’s CryptoAPI with no changes required by programs which then in turn use those packages. The interface is also designed to allow requests to be passed through to the cryptographic hardware with a minimum of copying, and to be broken up in a variety of ways to efficiently use any sort of accelerator hardware we may incorporate in the cryptoserver.
The interface to the cryptoserver is written in Sun’s rpcgen RPC specification language. In the figure below, we show a slimmed down version of specification; non-essential details about data types and benchmarking support have been removed. RPCMODEXP is a simple modular exponentiation. RPCMODEXPARRAY provides a more efficient way to encrypt multiple values with the same key. RPCRSAPCRTEXP and RPCRSACRTARRAY are the corresponding calls, but using Chinese Remaindered exponentiation. Finally, RPCMULTIMODEXPARRAY provides a more communication efficient mechanism for raising multiple bases to multiple powers (modulo the corresponding moduli).
program QCS_RPC_PROG {
version QCS_RPC_VERS {
QCS_value_res RPCMODEXP(QCS_mod_exp_coef, QCS_bignum) = 1;
QCS_val_array_res RPCMODEXPARRAY(QCS_mod_exp_coef, QCS_bignum_array) = 2;
QCS_value_res RPCRSAPCRTEXP(QCS_rsa_private_key, QCS_bignum) = 3;
QCS_val_array_res RPCRSACRTARRAY(QCS_rsa_private_key,
QCS_bignum_array) = 4;
QCS_val_array_res RPCMULTIMODEXPARRAY(QCS_mod_exp_coef_array,
QCS_bignum_array) = 5;
int RPCGETMAXMODULUSLEN(void) = 7;
} = 1;
} = 0x20000105;
Figure 2: Cryptoserver Architecture
Server Program
We implemented the cryptoserver in C++. From a security point of view, we would prefer to implement the server in a safe language, such as Java. Unfortunately, there are four problems with this:
- Lack of suitable middleware;
- Poor performance of present Java compilers;
- Difficulty of efficient interfacing to vendor libraries written in C;
- Lack of a complete Unix system call interface.
These restrictions narrowed the choice of language to C or C++.
In order to simplify the server, we use "shim" libraries to normalize the interface the server sees to each type of cryptographic hardware accelerator present in the system. This allows the server to transparently support a heterogeneous collection of hardware accelerators. It also lets us remove from the server any task-specific logic, and to compensate for any differences between the features supported by the accelerator drivers and the interface we present to cryptoserver clients. For instance, these shim libraries where necessary normalize byte order, handle support for negative numbers, exponents larger than the modulus, etc. Many hardware accelerators (and indeed cryptographic software packages) have intermittent or no support for such inputs, as they don’t occur when using standard RSA. However, as technologies like the cryptoserver reduce the cost of modular exponentiations, cryptographic algorithms and protocols considered too costly and complex for practical use will be used, and these algorithms do not respect these narrow limits. Each individual hardware "shim" is responsible for any initialization required by the hardware it manages, and can provide information to the server about the capabilities of that hardware (e.g., supported modulus lengths, whether the hardware driver supports features like negative numbers directly or the shim is providing that feature, etc.). Such information can be used by the server for more sophisticated scheduling of work items on particular accelerators.
In order to make effective use of the hardware, and maximize scalability, we can configure the number of threads the server uses to manage each cryptographic accelerator. This depends both on the inherent parallelism of each accelerator board, and on the architecture of the driver and any vendor libraries we use to interface with it. In the current implementation, to best support the Atalla board, the server is designed to allocate one thread per cryptographic processing unit. The server also allocates a set number of threads for the RPC subsystem. We preallocate a number of work items, which are placed into the idle queue. As cryptographic requests come in, they are decrypted and placed in a work item, which is moved onto the work queue. After the cryptographic accelerator has processed the request, the work item is moved onto the reply queue. The reply is then encrypted before being sent. We use multiple threads to process requests and replies to prevent either stage from being a bottleneck.
An additional important optimization is that the RPCs that take array arguments are broken into multiple work items as they are placed on the work queue. This enables the separate operations to occur in parallel given our current cryptographic hardware. Of course, the RPC cannot return until all the results are available. This is implemented by making one work item be canonical per RPC request, and not moving the canonical work item onto the reply queue until all the operations are finished.
Benchmarks
Our initial implementation of the cryptoserver is built around a Sun Ultra-10 workstation, containing a 440 MHz UltraSparc Iii processor, with an Atalla AXL200 cryptographic accelerator. The Ultra-10 scores 18.1 on SPECint 95, and the AXL200 is rated at 236 1024-bit RSA operations per second. For benchmarking purposes, we ran cryptoserver clients on an equivalently configured Sun (sans the Atalla board), connected via switched 100 Mbit/s Ethernet.
We wrote a microbenchmark that repeatedly performs a modular exponentiation using the Atalla-supplied libraries, and compares the result to the correct value. We then executed the same computation on a client of the cryptoserver. First, we report the single threaded case. We report numbers both with and without securing the wire with triple DES. The numbers we report are averages; in all cases the variance was negligible. Here are the benchmark data:
Machine Threads Latency (ms) Throughput (ops/s)
Local accelerator 1 91.5 10.9
Remote cryptoserver (insecure) 1 92.0 10.9
Remote cryptoserver (secure) 1 93.4 10.7
Without encryption, our software delivers the full throughput of the accelerator. Adding triple-DES incurs a 2% reduction in throughput.
While the single threaded numbers show that the AXL200 has a large latency, it has 24 modular exponentiators that can run in parallel. Many applications of the cryptoserver (e.g., supporting SSL) are more concerned with throughput than latency. For the cross machine cases, the client is configured to make requests across 50 threads. Here are the benchmark data:
Machine Threads Throughput (ops/s)
Local accelerator 24 261
Local accelerator 40 265
Remote cryptoserver (insecure) 24 259
Remote cryptoserver (secure) 24 259
Remote cryptoserver (insecure) 40 265
Remote cryptoserver (secure) 40 265
We note that although 24 threads should be enough to saturate the AXL200, it practice it is not. Adding threads above 40 does not result in increased performance. We also note that we are exceeding the rated speed of the board, however, we do not know what system the board was measured in. We also note that our exponent has a weight of 511 (i.e., there are 511 ‘1’ bits in the exponent), and the time per operation depends on the weight of the exponent. Again, we do not know the weight of the exponent used to in the official measurement. The important point is that the penalty for using a cryptoserver is minimal to nonexistent.
Future Work
Viewing cryptography as a network service changes our perspective about the costs of cryptography. Cryptography is no longer computationally prohibitive; it is only an RPC away. We are building applications using abundant public key operations, including secure communication services for dynamic coalitions, private database retrieval, and others. We also plan to build cryptoserver clients implementing standard cryptographic APIs such as PKCS 11, Microsoft’s CryptoAPI, and the Java Cryptography Environment. This will allow legacy applications to seamlessly take advantage of the cryptoserver.
A remaining challenge is to see how well our software architecture scales. While the software was designed with scalability in mind (e.g., using a fixed thread pool, accommodating inter- and intra-request parallelism, multiple request and reply handler threads to spread the symmetric cryptographic load, etc.), the proof will be actually running thousands of RSA operations per second on a suitable machine.
In our implementation, each request includes the client’s private key. Alternatively, if the cryptoserver already knows the client’s private key, then the request may include an authentication token that demonstrates who the request is coming from and that the request is fresh. While this would require a more trustworthy cryptoserver, it would reduce the network bandwidth required by nearly half.
The cryptoserver offers interesting options for those paranoid about their cryptographic operations. As our server supports a heterogeneous collection of hardware accelerators running concurrently, it would be a fairly simple modification to use one accelerator to double check the results delivered by another. By using different accelerators, a single accelerator could not produce a doctored result along with a doctored "inverse." The tradeoff between paranoia and throughput could be easily managed by checking a user-selectable fraction of results. By selecting hardware accelerators designed and manufactured in disjoint countries, no single government would be in a position to compromise a RSA operation. Such a system would be highly resistant to many attacks.
A similar level of paranoia is available to clients: it is easy for the client code to issue RPCs to more than one server. One might, for example, use servers operated by different organizations, or servers physically in multiple countries.
Summary
We have demonstrated that public key cryptography can be provided as a service over untrusted networks. This architecture has many advantages: it offloads work from clients, it allows greater utilization of cryptographic accelerators by sharing them among many clients, and it has acceptably small performance overhead. In addition, it enables new security applications that were previously considered too costly. Our implementation consists of customized software on top of generally available hardware. Benchmark data indicate that our approach is fast and effective. Hardware trends and other factors indicate that our approach will be increasingly attractive over time.