Doron Sherman

Subscribe to Doron Sherman: eMailAlertsEmail Alerts
Get Doron Sherman: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Related Topics: XML Magazine

XML: Article

XML Computation Trees

XML Computation Trees

Every computer science undergraduate program in the world has two important foundation courses: data structures and algorithms. Open any book on these subjects and you'll see immediately that almost a third of it is devoted to graphs. Graphs are used to model a very large number of real-world problems: the traveling salesman problem, efficient routing of a package, network flows, and more - all are modeled as graphs and often solved by graph-based algorithms.

A common use of a graph-based representation is that of a computation graph. Simply put, it's a graph that models a set of functions F1, F2, ..., Fk that take a set of inputs I1, I2, ..., In and computes a set of outputs O1, O2, ..., Om. As an example, Figure 1 shows a simple computation graph with a small number of nodes. This simple example represents the following equations:

O1 = F5(F1(I1, I2), F4(F1(I1, I2), F2(I2, I3)))
O2 = F6(F4(F1(I1, I2), F2(I2, I3)), I4)
O3 = F3(I3, I4)

Note that by managing the computation as a graph traversal you can eliminate computing F4(F1(I1, I2), F2(I2, I3)) twice. The nodes modeling both F5 and F6 access the value computed by the node representing F4.

Computation graphs arise in many environments and tend to be the preferred way to model very complex computations and dependencies. In fact, most computationally intensive environments use graph-based computation structures. The most prominent example of such environments occurs in financial services, including computer-based portfolio management, computer trading, pricing systems, risk management systems, hedge systems, and insurance quoting systems. These structures also arise in contract management systems, constraint-based optimization, impact analysis, root cause determination, and simulation systems. The reason is that a graph is the perfect way to model the dependencies between computation elements as well as the perfect metaphor to program to.

Most current systems managing computation graphs make use of proprietary environments to model and manage such structures. In this article I'll walk you through a different approach, one that makes use of XML to represent computation graphs. The approach is based on an XML representation that can be handed to almost context-free computation engines that cooperate to compute very large computation structures. While there would seem to be overhead and inefficiency in using XML instead of proprietary methods, this is offset by the fact that the full computation graph is partitioned so that many CPUs can work in tandem to compute the full result set. This approach has additional advantages in areas involving the maintainability of the system and the total cost of ownership.

Since the full framework involves some very advanced, complex, and sometimes confidential graph-partitioning methods, this article focuses on the general framework and stays away from the mathematical aspects of the approach.

Graphs and Trees
Data structures representing graphs (and hierarchies, which are also graphs) have been in use since the dawn of computers. For example, the major database technologies before relational databases were network and hierarchical databases. Efficient algorithms on graphs are certainly one of the most researched areas within computer science, and the need for such algorithms stems from every business vertical that uses computing systems. This isn't just a passing trend. Actually, over the years it has become more and more prominent. The current object-oriented development paradigm is based on objects, and it's most natural to represent them as a graph in which the nodes are the objects and the edges are the object references. In fact, object-oriented databases use a native graph-based representation, and hierarchies (both type and instance hierarchies) are best managed as graphs and sometimes as trees (a special family of graphs).

XML also follows this pattern. It organizes data according to a containment hierarchy, and any XML document can be viewed as a containment tree. More important in the context of this article, any tree can be encoded as an XML document, and graphs can be encoded in XML by adding attributes that represent references.

Formally, a graph G=G(V, E) includes a set of nodes (or vertices) {V1, V2, ..., Vn} and a set of edges {E1, E2, ..., Em} where for each i, Ei={Vk,Vl} where Vk and Vl are both nodes in the node set. A directed graph is one in which each edge has a source node and a target node; in this case an edge is sometimes written as Ei=<Vk,Vl>. Computation graphs are always represented by directed graphs in which the edges represent the computation dependencies (i.e., the value computed by the source node is an input to the computation of the target node).

Graphs can have cycles. A cycle within a directed graph is a set of edges E1, E2, ..., Ek where for each i the target node of Ei is the same as the source node of Ei+1, and the target node of Ek is the same as the source node of E1. Cycles are clearly a bad thing for a program traversing the computation graph since the cycle implies an infinite loop. Therefore, most graphs managed within computer systems, and all computation graphs, tend to be directed acyclic graphs (DAGs), meaning that there's no directed cycle within the graph.

A tree is a graph in which every node except the root has exactly one ancestor node. Trees are also heavily used in computation graphs; the reason is that the computation dependencies are more limited. Thus the propagation of values through the structure is simpler and faster. In many cases a computation DAG can be reduced to a computation tree by creating duplicate nodes, a common practice in efficient computation solutions.

Applications of Computation Graphs
Computation graphs are heavily used in many domains. If you dig deep enough, you'll probably find a computation graph at the heart of any system that includes complex mathematical calculations and computational dependencies. Some of these domains have already been mentioned, but I'll elaborate a bit on two that I've seen.

The first involves constraint optimization. Imagine that you're building a system that calls for optimal assignment of resources (e.g., workers or crews) to tasks. Each task has various components, such as where it has to be performed and in what time frame, and the skills required. Given a set of tasks and resources, it's your job to create the best possible schedule for your workforce to perform all tasks within the defined constraints.

There are many approaches for performing such optimizations, and the use of computation graphs may not be immediately obvious. Computation graphs come into play because optimization problems such as that described above need constant recalculation. This is often called continuous optimization. The problem is that as time goes by reality changes and constraints need to be reapplied. New tasks are created, tasks may be completed ahead of schedule, resources can become unavailable.

All this creates a reality in which it's not good enough to compute a good schedule once a day. The schedule needs continuous updating in real time as new information comes in. To do this efficiently, a computation graph has to be maintained to enable continuous reoptimization. New inputs ripple through the graph, updating the schedule, thus avoiding recomputing from scratch.

The key is real-time recomputation, the common thread in many of today's advanced uses of computation graphs. The second example is also heavily biased toward real-time recomputation. The domain is real-time pricing and real-time risk calculations for financial products and portfolios. Today's investment banks create and manage highly customized and very complex financial instruments. This is not about stocks, and not even about simple derivatives such as options and swaps. It sometimes involves hybrid derivatives, or synthetics. Investment banks take positions in one instrument or portfolio in order to make a commission or a margin, but don't want to carry the risk. In such a scenario they will hedge the portfolio with another instrument or portfolio that can be correlated with the original one. All of these instruments and portfolios can be represented as computation graphs managed for both pricing and risk management purposes. The inputs to these computation graphs include, for example, spot interest rates, exchange rates, and stock prices.

There are two important characteristics of these graphs: (1) they need to be very flexible and dynamic to allow traders to create the best instruments for the bank, and (2) they're very sensitive to constantly changing conditions. As an example, risk is sensitive to many factors, one of which is time itself. Because time is continuously changing, risk calculations can potentially require continuous recomputations that can trigger adjustments to a portfolio.

It's not unusual to see computation graphs with thousands and even tens of thousands of nodes that need to be recomputed in real time. Such computation graphs place a huge requirement on computation power. It's typical to see such computation engines run on very powerful (and expensive) Unix machines with a large number of CPUs. It's still never enough. These systems also tend to be huge memory hogs, requiring 2 and even 4GB of memory to hold the graph structure in memory and perform the computations.

The nature of such environments is that they always require more CPUs and more memory than is readily available. Some sample causes follow.

  • The computation graphs are truly complex. There's no way to get around that. The functions in the nodes are elaborate and often require complex computations including solving partial differential equations and performing iterative "guessing."

  • Nodes in the computation graph often involve Monte Carlo simulation. Computing a value is often not deterministic, but rather requires a sampling from a probability distribution, multiple (up to many thousands) runs, and an averaging in order to compute a single value.

  • Traders and quantitative analysts (quants) often need to work off a single computation graph but create permutations trying to simulate a certain environment or instrument. In effect, they create copies of subgraphs and can quickly grow the computation burden by a factor.

    One way to overcome the problem is to use a large set of computers on a network for real-time recomputation of such graphs. While this has been recognized by many, it hasn't been easy to do. The main reason is that most environments doing such calculations are highly proprietary and often include a home-grown object database and a proprietary communication and distribution layer. While such proprietary environments have proved themselves in operation, they've also proved to be very expensive to maintain, and the total cost of ownership (including the need to continuously maintain and improve the proprietary environment) tends to be so high that only investment banks have been able to afford it.

    Computation Graphs in XML
    A new paradigm aimed at lowering this total cost of ownership is emerging. It's based on the use of standard tools and techniques as well as partitioning the work and distributing it to a set of low-cost computers (running Linux or Windows). Much of the savings results from the fact that the bulk of the work is delegated to XML tools and less proprietary code needs to be written. The framework is based on representing the computation graph in XML, partitioning the graph into a large number of subgraphs (each also represented in XML), and distributing this computation to many computers on the network.

    To use XML easily (and not require HREF, XPath expression, or other types of references), the computation graph may be converted into a tree. This is done by duplicating nodes and maintaining multiple nodes that subscribe to an underlying value. While not the most elegant solution, it simplifies the algorithms for partitioning and managing the computations. Alternatively, the graph can be partitioned first; only then is each subgraph converted to a tree.

    The computation graphs are built using standard XML editors. This means that you can use off-the-shelf tools for building the graphs instead of writing proprietary editors, helping to reduce cost of maintenance. To make sure that the computation trees make sense in terms of the inputs and outputs, XML schemas must be employed to represent typing information.

    Assume for example that F is a function that takes as input a value of type Ta and a value of type Tb and computes a value of type Tc (i.e., Tc F(Ta, Tb)). Assume that G is a function that takes as input a value of type Tc and calculates a value of type Td (i.e., Td G(Tc)). Also assume that ta is an input value of type Ta, tb is an input value of type Tb, and tc is an input value of type Tc. The schema needs to be built so as to easily represent G(F(ta, tb)) as well as G(tc) in a way that can be type-checked by the XML editor.

    The "trick" in defining the schema is in the introduction of elements that can be either data or return values from the recursive activation of a function. Such elements are defined as compositors using choice elements that include the basic type as well as any function node that returns that value. As an example, the TcVal that becomes the type of the parameter of the function G can be defined using the following compositor:

    <xsd:group name="TcVal">
    <xsd:choice>
    <xsd:element ref="Tc"/>
    <xsd:element ref="F"/>
    ...and any other function that
    returns a Tc
    </xsd:choice>
    </xsd:group>

    Partitioning the Computation Tree
    The most important phase in the process is the partitioning phase. In this phase you need to partition either the computation tree or the graph, depending on whether conversion to trees happens before or after the partitioning. Good partitions mean that parallel computations can be performed efficiently. A good partitioning is also the key to continuous real-time recomputation.

    It's impossible here to survey all the methods used for partitioning. Partitioning can be proved to be NP-hard (meaning that there is probably no deterministic fast algorithm for doing general partitioning), but many approaches have been used in the past, some producing very good results. These range from simple greedy algorithms to complex spectral methods using eigenvectors of the Laplacian of the graph. Also, partitioning is often based on domain-specific nodes. The simplest example involves Monte Carlo simulation nodes since it's clear that multiple runs can be done in parallel.

    The partition is normally kept static, at least for a while. Although the dynamic nature of the computation tree is such that continuous partitioning may seem to be required, in practice this is overkill and takes much too long. Therefore, new partitions are created daily at most, and the subtrees are maintained throughout a relatively long period of time.

    Representing the computation graph in XML complicates the partitioning phase, and is the main area in which more research is required. If partitioning is done while looking at the entire graph, then the graph needs to be built in memory, which means that the partitioning process will be a memory superhog. In fact, building the tree using DOM is impossible for large graphs and shouldn't be attempted. Instead, current approaches for very large computation graphs use partitioning algorithms that can be implemented using a SAX parser with one or more passes over the graph. Multilevel bisectional algorithms (the general concept of these algorithms is illustrated in Figure 2) lend themselves to such an implementation.

    It's possible that JAXB, the new Java Architecture for XML Binding, may prove to be a good alternative to SAX processing for the partitioning phase. JAXB (formerly called Adelard) provides a framework for supporting a two-way mapping between XML documents based on schemas and Java objects. Schemas are compiled into Java classes that create a specialized parser based on the schema. The result is a parser that often operates as fast as a SAX parser but builds objects in memory while maintaining a small memory footprint. The JAXB documentation claims that early JAXB prototypes have shown that JAXB can be faster than SAX because the generated Java classes are precompiled and contain the schema logic, thereby avoiding the dynamic interpretation that a SAX parser performs. Using a JAXB-based approach, it may be possible to create better partitions, but whether such an approach is feasible from a memory footprint perspective has yet to be demonstrated.

    Computing Subtrees
    The fact that computation subtrees are represented by XML trees means that the computation engines can remain simple. Indeed, all they need to do is parse the XML and compute it based on the functions defined in the XML. The functional engines are therefore nothing more than tree traversal programs.

    The simplest approach to the actual computation is one that parses the computation tree using SAX and maintains a stack of values that reflect a DFS traversal of the tree. If the graph isn't converted to a tree, the XML includes references and two passes may be required. In this case the partitioning method needs to reduce (and hopefully eliminate) dependencies on other subgraphs. A family of partitioning methods called bounded contractions uses this as the primary target.

    Summary
    XML parsing is overhead, pure and simple. It's thus a bit surprising that an XML-based scheme for managing computation graphs can be used. But XML tools (parsers and other tools) are constantly being made more efficient, and the inherent simplicity of the approach from an architectural and tool perspective leads to significant savings. Benefits include:

  • The small amount of code that needs to be written and maintained

  • The use of XML editors rather than self-built editors

  • The flexibility afforded by schemas

  • The inherent attributes of XML that make the computation graph descriptions less fragile

  • The fact that graph serialization is implicitly taken care of by the very nature of XML-based communication

    There's one additional advantage that I believe is the most important one:

  • Because the framework is based on XML documents, it's inherently traceable.

    It's easy to debug, tune, and improve the process because you can see what's going on within this complex network of computation engines. Dependencies are implicitly described in a form that is readable by both humans and programs. These descriptions can be easily used to generate reports and perform queries using tools such as XSL and XQL - again saving a lot of complex code that would otherwise have to be written. Current approaches based on in-memory proprietary structures are very hard to debug and trace. If you're into computation graphs, this advantage alone should make you at least curious about whether XML computation graphs are appropriate for you.

    Finally, most of the mainstream database vendors seem to be moving toward native support for XML. As an example, the next major release of Microsoft's SQL Server supports XML as a "first-class citizen" (i.e., the database internals optimize XML handling and don't rely on a "bolt-on"). This means that mainstream databases can be used for these advanced environments instead of continuous use of proprietary or niche-vendor environments.

  • More Stories By Ron Ben-Natan

    Ron Ben-Natan is Chief Technology Officer at ViryaNet Inc. Prior to that he worked for companies such as Intel, AT&T Bell Laboratories, Merrill Lynch and as a consultant at J.P. Morgan. He has a Ph.D. in Computer Science in the field of distributed computing and has been architecting and developing distributed applications for over 15 years. His hobby is writing about how technology is used to solve real problems and he has authored numerous books including “IBM WebSphere Application Server: The Complete Reference” published by Osborne/McGraw. He can be reached at

    More Stories By Doron Sherman

    Doron Sherman is the CTO of Collaxa, Inc., a Web Service Orchestration Server vendor and a BEA Partner located in Redwood Shores, California. He has been involved with Java since its early days and pioneered application server technology while being a founder and chief scientist at NetDynamics.

    Comments (0)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.