Today represents the hottest time to be in financial markets - nanosecond response times, the ability to affect global markets in real time, and lucrative spot deals in dark pools being all the rage. For companies who do business in these times, it is a technical arms race, worthy of a Reagan era analogy.
With High-Frequency Trading firms locked into an effective "Space Race", the challenges for these firms are now far reaching, extending beyond traditional regulatory, compliance, and government boundaries.
With a need to ensure that regulatory requirements are met, serious fines for non compliance and even enforceable undertakings by 3rd parties to halt trading activities on markets are still outweighed by the potential upside for combatant firms playing in the race.
Increasingly, the most marginal of technical errors can spell doom for market participants. In a market where risk is a prime occurrence and measured often in millions of dollars, glitches are a regular occurrence, resulting in lost revenue, disappointed customers, and the fast destruction of once high-profile market leaders.
Recently, this was brought to the public's awareness, with the spectacular failure of Knight Capital: in August of 2012, erroneous trades were sent to the New York Stock Exchange, leading to the obliteration of nearly 60% of the firms value in under 1 hour.
The firm's catastrophe has forced an attitude change among investors and corporate technology leadership, with a focus on compliance controls and board level accountability. Tiny lapses in controls are expensive mistakes, leading to the disruption of markets, in conjunction with the immense losses and liability suits that often trail such events, the stakes are higher than ever to develop software in a controlled way and get it to market in the shortest time possible.
With regulatory changes imminent, the need for clearer, actionable reporting at all levels of technology organizations require a clearer approach than the traditional ones taken in the past.
The Landscape of Failure:
In the last 2 years alone, there have been numerous incidents of technology misconfiguration that led markets awry. Institutional investors aside, the mechanisms that govern software development for brokerage firms and markets have far-reaching and damaging consequences. From ill-prepared recovery protocols to poorly governed front, back, and middle offices; there are several noteworthy incidents in recent times that have led to greater scrutiny for trading companies.
November 2012 - NYSE/Eurodex
A newly implemented market matching engine UTP (Universal Trading Platform, the core trading platform employed by the NYSE) caused a day-long disruption and forced the Big Board operator to establish closing prices for more than 200 stocks using a fallback to it's old system Super Display Book (sDBK). Trading never resumed during the day for the 216 stocks affected, and the exchange determined the official closing price for each of the affected securities based on a consolidated reading of last-sale prices, instead of an auction system used to close stocks, manual intervention was required to revalidate positions for venues and participants.
Overview of the Root Cause: Poor Testing/Quality Assurance/Release Management Failure.
2007 - 2010 London Stock Exchange (LSE) Multiple Outages & the Move to Linux
Over the course of a 4 year period, the London Stock Exchange began to earn a reputation as the most unreliable exchange in the market. Multiple outages and multiple technology problems all led to a raft of technology errors, which were manifested in regular outages. In fact, the LSE had to ultimately change it's entire operating stack to a new platform and institute a raft of new mature processes to achieve the kind of reliability they needed.
August 2012 Knight Capital:
In the span of 45 minutes, a little over four hundred million dollars was lost when an algorithmic trading program designed for testing environments was released to their production environment market. The blunder led to a seventy five percent dip in the stock price in a 30 minute period before attempts to salvage the situation could be initiated. The error entailed HFT (High-Frequency Trading) of up to 140 stocks, and is just the latest in a string of such errors.
The Root Cause: Poor Configuration Management, Inconsistent Testing Approach, Poor Release Management
Most brokerages apply several layers of risk mitigation when developing and deploying software. I'll give a high level overview (below) of a traditional approach in another post (I won't go into the details of settlement, vetting, market matching etc). Trading firms are complex beasts, with multiple market participants, multiple exchanges and a plethora of investment instruments to use, and going into detail on the actual technologies detracts from the message. What is apparent is that the process life-cycles, which are used to achieve releases, are governed by mechanisms from a different time and place, with varying inconsistent controls not designed for rapid release schedules, leaving gaps in organizational capabilities that are open to failure.
The "New Old Ways" to Manage These Problems:
Typically Application Lifecycle Management (ALM), a recent play, is a means of ensuring that software remains relevant. A vital aspect of the Software Development Life-Cycle (SDLC), ALM is an integral part of ensuring that the firms overcome challenges to developing top-notch software at a fast pace. The new wave premise of ALM, follows a design, build, run mentality, and pushes the paradigm to encompass all activities in the development cycle under one roof, whereas previous approaches followed often different approaches with best-of-breed solutions.
The benefits of this, with regard to trading systems, are clear. Greater visibility and consistency between tools implies more fixes to bugs, and ultimately fewer glitches. The unfortunate reality is that underlying configurations are not still maintained well in this approach, and unfortunately would not have been necessarily caught with traditional ALM technology vendors.
ITIL is a widely accepted approach to IT service management in these organizations. An ITIL enabled process centrally focusses on what is called a Configuration Management Database (CMDB); which contains all information pertaining to an information system. It helps the organization identify and comprehend the relationship between system level components and applications, and it is designed to track relationships between technology services and at a micro level, items called CI's (Configuration Items). This process is known as configuration management, but as this typically lives in the operational part of the equation (Application Support, Infrastructure Operations & Service Management), the process usually only gets invoked at a high level in the pre-production environments. There is another discipline called Software Configuration Management which has applicable components in ITIL and ALM, however the tools and processes rarely meet, as the distinction between the disciplines are very much either software or infrastructure orientated.
The conceptual CMDB enables controlling and specification of configuration items in a systematic and detailed manner, reducing configuration drift. As mentioned previously, problems with this approach manifest in the ITIL world, as the CMDB typically does not converge with the version control repositories in the development life-cycle, and more often than not are actually not version controlled themselves - leaving further inconsistencies.
Okay Okay We Get That, So What Went Wrong at Knight?
Basically, Knight accidentally released simulation software they used to verify their market-making software functioned properly, into NYSE's live system.
Within Knight Capital's development environments lived a software program called "a market simulator", designed to send spread patterns of buy and sell orders to its counterpart market matching software, called RLP in this case. The trade executions are recorded and were potentially used for performance validation prior to new releases of the market matching software. This is probably how they could stress test how well their new market-making software worked under load before deploying to the live system connected to the NYSE live system.
Prior to August the 1st, a number of teams progressively would have migrated software between environments for release into the "live environment". Potentially, a manual process was caught in the deployment, and pushed a copy of the simulation software into the "live". As you can see, most companies do not employ baseline configuration tests in the later environment stages, thus (probably at a later stage in the process), someone opted to add the program to the release package and deployed it.
This is exacerbated in large teams, and is simply an overhang of the fact that typically no one team owns the configuration state, of both the Applications & the Operating Systems/Platform that they run on, the closest team is usually the systems administration team, but as they have a production environment to manage, these "lesser" environments get sidelined with more important problems to deal with. Combined with the fact that there are very few tools that actually focus on the configuration testing aspects and people use collections of scripts or home-brew solutions, it is easy to see where this went wrong.
The lack of a well-defined configuration baseline and set of configuration tests including differences between the environments is the likely cause (well, from an outsider's perspective) of the problem.
On the morning of August 1st, the release was successfully deployed and the simulator inadvertently bundled with the release was ready to do its job: execute market-making software.
This time however, it was no longer in one of the test environments, it was actually executing live trades on the market, with real orders and real dollars.
For stocks where Knight was the only one running market-making software as a RLP, and the simulator was the only algo trading that crossed the bid/ask spread, then we saw consistent buy and sell patterns of trade executions, all marked regular, all from the NYSE, and all occurring at prices just above the bid or just below the ask.
Examples include EXC and NOK, and you can see these patterns in charts here. The simulator was functioning just as it did in the test environments, and Knight's market making software was intercepting these orders and executing them. Knight's net loss is minor on simple volumes, on this day however, the problem was compounded, as the software was operating , but they were generating a lot of wash sales.
For stocks where Knight was not the only market-maker, or when there was other algorithmic trading software actively trading (and crossing the bid/ask spread), then some, or all of the orders sent by the simulator were executed by someone other than Knight, and Knight now had a position in the stock. Meaning it could have been making or losing money. The patterns generated for these stocks depended greatly on the activity of the other players.
Because the simulator was buying indiscriminately at the asking and selling of the bid, and because the bid/ask spreads were very wide during the open, we now understand why many stocks moved violently at that time. The simulator was simply hitting the bid or offer, and the side it hit first determined whether the stock opened sharply up or down.
Since the simulator didn't think it was dealing with real dollars, it didn't have to keep track of its net position. Its job was to send buy and sell orders in waves across pre-defined positions.
This explains why Knight didn't know right away that it was losing a lot of money.
They didn't even know the simulator was running.
When they realized they had a problem, the first likely suspect was likely the new market-making software. We think the two periods of time when there was a sudden drop in trading (9:48 and 9:52 AM), are when they restarted the system. Once it came back, the simulator, being part of the package, fired up and continued trading positions. Finally, just moments before a news release at 10 AM, someone found and killed the simulator.
We can fully appreciate the nightmare their team must have experienced that morning, a lack of visibility, inconsistent sources of what was actually running in production, and poor visibility over the successful release.
Regulated Controls Against Flash Crashes
Like those that came before it, Knight Capital was once THE retail market-maker in the US; its reputation has now been irreparably damaged. It's prudent to note that the error was vastly avoidable, had the relevant controls been put in place.
Several factors played into this scenario, namely:
- Poor configuration management,
- A set loose controls around the release management process within the firm,
- A lack of visibility into the makeup of the changes that were being introduced into the market.
- An inability to isolate the configurations that we deployed
- A lack of configuration testing
- A lack of operational acceptance testing
Automated Governance is the Way Forward
DevOps, a recent answer to the challenges of collaboration across release cycle, stresses the seamless integration of software development and collaboration between IT teams, with a view towards enabling a rapid rollout of products via automated release mechanisms. It recognizes the existing gap between activities considered as part of development life-cycle, and those characterized as operational activities. Historically, the separation of development and operations has manifested itself as a form of conflict, as can be clearly seen by the sheer amount of frameworks developed to address the problem, which ultimately predisposes entire systems to errors.
What's currently lacking in each approach is a mechanism to gather systems knowledge in environments where skills and capabilities between teams varies significantly.
For orchestration and deployment Puppet, Chef, Bladelogic and Electric Cloud go a long way towards improving upon the existing configuration components of ALM models, but often neglect the interaction with ITIL. Puppet has been making strides in recent months with integrations into tools of this nature. Yet, the existing suites of tools require specific knowledge of declarative domain-specific languages to enable a user to describe system resources and their state. In the case of Puppet, discovery of system information and compilation into a usable format is possible, but is a daunting task to a novice user in these fast paced corporate environments.
Over time, heavily regulated environments, governed by strict auditing requirements, combined with a validation mechanism that can clearly be maintained and usable by then varying capability levels of an organization must be put in place to ensure that configuration drift between environments is caught early and reported back.
Increasingly smart automations will be deployed, which will ensure state is forcefully maintained by testing, recording, and auto-provisioned safely. This is a unique means of peer-based systems configuration and a measure of prevention before configuration errors affect running systems,that very few companies are experimenting with (aka configuration-aware systems).
Our own tool, UpGuard, complements the existing workflow tools and offers the simplest way for Developers/Configuration Managers & Systems Administrators gain realtime validation of configuration state to great effect. It enables the creation and running of configuration tests, collaborative configurations for teams, and a robust community option in coming months, as well as the creation of detailed documents that act as reports to satisfy audit standards. Applying UpGuard to these environments ensures fast process maturity for developing seamless system configurations and requires no new syntax introductions or code; everything is available as a version controlled test that can be executed under strict security contexts on the target system.
The Knight crisis is not an isolated event. However, it has been looked at as a rallying call for greater visibility into the processes and compliance measures implemented within trading participants.
With the increasing complexity of trading algorithms, which are the backbone of trading procedures, the necessity of controls to govern these technology organizations is becoming more apparent each day.
Mary Shapiro, the outgoing Chair of the SEC, called for a review of the SEC’s automation review policies, which were put in place with exchanges after the 1987 market crash, that require venues to notify the regulator of trading failures or security lapses. Portions of those policies will serve as the basis for the new rules.
The implementation of a powerful trading platform rests on many pillars. Their remarkable effectiveness has led to the reliance of historically legacy solutions to deal with the rapid release schedules that firms now face to stay recognized as leading systems. This comes at a cost, as this increased pressure to deliver innovation has opened up these systems, and more importantly, the processes and tools that govern them to exposure and the risk of failure if glitches occur. As a result, the concepts outlined in DevOps are clearly necessitated in order to continue delivering key features and key components of financial markets, proper execution will help avert crises such as the Knight fiasco in future.
The comprehension and adoption of the various frameworks, the integration of IT Automation, and clear governance of development and operational environments will go a long way into ensuring that a fiasco such as the Knight crisis remains solely as a problem of the past, never to be replicated. Unfortunately, we still have a long way to go in this journey.