Cronkite Story Engine (un)Optimization

2015-06-27 in Post-mortems on Dartmouth-hitchcock, Mssql, Optimization, Overly-helpful-clippy, Sql, Web-services

One of the primary web services that we provide via our Legion framework is Cronkite, our story engine. Basically, Cronkite is a centralized repository of all the content that our Public Affairs team wishes to make available on our websites and various internal applications. Each item or article of content in Cronkite is called a “story”. On the surface this seems like a simple application, but given how the company is structured this almost immediately balloons in to a monster of complexity.

The primary issue is that while Dartmouth-Hitchcock bills itself as a single institution (we even have a campaign titled “One DH”), it’s really a network of about a hundred different clinics, labs, and hospitals. These entities first coalesced into five larger groups, which then joined to form the Dartmouth-Hitchcock Network; it’s these five groups which are referred to internally as “Locations”.

So, we have one repository of stories which are targeted to five different locations and filtered according to department, topic, and page. This leads to myriad ways in which stories call be pulled from the central database depending on where you are, what you’re doing, and how you’re doing it. This in turn leads to a situation where some wacky queries are being run.

During testing these queries performed fine and did nothing to raise any red flags. We had normalized our data, built the indices for the columns in our WHERE clause, and everything seemed fine. However, once we went live we immediately began to notice queries taking in excess of 30 seconds. For those of you new to MS SQL Server, 30 seconds is the default command timeout, meaning that if any query takes longer than this it fails. Now, one might say, why not just increase the timeout? Well, I’m glad you asked: BECAUSE THAT’S BAD AND YOU SHOULD FEEL BAD FOR THINKING IT! This query is being run on every page load of our Intranet, and having either a) the page take 30+ seconds to load or b) the page load, but the call-to-action taking 30+ seconds to load asynchronously kind of defeats the whole purpose.

After profiling the code, it became rapidly apparent that the issue lay not in the application, platform, or service but rather in the data layer. Here is a (rather abbreviated snippet) of one of the offending queries:

CREATE PROCEDURE [dbo].[xspGetFeaturedStories]
    @ChannelId INT ,
    @ChannelName VARCHAR(255) ,
    @LocationId INT ,
    @LocationName VARCHAR(255) ,
    @featureType VARCHAR(50) ,
    @NumFeatures INT
AS
    -- snip, input validation and default values
 
    SELECT TOP ( @NumFeatures )
            id_Story AS StoryId ,
            id_Type AS StoryTypeId ,
            Type AS StoryTypeName ,
            ChannelGroup ,
            id_ChannelGroup AS ChannelGroupId ,
            XMLText AS StoryXml
    FROM    vwFeaturedStories
    WHERE   id_Story IN (
                -- snip, concerns story publish and expiration dates
            )
            AND id_Story IN (
                SELECT  id_Story
                FROM    dbo.tblStoryLocationMap
                WHERE   id_Location = @LocationId
                        AND id_Channel = @ChannelId
            )
            AND tfFeatured = 1
            AND dbo.vwFeaturedStories.id_Channel = @Channelid
    ORDER BY StoryDate DESC
 
    -- snip, other stuff

Would you like me to optimize thatIf you don’t see anything wrong with that code that’s because there’s not. This issue arises when a feature in MS SQL called parameter sniffing starts trying to be helpful. This feature looks at the stored procedure and attempts to match input parameters with the queries in which they are used. It then takes this information and tries to construct an optimized execution plan.

In our case this one query is in effect querying against five subsets of data — the five locations mentioned earlier. We have two primary locations which each have about 40% of the stories in the database, with the last three locations combing to make up the last 20%. In order for execution plan optimization to work, the parameters being passed in must be “typical” for the call, and unfortunately for us, we’ve just defined three separate instances of “typical”. This means that at any given time about half of the calls to the procedure are being executed with a sub-optimal plan, resulting in long query times and degrading the performance of any other queries being run.

The ultimate solution was to fool MS SQL into thinking that there was nothing in the procedure that it could optimize. You see, MS SQL does not do variable tracing. If an input parameter is not directly used in a query as far as MS SQL is concerned, for optimization purposes, it may as well not exist. Therefore disabling optimization was as simple as copying the input parameters into locally declared variables and refactoring the queries to leverage those rather than the actual parameters.

CREATE PROCEDURE [dbo].[xspGetFeaturedStories]
    @ChannelId INT ,
    @ChannelName VARCHAR(255) ,
    @LocationId INT ,
    @LocationName VARCHAR(255) ,
    @featureType VARCHAR(50) ,
    @NumFeatures INT
AS
    DECLARE @_ChannelId INT = @ChannelId
    DECLARE @_ChannelName VARCHAR(255) = @ChannelName
    DECLARE @_LocationId INT = @LocationId
    DECLARE @_LocationName VARCHAR(255) = @LocationName
    DECLARE @_featureType VARCHAR(50) = @featureType
    DECLARE @_NumFeatures INT = @NumFeatures
 
    -- snip, input validation and default values
 
    SELECT TOP ( @_NumFeatures )
            id_Story AS StoryId ,
            id_Type AS StoryTypeId ,
            Type AS StoryTypeName ,
            ChannelGroup ,
            id_ChannelGroup AS ChannelGroupId ,
            XMLText AS StoryXml
    FROM    vwFeaturedStories
    WHERE   id_Story IN (
                -- snip, concerns story publish and expiration dates
            )
            AND id_Story IN (
                SELECT  id_Story
                FROM    dbo.tblStoryLocationMap
                WHERE   id_Location = @_LocationId
                        AND id_Channel = @_ChannelId
            )
            AND tfFeatured = 1
            AND dbo.vwFeaturedStories.id_Channel = @_Channelid
    ORDER BY StoryDate DESC
 
    -- snip, other stuff

It’s worth noting that according to the documentation, you should be able to control how the optimization works by adding the “OPTIMIZE FOR RECOMPILE” option to the procedure, but for some reason this seemed to have exactly zero effect on the performance.

So that solved the problem with the SQL timeout and the 30+ seconds query times, and we were able to get an average response time from the Cronkite service of about 1-2 seconds. This was much better, but not good enough, so we went back to the drawing board. What we realized is that while the queries are different, there’s a relatively small and finite number of them, so why can’t we just save the result set and return it as needed? And thus, result caching in Legion was born. It was actually a relatively simple addition to the framework, as every call to a service ultimately passes through a single method in the framework. All we had to do was intercept the call before it was actually passed to the service and return the cached result (or if there is no cached result, or if it’s stale, pass the call and then save and return the result). This gave us a much more reasonable average response time of 1.5ms.

We later discovered that the 1-2 second queries were caused by our Public Affairs team having added additional filters to the stories that were being displayed, which resulted in some tables being pivoted in unexpected ways. We had already solved (aka hidden) the problem, but we went and added the appropriate indices to eek out that little bit of extra performance. In the end, we developed a very stable and scalable service and added some cool functionality that can, and is, being leveraged by other services.

All this being said, if we had the time to do actual analysis on the queries being run against the Cronkite database, we could probably do away with a lot of these shenanigans which are really just band-aiding the problem. But given that we are currently getting an average response time of 1.5ms with a max of 31ms for the really squirrelly queires, it’s a problem that doesn’t really need to be fixed any time soon.

Cronkite Story Engine (un)Optimization

Erik J. Olson

Error

Templates (for web app):

Error