Performance Issues SharePoint 2013 – Things to look at

Hi,

Just thought I would share some experiences when dealing with intermittent performance issues I was working on recently with regard to SharePoint 2013.

Currently I’m involved with migrating a very large SharePoint platform (10TB’s of data) from SharePoint 2010 -> SharePoint 2013. The issue I encountered arose due to problems being reported in our performance test environment that we use to sign off the system from a performance perspective for production use.

I spent a few days diagnosing this problem before solving it. The main symptom we were experiencing was that of periodic, though no obvious pattern, http requests to SharePoint that would normally be served in less than 1’s, taking 10-30sec’s to respond, most strange.

To add some perspective on the platform we have many customisations such as http modules, custom site definitions, features event handlers, list and document library event handlers, custom search, workflows all sorts :-), so I had a lot of stuff to cover and rule out.

I would also add that before I started to dive into the fact that software was the issue, I had ruled out a Hardware bottleneck as the problem, as all the profiles etc.. from our SQL servers showed no long running transaction or SQL performance issues, the memory / cpu usage on all servers were nominal, no unexpected network or SAN utilisation etc….

So in my mind it had to be a software or configuration related issue.

Initially I investigated all our code customisations, gradually excluding them one by one, still no joy, even with the environment stripped down to bog standard SharePoint 2013 using a team site with the standard document library, the issue still occurred accessing http views if we left the perf test’s running long enough with enough concurrent users being simulated.

So, customisations ruled out, it’s got to be a config issue right?

We were seeing occasional errors in the ULS logs around the distributed cache (AppFabric Cache), and a bit of googling led me quite a few people blogging about known issues with the App Fabric Cache version that is supplied in the pre-req’s for SP2013, none of these tied in exactly with the timings of the performance issues, we were experiencing, however after reading the articles/blogs I decided it was prudent to update our AppFabric Cache to the latest version.

Microsoft provide CU1 of AppFabric Cache 1.1, on the installation media for SharePoint 2013, thus this is what most of you will likely have installed.

I would strongly recommend that if you are deploying SharePoint 2013 to a large scale multi server production environment that you update your app fabric cache to the latest available from Microsoft, as the earlier versions do have issues, as of writing this article the latest version is CU5, see KB2932678

I don’t know why Microsoft choose not to ship updates to Appfabric Cache with SP2013 CU’s or SP’s?

One of the other problems our SharePoint farms were experiencing, when under performance test load, even after the AppFabric patching were occasional timeouts being recorded in the ULS logs when dealing with the Distributed Cache, again google to the rescue :-).

The default timeout values for operations within the AppFabric defaults to 20ms, so next step for me was to up those values, I moved it to 10s, and increased the maxbuffer sizes from the defaults to 32KB, I found this script elsewhere on the net, but am adding it here for reference, here and have updated it with other finds from the URLs below, thanks to sources for helping me out.

$settings = Get-SPDistributedCacheClientSetting -ContainerType DistributedLogonTokenCache
$settings.maxBufferPoolSize = "1073741824"
$settings.maxBufferSize = "33554432"
$settings.requestTimeout = "10000"
$settings.channelOpenTimeOut = "10000"
$settings.MaxConnectionsToServer = "100"
Set-SPDistributedCacheClientSetting -ContainerType DistributedLogonTokenCache -DistributedCacheClientSettings $settings
$settingsverify = Get-SPDistributedCacheClientSetting -ContainerType DistributedLogonTokenCache
$settingsverify
$settingsvsc = Get-SPDistributedCacheClientSetting -ContainerType DistributedViewStateCache
$settingsvsc.ChannelOpenTimeOut = 10000
$settingsvsc.RequestTimeout=10000
$settingsvsc.MaxBufferSize = 33554432
Set-SPDistributedCacheClientSetting -ContainerType DistributedViewStateCache -DistributedCacheClientSettings $settingsvsc
$settingsaverify = Get-SPDistributedCacheClientSetting -ContainerType DistributedViewStateCache
$settingsaverify
$sts = Get-SPSecurityTokenServiceConfig
$sts.MaxServiceTokenCacheItems = "1500"
$sts.MaxLogonTokenCacheItems = "1500"
$sts.Update()

This resolved all the ULS log errors I was seeing, general load test performance was better, no more errors in ULS / Event logs but we still has periodic requests that were taking upwards of 20s-30s to respond (IIS Logs confirmed these times).

You can also check out these two sites here for further info on the subject above

http://habaneroconsulting.com/insights/sharepoint-2013-distributed-cache-bug#.VjiNhZVi-70
http://www.wictorwilen.se/how-to-patch-the-distributed-cache-in-sharepoint-2013

Next thing to check was to stop the distributed cache from blocking on garbage collection, I strongly suggest you do this, you need to change the config file for the distributed cache service, under normal circumstances this can be found here C:\Program Files\AppFabric 1.1 for Windows Server\DistributedCacheService.exe.config

Add the following section to the file.

<appsettings>
<add key="backgroundGC" value="true"></add>
</appsettings>

So it looks like

.....
</configSections>
<appsettings>
<add key="backgroundGC" value="true"></add>
</appsettings>
<datacacheconfig cacheHostName="AppFabricCachingService">
.....

Don’t forget to restart your cache service and save the changes

Stop-SPDistributedCacheServiceInstance -Graceful

Save the .config file

$instance = Get-SPServiceInstance | ? {$_.TypeName -eq "Distributed Cache" -and $_.Server.Name -eq $env:computername}
$instance.Provision()

The following updates, can also be used to resolve issues where you see periodic re-authentication requests in SharePoint when combined with ULS errors around authentication token cache issues…

Alas for me the performance problem still persisted…

So where next… I started to suspect that the application pool that was running SharePoint was recycling at random intervals, but I was seeing nothing in the ULS logs or Event logs to confirm this. I then used Perfmon, to monitor the ASP.Net counter for application restarts, and lo and behold when we hit a performance issue an application restart was occurring, so I started to investigate what was causing it, I’d already ruled out all our custom code by this point, and I saw nothing in the logs to explain what was going on, I even pushed the ULS logging levels to VerboseEx, an undocumented level of detail even grater than verbose.

Then I checked the IIS settings for the AppPools running SharePoint 2013, this was to make sure they were configured to report all re-cycle events to the event log, they were…

At that point though a value in the advanced section of the app pool config caught my eye, I then noticed the smoking GUN, the private memory pool limit…

It seems that for reasons I’ve yet to get to the bottom of this value was different from our previous SharePoint 2010 platforms, on SharePoint 2010 the values set in the appPools in IIS for the Private Memory limit are 0 (I.E. no limit), for some reason on our SharePoint 2013 kit, when the app pools are created, via powershell scripts, the limit was set to 2GB, and if your appPool attempts to exceed this memory allocation it get’s silently recycled, bingo, changed the value to 0 on SP2013 to match SP2010 and no more appPool recycle’s from Perfmon and no more requests taking ages to respond.

Is this new limit something to do with SP2013 being more cloud focused and configuring itself OOTB with a memory limit more suitable to a multi tenanted cloud hosted environment?, or is it due to some other changes somewhere in our server provisioning process? I’ll never know 🙂

Of course you may want to work out a suitable Private Memory Limit for your production platform’s, if you have loads of RAM on your Servers than just setting it to no-limit should be ok, if your server/servers have limited RAM you could try doubling this to 4GB, rather than unlimited if you are hitting the problem I encountered.

Hope this proves useful to others.

Thanks

Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*