From looking at the mailing lists and discussion boards, you’ll see that a reasonable number of people seem to be having problems with their AFP servers under OS X Server 10.4.x, particularly under heavy load with network home directories, and particularly in terms of stability. I used to be one of those people, but the solutions presented in this article have (touch wood) resolved the vast majority of my issues.
Read on for all your AFP tuning needs…
[Ed.] See part #2 of this article
Whereas under 10.3 I used to get a symptom where the whole server would lock up and not even be pingable, the symptoms I’ve seen under 10.4.x have been entirely related to the AFP server. Login times will get longer and longer, access speed becomes slower and slower, and eventually the AFP server enters a death spiral, where the only recourse seems to be restarting the service.
I’m going to quickly cover some of the more commonly known fixes and workarounds that I’d tried before these major solutions.
1) Make sure DNS is working with proper forward and reverse entries for your servers at the bare minimum, and preferably all your client machines as well.
2) Redirect ~/Library/Caches to a local folder for network home directory users.
This makes quite a big difference. If you start watching your AFP server logs, you’ll notice that your network home directory users hit the cache a lot, particularly for apps like Safari. We do this with a login hook that redirects ~/Library/Caches to /Library/Caches/username/. I posted to the macos-x-server list about this a while ago.
3) Disable creation of .DS_Store files for network mounts.
There is an Apple techinfo article up about this. From debugging, it looks like a corrupt .DS_Store file can create all sorts of problems, particularly if you have a setup like mine, where students have read-only access to certain network folders, and lecturers have write-access. I’ve used MCX to push out the above setting, and then run a command like this to delete all existing .DS_Store files on the sharepoint:
sudo -s find /Volumes/someshare -name ".DS_Store" -print0 | xargs -0 rm
The problem with this solution is that your users will get a bit narky about no longer being able to have window settings ‘stick’ across sessions…. which my secretaries are still unhappy about…
4) Check the integrity of your filesystem.
5) Put more RAM in your server.
I was previously looking at swapfile creation and vm_stat to work out whether my AFP servers needed more RAM, however the problem seems to be that the AFP server tries to avoid swapping to disk, and so these aren’t good metrics to work from. As soon as I stuck a couple more gig into my servers, they started using it…
6) Move heavy users to Mobile Accounts with Portable Home Directories.
This isn’t applicable in all situations, and there’s no way I could move all my students to Mobile Accounts, but for my staff members who use Office and Mail heavily, performance on a network home directory isn’t that great, even on a fast connection. If you have users who primarily use a single machine, Mobile Accounts where you control the synchronisation settings via MCX give you the best of both worlds.
7) Increase the sysctl parameters for max files and max files per process.
This is a more general tuning tip for OS X Server, but as Rob Middleton pointed out to me, without it, if you have AFP error logging on, it helpfully logs several hundred lines per second to inform you that it can’t open files… filling up your primary partition rather quickly…
Create /etc/sysctl.conf with the following parameters:
kern.maxfiles=200000
kern.maxfilesperproc=50000
If you want to apply those settings immediately, rather than waiting for a reboot you can do:
for i in $(cat /etc/sysctl.conf); do sysctl -w $i; done
Now onto the more interesting fixes…
1) Tweaking the WAN threshold and packet size on the clients
If you run this command on a client machine, you can see various AFP client tuning parameters. Some of your settings may be slightly different to these.
nigelkersten@zombie: ~ $ defaults read -g com.apple.AppleShareClientCore { "afp_active_timeout" = 0; "afp_authtype_show" = 0; "afp_cleartext_allow" = 1; "afp_cleartext_warn" = 1; "afp_debug_level" = 6; "afp_debug_syslog" = 0; "afp_default_name" = ""; "afp_idle_timeout" = 0; "afp_keychain_add" = 1; "afp_keychain_search" = 1; "afp_login_displayGreeting" = 1; "afp_maxDirCache" = 60; "afp_maxFileCache" = 60; "afp_minDirCache" = 5; "afp_minFileCache" = 5; "afp_mount_defaultFlags" = 0; "afp_no_kQueues" = 0; "afp_no_volChange_caching" = 1; "afp_prefs_version" = 2; "afp_reconnect_allow" = 1; "afp_reconnect_interval" = 10; "afp_reconnect_retries" = 12; "afp_ssh_allow" = 0; "afp_ssh_force" = 0; "afp_ssh_require" = 0; "afp_ssh_warn" = 1; "afp_use_default_name" = 0; "afp_use_short_name" = 0; "afp_voldlog_skipIfOnly" = 0; "afp_wan_quantum" = 8192; "afp_wan_threshold" = 30; }
Of particular interest are these two settings, ‘afp_wan_quantum’ and ‘afp_wan_threshold’. The values I’ve shown here should be the defaults, but you may have a setting of 0 for both of them. These are used by the client to work out whether a particular AFP connection is over a LAN or WAN connection. If the latency of a given connection is higher than the value in afp_wan_threshold, then the data chunk size drops from 128KiB (the default for a LAN connection) to 8KiB, the setting shown here.
The problem seems to be that this default threshold setting is way too low, and once the AFP server starts experiencing moderate load, your LAN clients start using the WAN data chunk size. Although smaller chunks are desirable for slow connections, they induce an overhead on the server in terms of processing as the server is dealing with 16x the number of chunks, and reduce overall throughput. A symptom of this is high CPU usage.
The good news is that we can change these settings. The values you choose are up to you, and depend largely upon whether you actually do have WAN clients to support, so I’m going to suggest two scenarios:
First scenario: No WAN clients (cause AFP sucks over slow connections anyway 🙂 )
<code><pre> afp_wan_quantum = 131072 afp_wan_threshold = 1000 </pre> <p>This is a bit of a shotgun approach. Bring the latency threshold way up, and even if the clients reach this threshold, force their chunk size to be the same as that of a LAN client. This is what I've done in my environment, mainly because I wanted to make sure that I could rule out this issue. As time goes on, and once I work out the correct way to measure latency for a client (see the footnote at the end), I'll come up with more sane values, but at this stage, I just needed to stop my lecturers from revolting en masse due to AFP stability issues.</p> </code>
Second scenario: Some WAN clients who use AFP (I feel their pain…)
afp_wan_quantum = 8192 afp_wan_threshold = 200
This is very much a guesstimate on my part. I’ve done a few tests at a threshold of 100, and I’ve found that LAN clients were still getting the WAN chunk size when I set the threshold at 100. Again, see the footnote at the end, as I’d like to find out a way of getting an accurate picture of the latency a client is experiencing.
So as far as this setting goes, the good news (for me) is that it has almost entirely resolved performance issues with the AFP server under moderate load. I’ve also been experimenting with the maxFileCache and maxDirCache settings, and even when I’ve been increasing them from 60 to 6000, I’ve been unable to replicate any stale cache issues. If you so desire, try playing around with those settings.
Applying these settings to a client machine.
Well, there are a couple of options you have here.
1) Apply the settings manually.
One way would be to use a LoginHook to set these parameters using the defaults command. ie, the first scenario above could be done with the following two commands:
defaults write -g com.apple.AppleShareClientCore -dict-add afp_wan_quantum -int 131702 defaults write -g com.apple.AppleShareClientCore -dict-add afp_wan_threshold -int 1000
The “-g” refers to the fact that we’re writing these settings into the Global Domain, (for a user, at ~/Library/Preferences/.GlobalPreferences.plist), -dict-add means you’re adding a key value to a dictionary (named “com.apple.AppleShareClientCore”), and -int means you’re adding an integer value.
This may be the best solution for your environment, and is a good place to start testing before you move onto the next method…
2) Use MCX to manage the settings.
This is the way I’ve done it, as it has some nice side effects, and is more The Apple Way�. As with all such settings, you can choose to do this at the user or group level. I’ve chosen to do it to my two main groups that cover all my staff and all my students.
- Open up Workgroup Manager, and choose the group or user you wish to apply these settings to. Click on the toolbar item ‘Preferences’ and then the tab ‘Details’ to the right.
- Click on the ‘Add’ button, and navigate to your own home folder. Choose “.GlobalPreferences.plist”. You’ll now see it appear in the Preferences pane.
- Double click on the .GlobalPreferences item to edit it. You’ll notice that the settings have been put into ‘Often’ rather than ‘Always’ or ‘Once’. I’ve had some really odd behaviour with trying to get the global defaults domain to be managed ‘Always’, and ‘Often’ has been working happily for me, so let’s leave it set that way.
- If you have any other items other than the ‘com.apple.AppleShareClientCore’ dictionary, delete them, unless of course you wish to manage those settings for your users as well. Click on the triangle next to the AppleShareClientCore dictionary to expand it.
- Here you can see the relevant settings. Change the afp_wan_threshold and afp_wan_quantum values to the ones you’ve decided to use. Click on ‘Apply Now’.
Done! A really nice side effect of doing things this way is that you can now centrally manage a bunch of other afp connection settings, and perhaps most usefully, you can turn on AFP client side debugging using MCX for all your users. The AFP server itself isn’t particularly useful at giving debugging info, but the client is actually quite good. The footnote at the bottom will explain how to turn on client-side debugging.
There are a few oddities here though. It seems like global preferences shouldn’t take effect until the home directory is actually mounted, but perhaps by putting the prefs into MCX, we’re having them take effect before the actual home directory has mounted… ?
If anyone goes down the manual path setting, I’d like to hear if you’re finding these settings aren’t taking effect
Since I applied these changes, I’ve seen a drastic change in stability and performance under load. However… resolving this issue has pointed to another problem with Apple’s default settings, and I’ve noticed that the server is still slowing down under much heavier load, and still not using all the resources available to it, which brings us to…
2) Tweaking the maximum # of threads on on the server
The AFP server saw a change in 10.4, where permissions are now calculated by spawning a new thread for each connection with the effective uid of the connecting user. The problem is….
root@server: ~ $ serveradmin settings afp:maxThreads afp:maxThreads = 40
That seems kind of low given the above, right? Apparently we should have at least one thread per client session, and this would include all your automounts…
I’ve set mine to 600 here like this:
sudo serveradmin settings afp:maxThreads=600
This is something that again will depend upon what else your AFP server is doing (ideally not much…) and how many clients and automounts you have.
Footnote: Working out the latency of a client connection and AFP client side debugging.
You may have noticed two debug settings in the AppleShareClientCore dictionary as you were applying those settings.
afp_debug_level = 6 afp_debug_syslog = 0
If you set afp_debug_syslog to 1 (true), and add the following line to /etc/syslogd.conf,
*.debug /var/log/debug.log
then if you do:
sudo killall -HUP syslogd
then you’ll see a wealth of debugging info (depending upon the value from 1 to 8 you’ve set for afp_debug_level) go to /var/log/debug.log
This isn’t something you should just willy-nilly turn on for all your clients, as tempting as it may be to finally have debug info for AFP… Rather use it to have a look at what goes on when a specific client connects to an AFP mount. You might be surprised, the debugging info is quite good, and much more useful than the kinds of stuff we get server-side.
You’ll notice if you have the debug level up quite high that you’ll see a bunch of times reported to the debug log. I have yet to work out how to interpret those times to get the latency of a given client connection, and this is something I’d like to know so that I can tune ‘real’ values for my afp_wan_threshold.
So go forth afp548’ers! and work out how to do it. 🙂
As always, I’m not responsible for your server spontaneously combusting, your mileage may vary and this article may contain traces of nuts.
Many thanks must go to the AFP548.com posse, Joel ‘n’ Josh for extensively talking through these issues with me, as well as bringing them to my attention…
These are good tips, and some are applicable for NFS and SMB home directories
as well. We use NFS home dirs, and I have also redirected ~/Library/Caches to /
Library/Caches/<username>. I’d recommend the same for SMB home dirs –
there is no reason to be storing caches on the network store.
I’d like to see Apple actually have a checkbox in Workgroup Manager for "Store
Caches on Network Home Directory", somewhere in the preferences
management.
Apple have all the bits and pieces to do this far more elegantly than with a
LoginHook… and as you say, there is absolutely no reason why you want Caches
to be on the network… the only reason I can possibly think of is for security
purposes… which is why they should flush more often. 🙂
I was informed by an Apple representative that if we move from an Xserve
G4 to the newer dual processor G5, that my AFP problems will be corrected.
Will a new G5 Xserver correct AFP issues??? Am I throwing $$$. We have G5
servers in our two middle schools and don’t seem to have these AFP problems
there. Trying to make an informed decision whether to spend the money or
not.
It depends. 🙂 Throwing more CPU power at the AFP server will probably help,
but it really depends upon the load you’re seeing on those servers. Is it the
same for both?
It depends on the issues you’re seeing and the load, as well as the capabilities
of the server in terms of CPU, RAM and disk speed.
I was thinking about posting some metrics, but you know, it’s almost
pointless in some ways.
My student usage patterns fluctuate wildly, and it’s really tough to use figures
like CPU usage or anything to do this stuff.
I’ll sketch out what my student server supports:
Xserve G5 2x2Ghz G5.
4Gb RAM
Direct FC connection to an Xserve RAID.
2 network home directory automounts for student home directories.
1 equivalent of a ‘group folder’ sharepoint where Classwork data is stored.
1 ‘transit’ sharepoint that wipes itself out each night and is used for
transferring large files for one-off occasions.
GigE connection to our Catalyst switch.
All clients are on 100Mb and range from eMacs to G5 towers and everything in
between.
I have approximately 2000 student accounts, and the max is about 150
students logging in with network home directories at any one time. However,
as these are automounts, the other 300 or so machines I have on campus
also mount them, and many staff regularly use the Classwork sharepoint from
their machines.
The best metric I came up with really was login times.
Before the afp_wan_threshold fix, login times would be normal at the
beginning of the day (normal for us is about a minute at most, due to
reasonably complex login hooks and quite a few MCX settings), but then
would slowly increase over the course of the day.
About once every two days, (and sometimes more often), AFP would crawl to
a halt, and login times would stretch out to maybe 10 minutes. Once more
than handful of students started experiencing login times of that length, the
next few logins could take literally forever. I saw two students take 50 minutes
to login at one point.
Restarting the AFP service would instantly fix it, but often I had assessments
that I couldn’t interrupt. 🙁
After I applied the afp_wan_threshold fix, things drastically improved, until I
hit a point of over 100 students logging in at once with network homes and
copying data from Classwork to the local Storage partitions.
After I applied the maxThreads fix, I have seen no issues whatsoever
(frantically touches wood), and have had upwards of 400 simultaneous
connections chugging along without problems.
(goes to check the server right now…)
root@server: ~ $ netstat -a |grep -c afp
268
so that’s 268 simultaneous connections.
Actually, I’ll have a quick squiz at my cacti install (article coming soon, I promise…) and get some throughput figures
for you…
The student server tends to have peak outbound traffic of about 60 Mbits/sec,
and peak inbound traffic of about 30 Mbits/sec.
Current load averages:
load averages: 0.85 0.97 0.85
To get the Library redirect loginhook to work I had to replace the sed command
with
sed 's/homeDirectory: //'
otherwise, great article, thank you!
I think the mail archives put a space in there where there shouldn’t be, as it’s not
there on my original sent mail, but you can use either “|” or “/” with sed.
I often use “|” instead as it makes life easier when you’re working with paths, so
you don’t have to get confused escaping out ‘real’ “/” characters.
I can try everything, I get an "Invalid path" error every time for this line:
home_loc=$(dscl /Search -read /Users/$1 homeDirectory | sed ‘s|
homeDirectory: ||g’)
(all in one line)
trying on 10.3.9
any ideas?
Thanks in advance
svenc
awesome. truly awesome.
I should add that further testing seems to be showing another limit of just
over
400 simultaneous connections in my environment. If I go over this, I start to
get the death spiral slow down issues again…. I haven’t had time to work this
out exhaustively, but I have a
feeling there’s a threading issue… sc_usage seems to show some odd results
when sampling an unhappy AppleFileServer process.
I really need to cut down on the number of automounts I have on this box,
but
that’s primarily been dictated by the Xserve RAID… I haven’t moved to XSan,
and physical disks mean that I need to have a couple more automounts than
are
really desirable.
If I redirect the ~/Caches Folder of a network user to the local harddrive MS Word has a problem to save files. When saving an open file for the second time I always get an error that Word can’t save to file because of a naming or permissions error. We work with 10.4.6 on server and client and Word 10.2.3. If I delete the Symlink the saving works fine again.
I haven’t made much sense of this yet, but a quick work around is save again.
The first time you attempt a save after changes are made, it fails with the error
you mentioned, but a second save before you make any changes will save
correctly with the correct file name.
Any other true fixes?
When using MCX to manage the .GlobalPreferences.plist, to just modify those
two afp_wan settings I would just delete the rest of the keys in WGM, correct?
Also, are there any known issues with pushing prefs when some clients are still
10.3.x? Those clients should just ignore this part of MCX settings, right?
That will be fine just running those two settings.
I haven’t tested with 10.3.x much, as we’ve moved completely, but at worst it
would just ignore those settings. It doesn’t appear to show up in the defaults.
I’ve been getting quite a few emails from people, so just wanted to post that I’m
working on a followup for this article, as resolving these particular issues caused
some more to appear… mainly related to idle connections.
Very useful article.
Are there also debug settings of Apple Filing Protocol Server log?
afp:TCPQuantum seemed to make almost no difference from testing, so I didn’t
mention it in the article.
It’s possible that you could get better performance with a smaller number of
clients all on Gig-E to the server if you increased it though.
I think the values for scenario 1 are aggressive too… That’s why I qualified
them 🙂
In the case of not supporting WAN clients, you can have aggressive values, as
you’re trying to enforce LAN settings for everyone.
If you have some WAN clients who are using modem connections, I would
suggest that 8k datagrams are desirable, not 32k as Apple have suggested.
That setting would seem most appropriate for remote DSL/cable connections,
or anything faster than a modem. I would say that 32k is somewhat
aggressive for modem users.
I find it curious that you’re being told they don’t have a value for the
threshold. That is the critical setting as far as I can tell. I just wish I had a
better way of measuring the real latency reported by a client.
If Apple think that 500-600 threads is too high, and yet the server is
supposed to support up to 1000 simultaneous connections (not logged in
users, *connections), then I suggest that points to something being wrong
with the AFP server.
I’m not having issues running at 600. If you don’t support that many
connections at once, don’t set it that high.
I need to get going on part #2 of this article, which will be much shorter, but
will essentially talk about how I’ve been seeing issues with session caching and
idle user connections. Disabling session caching and disconnecting idle users
for my servers where students are logging in and out every couple of hours
seems to have finally gotten me to a point where I no longer have to pre-
emptively restart that AFP server every week or two.
thanks for the detailed comment.
Hi Nigel,
very nice article. I will try some of these tips…
We have an Xserve DP 2.3 3GB RAM, 10.3.9 Server, Xserve RAID (RAID-Level
5) and about 25 Clients (G5 mixed up with 10.3.9 and 10.4.6).
The network is completely on GigaBit-Ethernet (only 2 iMacs are running on
100Mbit). We use Linksys-Switches with EtherNet-Link-Aggregation, so the
connection between the two switches is 2GB. After changing to 10.4.x-Server,
also the Xerve will be connected with 2 x GigaBit. During the upgrade we will
also change the RAID from Level 5 to Level 50.
All users have home directories on the server, because the Macs are switched
quite often and a lot of feelancers are working here. We don´t want any
critical file to be saved on the Clients, all files related to the company have to
be stored exclusively on the RAID).
Mobile homedirectories are not an option. The sync does not work properly
(we tested this in a test-environment). Also the "company-rules" (see above)
do not allow local storage of business files.
E-Mail is distributed via IMAP and all Clients are working with Apples Mail.app.
Sometimes also the Xserve gets extremly slow (AFP-load about 90%).
Looking in the afp-access-log, I see a lot of activity from 1 or 2 clients. Very
often the clients are loading many fonts from the server or Mail seems to
index a lot of Mails (we keep a copy of the mail (without attachments) in the
homedirectories, because without the local copy, searches in the mail-content
does not work). It looks like the Xserve RAID (which also shows a huge load at
these times) has a problem with many small files, if there accessed in a huge
number. Especially if the file is read from the RAID and wrote back to the RAID
again (indexing IMAP-folders from the RAID and writing the Index in the
Homedirectory, which is also on the same RAID).
The Clients with Tiger are also indexing the Network-Volumes (distributed via
AFP) with Spotlight (also a lot of manuals say that Spotlight wouldn´t do this
-> it does!). This keeps the load really high. I don´t unterstand, why Apple
does not distribute a solution for Spotlight in a network-environment….
I tried to redirect the ~/Library/Caches to /Library/Caches/User but I can not
set up the command "defaults write com.apple.loginwindow ……" on the
10.3.9-clients (error: "Domain com.apple.loginwindow does not exist")
I will try this again with the manual provided from Apple KB:
http://docs.info.apple.com/article.html?artnum=301446
I hope we can solve this annoying slow-down-problem with one of Your
tricks… the changes with afp_wan and AppleShareClientCore didn´t make any
difference…
Thanks
svenc
UPDATE for 10.5
These items are no longer in com.apple.AppleShareClientCore but are now in /Library/Preferences/AppleShareClient
So to update the commands form the article
defaults write /Library/Preferences/com.apple.AppleShareClient -dict-add afp_wan_threshold -int 1000
defaults write /Library/Preferences/com.apple.AppleShareClient -dict-add afp_wan_quantum -int 131702
You may still be able to put them in the global domain? ie defaults write -g com.apple.AppleShareClient but I am not sure…..they certainly aren’t there to begin with…
HELLO! Great article!
Do you know how I can tune the automount timeout for failed mounts, from 120s to 10s? (Try it! on your afp server setup a firewall to prevent a particular afp client from using the service for user home directories.. and then ssh into that user on the afp client.. you’ll see that it takes 120s before you’re allowed to log in homelessly but successfully).
I’ve naturally tried looking at /etc/autofs.conf and /Library/Preferences/com.apple.AppleFileServer.plist in addition to /System/Library/LaunchDaemons/com.apple.automountd.plist all with no success 🙁
Any ideas?
Running 10.6 here and was wondering if any of the concepts above continue to be necessary? Could someone explain the maxThreads issue in more detail? For instance when I use top, I see that my AppleFileServer is spawning 203 threads of which only 2-5 seem to be active at any time. If only 2-5 are active, do I need to spawn more threads?