Articles March 29, 2006 at 10:30 pm

AFP. It ain’t so bad….

From looking at the mailing lists and discussion boards, you’ll see that a reasonable number of people seem to be having problems with their AFP servers under OS X Server 10.4.x, particularly under heavy load with network home directories, and particularly in terms of stability. I used to be one of those people, but the solutions presented in this article have (touch wood) resolved the vast majority of my issues.

Read on for all your AFP tuning needs…

[Ed.] See part #2 of this article

Whereas under 10.3 I used to get a symptom where the whole server would lock up and not even be pingable, the symptoms I’ve seen under 10.4.x have been entirely related to the AFP server. Login times will get longer and longer, access speed becomes slower and slower, and eventually the AFP server enters a death spiral, where the only recourse seems to be restarting the service.

I’m going to quickly cover some of the more commonly known fixes and workarounds that I’d tried before these major solutions.

1) Make sure DNS is working with proper forward and reverse entries for your servers at the bare minimum, and preferably all your client machines as well.

2) Redirect ~/Library/Caches to a local folder for network home directory users.
This makes quite a big difference. If you start watching your AFP server logs, you’ll notice that your network home directory users hit the cache a lot, particularly for apps like Safari. We do this with a login hook that redirects ~/Library/Caches to /Library/Caches/username/. I posted to the macos-x-server list about this a while ago.

3) Disable creation of .DS_Store files for network mounts.
There is an Apple techinfo article up about this. From debugging, it looks like a corrupt .DS_Store file can create all sorts of problems, particularly if you have a setup like mine, where students have read-only access to certain network folders, and lecturers have write-access. I’ve used MCX to push out the above setting, and then run a command like this to delete all existing .DS_Store files on the sharepoint:

sudo -s
find /Volumes/someshare -name ".DS_Store" -print0 | xargs -0 rm

The problem with this solution is that your users will get a bit narky about no longer being able to have window settings ‘stick’ across sessions…. which my secretaries are still unhappy about…

4) Check the integrity of your filesystem.

5) Put more RAM in your server.
I was previously looking at swapfile creation and vm_stat to work out whether my AFP servers needed more RAM, however the problem seems to be that the AFP server tries to avoid swapping to disk, and so these aren’t good metrics to work from. As soon as I stuck a couple more gig into my servers, they started using it…

6) Move heavy users to Mobile Accounts with Portable Home Directories.
This isn’t applicable in all situations, and there’s no way I could move all my students to Mobile Accounts, but for my staff members who use Office and Mail heavily, performance on a network home directory isn’t that great, even on a fast connection. If you have users who primarily use a single machine, Mobile Accounts where you control the synchronisation settings via MCX give you the best of both worlds.

7) Increase the sysctl parameters for max files and max files per process.
This is a more general tuning tip for OS X Server, but as Rob Middleton pointed out to me, without it, if you have AFP error logging on, it helpfully logs several hundred lines per second to inform you that it can’t open files… filling up your primary partition rather quickly…
Create /etc/sysctl.conf with the following parameters:


kern.maxfiles=200000

kern.maxfilesperproc=50000

If you want to apply those settings immediately, rather than waiting for a reboot you can do:


for i in $(cat /etc/sysctl.conf); do sysctl -w $i; done

Now onto the more interesting fixes…

1) Tweaking the WAN threshold and packet size on the clients

If you run this command on a client machine, you can see various AFP client tuning parameters. Some of your settings may be slightly different to these.

  [email protected]: ~ $ defaults read -g com.apple.AppleShareClientCore
  {
      "afp_active_timeout" = 0; 
      "afp_authtype_show" = 0; 
      "afp_cleartext_allow" = 1; 
      "afp_cleartext_warn" = 1; 
      "afp_debug_level" = 6; 
      "afp_debug_syslog" = 0; 
      "afp_default_name" = ""; 
      "afp_idle_timeout" = 0; 
      "afp_keychain_add" = 1; 
      "afp_keychain_search" = 1; 
      "afp_login_displayGreeting" = 1; 
      "afp_maxDirCache" = 60; 
      "afp_maxFileCache" = 60; 
      "afp_minDirCache" = 5; 
      "afp_minFileCache" = 5; 
      "afp_mount_defaultFlags" = 0; 
      "afp_no_kQueues" = 0; 
      "afp_no_volChange_caching" = 1; 
      "afp_prefs_version" = 2; 
      "afp_reconnect_allow" = 1; 
      "afp_reconnect_interval" = 10; 
      "afp_reconnect_retries" = 12; 
      "afp_ssh_allow" = 0; 
      "afp_ssh_force" = 0; 
      "afp_ssh_require" = 0; 
      "afp_ssh_warn" = 1; 
      "afp_use_default_name" = 0; 
      "afp_use_short_name" = 0; 
      "afp_voldlog_skipIfOnly" = 0; 
      "afp_wan_quantum" = 8192; 
      "afp_wan_threshold" = 30; 
  }
  

Of particular interest are these two settings, ‘afp_wan_quantum’ and ‘afp_wan_threshold’. The values I’ve shown here should be the defaults, but you may have a setting of 0 for both of them. These are used by the client to work out whether a particular AFP connection is over a LAN or WAN connection. If the latency of a given connection is higher than the value in afp_wan_threshold, then the data chunk size drops from 128KiB (the default for a LAN connection) to 8KiB, the setting shown here.

The problem seems to be that this default threshold setting is way too low, and once the AFP server starts experiencing moderate load, your LAN clients start using the WAN data chunk size. Although smaller chunks are desirable for slow connections, they induce an overhead on the server in terms of processing as the server is dealing with 16x the number of chunks, and reduce overall throughput. A symptom of this is high CPU usage.

The good news is that we can change these settings. The values you choose are up to you, and depend largely upon whether you actually do have WAN clients to support, so I’m going to suggest two scenarios:

First scenario: No WAN clients (cause AFP sucks over slow connections anyway 🙂 )

    afp_wan_quantum = 131072
    afp_wan_threshold = 1000
    

This is a bit of a shotgun approach. Bring the latency threshold way up, and even if the clients reach this threshold, force their chunk size to be the same as that of a LAN client. This is what I’ve done in my environment, mainly because I wanted to make sure that I could rule out this issue. As time goes on, and once I work out the correct way to measure latency for a client (see the footnote at the end), I’ll come up with more sane values, but at this stage, I just needed to stop my lecturers from revolting en masse due to AFP stability issues.

Second scenario: Some WAN clients who use AFP (I feel their pain…)

  afp_wan_quantum = 8192
  afp_wan_threshold = 200
  

This is very much a guesstimate on my part. I’ve done a few tests at a threshold of 100, and I’ve found that LAN clients were still getting the WAN chunk size when I set the threshold at 100. Again, see the footnote at the end, as I’d like to find out a way of getting an accurate picture of the latency a client is experiencing.

So as far as this setting goes, the good news (for me) is that it has almost entirely resolved performance issues with the AFP server under moderate load. I’ve also been experimenting with the maxFileCache and maxDirCache settings, and even when I’ve been increasing them from 60 to 6000, I’ve been unable to replicate any stale cache issues. If you so desire, try playing around with those settings.

Applying these settings to a client machine.

Well, there are a couple of options you have here.

1) Apply the settings manually.

One way would be to use a LoginHook to set these parameters using the defaults command. ie, the first scenario above could be done with the following two commands:

    defaults write -g com.apple.AppleShareClientCore -dict-add afp_wan_quantum -int 131702
    defaults write -g com.apple.AppleShareClientCore -dict-add afp_wan_threshold -int 1000
  

The “-g” refers to the fact that we’re writing these settings into the Global Domain, (for a user, at ~/Library/Preferences/.GlobalPreferences.plist), -dict-add means you’re adding a key value to a dictionary (named “com.apple.AppleShareClientCore”), and -int means you’re adding an integer value.

This may be the best solution for your environment, and is a good place to start testing before you move onto the next method…

2) Use MCX to manage the settings.

This is the way I’ve done it, as it has some nice side effects, and is more The Apple Way�. As with all such settings, you can choose to do this at the user or group level. I’ve chosen to do it to my two main groups that cover all my staff and all my students.

  • Open up Workgroup Manager, and choose the group or user you wish to apply these settings to. Click on the toolbar item ‘Preferences’ and then the tab ‘Details’ to the right.
  • Click on the ‘Add’ button, and navigate to your own home folder. Choose “.GlobalPreferences.plist”. You’ll now see it appear in the Preferences pane.
  • Double click on the .GlobalPreferences item to edit it. You’ll notice that the settings have been put into ‘Often’ rather than ‘Always’ or ‘Once’. I’ve had some really odd behaviour with trying to get the global defaults domain to be managed ‘Always’, and ‘Often’ has been working happily for me, so let’s leave it set that way.
  • If you have any other items other than the ‘com.apple.AppleShareClientCore’ dictionary, delete them, unless of course you wish to manage those settings for your users as well. Click on the triangle next to the AppleShareClientCore dictionary to expand it.
  • Here you can see the relevant settings. Change the afp_wan_threshold and afp_wan_quantum values to the ones you’ve decided to use. Click on ‘Apply Now’.

Done! A really nice side effect of doing things this way is that you can now centrally manage a bunch of other afp connection settings, and perhaps most usefully, you can turn on AFP client side debugging using MCX for all your users. The AFP server itself isn’t particularly useful at giving debugging info, but the client is actually quite good. The footnote at the bottom will explain how to turn on client-side debugging.

There are a few oddities here though. It seems like global preferences shouldn’t take effect until the home directory is actually mounted, but perhaps by putting the prefs into MCX, we’re having them take effect before the actual home directory has mounted… ?

If anyone goes down the manual path setting, I’d like to hear if you’re finding these settings aren’t taking effect

Since I applied these changes, I’ve seen a drastic change in stability and performance under load. However… resolving this issue has pointed to another problem with Apple’s default settings, and I’ve noticed that the server is still slowing down under much heavier load, and still not using all the resources available to it, which brings us to…

2) Tweaking the maximum # of threads on on the server

The AFP server saw a change in 10.4, where permissions are now calculated by spawning a new thread for each connection with the effective uid of the connecting user. The problem is….

    [email protected]: ~ $ serveradmin settings afp:maxThreads
    afp:maxThreads = 40
  

That seems kind of low given the above, right? Apparently we should have at least one thread per client session, and this would include all your automounts…

I’ve set mine to 600 here like this:

    sudo serveradmin settings afp:maxThreads=600
  

This is something that again will depend upon what else your AFP server is doing (ideally not much…) and how many clients and automounts you have.

Footnote: Working out the latency of a client connection and AFP client side debugging.

You may have noticed two debug settings in the AppleShareClientCore dictionary as you were applying those settings.

    afp_debug_level = 6
    afp_debug_syslog = 0
  

If you set afp_debug_syslog to 1 (true), and add the following line to /etc/syslogd.conf,

  *.debug                                                 /var/log/debug.log
  

then if you do:


sudo killall -HUP syslogd

then you’ll see a wealth of debugging info (depending upon the value from 1 to 8 you’ve set for afp_debug_level) go to /var/log/debug.log

This isn’t something you should just willy-nilly turn on for all your clients, as tempting as it may be to finally have debug info for AFP… Rather use it to have a look at what goes on when a specific client connects to an AFP mount. You might be surprised, the debugging info is quite good, and much more useful than the kinds of stuff we get server-side.

You’ll notice if you have the debug level up quite high that you’ll see a bunch of times reported to the debug log. I have yet to work out how to interpret those times to get the latency of a given client connection, and this is something I’d like to know so that I can tune ‘real’ values for my afp_wan_threshold.

So go forth afp548’ers! and work out how to do it. 🙂

As always, I’m not responsible for your server spontaneously combusting, your mileage may vary and this article may contain traces of nuts.

Many thanks must go to the AFP548.com posse, Joel ‘n’ Josh for extensively talking through these issues with me, as well as bringing them to my attention…

1 Comment

  • These are good tips, and some are applicable for NFS and SMB home directories
    as well. We use NFS home dirs, and I have also redirected ~/Library/Caches to /
    Library/Caches/<username>. I’d recommend the same for SMB home dirs –
    there is no reason to be storing caches on the network store.

    • I’d like to see Apple actually have a checkbox in Workgroup Manager for "Store
      Caches on Network Home Directory", somewhere in the preferences
      management.

      Apple have all the bits and pieces to do this far more elegantly than with a
      LoginHook… and as you say, there is absolutely no reason why you want Caches
      to be on the network… the only reason I can possibly think of is for security
      purposes… which is why they should flush more often. 🙂

  • I was informed by an Apple representative that if we move from an Xserve
    G4 to the newer dual processor G5, that my AFP problems will be corrected.
    Will a new G5 Xserver correct AFP issues??? Am I throwing $$$. We have G5
    servers in our two middle schools and don’t seem to have these AFP problems
    there. Trying to make an informed decision whether to spend the money or
    not.

    It depends. 🙂 Throwing more CPU power at the AFP server will probably help,
    but it really depends upon the load you’re seeing on those servers. Is it the
    same for both?

    It depends on the issues you’re seeing and the load, as well as the capabilities
    of the server in terms of CPU, RAM and disk speed.

  • I was thinking about posting some metrics, but you know, it’s almost
    pointless in some ways.

    My student usage patterns fluctuate wildly, and it’s really tough to use figures
    like CPU usage or anything to do this stuff.

    I’ll sketch out what my student server supports:

    Xserve G5 2x2Ghz G5.
    4Gb RAM
    Direct FC connection to an Xserve RAID.
    2 network home directory automounts for student home directories.
    1 equivalent of a ‘group folder’ sharepoint where Classwork data is stored.
    1 ‘transit’ sharepoint that wipes itself out each night and is used for
    transferring large files for one-off occasions.
    GigE connection to our Catalyst switch.

    All clients are on 100Mb and range from eMacs to G5 towers and everything in
    between.

    I have approximately 2000 student accounts, and the max is about 150
    students logging in with network home directories at any one time. However,
    as these are automounts, the other 300 or so machines I have on campus
    also mount them, and many staff regularly use the Classwork sharepoint from
    their machines.

    The best metric I came up with really was login times.

    Before the afp_wan_threshold fix, login times would be normal at the
    beginning of the day (normal for us is about a minute at most, due to
    reasonably complex login hooks and quite a few MCX settings), but then
    would slowly increase over the course of the day.

    About once every two days, (and sometimes more often), AFP would crawl to
    a halt, and login times would stretch out to maybe 10 minutes. Once more
    than handful of students started experiencing login times of that length, the
    next few logins could take literally forever. I saw two students take 50 minutes
    to login at one point.

    Restarting the AFP service would instantly fix it, but often I had assessments
    that I couldn’t interrupt. 🙁

    After I applied the afp_wan_threshold fix, things drastically improved, until I
    hit a point of over 100 students logging in at once with network homes and
    copying data from Classwork to the local Storage partitions.

    After I applied the maxThreads fix, I have seen no issues whatsoever
    (frantically touches wood), and have had upwards of 400 simultaneous
    connections chugging along without problems.

    (goes to check the server right now…)


    [email protected]: ~ $ netstat -a |grep -c afp
    268

    so that’s 268 simultaneous connections.

    Actually, I’ll have a quick squiz at my cacti install (article coming soon, I promise…) and get some throughput figures
    for you…

    The student server tends to have peak outbound traffic of about 60 Mbits/sec,
    and peak inbound traffic of about 30 Mbits/sec.

    Current load averages:


    load averages: 0.85 0.97 0.85

  • To get the Library redirect loginhook to work I had to replace the sed command
    with

    sed 's/homeDirectory: //'

    otherwise, great article, thank you!

    • I think the mail archives put a space in there where there shouldn’t be, as it’s not
      there on my original sent mail, but you can use either “|” or “/” with sed.

      I often use “|” instead as it makes life easier when you’re working with paths, so
      you don’t have to get confused escaping out ‘real’ “/” characters.

    • I can try everything, I get an "Invalid path" error every time for this line:

      home_loc=$(dscl /Search -read /Users/$1 homeDirectory | sed ‘s|
      homeDirectory: ||g’)

      (all in one line)

      trying on 10.3.9

      any ideas?

      Thanks in advance

      svenc

  • awesome. truly awesome.

  • I should add that further testing seems to be showing another limit of just
    over
    400 simultaneous connections in my environment. If I go over this, I start to
    get the death spiral slow down issues again…. I haven’t had time to work this
    out exhaustively, but I have a
    feeling there’s a threading issue… sc_usage seems to show some odd results
    when sampling an unhappy AppleFileServer process.

    I really need to cut down on the number of automounts I have on this box,
    but
    that’s primarily been dictated by the Xserve RAID… I haven’t moved to XSan,
    and physical disks mean that I need to have a couple more automounts than
    are
    really desirable.

  • If I redirect the ~/Caches Folder of a network user to the local harddrive MS Word has a problem to save files. When saving an open file for the second time I always get an error that Word can’t save to file because of a naming or permissions error. We work with 10.4.6 on server and client and Word 10.2.3. If I delete the Symlink the saving works fine again.

    • I haven’t made much sense of this yet, but a quick work around is save again.
      The first time you attempt a save after changes are made, it fails with the error
      you mentioned, but a second save before you make any changes will save
      correctly with the correct file name.

      Any other true fixes?

  • When using MCX to manage the .GlobalPreferences.plist, to just modify those
    two afp_wan settings I would just delete the rest of the keys in WGM, correct?

    Also, are there any known issues with pushing prefs when some clients are still
    10.3.x? Those clients should just ignore this part of MCX settings, right?

    • That will be fine just running those two settings.

      I haven’t tested with 10.3.x much, as we’ve moved completely, but at worst it
      would just ignore those settings. It doesn’t appear to show up in the defaults.

  • I’ve been getting quite a few emails from people, so just wanted to post that I’m
    working on a followup for this article, as resolving these particular issues caused
    some more to appear… mainly related to idle connections.

  • Very useful article.
    Are there also debug settings of Apple Filing Protocol Server log?

  • afp:TCPQuantum seemed to make almost no difference from testing, so I didn’t
    mention it in the article.

    It’s possible that you could get better performance with a smaller number of
    clients all on Gig-E to the server if you increased it though.

    • I think the values for scenario 1 are aggressive too… That’s why I qualified
      them 🙂

      In the case of not supporting WAN clients, you can have aggressive values, as
      you’re trying to enforce LAN settings for everyone.

      If you have some WAN clients who are using modem connections, I would
      suggest that 8k datagrams are desirable, not 32k as Apple have suggested.
      That setting would seem most appropriate for remote DSL/cable connections,
      or anything faster than a modem. I would say that 32k is somewhat
      aggressive for modem users.

      I find it curious that you’re being told they don’t have a value for the
      threshold. That is the critical setting as far as I can tell. I just wish I had a
      better way of measuring the real latency reported by a client.

      If Apple think that 500-600 threads is too high, and yet the server is
      supposed to support up to 1000 simultaneous connections (not logged in
      users, *connections), then I suggest that points to something being wrong
      with the AFP server.

      I’m not having issues running at 600. If you don’t support that many
      connections at once, don’t set it that high.

      I need to get going on part #2 of this article, which will be much shorter, but
      will essentially talk about how I’ve been seeing issues with session caching and
      idle user connections. Disabling session caching and disconnecting idle users
      for my servers where students are logging in and out every couple of hours
      seems to have finally gotten me to a point where I no longer have to pre-
      emptively restart that AFP server every week or two.

      thanks for the detailed comment.

      • Hi Nigel,

        very nice article. I will try some of these tips…
        We have an Xserve DP 2.3 3GB RAM, 10.3.9 Server, Xserve RAID (RAID-Level
        5) and about 25 Clients (G5 mixed up with 10.3.9 and 10.4.6).
        The network is completely on GigaBit-Ethernet (only 2 iMacs are running on
        100Mbit). We use Linksys-Switches with EtherNet-Link-Aggregation, so the
        connection between the two switches is 2GB. After changing to 10.4.x-Server,
        also the Xerve will be connected with 2 x GigaBit. During the upgrade we will
        also change the RAID from Level 5 to Level 50.

        All users have home directories on the server, because the Macs are switched
        quite often and a lot of feelancers are working here. We don´t want any
        critical file to be saved on the Clients, all files related to the company have to
        be stored exclusively on the RAID).
        Mobile homedirectories are not an option. The sync does not work properly
        (we tested this in a test-environment). Also the "company-rules" (see above)
        do not allow local storage of business files.
        E-Mail is distributed via IMAP and all Clients are working with Apples Mail.app.

        Sometimes also the Xserve gets extremly slow (AFP-load about 90%).
        Looking in the afp-access-log, I see a lot of activity from 1 or 2 clients. Very
        often the clients are loading many fonts from the server or Mail seems to
        index a lot of Mails (we keep a copy of the mail (without attachments) in the
        homedirectories, because without the local copy, searches in the mail-content
        does not work). It looks like the Xserve RAID (which also shows a huge load at
        these times) has a problem with many small files, if there accessed in a huge
        number. Especially if the file is read from the RAID and wrote back to the RAID
        again (indexing IMAP-folders from the RAID and writing the Index in the
        Homedirectory, which is also on the same RAID).
        The Clients with Tiger are also indexing the Network-Volumes (distributed via
        AFP) with Spotlight (also a lot of manuals say that Spotlight wouldn´t do this
        -> it does!). This keeps the load really high. I don´t unterstand, why Apple
        does not distribute a solution for Spotlight in a network-environment….

        I tried to redirect the ~/Library/Caches to /Library/Caches/User but I can not
        set up the command "defaults write com.apple.loginwindow ……" on the
        10.3.9-clients (error: "Domain com.apple.loginwindow does not exist")

        I will try this again with the manual provided from Apple KB:
        http://docs.info.apple.com/article.html?artnum=301446

        I hope we can solve this annoying slow-down-problem with one of Your
        tricks… the changes with afp_wan and AppleShareClientCore didn´t make any
        difference…

        Thanks

        svenc

  • UPDATE for 10.5

    These items are no longer in com.apple.AppleShareClientCore but are now in /Library/Preferences/AppleShareClient

    So to update the commands form the article

    defaults write /Library/Preferences/com.apple.AppleShareClient -dict-add afp_wan_threshold -int 1000
    defaults write /Library/Preferences/com.apple.AppleShareClient -dict-add afp_wan_quantum -int 131702

    You may still be able to put them in the global domain? ie defaults write -g com.apple.AppleShareClient but I am not sure…..they certainly aren’t there to begin with…

  • HELLO! Great article!

    Do you know how I can tune the automount timeout for failed mounts, from 120s to 10s? (Try it! on your afp server setup a firewall to prevent a particular afp client from using the service for user home directories.. and then ssh into that user on the afp client.. you’ll see that it takes 120s before you’re allowed to log in homelessly but successfully).

    I’ve naturally tried looking at /etc/autofs.conf and /Library/Preferences/com.apple.AppleFileServer.plist in addition to /System/Library/LaunchDaemons/com.apple.automountd.plist all with no success 🙁

    Any ideas?

  • Running 10.6 here and was wondering if any of the concepts above continue to be necessary? Could someone explain the maxThreads issue in more detail? For instance when I use top, I see that my AppleFileServer is spawning 203 threads of which only 2-5 seem to be active at any time. If only 2-5 are active, do I need to spawn more threads?

Leave a reply

You must be logged in to post a comment.