SpamAssassin 4.0.x Setup and Hardening on Debian/Ubuntu

Future Foundation — Public Documentation

Author: Jeff Brown | March 2026

This document covers a clean installation of SpamAssassin 4.0.x from CPAN on Debian-based mail servers running Exim4 with SA-Exim integration. It assumes a working Exim4 MTA and basic familiarity with Perl, systemd, and git.

The Debian apt package for SpamAssassin typically lags well behind upstream. At the time of writing, apt on Debian 12 ships 3.4.6 while the current stable release is 4.0.2 (August 2025). The CPAN install gives you access to newer plugins, improved Bayes classification, and the DMARC/FromNameSpoof/Phishing plugins that are absent or disabled in the packaged version. This guide assumes CPAN as the installation method for that reason.

1. Prerequisites

Ensure the following packages are installed. These provide the build toolchain, the Perl module installer, and the runtime dependencies SA needs:

apt-get update
apt-get install build-essential libssl-dev libexpat1-dev \
  libhtml-parser-perl libnet-dns-perl libnetaddr-ip-perl \
  libio-socket-inet6-perl libmail-dkim-perl libgeoip2-perl \
  cpanminus razor pyzor re2c

The razor and pyzor packages install the collaborative filtering clients. We will configure them in section 5.

Note: the re2c package is needed for sa-compile, which compiles SA rules into optimised C code for faster scanning.

2. Installing SpamAssassin from CPAN

cpanm Mail::SpamAssassin

This installs the SA binaries under /usr/local/bin/ and the Perl modules under /usr/local/share/perl/. Once complete, verify:

/usr/local/bin/spamassassin --version

You should see something like:

SpamAssassin version 4.0.2
  running on Perl version 5.36.0

The apt-installed binaries remain at /usr/bin/spamd and /usr/bin/spamassassin. These are now superseded but are not removed automatically. Keep them in place as a fallback but ensure all systemd units and cron jobs point to /usr/local/bin/.

3. Switching spamd to the CPAN Binary

The default systemd unit for spamassassin still points to the apt binary at /usr/sbin/spamd. Override it without editing the packaged unit file:

systemctl edit spamassassin

This opens an override file. Add:

[Service]
ExecStart=
ExecStart=/usr/local/bin/spamd -d --pidfile=/run/spamd.pid \
  --syslog=/var/log/spamd.log --create-prefs --max-children=5 \
  --min-children=2 --min-spare=2 --max-spare=4 \
  --max-conn-per-child=50 --timeout-child=240 \
  --helper-home-dir -D learn
Nice=15

The blank ExecStart= line is required to clear the inherited value before setting the new one. Adjust --max-children to suit available RAM (each child consumes roughly 80-120MB).

Then reload and restart:

systemctl daemon-reload
systemctl restart spamassassin

Send a test message and confirm the X-Spam-Checker-Version header now shows version 4.0.2 rather than 3.4.6.

4. Plugin Management: the local.pre Convention

This is an important best practice that is easy to get wrong.

SpamAssassin reads all .pre files before local.cf. The .pre files are intended exclusively for loadplugin and loadobject directives. Everything else — scores, whitelist entries, dns_query_restriction, Bayes settings, trusted_networks — belongs in local.cf.

The stock installation ships with version-specific .pre files (init.pre, v310.pre, v320.pre ... v402.pre) that load the bundled plugins. Do not edit these. Your local customisations belong in a separate file called local.pre, which SA reads automatically because it ends in .pre.

Create /etc/spamassassin/local.pre containing only your additional plugin loads. A representative example:

loadplugin Mail::SpamAssassin::Plugin::DMARC
loadplugin Mail::SpamAssassin::Plugin::AttachmentPresent
loadplugin Mail::SpamAssassin::Plugin::FromNameSpoof
loadplugin Mail::SpamAssassin::Plugin::Phishing
loadplugin Mail::SpamAssassin::Plugin::Razor2
loadplugin Mail::SpamAssassin::Plugin::Pyzor

That is it. No scores, no configuration, no conditionals. Just loadplugin lines. The configuration for these plugins (scores, thresholds, dns settings) goes into local.cf.

Why this matters: loadplugin directives must be processed before the rules and scores that reference them. Placing them in local.cf can cause ordering issues where SA tries to apply a score to a test that has not yet been defined because its plugin was loaded too late in the parsing sequence. The .pre files are parsed first by design, so plugins loaded there are guaranteed to be available when local.cf is read.

To verify which plugins are loaded:

spamassassin --lint -D 2>&1 | grep -i "plugin.*loaded"

To check that a specific plugin's tests are available:

spamassassin --lint -D 2>&1 | grep -i "dmarc\|fromnamespoof\|phishing"

5. Configuring Razor2 and Pyzor

Razor2 and Pyzor are collaborative spam signature databases. When someone reports a spam message to the Razor or Pyzor network, your server can query that network and benefit immediately without any local Bayes training. They are particularly effective against phishing campaigns where the same message body hits many recipients simultaneously.

5a. Pyzor

If installed via apt (section 1), test connectivity:

pyzor ping

Expected output:

public.pyzor.org:24441    (200, 'OK')

That is all the setup Pyzor needs. The SA plugin queries it automatically once loaded via local.pre.

Older documentation refers to pyzor discover but this command no longer exists in current versions. pyzor ping is the correct connectivity test.

5b. Razor2

Initialise the Razor2 client and register with the network:

razor-admin -create
razor-admin -register

This creates configuration files under /root/.razor/ (or the home directory of whichever user runs spamd). Verify it works:

echo "test" | razor-check

The exit code is what matters here, not the output. A working installation returns silently.

5c. Verifying SA Integration

Run a debug scan and check that both backends are found:

spamassassin --test-mode -D razor,pyzor < /dev/null 2>&1 | \
  grep -E "razor|pyzor"

You should see lines indicating both are available, something like:

dbg: pyzor: pyzor is available: /usr/bin/pyzor
dbg: pyzor: got response: public.pyzor.org:24441 (200, 'OK')

The "exceeded hardcoded limits" message that appears when testing with empty input is expected and harmless. SA sensibly ignores trivial matches. Real mail will score normally.

5d. A Note on DCC

DCC (Distributed Checksum Clearinghouse) is the third major collaborative filtering system. It is not packaged in Debian due to its non-standard licence and must be compiled from source from https://www.dcc-servers.net/dcc/. Razor2 and Pyzor together cover most of the collaborative filtering benefit; add DCC only if you have a specific need and the appetite for maintaining a source build.

6. The KAM Ruleset Channel

The default sa-update channel (updates.spamassassin.org) provides the core ruleset. Kevin McGrail's KAM channel is an actively maintained supplementary ruleset with aggressive phishing URL rules and patterns targeting current spam campaigns. It is probably the single highest-value addition to any SA installation.

6a. Import the GPG Signing Key

Download and import the key, running as the debian-spamd user to match the ownership of the GPG keyring:

wget https://mcgrail.com/downloads/kam.sa-channels.mcgrail.com.key \
  -O /tmp/kam.key
chmod 644 /tmp/kam.key
sudo -u debian-spamd sa-update \
  --import /tmp/kam.key \
  --gpghomedir /var/lib/spamassassin/sa-update-keys

The key file must be readable by debian-spamd. Downloading it to a restricted directory (like /root or a mail spool) and then running the import as debian-spamd will fail with a permission error.

6b. Pull the Channel

sudo -u debian-spamd sa-update \
  --gpgkey 24C063D8 \
  --channel kam.sa-channels.mcgrail.com \
  --gpghomedir /var/lib/spamassassin/sa-update-keys \
  --verbose

A successful first run downloads the ruleset. Subsequent runs that report "no fresh updates" with exit code 1 mean the channel is working and simply has nothing new since the last pull.

The rules land under:

/var/lib/spamassassin/4.000002/kam_sa-channels_mcgrail_com/

Note the directory name uses underscores, not dots.

6c. Adding KAM to the Daily Cron

The stock Debian cron job at /etc/cron.daily/spamassassin handles sa-update and spamd reload. Two things to watch for:

First, the stock cron script may hardcode /usr/bin/sa-update (the apt version). If you installed SA from CPAN, the correct binary is /usr/local/bin/sa-update. Either update the path in the cron script or ensure /usr/local/bin precedes /usr/bin in the cron PATH.

Second, add the KAM channel pull after the existing sa-update block and before the # Local variables: comment at the bottom:

# KAM ruleset channel update
env -i LANG="$LANG" PATH="$PATH" http_proxy="$http_proxy" \
  start-stop-daemon --chuid debian-spamd:debian-spamd --start \
  --exec /usr/local/bin/sa-update -- \
  --gpgkey 24C063D8 \
  --channel kam.sa-channels.mcgrail.com \
  --gpghomedir /var/lib/spamassassin/sa-update-keys 2>&1

This mirrors the existing sa-update block in style, running as the same user with the same GPG home directory.

Also confirm that CRON=1 is set in /etc/default/spamassassin, otherwise the entire cron script exits immediately without doing anything.

6d. Defunct Channels

Older documentation references channels at kam.sa.net.au and sought.rules.yerp.org. Both are defunct as of 2026. The correct current channel is kam.sa-channels.mcgrail.com as documented above. If you find stale directories from previous attempts under /var/lib/spamassassin/4.000002/ (kam_sa_net_au, sought_rules_yerp_org) they can be safely removed.

7. Bayesian Classifier Health and Training

SA's Bayesian classifier is one of its most powerful components but it needs a minimum of 200 spam and 200 ham messages before it activates. Below that threshold, BAYES_* scores in headers are meaningless.

7a. Checking Corpus Health

sa-learn --dump magic

A healthy output looks like:

0.000  0  3       0  non-token data: bayes db version
0.000  0  100498  0  non-token data: nspam
0.000  0  912746  0  non-token data: nham
0.000  0  259630  0  non-token data: ntokens
...

The critical numbers are nspam and nham. Both should be well above 200 for Bayes to function. The newest atime timestamp confirms the database is actively learning from live traffic.

Watch the ham/spam ratio. A corpus heavily skewed toward ham (say 9:1 or worse) makes Bayes conservative about flagging spam. You want something closer to 2:1 or 3:1 for optimal sensitivity.

7b. Training from SA-Exim Reject Spools

If you run SA-Exim, rejected messages accumulate in /var/spool/sa-exim/SApermreject/new/. These are high-confidence spam that SA scored above the reject threshold. They are excellent training material but Bayes does not learn from them automatically, because SA-Exim rejects them at SMTP time before the full auto-learn pipeline completes.

Feed them explicitly:

sa-learn --progress --spam /var/spool/sa-exim/SApermreject/new/

The --progress flag shows a running count. On a corpus of several thousand messages this takes a few minutes.

Do NOT train from SAspamaccept directories without careful review. These contain messages that SA flagged as spam but delivered anyway, and if your scoring has had any period of misconfiguration they will contain false positives that will contaminate the corpus.

7c. Auto-learning Configuration

The following settings in local.cf control auto-learning thresholds:

bayes_auto_learn 1
bayes_auto_learn_threshold_nonspam -0.001
bayes_auto_learn_threshold_spam 8.0

This means SA automatically learns messages scoring below -0.001 as ham and above 8.0 as spam. The spam threshold is deliberately conservative to avoid learning from borderline messages that might be misclassified.

Confirm auto-learning is active by checking for a regularly updated journal file:

ls -lh /etc/spamassassin/bayes/

A growing bayes_journal file confirms the auto-learn loop is firing.

8. DNS Query Restrictions

Some RBL providers (notably Validity/SenderScore) block queries from certain resolver IPs and return misleading positive results rather than useful signal. These false hits add phantom score to legitimate mail.

Suppress queries to known-broken lists in local.cf:

dns_query_restriction deny bl.score.senderscore.com
dns_query_restriction deny sa-accredit.habeas.com
dns_query_restriction deny sa-trusted.bondedsender.org

Similarly, any RBL that returns URIBL_BLOCKED or equivalent "you are not authorised" responses should be scored to zero:

score URIBL_BLOCKED 0.0

This prevents blocked query results from contributing to scores.

9. Trusted Networks

The trusted_networks directive tells SA which relay IPs are under your control. Mail received from these IPs is not subjected to relay-based checks (RDNS, PBL, etc). This must match your SPF record. If they diverge, SA will penalise mail relayed through your own infrastructure.

trusted_networks 129.232.230.120/29 197.189.206.80/29 41.203.26.232/29 41.72.147.64/27
trusted_networks 2c0f:fce8:4000:801::/64 2c0f:fce8:0:40c::/64

IPv4 and IPv6 ranges must be on separate trusted_networks lines. Review and update these whenever relay IPs change.

10. Shortcircuiting and Priority Hints

Shortcircuiting allows SA to skip expensive downstream checks (RBL lookups, network tests) when a definitive early result is already available. This improves throughput without sacrificing accuracy.

In local.cf, within an ifplugin block:

ifplugin Mail::SpamAssassin::Plugin::Shortcircuit

shortcircuit USER_IN_WHITELIST       on
shortcircuit USER_IN_DEF_WHITELIST   on
shortcircuit USER_IN_ALL_SPAM_TO     on
shortcircuit USER_IN_BLACKLIST       on
shortcircuit USER_IN_BLACKLIST_TO    on
shortcircuit SUBJECT_IN_BLACKLIST    on
shortcircuit ALL_TRUSTED             on

endif

Priority hints tell SA to run certain tests early. Setting a negative priority causes a test to run before the default batch. If Bayes returns a high-confidence spam result early, the shortcircuit plugin can skip the remaining network tests entirely:

priority BAYES_99  -850
priority BAYES_999 -850

11. Phishing Plugin Feeds

Loading the Phishing plugin (section 4) is necessary but not sufficient. The plugin needs feed data from OpenPhish or PhishTank to be useful. Without the feed files, the plugin loads but fires blind.

Check whether feed data exists:

find /var/lib/spamassassin -name "*phish*" 2>/dev/null

If nothing is returned, the feeds are not being downloaded. Consult the SA documentation for the Phishing plugin on how to configure the feed download cron. This is a separate mechanism from sa-update.

12. Lint Testing and Validation

After any configuration change, always lint before restarting:

spamassassin --lint && echo "Config OK"

This catches syntax errors, unknown test names, and plugin loading failures. Only restart spamd after a clean lint:

systemctl restart spamassassin

For a reload without dropping active connections (spamd supports SIGHUP):

systemctl reload spamassassin

13. Git-Based Configuration Management

Managing local.cf and local.pre in a git repository makes configuration changes auditable, reversible, and deployable across multiple servers.

The recommended repository structure:

/etc/spamassassin/
  .git/
  .gitignore
  local.cf          -- all scoring, settings, custom rules
  local.pre         -- loadplugin directives only

The .gitignore should exclude everything except the files you manage:

65_debian.cf
bayes
local.cf.bak
local.cf.dpkg-dist
local.cf.ispa-original
sa-update-hooks.d
sa-update-keys
*.pre
!local.pre

The *.pre wildcard excludes all the version-specific .pre files (init.pre, v310.pre, v342.pre, v400.pre, etc) which are managed by the SA package and should not be committed. The negation !local.pre must come after the wildcard. Gitignore processes rules top to bottom, so if !local.pre appears before *.pre, the wildcard simply re-ignores it.

On secondary servers, use git sparse-checkout to pull only the files that should be deployed:

git init
git remote add origin https://git.example.com/org/spamassassin.git
git config core.sparseCheckout true

echo "local.cf" >> .git/info/sparse-checkout
echo "local.pre" >> .git/info/sparse-checkout
echo ".gitignore" >> .git/info/sparse-checkout

git pull origin master

Historical snapshots (like a pre-migration copy of another server's config) can live in the repo without being deployed, provided they are excluded from the sparse-checkout on secondary servers and added to .gitignore so they are not deployed by accident.

When pulling on a server for the first time, if git reports "no tracking information for the current branch", set it up with:

git checkout -b master origin/master

Subsequent pulls then work normally:

git pull --ff-only

14. New Server Checklist

On a fresh Debian 12 server with Exim4 and SA-Exim already working:

Install prerequisites (section 1).
Install SA from CPAN (section 2).
Override the systemd unit to use /usr/local/bin/spamd (section 3).
Clone the git repository to /etc/spamassassin (section 13). Set up sparse-checkout for local.cf, local.pre, .gitignore.

Create the Bayes database directory:

mkdir -p /etc/spamassassin/bayes
chown Debian-exim:Debian-exim /etc/spamassassin/bayes
chmod 2770 /etc/spamassassin/bayes

Initialise Razor2 and test Pyzor (section 5):

razor-admin -create
razor-admin -register
pyzor ping

Import the KAM GPG key and pull the channel (section 6).
Update the daily cron job with the KAM channel block and ensure it references /usr/local/bin/sa-update.

Lint and restart:

spamassassin --lint && systemctl restart spamassassin

Send a test message and verify the X-Spam-Checker-Version header shows 4.0.2, and DMARC_PASS or other plugin-specific tests appear in X-Spam-Status.
After a few days of live traffic, check Bayes health:
```
sa-learn --dump magic
```
Confirm nspam and nham are both growing.

Appendix A: Recommended spamd Flags

Flag	Purpose
`--max-children=5`	Maximum concurrent scanning processes
`--min-children=2`	Keep at least 2 children warm
`--min-spare=2`	Minimum idle children
`--max-spare=4`	Maximum idle children
`--max-conn-per-child=50`	Recycle children after 50 connections (prevents memory leaks)
`--timeout-child=240`	Kill children idle for 4 minutes
`--helper-home-dir`	Use the helper user's home for .razor etc
`-D learn`	Debug logging for Bayes auto-learn events
`Nice=15`	Run at reduced priority

Adjust --max-children based on available RAM. Each child uses roughly 80-120MB depending on ruleset size.

Appendix B: Useful Diagnostic Commands

Task	Command
Check SA version	`spamassassin --version`
Lint the configuration	`spamassassin --lint`
List loaded plugins	`spamassassin --lint -D 2>&1 \| grep "plugin.*loaded"`
Test a message interactively	`spamassassin --test-mode < /path/to/message.eml`
Check Bayes database health	`sa-learn --dump magic`
Test Pyzor connectivity	`pyzor ping`
Test Razor2/Pyzor integration	`spamassassin --test-mode -D razor,pyzor < /dev/null 2>&1 \| grep -E "razor\|pyzor"`
Check KAM ruleset	`ls -la /var/lib/spamassassin/4.000002/kam_sa-channels_mcgrail_com/`
Check which sa-update is in PATH	`which sa-update`

Appendix C: File Separation Reference

local.pre contains ONLY

loadplugin directives

local.cf contains EVERYTHING ELSE

required_score, trusted_networks, dns_query_restriction, score overrides, whitelist_from_rcvd, Bayes settings, shortcircuit configuration, priority hints, custom header rules, custom keyword blocking rules

This separation is not optional. It is the SA project's own convention and prevents plugin loading order issues that can cause rules to silently fail.