- Accidental DBA
- Posts
- SQL Server Agent Jobs: Set It and Forget It?
SQL Server Agent Jobs: Set It and Forget It?
Useful tips to make your jobs more stable

Too many IT teams treat SQL Server Agent jobs like a coffee timer "Set it and forget it!"
Unfortunately, that mindset only works if everything else is perfect forever. Whether its backup jobs failing silently, index maintenance running on the wrong replica, or nobody getting alerts when things break, unattended SQL Agent jobs are one of the sneakiest ways to rack up technical debt. Let’s dig into what DBAs and non-DBAs alike need to keep an eye on to avoid job-related headaches.
SQL Agent Jobs Are Not “Fire and Forget”
Just because a job is scheduled doesn’t mean it’s working.
Agent jobs can fail, silently skip steps (especially when new steps are added), hang indefinitely, or run at the wrong time (local vs. UTC) and nobody notices until something downstream explodes.
Why Job Monitoring Matters
An ETL job that fails silently doesn’t just impact reports - it can compromise business decisions. Visibility is key.
CheckDB job failing due to corruption found? That needs IMMEDIATE attention.
Common Signs of Neglect:
Look for jobs with old failure messages, mismatched schedules, or jobs that haven’t run successfully in days or weeks.
Real-world Examples:
One client had a nightly report load job fail every third Wednesday because of a bad parameter. Nobody noticed for months until the CFO asked why sales numbers looked off.
Differential backup jobs were failing due to running on an AG secondary. DAYS of potential data loss
Job Ownership and Permissions
SQL Agent jobs need a stable, valid owner. If that account gets disabled or removed, some job steps might fail, as SQL Server checks for valid ownership every time a job runs. The job owner or team name should be in the job description field.
Default Ownership Pitfalls
Too often, developers or DBAs create jobs under their own accounts (this is default behavior). When those folks leave the company, chaos follows.Best Practice:
Assign ownership to a dedicated Agent service account or ‘sa’, depending on your security policy.What Can Go Wrong?
A DBA leaves the company (even on the best of terms), sysadmin disables her account, BLAM! SQL Agent jobs start failing all over the place the next time they run. Probably while you are at your kid’s soccer game.
Logging and Notifications: Built-In but Underused
SQL Agent offers good logging and alerting options. You just have to turn them on.
Enable Job History and Step Logging
Make sure each step writes output so you can diagnose problems quickly. Even the bare minimum “Log to Table” in the job step Advanced menu is better than nothing. Watch out for the “Append output to existing entry in table” option – that can blow out msdb in a hurry.Set Up Alerts for Failures
Database Mail + an Operator = Quick alerts when something breaks. Useless if the emails go to the wrong person or team though.Pro Tip
Use job steps that retry on failure with brief wait times. And yes, that includes your t-log backup jobs. Check the job logic so you don’t accidentally repeat something that worked.
AG-Aware Jobs: Run Only on the Primary
In an Availability Group, not all jobs belong on all nodes.
Why This Matters
Some jobs, like index rebuilds, update stats, or differential backups must run on the Primary replica.Implementation Tip
Add a check step at the beginning of critical jobs. If not on the Primary, gracefully exit.How to Check If You’re on the Primary
Use sys.fn_hadr_is_primary_replica in a preliminary job step to verify before proceeding.
-- Check if the local replica is the Primary
IF NOT EXISTS (
SELECT 1
FROM sys.dm_hadr_availability_replica_states ars
JOIN sys.availability_replicas ar ON ars.replica_id = ar.replica_id
WHERE ars.role_desc = 'PRIMARY'
AND ar.replica_server_name = @@SERVERNAME
)
BEGIN
PRINT 'This is not the primary replica. Skipping job step.';
RETURN;
END
-- If we get here, we're on the Primary
PRINT 'This is the Primary replica. Continuing job logic...';
BEGIN
-- Your actual job logic goes here
-- Example:
Select getdate();
END
Don’t Ignore Long-Running or Hanging Jobs
A job that’s still running isn’t necessarily still working.
Symptoms of a Hung Job:
No errors, but the runtime is wildly longer than average.
Sitting as an active or sleeping SPID, but resource utilization isn’t changing
Root Causes:
Blocking, deadlocks, external API hangs, or excessive waits can all cause jobs to stall.Solutions
Implement a max run time policy per job or use third-party monitoring to alert on outliers.
Could Your SQL Server Survive a Failure Without Major Downtime?
Find out how bullet-proof your DR plan really is!
SQL tidBITs: Long running job code
-- This is code from the internet, test before using
-- Source: https://www.sqlservercentral.com/forums/topic/script-to-find-long-running-job
SELECT jobs.name AS [Job_Name]
, CONVERT(VARCHAR(30),ja.start_execution_date,121)
AS [Start_execution_date]
, ja.stop_execution_date
, [Duration_secs]
FROM msdb.dbo.sysjobs jobs
LEFT JOIN (
SELECT *
FROM msdb.dbo.sysjobactivity
WHERE session_id = (SELECT MAX(session_id) FROM msdb.dbo.syssessions) AND
start_execution_date IS NOT NULL AND
stop_execution_date IS NULL
) AS ja ON ja.job_id = jobs.job_id
CROSS APPLY (
SELECT DATEDIFF(ss,ja.start_execution_date,GETDATE()) AS [Duration_secs]
) AS ca1
WHERE 1=1
and jobs.name Not in ('test') -- set this up to include/exclude jobs that should/should not be alerted on
AND Duration_secs > 300 -- 5 minutes
Reply