Skip to content

CPU and Max RSS Analysis tools#6663

Open
ChrisPaulBennett wants to merge 53 commits into
cylc:masterfrom
ChrisPaulBennett:cylc_profiler
Open

CPU and Max RSS Analysis tools#6663
ChrisPaulBennett wants to merge 53 commits into
cylc:masterfrom
ChrisPaulBennett:cylc_profiler

Conversation

@ChrisPaulBennett
Copy link
Copy Markdown
Contributor

@ChrisPaulBennett ChrisPaulBennett commented Mar 12, 2025

This apart of 3 pull requests for adding CPU time and Max RSS analysis to the Cylc UI.

This adds the Max RSS and CPU time (as measured by cgroups) to the table view, box plot and time series views.

This adds a python profiler script. This profiler will will be ran by cylc in the same crgroup as the cylc task. It will periodically poll cgroups and save data to a file. Cylc will then store these values in the sql db file.

Linked to;
cylc/cylc-ui#2100
cylc/cylc-uiserver#675

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • Changelog entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@ChrisPaulBennett ChrisPaulBennett marked this pull request as draft March 12, 2025 09:19
Copy link
Copy Markdown
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Comment thread cylc/flow/etc/job.sh Outdated
Comment thread cylc/flow/etc/job.sh Outdated
Comment thread cylc/flow/etc/job.sh Outdated
Comment thread cylc/flow/job_file.py Outdated
Comment thread cylc/flow/etc/job.sh Outdated
Comment thread cylc/flow/scripts/profile.py Outdated
Comment thread cylc/flow/scripts/profile.py Outdated
Comment thread cylc/flow/scripts/profile.py Outdated
Comment thread cylc/flow/scripts/profile.py Outdated
Comment thread tests/functional/jobscript/02-profiler.t Outdated
@oliver-sanders oliver-sanders added this to the 8.x milestone Mar 12, 2025
@ChrisPaulBennett ChrisPaulBennett force-pushed the cylc_profiler branch 2 times, most recently from fb1b12b to c5d30b3 Compare March 21, 2025 11:37
@ChrisPaulBennett ChrisPaulBennett force-pushed the cylc_profiler branch 3 times, most recently from 30a7bb0 to 7091711 Compare April 2, 2025 08:35
@ChrisPaulBennett ChrisPaulBennett marked this pull request as ready for review April 2, 2025 14:20
Copy link
Copy Markdown
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread cylc/flow/cfgspec/globalcfg.py Outdated
Comment thread cylc/flow/etc/job.sh Outdated
Comment thread cylc/flow/etc/job.sh Outdated
Comment thread cylc/flow/scripts/profiler.py Outdated
Comment thread cylc/flow/scripts/profiler.py Outdated
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
#-------------------------------------------------------------------------------
# cylc profile test
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test will run regular background jobs, no slurm / pbs / whatever, so no cgroups.

I think this is testing that the profiler will not cause the job to fail, even if it cannot poll cgroups? Which is worthwhile testing.

We should test the jobs stderr for the line(s) written by the profiler script complaining of the fault.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChrisPaulBennett

The profiler actually fails in this test, but the test passes anyway because it doesn't check whether the profiler did anything useful.

I've had a crack at a test here: ChrisPaulBennett#1

A couple of the sub-tests don't pass at the moment because the cpu/memory are not returned if the job fails.

Comment thread tests/functional/jobscript/02-profiler/flow.cylc Outdated
Comment thread tests/functional/jobscript/02-profiler/flow.cylc Outdated
Comment thread cylc/flow/scripts/profiler.py Outdated
@oliver-sanders
Copy link
Copy Markdown
Member

(please ignore the manylinux test failures, we'll be removing this test on master shortly)

Comment thread cylc/flow/cfgspec/globalcfg.py Outdated
@wxtim
Copy link
Copy Markdown
Member

wxtim commented Apr 16, 2025

I'm getting lots of failures with this (admittedly nasty) workflow on localhost:

[task parameters]
    time = 1..10
    reps = 1..5
[scheduling]
    cycling mode = integer
    [[graph]]
        R1 = task<time><reps>
[runtime]
    [[task<time><reps>]]
        script = sleep $CYLC_TASK_PARAM_time

About 2/3 of tasks have FileNotFoundError: [Errno 2] No such file or directory: 'cpu_time' - It looks to me like the profiler fails if the task exits too fast?

Full Traceback
Traceback (most recent call last):
  File "/home/users/tim.pillinger/conda-envs/cylc39/bin/cylc", line 8, in <module>
    sys.exit(main())
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/cylc.py", line 702, in main
    execute_cmd(command, *cmd_args)
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/cylc.py", line 333, in execute_cmd
    entry_point.load()(*args)
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/terminal.py", line 298, in wrapper
    wrapped_function(*wrapped_args, **wrapped_kwargs)
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/profiler.py", line 62, in main
    get_config(options)
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/profiler.py", line 180, in get_config
    profile(process, cgroup_version, args.delay)
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/profiler.py", line 159, in profile
    write_data(str(cpu_time), "cpu_time")
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/profiler.py", line 103, in write_data
    with open(filename, 'w') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'cpu_time'

@oliver-sanders
Copy link
Copy Markdown
Member

oliver-sanders commented Apr 16, 2025

Note, it's not really valid to configure the profiler for the localhost platform as the job isn't running in a cgroup, but jobs that exit faster than the profiler's poll interval is an edge case that we should handle.

@wxtim
Copy link
Copy Markdown
Member

wxtim commented Apr 17, 2025

Note, it's not really valid to configure the profiler for the localhost platform as the job isn't running in a cgroup

Probably need some user safety rails/warnings about that

@oliver-sanders
Copy link
Copy Markdown
Member

Probably need some user safety rails/warnings about that

It's difficult for us to say which job runners do or do not support cgroup profiling. The best we can do is to document it.

Comment thread cylc/flow/etc/job.sh Outdated
Comment thread cylc/flow/cfgspec/globalcfg.py Outdated
Comment thread cylc/flow/etc/job.sh Outdated
Comment thread cylc/flow/etc/job.sh Outdated
Comment thread cylc/flow/etc/job.sh
@ChrisPaulBennett
Copy link
Copy Markdown
Contributor Author

I'm not sure how to deal with the linting failure. My Perl is rusty, at best.
If I add "export", as the error code recommends, the test fails. If I remove it the test also fails.
Dave Matthews recommendations have been implemented

@oliver-sanders
Copy link
Copy Markdown
Member

oliver-sanders commented May 7, 2025

Works fine for me:

$ ctb -v tests/functional/jobscript/02-profiler.t -p '*'
ok 1 - 02-profiler-validate
ok 2 - 02-profiler-run
ok    20179 ms ( 0.01 usr  0.01 sys +  3.16 cusr  1.10 csys =  4.28 CPU)
[12:56:44]
All tests successful.
Files=1, Tests=2, 24 wallclock secs ( 0.02 usr  0.01 sys +  3.16 cusr  1.10 csys =  4.29 CPU)
Result: PASS

$ git diff
diff --git a/tests/functional/jobscript/02-profiler.t b/tests/functional/jobscript/02-profiler.t
index 1d8dbc548..601d12971 100644
--- a/tests/functional/jobscript/02-profiler.t
+++ b/tests/functional/jobscript/02-profiler.t
@@ -16,7 +16,7 @@
 # along with this program.  If not, see <http://www.gnu.org/licenses/>.
 #-------------------------------------------------------------------------------
 # cylc profile test
-REQUIRE_PLATFORM='runner:?(pbs|slurm)'
+export REQUIRE_PLATFORM='runner:?(pbs|slurm)'
 . "$(dirname "$0")/test_header"
 #-------------------------------------------------------------------------------
 set_test_number 2

$ etc/bin/shellchecker 
$ echo $?
0

Comment thread cylc/flow/scripts/profiler.py
Comment thread tests/functional/cylc-cat-log/12-delete-kill.t Outdated
Comment thread tests/functional/jobscript/03-profiler-e2e/flow.cylc Outdated
Comment thread tests/functional/jobscript/04-profiler/flow.cylc Outdated
Comment thread tests/functional/jobscript/04-profiler.t Outdated
@oliver-sanders
Copy link
Copy Markdown
Member

Just some minor comments above.

I've repeated my test from #6663 (comment) and confirmed everything is looking good.

Comment thread changes.d/6663.feat.md Outdated
Comment thread cylc/flow/cfgspec/globalcfg.py Outdated
Comment thread cylc/flow/cfgspec/globalcfg.py Outdated
Comment thread cylc/flow/scripts/message.py Outdated
Comment thread cylc/flow/job_file.py Outdated
@ChrisPaulBennett
Copy link
Copy Markdown
Contributor Author

I've tested the CPU times as a sanity check that the numbers are correct. And It looks good to me.
I've got two flow.cylc files. One serial and one parallel.
FOO, FOOT and FOOL does some amount of compute
BAR, BOOL and PUB does twice the amount of compute.
In the serial workflow you should see both wall clock time and CPU time scale together (Roughly double). In parallel you should see the CPU time double (Same amount of work still), but the wall clock time should stay roughly the same (Twice as many cores doing the work)

Serial
8ad03f2f-b2d3-4af0-8093-9dccaa856879

#!Jinja2
#

[scheduler]
    UTC mode = True
    allow implicit tasks = True

[scheduling]
    initial cycle point = 2019-12-09T09:00Z
    [[graph]]
        R1 = foo_cold => foo_start
        R1/T00 = foo_start[^] => FOO
        T00, T12 = """
            cycle_end[-PT12H] => FOO
            FOO:succeed-all => BAR
            BAR:succeed-any => wipe_bar
            BAR:succeed-all & wipe_bar => cycle_end
        """

[runtime]

    [[root]]
    	platform = spice
        
    [[FOO]]
    script = /usr/bin/time -v bash -c 'for x in 1 2 3; do python -c "for x in range(100000000): (x / 1.234567) ** 2.3456789"; done'
      [[[directives]]]
        --mem=1000
        --ntasks=2
    [[BAR]]
    script = /usr/bin/time -v bash -c 'for x in 1 2 3 4 5 6; do python -c "for x in range(100000000): (x / 1.234567) ** 2.3456789"; done'
      [[[directives]]]
        --mem=500
        --ntasks=2   

Parallel
image

#!Jinja2
#

[scheduler]
    UTC mode = True
    allow implicit tasks = True

[scheduling]
    initial cycle point = 2019-12-09T09:00Z
    [[graph]]
        R1 = foo_cold => foo_start
        R1/T00 = foo_start[^] => FOO
        T00, T12 = """
            cycle_end[-PT12H] => FOO
            FOO:succeed-all => BAR
            BAR:succeed-any => wipe_bar
            BAR:succeed-all & wipe_bar => cycle_end
        """

[runtime]

    [[root]]
    	platform = spice
        
    [[FOO]]
    script = /usr/bin/time -v bash -c 'for x in 1 2 3; do python -c "for x in range(100000000): (x / 1.234567) ** 2.3456789" & done; wait'
      [[[directives]]]
        --mem=1000
        --ntasks=2
    [[BAR]]
    script = /usr/bin/time -v bash -c 'for x in 1 2 3 4 5 6; do python -c "for x in range(100000000): (x / 1.234567) ** 2.3456789" & done; wait'
      [[[directives]]]
        --mem=500
        --ntasks=2   

Comment thread cylc/flow/cfgspec/globalcfg.py Outdated
Comment thread cylc/flow/job_file.py Outdated
@MetRonnie MetRonnie self-requested a review May 11, 2026 16:10
ChrisPaulBennett and others added 3 commits May 14, 2026 17:35
Co-authored-by: Ronnie Dutta <61982285+MetRonnie@users.noreply.github.com>
Comment thread cylc/flow/scripts/profiler.py Outdated
Comment thread tests/functional/jobscript/04-profiler.t Outdated
Comment thread cylc/flow/scripts/profiler.py Outdated
Comment thread cylc/flow/scripts/profiler.py
Comment thread cylc/flow/scripts/profiler.py Outdated
Comment thread cylc/flow/scripts/profiler.py
Comment thread tests/unit/scripts/test_profiler.py Outdated
Comment thread cylc/flow/scripts/profiler.py Outdated
Comment thread tests/unit/scripts/test_profiler.py Outdated
most circumstances
''')
Conf('polling interval', VDR.V_INTEGER,
default=10,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dpmatthews, should we consider reducing this to 1?

Copy link
Copy Markdown
Member

@MetRonnie MetRonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChrisPaulBennett#4

@oliver-sanders might want to cast your eye over this too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants