Skip to content
This repository has been archived by the owner on Sep 12, 2024. It is now read-only.

Test T2_CH_CERN_P5 for production #1101

Open
haozturk opened this issue Dec 1, 2022 · 6 comments
Open

Test T2_CH_CERN_P5 for production #1101

haozturk opened this issue Dec 1, 2022 · 6 comments

Comments

@haozturk
Copy link
Collaborator

haozturk commented Dec 1, 2022

HLT resources are being shifted under this site name and SI informed us about readiness of the site. I submitted the following workflow as a test before enabling it for prod:

https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=haozturk_task_HIG-RunIISummer20UL17wmLHEGEN-Backfill-04142__v1_T_221201_141613_9953

let's see how it goes.

@haozturk
Copy link
Collaborator Author

haozturk commented Dec 7, 2022

Saqib let me know that the testbed will not work, because the testbed agent isn't connected to the global pool. I submitted the same workflow in prod which should work: https://cmsweb.cern.ch/reqmgr2/fetch?rid=haozturk_task_HIG-RunIISummer20UL17wmLHEGEN-Backfill-04142__v1_T_221207_154734_8291

@haozturk
Copy link
Collaborator Author

haozturk commented Dec 8, 2022

The request is picked up by cmsgwms-submit8.fnal.gov but I don't see jobs created or injected. Not sure why. Can this agent work w/ T2_CH_CERN_P5 in principle?

@haozturk
Copy link
Collaborator Author

Hi @amaltaro @todor-ivanov can you please check this workflow doesn't move in submit8?

@todor-ivanov
Copy link
Contributor

hi @haozturk
As discussed during the meeting this workflow has already landed on submit8 which is having problems. But still managed to materialize 350 jobs from the WMcore queue into condor jobs:

[cmsdataops@cmsgwms-submit8 current]$ condor_q  -const 'WMAgent_RequestName == "haozturk_task_HIG-RunIISummer20UL17wmLHEGEN-Backfill-04142__v1_T_221207_154734_8291"'
Total for query: 350 jobs; 0 completed, 0 removed, 350 idle, 0 running, 0 held, 0 suspended 

and taking and analyzing one of them:

[cmsdataops@cmsgwms-submit8 current]$ condor_q -better 366382.56
...
Job 366382.056 defines the following attributes:
    ExtraMemory = 500
    JobCpus = ((JobStatus =!= 1) && (JobStatus =!= 5) &&  !isUndefined(MATCH_EXP_JOB_GLIDEIN_Cpus) && (int(MATCH_EXP_JOB_GLIDEIN_Cpus) =!= error)) ? int(MATCH_EXP_JOB_GLIDEIN_Cpus) : OriginalCpus
    JobStatus = 1
    MaxCores = 4
    MinCores = 2.0
    OriginalCpus = 4
    OriginalMemory = 7900
    RequestCpus = WMCore_ResizeJob ? ( !isUndefined(Cpus) ? RequestResizedCpus : JobCpus) : OriginalCpus
    RequestDisk = 5000000
    RequestMemory = OriginalMemory + ExtraMemory * (WMCore_ResizeJob ? (RequestCpus - OriginalCpus) : 0)
    RequestResizedCpus = (Cpus > MaxCores) ? MaxCores : ((Cpus < MinCores) ? MinCores : Cpus)
    REQUIRED_ARCH = "X86_64"
    WMCore_ResizeJob = false

The Requirements expression for job 366382.056 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]      142139  stringListMember(TARGET.Arch,REQUIRED_ARCH)
[1]      142139  TARGET.OpSys == "LINUX"
[3]       70839  TARGET.Disk >= RequestDisk
[5]       69910  TARGET.Memory >= RequestMemory
[6]       34009  [3] && [5]
[7]       69517  TARGET.Cpus >= RequestCpus
[8]       26381  [6] && [7]


366382.056:  Run analysis summary ignoring user priority.  Of 33408 machines,
    343 are rejected by your job's requirements

Just a guess here: Those 343 sound like slots at the correct destination but failing to match job requirements. Maybe SI needs to check how those are build for this workflow.

@saqibhaleem
Copy link

hi @haozturk

MoniT is now showing successful completion of production jobs on this new site name. Workflows were queued on cmsgwms-submit8.fnal.gov. Can you please verify and also resume further necessary tests (if needed). thanks

@haozturk
Copy link
Collaborator Author

Hi @saqibhaleem thanks for the update. Things look good from our side. I don't see a reason for further testing. Feel free to scale up.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants