Invalid job definition: expected a dictionary
It's a weird issue because it sometimes show next with high probability, while sometimes it's ok with the same job. It's just for multinode job, ok for normal job.
When this issue happens, the current job run is ok, but we can't resubmit that job as all contents in resubmit text area is empty.
I checked the database and found: all multinode job which has issue has empty value for sub_id
and multinode_definition
. But for next code I add some log, found sub_id
and multinode_definition
has value.
job.sub_id = "%d.%d" % (
parent,
node_data["protocols"]["lava-multinode"]["sub_id"],
)
logger.fatal(job.sub_id)
logger.fatal("???assign subid")
job.multinode_definition = (
yaml_data # store complete submission, inc. comments
)
logger.fatal(job.multinode_definition)
logger.fatal("???assign multinode definition")
try:
job.save()
except Exception as e:
logger.fatal("error now")
logger.fatal(e)
I failed to find why, could you give me some inspire on how to debug this issue? Or you know possible reason? Thanks.
It will happen randomly despite of particular multinode job, but I still attach one sample:
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information