September 5, 2018 at 7:22 pm #25633
I think I know the answer to this already, but I thought I’d best check, as I can’t get a straight answer from our CFD tech or at least an answer I don’t think is correct.
At work, he’s set up out our modelling machines in to a cluster. Nothing like a Beowulf cluster, as that was initially tried and suffered a slowdown as it was communicating over GB networks, rather than anything more esoteric, but more like the attached diagram where each node is sent a task from the central file server (that serves only the modelling machines). This also happens to be an NFS share, where the software writes the modelling results to.
My query relates to NFS. The queuing software sometimes requires that server is rebooted to clear the queue. Now, when this happens, obviously, the NFS share will restart as well.
During this period, what would happen to the data trying to be written? In my mind, it would fail, as the server cannot be reached. However, the CFD tech is saying that as each node has a folder in the same location, nothing adverse will happen. In my mind, that just means that it’ll write the file locally and then when the NFS share is back up, it’ll continue to write to the NFS share. In the mean time, we’ve potentially lost data that’s now being stored on the local drive and therefore a possible chunk of data is missing from the model.
Would that be peoples understanding? I guess it comes down to how the modelling software reads and writes it’s results.
September 5, 2018 at 7:33 pm #25635
I would imagine the client would begin writing where it left off when the connection was lost.00September 5, 2018 at 8:01 pm #25640
Depends what you mean by ‘clear the queue’. If the queue is a pointer to a local buffer on each machine then it could start all over. If the ‘queue’ is the actual circular buffer then everything is gone. If it is only a pointer to a place on a circular buffer then it ought to start at the head of the buffer, but it is hard to say.00September 6, 2018 at 8:18 am #25654
Wheels-Of-FireParticipant@grahamdearsleyForumite Points: 4,042
Well if the server is down the only place left to store data would be a local drive.00September 6, 2018 at 9:04 am #25655
Yes, but the question is does it play catch up when the server come back online? Given all the fail safes in Unix file systems, like journalling, my guess would be yes. But that is a guess.
Like Ed I’m wondering more what this clearing the queue is all about?00September 6, 2018 at 9:14 am #25656
The queue system is Torque. We submit the modelling jobs to the TORQUE queue and then as machines free up, the next models get run. For whatever reason, it sometimes fails to pass on the jobs and a reboot solves the issue.
It terms of data loss, I don’t think we’ve seen any errors from missing data, so perhaps it is playing catch up. I guess the main way to make sure is to try it and find out – I’ve got NFS setup at home for me NAS, so I may well set up a model running and then turn restart my NAS and see what happens.00September 6, 2018 at 10:01 am #25658
Ah I see. I think the techie may have been answering a different question, aimed at the jobs not the output. Torque uses an NFS server and client shares to co-ordinate the nodes. It may be he’s using the same share as the local target for the job output (I would unless there’s a very good reason not to).
So my guess is it works something like Synology Drive (ex Cloud Station).00September 6, 2018 at 1:15 pm #25659
I believe that it is the same share. I hadn’t looked in to Torque to much, other than when I’ve had to ignore it and run files on the nodes individually whilst he was on holiday and it went wrong and a restart wouldn’t fit it.
I thought it was just issued commands via SSH, but maybe not.00September 6, 2018 at 4:56 pm #25660
“Restart the server and the new options are sourced”
my guess would be that it starts running the original script all over again. It is however possible that there is some sort of flag file which stores a job completion flag, if so it would rapidly jump to the failure point in the previous run..00September 7, 2018 at 7:49 am #25683
Given: “Restart the server and the new options are sourced” my guess would be that it starts running the original script all over again. It is however possible that there is some sort of flag file which stores a job completion flag, if so it would rapidly jump to the failure point in the previous run..
Well, whatever is running continues to run with no issues and it restarts and then does actually start processing the queue again, so it does seem to work.00September 7, 2018 at 9:46 am #25688
From what I’ve seen each client goes off independently doing what its doing then when it’s finished flags it’s free (I believe this is written to a log file on the NFS share). So the absence of the server makes no difference to the client at that stage, it’s got it’s job and gets on with it.
Whether it then picks the next job off the list or the server assigns it I don’t know. My guess would be it’s server assigned to stop conflicts. There is a daemon involved and NFS is not necessary but apparently makes things much simpler.00September 7, 2018 at 11:10 am #25693
From the comments and description it is analogous to multiprocessing in Python. I use this if I’m lashing up a real-time Pi program (e.g. clock, web process1, webprocess2,multimedia etc.) I’m not using a cluster (but could) instead the Pi is doing all the hard graft of allocating multicore resources.
The script in this specific instance is a loop, and individual jobs are written in a circular list.(some of the jobs such as a RSS display screen) have their own circular lists.
If one job crashes in this case, the others just carry on to completion then stop. For your case it would depend how nfs handles crashes on one processor in a simple cluster. i.e stop the lot or carry on, but spit out a job error for the one that failed.00
You must be logged in to reply to this topic.