[ale] shared research server help

Thu Oct 5 09:27:00 EDT 2017

Right, Jim, another aspect of this problem is that most of the students 
don't even realize they need to be careful, much less how to be careful. 
"What? Is there a problem with me asking for 500 gigabytes of ram?" 
Well, the machine has only 256. But I'm just the IT guy and it's not my 
place to demand that these students demonstrate a basic understanding of 
sharing resources before getting started. The instructors would never go 
for that. I am pretty much stuck providing that informally on a 
one-to-one basis. But I think it would be valuable for me to work on 
automating that somehow. Pointers to the wiki, stuff like that.

Somebody emailled me off list and made a really good point. The key, I 
think is information. Well, that and peer pressure. I know nagios can 
trigger an alert when a machine runs low on ram or cpu cycles. It might 
even be able to determine who is running the procs that are causing it. 
I can at least put all the users in a nagios group and send them alerts 
when a research server is near an OOM event. I'll have to see what kind 
of granularity I can get out of nagios and experiment with who gets 
notified. I can do things like keep widening the group that gets 
notified of an event if the original setup turns out to be ineffective.

This list has really come through for me again just with ideas I can 
bounce around. I'll have to tread lightly though. About a year ago, I 
configured the machines in our shared labs to log someone off after 15 
minutes of inactivity. Believe it or not, that was controversial. Not 
with the faculty but with the students using the labs. It was an easy 
win for me but some of the students went to the faculty with complaints. 
Wait, you're actually defending your right to walk away from a 
workstation in a public place still logged in? In a way that's not such 
a bad thing. This is a university and the students should run the place. 
But they need a referee.

On 10/05/2017 06:52 AM, Jim Kinney wrote:
> Back to the original issue:
> 
> A tool like torque or slurm is really your best solution to intensive 
> shared resources. It prevents 2 big jobs from eating the same machine 
> and can also encourage users to code better to manage resources better 
> so they can run more jobs.
> 
> I have the same problem. One heavy gpu machine (4 tesla P100) only has 
> 64 G ram. Student tried to load in 200+G of data into ram.
> 
> A few crashes later he can run 2 jobs at once, each only eats 30G ram 
> and one p100.
> 
> On October 4, 2017 6:32:32 PM EDT, Todor Fassl <fassl.tod at gmail.com> wrote:
> 
>     I manage a group of research servers for grad students at a university.
>     The grad students use these machines to do the research for their Ph.D
>     theses. The problem is that they pretty regularly kill off each other's
>     programs by using up all the ram. Most of the machines have 256G of ram.
>     One kid uses 200Gb and another 100Gb and one or the other, often both,
>     die. Sometimes they bringthe machines down by hogging the cpu or using
>     up all the ram. Well, the machines never crash but they might as well be
>     down.
> 
>     We really, really don't want to force them to use a scheduling system
>     like slurm. They are just learnng and they might run the same piece of
>     code 20 times in an hour.
> 
>     Is there a way to set a limit on the amount of ram all of a user's
>     processes can use? If so, we were thinking of setting it at 50% of the
>     on-board ram. Then it would take 3 students together to trash a machine.
>     It might still happen but it would be a lot more infrequent.
> 
>     Any other suggestions? Anything at all? Just keep in mind that we really
>     want to keep it easy for the students to play around.
> 
> 
> -- 
> Sent from my Android device with K-9 Mail. All tyopes are thumb related 
> and reflect authenticity.

-- 
Todd