[ale] shared research server help

Thu Oct 5 10:50:21 EDT 2017

The politics can get messy. Jeffry's later post of providing data of the
hog issue is very correct.

I use ganglia to provide real time display of cluster usage (RAM, CPU,
networking, adding GPU now).

I guess I'm pretty lucky as I'm also "just the IT guy" but I get to make it
plain that my job is to help them graduate. Yes, I do spend time
individually helping each student learn how to not break things. I also
make it very plain that a system crash is an extreme failure on their part.
Sure, I have to reboot a machine (YAY addressable PDUs and IPMI!) but it
breaks _their_ work worse. My current quest is to beat them all with the
clue-by-four of parallelism. LEARN how to think in parallel processes.
LEARN how to write code that can support multiple threads. LEARN how to
split large data sets into chunks that can be processed by multiple
systems/cores/nodes/gpus, etc.

latest fun: machine learning on image analysis for eye tracking from a
video for ADHD work - generates video with 15K frames; each frame has a
data set of eye position in pixel coordinates per eye; process was trained
on worst design at all - each frame is cropped to generate an enlarged
image of each individual eye - that's now run again to determine gaze
direction - 15,000 frames -> 30,000 images => all single threaded. <sigh>

<rant> I don't come from a comp-sci background so I've had to figure out a
lot on my own. It seems the younger programmers are more and more
disconnected from the reality of the hardware they use. "Load this data set
and start my algorithm" is the mindset. The engineering mentality of HOW to
do the process using both hardware and software is missing. </rant>

On Thu, Oct 5, 2017 at 9:27 AM, Todor Fassl <fassl.tod at gmail.com> wrote:

> Right, Jim, another aspect of this problem is that most of the students
> don't even realize they need to be careful, much less how to be careful.
> "What? Is there a problem with me asking for 500 gigabytes of ram?" Well,
> the machine has only 256. But I'm just the IT guy and it's not my place to
> demand that these students demonstrate a basic understanding of sharing
> resources before getting started. The instructors would never go for that.
> I am pretty much stuck providing that informally on a one-to-one basis. But
> I think it would be valuable for me to work on automating that somehow.
> Pointers to the wiki, stuff like that.
>
> Somebody emailled me off list and made a really good point. The key, I
> think is information. Well, that and peer pressure. I know nagios can
> trigger an alert when a machine runs low on ram or cpu cycles. It might
> even be able to determine who is running the procs that are causing it. I
> can at least put all the users in a nagios group and send them alerts when
> a research server is near an OOM event. I'll have to see what kind of
> granularity I can get out of nagios and experiment with who gets notified.
> I can do things like keep widening the group that gets notified of an event
> if the original setup turns out to be ineffective.
>
> This list has really come through for me again just with ideas I can
> bounce around. I'll have to tread lightly though. About a year ago, I
> configured the machines in our shared labs to log someone off after 15
> minutes of inactivity. Believe it or not, that was controversial. Not with
> the faculty but with the students using the labs. It was an easy win for me
> but some of the students went to the faculty with complaints. Wait, you're
> actually defending your right to walk away from a workstation in a public
> place still logged in? In a way that's not such a bad thing. This is a
> university and the students should run the place. But they need a referee.
>
>
>
>
>
> On 10/05/2017 06:52 AM, Jim Kinney wrote:
>
>> Back to the original issue:
>>
>> A tool like torque or slurm is really your best solution to intensive
>> shared resources. It prevents 2 big jobs from eating the same machine and
>> can also encourage users to code better to manage resources better so they
>> can run more jobs.
>>
>> I have the same problem. One heavy gpu machine (4 tesla P100) only has 64
>> G ram. Student tried to load in 200+G of data into ram.
>>
>> A few crashes later he can run 2 jobs at once, each only eats 30G ram and
>> one p100.
>>
>> On October 4, 2017 6:32:32 PM EDT, Todor Fassl <fassl.tod at gmail.com>
>> wrote:
>>
>>     I manage a group of research servers for grad students at a
>> university.
>>     The grad students use these machines to do the research for their Ph.D
>>     theses. The problem is that they pretty regularly kill off each
>> other's
>>     programs by using up all the ram. Most of the machines have 256G of
>> ram.
>>     One kid uses 200Gb and another 100Gb and one or the other, often both,
>>     die. Sometimes they bringthe machines down by hogging the cpu or using
>>     up all the ram. Well, the machines never crash but they might as well
>> be
>>     down.
>>
>>     We really, really don't want to force them to use a scheduling system
>>     like slurm. They are just learnng and they might run the same piece of
>>     code 20 times in an hour.
>>
>>     Is there a way to set a limit on the amount of ram all of a user's
>>     processes can use? If so, we were thinking of setting it at 50% of the
>>     on-board ram. Then it would take 3 students together to trash a
>> machine.
>>     It might still happen but it would be a lot more infrequent.
>>
>>     Any other suggestions? Anything at all? Just keep in mind that we
>> really
>>     want to keep it easy for the students to play around.
>>
>>
>> --
>> Sent from my Android device with K-9 Mail. All tyopes are thumb related
>> and reflect authenticity.
>>
>
> --
> Todd
>

-- 
-- 
James P. Kinney III

Every time you stop a school, you will have to build a jail. What you gain
at one end you lose at the other. It's like feeding a dog on his own tail.
It won't fatten the dog.
- Speech 11/23/1900 Mark Twain

*http://heretothereideas.blogspot.com/
<http://heretothereideas.blogspot.com/>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ale.org/pipermail/ale/attachments/20171005/186339db/attachment.html>