hwloc is segfaulting

Issue #1271 closed
Steven R. Brandt created an issue

My efforts to run the ET testsuite fail when system_topology.cc attempts to bind threads to CPUs because hwloc_get_obj_by_depth(topology, pu_depth, pu_num) returns NULL. In the attached patch I check for a null return value and print a warning, then skip setting the affinity.

Keyword:

Comments (11)

  1. Erik Schnetter
    • removed comment

    What system is this? What version of hwloc is this?

    Is a NULL return value documented as error return, or did we find an error in hwloc here? (I'm asking because the hwloc developers are very keen on receiving bug reports.)

  2. Steven R. Brandt reporter
    • removed comment

    This is my IBM w520 laptop. I used the built-in Cactus hwloc. It is a documented error value that indicates that either 0, or 2+ objects of the requested type exist. I confess I didn't delve into which case it was.

  3. Erik Schnetter
    • removed comment

    I don't see NULL as documented error value... I find "Returns the topology object at logical index idx from depth depth." as documentation for hwloc_get_obj_by_depth.

    What are the values of "pu_depth" and "pu_num"? Can you post the output of "lstopo" for your system? I assume you are running this under Linux? Is this a virtual machine?

  4. Steven R. Brandt reporter
    • removed comment

    When I said documented error value, I meant the condition that Frank just quoted.

  5. Erik Schnetter
    • removed comment

    This page documents two functions, as many man pages do. The error code is for the second function only.

  6. Erik Schnetter
    • removed comment

    I just looked at this error again. This error occurs when hwloc_get_obj_by_depth is called for the second time. I assume that one of its arguments is wrong. Can you output its arguments pu_depth and pu_num? If pu_num<0 or pu_num>=8, can you also output core_num, smt_multiplier, and pu_offset? thread_offset, thread_num_in_proc, and thread_num would also be interesting.

  7. Steven R. Brandt reporter
    • removed comment

    So I modified my code to look for these conditions, print a more useful error message, then abort. Since doing that, the error has not recurred. Not sure why. Maybe we can close the ticket.

  8. Log in to comment