Cyborg Agent-Driver API

Cyborg agent interacts with each Cyborg driver in the compute node to discover available devices. This spec defines how the agent-driver API is structured.

No change is proposed to the way the agent discovers the drivers on start or restart.

This spec is common to all accelerators, including GPUs, High Precision Time Synhronization (HPTS) cards, etc. Since FPGAs have more aspects to be considered than other devices, some sections may focus on FPGA-specific factors. The spec calls out the FPGA-specific aspects.

The scope of this spec is Rocky release, but the API has been designed to be extensible for future releases. Accordingly, the spec calls out the Rocky-specific aspects.

Problem description

The 1 specifies that devices are represented using Resource Providers (RPs), Resource Classes (RCs) and traits. The information needed to create them has to come from the Cyborg driver to the Cyborg agent, which in turn needs to push it to the Cyborg Conductor.

The main challenge is discovering the device topology for FPGAs. An FPGA may have one or more Partial Reconfiguration regions, and those regions may have one or more accelerators nested inside them. Further, it may have local memory that is either partitioned or shared among the regions.

Use Cases

  • Devices of different types (GPUs, FPGAs, HPTS cards, Quick Assist) are present in the same host.

  • FPGAs of different types, possibly from different vendors, are present in the same host.

  • An FPGA may have one or more regions. Each region may have one or more accelerators.

    • In Rocky, we may support only one region per FPGA, and only one accelerator per region.

  • For Rocky, it is proposed that local memory need not be exposed as a resource to orchestration. That is because, since there is only one region per FPGA, an instance attached to that region will be able to access all the memory, no matter how much there is. For non-FPGA devices like GPUs, there does not seem to be a requirement to expose video RAM.

Cyborg will assume and handle the following component relationships:

  • One product (e.g. Intel PAC Arria 10) may correspond to multiple PCI vendor/device IDs.

  • One PCI vendor/device ID may correspond to different region type IDs. This could be either because there are multiple regions in the same device or because there are different versions/revisions of the same device.

  • But the same region type ID will never show up in products with different PCI IDs.

Proposed change

Today, the Cyborg agent invokes the discover() API for each driver that it finds. The discover() API returns a dictionary indexed by the PCI BDF of a device. The value element in the key-value pair of the dictionary contains the components and characteristics of the device with that BDF.

We propose to retain the same model, but enhance the dictionary to include enough information to create the resource providers and traits needed to populate Placement. Here are the additional proposed keys in the device dictionary for each PF:

"type": <enum-string> # One of GPU, FPGA, etc.
"vendor": <string>
"product": <string>

Also, in the regions entry for each PF, it is proposed to add the following keys:

"region-type-uuid": <uuid> # Optional, default: NULL
"bitstream-id": <uuid> # Glance/other UUID, optional, default: NULL
"function-uuid": <uuid> # Optional, default: NULL

When the agent receives this dictionary for a device, it will do the following:

  • If there is nested RP support, create an RP for the device and each region within.

  • Create a device type trait: CUSTOM_<type>_<vendor>_<product>. Apply it to the device RP (if nRP support exists) or the compute node RP.

    • E.g. CUSTOM_FPGA_INTEL_PAC_ARRIA10.

    • NOTE: The agent will convert all characters to upper case, replace spaces with underscores, and check for conformance to custom trait syntax (see 2)

  • Create region type traits for each region, of the form: CUSTOM_<type>_<vendor>_REGION_<type-uuid>. Apply them to the corresponding region RP (if nRP support exists) or the compute node RP.

    • E.g. CUSTOM_FPGA_INTEL_REGION_<type-uuid>

    • NOTE: For UUIDs, the agent will convert all hexadecimal digits to upper case, replace hyphens with underscores and validate all characters.

  • Create function type traits for each function in each region, of the form: CUSTOM_<type>_<vendor>_FUNCTION_<function-uuid>. Apply them to the corresponding region RP (if nRP support exists) or the compute node RP.

    • E.g. CUSTOM_FPGA_INTEL_FUNCTION_<function-uuid>

Alternatives

N/A

Data model impact

Add the new fields to the database under Deployables and Attributes.

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

None

Other deployer impact

None

Developer impact

None

Implementation

Assignee(s)

None

Work Items

Dependencies

None

Testing

Need to update unit tests to check for the newly added fields.

Documentation Impact

None

References

1

Cyborg/Nova Scheduling spec

2

Custom Traits

History

Optional section intended to be used each time the spec is updated to describe new design, API or any database schema updated. Useful to let reader understand what’s happened along the time.

Revisions

Release Name

Description

Rocky

Introduced