GithubHelp home page GithubHelp logo

icl-utk-edu / papi Goto Github PK

View Code? Open in Web Editor NEW
70.0 6.0 35.0 36.06 MB

License: Other

HTML 0.14% Shell 0.15% Python 0.24% Makefile 0.72% MATLAB 0.03% C 93.78% Assembly 0.01% Cuda 0.69% C++ 1.10% M4 0.01% Io 0.01% Fortran 0.65% Roff 2.41% SWIG 0.05% Perl 0.04%

papi's Introduction

PAPI: The Performance Application Programming Interface

Innovative Computing Laboratory (ICL)

University of Tennessee, Knoxville (UTK)


[TOC]


About

The Performance Application Programming Interface (PAPI) provides tool designers and application engineers with a consistent interface and methodology for the use of low-level performance counter hardware found across the entire compute system (i.e. CPUs, GPUs, on/off-chip memory, interconnects, I/O system, energy/power, etc.). PAPI enables users to see, in near real time, the relations between software performance and hardware events across the entire computer system.

The ECP Exa-PAPI project builds on the latest PAPI project and extends it with:

  • Performance counter monitoring capabilities for new and advanced ECP hardware, and software technologies.
  • Fine-grained power management support.
  • Functionality for performance counter analysis at "task granularity" for task-based runtime systems.
  • "Software-defined Events" that originate from the ECP software stack and are currently treated as black boxes (i.e., communication libraries, math libraries, task-based runtime systems, etc.)

The objective is to enable monitoring of both types of performance events---hardware- and software-related events---in a uniform way, through one consistent PAPI interface. Third-party tools and application developers will have to handle only a single hook to PAPI in order to access all hardware performance counters in a system, including the new software-defined events.


Documentation


Getting Help


Contributing

The PAPI project welcomes contributions from new developers. Contributions can be offered through the standard GitHub pull request model. We strongly encourage you to coordinate large contributions with the PAPI development team early in the process.

For timely pull request reviews and feedback, it is important to submit one (1) pull request per feature / bug fix.

In order to create a pull request on a public read-only repo, you will need to do the following:

  1. Fork the PAPI repo (click "+" on the left and "Fork this repository").

  2. Clone it.

  3. Make your changes and push them.

  4. Click "create pull request" from your repo (not the PAPI repo).


Resources


License

Copyright (c) 2019, University of Tennessee
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
    * Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.
    * Redistributions in binary form must reproduce the above copyright
      notice, this list of conditions and the following disclaimer in the
      documentation and/or other materials provided with the distribution.
    * Neither the name of the University of Tennessee nor the
      names of its contributors may be used to endorse or promote products
      derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL UNIVERSITY OF TENNESSEE BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

papi's People

Contributors

adanalis avatar anustuvicl avatar ayarkhan avatar coctic avatar dbarry9 avatar deater avatar fweimer-rh avatar g-ragghianti avatar gcongiu avatar gvnn3 avatar hanumanth0004 avatar iskra-anl avatar jagode avatar jhenryicl avatar jlinford avatar jrodgers-github avatar peinanz avatar plvlnisse avatar pmucci avatar redcrash avatar sragate avatar swarup-sahoo avatar tcojean avatar tony-icl avatar treece-burgess avatar tushar-mohan avatar wcohen avatar willschm avatar winklerf-zih avatar yamada-masahiko avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

papi's Issues

Remove the unused bundled libpfm-3.y, perfctr-2.6.x, and perfctr-2.7.x source from papi

PAPI bundles 4 different performance counter access libraries in papi: libpfm4, libpfm-3.y, perfctr-2.6.x, and perfctr-2.7.x. Of those the only one that has been updated actively in te past decade is libpfm4. The other three don't appear to be currently used and have not been updated since 2012. Eliminating the unused perf counter access libraries would remove about 11MB of source, about 30% of the total source code in papi.

`PAPI_ipc` fails with `Event does not exist` on 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz

A simple call to PAPI_ipc fails on my Thinkpad P1 Gen4 laptop, whereas this used to work fine on my older laptop and also on my workstation with an AMD Cpu. It is not yet a hybrid intel CPU, and I have elevated my perf privileges (perf stat works just fine).

#include <papi.h>

#include <iostream>

struct Ipc
{
    static Ipc measure()
    {
        Ipc data;
        int ret = PAPI_ipc(&data.realTime, &data.processTime,
                           &data.instructions, &data.ipc);
        if (ret != 0) {
            std::cerr << "IPC measurement failed with code " << ret << ": "
                      << PAPI_strerror(ret) << std::endl;
        }

        return data;
    }

    void print(const char* label) const
    {
        std::cout << label
                  << "\n\trealtime elapsed: " << realTime
                  << ", process time elapsed: " << processTime
                  << "\n\tinstructions executed: " << instructions
                  << ", cycles: " << (instructions / ipc)
                  << ", IPC: " << ipc
                  << "\n";
    }

    float realTime = 0;
    float processTime = 0;
    long long instructions = 0;
    float ipc = 0;
};

int main()
{
    Ipc::measure().print("test");
    return 0;
}

Compiled with:

$ g++ -g -O2 test.cpp -lpapi -o test_papi
$ ./test_papi 
IPC measurement failed with code -7: Event does not exist
test
        realtime elapsed: 0, process time elapsed: 0
        instructions executed: 0, cycles: -nan, IPC: 0

With strace I can see:

perf_event_open({type=PERF_TYPE_HARDWARE, size=0 /* PERF_ATTR_SIZE_??? */, config=PERF_COUNT_HW_INSTRUCTIONS, sample_period=0, sample_type=0, read_format=0, precise_ip=0 /* arbitrary skid */, ...}, 0, -1, -1, 0) = 3
close(3)                                = 0
perf_event_open({type=PERF_TYPE_HARDWARE, size=0 /* PERF_ATTR_SIZE_??? */, config=PERF_COUNT_HW_INSTRUCTIONS, sample_period=0, sample_type=0, read_format=0, precise_ip=0 /* arbitrary skid */, exclude_guest=1, ...}, 0, -1, -1, 0) = 3
close(3)                                = 0

When I instead run strace perf stat -e instructions on some binary I see:

perf_event_open({type=PERF_TYPE_HARDWARE, size=0x88 /* PERF_ATTR_SIZE_??? */, config=PERF_COUNT_HW_INSTRUCTIONS, sample_period=0, sample_type=PERF_SAMPLE_IDENTIFIER, read_format=PERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_TOTAL_TIME_RUNNING, disabled=1, inherit=1, enable_on_exec=1, precise_ip=0 /* arbitrary skid */, exclude_guest=1, ...}, 4762, -1, -1, PERF_FLAG_FD_CLOEXEC) = 3
...
read(3, "\202T \0\0\0\0\0\250\306\6\0\0\0\0\0\250\306\6\0\0\0\0\0", 24) = 24
close(3)                                = 0

My system:

inxi -GSC -xx
System:
  Host: agathemoarbauer Kernel: 6.6.2-arch1-1 arch: x86_64 bits: 64
    compiler: gcc v: 13.2.1 Desktop: KDE Plasma v: 5.27.9 tk: Qt v: 5.15.11
    wm: kwin_x11 dm: SDDM Distro: Arch Linux
CPU:
  Info: 8-core model: 11th Gen Intel Core i7-11850H bits: 64 type: MT MCP
    arch: Tiger Lake rev: 1 cache: L1: 640 KiB L2: 10 MiB L3: 24 MiB
  Speed (MHz): avg: 969 high: 3506 min/max: 800/4800 cores: 1: 800 2: 800
    3: 800 4: 800 5: 800 6: 800 7: 800 8: 3506 9: 800 10: 800 11: 800 12: 800
    13: 800 14: 800 15: 800 16: 800 bogomips: 79888
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx

PAPI ROCm: Confusion with `HIP_VISIBLE_DEVICES`

When setting HIP_VISIBLE_DEVICES the id in the :device=%d event name suffix is still the hardware device index, not the HIP device index.

The ./sample_multi_kernel_monitoring test always uses :device=0, so starting it with different HIP_VISIBLE_DEVICES values will result in 0-value results:

$ HIP_VISIBLE_DEVICES=0 ./sample_multi_kernel_monitoring
rocm:::SQ_INSTS_VALU:device=0 : 191459309210
rocm:::SQ_INSTS_SALU:device=0 : 73288502349
rocm:::SQ_WAVES:device=0 : 526324
rocm:::SQ_WAVES_RESTORED:device=0 : 2032
$ HIP_VISIBLE_DEVICES=1 ./sample_multi_kernel_monitoring 
rocm:::SQ_INSTS_VALU:device=0 : 0
rocm:::SQ_INSTS_SALU:device=0 : 0
rocm:::SQ_WAVES:device=0 : 0
rocm:::SQ_WAVES_RESTORED:device=0 : 0

ROCm: Memory Access Fault in Sampling Mode for Various Events

Some events cause a memory fault, as shown below, when using the papi_command_line utility:

${PAPIDIR}/bin/papi_command_line  "rocm:::TCP_TCC_NC_ATOMIC_REQ_sum:device=0"

This utility lets you add events from the command line interface to see if they work.

Successfully added: rocm:::TCP_TCC_NC_ATOMIC_REQ_sum:device=0

Memory access fault by GPU node-2 (Agent handle: 0x4b18cb0) on address 0x7f38e2247000. Reason: Unknown.
Aborted (core dumped)

and

${PAPIDIR}/bin/papi_command_line  "rocm:::TA_BUSY_avr:device=0"

This utility lets you add events from the command line interface to see if they work.

Successfully added: rocm:::TA_BUSY_avr:device=0

Memory access fault by GPU node-2 (Agent handle: 0x55a0ca0) on address 0x7fdaf23d3000. Reason: Unknown.
Aborted (core dumped)

These memory faults do not occur is intercept mode is enabled via the following:
export ROCP_HSA_INTERCEPT=1

PAPI CUDA: `PAPI_read` performance degradation

Finding a considerable performance degradation for PAPI_read operations when using the updated PAPI CUDA component.

Time required by simple benchmark performing 1000 PAPI_read operations from 10 executions (by component version):

version avg(ms) min(ms) max(ms)
old 57428 56444 58584
new 326458 325660 327451

Average performance penalty: 5.68x

Reproducer/benchmark will be attached shortly (please run on y'alls end to confirm behavior).

Analyze the cache behavior using PAPI

Hi, I want to use the PAPI profiling tool to analyze the cache behavior of my code. I have used some native events on Xeon platform since some preset events are not supported. There is a mistake (invalid argument) of my code when I add the event PAPI_L2_DCA. However, the function call PAPI_query_event(PAPI_L2_DCA) returns PAPI_OK. I need some help to solve the problem. Thanks.

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <omp.h>
#include <pthread.h>
#include <cblas.h>
#include <lapacke.h>
#include <papi.h>

#define REPEAT       3
#define NUM_THREADS  10
#define DIM 	     1024
#define ERROR_RETURN(retval) { fprintf(stderr, "[Error] file: %s, line: %d, retval: %s \n", __FILE__, __LINE__, PAPI_strerror(retval));  exit(retval); }

int add_papi_events() {
	int retval;
	int EventSet = PAPI_NULL, native_code;
	
	char *native_names[] = {
		"PERF_COUNT_HW_CACHE_L1D:MISS", "PERF_COUNT_HW_CACHE_L1D:ACCESS", "PERF_COUNT_HW_CACHE_LL:MISS", "PERF_COUNT_HW_CACHE_LL:ACCESS"
	};

	if((retval = PAPI_create_eventset(&EventSet)) != PAPI_OK)
		ERROR_RETURN(retval);

	// L1D
	if((retval = PAPI_event_name_to_code(native_names[0], &native_code)) != PAPI_OK)
		ERROR_RETURN(retval);
	if((retval = PAPI_add_event(EventSet, native_code)) != PAPI_OK)
		ERROR_RETURN(retval);
	if((retval = PAPI_event_name_to_code(native_names[1], &native_code)) != PAPI_OK)
		ERROR_RETURN(retval);
	if((retval = PAPI_add_event(EventSet, native_code)) != PAPI_OK)
		ERROR_RETURN(retval);

	// L2D
	if((retval = PAPI_query_event(PAPI_L2_DCM)) != PAPI_OK)
		ERROR_RETURN(retval);
	if((retval = PAPI_add_event(EventSet, PAPI_L2_DCM)) != PAPI_OK)
		ERROR_RETURN(retval);
	if((retval = PAPI_query_event(PAPI_L2_DCA)) != PAPI_OK)
		ERROR_RETURN(retval);
	if((retval = PAPI_add_event(EventSet, PAPI_L2_DCA)) != PAPI_OK)
		ERROR_RETURN(retval);

	// // L3
	if((retval = PAPI_event_name_to_code(native_names[2], &native_code)) != PAPI_OK)
		ERROR_RETURN(retval);
	if((retval = PAPI_add_event(EventSet, native_code)) != PAPI_OK)
		ERROR_RETURN(retval);
	if((retval = PAPI_event_name_to_code(native_names[3], &native_code)) != PAPI_OK)
		ERROR_RETURN(retval);
	if((retval = PAPI_add_event(EventSet, native_code)) != PAPI_OK)
		ERROR_RETURN(retval);
	
	return EventSet;
}

// ---------------------------------------------------------------------------------------------------------------------------------------------------
// Events:
// PERF_COUNT_HW_CACHE_L1D:ACCESS/MISS; PAPI_L2_DCM/PAPI_L2_DCA; PERF_COUNT_HW_CACHE_LL:ACCESS/MISS

// Compile:
// gcc -fopenmp test_cache.c -o test_cache.x -I/home/xx/lib/icl-papi/include/ -I/home/xx/lib/openblas_s/include /home/xx/lib/icl-papi/lib/libpapi.a /home/xx/lib/openblas_s/lib/libopenblas.a -lm

// ---------------------------------------------------------------------------------------------------------------------------------------------------


int main(int argc, const char* argv[])
{
    int m, n, k, lda, ldb, ldc;
    m = n = k = DIM;
    lda = ldb = ldc = DIM;
	double alpha = 0.001, beta = 0.001;

	int seed[] = {0, 0, 0, 1};
	double **a = NULL, **b = NULL, **c = NULL;
	a = (double **) malloc (sizeof(double *) * NUM_THREADS);
	b = (double **) malloc (sizeof(double *) * NUM_THREADS);
	c = (double **) malloc (sizeof(double *) * NUM_THREADS);
	for(int i = 0; i < NUM_THREADS; i ++) {
		a[i] = (double *) malloc (sizeof(double) * m * k);
		LAPACKE_dlarnv(1, seed, m*k, a[i]);
		b[i] = (double *) malloc (sizeof(double) * k * n);
		LAPACKE_dlarnv(1, seed, k*n, b[i]);
		c[i] = (double *) malloc (sizeof(double) * m * n);
		LAPACKE_dlarnv(1, seed, m*n, c[i]);
	}

	int retval;
	if ((retval=PAPI_library_init(PAPI_VER_CURRENT)) != PAPI_VER_CURRENT)
        ERROR_RETURN(retval);

	long long values[6];
	int EventSet = PAPI_NULL;

	EventSet = add_papi_events();
    if ((retval = PAPI_start(EventSet)) != PAPI_OK)
            ERROR_RETURN(retval);
    for(int i = 0; i < NUM_THREADS; i ++)
    {
		cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a[i], lda, b[i], ldb, beta, c[i], ldc);
    }
    if ((retval=PAPI_stop(EventSet, values)) != PAPI_OK)
        ERROR_RETURN(retval);

    if ((retval=PAPI_cleanup_eventset(EventSet)) != PAPI_OK)
        ERROR_RETURN(retval);
    if ((retval=PAPI_destroy_eventset(&EventSet)) != PAPI_OK)
        ERROR_RETURN(retval);

    printf("L2_DCM: %lld\t L2_DCA: %lld\t L2 Data Cache Miss Rate: %.2f%% \n", values[2], values[3], 100.0 * values[2] / values[3]);

	PAPI_shutdown();

	for(int i = 0; i < NUM_THREADS; i ++) {
		free(a[i]); free(b[i]); free(c[i]);
	}
	free(a); free(b); free(c);

    return 0;
}

PAPI CUDA: CUDA test build failure

Encountering the following PAPI CUDA test build failure on SLES15 platforms using GCC 12:

cudaOpenMP.o:(.eh_frame+0x227): undefined reference to `__gxx_personality_v0'
collect2: error: ld returned 1 exit status
make[2]: *** [Makefile:53: cudaOpenMP] Error 1

To prevent this failure, -lstdc++ can be added to CUDALIBS on line 26 of src/components/cuda/tests/Makefile, example:

CUDALIBS = -L$(PAPI_CUDA_ROOT)/lib64 -lcudart -lcuda -lstdc++
  • Note: this same option is applied to CUDALIBS in src/components/nvml/tests/Makefile.

PAPI lmsensors: libsensors.so fails to load

This behaviour observed on ICL Leconte

$ module load lm-sensors
$ export PAPI_LMSENSORS_ROOT=$ICL_LM_SENSORS_ROOT
$ ./configure --with-components="lmsensors" && make && make install
$ utils/papi_component_avail

Shows the output

...
Name:   lmsensors               Linux LMsensor statistics
   \-> Disabled: libsensors.so not found.

SDE Created Counter

When working with SDE's specifically Created Counters there appears to be a bug when wanting to collect the short_descr from the description that you provide to papi_sde_describe_counter(...). Please see the code below to reproduce and the subsequent output which for short_descr will be empty.

Code to reproduce:

#include "papi.h"
#include "sde_lib.h"

#include <stdio.h>
#include <stdlib.h>

int main() {
    int retval, evt_code, total_iter_cnt = 0;
    papi_handle_t handle;
    PAPI_event_info_t info;
   
    retval = PAPI_library_init(PAPI_VER_CURRENT);
    if (retval != PAPI_VER_CURRENT) {
        printf("Error initializing the PAPI library.\n");
        exit(1);
    }

    /* initialize PAPI SDEs */
    handle = papi_sde_init("SDE");
    /* register created counter */
    retval = papi_sde_register_counter(handle, "TOTAL_ITERATIONS", PAPI_SDE_RO, PAPI_SDE_long_long, &total_iter_cnt);
    if (retval != PAPI_OK) {
        printf("Error code: %d\n", retval);
        exit(1);
    }  
    /* describe created counter */
    retval = papi_sde_describe_counter(handle, "TOTAL_ITERATIONS", "Total iterations.");
    if (retval != PAPI_OK) {
        printf("Error code: %d\n", retval);
        exit(1);
    }  

    /* convert sde event name to code */
    retval = PAPI_event_name_to_code("sde:::SDE::TOTAL_ITERATIONS", &evt_code);
    if (retval != PAPI_OK) {
        printf("Error code: %d\n", retval);
        exit(1);
    }
    printf("Event Code: %d\n", evt_code);

    /* get sde event code info */
    retval = PAPI_get_event_info(evt_code, &info);
    if (retval != PAPI_OK) {
        printf("Error code: %d\n", retval);
        exit(1);
    } 
    
    /* print event description */
    printf("%s\n", info.short_descr);

}
/* actual output */
Event Code: 1073741841

/* desired output */
Event Code: 1073741841
Total iterations.

The desired output will occur if you use info.long_descr instead. However, short_descr has 64 characters available and therefore should contain output as well.

Remove use of Time Stamp Counter (rdtsc)

The Time Stamp Counter was once an excellent high-resolution, low-overhead way for a program to get CPU timing
information. With the advent of multi-core CPUs, systems with multiple CPUs, and hibernating operating systems, the
TSC cannot be relied upon to provide accurate results — unless great care is taken to correct the possible flaws: 
rate of tick and whether all cores (processors) have identical values in their time-keeping registers.

-- https://en.wikipedia.org/wiki/Time_Stamp_Counter

TLS support problem

_papi_hwi_my_thread is a TLS global variable. The variable, currently, is guarded in both threads.c & threads.h and declared only if the PAPI configure script can detect TLS support (either by default or as selected by the user through the --with-tls=<keyword> configure option). If TLS support is not detected the variable is undeclared and the build fails, as following reported by static analysis:

/spack/opt/spack/linux-rocky8-x86_64/gcc-9.5.0/llvm-16.0.6-gweege3utpobgkimb4fej76zih53yc6i/bin/../libexec/ccc-analyzer -fPIC -DPIC -shared -Wl,-soname -Wl,libpapi.so.7.0 -Xlinker "-rpath" -Xlinker "/usr/local/lib" -DPAPI_NO_MEMORY_MANAGEMENT -DSTATIC_PAPI_EVENTS_TABLE  -DUSE_PERFEVENT_RDPMC=1 -DPEINCLUDE=\"libpfm4/include/perfmon/perf_event.h\" -D_REENTRANT -D_GNU_SOURCE -DNO_TLS -Ilibpfm4/include -fvisibility=hidden -I. -g -Wextra  -Wall -DPAPI_NUM_COMP=3 -DOSLOCK=\"linux-lock.h\" -DOSCONTEXT=\"linux-context.h\" -O2 x86_cpuid_info.c papi_libpfm4_events.c papi.c papi_internal.c high-level/papi_hl.c extras.c sw_multiplex.c upper_PAPI_FWRAPPERS.c papi_fwrappers_.c papi_fwrappers__.c threads.c cpus.c linux-memory.c linux-timer.c linux-common.c  papi_preset.c papi_vector.c papi_memory.c components/perf_event/perf_event.c components/perf_event/pe_libpfm4_events.c components/perf_event_uncore/perf_event_uncore.c components/sysdetect/sysdetect.c components/sysdetect/nvidia_gpu.c components/sysdetect/amd_gpu.c components/sysdetect/cpu.c components/sysdetect/cpu_utils.c components/sysdetect/os_cpu_utils.c components/sysdetect/linux_cpu_utils.c components/sysdetect/x86_cpu_utils.c  -o libpapi.so.7.0.1.0 -Bdynamic -Llibpfm4/lib -lpfm 
[402](https://github.com/icl-utk-edu/papi/actions/runs/5519307326/jobs/10064398824?pr=45#step:3:403)
papi_internal.c: In function ‘_papi_hwi_set_papi_event_code’:
[403](https://github.com/icl-utk-edu/papi/actions/runs/5519307326/jobs/10064398824?pr=45#step:3:404)
papi_internal.c:124:3: error: ‘_papi_hwi_my_thread’ undeclared (first use in this function); did you mean ‘_papi_hwi_read’?
[404](https://github.com/icl-utk-edu/papi/actions/runs/5519307326/jobs/10064398824?pr=45#step:3:405)
   _papi_hwi_my_thread->tls_papi_event_code_changed = -1;
[405](https://github.com/icl-utk-edu/papi/actions/runs/5519307326/jobs/10064398824?pr=45#step:3:406)
   ^~~~~~~~~~~~~~~~~~~
[406](https://github.com/icl-utk-edu/papi/actions/runs/5519307326/jobs/10064398824?pr=45#step:3:407)
   _papi_hwi_read
[407](https://github.com/icl-utk-edu/papi/actions/runs/5519307326/jobs/10064398824?pr=45#step:3:408)
papi_internal.c:124:3: note: each undeclared identifier is reported only once for each function it appears in
[408](https://github.com/icl-utk-edu/papi/actions/runs/5519307326/jobs/10064398824?pr=45#step:3:409)
papi_internal.c: In function ‘_papi_hwi_get_papi_event_code’:
[409](https://github.com/icl-utk-edu/papi/actions/runs/5519307326/jobs/10064398824?pr=45#step:3:410)
papi_internal.c:138:9: error: ‘_papi_hwi_my_thread’ undeclared (first use in this function); did you mean ‘_papi_hwi_read’?
[410](https://github.com/icl-utk-edu/papi/actions/runs/5519307326/jobs/10064398824?pr=45#step:3:411)
  return _papi_hwi_my_thread->tls_papi_event_code;
[411](https://github.com/icl-utk-edu/papi/actions/runs/5519307326/jobs/10064398824?pr=45#step:3:412)
         ^~~~~~~~~~~~~~~~~~~
[412](https://github.com/icl-utk-edu/papi/actions/runs/5519307326/jobs/10064398824?pr=45#step:3:413)
         _papi_hwi_read
[413](https://github.com/icl-utk-edu/papi/actions/runs/5519307326/jobs/10064398824?pr=45#step:3:414)
papi_internal.c: In function ‘_papi_hwi_native_to_eventcode’:
[414](https://github.com/icl-utk-edu/papi/actions/runs/5519307326/jobs/10064398824?pr=45#step:3:415)
papi_internal.c:565:7: error: ‘_papi_hwi_my_thread’ undeclared (first use in this function); did you mean ‘_papi_hwi_read’?
[415](https://github.com/icl-utk-edu/papi/actions/runs/5519307326/jobs/10064398824?pr=45#step:3:416)
   if (_papi_hwi_my_thread->tls_papi_event_code_changed > 0) {
[416](https://github.com/icl-utk-edu/papi/actions/runs/5519307326/jobs/10064398824?pr=45#step:3:417)
       ^~~~~~~~~~~~~~~~~~~
[417](https://github.com/icl-utk-edu/papi/actions/runs/5519307326/jobs/10064398824?pr=45#step:3:418)
       _papi_hwi_read
[418](https://github.com/icl-utk-edu/papi/actions/runs/5519307326/jobs/10064398824?pr=45#step:3:419)
papi_internal.c: In function ‘_papi_hwi_get_papi_event_code’:
[419](https://github.com/icl-utk-edu/papi/actions/runs/5519307326/jobs/10064398824?pr=45#step:3:420)
papi_internal.c:139:1: warning: control reaches end of non-void function [-Wreturn-type]
[420](https://github.com/icl-utk-edu/papi/actions/runs/5519307326/jobs/10064398824?pr=45#step:3:421)
 }
[421](https://github.com/icl-utk-edu/papi/actions/runs/5519307326/jobs/10064398824?pr=45#step:3:422)
 ^

ROCm: Questioning the `:device=` and `:instance=` order

So event names for ROcm are rocm:::<name>:device=<device>[:instance=]. Here is an example event name:

--------------------------------------------------------------------------------
| rocm:::TCP_TCP_TA_DATA_STALL_CYCLES:device=0:instance=0                      |
|            TCP stalls TA data interface. Now Windowed.                       |
--------------------------------------------------------------------------------
| rocm:::TCP_TCP_TA_DATA_STALL_CYCLES:device=0:instance=1                      |
|            TCP stalls TA data interface. Now Windowed.                       |
--------------------------------------------------------------------------------

In Score-P we want to make it convenient for the user to just tell the measurement system, which metrics to record, on all devices. Hence, the user should only configure Score-P with the metric name (e.g., rocm:::TCP_TCP_TA_DATA_STALL_CYCLES) and will generate the proper event name depending on which device it is running. this is important in an multi-GPU/multi-node experiment, where the user does not know in advance which device is allocated on the node. Hence, the user is only interested in the metric name to record.

But what to do with the instances? As the order is fixed, Score-P would need to split the metric name on :instance= and insert the :device=. This seems very fragile.

Hence I questioning the usefulness of the :device= and :instance= ordering and propose to change it to rocm:::<name>[:instance=]:device=<device>. This way Score-P can always just append :device= and it works.

A different approach would be to know which metric has instances at all and how many and also append the :instance= to the base metric name. But I doubt there is such API in PAPI available for this.

Variation between Fortran files

If I read right, the files under src/testlib are F77 compliant, while the

   src/components/sysdetect/tests/query_device_simple_f.F

is written in F90 format. You special-case this

26 ifeq ($(notdir $(F77)),gfortran)
27 FFLAGS +=-ffree-form -ffree-line-length-none
28 else ifeq ($(notdir $(F77)),flang)
29 FFLAGS +=-ffree-form
30 else ifneq ($(findstring $(notdir $(F77)), $(intel_compilers)),)
31 FFLAGS +=-free
32 else ifneq ($(findstring $(notdir $(F77)), $(cray_compilers)),)
33 FFLAGS +=-ffree
34 endif

in the file

   src/components/sysdetect/tests/Makefile

I'm testing the "nvfortran" compiler, so it's not triggering the corresponding "-Mfree" flag.
And if I set it in general

    export FFLAGS="-noswitcherror -Mfree"

it will try to treat the F77 files as F90 and break the compile.
As I understand the Fortra standard, if you change the name of the latter file to

    src/components/sysdetect/tests/query_device_simple_f.f90

the compilers will automatically treat the file as free-form and you won't have to use the special-casing in the Makefile.

rpmbuild failure for fedora

I am looking into building rpms of what is currently in the papi repository. However, when I put the new papi-7.0.1.tar.gz from "make dist-targz" the rpmbuild configure doesn't work because of the followig patch git commit 0cc7bf1

Author: Giuseppe Congiu <[email protected]> 2023-09-29 09:33:00
Committer: Giuseppe Congiu <[email protected]> 2023-10-24 08:44:56
Parent: d9ffaf2 (Merge pull request #101 from anustuvicl/2023.10.19-cuda_fix_papi_command_line_segfault)
Child: 8d80048 (sde_lib: do not build with debug symbols by default)
Branches: coverity202311, master, remotes/origin/master
Follows:
Precedes:

configure: do not build with debug symbols by default

Remove -g being added by default in configure.

End up getting the following error message in config.log for the "rpmbuild -ba papi.spec"

configure:3180: checking whether we are cross compiling
configure:3188: gcc -o conftest -O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -g -Wl,-z,relro -Wl,--as-needed -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -Wl,--build-id=sha1 -specs=/usr/lib/rpm/redhat/redhat-package-notes conftest.c >&5
configure:3192: $? = 0
configure:3199: ./conftest
configure:3203: $? = 0
configure:3218: result: no
configure:3223: checking for suffix of object files
configure:3246: gcc -c -O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -g conftest.c >&5
configure:3250: $? = 0
configure:3272: result: o

For the time being I have patched out that change out so that I can build things on Fedora.

Segmentation fault

Here are my comands:
git clone --depth 1 --branch papi-6-0-0-1-t https://github.com/icl-utk-edu/papi.git
cd papi/src
./configure
make -j16
make test

Then an error appears:
make[1]: Leaving directory '/home/lvhang/papi/src/ctests'
ctests/zero
make: *** [Makefile.inc:227: test] Segmentation fault (core dumped)

Define PAPI presets for Intel Sapphire Rapids

[opened this in the bitbucket repository so just repeating to ensure it exists]

The papi_events.csv file does not define the PAPI presents for the Sapphire Rapids, ie, spr, processor.

[question] Is PAPI OS-dependant?

Does PAPI utilize something relevant to OS-specific internals? May I be sure that if using PAPI works well on Linux on one machine, then PAPI will work fine on the same machine on both MacOS and Windows?

Environment variables not honored in some Makefiles

There's a handful of CC = lines in Makefiles that should be CC ?= so it may be properly overridden by a user's CC environment variable. Examples:

src/components/host_micpower/utils/Makefile:1:CC = gcc
src/components/powercap/utils/Makefile:1:CC = gcc
src/components/rapl/utils/Makefile:1:CC = gcc
src/examples/Makefile:3:CC = gcc
src/examples/Makefile:17:   CC = xlc
src/examples/Makefile.AIX:3:CC = xlc
src/examples/Makefile.IRIX64:3:CC = gcc
src/examples/Makefile.OSF1:3:CC = gcc
src/libperfnec/config.mk:77:#CC=icc
src/libperfnec/config.mk:103:CC=craynv-cray-linux-gnu-gcc
src/libpfm-3.y/config.mk:164:#CC=icc
src/libpfm-3.y/config.mk:190:CC=craynv-cray-linux-gnu-gcc
src/libpfm4/config.mk:189:#CC=icc
src/libpfm4/config.mk:206:CC=clang
src/perfctr-2.6.x/examples/global/Makefile:5:CC=gcc
src/perfctr-2.6.x/examples/perfex/Makefile:5:CC=gcc
src/perfctr-2.6.x/examples/self/Makefile:5:CC=gcc
src/perfctr-2.6.x/examples/signal/Makefile:5:CC=gcc
src/perfctr-2.6.x/usr.lib/Makefile:5:CC=gcc
src/perfctr-2.7.x/examples/global/Makefile:5:CC=gcc
src/perfctr-2.7.x/examples/perfex/Makefile:5:CC=$(CROSS_COMPILE)gcc
src/perfctr-2.7.x/examples/self/Makefile:5:CC=$(CROSS_COMPILE)gcc
src/perfctr-2.7.x/examples/signal/Makefile:5:CC=$(CROSS_COMPILE)gcc
src/perfctr-2.7.x/usr.lib/Makefile:5:CC=$(CROSS_COMPILE)gcc

In addition, fortran tests are not honoring the FFLAGS environment variable. Adding FFLAGS = @FFLAGS@ to src/components/Makefile_comp_tests.target.in appears to fix this issue.

Somewhat related, these 2 conditional blocks in src/components/sde/tests/Makefile and src/components/sysdetect/tests/Makefile could be made more robust by adding an else ifeq check for gfortran-12

ifeq ($(notdir $(F77)),gfortran)
    FFLAGS +=-ffree-form -ffree-line-length-none
else ifeq ($(notdir $(F77)),flang)
    FFLAGS +=-ffree-form
else ifeq ($(findstring $(notdir $(F77)), $(intel_compilers)),)
    FFLAGS +=-free
else ifeq ($(findstring $(notdir $(F77)), $(cray_compilers)),)
    FFLAGS +=-ffree
endif

PAPI_TOT_INS randomly off on AMD EPYC 7352

We test the installation with the usual make test which runs ctests/zero. But that fails seemingly randomly but very often

Passing output is:

PAPI Error: Couldn't open hw_instructions in exclude_guest=0 test
Test case 0: start, stop.
-----------------------------------------------
Default domain is: 1 (PAPI_DOM_USER)
Default granularity is: 1 (PAPI_GRN_THR)
Using 200 iterations 1 million instructions
-------------------------------------------------------------------------
Test type    : 	           1
PAPI_TOT_CYC : 	    140513836
PAPI_TOT_INS : 	    200001620
IPC          : 	         1.42
Real usec    : 	       359225
Real cycles  : 	    826204390
Virt usec    : 	       352658
Virt cycles  : 	    811113400
-------------------------------------------------------------------------
Verification: PAPI_TOT_INS should be roughly 200000000
PASSED

Failing output:

PAPI Error: Couldn't open hw_instructions in exclude_guest=0 test
Test case 0: start, stop.
-----------------------------------------------
Default domain is: 1 (PAPI_DOM_USER)
Default granularity is: 1 (PAPI_GRN_THR)
Using 200 iterations 1 million instructions
-------------------------------------------------------------------------
Test type    : 	           1
PAPI_TOT_CYC : 	    140556202
PAPI_TOT_INS : 	   3200025872
IPC          : 	        22.77
Real usec    : 	       359115
Real cycles  : 	    825951988
Virt usec    : 	       352894
Virt cycles  : 	    811653900
-------------------------------------------------------------------------
Verification: PAPI_TOT_INS should be roughly 200000000
PAPI_TOT_INS Error of 1500.01%
FAILED!!!
Line # 161 Error: Instruction validation

This is with PAPI 7.1.0 on a "AMD EPYC 7352 24-Core Processor" system (4 CPUs)

It looks like it randomly picks up an additional "3000000000" instructions which looks rather like an error as the remainder makes sense.

Any ideas?

PAPI might use a CUDA version that is not the one intended by the application

The PAPI CUDA component provides users with access to hardware performance counters on NVIDIA GPUs through the PAPI interface. The component is enabled by configuring PAPI with the --with-components=cuda option. In order to properly build the CUDA component, the PAPI build system also needs the path to the CUDA root directory containing the headers. These are located through the PAPI_CUDA_ROOT environment variable. At runtime, the CUDA component uses the same environment variable to find the CUDA libraries to dlopen. However, this can cause a problem if the system administrator sets the PAPI_CUDA_ROOT variable to a specific CUDA toolkit version while the user program is linked against a different version of the CUDA toolkit.

The solution to the problem could be that at PAPI_library_init time (or whenever the CUDA component loads the CUDA symbols), the CUDA component detects (e.g. using dl_iterate_phdr) the user-linked version of the CUDA library and, if possible (meaning, compatible with the version of the CUDA headers used for building PAPI), dlopens the same version.

/usr/share/papi/components/pcp/tests/testPCP FAILED

When reviewing the papi tests results I noticed that the testPCP test failed even if the machine has PCP setup on the machine. I can see a large number of PCP events listed by "pminfo". However, when running testPCP on Fedora rawhide (papi-7.0.1) see the following:

$ /usr/share/papi/components/pcp/tests/testPCP 
PAPI Error: Couldn't open hw_instructions in exclude_guest=0 test
PAPI_Library_init run time =    5301.0 uS
Testing PCP Component with PAPI 7.0.1
Found PCP Component at id 11
PAPI_enum_cmp_event returned -17 [Component Index isn't set].
FAILED!!!
Line # 108 Error in PAPI_enum_cmp_event failed.
: Component Index isn't set
Some tests require special hardware, permissions, OS, compilers
or library versions. PAPI may still function perfectly on your 
system without the particular feature being tested here.   

I took a look at earlier versions of papi, 6.0.0 and see the same result. Are others able to successful run testPCP? Is there something I need to change to successfully run it?

Fixing RAPL for AMD Milan CPU

"_rapl_init_component (cidx=) at components/rapl/linux-rapl.c:506
506 strCpy=strncpy(_rapl_vector.cmp_info.disabled_reason,
(gdb) list
501 msr_pkg_energy_status=MSR_AMD_PKG_ENERGY_STATUS;
502 msr_pp0_energy_status=MSR_AMD_PP0_ENERGY_STATUS;
503
504 if (hw_info->cpuid_family!=0x17) {
505 /* Not a family 17h machine */
506 strCpy=strncpy(_rapl_vector.cmp_info.disabled_reason,
507 "CPU family not supported",PAPI_MAX_STR_LEN);
508 _rapl_vector.cmp_info.disabled_reason[PAPI_MAX_STR_LEN-1]=0;
509 if (strCpy == NULL) HANDLE_STRING_ERROR;
510 retval = PAPI_ENOIMPL;"

The 0x17 check in 504 doesnt work for Milan. After disabling this check, RAPL worked.

Problem using libsupc++ in intel_gpu component

I'm working on supporting use of the intel_gpu component in the papi spack package. Depending on the compiler that I use, I may or may not get an error linking with libsupc++. In particular, this happens on Redhat OSes with the OS-installed compiler because the package container libsupc++.a isn't included by default and is in a non-default repository. Is it possible to remove this dependency? I removed references to it, and the build seemed to work with papi_component_avail also working.

The papi-7.1.0.tar.gz download isn't gzip'd

The tarball here

    https://icl.utk.edu/projects/papi/downloads/papi-7.1.0.tar.gz

posted on this page

    https://icl.utk.edu/papi/

hadn't been gzip'd

    --> gunzip papi-7.1.0.tar.gz
    gzip: papi-7.1.0.tar.gz: not in gzip format

It's a plain tar-file. I noticed this when my install-script broke.

sysdetect tests and fortran arm compiler

Trying to build papi with armflang, I get to:

make[1]: Entering directory /home/ec2-user/tmp/1698778753/papi/src/components/sysdetect/tests
armclang -DUSE_PTHREAD_MUTEXES -DPAPI_NUM_COMP=3 -O2 -I. -I../../.. -I../../../testlib -I../../../validation_tests -I/bm/ashterenli/install/u_arm/23.10/papi-ea3cc0041/include -c -o query_device_simple.o query_device_simple.c
armclang -DUSE_PTHREAD_MUTEXES -DPAPI_NUM_COMP=3 -I. -I../../.. -I../../../testlib -I../../../validation_tests -I/bm/ashterenli/install/u_arm/23.10/papi-ea3cc0041/include -o query_device_simple query_device_simple.o ../../../testlib/libtestlib.a ../../../libpapi.a -ldl
armflang -I../../.. -o query_device_simple_f query_device_simple_f.F ../../../libpapi.a -ldl
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 14)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 15)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 16)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 17)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 18)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 19)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 20)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 21)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 22)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 23)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 24)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 25)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 26)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 27)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 28)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 29)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 30)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 31)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 32)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 33)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 34)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 35)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 36)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 37)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 38)
F90-F-0008-Error limit exceeded (../../../f90papi.h: 3)
F90/aarch64 Linux FlangArm F90 - 1.5 2017-05-01: compilation aborted
make[1]: *** [query_device_simple_f] Error 1
make[1]: Leaving directory `/home/ec2-user/tmp/1698778753/papi/src/components/sysdetect/tests'
Seems components/sysdetect/tests/Makefile needs to be updated.

In contrast, the flags seem correct in ftests:

make[1]: Entering directory `/home/ec2-user/tmp/1698778753/papi/src/ftests'
armflang -I../testlib -I. -I.. -DUSE_PTHREAD_MUTEXES -DPAPI_NUM_COMP=3 -ffixed-line-length-132 -O1 strtest.F ../testlib/libtestlib.a ../libpapi.a -ldl -o strtest

PAPI ROCm: Missed Reads Intercept Mode

Finding evidence that PAPI ROCm PAPI_read operations are missing results when executed in intercept mode. Sample workflows highlighting what I'm seeing:

  • Sample 1:
              PAPI_start
              PAPI_read              <-  Counter values zero as we’d expect
              <kernel launch>
              PAPI_read              <-  Counter values still zero
              PAPI_stop              <-  Meaningful counter values collected
  • Sample 2:
              PAPI_start
              PAPI_read              <-  Counter values zero as we’d expect
              <kernel launch A>
              PAPI_read              <-  Counter values still zero
              PAPI_read              <-  Counter values from <kernel launch A> collected
              <kernel launch B>
              PAPI_read              <-  No changes in values, still reports values from <kernel launch A>
              PAPI_read              <-  Counter values from <kernel launch A> + <kernel launch B> collected
              <kernel launch C>
              PAPI_read              <-  No changes in values, still reports values from <kernel launch A> + <kernel launch B>
              PAPI_stop              <-  Counter values from <kernel launch A> + <kernel launch B> + <kernel launch C> collected

Consulting with @gcongiu, this may be expected behavior as:

In intercept mode, PAPI_read(s) that happen before a kernel has finished running and/or before rocprofiler has fetched the kernel counters return whatever value was present until that point in the eventset counters (the component does not synchronize the GPU stream internally like old cuda component used to do). Otherwise, it reads the new counters (get_context_counters).

In your example above the behavior looks consistent with the ROCm component's code. If you wish to read counters for a kernel, in intercept mode, you should synchronize the stream first to make sure the kernel has finished running and the counters are collected.

However, it does not look like synchronizing the streams alone is enough to prevent the undesirable behavior, as I’m still detecting the issue after calling hipDeviceSynchronize before & after each read (after is overkill, but I wanted to be sure). Additionally, finding that pairing the device/stream synchronization with any of the following is also unfruitful:

  • Stopping/re-starting the queues with each read
  • Flushing the context pools prior to checking the dispatch queue results
  • Changing the context pools' properties/configurations
    • E.g. smaller number of entires, alternate profiling modes, etc.

If possible, it would ideal if we could find a means of enforcing a synchronization such that the counters could be resolved with each PAPI_read.

Updated PAPI CUDA component returns default for number of counters

Noticed that the updated PAPI CUDA component is returning the default number of counters for <papi_vector>.cmp_info.num_cntrs, specifically PAPI_CUDA_MAX_COUNTERS (30).

Example papi_component_avail behavior:

  • New component:
Name:   cuda                    CUDA profiling via NVIDIA CuPTI interfaces
                                Native: 1052928, Preset: 0, Counters: 30
  • Old component:
Name:   cuda                    CUDA events and metrics via NVIDIA CuPTI interfaces
                                Native: 1052928, Preset: 0, Counters: 1052928

Using the old component as a guide, I think this could be fixed by capturing the counter count along side the native counter count.

  • Old component:
int _cuda_init_private(void)
....
    /* Export some information */
    _cuda_vector.cmp_info.num_native_events = global_cuda_context->availEventSize;
    _cuda_vector.cmp_info.num_cntrs = _cuda_vector.cmp_info.num_native_events;
....
  • New component:
static int cuda_ntv_enum_events(unsigned int *event_code, int modifier)
...
    _cuda_vector.cmp_info.num_native_events = global_event_names->count;
    _cuda_vector.cmp_info.num_cntrs = global_event_names->count;     < ------ WHAT NEEDS TO BE ADDED
....

GCC 12.2.0 issues some compilation warnings

Using gcc 12.2.1 the following compilation issues occur when building PAPI:

gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-7)
In file included from components/rocm_smi/rocs.c:7:
components/rocm_smi/rocs.c: In function 'rocs_init':
./papi_memory.h:23:24: warning: 'ntv_events' may be used uninitialized in this function [-Wmaybe-uninitialized]
 #define papi_free(a)   free(a)
                        ^~~~
components/rocm_smi/rocs.c:916:18: note: 'ntv_events' was declared here
     ntv_event_t *ntv_events = papi_calloc(ntv_events_count, sizeof(*ntv_events));
                  ^~~~~~~~~~
papi_component_avail.c: In function 'main':
papi_component_avail.c:144:48: warning: format '%d' expects a matching 'int' argument [-Wformat=]
    printf( "        %-23s Native: %d, Preset: %d, Counters: ", " ", cmpinfo->num_native_events);
                                               ~^
gfortran -I../testlib -I. -I.. -D_CRAY -O2 -fno-omit-frame-pointer -DPE_SAMPLE_ADDR_EXTENSION -DASSUME_KERNEL=\"4.0.0\" -g -Wextra  -Wall -DHAVE_CUDA -DHAVE_NVML -DHAVE_ROCM -DHAVE_ROCM_SMI -DPAPI_NUM_COMP=12 -ffixed-line-length-132 -O1 fmultiplex2.F ../testlib/libtestlib.a ../testlib/do_loops.o ../libpapi.a -o fmultiplex2 -ldl
fmultiplex2.F:85:0:

          do while ((avail_flag.EQ.0).AND.
 
Warning: 'papi_max_preset_events' is used uninitialized in this function [-Wuninitialized]
gfortran -I../testlib -I. -I.. -D_CRAY -O2 -fno-omit-frame-pointer -DPE_SAMPLE_ADDR_EXTENSION -DASSUME_KERNEL=\"4.0.0\" -g -Wextra  -Wall -DHAVE_CUDA -DHAVE_NVML -DHAVE_ROCM -DHAVE_ROCM_SMI -DPAPI_NUM_COMP=12 -ffixed-line-length-132 -O1 avail.F  ../testlib/libtestlib.a ../libpapi.a -ldl -o avail
avail.F:55:0:

       do i=0, PAPI_MAX_PRESET_EVENTS-1
 
Warning: 'papi_max_preset_events' is used uninitialized in this function [-Wuninitialized]

There may be more depending on which components are configured.

PAPI CUDA: Dangerous `dl_iterate_phdr_cb` operation

The PAPI CUDA function src/components/cuda/cupti_common.c::dl_iterate_phdr_cb is performing a dangerous operation on the resolved info->dlpi_name. Specifically, the call to dirname() may modify the contents of the dlpi_name when called.

To prevent undesired behavior, the function should be updated to make use of a temporary buffer/copy instead of operating on the dl_phdr_info pointer.

Reference: https://linux.die.net/man/3/dirname

Both dirname() and basename() may modify the contents of path, so it may be desirable to pass a copy when calling one of these functions.

PAPI CUDA: `cupti_profiler.c::cuptip_event_enum` bug

The current implementation of cuda/cupti_profiler.c::cuptip_event_enum will incorrectly assign event codes when called with a non-empty global_event_names table; with the logic that biases the event code by the current global event count being the troublemaker.

Reproducer highlighting the issue will be attached shortly. Potential fix to prevent undesired behavior:

@@ -1245,13 +1245,13 @@ int cuptip_event_enum(cuptiu_event_table_t *all_evt_names)
                 goto fn_exit;
             }
             if (cuptiu_event_table_find_name(all_evt_names, evt_name, &find) == PAPI_ENOEVNT) {
-                papi_errno = cuptiu_event_table_insert_record(all_evt_names, evt_name, curr + i, 0);
+                papi_errno = cuptiu_event_table_insert_record(all_evt_names, evt_name, curr, 0);
                 if (papi_errno != PAPI_OK) {
                     goto fn_exit;
                 }
+                curr++;
             }
         }
-        curr += i;
     }
 fn_exit:
     return papi_errno;

Error building component tests for sysdetect,sde with Cray compiler

As seen here: https://gitlab.spack.io/spack/spack/-/jobs/7405712

And discussed here: spack/spack#38443 (comment)

When building with Cray fortran compiler, error is:

/home/gitlab-runner-1/builds/Q2MuKA89/0/spack/spack/lib/spack/env/cce/ftn -free -I../../.. -o query_device_simple_f query_device_simple_f.F ../../../libpapi.a 
ftn-78 ftn: ERROR in command line
  The -f option has an invalid argument, "ree".

The Makefile sysdetect/tests/Makefile falls back to using "FFLAGS = -free" which is not a valid command-line option for the cray fortran compiler. This is also present in the sde comonent tests.

problem to make

OS: centos 7
[email protected]

context :

(work) [user@localhost src]$ ./configure --prefix=/home/user/Downloads/tar_hub/papi/src/install
checking for architecture... x86_64
checking for OS... linux
checking for OS version... 3.10.0-1160.66.1.el7.x86_64
checking for perf_event workaround level... autodetect
checking for if MIC should be used... no
checking for if NEC should be used... no
checking for xlc... no
checking for icc... no
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking for xlf... no
checking for ifort... no
checking for gfortran... gfortran
checking whether we are using the GNU Fortran 77 compiler... yes
checking whether gfortran accepts -g... yes
checking for mpicc... mpicc
checking for gawk... gawk
checking how to run the C preprocessor... gcc -E
checking whether ln -s works... yes
checking whether make sets $(MAKE)... yes
checking for ranlib... ranlib
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking minix/config.h usability... no
checking minix/config.h presence... no
checking for minix/config.h... no
checking whether it is safe to define __EXTENSIONS__... yes
checking for ANSI C header files... (cached) yes
checking for inline... inline
checking whether time.h and sys/time.h may both be included... yes
checking sys/time.h usability... yes
checking sys/time.h presence... yes
checking for sys/time.h... yes
checking c_asm.h usability... no
checking c_asm.h presence... no
checking for c_asm.h... no
checking intrinsics.h usability... no
checking intrinsics.h presence... no
checking for intrinsics.h... no
checking mach/mach_time.h usability... no
checking mach/mach_time.h presence... no
checking for mach/mach_time.h... no
checking sched.h usability... yes
checking sched.h presence... yes
checking for sched.h... yes
checking for gethrtime... no
checking for read_real_time... no
checking for time_base_to_time... no
checking for clock_gettime... no
checking for mach_absolute_time... no
checking for sched_getcpu... yes
checking for timer_create and timer_*ettime symbols in base system... not found
checking for timer_create and timer_*ettime symbols in -lrt... found
checking for dlopen and dlerror symbols in base system... not found
checking for dlopen and dlerror symbols in -ldl... found
checking for tests... ctests ftests  mpitests
checking for debug build... 
checking for -Wno-override-init... 1
checking for CPU type... x86
checking for ffsll... yes
checking for working gettid... no
checking for working syscall(SYS_gettid)... yes
checking for working MMTIMER... no
checking for working CLOCK_REALTIME_HR POSIX 1b timer... no
checking for working CLOCK_REALTIME POSIX 1b timer... yes
checking for which real time clock to use... clock_realtime
checking for working __thread... yes
checking for high performance thread local storage... __thread
checking for working CLOCK_THREAD_CPUTIME_ID POSIX 1b timer... yes
checking for which virtual timer to use... clock_thread_cputime_id
checking for static user preset events... no
checking for static PAPI preset events... yes
checking for whether to build static library... yes
checking for whether to build shared library... yes
checking for static compile of tests and utilities... no
checking for linking with papi shared library of tests and utilities... no
checking for building libsde... shared and static
checking for /sys/class/perfctr... no
checking for /dev/perfctr... no
checking for /sys/kernel/perfmon/version... no
checking for /proc/perfmon... no
checking for /proc/sys/kernel/perf_event_paranoid... yes
checking for libpfm4/include/perfmon/perf_event.h... yes
checking platform... linux-pe
checking for components to build... perf_event perf_event_uncore sysdetect
checking for PAPI event CSV filename to use... papi_events.csv
configure: Rules.pfm4_pe will be included in the generated Makefile
configure: creating ./config.status
config.status: creating Makefile
config.status: creating papi.pc
config.status: creating components/Makefile_comp_tests.target
config.status: creating testlib/Makefile.target
config.status: creating utils/Makefile.target
config.status: creating ctests/Makefile.target
config.status: creating ftests/Makefile.target
config.status: creating validation_tests/Makefile.target
config.status: creating config.h
config.status: config.h is unchanged
config.status: executing genpapifdef commands

problem:

In file included from components/perf_event/perf_event.c:56:
components/perf_event/perf_helpers.h:43:9: error: unknown type name '__u32'
 typedef __u32 u32;
         ^~~~~
components/perf_event/perf_helpers.h:44:9: error: unknown type name '__s32'
 typedef __s32 s32;
         ^~~~~
components/perf_event/perf_helpers.h:46:9: error: unknown type name '__u16'
 typedef __u16 u16;
         ^~~~~
components/perf_event/perf_helpers.h:47:9: error: unknown type name '__s16'
 typedef __s16 s16;
         ^~~~~
components/perf_event/perf_helpers.h:49:9: error: unknown type name '__u8'
 typedef __u8  u8;
         ^~~~
components/perf_event/perf_helpers.h:50:9: error: unknown type name '__s8'
 typedef __s8  s8;
         ^~~~
make[1]: *** [libpapi.so.7.1.0.0] Error 1
make[1]: Leaving directory `/home/user/Downloads/tar_hub/papi/src'
make: *** [libpfm4/lib/libpfm.a] Error 2

armflang build errors

Trying to build papi with armflang, I get to:

make[1]: Entering directory `/home/ec2-user/tmp/1698778753/papi/src/components/sysdetect/tests'
armclang -DUSE_PTHREAD_MUTEXES  -DPAPI_NUM_COMP=3 -O2 -I. -I../../.. -I../../../testlib -I../../../validation_tests -I/bm/ashterenli/install/u_arm/23.10/papi-ea3cc0041/include -c -o query_device_simple.o query_device_simple.c
armclang -DUSE_PTHREAD_MUTEXES  -DPAPI_NUM_COMP=3 -I. -I../../.. -I../../../testlib -I../../../validation_tests -I/bm/ashterenli/install/u_arm/23.10/papi-ea3cc0041/include -o query_device_simple query_device_simple.o ../../../testlib/libtestlib.a ../../../libpapi.a -ldl
armflang  -I../../.. -o query_device_simple_f query_device_simple_f.F ../../../libpapi.a -ldl
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 14)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 15)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 16)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 17)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 18)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 19)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 20)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 21)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 22)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 23)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 24)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 25)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 26)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 27)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 28)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 29)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 30)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 31)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 32)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 33)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 34)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 35)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 36)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 37)
F90-S-0021-Label field of continuation line is not blank (../../../f90papi.h: 38)
F90-F-0008-Error limit exceeded (../../../f90papi.h: 3)
F90/aarch64 Linux FlangArm F90  - 1.5 2017-05-01: compilation aborted
make[1]: *** [query_device_simple_f] Error 1
make[1]: Leaving directory `/home/ec2-user/tmp/1698778753/papi/src/components/sysdetect/tests'

Seems components/sysdetect/tests/Makefile needs to be updated.

In contrast, the flags seem correct in ftests:

make[1]: Entering directory `/home/ec2-user/tmp/1698778753/papi/src/ftests'
armflang -I../testlib -I. -I.. -DUSE_PTHREAD_MUTEXES  -DPAPI_NUM_COMP=3 -ffixed-line-length-132 -O1 strtest.F ../testlib/libtestlib.a ../libpapi.a -ldl -o strtest

cuda component overloads field for utility

Pull request 64 overloads num_cntrs field of PAPI_component_info_t to inform cuda_component_avail utility that cuda component does not support max counters.

This implementation introduces potential risk of memory problems in PAPI_write and other places. This issue exists to address this problem and bring a different implementation for this feature.

How to add preset events for RaptorLake/AlderLake

I previously commented on #131 but it's a closed issue so I thought it would be worth opening a new one.

I have a RaptorLake CPU and am willing to contribute preset events if anyone can point me in the right direction it would be appreciated.

papi_component_avail also shows,

Name: perf_event_uncore Linux perf_event CPU uncore and northbridge
-> Disabled: No uncore PMUs or events found

But I believe this might be a libpfm4 support issue.

Thank you

PAPI_enum_event Incorrect Modifier Behavior for Preset Events

PAPI_enum_event(int *EventCode, int modifier) has incorrect behavior when providing a preset modifier. Below is a list of available modifiers for PAPI_enum_event and the behavior that occurs with each.

  • PAPI_PRESET_ENUM_AVAIL - Enumerated a preset event that was not actually available
  • PAPI_PRESET_ENUM_MSC - Enumerated events that were not miscellaneous and could of been classified into another modifier e.g. PAPI_BR_TKN or PAPI_TOT_INS
  • PAPI_PRESET_ENUM_INS - Enumerated events that were not instruction related preset events
  • PAPI_PRESET_ENUM_BR - Enumerated events that were not branch related preset events
  • PAPI_PRESET_ENUM_CND - Enumerated events that were not conditional preset events
  • PAPI_PRESET_ENUM_MEM - Enumerates all preset events, rather than just memory related events such as PAPI_MEM_SCY or PAPI_MEM_RCY
  • PAPI_PRESET_ENUM_CACH - Enumerates all preset events, rather than just cache related events such as PAPI_L1_DCM or PAPI_L2_ICM
  • PAPI_PRESET_ENUM_L1 - Enumerated events that were not L1 cache related preset events
  • PAPI_PRESET_ENUM_L2 - Enumerated events that were not L2 cache related preset events
  • PAPI_PRESET_ENUM_L3 - Enumerates all preset events, rather than just L3 cache related events such as PAPI_L3_TCA or PAPI_L3_ICR
  • PAPI_PRESET_ENUM_TLB - Enumerates all preset events, rather than just TLB related events such as PAPI_TLB_SD or PAPI_TLB_IM
  • PAPI_PRESET_ENUM_FP - Enumerated events that were not floating point related preset events

Code to reproduce:

#include "papi.h"

#include <stdio.h>
#include <stdlib.h>

int main() {
    int EventCode, retval, modifier;
    char EventCodeStr[PAPI_MAX_STR_LEN];

    retval = PAPI_library_init(PAPI_VER_CURRENT);
    if (retval != PAPI_VER_CURRENT) {
        printf("PAPI library is not initialized properly.");
        exit(1);
    }

   EventCode = 0 | PAPI_PRESET_MASK;
   modifier = PAPI_PRESET_ENUM_L1;
   do {
       /* translate the integer code to a string */
       retval = PAPI_event_code_to_name(EventCode, EventCodeStr);
       /* print all the preset events for modifier if PAPI_OK*/
       if (retval == PAPI_OK)
           printf("Name: %s\nCode: %x\n", EventCodeStr, EventCode);
    
   } while ( ( PAPI_enum_event(&EventCode, modifier) )  == PAPI_OK );    
}

`PAPI_event_code_to_name` potential memory risk

For native events PAPI_event_code_to_name(int EventCode, char *out) calls _papi_hwi_native_code_to_name( EventCode, out, PAPI_MAX_STR_LEN). Thus, it assumes that the event name could be max 128 characters. There are several potential problems with this,

  • CUDA Perfworks has event names that can go up to 133 characters, thus calling this function returns PAPI_ENOMEM which is not documented behavior. These long cuda events are all multipass (and cannot be measured) but PAPI event name<->code API should work for them too.
  • If the user has a string defined as char evt_name[64] and calls PAPI_event_code_to_name(EventCode, evt_name) then it's a buffer overflow. The example in docs shows char EventCodeStr[PAPI_MAX_STR_LEN] but this is not mentioned as a requirement.

PAPI API needs to account for possibly longer event names, and properly truncate at user string length. The function should be defined as PAPI_event_code_to_name(int EventCode, char *out, int len).

Alternatively, we can have a function that returns a pointer to a dynamically allocated string that the user needs to free.

large number of unsupported papi counters on 13th Gen Intel Core i7-13800H

I want to use papi on my laptop for profiling but I can only see a small number of papi counters (and to get that I had to install the latest version v7)

Initially I tried v6 but it doesnt work for me at all. I moved to version 7 and here is what I get:

$ ./papi_avail -a
Available PAPI preset and user defined events plus hardware information.
--------------------------------------------------------------------------------
PAPI version             : 7.0.1.0
Operating system         : Linux 6.2.0-37-generic
Vendor string and code   : GenuineIntel (1, 0x1)
Model string and code    : 13th Gen Intel(R) Core(TM) i7-13800H (186, 0xba)
CPU revision             : 2.000000
CPUID                    : Family/Model/Stepping 6/186/2, 0x06/0xba/0x02
CPU Max MHz              : 5000
CPU Min MHz              : 400
Total cores              : 14
SMT threads per core     : 1
Cores per socket         : 14
Sockets                  : 1
Cores per NUMA region    : 14
NUMA regions             : 1
Running in a VM          : no
Number Hardware Counters : 0
Max Multiplex Counters   : 384
Fast counter read (rdpmc): yes
--------------------------------------------------------------------------------

================================================================================
  PAPI Preset Events
================================================================================
    Name        Code    Deriv Description (Note)

--------------------------------------------------------------------------------
Of 0 available events, 0 are derived.

No events detected!  Check papi_component_avail to find out why.

If I run papi_component_avail I get:

$ ./papi_component_avail
Available components and hardware information.
--------------------------------------------------------------------------------
PAPI version             : 7.0.1.0
Operating system         : Linux 6.2.0-37-generic
Vendor string and code   : GenuineIntel (1, 0x1)
Model string and code    : 13th Gen Intel(R) Core(TM) i7-13800H (186, 0xba)
CPU revision             : 2.000000
CPUID                    : Family/Model/Stepping 6/186/2, 0x06/0xba/0x02
CPU Max MHz              : 5000
CPU Min MHz              : 400
Total cores              : 14
SMT threads per core     : 1
Cores per socket         : 14
Sockets                  : 1
Cores per NUMA region    : 14
NUMA regions             : 1
Running in a VM          : no
Number Hardware Counters : 0
Max Multiplex Counters   : 384
Fast counter read (rdpmc): yes
--------------------------------------------------------------------------------

Compiled-in components:
Name:   perf_event              Linux perf_event CPU counters
   \-> Disabled: Error libpfm4 no default PMU found
Name:   perf_event_uncore       Linux perf_event CPU uncore and northbridge
   \-> Disabled: No uncore PMUs or events found
Name:   sysdetect               System info detection component

Active components:
Name:   sysdetect               System info detection component
                                Native: 0, Preset: 0, Counters: 0

so finally if I set LIBPFM_FORCE_PMU :

$ LIBPFM_FORCE_PMU=amd64 ./papi_component_avail 
Available components and hardware information.
--------------------------------------------------------------------------------
PAPI version             : 7.0.1.0
Operating system         : Linux 6.2.0-37-generic
Vendor string and code   : GenuineIntel (1, 0x1)
Model string and code    : 13th Gen Intel(R) Core(TM) i7-13800H (186, 0xba)
CPU revision             : 2.000000
CPUID                    : Family/Model/Stepping 6/186/2, 0x06/0xba/0x02
CPU Max MHz              : 5000
CPU Min MHz              : 400
Total cores              : 14
SMT threads per core     : 1
Cores per socket         : 14
Sockets                  : 1
Cores per NUMA region    : 14
NUMA regions             : 1
Running in a VM          : no
Number Hardware Counters : 4
Max Multiplex Counters   : 384
Fast counter read (rdpmc): yes
--------------------------------------------------------------------------------

Compiled-in components:
Name:   perf_event              Linux perf_event CPU counters
Name:   perf_event_uncore       Linux perf_event CPU uncore and northbridge
   \-> Disabled: No uncore PMUs or events found
Name:   sysdetect               System info detection component

Active components:
Name:   perf_event              Linux perf_event CPU counters
                                Native: 24, Preset: 18, Counters: 4
                                PMUs supported: amd64_k7

Name:   sysdetect               System info detection component
                                Native: 0, Preset: 0, Counters: 0

Now running papi_avail, I get:

$ LIBPFM_FORCE_PMU=amd64 ./papi_avail -a
Available PAPI preset and user defined events plus hardware information.
--------------------------------------------------------------------------------
PAPI version             : 7.0.1.0
Operating system         : Linux 6.2.0-37-generic
Vendor string and code   : GenuineIntel (1, 0x1)
Model string and code    : 13th Gen Intel(R) Core(TM) i7-13800H (186, 0xba)
CPU revision             : 2.000000
CPUID                    : Family/Model/Stepping 6/186/2, 0x06/0xba/0x02
CPU Max MHz              : 5000
CPU Min MHz              : 400
Total cores              : 14
SMT threads per core     : 1
Cores per socket         : 14
Sockets                  : 1
Cores per NUMA region    : 14
NUMA regions             : 1
Running in a VM          : no
Number Hardware Counters : 4
Max Multiplex Counters   : 384
Fast counter read (rdpmc): yes
--------------------------------------------------------------------------------

================================================================================
  PAPI Preset Events
================================================================================
    Name        Code    Deriv Description (Note)
PAPI_L1_DCM  0x80000000  No   Level 1 data cache misses
PAPI_L1_ICM  0x80000001  No   Level 1 instruction cache misses
PAPI_L1_TCM  0x80000006  Yes  Level 1 cache misses
PAPI_TLB_DM  0x80000014  No   Data translation lookaside buffer misses
PAPI_TLB_IM  0x80000015  No   Instruction translation lookaside buffer misses
PAPI_TLB_TL  0x80000016  Yes  Total translation lookaside buffer misses
PAPI_HW_INT  0x80000029  No   Hardware interrupts
PAPI_BR_TKN  0x8000002c  No   Conditional branch instructions taken
PAPI_BR_MSP  0x8000002e  No   Conditional branch instructions mispredicted
PAPI_TOT_INS 0x80000032  No   Instructions completed
PAPI_BR_INS  0x80000037  No   Branch instructions
PAPI_TOT_CYC 0x8000003b  No   Total cycles
PAPI_L1_DCH  0x8000003e  Yes  Level 1 data cache hits
PAPI_L1_DCA  0x80000040  No   Level 1 data cache accesses
PAPI_L1_ICA  0x8000004c  No   Level 1 instruction cache accesses
PAPI_L1_ICR  0x8000004f  No   Level 1 instruction cache reads
PAPI_L1_TCH  0x80000055  Yes  Level 1 total cache hits
PAPI_L1_TCA  0x80000058  Yes  Level 1 total cache accesses
--------------------------------------------------------------------------------
Of 18 available events, 5 are derived.

But I want to get PAPI_FP_INS for example and I can only 18 out of about 108 possible events.

Can anyone explain how I can find out where to get options for LIBPFM_FORCE_PMU? I got this off stackoverflow but I don't know how to find it out myself.

Is it possible to configure PAPI without needing to export LIBPFM_FORCE_PMU?

Is my intel i7 currently not fully supported, or am I doing something wrong? e.g. missing configure options.

(FYI I have set /proc/sys/kernel/perf_event_paranoid equal to 0 and I have sudo access on my laptop if necessary -- although commands above were run as user)

Trouble adding PAPI_TOT_CYC: Component containing event is disabled after running make test

I was trying to install PAPI on my i7-12700H MSI computer running Ubuntu. I come across this error and can´t get PAPI to run, can anyone help? It doesn´t detect any events, so I can't get past the make test command.

Available PAPI preset and user defined events plus hardware information.

PAPI version : 7.1.0.0
Operating system : Linux 6.5.0-21-generic
Vendor string and code : GenuineIntel (1, 0x1)
Model string and code : 12th Gen Intel(R) Core(TM) i7-12700H (154, 0x9a)
CPU revision : 3.000000
CPUID : Family/Model/Stepping 6/154/3, 0x06/0x9a/0x03
CPU Max MHz : 4600
CPU Min MHz : 400
Total cores : 20
SMT threads per core : 2
Cores per socket : 10
Sockets : 1
Cores per NUMA region : 20
NUMA regions : 1
Running in a VM : no
Number Hardware Counters : 0
Max Multiplex Counters : 384
Fast counter read (rdpmc): yes

================================================================================
PAPI Preset Events

Name        Code    Deriv Description (Note)

Of 0 available events, 0 are derived.

No events detected! Check papi_component_avail to find out why.

Available components and hardware information.

PAPI version : 7.1.0.0
Operating system : Linux 6.5.0-21-generic
Vendor string and code : GenuineIntel (1, 0x1)
Model string and code : 12th Gen Intel(R) Core(TM) i7-12700H (154, 0x9a)
CPU revision : 3.000000
CPUID : Family/Model/Stepping 6/154/3, 0x06/0x9a/0x03
CPU Max MHz : 4600
CPU Min MHz : 400
Total cores : 20
SMT threads per core : 2
Cores per socket : 10
Sockets : 1
Cores per NUMA region : 20
NUMA regions : 1
Running in a VM : no
Number Hardware Counters : 0
Max Multiplex Counters : 384
Fast counter read (rdpmc): yes

Compiled-in components:
Name: perf_event Linux perf_event CPU counters
-> Disabled: Error libpfm4 too many default PMUs found
Name: perf_event_uncore Linux perf_event CPU uncore and northbridge
-> Disabled: No uncore PMUs or events found
Name: sysdetect System info detection component

Active components:
Name: sysdetect System info detection component
Native: 0, Preset: 0, Counters: 0


papi 7.0.1 -- building with binutils/2.41 -- multiple definition of `_peu_update_control_state

I'm trying to build papi 7.0.1 on aws Graviton3 armv8.

I'm using binutils/2.41

I can build papi fine when I just use

--with-perf-events \

However, if I try

--with-components="perf_event_uncore" \

I get:

564 /bm/ashterenli/install/binutils-2.41/bin/ld: /home/ec2-user/tmp/1696525365/cc69uRoe.o: in function `_peu_update_control_state':
565 /home/ec2-user/tmp/1696525365/papi-7.0.1/src/components/perf_event_uncore/perf_event_uncore.c:706: multiple definition of `_peu_update_co    ntrol_state'; /home/ec2-user/tmp/1696525365/cclMuroC.o:/home/ec2-user/tmp/1696525365/papi-7.0.1/src/components/perf_event_uncore/perf_eve    nt_uncore.c:706: first defined here
566 /bm/ashterenli/install/binutils-2.41/bin/ld: /home/ec2-user/tmp/1696525365/cc69uRoe.o:/home/ec2-user/tmp/1696525365/papi-7.0.1/src/compon    ents/perf_event_uncore/perf_event_uncore.c:52: multiple definition of `uncore_native_event_table'; /home/ec2-user/tmp/1696525365/cclMuroC    .o:/home/ec2-user/tmp/1696525365/papi-7.0.1/src/components/perf_event_uncore/perf_event_uncore.c:52: first defined here
567 /bm/ashterenli/install/binutils-2.41/bin/ld: /home/ec2-user/tmp/1696525365/cc69uRoe.o:/home/ec2-user/tmp/1696525365/papi-7.0.1/src/compon    ents/perf_event_uncore/perf_event_uncore.c:49: multiple definition of `_perf_event_uncore_vector'; /home/ec2-user/tmp/1696525365/cclMuroC    .o:/home/ec2-user/tmp/1696525365/papi-7.0.1/src/components/perf_event_uncore/perf_event_uncore.c:49: first defined here
568 collect2: error: ld returned 1 exit status
569 make[1]: *** [libpapi.so.7.0.1.0] Error 1

If I try

--with-components="perf_event" \

I get:

548 /bm/ashterenli/install/binutils-2.41/bin/ld: /home/ec2-user/tmp/1696525546/ccAQRk2S.o: in function `check_exclude_guest':
549 /home/ec2-user/tmp/1696525546/papi-7.0.1/src/components/perf_event/perf_event.c:279: multiple definition of `check_exclude_guest'; /home/    ec2-user/tmp/1696525546/ccDkH0GF.o:/home/ec2-user/tmp/1696525546/papi-7.0.1/src/components/perf_event/perf_event.c:279: first defined her    e
550 /bm/ashterenli/install/binutils-2.41/bin/ld: /home/ec2-user/tmp/1696525546/ccAQRk2S.o:/home/ec2-user/tmp/1696525546/papi-7.0.1/src/compon    ents/perf_event/perf_event.c:76: multiple definition of `perf_native_event_table'; /home/ec2-user/tmp/1696525546/ccDkH0GF.o:/home/ec2-use    r/tmp/1696525546/papi-7.0.1/src/components/perf_event/perf_event.c:76: first defined here
551 /bm/ashterenli/install/binutils-2.41/bin/ld: /home/ec2-user/tmp/1696525546/ccAQRk2S.o:/home/ec2-user/tmp/1696525546/papi-7.0.1/src/compon    ents/perf_event/perf_event.c:73: multiple definition of `_perf_event_vector'; /home/ec2-user/tmp/1696525546/ccDkH0GF.o:/home/ec2-user/tmp    /1696525546/papi-7.0.1/src/components/perf_event/perf_event.c:73: first defined here
552 /bm/ashterenli/install/binutils-2.41/bin/ld: /home/ec2-user/tmp/1696525546/cc3OTCSE.o: in function `_pe_libpfm4_ntv_name_to_code':
553 /home/ec2-user/tmp/1696525546/papi-7.0.1/src/components/perf_event/pe_libpfm4_events.c:582: multiple definition of `_pe_libpfm4_ntv_name_    to_code'; /home/ec2-user/tmp/1696525546/cclwRKSp.o:/home/ec2-user/tmp/1696525546/papi-7.0.1/src/components/perf_event/pe_libpfm4_events.c    :582: first defined here
554 /bm/ashterenli/install/binutils-2.41/bin/ld: /home/ec2-user/tmp/1696525546/cc3OTCSE.o: in function `_pe_libpfm4_ntv_code_to_name':
555 /home/ec2-user/tmp/1696525546/papi-7.0.1/src/components/perf_event/pe_libpfm4_events.c:632: multiple definition of `_pe_libpfm4_ntv_code_    to_name'; /home/ec2-user/tmp/1696525546/cclwRKSp.o:/home/ec2-user/tmp/1696525546/papi-7.0.1/src/components/perf_event/pe_libpfm4_events.c    :632: first defined here
556 /bm/ashterenli/install/binutils-2.41/bin/ld: /home/ec2-user/tmp/1696525546/cc3OTCSE.o: in function `_pe_libpfm4_ntv_code_to_descr':
557 /home/ec2-user/tmp/1696525546/papi-7.0.1/src/components/perf_event/pe_libpfm4_events.c:725: multiple definition of `_pe_libpfm4_ntv_code_    to_descr'; /home/ec2-user/tmp/1696525546/cclwRKSp.o:/home/ec2-user/tmp/1696525546/papi-7.0.1/src/components/perf_event/pe_libpfm4_events.    c:725: first defined here
558 /bm/ashterenli/install/binutils-2.41/bin/ld: /home/ec2-user/tmp/1696525546/cc3OTCSE.o: in function `_pe_libpfm4_ntv_code_to_info':
559 /home/ec2-user/tmp/1696525546/papi-7.0.1/src/components/perf_event/pe_libpfm4_events.c:791: multiple definition of `_pe_libpfm4_ntv_code_    to_info'; /home/ec2-user/tmp/1696525546/cclwRKSp.o:/home/ec2-user/tmp/1696525546/papi-7.0.1/src/components/perf_event/pe_libpfm4_events.c    :791: first defined here
560 /bm/ashterenli/install/binutils-2.41/bin/ld: /home/ec2-user/tmp/1696525546/cc3OTCSE.o: in function `_pe_libpfm4_ntv_enum_events':
561 /home/ec2-user/tmp/1696525546/papi-7.0.1/src/components/perf_event/pe_libpfm4_events.c:831: multiple definition of `_pe_libpfm4_ntv_enum_    events'; /home/ec2-user/tmp/1696525546/cclwRKSp.o:/home/ec2-user/tmp/1696525546/papi-7.0.1/src/components/perf_event/pe_libpfm4_events.c:    831: first defined here
562 /bm/ashterenli/install/binutils-2.41/bin/ld: /home/ec2-user/tmp/1696525546/cc3OTCSE.o: in function `_pe_libpfm4_shutdown':
563 /home/ec2-user/tmp/1696525546/papi-7.0.1/src/components/perf_event/pe_libpfm4_events.c:1093: multiple definition of `_pe_libpfm4_shutdown    '; /home/ec2-user/tmp/1696525546/cclwRKSp.o:/home/ec2-user/tmp/1696525546/papi-7.0.1/src/components/perf_event/pe_libpfm4_events.c:1093:     first defined here
564 /bm/ashterenli/install/binutils-2.41/bin/ld: /home/ec2-user/tmp/1696525546/cc3OTCSE.o: in function `_pe_libpfm4_init':
565 /home/ec2-user/tmp/1696525546/papi-7.0.1/src/components/perf_event/pe_libpfm4_events.c:1145: multiple definition of `_pe_libpfm4_init'; /    home/ec2-user/tmp/1696525546/cclwRKSp.o:/home/ec2-user/tmp/1696525546/papi-7.0.1/src/components/perf_event/pe_libpfm4_events.c:1145: firs    t defined here
566 /bm/ashterenli/install/binutils-2.41/bin/ld: /home/ec2-user/tmp/1696525546/cc3OTCSE.o: in function `_peu_libpfm4_init':
567 /home/ec2-user/tmp/1696525546/papi-7.0.1/src/components/perf_event/pe_libpfm4_events.c:1317: multiple definition of `_peu_libpfm4_init';     /home/ec2-user/tmp/1696525546/cclwRKSp.o:/home/ec2-user/tmp/1696525546/papi-7.0.1/src/components/perf_event/pe_libpfm4_events.c:1317: fir    st defined here
568 collect2: error: ld returned 1 exit status

I also tried earlier binutils - 2.37 - still the same errors.

Am I doing something wrong?

Failing to detect Intel ifx by Makefile logic.

Compilation with Intel's ifx results in the following warning:
ifx -ffree -I../../.. -o query_device_simple_f query_device_simple_f.F ../../../libpapi.a -lrt
ifx: command line warning #10006: ignoring unknown option '-ffree'
although components/sysdetect/tests/Makefile has logic to test for "ifx" and use the correct flag "-free". The failure to use the proper flag results in failing to compile the Fortran codes that are in free form.

This issue occurred on ALCF machines.

HL-API: Problems with negative counter values

I am using PAPI's high level API to collect energy data using RAPL and a couple of performance counters for a large number of simple, memory-intensive applications. These applications have been auto generated and have execution times from about 50ms up to 400s.
An example of such an auto-generated application can be found in [1].

Unfortunately, I am sometimes observing negative counter values for some performance counters [2], especially on workloads with very short execution times (~50ms). I first thought this could be caused by an overflow of the counters but about 40% of the 44000 runs have at least one negative counter value.

My question now is if this could be caused by counter overflow or if it can be caused by something else? If it is caused by an overflow, is there any option to handle this in the high-level API? I can't find any information on this in both the whitepaper and the documentation on the HL API.

Thank you in advance!

Here are some information on my system and the PAPI configuration:

  • Operating System: RockyLinux 8.8
  • Kernel: 6.4.6-1.el8.elrepo.x86_64
  • Papi Version: 7.0.0 (including #54)

PAPI_EVENTS="rapl:::DRAM_ENERGY:PACKAGE0,rapl:::DRAM_ENERGY:PACKAGE1,rapl:::PACKAGE_ENERGY:PACKAGE0,rapl:::PACKAGE_ENERGY:PACKAGE1,perf::CACHE-REFERENCES,perf::CACHE-MISSES,perf::INSTRUCTIONS,perf::CYCLES,PAPI_LD_INS,PAPI_SR_INS,PAPI_DP_OPS,PAPI_SP_OPSFP_ARITH:SCALAR,FP_ARITH:VECTOR,OCR:READS_TO_CORE_LOCAL_DRAM,OCR:READS_TO_CORE_LOCAL_PMM"

  • PAPI_MULTIPLEX="1"

[1] https://gist.github.com/lukalt/eabfa79b2dd672f7ab08488ae18dd8e1
[2] perf::CACHE-REFERENCES,perf::CACHE-MISSES,PAPI_LD_INS, PAPI_SR_INS, PAPI_DP_OPS, OCR:READS_TO_CORE_LOCAL_DRAM, OCR:READS_TO_CORE_LOCAL_PMM

Wrong documentation for `PAPI_add_named_event`

Bad info in papi.c on line 2386:

PAPI_add_named_event( int EventSet, const char *EventName ) is documented as taking params EventSet and EventCode A defined event such as PAPI_TOT_INS.

This should be changed to EventName An event name string for any configured component

rocm_smi component failure with rocm 5.7.3+

When using rocm version 5.7.3 or newer on a MI210 GPU, the rocm_smi component isn't enabled with papi_component_avail.

Build on dopamine.icl.utk.edu (MI210 GPU):

#!/bin/bash -e

module load gcc@11
module load [email protected]
git clone https://github.com/icl-utk-edu/papi
cd papi/src

export PAPI_ROCM_ROOT=$ROCM_PATH
export PAPI_ROCMSMI_ROOT=$PAPI_ROCM_ROOT/rocm_smi

./configure --with-debug=yes --enable-warnings --with-components=rocm_smi
make -j4

utils/papi_component_avail

Result:

Available components and hardware information.
--------------------------------------------------------------------------------
PAPI version             : 7.1.0.0
Operating system         : Linux 6.1.62-1.el9.elrepo.x86_64
Vendor string and code   : AuthenticAMD (2, 0x2)
Model string and code    : AMD EPYC 7413 24-Core Processor (1, 0x1)
CPU revision             : 1.000000
CPUID                    : Family/Model/Stepping 25/1/1, 0x19/0x01/0x01
CPU Max MHz              : 3630
CPU Min MHz              : 1500
Total cores              : 96
SMT threads per core     : 2
Cores per socket         : 24
Sockets                  : 2
Cores per NUMA region    : 48
NUMA regions             : 2
Running in a VM          : no
Number Hardware Counters : 5
Max Multiplex Counters   : 384
Fast counter read (rdpmc): yes
--------------------------------------------------------------------------------

Compiled-in components:
Name:   perf_event              Linux perf_event CPU counters
Name:   perf_event_uncore       Linux perf_event CPU uncore and northbridge
Name:   rocm_smi                AMD GPU System Management Interface via rocm_smi_lib
   \-> Disabled: Error while initializing the native event table.
Name:   sysdetect               System info detection component

Active components:
Name:   perf_event              Linux perf_event CPU counters
                                Native: 145, Preset: 27, Counters: 5
                                PMUs supported: perf, perf_raw, amd64_fam19h_zen3

Name:   perf_event_uncore       Linux perf_event CPU uncore and northbridge
                                Native: 4, Preset: 0, Counters: 7
                                PMUs supported: amd64_rapl, amd64_fam19h_zen3_l3

Name:   sysdetect               System info detection component
                                Native: 0, Preset: 0, Counters: 0

The component works correctly on all GPUs with ROCm 5.5 and with the Radeon vii GPUs on histamine0 with ROCm 5.7.1+.

PAPI_attach() + rdpmc bug in linux kernel

Good morning,

I'm having strange problems doing periodic sampling of even the most basic
performance counters when it comes to attached processes, to the point that
I can't help wondering if I'm not using the correct API?  Or is this a use
case that PAPI simply wasn't designed for (because it seems to be too
simple a problem to be a bug that wasn't caught).

Anyway, I won't bore you with our actual source code, but I modified one of
the PAPI testcases (ctests/attach2.c) to demonstrate the same issue.  The
complete source code is attached to this email but here's the diff:

--- attach2.c.orig	2023-06-22 09:59:05.220181155 -0500
+++ attach2.c	2023-06-22 09:59:05.206847812 -0500
@@ -48,7 +48,7 @@ wait_for_attach_and_loop( void )
  putenv(newpath);

  if (ptrace(PTRACE_TRACEME, 0, 0, 0) == 0) {
-    execlp("attach_target","attach_target","100000000",NULL);
+    execlp("attach_target","attach_target","1000000000",NULL);
    perror("execl(attach_target) failed");
  }
  perror("PTRACE_TRACEME");
@@ -176,6 +176,16 @@ main( int argc, char **argv )
	  return 1;
	}

+	sleep(1);
+	retval = PAPI_read(EventSet1, values[0]);
+//	retval = PAPI_stop(EventSet1, values[0]);
+        if ( retval != PAPI_OK )
+          test_fail( __FILE__, __LINE__, "PAPI_read", retval );
+        printf( TAB1, "PAPI_TOT_CYC : \t", ( values[0] )[0] );
+	printf( "%s : \t %12lld\n",event_name, ( values[0] )[1]);
+//	retval = PAPI_start(EventSet1);
+//        if ( retval != PAPI_OK )
+//          test_fail( __FILE__, __LINE__, "PAPI_read", retval );

	do {
	  child = wait( &status );

Basically, I'm invoking PAPI_read() while the attached process is still
running (I increased its runtime to be a couple of seconds on my system so
that the sleep(1) makes sense).  And I'm getting garbage results:

must_ptrace is 1
Debugger exited wait() with 27085
Child has stopped due to signal 5 (Trace/breakpoint trap)
After 0
Continuing
PAPI_TOT_CYC :   140737488355327
PAPI_TOT_INS :   140737488355327
Debugger exited wait() with 27085
Child exited with value 0
Test case: 3rd party attach start, stop.
-----------------------------------------------
Default domain is: 1 (PAPI_DOM_USER)
Default granularity is: 1 (PAPI_GRN_THR)
Using 20000000 iterations of c += a*b
-------------------------------------------------------------------------
Test type    :             1
PAPI_TOT_CYC :     9145700936
PAPI_TOT_INS :    10000141883
Real usec    :        2659234
Real cycles  :     7722401191
Virt usec    :            242
Virt cycles  :         836500
-------------------------------------------------------------------------
Verification: none
PASSED

As you can see from the two added lines immediately following "Continuing",
instead of valid counts I get 2^47-1.  I tried doing
PAPI_stop()/PAPI_start() instead of PAPI_read() (see the commented out
code) but that doesn't seem to help.

This is on my old x86 laptop (Intel iCore i7-7500U, Gentoo stable, Linux
kernel version 6.3.3) with the latest release of PAPI (7.0.1) but to check
that I'm not crazy I also compiled it on a login node of the ALCF sunspot
(Intel Xeon Gold 5320, SLES 15-SP3, Linux kernel
5.3.18-150300.59.115-default), with the same outcome.

Could you shed some light on what might be going on here and how to fix it?

Thank you,

Kamil

Some tests of the snprintf results appear to be off by one in the papi code

I was working on addressing some of the issues found by coverity in papi (https://github.com/wcohen/papi/tree/coverity202307). In particular there were some possibly non-null terminated strings due to the string being larger than the buffer (wcohen@83ae483). When developing that particular fix I looked to see how other places in the papi had addressed that issue. Other places in were using snprintf and checking the value returned by the function. I noticed that some source code such as src/components/pcp/linux-pcp.c would flag errors when the return value was larger than the size passed in, but other such as src/components/sysdetect/nvidia_gpu.c and src/components/sysdetect/amd_gpu.c would flag an error if the size return was EQUAL to or larger than the size passed in. Both of the following links mention that the NULL at the end of the string is not counted in that total count returned by snprintf:

https://www.geeksforgeeks.org/snprintf-c-library/
https://cplusplus.com/reference/cstdio/snprintf/

That would mean that a number of the tests checking the return value of snprintf are off by one. The tests should be flagging an error if snprintf returns a value equal to or greater than the buffer size rather than only greater than the buffer size.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.