gtcasl / gpuocelot Goto Github PK

GPUOCelot: A dynamic compilation framework for PTX

License: BSD 3-Clause "New" or "Revised" License

Python 0.06% C++ 80.77% C 3.10% Cuda 1.69% LLVM 0.01% Shell 0.18% GLSL 0.01% Objective-C 0.01% Makefile 0.07% Groff 13.76% MATLAB 0.02% Pascal 0.01% HTML 0.29% CSS 0.01% Batchfile 0.01% Yacc 0.01%

gpuocelot's Introduction

#Current Status GPU Ocelot project is not longer actively maintained. The latest system requirement and installation guide are available at https://github.com/gtcasl/gpuocelot/wiki/Installation

#Overview Ocelot is a modular dynamic compilation framework for heterogeneous system, providing various backend targets for CUDA programs and analysis modules for the PTX virtual instruction set. Ocelot currently allows CUDA programs to be executed on NVIDIA GPUs, AMD GPUs, and x86-CPUs at full speed without recompilation. For more information, check http://gpuocelot.gatech.edu/.

#Installation Please check the installation guide here.

#Resources

Source code documentation http://gpuocelot.gatech.edu/doxygen
Mailing list http://groups.google.com/group/gpuocelot

#News March, 2013 - Actively seeking developers for AMD and Intel GPU backends. Please post on the mailing list if interested.

June 10, 2012 - GPU Ocelot tutorial to be presented at ISCA 2012.

May 14, 2012 - GPU Ocelot poster to be presented at GPU Technology Conference.

March 5, 2012 - Call for developers for the NVIDIA GPU device and the AMD GPU device. Initial implementations are in-place but we could greatly benefit from owners willing to test the code and add new features as new hardware (Kepler, GCN) comes out. Post on the mailing list if you are interested.

October 10, 2011 - Ocelot tutorial at PACT 2011.

#Contributing ##Documentation Ocelot currently is lacking good documentation for installation and common usage. If anyone is interested in writing tutorials or howtos please post on the mailing list.

##Complete a Feature Request If you would like to contribute to this project and help with any of the directions on our roadmap you can do the following:

Pull a task from the list of issues
Implement it on your own
Post a patch to the relevant issue

See the requirements for contributing a feature
If it is accepted we will merge it into the main codebase

Ask us about becoming a registered developer

##Branch Our Code If you want to work on something not on our roadmap, but want to host your code on this site, contact us about becoming a developer and creating a branch.

##Start A New Project If you want to work independently using Ocelot as a starting point, feel free to copy our most current release and use it internally.

#Special Thanks We would like to thank the following people, who have contributed novel ideas, software, and tests to the project:

Nathan Bell
Sylvain Collange
David Luebke
Diogo Sampaio
Ryuta Suzuki
Steve Worley
Ignacio Llamas
James Bigler
Greg Humphreys

gpuocelot's People

Contributors

Stargazers

Watchers

gpuocelot's Issues

error ---- ir::PTXOperand::AddressMode' is not a class or namespace

From [email protected] on March 31, 2011 00:28:09

What steps will reproduce the problem? 1../build.py --install 2. 3. What is the expected output? What do you see instead? What version of the product are you using? On what operating system? 2.0 Please provide any additional information below. in ocelot/analysis/implementation/DivergenceAnalysis.cpp

ir::PTXOperand::AddressMode::Special
should be
ir::PTXOperand::Special

also similar error in ocelot/analysis/implementation/SyncEliminationPass.cpp

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=49

Compilation fails on 86_64

From [email protected] on October 01, 2009 14:51:23

Hi,
compilation of r150 fails on
cc1plus: warnings being treated as errors
ocelot/ir/implementation/LLVMInstruction.cpp: In member function
'std::string ir::LLVMInstruction::Operand::toString() const':
ocelot/ir/implementation/LLVMInstruction.cpp:147: error: dereferencing
type-punned pointer will break strict-aliasing rules
ocelot/ir/implementation/LLVMInstruction.cpp:206: error: dereferencing
type-punned pointer will break strict-aliasing rules

Sems that 'classical' solution to this problem is to introduce an union of
uint32/64 and a float/double.
Seems that there exists something called std::hexfloat (like std::hex) but
unfortunately compiler can't find it.

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=29

Implement Dead Code Elimination

From [email protected] on June 30, 2009 15:30:29

Describe the New Feature: 1. After the data flow graph has been used to compute live sets, remove
instructions that produce registers with no consumers.
2. Recursively do this until no more instructions can be removed. Which milestone does the feature belong to? 0.5.0 Which branch does the new feature go in? Trunk

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=7

Add a driver level API implementation

From [email protected] on July 14, 2009 08:28:41

Describe the New Feature: Having a driver level API implementation would allow applications that use
it exclusively to use Ocelot... Which milestone does the feature belong to? 1.0.0 Which branch does the new feature go in? Trunk

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=12

Add SSA Analysis Module

From [email protected] on June 22, 2009 15:36:50

Describe the New Feature: 1. Add an SSA control flow graph to the code analysis modules.
a. The SSA graph should be composed of blocks of instructions.
b. Registers should be represented by integers.
c. Instructions should be represented by either pointers to instruction
objects or indices into a vector of all instructions.
d. There should be three types of instructions: generic, branch, and phi
2. Write a unit test that builds the SSA CFG for all of the PTX test files. Which milestone does the feature belong to? 0.5.0 Which branch does the new feature go in? Trunk

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=1

Add an AMD IL backend

From [email protected] on October 08, 2009 13:47:27

Describe the New Feature: At a high level this would be an IR, translator, and executive kernel for
AMD's IL language. Which milestone does the feature belong to? 2.0.0 Which branch does the new feature go in? Trunk

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=30

Add Dataflow Graph Analysis Module

From [email protected] on June 22, 2009 15:52:25

Describe the New Feature: 1. A dataflow graph is an augmented control flow graph where the live
registers going into and out of each block are annotated in the cfg.

Create a separate DFG class where each block is augmented with a list
of live registers in and a list of live registers out. Which milestone does the feature belong to? 0.5.0 Which branch does the new feature go in? Trunk

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=2

Define LLVM IR

From [email protected] on June 22, 2009 16:24:26

Describe the New Feature: 1. Create an LLVMInstruction class that inherits from Instruction and is
able to represent any valid LLVM instruction.
2. Create a toString function that converts an instruction to a parsable
assembly language representation. Which milestone does the feature belong to? Milestone-Release0.6 Which branch does the new feature go in? Trunk

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=4

Ocelot does not build on windows

From [email protected] on April 21, 2011 14:52:58

The title says it all, we need someone to actually step through the entire build process on windows.

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=51

Bug when type-checking the operand to bar and pmevent instructions

From [email protected] on November 04, 2009 20:44:14

In PTXInstruction.cpp:

case Bar: {
if( !a.addressMode == PTXOperand::Immediate ) {...}

should be:

case Bar: {
if( a.addressMode != PTXOperand::Immediate ) {...}

The same occurs for Pmevent.

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=33

barrier deadlock on __syncthreads?

From [email protected] on January 15, 2010 09:04:05

What steps will reproduce the problem? The error below occurs when encountering the first syncthreads(); in my
CUDA kernel code. What is the expected output? What do you see instead? ==Ocelot== Emulator failed to run kernel "_Z18chiSquaredDistancePfS_S_i"
with exception:
==Ocelot== [PC 91] [thread 0] [cta 0] bar.sync 0 - barrier deadlock at:
precomputeMatrix_chikernel.cu:53:0
terminate called after throwing an instance of
'executive::RuntimeException'
Aborted What version of the product are you using? On what operating system? OpenSuse 11.1/GCC4.3.2/CUDA Toolkit 2.3/Ocelot SVN r271 (2009-12-22) Please provide any additional information below. __global void
chiSquaredDistance(float* C, float* A, float* B, int slabSizeA)
{
// Thread index
int tx = threadIdx.x;

// Block index
int bx = blockIdx.x;
int by = blockIdx.y;

float temp = 0.0f;

for(int vectorChunk = 0; vectorChunk \< VECTORCHUNK_COUNT; vectorChunk+

+)
{
temp += 1.0f; // compute some value for temp, not removed
}

__shared__ float sum[VECTORCHUNK_SIZE];
sum[tx] = temp;
__syncthreads();
for(int bit = VECTORCHUNK_SIZE / 2; bit > 0; bit /= 2)
{
    float t = sum[tx] + sum[tx ^ bit];
    __syncthreads();
    sum[tx] = t;
    __syncthreads();
}

// write to global memory
if(tx == 0)
    C[by * slabSizeA + bx] = sum[tx] / 2;

}

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=37

PTX-Emulator: Warp Scheduler

From [email protected] on May 18, 2010 10:36:05

Describe the New Feature: Add the concept of a warp to the PTX emulator.

The warp size should be configurable on a per-cta basis.
The CooperativeThreadArray class should be extended with a set of Warps,
each containing a stack of CTAContexts.
The CooperativeThreadArray class should include a callback interface to
a warp scheduler function object that picks the next warp to execute out of
a pool of ready warps.
The branch divergence mechanism should be refined to operate on a
per-warp basis rather than a per-cta basis.
Each eval_* function should be modified to only execute instructions for
the currently selected warp. Which milestone does the feature belong to? 2.0.0 Which branch does the new feature go in? Trunk

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=41

config.ocelot has mistakes

From [email protected] on August 16, 2009 15:11:12

Hi,
when reading config.ocelot, these two lines are executed:
fi.descend("ocelot");
fi.descend("OcelotRuntime");

but the example config.ocelot in svn hasn't got OcelotRuntime tags,
adding them causes Ocelot runtime to load it's settings.

Also there is an unneeded 0 inbetween tags.

Attaching corrected file.

Attachment: config.ocelot

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=22

ocelot segaults

From [email protected] on September 08, 2009 17:45:32

Hi, i got newest ocelot and running memoryErrors from wiki crashes :

[Thread debugging using libthread_db enabled]
[New Thread 0x7f3822d42760 (LWP 21763)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f3822d42760 (LWP 21763)]
0x00007f382269066b in executive::Executive::getSelectedISA (this=0xdec9d8)
at ocelot/executive/implementation/Executive.cpp:358
358 }
(gdb) bt
#0 0x00007f382269066b in executive::Executive::getSelectedISA
(this=0xdec9d8) at ocelot/executive/implementation/Executive.cpp:358
#1 0x00007f38226921c2 in executive::Executive::loadModule (this=0xdec9d8,
path=, translateToSelected=true,
stream=0x7fff2ad6d7d0) at ocelot/executive/implementation/
Executive.cpp:209
#2 0x00007f38217010f8 in cuda::CudaRuntime::registerFatBinary
(this=0xdec4f8, binary=@0x6151a0)
at ocelot/cuda/implementation/CudaRuntime.cpp:604
#3 0x00007f382170f5ab in cuda::CudaRuntimeBase::cudaRegisterFatBinary
(this=, fatCubin=0x6151a0)
at ocelot/cuda/implementation/CudaRuntimeBase.cpp:1482
#4 0x000000000040a485 in
__sti____cudaRegisterAll_47_tmpxft_0000112c_00000000_4_memoryErrors_cpp1_ii_41d29f55
()
at /tmp/tmpxft_0000112c_00000000-1_memoryErrors.cudafe1.stub.c:29
#5 0x000000000040f8c6 in ?? ()
#6 0x00007f3822d6e000 in ?? ()
#7 0x000000000040f7f0 in ?? ()
#8 0x0000000000000000 in ?? ()

Additionaly compiling LLVMExecutableKernel.cpp fails with

g++ -DHAVE_CONFIG_H -I. -Wall -ansi -pedantic -Werror -std=c++0x -g -O2 -
MT libOcelotExecutive_la-LLVMExecutableKernel.lo -MD -MP -MF .deps/
libOcelotExecutive_la-LLVMExecutableKernel.Tpo -c ocelot/executive/
implementation/LLVMExecutableKernel.cpp -fPIC -DPIC -o .libs/
libOcelotExecutive_la-LLVMExecutableKernel.o
cc1plus: warnings being treated as errors
ocelot/executive/implementation/LLVMExecutableKernel.cpp: In destructor
'virtual executive::LLVMExecutableKernel::~LLVMExecutableKernel()':
ocelot/executive/implementation/LLVMExecutableKernel.cpp:47: error:
possible problem detected in invocation of delete operator:
ocelot/executive/implementation/LLVMExecutableKernel.cpp:47: error:
invalid use of incomplete type 'struct llvm::Module'
./ocelot/executive/interface/LLVMExecutableKernel.h:16: error: forward
declaration of 'struct llvm::Module'
ocelot/executive/implementation/LLVMExecutableKernel.cpp:47: note: neither
the destructor nor the class-specific operator delete will be called, even
if they are declared when the class is defined.
make[1]: *** [libOcelotExecutive_la-LLVMExecutableKernel.lo] Error 1

Commenting out this line allowed me to compile, although obvious this is
not a fix :).

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=26

Change the Pass interfce to include analysis and manager concepts

From [email protected] on August 26, 2010 08:39:25

Describe the New Feature: See this post http://groups.google.com/group/gpuocelot/browse_thread/thread/45c67ce970284954?hl=en Which milestone does the feature belong to? 2.0.0 Which branch does the new feature go in? Trunk

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=44

memoryChecker reports bad global memory access

From [email protected] on August 04, 2010 00:21:23

What steps will reproduce the problem? 1. allocate device memory
2. attempt to access the allocated memory within a kernel What is the expected output? What do you see instead? I expect the program to succeed with no output. Instead, the memory checker claims one of my kernels is accessing memory that is not allocated or mapped. However, it clearly is allocated, as its own list of device allocations shows it is within the fifth allocation:

terminate called after throwing an instance of 'hydrazine::Exception'
what(): [PC 26] [thread 0] [cta 0] ld.global.u8 % r24 , [% r23 + 0] - Global memory access 0xb79d0f is not within any allocated or mapped range.

Nearby Device Allocations
[0xa66fc0] - [0xa67020](96 bytes)
[0xa670a0] - [0xa674d8](1080 bytes)
[0xa67620] - [0xa67680](96 bytes)
[0xa67700] - [0xa67b38](1080 bytes)
[0xb79d00] - [0xb7bd00](8192 bytes)
[0xb7be20] - [0xb7de20](8192 bytes)
[0xb7df40] - [0xb7e740](2048 bytes)
[0xb7e860] - [0xb7f060](2048 bytes)
[0xb7f180] - [0xb7f980](2048 bytes)
[0xb7faa0] - [0xb802a0](2048 bytes)
[0xb803c0] - [0xb80bc0](2048 bytes)
[0xb80ce0] - [0xb814e0](2048 bytes)
[0xb83520] - [0xb85520](8192 bytes)
[0xb87660] - [0xb89660](8192 bytes)
[0xb89780] - [0xb8e780](20480 bytes)
[0xb8e8a0] - [0xb938a0](20480 bytes)
[0xb939c0] - [0xb989c0](20480 bytes)
[0xb98ae0] - [0xb9dae0](20480 bytes)
[0xb9dc00] - [0xba2c00](20480 bytes)
[0xba2d20] - [0xba7d20](20480 bytes)
[0xba7e40] - [0xbace40](20480 bytes)
[0xbacf60] - [0xbb1f60](20480 bytes)
[0xbb2080] - [0xbb2880](2048 bytes)
[0xbb29a0] - [0xbb31a0](2048 bytes)
[0xbb32c0] - [0xbb3ac0](2048 bytes)
[0xbb3be0] - [0xbb43e0](2048 bytes) What version of the product are you using? On what operating system? -Ocelot version: SVN r634 -compiled with gcc 4.5 with the macro to disable c++0x feature in KernelEntry.cpp as described in http://groups.google.com/group/gpuocelot/msg/55339e218bc5bdaa?pli=1 -modifications to type checker described in http://groups.google.com/group/gpuocelot/browse_thread/thread/186dcb0bee10ed8b Cuda version: 2.3 Please provide any additional information below.

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=43

Add a CUDA debugger

From [email protected] on June 08, 2010 14:38:56

Describe the New Feature: See this thread: http://groups.google.com/group/gpuocelot/browse_thread/thread/e4964a46d419623d Which milestone does the feature belong to? 3.0 Which branch does the new feature go in? Trunk

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=42

Add an OpenCL front-end

From [email protected] on August 05, 2009 12:04:41

Describe the New Feature: 1. Add an OpenCL front-end to ocelot.
2. Completely re-implement the OpenCL 1.0 specification using the open
source Nvidia Open64 compiler to generate PTX code.
3. Use as much of the existing Executive class as possible to implement
OpenCL functionality. Which milestone does the feature belong to? 1.0.0 Which branch does the new feature go in? Branch

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=17

Virtualize the CUDA runtime API

From [email protected] on September 05, 2009 15:11:59

Describe the New Feature: We want support for easily swapping between multiple implementations of
the CUDA runtime. OcelotRuntimeAPI should be replaced by a pure virtual
class with one pure virtual method per CUDA API call. Which milestone does the feature belong to? 0.8.0 Which branch does the new feature go in? Trunk

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=23

error--- no matching function for call to 'std::basic_ofstream

From [email protected] on March 31, 2011 01:35:23

What steps will reproduce the problem? 1.build.py --install 2. 3. What is the expected output? What do you see instead? What version of the product are you using? On what operating system? 2.0 Please provide any additional information below. in ocelot/executive/implementation/PassThroughDevice.cpp line 604
std::ofstream file(stream.str());
should be
std::ofstream file((stream.str()).c_str());

the same in ocelot/graphs/implementation/DivergenceDrawer.cpp:187

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=50

Review LLVM IR

From [email protected] on July 29, 2009 17:59:52

Purpose of code changes on this branch: Make sure that the proposed IR is suitable for being a translated target
for the PTX IR. When reviewing my code changes, please focus on: 1) Make sure that the IR is able of expressing all LLVM instructions.
2) Suggest ways to modify the IR to make it easier to use/understand.
3) Check to make sure that the code generation functions and error checking
functions are complete and correct.

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=14

failure to build against current hydralize

From dankamongmen on April 13, 2010 05:44:06

What steps will reproduce the problem? 1. check out a fresh hydrazine using:

svn checkout http://hydrazine.googlecode.com/svn/trunk/ hydrazine-read-only

as instructed in the Installation Guide

check out a fresh copy of gpuocelot
build and install hydrazine. only lbhydrazine.a is installed using the
standard "make install" target
attempt to build gpuocelot. the build fails looking for
hydrazine/implementation/debug.h:

make all-am
make[1]: Entering directory /home/dank/local/gpuocelot-read-only/ocelot' /bin/bash ./libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I ./ocelot/cuda/include -Wall -ansi -Werror -std=c++0x -g -O2 -MT libocelot_la-DataflowGraph.lo -MD -MP -MF .deps/libocelot_la-DataflowGraph.Tpo -c -o libocelot_la-DataflowGraph.lo test -f 'ocelot/analysis/implementation/DataflowGraph.cpp' || echo
'./'`ocelot/analysis/implementation/DataflowGraph.cpp
libtool: compile: g++ -DHAVE_CONFIG_H -I. -I ./ocelot/cuda/include -Wall
-ansi -Werror -std=c++0x -g -O2 -MT libocelot_la-DataflowGraph.lo -MD -MP
-MF .deps/libocelot_la-DataflowGraph.Tpo -c
ocelot/analysis/implementation/DataflowGraph.cpp -fPIC -DPIC -o
.libs/libocelot_la-DataflowGraph.o
In file included from ocelot/analysis/implementation/DataflowGraph.cpp:10:
./ocelot/analysis/interface/DataflowGraph.h:15:44: error:
hydrazine/implementation/debug.h: No such file or directory
ocelot/analysis/implementation/DataflowGraph.cpp: In constructor
‘analysis::DataflowGraph::NoProducerException::NoProducerException(unsigned
int)’:
ocelot/analysis/implementation/DataflowGraph.cpp:56: error: aggregate
‘std::stringstream message’ has incomplete type and cannot be defined
ocelot/analysis/implementation/DataflowGraph.cpp: In member function
‘analysis::DataflowGraph::Instruction
analysis::DataflowGraph::convert(ir::PTXInstruction&)’: What is the expected output? What do you see instead? A successful build. What version of the product are you using? On what operating system? SVN as of 2010-04-13 on Debian Linux Unstable, using the llvm-snapshot LLVM
packages. Please provide any additional information below. Hit me on aim or gtalk (nickblackandmild, [email protected]), if you'd
like, and we ought be able to track in on it pretty quickly. Thanks.

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=40

Multithread the emulator!

From [email protected] on July 29, 2009 23:27:16

Describe the New Feature: 1) Devise a work queue approach where the executive class spawns one thread
per CPU core and assigns CTAs to threads as they complete.
2) For atomic ops, rather than locking, asynchronously push data into a
local queue, when it overflows then lock and do a bulk update. Also do a
bulk update when the CTA completes to eliminate stragglers and when a fence
instruction is called. Which milestone does the feature belong to? 1.0.0 Which branch does the new feature go in? Branch

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=16

Atomic CAS has incorrect semantics when interleaved with non-atomic stores

From [email protected] on December 10, 2009 12:44:26

What steps will reproduce the problem? Consider the example:

a = 25

(thread 1) st a, 5 (thread 2) atomic cas a, 25, 0

possible outcomes:

case 0:
(thread 1)
(thread 2)
(thread 2)
a = 0

case 1:
(thread 2)
(thread 2)
(thread 1)
a = 5

case 2:
(thread 2)
(thread 1)
(thread 2)
a = 25 What is the expected output? What do you see instead? Case 2 is a possible ordering in our implementation, but would not be
possible if the operation was actually performed atomically rather than
using locks. It is very unlikely that a program would ever rely on this
behavior, but it is one more reason to abandon the current locking
implementation and move to an entirely atomic implementation. Please use labels and text to provide additional information. We should consider replacing the current implementation with the upcoming
cstdatomics library.

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=35

on 32-bit platforms gcc and nvcc disagree about the size of tuples of pairs

From [email protected] on February 06, 2010 03:31:14

What steps will reproduce the problem? This test program:

#include
#include <tr1/tuple>
#include

int main()
{
std::cerr << "sizeof(long long): " << sizeof(long long) << std::endl;

using namespace std;
using namespace std::tr1;

typedef pair<long long, long long> p;
typedef tuple<p, unsigned int> t;

std::cerr << "sizeof(tuple<pair<long long, long long> >, unsigned int>):
" << sizeof(t) << std::endl;

return 0;
} What is the expected output? What do you see instead? The size is 20 on gcc4.4.1 32-bit, 24 on gcc4.4.1 64-bit, and 24 on nvcc3.0b

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=38

LLVM Translator

From [email protected] on July 29, 2009 18:14:11

Describe the New Feature: 1. Implement a high level translator interface for moving between
different Instruction classes.

Implement a specific translator that examines a vector of PTX
instructions and produces an equivalent vector of LLVM instructions.

Note: This first version will use naive translation where each PTX
instruction maps to one or more LLVM instructions. It should not pay
attention to automatic vectorization at all. Which milestone does the feature belong to? 0.7.0 Which branch does the new feature go in? Trunk

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=15

CUDA API Trace Generator

From [email protected] on September 05, 2009 15:13:20

Describe the New Feature: Add an implementation of the CUDA runtime that records a trace of every
call made. Which milestone does the feature belong to? 0.9.0 Which branch does the new feature go in? Trunk

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=24

Add UIUC Parboil Benchmakr Suite to Ocelot Regression Tests

From [email protected] on July 08, 2009 10:00:04

Describe the New Feature: Create a regression test based on the UIUC Parboil benchmark suite. Which milestone does the feature belong to? 0.5.0 Which branch does the new feature go in? Trunk

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=9

LLVM Runtime

From [email protected] on September 07, 2009 16:03:13

Describe the New Feature: Add an LLVM device for which kernels are launched using the LLVM JIT. Which milestone does the feature belong to? 0.8.0 Which branch does the new feature go in? Trunk

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=25

Error during make install

From [email protected] on March 08, 2011 13:11:59

What steps will reproduce the problem? 1. Download the ocelot-1.3.967 package from https://code.google.com/p/gpuocelot/downloads/detail?name=ocelot-1.3.967.tar.bz2&can=2&q= 2. Run ./configure; make; sudo make install What is the expected output? What do you see instead? Below is the error I get. Looks like TestLLVMKernels.h file is included in the list.

/usr/bin/install -c -m 644 ocelot/executive/test/TestGPUKernel.h ocelot/executive/test/TestLLVMKernels.h ocelot/executive/test/TestEmulator.h ocelot/executive/test/TestLLVMKernels.h ocelot/executive/test/sequence.ptx ocelot/executive/test/kernels.ptx '/usr/local/include/ocelot/executive/test'
/usr/bin/install: will not overwrite just-created /usr/local/include/ocelot/executive/test/TestLLVMKernels.h' withocelot/executive/test/TestLLVMKernels.h'
make[2]: *** [install-nobase_includeHEADERS] Error 1
make[2]: Leaving directory /home/animus/Work/simulators/ocelot-1.3.967' make[1]: *** [install-am] Error 2 make[1]: Leaving directory/home/animus/Work/simulators/ocelot-1.3.967'
make: *** [install] Error 2 What version of the product are you using? On what operating system? I am using ocelot-1.3.967 on Ubuntu 10.04. Make version is 3.81. Install version is 8.5. Please provide any additional information below. This doesn't seem to be fatal error for NVIDIA GPU emulation. The libraries are installed and they work. I just am not sure if this error has any effect on the working of the simulator.

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=48

Add Support for CUDA 2.3

From [email protected] on June 22, 2009 16:21:24

Describe the New Feature: 1. Download the new toolkit and SDK.

Create a new parser if there is a new version of PTX.
Dump the .ptx files from each sdk sample into the test directory.
Create a test suite for the 2.3 sdk examples. Make sure that it passes. Which milestone does the feature belong to? 0.4.0 Which branch does the new feature go in? Trunk

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=3

Install Compile Failue on ArchLinux 64 bit Boost 1.43

From [email protected] on October 14, 2010 01:51:25

What steps will reproduce the problem? 1. install boost
2. install glew
3. ocelot make fails What is the expected output? What do you see instead? compiler error What version of the product are you using? On what operating system? Archlinux circa Oct. 2010
boost library 1.43.0-1 Please provide any additional information below. During make the following error from a boost include file is observed:
libtool: compile: g++ -DHAVE_CONFIG_H -I. -I ./ocelot/cuda/include -Wall -ansi -Werror -std=c++0x -g -O2 -MT libocelot_la-KernelEntry.lo -MD -MP -MF .deps/libocelot_la-KernelEntry.Tpo -c ocelot/trace/implementation/KernelEntry.cpp -fPIC -DPIC -o .libs/libocelot_la-KernelEntry.o
In file included from /usr/include/boost/interprocess/sync/file_lock.hpp:24:0,
from ocelot/trace/implementation/KernelEntry.cpp:22:
/usr/include/boost/interprocess/detail/move.hpp: In function ‘typename boost::remove_reference::type&& boost::interprocess::move(T&&) [with T = boost::interprocess::file_lock&, typename boost::remove_reference::type = boost::interprocess::file_lock]’:
/usr/include/boost/interprocess/sync/file_lock.hpp:68:52: instantiated from here
/usr/include/boost/interprocess/detail/move.hpp:342:11: error: invalid initialization of reference of type ‘boost::remove_referenceboost::interprocess::file_lock&::type&&’ from expression of type ‘boost::interprocess::file_lock’
make[1]: *** [libocelot_la-KernelEntry.lo] Error 1
make[1]: Leaving directory `/opt/ocelot-1.1.560'
make: *** [all] Error 2

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=45

Incorrect results in unrolled loops

From [email protected] on August 15, 2009 14:57:03

What steps will reproduce the problem? 1. Run the mri-fhd parboil benchmark with and without loop unrolling in the
inner loop. What is the expected output? What do you see instead? 1. With unrolling, the kernel produces incorrect outputs starting with the
first ld.const compared to the same kernel without unrolling. What version of the product are you using? On what operating system? Ubuntu 9.04, r107 Please provide any additional information below. This bug is also causing incorrect results in the mandelbrot 2.2 sdk
example without manual rolling of loop bodies. We need a simple test case
to reproduce this before we can start diagnosing the problem in detail. It
is not obvious from examining the dataflow traces from either example.

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=20

Triage LLVM Failures

From [email protected] on September 22, 2009 07:01:30

What steps will reproduce the problem? 1. Compile ocelot with llvm support enabled.
2. Select the LLVM JIT device.
3. Run the SDK samples. What is the expected output? What do you see instead? Very few of the SDK samples will execute and produce correct results. We
need to identify exactly which programs fail and which complete. Then we
need to begin stepping through individual examples and modifying the
LLVMEmulatedKernel, translator, and PTX optimization passes to fix any
problems. Please use labels and text to provide additional information.

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=28

Lazy Evaluation of PTX Kernels in the CUDA Runtime

From [email protected] on August 13, 2009 16:15:55

Describe the New Feature: The current implementation loads and parses all PTX kernels declared within
a program upon kernel registration. Make this lazily evaluated instead.

Ideally, registering a kernel should add an entry with a flag saying that
it has not yet been parsed. Upon the first execution, it should be
translated and then executed and the flag should be updated. Which milestone does the feature belong to? 1.0.0 Which branch does the new feature go in? Trunk

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=19

Add Shared Memory Race Detection

From [email protected] on October 29, 2009 11:03:39

Describe the New Feature: From SPWorley:

Just had a little brainstorm... not a feature request, but something
that may inspire some more memory debugging/error checking for
Ocelot's emulator.

It could be straightforward for Ocelot to detect all shared memory
thread race conditions. These happen when one thread writes to shared
memory and a different thread reads, but with unspecified thread
ordering, that read may have occurred either before or after the write
so your results become uncertain. These bugs are usually hard to track
down since they often work and break only later when some other
innocuous change is made.

Ocelot could detect these shared memory ordering races pretty easily
with a little overhead in memory and time.

Allocate two 16K int (or short) buffers.. these are basically two
tracking variables per shared memory location to track which thread
has read or written to a particular byte of shared memory. At the
start of a block and every threadsync(), initialize the two buffers to
0xFFFF, meaning "no access yet."
If a thread reads from a shared memory location, mark the "read"
buffer with the thread ID. If a thread writes to a shared memory
location, mark the "write" buffer with that thread ID.

Typical race errors will be detected by checking that the read and
write arrays never have two different thread IDs in them. It's OK if
one thread reads and writes. It's OK if lots of threads read and
nobody writes. It's even OK if lots of threads write and nobody
reads. But once you detect that you have a reader and writer with
different thead IDs, you fire off the "warning, potential race
condition detected."

If multiple threads read, then you could mark the array with a
"multiple readers" flag instead of the threadID and then ANY write
before or after would be a race. Same for multiple writers.

This idea may not be too practical or important but I thought I'd
share it while it was still fresh in my mind. Which milestone does the feature belong to? 2.0.x Which branch does the new feature go in? Trunk

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=31

Implement Register Allocation

From [email protected] on June 28, 2009 20:17:53

Describe the New Feature: 1. The current register allocation scheme simply maps PTX register
variables to unique identifiers.
2. Implement a graph coloring register allocator.
3. Implement a linear scan register allocator. Which milestone does the feature belong to? 0.5.0 Which branch does the new feature go in? Trunk

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=5

Function Calls are not handled correctly in the emulator or parser

From [email protected] on July 07, 2009 10:08:23

What steps will reproduce the problem? 1. Compile any CUDA device function with the directive noinline What is the expected output? What do you see instead? Before we were under the impression that the compiler never generated any
function calls. It turns out that it does. So we should provide support
for this in the parser and the emulator.

We need to add unit tests for the lexer/parser that contain device function
calls. We also need to add functionality to the code generator for the
emulator to add code of all referenced function calls to a kernel binary.
Finally, we need to add support for recursive function calls since they are
likely to come out in future releases...

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=8

Add a code generator for PTX and a GPU target

From [email protected] on September 16, 2009 17:48:48

Describe the New Feature: We can already create a kernel of PTX instructions. However, we do not
currently have a way of launching it on a GPU. We need the following new
features:

A new executable kernel class that launches a PTX kernel on a GPU device
using the CUDA driver level api.
Additions to the Executive class to support detection of NVIDIA GPUs,
translation to executable GPU kernels, and memory allocation and copies
into the GPU address space.
Unit tests to make sure that this functionality works. Which milestone does the feature belong to? 0.9.0 Which branch does the new feature go in? Trunk

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=27

Add Windows and MACOS support..

From [email protected] on October 31, 2009 18:13:33

Now that you have LLVM multicore support working (I don't know if you are
aware of Nvidia similar efforts for CUDA for CPUs..
llvm.org/devmtg/2009-10/Grover_PLANG.pdf
and also they have a video..
I hope they have ready for CUDA 3.0.. which seems to be getting a beta by
SC09 and hope they release PTX spec v1.5 )
Assuming CUDA multicore doesn't get relased in this beta I'm interested in
porting your project to Windows and MacOS assuming it can be done with no
major changes.. I mean getting perhaps rid of configure/automake stuff..
and perhaps some fixing system specific code..
Would be good if with some of your thougt you can point me to some big
issues you can expect me to find and already not planned by me?

So perhaps the plan is:
I will try first to test in MacOs which would allow to catch very specific
Linux issues..
Then from that try to use Cmake as make system..
Assuming all goes well on Linux and MacOs attempt Windows port..
This will expose first two kind of portability issues:
*GCC vs Visual Studio issues:
are you using some C99 or other specific GCC code don't supported..
*OS specific API stuff..
I expect I'm having to learn to build:
Boost..
Pthreads..
LLVM..

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=32

Add External Trace Generator API

From [email protected] on July 15, 2009 12:01:06

Describe the New Feature: There should be a high level API call available to CUDA programs that
specifies a trace generator to be attached to the next launched kernel.

For example:

{{{
BranchTraceGenerator generator;
ocelotAddTraceGenerator( generator );
somekernel<<< ctas, threads, memory >>>(parameter);
}}} Which milestone does the feature belong to? 0.5.0 Which branch does the new feature go in? Trunk

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=13

2-Element Vectors of floats are broken in the llvm backend on 32-bit platforms

From [email protected] on February 20, 2010 15:15:08

What steps will reproduce the problem? See this bug report from llvm: http://hlvm.llvm.org/bugs/show_bug.cgi?id=3287 What is the expected output? What do you see instead? Loads to 2-element vectors of floats randomly produce nan values. What version of the product are you using? On what operating system? 32-bit platforms using LLVM.

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=39

Linear Texture Interpolation on 32-bit installs

From [email protected] on July 12, 2009 12:40:38

What steps will reproduce the problem? 1. Install Ocelot on Ubuntu 8.10-32-bit
2. Run the Dxt8x8 regression test. What is the expected output? What do you see instead? Note that the first example does not match the reference. Please use labels and text to provide additional information.

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=11

Mistake in lexer

From [email protected] on August 10, 2009 17:49:59

From the newest release avaible here, ptx.lpp has this definition for
octal values :
OCT_CONSTANT (0[0123456]*)
but 07 is a proper octal value.

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=18

1.1.560 doesn't build

From [email protected] on January 15, 2011 14:11:15

I downloaded the 1.1.560 release, but:

/bin/sh ./libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I/usr/include -DNDEBUG -D_GNU_SOURCE -D__STDC_LIMIT_MACROS -D__STDC_CONSTANT_MACROS -I ./ocelot/cuda/include -Wall -ansi -Werror -std=c++0x -g -O2 -MT libocelot_la-ControlFlowGraph.lo -MD -MP -MF .deps/libocelot_la-ControlFlowGraph.Tpo -c -o libocelot_la-ControlFlowGraph.lo test -f 'ocelot/ir/implementation/ControlFlowGraph.cpp' || echo './'ocelot/ir/implementation/ControlFlowGraph.cpp
libtool: compile: g++ -DHAVE_CONFIG_H -I. -I/usr/include -DNDEBUG -D_GNU_SOURCE -D__STDC_LIMIT_MACROS -D__STDC_CONSTANT_MACROS -I ./ocelot/cuda/include -Wall -ansi -Werror -std=c++0x -g -O2 -MT libocelot_la-ControlFlowGraph.lo -MD -MP -MF .deps/libocelot_la-ControlFlowGraph.Tpo -c ocelot/ir/implementation/ControlFlowGraph.cpp -fPIC -DPIC -o .libs/libocelot_la-ControlFlowGraph.o
cc1plus: warnings being treated as errors
ocelot/ir/implementation/ControlFlowGraph.cpp: In member function 'std::listir::ControlFlowGraph::BasicBlock::Edge::const_iterator ir::ControlFlowGraph::BasicBlock::get_edge(std::listir::ControlFlowGraph::BasicBlock::const_iterator) const':
ocelot/ir/implementation/ControlFlowGraph.cpp:94:1: error: control reaches end of non-void function
ocelot/ir/implementation/ControlFlowGraph.cpp: In member function 'std::listir::ControlFlowGraph::BasicBlock::Edge::iterator ir::ControlFlowGraph::BasicBlock::get_edge(std::listir::ControlFlowGraph::BasicBlock::iterator)':
ocelot/ir/implementation/ControlFlowGraph.cpp:85:1: error: control reaches end of non-void function
ocelot/ir/implementation/ControlFlowGraph.cpp: In member function 'std::listir::ControlFlowGraph::BasicBlock::Edge::const_iterator ir::ControlFlowGraph::BasicBlock::get_branch_edge() const':
ocelot/ir/implementation/ControlFlowGraph.cpp:76:1: error: control reaches end of non-void function
ocelot/ir/implementation/ControlFlowGraph.cpp: In member function 'std::listir::ControlFlowGraph::BasicBlock::Edge::iterator ir::ControlFlowGraph::BasicBlock::get_branch_edge()':
ocelot/ir/implementation/ControlFlowGraph.cpp:67:1: error: control reaches end of non-void function
ocelot/ir/implementation/ControlFlowGraph.cpp: In member function 'std::listir::ControlFlowGraph::BasicBlock::Edge::const_iterator ir::ControlFlowGraph::BasicBlock::get_fallthrough_edge() const':
ocelot/ir/implementation/ControlFlowGraph.cpp:58:1: error: control reaches end of non-void function
ocelot/ir/implementation/ControlFlowGraph.cpp: In member function 'std::listir::ControlFlowGraph::BasicBlock::Edge::iterator ir::ControlFlowGraph::BasicBlock::get_fallthrough_edge()':
ocelot/ir/implementation/ControlFlowGraph.cpp:49:1: error: control reaches end of non-void function
make[1]: *** [libocelot_la-ControlFlowGraph.lo] Error 1
make[1]: Leaving directory `/home/realnc/tmp/ocelot-1.1.560'
make: *** [all] Error 2

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=47

Memory checker is not running.

From [email protected] on August 16, 2009 08:58:31

Hi,
i downloaded ocelot from svn today, ran
libtoolize
aclocal
autoconf
automake
./configure --prefix=SOMEPATH
make
make install

and then used your example from wiki, for memory checker
Compiled it with
g++ -o mem mem.cu.cpp -L /usr/local/cuda/lib64/ -L SOME_PATH/lib -lcudart -
lOcelotIr -lOcelotParser -lOcelotExecutive -lOcelotTrace -lOcelotAnalysis -
lhydrazine
and running it by
./mem 1
doesn't print anything.

For a simple test whether ocelot is running at all, i did:
export CUDA_PROFILE=1
and ran
./mem 3
which, did not procude cuda_profile.log, which might suggest that ocelot
took over normal execution.

Also, running ./mem 1 doesn't produce any files.

CHECK_GLOBAL_ACCESSES is defined to 1.

I am using CUDA 2.3, but i as far as i understand that shouldn't be an
issue, as this kernel shouldn't produce unknown ptx instructions.

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=21

Software texture sampling off by one pixel

From [email protected] on November 05, 2009 14:06:39

To reproduce problem:

Execute 'SimpleTexture' and 'BicubicTexture' applications compiled using
the native CUDA toolchain executing on the GPU.
Copy their outputs into the 'data/' directory to be used as reference
inputs for GPU Ocelot.
Execute SimpleTexture and BicubicTexture with emulated and LLVM devices.

The expected output should match the reference inputs. Instead, the output
consists of images shifted by approximately one pixel.

The current reference inputs were produced by GPU Ocelot and not provided
by the CUDA toolchain. We should see this as a defect in the texture
sampling procedures used by GPU Ocelot's emulated and translated devices.

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=34

Detect Global Memory Access Violations

From [email protected] on July 09, 2009 16:06:24

Describe the New Feature: Instrument all LD/ST/TEX instructions to verify that there is a valid
device memory region allocated before doing the access. Which milestone does the feature belong to? 0.5.0 Which branch does the new feature go in? Trunk

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=10

Problems Linking to Ocelot While Using CUBLAS and CUFFT

From [email protected] on October 20, 2010 18:53:40

What steps will reproduce the problem? 1. Create a program that uses CUBLAS
2. Replace -lcudart with -locelot
3. Run program What is the expected output? What do you see instead? I expect the emulator to be invoked. Instead cudart gets linked in. What version of the product are you using? On what operating system? 1.1.56 well actually a subversion copy from around Oct. 18 2010 Please provide any additional information below. I deliberately introduced a memory read error, but it wasn't caught by ocelot. Looks like my linker still found cudart instead of ocelot.

I found these instructions from a third party site:
"
To use Ocelot with any pre-compiled CUDA libraries (such as CUFFT or CUBLAS), the libraries must be compiled as shared objects and must be linked in the correct order. The order is PRECOMPILED_LIBRARIES OCELOT_LIBRARIES YOUR_PROGRAM . This is REQUIRED to ensure that global constructors are called in the correct order. It is an artifact of how CUDA is designed and impossible for us to change. "

I tried that build order in the g++ command, but it seemed to make no difference. An example of linking in CUFFT or CUBLAS, whilst using the ocelot emulator would be very helpful. I'm trying to use the memorey checker.

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=46

Can't check-out svn repos using the command line provided in wiki

From [email protected] on December 22, 2009 04:24:48

What steps will reproduce the problem? 1. svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot

It failed with below error information
A gpuocelot/tests/ptx/sdk/MonteCarlo_SM10.ptx
A gpuocelot/tests/ptx/sdk/bandwidthTest.ptx
svn: In directory 'gpuocelot/tests/ptx/sdk'
svn: Can't open file
'gpuocelot/tests/ptx/sdk/.svn/tmp/text-base/particleSystem.ptx.svn-base':
No such file or directory

I tried svn 1.6.5 on Mac OS X 10.6.2 and svn 1.6.6 on Windows XP.

Seems like something wrong with this file on server. Any idea? Thanks.

Original issue: http://code.google.com/p/gpuocelot/issues/detail?id=36

gtcasl / gpuocelot Goto Github PK

gpuocelot's Introduction

gpuocelot's People

Contributors

Stargazers

Watchers

Forkers

gpuocelot's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs