GithubHelp home page GithubHelp logo

taufw's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

taufw's Issues

Parallel processing in python 3: `RDataFrame` ?

Unfortunately the current parallel processing functionality for creating histograms from trees in SampleSet.gethist broke when switching to python 3. Segmentation faults seem to be caused by a conflict between how python and ROOT handle their objects in the memory. (The parallel processing is done (ab)using python's multithreading.)

As a consequence, I am starting to look into completely redesigning the SampleSet.gethist/MergedSample.gethist/Sample.gethist routines using RDataFrame, which is native to ROOT since v6.14. I will probably make this the default routine replacing the old routine based on python's multithreading by Plotter/python/plot/MultiThread.py and MultiDraw.py/MultiDraw.cxx. The latter also has some unexpected behavior for array branches of variable length.

Besides solving the memory issues, this should be more performant because we can string together multiple instances of RDataFrame (see this section of the class reference):

from ROOT import RDataFrame, RDF
df1 = RDataFrame("tree", "DY.root")
df2 = RDataFrame("tree", "TT.root")
df1_sel = df1.Filter("q_1*q_2<0 && pt_1>50 && pt_2>50")
df2_sel = df1.Filter("q_1*q_2<0 && pt_1>50 && pt_2>50")
res1 = df1_sel.Histo1D(("pt_1",50,0,250),"pt_1","genweight*idisoweight")
res2 = df2_sel.Histo1D(("pt_1",50,0,250),"pt_1","genweight*idisoweight")
RDF.RunGraphs([res1,res2]) # runs df1 and df2 concurrently

and let RDataFrame optimize the parallel processing of many histograms (multiple samples x variables x selections) by itself.

Furthermore, we could even think of processing multiple variables and selections in one go. (The previous setup would only process multiple variables and samples in parallel, but sequentially for selections.)

NanoAOD producer

It would be interesting to add a simple setup to process miniAOD to nanoAOD (via CRAB, or using the existing batch tools), so we do not have to rely on the official production anymore. We can adapt a setup like this one: https://github.com/IzaakWN/CRAB, and create a NanoProducer package in the TauFW.

HTCondor configuration & singularity for SLC7/CentOS7 compatibility

Issue: Environment not set

Since March, HTCondor jobs on lxplus do not have the CMSSW environment set correctly, nor JOBID or TASKID as defined in submit_HTCondor.sub. This causes the following error and subsequent job failure:

Traceback (most recent call last):
  File "/afs/cern.ch/user/i/ineuteli/analysis/CMSSW_12_4_8_g-2/src/TauFW/PicoProducer/python/processors/picojob.py", line 8, in <module>
    from PhysicsTools.NanoAODTools.postprocessing.framework.postprocessor import PostProcessor
  File "/usr/lib64/python3.6/site-packages/ROOT/_facade.py", line 150, in _importhook
    return _orig_ihook(name, *args, **kwds)
ModuleNotFoundError: No module named 'PhysicsTools'

Our hacky workaround was to hardcode our individual CMSSW_BASE path in the executable submit_HTCondor.sh script and do cmsenv...

The cause appears to be that newer HTCondor versions have a "new syntax" (documented here), and we have to simply change

getenv                = true
environment           = JOBID=$(ClusterId);TASKID=$(ProcId)

to

getenv                = true
environment           = "JOBID=$(ClusterId) TASKID=$(ProcId)"

I'll make a PR with a patch asap.

Issue: SLC7/CC7/CentOS7 compatibility on lxplus

CERN's lxplus is phasing out CentOS7 by end of June 2024 (see this announcement and this page).

If we want to keep using CMSSW 11 or 12 on a SLC7 architecture, we have to use a singularity on lxplus user nodes and in HTCondor jobs, see this page:

CMSSW_BASE="/afs/cern.ch/user/i/ineuteli/analysis/CMSSW_12_4_8/src/TauFW/"
cmssw-el7 --env "CMSSW_BASE=$CMSSW_BASE" # setup singularity & pass environment variable
cd $CMSSW_BASE/src
cmsenv

I'll add this in a future PR as well, and update the instructions in the documentation...

Global tags to make analysis modules more release independent ?

At the moment, things become messy when we want to analyze new nanoAOD versions that have a different set of tau IDs.

It would be desirable to have the same analysis code that can tackle different nanoAOD versions that contain different tau IDs (e.g. DeepTau2017v2p1VSjet vs. DeepTau2017v2p5VSjet, etc.), and keep track of it by something like a global tag that selects the branches we are interested in. In this way, we can retain compatibility to different nanoAOD versions without code duplication.

Event-based splitting of jobs.

Right now jobs are split by number of files. However, the number of entries varies wildly between nanoAOD files. If the submission routine in pico.py allows for event-based splitting of jobs, it would make it possible to create jobs and output files more uniform in length and size, and have easier finetuning of batch submission parameters such as maximum run time. With event-splitting, smaller files can be combined into one job, or a single large file can be split into several jobs.

It would not be too hard to implement–I think.

The post-processor already allows to define a start event index and maximum number of events, so "all" one needs to do it add this as an option for the job argument list.

But first one needs to split the files into chunks that may overlap over not. Right now chunks are made here:

fchunks = chunkify(infiles,nfilesperjob_) # file chunks

Currently, the chunks are saved as a dictionary in the JSON job config file for bookkeeping during resubmission, e.g.

"chunkdict": {
  "0": [ "nano_1.root",  "nano_2.root" ]
  "1": [ "nano_2.root",  "nano_3.root" ]
  ...
}

The trickiest part is to save it in this config format for bookkeeping in the resubmission and status routines. This is where a lot of bugs might creep in if the information is not stored and retrieved correctly. The simplest and most compact would be to simply add it to the end of the usual filename in the chunk dictionary of the config JSON file,

"chunkdict": {
  "0": [ "nano_1.root:0:1000" ]
  "1": [ "nano_1.root:1000:2000" ]
  ...
}

and parse it in checkchunks.

It should be possible. I plan to implement it in the near future.

python3 compatibility & coffea

Since coffea is very similar to nanoAOD-tools, it would be nice to have this option in the future to speed up processing of nanoAOD files.

I have looked at it, but the main difficulty at the moment is that in CMSSW I have not found a neat way to use python3 with ROOT. Maybe it's first necessary to divorce TauFW from CMSSW (see issue #5), and/or some tricks are needed to set the ROOT version with for example

source /cvmfs/sft.cern.ch/lcg/views/LCG_96python3/x86_64-centos7-gcc8-opt/setup.sh

In any case, the code like pico.py and common help functions should be made python3 compatible in preparation by using

from __future__ import print_function

and fixing all print statements without parentheses, as well as replacing iteritems() with items(), etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.