cms-tau-pog / taufw Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 40.0 68.67 MB

Analysis framework for tau analysis at CMS using NanoAOD

Python 96.27% Shell 1.63% C++ 0.62% C 1.48%

taufw's People

Stargazers

Watchers

taufw's Issues

Standalone from CMSSW

It should be made possible at some point to have the TauFW independent from CMSSW, as it is not strictly necessary. One could add a script as the one nanoAOD-tools has.

To make it possible some work is needed. In several places a CMSSW_BASE environment variable is assumed, as in the RecoilCorrectionTool.py, and in the b-tag tool the calibrator and the flavor enumeration is used.

Parallel processing in python 3: `RDataFrame` ?

Unfortunately the current parallel processing functionality for creating histograms from trees in SampleSet.gethist broke when switching to python 3. Segmentation faults seem to be caused by a conflict between how python and ROOT handle their objects in the memory. (The parallel processing is done (ab)using python's multithreading.)

As a consequence, I am starting to look into completely redesigning the SampleSet.gethist/MergedSample.gethist/Sample.gethist routines using RDataFrame, which is native to ROOT since v6.14. I will probably make this the default routine replacing the old routine based on python's multithreading by Plotter/python/plot/MultiThread.py and MultiDraw.py/MultiDraw.cxx. The latter also has some unexpected behavior for array branches of variable length.

Besides solving the memory issues, this should be more performant because we can string together multiple instances of RDataFrame (see this section of the class reference):

from ROOT import RDataFrame, RDF
df1 = RDataFrame("tree", "DY.root")
df2 = RDataFrame("tree", "TT.root")
df1_sel = df1.Filter("q_1*q_2<0 && pt_1>50 && pt_2>50")
df2_sel = df1.Filter("q_1*q_2<0 && pt_1>50 && pt_2>50")
res1 = df1_sel.Histo1D(("pt_1",50,0,250),"pt_1","genweight*idisoweight")
res2 = df2_sel.Histo1D(("pt_1",50,0,250),"pt_1","genweight*idisoweight")
RDF.RunGraphs([res1,res2]) # runs df1 and df2 concurrently

and let RDataFrame optimize the parallel processing of many histograms (multiple samples x variables x selections) by itself.

Furthermore, we could even think of processing multiple variables and selections in one go. (The previous setup would only process multiple variables and samples in parallel, but sequentially for selections.)

NanoAOD producer

It would be interesting to add a simple setup to process miniAOD to nanoAOD (via CRAB, or using the existing batch tools), so we do not have to rely on the official production anymore. We can adapt a setup like this one: https://github.com/IzaakWN/CRAB, and create a NanoProducer package in the TauFW.

HTCondor configuration & singularity for SLC7/CentOS7 compatibility

Issue: Environment not set

Since March, HTCondor jobs on lxplus do not have the CMSSW environment set correctly, nor JOBID or TASKID as defined in submit_HTCondor.sub. This causes the following error and subsequent job failure:

Traceback (most recent call last):
  File "/afs/cern.ch/user/i/ineuteli/analysis/CMSSW_12_4_8_g-2/src/TauFW/PicoProducer/python/processors/picojob.py", line 8, in <module>
    from PhysicsTools.NanoAODTools.postprocessing.framework.postprocessor import PostProcessor
  File "/usr/lib64/python3.6/site-packages/ROOT/_facade.py", line 150, in _importhook
    return _orig_ihook(name, *args, **kwds)
ModuleNotFoundError: No module named 'PhysicsTools'

Our hacky workaround was to hardcode our individual CMSSW_BASE path in the executable submit_HTCondor.sh script and do cmsenv...

The cause appears to be that newer HTCondor versions have a "new syntax" (documented here), and we have to simply change

getenv                = true
environment           = JOBID=$(ClusterId);TASKID=$(ProcId)

getenv                = true
environment           = "JOBID=$(ClusterId) TASKID=$(ProcId)"

I'll make a PR with a patch asap.

Issue: SLC7/CC7/CentOS7 compatibility on lxplus

CERN's lxplus is phasing out CentOS7 by end of June 2024 (see this announcement and this page).

If we want to keep using CMSSW 11 or 12 on a SLC7 architecture, we have to use a singularity on lxplus user nodes and in HTCondor jobs, see this page:

CMSSW_BASE="/afs/cern.ch/user/i/ineuteli/analysis/CMSSW_12_4_8/src/TauFW/"
cmssw-el7 --env "CMSSW_BASE=$CMSSW_BASE" # setup singularity & pass environment variable
cd $CMSSW_BASE/src
cmsenv

I'll add this in a future PR as well, and update the instructions in the documentation...

Global tags to make analysis modules more release independent ?

At the moment, things become messy when we want to analyze new nanoAOD versions that have a different set of tau IDs.

It would be desirable to have the same analysis code that can tackle different nanoAOD versions that contain different tau IDs (e.g. DeepTau2017v2p1VSjet vs. DeepTau2017v2p5VSjet, etc.), and keep track of it by something like a global tag that selects the branches we are interested in. In this way, we can retain compatibility to different nanoAOD versions without code duplication.

Event-based splitting of jobs.

Right now jobs are split by number of files. However, the number of entries varies wildly between nanoAOD files. If the submission routine in pico.py allows for event-based splitting of jobs, it would make it possible to create jobs and output files more uniform in length and size, and have easier finetuning of batch submission parameters such as maximum run time. With event-splitting, smaller files can be combined into one job, or a single large file can be split into several jobs.

It would not be too hard to implement–I think.

The post-processor already allows to define a start event index and maximum number of events, so "all" one needs to do it add this as an option for the job argument list.

But first one needs to split the files into chunks that may overlap over not. Right now chunks are made here:

TauFW/PicoProducer/scripts/pico.py

Line 671 in 4a6311c

fchunks = chunkify(infiles,nfilesperjob_) # file chunks

Currently, the chunks are saved as a dictionary in the JSON job config file for bookkeeping during resubmission, e.g.

"chunkdict": {
  "0": [ "nano_1.root",  "nano_2.root" ]
  "1": [ "nano_2.root",  "nano_3.root" ]
  ...
}

The trickiest part is to save it in this config format for bookkeeping in the resubmission and status routines. This is where a lot of bugs might creep in if the information is not stored and retrieved correctly. The simplest and most compact would be to simply add it to the end of the usual filename in the chunk dictionary of the config JSON file,

"chunkdict": {
  "0": [ "nano_1.root:0:1000" ]
  "1": [ "nano_1.root:1000:2000" ]
  ...
}

and parse it in checkchunks.

It should be possible. I plan to implement it in the near future.

python3 compatibility & coffea

Since coffea is very similar to nanoAOD-tools, it would be nice to have this option in the future to speed up processing of nanoAOD files.

I have looked at it, but the main difficulty at the moment is that in CMSSW I have not found a neat way to use python3 with ROOT. Maybe it's first necessary to divorce TauFW from CMSSW (see issue #5), and/or some tricks are needed to set the ROOT version with for example

source /cvmfs/sft.cern.ch/lcg/views/LCG_96python3/x86_64-centos7-gcc8-opt/setup.sh

In any case, the code like pico.py and common help functions should be made python3 compatible in preparation by using

from __future__ import print_function

and fixing all print statements without parentheses, as well as replacing iteritems() with items(), etc.

Bookkeep PDF & scale sum of weights before any skimming

For skimming MC nanoAOD with a pre-selection cut, it would be good to keep track of the sum of PDF/scale weights before any pre-selection cut.

This could perhaps be implemented into Bookkeeper as separate histograms, or as a new module.

cms-tau-pog / taufw Goto Github PK

taufw's People

Stargazers

Watchers

Forkers

taufw's Issues

Standalone from CMSSW

Parallel processing in python 3: `RDataFrame` ?

NanoAOD producer

HTCondor configuration & singularity for SLC7/CentOS7 compatibility

Issue: Environment not set

Issue: SLC7/CC7/CentOS7 compatibility on lxplus

Global tags to make analysis modules more release independent ?

Event-based splitting of jobs.

python3 compatibility & coffea

Bookkeep PDF & scale sum of weights before any skimming

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs