cms-tau-pog / taufw Goto Github PK
View Code? Open in Web Editor NEWAnalysis framework for tau analysis at CMS using NanoAOD
Analysis framework for tau analysis at CMS using NanoAOD
It should be made possible at some point to have the TauFW independent from CMSSW, as it is not strictly necessary. One could add a script as the one nanoAOD-tools
has.
To make it possible some work is needed. In several places a CMSSW_BASE
environment variable is assumed, as in the RecoilCorrectionTool.py, and in the b-tag tool the calibrator and the flavor enumeration is used.
Unfortunately the current parallel processing functionality for creating histograms from trees in SampleSet.gethist
broke when switching to python 3. Segmentation faults seem to be caused by a conflict between how python and ROOT handle their objects in the memory. (The parallel processing is done (ab)using python's multithreading.)
As a consequence, I am starting to look into completely redesigning the SampleSet.gethist
/MergedSample.gethist
/Sample.gethist
routines using RDataFrame
, which is native to ROOT since v6.14. I will probably make this the default routine replacing the old routine based on python's multithreading by Plotter/python/plot/MultiThread.py
and MultiDraw.py
/MultiDraw.cxx
. The latter also has some unexpected behavior for array branches of variable length.
Besides solving the memory issues, this should be more performant because we can string together multiple instances of RDataFrame
(see this section of the class reference):
from ROOT import RDataFrame, RDF
df1 = RDataFrame("tree", "DY.root")
df2 = RDataFrame("tree", "TT.root")
df1_sel = df1.Filter("q_1*q_2<0 && pt_1>50 && pt_2>50")
df2_sel = df1.Filter("q_1*q_2<0 && pt_1>50 && pt_2>50")
res1 = df1_sel.Histo1D(("pt_1",50,0,250),"pt_1","genweight*idisoweight")
res2 = df2_sel.Histo1D(("pt_1",50,0,250),"pt_1","genweight*idisoweight")
RDF.RunGraphs([res1,res2]) # runs df1 and df2 concurrently
and let RDataFrame
optimize the parallel processing of many histograms (multiple samples x variables x selections) by itself.
Furthermore, we could even think of processing multiple variables and selections in one go. (The previous setup would only process multiple variables and samples in parallel, but sequentially for selections.)
It would be interesting to add a simple setup to process miniAOD to nanoAOD (via CRAB, or using the existing batch tools), so we do not have to rely on the official production anymore. We can adapt a setup like this one: https://github.com/IzaakWN/CRAB, and create a NanoProducer
package in the TauFW
.
Since March, HTCondor jobs on lxplus do not have the CMSSW environment set correctly, nor JOBID
or TASKID
as defined in submit_HTCondor.sub
. This causes the following error and subsequent job failure:
Traceback (most recent call last):
File "/afs/cern.ch/user/i/ineuteli/analysis/CMSSW_12_4_8_g-2/src/TauFW/PicoProducer/python/processors/picojob.py", line 8, in <module>
from PhysicsTools.NanoAODTools.postprocessing.framework.postprocessor import PostProcessor
File "/usr/lib64/python3.6/site-packages/ROOT/_facade.py", line 150, in _importhook
return _orig_ihook(name, *args, **kwds)
ModuleNotFoundError: No module named 'PhysicsTools'
Our hacky workaround was to hardcode our individual CMSSW_BASE
path in the executable submit_HTCondor.sh
script and do cmsenv
...
The cause appears to be that newer HTCondor versions have a "new syntax" (documented here), and we have to simply change
getenv = true
environment = JOBID=$(ClusterId);TASKID=$(ProcId)
to
getenv = true
environment = "JOBID=$(ClusterId) TASKID=$(ProcId)"
I'll make a PR with a patch asap.
CERN's lxplus is phasing out CentOS7 by end of June 2024 (see this announcement and this page).
If we want to keep using CMSSW 11 or 12 on a SLC7 architecture, we have to use a singularity on lxplus user nodes and in HTCondor jobs, see this page:
CMSSW_BASE="/afs/cern.ch/user/i/ineuteli/analysis/CMSSW_12_4_8/src/TauFW/"
cmssw-el7 --env "CMSSW_BASE=$CMSSW_BASE" # setup singularity & pass environment variable
cd $CMSSW_BASE/src
cmsenv
I'll add this in a future PR as well, and update the instructions in the documentation...
At the moment, things become messy when we want to analyze new nanoAOD versions that have a different set of tau IDs.
It would be desirable to have the same analysis code that can tackle different nanoAOD versions that contain different tau IDs (e.g. DeepTau2017v2p1VSjet
vs. DeepTau2017v2p5VSjet
, etc.), and keep track of it by something like a global tag that selects the branches we are interested in. In this way, we can retain compatibility to different nanoAOD versions without code duplication.
Right now jobs are split by number of files. However, the number of entries varies wildly between nanoAOD files. If the submission routine in pico.py
allows for event-based splitting of jobs, it would make it possible to create jobs and output files more uniform in length and size, and have easier finetuning of batch submission parameters such as maximum run time. With event-splitting, smaller files can be combined into one job, or a single large file can be split into several jobs.
It would not be too hard to implement–I think.
The post-processor already allows to define a start event index and maximum number of events, so "all" one needs to do it add this as an option for the job argument list.
But first one needs to split the files into chunks that may overlap over not. Right now chunks are made here:
TauFW/PicoProducer/scripts/pico.py
Line 671 in 4a6311c
"chunkdict": {
"0": [ "nano_1.root", "nano_2.root" ]
"1": [ "nano_2.root", "nano_3.root" ]
...
}
The trickiest part is to save it in this config format for bookkeeping in the resubmission and status routines. This is where a lot of bugs might creep in if the information is not stored and retrieved correctly. The simplest and most compact would be to simply add it to the end of the usual filename in the chunk dictionary of the config JSON file,
"chunkdict": {
"0": [ "nano_1.root:0:1000" ]
"1": [ "nano_1.root:1000:2000" ]
...
}
and parse it in checkchunks
.
It should be possible. I plan to implement it in the near future.
Since coffea
is very similar to nanoAOD-tools
, it would be nice to have this option in the future to speed up processing of nanoAOD files.
I have looked at it, but the main difficulty at the moment is that in CMSSW I have not found a neat way to use python3
with ROOT
. Maybe it's first necessary to divorce TauFW
from CMSSW (see issue #5), and/or some tricks are needed to set the ROOT
version with for example
source /cvmfs/sft.cern.ch/lcg/views/LCG_96python3/x86_64-centos7-gcc8-opt/setup.sh
In any case, the code like pico.py
and common help functions should be made python3
compatible in preparation by using
from __future__ import print_function
and fixing all print
statements without parentheses, as well as replacing iteritems()
with items()
, etc.
For skimming MC nanoAOD with a pre-selection cut, it would be good to keep track of the sum of PDF/scale weights before any pre-selection cut.
This could perhaps be implemented into Bookkeeper
as separate histograms, or as a new module.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.