GithubHelp home page GithubHelp logo

dmwm / ddm Goto Github PK

View Code? Open in Web Editor NEW
1.0 13.0 9.0 17 MB

Dynamic Data Management - Cache release and auto-replication of hot data

Shell 1.07% PLSQL 4.94% Python 31.72% HTML 15.08% JavaScript 37.40% CSS 3.66% SQLPL 0.33% PigLatin 4.88% Scala 0.89% TSQL 0.03%

ddm's Introduction

DDM

Dynamic Data Management - Cache release and auto-replication of hot data

ddm's People

Contributors

cvuosalo avatar domenicogiordano avatar giffels avatar kdziedzi avatar mmeoni avatar nikmagini avatar rcaspart avatar tonywildish avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ddm's Issues

The schema name is hardcoded

The schema name is hardcoded to CMS_CLEANING_AGENT in dq2.victor.cms/lib/dq2/victor/victorDao.py and victor.monitoring.cms/lib/victorDao.py, should be configurable instead.

extend the file to dataset association agent to cover all sites in addition to EOS

The current implementation of the file to dataset association agent assumes that only XRootD EOS popularity is available.
Indeed the XRootD popularity DB includes now also data from other sites.
THe agent needs to be extended to cover also associations for files other than the EOS ones.

At the meantime it would be useful to evaluate a redesign of the association agent, based on different queries to phedex (and query frequencies)

CMSSW Popualrity: request a production DB instance

A production DB instance should be asked to collect the CMSSW Popularity data.
Currently data are collected in int11r DB, but this DB has not backup.

Please evaluate also if a migration of the current data from int11r to the new production DB schema is needed and in case coordinate with DBA to perform data migration.
I would strongly suggest to move data in production DB.

When the production DB schema will be ready, redirect the collection workflow to this DB.

XRootD CMS EOS collection suffering T0 tests

For the record:

In date 2014-11-13 at ~21h, a stress test of CMS T0 [1] caused a peak of XRootD monitoring messages from EOS, as can be seen in the following graph.
https://mig-graphite.cern.ch/graphlot/?from=20:00_20141113&until=08:59_20141114&target=scaleToSeconds(sumSeries(derivative(msgclt.received_messages.mb*.dashb.xrootd-cms-eos)),1)
The huge load, with peaks of 2.8 KHz of messages has caused troubles to GLED that has been restarted up to 6 times from Nov 13 at 22:21 to Nov 14 at 5:11.
In addition the services running on dashb-ai-530 (consuming from /queue/Consumer.popularity.xrootd.cms.eos and populating INT2R)
could not sustain the huge load. The local disk got immediately full, blocking services, and causing pileup of messages in the message brokers.
A watchdog running on dashb-ai-530 to stop stompclt when disk inodes are above 80% was not able to mitigate the problem, because the message rate increase of 3 order of magnitude was almost instantaneous.
In addition, the insertion rate of messages in INT2R is far smaller respect to LCGR. Comparing the performance of consumers inserting data in LCGR and INT2R, it can be seen that for LCGR it takes <1s to insert 1000 record in bulk, whereas it takes ~8s in INT2R.
The INT2R DBA has been contacted. Investigation is still ongoing.
The pileup of messages in the virtual queue can be seen here
https://mig-graphite.cern.ch/graphlot/?from=20:00_20141113&until=23:59_20141114&target=msgclt.stored_messages.mb*.dashb.xrootd-cms-eos-pop
Another scale test on CMS T0 is foreseen for today.
Domenico
[1]
From CMS Computing ELOG
Total of about 110k jobs, we were filling all the cores available to the project during some periods.
The number of to be submitted jobs went up to almost 40k temporarily.
[2]
[Thu Nov 13 22:20:10 CET 2014] gled-xrdmon-check: some collectors supposed to be running are stopped (crashed?), will be restarted
Collector 'prod2' is probably crashed, restarting
Stopping Gled XRootD transfers monitoring ('prod2'): [FAILED]
Starting Gled XRootD transfers monitoring ('prod2'): [ OK ]

Victorinterface API configuration for PhEDEx DATASERVICE_HOST hardcoded to cmsweb.cern.ch

Hi,

the configuration of the victorinterface API in config/popdbweb/conf.ini is currently hardcoded to query cmsweb.cern.ch for the PhEDEx datasvc, and cms-popularity.cern.ch for the PopDB APIs:

https://github.com/dmwm/DDM/blob/master/DataPopularity/popdb.web/config/conf.ini#L10-L15

It should be made configurable, pointing both DATASERVICE_HOST and
POPULARITY_HOST by default to the same cluster on which it is running (i.e. https://cmsweb-testbed.cern.ch or https://cmsweb.cern.ch or a private devvm for preprod/prod/dev respectively).

This comes with an additional problem, POPULARITY_HOST will require authentication if we set it to cmsweb(-testbed).

Cheers
N

Why popularity API returns milliseconds

Current popularity APIs returns data in milliseconds since epoch. This makes output very confusing and waste resources/output. I can't imaging that milliseconds are required and so far I did not see that millisecond part of bins is actually filled up. They can be easily changed to normal unix seconds since epoch, making output smaller in size and more easily parseable at client side.

update link to support tickets

The following web pages [*] include a link for the support requests from users.
The link points to the ticket system of https://svnweb.cern.ch/trac/popdb/
that is now obsolete and not maintained anymore.
It should be replaced by new links to the gitHub project (or something else t.b.d.)

Domenico

[*]

  1. submenu Support-> Wiki and BugReport for the urls
    https://cms-popularity.cern.ch/popdb
    and
    https://cms-popularity.cern.ch/popdb/xrdpopularity/

  2. submenu Bug&Requests at
    https://cms-popularity.cern.ch/victor/accounting

Long standing issue with DB merge among XRootD popularity and dashboard

Migration of XRootD popularity workflow from the integration DB instance in int2r to the production instance in LCGR shared with dashboard has two advantages:

  1. the data collection workflow is maintained by dashboard team
  2. data include not only CMS EOS at CERN but also all the XRootD traffic among sites.

The migration is still stuck in its final phase, because the new raw table in the production db LCGR has virtual columns (start_date and end_date) that are incompatible with the fast refresh of materialized views. LCGR DBA is investigating if there are workarounds.
In the meantime an additional data collection workflow is running, to consume from the ActiveMQ virtual queue /queue/Consumer.popularity.xrootd.cms.eos and to upload EOS data in the int2r DB.

This workflow is currently maintained by IT-SDC. If the problem is not solved in a couple of weeks, it would be useful to move the workflow to 2 CMS VOboxes.

add synonyms in DB schema

and get rid of the DB schema name specified in the web application code, both for CMS Popularity and Victor

Provide description of output key/values

Right now popularity APIs only provide short description of input parameters, while leaving output key/values totally un-explained. For instance, how user can figure out that returned output of getDSdata API is in milliseconds. How user can find out its output consists of bins, what are the values of those bins and how they are assigned. The same applies to other APIs. All parameters input and output should be clearly documented.

Run popularity agents as non-root

Currently the following agents are running as root:

  • CRAB popularity on vocms041
  • xrootd popularity on vocms041
  • Victor on vocms041
  • CMSSW popularity on vocms044

Can they be run as non-root (e.g. with the cmspopdb service account) instead?

upgrade to Django >= 1.5

CMS Data Popularity and Victor Web applications are based on Django versions <1.5.
In order to upgrade to latest Django releases few changes are needed due to changes in the Django APIs.

Future deployment of DDM on cmsweb

Hi,

I have added the necessary user accounts and groups (_victor and _cmspopdb) to the system/deploy script. giffels/deployment@cab5d36

I have added forwarding rules for victor and cmspopdb to the appropriate frontend configurations. giffels/deployment@d6c24dc

I have added a CMS Popularity box with links to victor and cmspopdb to the cmsweb welcome page. giffels/deployment@c851e80

A comparison of the feature branch to the dmwm:master can be found here https://github.com/giffels/deployment/compare/cmspopdb-configuration

Things to do:

  • Request a port range for victor and cmspopdb from the http group. Currently the forwarding rules are pointing to the dbs port.

Cheers,
Manuel

change CMSSW popularity DB schema

The CMSSW popularity DB schema contains a raw table with virtual columns start_date end_date.
This schema is inherited from a similar structure in the XRootD DB.
THere are problems with virtual columns and Materialized Views.
In particular fast refresh is not possible if the MV includes the virtual column.
Reason of this is that a MV Log cannot include a virtual column.

This issue is well documented in Oracle documentation:
BUG:10053393 - FAST REFRESH ISSUE WITH AGGREGATE MATERIALIZED VIEW AND VIRTUAL COLUMN
BUG:10092363 - VIRTUAL COLUMNS ARE NOT SUPPORTED IN MATERIALIZED VIEWS LOGS BUG:9907479 - ORA-947 RUNNING FAST REFRESH ON AN MVIEW THAT HAS VIRTUAL COLUMNS

Need to reorganize DB structure and collector definition.

getSingleDSstat API should not accept/return tstart/tstop

The getSingleDSstat API output is independent from tstart/tstop parameters, but they're returned anyhow. This makes output very confusing to the client, since it contains all dates when particular dataset was known popular. For instance, requesting
https://cms-popularity.cern.ch/popdb/popularity/getSingleDSstat?orderby=totcpu&name=/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball/Summer12-PU_S7_START52_V9-v2/AODSIM
will return data in in range of 2012-2013, while the output also has tstart/tstop referring to current time (2014). Originally I assumed that I can pass tstart/tstop parameters into API, but it does not change the input. Therefore they're ignored. It would be nice if you'll either eliminate them from API output and/or validate API parameters from the client.

getSingleDSstat API reporting 'invalid dataset name' for datasets that have never been accessed

Hello,

currently the getSingleDSstat API reports an error message "Error occurred during dataset validation, cause: param must be a valid dataset name" when the dataset name is valid but the dataset hasn't been recorded in PopDB yet. The API could return an empty result set in this case instead.

Example on Apr 21st:

https://cms-popularity.cern.ch/popdb/popularity/getSingleDSstat?sitename=summary&orderby=naccess&name=/VBF_HToZZTo4L_M-125_14TeV-powheg-pythia6/TP2023SHCALDR-SHCALMar26_PU140BX25_PH2_1K_FB_V6-v3/GEN-SIM-RECO&aggr=day

Google data analytics

Google data analytics is included to monitor the web accesses, both for CMS Popualrity and Victor. Evaluate if it's still needed or should be removed

Dataset name validation in Popularity getSingleDSstat API

When a valid dataset has never been accessed, the getSingleDSstat API returns this error message:

"Error occurred during dataset validation, cause: param must be a valid dataset name"

e.g.

https://cms-popularity.cern.ch/popdb/popularity/getSingleDSstat/?&sitename=summary&orderby=naccess&name=%2FVBFHiggs0PToWWTo2L2NU_M-125p6_8TeV-JHUGenV4%2FSummer12_DR53X-PU_S10_START53_V19-v1%2FAODSIM&aggr=day

Same as when the dataset name is actually invalid according to the CMS naming conventions, e.g.:

https://cms-popularity.cern.ch/popdb/popularity/getSingleDSstat/?&sitename=summary&orderby=naccess&name=invalidcmsdatasetname&aggr=day

The two cases could return different error messages.

extend popularity api: add number of distinct tasks per dataset/block

Popularity APIs report number of accesses, cumulative processing time, number of distinct users per block and/or dataset.
The APIs need to be extended to include also the number of distinct job tasks accessing a dataset/block, in a given time window.
This implies:

  1. modification of Materialized Views
  2. modification of Django views reporting the information
  3. extend rendering of tables and plots to use this information.

request production db schema for cmssw popularity

cmssw popularity data are currently collected in a int11r schema.
Those data should be copied in a production db, in order to guarantee backup of currently collected data

Please coordinate with the CMS DBA, to

  1. copy the current schema from int11r to the CMS production DB

  2. point the insertion agents to the new db schema

Change Victorinterface APIs to query DB directly instead of calling PopDB APIs

Suggestion from Diego to avoiid having to put in place authentication for Victorinterface APIs to query PopDB APIs:


Every service interaction would then pass through the frontends, using
the proxy/renewal if the target service requires authentication (which
is the case for most of the services). However, if the service queries
itself, it is usually preferable to solve that in the code level: call
the code routine directly and avoid completely the cycle on generating
a HTTP request. Wouldn't it be possible here? Since both APIs can
access the same (DB) resources, there's no need to put a dependency on
the other API and triggering the extra delay generated by the HTTP

processing cycle.

cmssw popularity collector fails often bulk insertions

The CMSSW popularity collector procedure fails often bulk insertions.

CMSSW reports are sent as UDP messages, in a key value format, and organized in a dictionary.
In general the collector tries to insert in DB a list of dictionaries in bulk, and this operation frequently fails.
The problem is due to reports that do not contain the full list of dictionary keys.
If the "smaller" dictionary is the first of the list of dictionaries to insert, the bulk insertion fails.
Probably cx_Oracle infers from the first dictionary in the list the attributes to insert, and this will not match the largest list of the next messages.

Messages are not lost, because after bulk isnertion fails, an insertion of messages one-by-one is executed, and this succeeds.

Would be good to fix this issue.

The current dictionary must contain 27 key values, as listed in [*]

[*]
['client_domain',
'read_vector_bytes',
'site_name',
'server_domain',
'read_single_operations',
'app_info',
'user_dn',
'start_time',
'read_bytes',
'file_lfn',
'read_single_bytes',
'server_host',
'read_bytes_at_close',
'client_host',
'read_vector_operations',
'file_size',
'end_date',
'fallback',
'start_date',
'unique_id',
'end_time']

Enable monitoring and alarms for CMSSW popularity service

Hello,

the CMSSW popularity collector running on vocms044 doesn't come back up automatically after a reboot. Suggesting to add:

  • automatic restart of the service; could be done using the simplevisor like other popularity services
  • monitoring and alarms in case the automatic restart fails; could be done with lemon metrics like other cmsweb services

Cheers
Nicolo'

Verify the request headers coming from the cmsweb frontend

From Diego:

One thing that is missing in the setup but is not critical for now is
that both popdbweb and victorweb are not verifying the headers coming
from the cmsweb frontends. Every back-end service must check them to
make sure requests have not been tampered with or crafted locally by
exploiting some failure in some other service running on the same
back-end. Once that protection would be in place, we cannot anymore
generate requests locally.

On a separate note, it is also important all requests pass through the
frontends for the proper accountability and indentification of all the
clients.

CMS Popularity web: development instance alias

There is a alias usable to point to a development/integration instance of the CMS popularity system.
This alias is CMS-POP-DEV--LOAD-xxx-
cms-popularity-dev (the redirector url) points to this alias, to expose the dev/integration instance..
Currently CMS-POP-DEV points to a dashboard machine.
When a development (integration) instance is ready in a cms vobox, please ask to remove the alias from the dashboard machine.

Note from Andreas P. to access the machine through web redirector:
open port 443 (https) open, either for any machine, or for a subset (CERN hosts) or for a single machine (likely by IP address). If it’s by host, you’ll need to change 137.138.162.187 (vocms147, the old -dev front-end) to 128.142.243.211 (vocms0148, the new -dev front-end).

Align all popularity APIs with common notations

Right now popularity APIs use different notation schema, e.g.DSStatInTimeWindow uses upper case keys in returned dictionary, while getDSdata lower case. When programming against the service it desired to have consistent schema.

XRootD inconsistency of t-stream monitoring data

Large discrepancy found among GLED reported metrics for t-stream (a.k.a. detailed stream) and ML summary metrics
Discrepancy deeply studied during summer with collaboration of NotreDame site admin, measuring real site traffic.
- Identified inconsistent close time and read/write bytes reports
- Reported to the AAA project leaders
- Matevz preliminary investigation identifies a possible reason
 “xrootd servers not handling the client disconnects properly (or at least not reporting them to monitoring)”

Proposed actions
move reports from t-stream to f-stream. f-stream doesn’t suffer the same problem
ATLAS is using f-stream, traffic metrics are consistent w.r.t. ML summary

CRAB Popularity Workflow: Validation Crab3

validate the data collection workflow already running for crab3, and compare with crab2.
It would be useful to do it under control, submitting the same kind of job, by the same user, to a same site, accessing the same files, with crab2 and crab3 as well, in order to compare the entries into the DB.

Monitoring validity of popularity metrics

Hello,

posting report by Domenico about recent incidents causing incomplete popularity record:

  1. For CRAB:
    the job monitoring outage, due to the Monalisa issue, has caused an interruption of the data popualrity collection for all CMS sites, for few days.
    This can be seen in [e1] where I report the number of records uploaded to DB per day.
    From Sep. 27th to Oct 2nd, and again Oct 4th the reduction is evident, as also shown in the first snapshot.

  2. For EOS:
    the migration of EOS disk servers to puppet has propagated a wrong configuration for the xrootd monitoring.
    As a consequence udp messages from most of the disk servers were sent to a wrong port, where there was nothing running.
    As a consequence the EOS popularity is in a large part wrong for a period of times that goes from Aug 28 to Oct 3rd (5 weeks) when the

issue was discovered and fixed (look at the second snapshot attached).

Need to figure out metrics to spot this kind of incidents earlier, and send alarms accordingly.

Cheers
n

Default value when no passed value in popdb

For some parameters there are default values if the parameter is not specified. However if the parameter is defined in the query but no value is passed (empty string) an error message is returned.

Would it be possible to apply the default value if an empty string is passed.

Example: https://cms-popularity.cern.ch/popdb/popularity/getDSdata?&tstart=2012-6-27&tstop=2012-7-5&sitename=&n=5&aggr=day&orderby=totcpu

Here sitename is specified in the query but have no value defined. would like the API to apply the default "summary" for this case.

CMSSW popularity aggregations

CMSSW popularity DB is currently collecting the CMSSW UDP monitoring, but nothing is aggregated and exposed through web UI or json APIs.
Steps needed:

  1. create MVs in the DB schema
  2. build a file2dataset association agent (such as the one used for XRootD popularity) in order to solve the association file to dataset
  3. extend the popularity web UI to access the DB information

Xrootd Popularity getSingleDSstat doesn't work when ranking by naccess

Hi,

orderby=naccess doesn't work in xrdpopularity/getSingleDSstat, the result is an ORA error message, see e.g.:

https://cms-popularity.cern.ch/popdb/xrdpopularity/getSingleDSstat?orderby=naccess&name=/DoubleElectron/Run2012B-PromptReco-v1/AOD

Database error occurred: ORA-00904: "TA"."NACCESS": invalid identifier (SQL: select round(( TDay-to_date('19700101','YYYYMMDD') )_86400)_1000 as millisecondsSinceEpoch, ta.naccess from ( select trunc(TDay,'DDD') as TDay, sum(numAccesses) as nAcc, round(sum(totCPU)/3600,0) as totCPU, sum(numUsers) as nUsers from CMS_EOS_POPULARITY_SYSTEM.MV_XRD_DS_STAT0_AGGR1 where collName = '/DoubleElectron/Run2012B-PromptReco-v1/AOD' and isUserCMS = 0 group by trunc(TDay,'DDD'), collName ) ta order by TDay )

It looks like a wrong column definition in the API query

Config file name is hardcoded in victor executable

In the main executable /opt/dq2/lib/dq2/victor/run.py the config file name is hardcoded to /opt/dq2/etc/dq2.cfg

It would be useful to have it configurable through argument/env variable so we can deploy multiple instances of Victor agents (e.g. pointing to prod vs int db) on the same machine.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.