dmwm / ddm Goto Github PK

Dynamic Data Management - Cache release and auto-replication of hot data

Shell 1.07% PLSQL 4.94% Python 31.72% HTML 15.08% JavaScript 37.40% CSS 3.66% SQLPL 0.33% PigLatin 4.88% Scala 0.89% TSQL 0.03%

ddm's Introduction

DDM

Dynamic Data Management - Cache release and auto-replication of hot data

ddm's People

Contributors

Stargazers

Watchers

Forkers

domenicogiordano giffels rcaspart nikmagini mmeoni roshnibabykutty nataliaratnikova meniluca kdziedzi

ddm's Issues

The schema name is hardcoded

The schema name is hardcoded to CMS_CLEANING_AGENT in dq2.victor.cms/lib/dq2/victor/victorDao.py and victor.monitoring.cms/lib/victorDao.py, should be configurable instead.

extend the file to dataset association agent to cover all sites in addition to EOS

The current implementation of the file to dataset association agent assumes that only XRootD EOS popularity is available.
Indeed the XRootD popularity DB includes now also data from other sites.
THe agent needs to be extended to cover also associations for files other than the EOS ones.

At the meantime it would be useful to evaluate a redesign of the association agent, based on different queries to phedex (and query frequencies)

CMSSW Popualrity: request a production DB instance

A production DB instance should be asked to collect the CMSSW Popularity data.
Currently data are collected in int11r DB, but this DB has not backup.

Please evaluate also if a migration of the current data from int11r to the new production DB schema is needed and in case coordinate with DBA to perform data migration.
I would strongly suggest to move data in production DB.

When the production DB schema will be ready, redirect the collection workflow to this DB.

XRootD CMS EOS collection suffering T0 tests

For the record:

In date 2014-11-13 at ~21h, a stress test of CMS T0 [1] caused a peak of XRootD monitoring messages from EOS, as can be seen in the following graph.
https://mig-graphite.cern.ch/graphlot/?from=20:00_20141113&until=08:59_20141114&target=scaleToSeconds(sumSeries(derivative(msgclt.received_messages.mb*.dashb.xrootd-cms-eos)),1)
The huge load, with peaks of 2.8 KHz of messages has caused troubles to GLED that has been restarted up to 6 times from Nov 13 at 22:21 to Nov 14 at 5:11.
In addition the services running on dashb-ai-530 (consuming from /queue/Consumer.popularity.xrootd.cms.eos and populating INT2R)
could not sustain the huge load. The local disk got immediately full, blocking services, and causing pileup of messages in the message brokers.
A watchdog running on dashb-ai-530 to stop stompclt when disk inodes are above 80% was not able to mitigate the problem, because the message rate increase of 3 order of magnitude was almost instantaneous.
In addition, the insertion rate of messages in INT2R is far smaller respect to LCGR. Comparing the performance of consumers inserting data in LCGR and INT2R, it can be seen that for LCGR it takes <1s to insert 1000 record in bulk, whereas it takes ~8s in INT2R.
The INT2R DBA has been contacted. Investigation is still ongoing.
The pileup of messages in the virtual queue can be seen here
https://mig-graphite.cern.ch/graphlot/?from=20:00_20141113&until=23:59_20141114&target=msgclt.stored_messages.mb*.dashb.xrootd-cms-eos-pop
Another scale test on CMS T0 is foreseen for today.
Domenico
[1]
From CMS Computing ELOG
Total of about 110k jobs, we were filling all the cores available to the project during some periods.
The number of to be submitted jobs went up to almost 40k temporarily.
[2]
[Thu Nov 13 22:20:10 CET 2014] gled-xrdmon-check: some collectors supposed to be running are stopped (crashed?), will be restarted
Collector 'prod2' is probably crashed, restarting
Stopping Gled XRootD transfers monitoring ('prod2'): [FAILED]
Starting Gled XRootD transfers monitoring ('prod2'): [ OK ]

Victorinterface API configuration for PhEDEx DATASERVICE_HOST hardcoded to cmsweb.cern.ch

Hi,

the configuration of the victorinterface API in config/popdbweb/conf.ini is currently hardcoded to query cmsweb.cern.ch for the PhEDEx datasvc, and cms-popularity.cern.ch for the PopDB APIs:

https://github.com/dmwm/DDM/blob/master/DataPopularity/popdb.web/config/conf.ini#L10-L15

It should be made configurable, pointing both DATASERVICE_HOST and
POPULARITY_HOST by default to the same cluster on which it is running (i.e. https://cmsweb-testbed.cern.ch or https://cmsweb.cern.ch or a private devvm for preprod/prod/dev respectively).

This comes with an additional problem, POPULARITY_HOST will require authentication if we set it to cmsweb(-testbed).

Cheers
N

Why popularity API returns milliseconds

Current popularity APIs returns data in milliseconds since epoch. This makes output very confusing and waste resources/output. I can't imaging that milliseconds are required and so far I did not see that millisecond part of bins is actually filled up. They can be easily changed to normal unix seconds since epoch, making output smaller in size and more easily parseable at client side.

update link to support tickets

The following web pages [*] include a link for the support requests from users.
The link points to the ticket system of https://svnweb.cern.ch/trac/popdb/
that is now obsolete and not maintained anymore.
It should be replaced by new links to the gitHub project (or something else t.b.d.)

Domenico

[*]

submenu Support-> Wiki and BugReport for the urls
https://cms-popularity.cern.ch/popdb
and
https://cms-popularity.cern.ch/popdb/xrdpopularity/
submenu Bug&Requests at
https://cms-popularity.cern.ch/victor/accounting

Long standing issue with DB merge among XRootD popularity and dashboard

Migration of XRootD popularity workflow from the integration DB instance in int2r to the production instance in LCGR shared with dashboard has two advantages:

the data collection workflow is maintained by dashboard team
data include not only CMS EOS at CERN but also all the XRootD traffic among sites.

The migration is still stuck in its final phase, because the new raw table in the production db LCGR has virtual columns (start_date and end_date) that are incompatible with the fast refresh of materialized views. LCGR DBA is investigating if there are workarounds.
In the meantime an additional data collection workflow is running, to consume from the ActiveMQ virtual queue /queue/Consumer.popularity.xrootd.cms.eos and to upload EOS data in the int2r DB.

This workflow is currently maintained by IT-SDC. If the problem is not solved in a couple of weeks, it would be useful to move the workflow to 2 CMS VOboxes.

DN privacy policy on CMSSW file report

Following the DN privacy issue related to the XRootD monitoring, it should be considered that also the CMSSW popularity monitoring includes user DN information, as described here
https://twiki.cern.ch/twiki/bin/view/Main/GenericFileMonitoring

It's worth to evaluate how to handle this privacy aspect before it becomes urgent

add synonyms in DB schema

and get rid of the DB schema name specified in the web application code, both for CMS Popularity and Victor

commit db schema for cmssw popularity

Provide description of output key/values

Right now popularity APIs only provide short description of input parameters, while leaving output key/values totally un-explained. For instance, how user can figure out that returned output of getDSdata API is in milliseconds. How user can find out its output consists of bins, what are the values of those bins and how they are assigned. The same applies to other APIs. All parameters input and output should be clearly documented.

Deployment on cmsweb

Hi,

I have added the necessary user account and group to the system/deploy script.

https://github.com/giffels/deployment/compare/cmspopdb-configuration

Cheers,
Manuel

Popularity APIs should propery return data types

Currently PopDB APIs return output as strings. They should properly return data types, i.e.
if value is a float it should be float type and not a string.

Run popularity agents as non-root

Currently the following agents are running as root:

CRAB popularity on vocms041
xrootd popularity on vocms041
Victor on vocms041
CMSSW popularity on vocms044

Can they be run as non-root (e.g. with the cmspopdb service account) instead?

Documentation for returned data

Would like to have the documentation updated such that it describes the data which is returned. For example in the following query:

What is RTOTCPU, RNACC, etc?

xrootd file2dataset: make file exclusion list configurable

Currently the xrootd filetodataset associator contains a hardcoded list of file paths to ignore. The list could be managed via a config file instead.

upgrade to Django >= 1.5

CMS Data Popularity and Victor Web applications are based on Django versions <1.5.
In order to upgrade to latest Django releases few changes are needed due to changes in the Django APIs.

Future deployment of DDM on cmsweb

Hi,

I have added the necessary user accounts and groups (_victor and _cmspopdb) to the system/deploy script. giffels/deployment@cab5d36

I have added forwarding rules for victor and cmspopdb to the appropriate frontend configurations. giffels/deployment@d6c24dc

I have added a CMS Popularity box with links to victor and cmspopdb to the cmsweb welcome page. giffels/deployment@c851e80

A comparison of the feature branch to the dmwm:master can be found here https://github.com/giffels/deployment/compare/cmspopdb-configuration

Things to do:

Request a port range for victor and cmspopdb from the http group. Currently the forwarding rules are pointing to the dbs port.

Cheers,
Manuel

change CMSSW popularity DB schema

The CMSSW popularity DB schema contains a raw table with virtual columns start_date end_date.
This schema is inherited from a similar structure in the XRootD DB.
THere are problems with virtual columns and Materialized Views.
In particular fast refresh is not possible if the MV includes the virtual column.
Reason of this is that a MV Log cannot include a virtual column.

This issue is well documented in Oracle documentation:
BUG:10053393 - FAST REFRESH ISSUE WITH AGGREGATE MATERIALIZED VIEW AND VIRTUAL COLUMN
BUG:10092363 - VIRTUAL COLUMNS ARE NOT SUPPORTED IN MATERIALIZED VIEWS LOGS BUG:9907479 - ORA-947 RUNNING FAST REFRESH ON AN MVIEW THAT HAS VIRTUAL COLUMNS

Need to reorganize DB structure and collector definition.

getSingleDSstat API should not accept/return tstart/tstop

The getSingleDSstat API output is independent from tstart/tstop parameters, but they're returned anyhow. This makes output very confusing to the client, since it contains all dates when particular dataset was known popular. For instance, requesting
https://cms-popularity.cern.ch/popdb/popularity/getSingleDSstat?orderby=totcpu&name=/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball/Summer12-PU_S7_START52_V9-v2/AODSIM
will return data in in range of 2012-2013, while the output also has tstart/tstop referring to current time (2014). Originally I assumed that I can pass tstart/tstop parameters into API, but it does not change the input. Therefore they're ignored. It would be nice if you'll either eliminate them from API output and/or validate API parameters from the client.

getSingleDSstat API reporting 'invalid dataset name' for datasets that have never been accessed

Hello,

currently the getSingleDSstat API reports an error message "Error occurred during dataset validation, cause: param must be a valid dataset name" when the dataset name is valid but the dataset hasn't been recorded in PopDB yet. The API could return an empty result set in this case instead.

Example on Apr 21st:

Google data analytics

Google data analytics is included to monitor the web accesses, both for CMS Popualrity and Victor. Evaluate if it's still needed or should be removed

Dataset name validation in Popularity getSingleDSstat API

When a valid dataset has never been accessed, the getSingleDSstat API returns this error message:

"Error occurred during dataset validation, cause: param must be a valid dataset name"

e.g.

Same as when the dataset name is actually invalid according to the CMS naming conventions, e.g.:

The two cases could return different error messages.

extend popularity api: add number of distinct tasks per dataset/block

Popularity APIs report number of accesses, cumulative processing time, number of distinct users per block and/or dataset.
The APIs need to be extended to include also the number of distinct job tasks accessing a dataset/block, in a given time window.
This implies:

modification of Materialized Views
modification of Django views reporting the information
extend rendering of tables and plots to use this information.

request production db schema for cmssw popularity

cmssw popularity data are currently collected in a int11r schema.
Those data should be copied in a production db, in order to guarantee backup of currently collected data

Please coordinate with the CMS DBA, to

copy the current schema from int11r to the CMS production DB
point the insertion agents to the new db schema

Change Victorinterface APIs to query DB directly instead of calling PopDB APIs

Suggestion from Diego to avoiid having to put in place authentication for Victorinterface APIs to query PopDB APIs:

Every service interaction would then pass through the frontends, using
the proxy/renewal if the target service requires authentication (which
is the case for most of the services). However, if the service queries
itself, it is usually preferable to solve that in the code level: call
the code routine directly and avoid completely the cycle on generating
a HTTP request. Wouldn't it be possible here? Since both APIs can
access the same (DB) resources, there's no need to put a dependency on
the other API and triggering the extra delay generated by the HTTP

processing cycle.

cmssw popularity collector fails often bulk insertions

The CMSSW popularity collector procedure fails often bulk insertions.

CMSSW reports are sent as UDP messages, in a key value format, and organized in a dictionary.
In general the collector tries to insert in DB a list of dictionaries in bulk, and this operation frequently fails.
The problem is due to reports that do not contain the full list of dictionary keys.
If the "smaller" dictionary is the first of the list of dictionaries to insert, the bulk insertion fails.
Probably cx_Oracle infers from the first dictionary in the list the attributes to insert, and this will not match the largest list of the next messages.

Messages are not lost, because after bulk isnertion fails, an insertion of messages one-by-one is executed, and this succeeds.

Would be good to fix this issue.

The current dictionary must contain 27 key values, as listed in [*]

[*]
['client_domain',
'read_vector_bytes',
'site_name',
'server_domain',
'read_single_operations',
'app_info',
'user_dn',
'start_time',
'read_bytes',
'file_lfn',
'read_single_bytes',
'server_host',
'read_bytes_at_close',
'client_host',
'read_vector_operations',
'file_size',
'end_date',
'fallback',
'start_date',
'unique_id',
'end_time']

Deployment procedure sets the crontab user to root

The deployment procedure sets the crontab user to 'root'; the user
should be configurable (Nicolo has edited it manually to 'cmspopdb')

getSingleDSstat API: undocumented 'name' argument

The mandatory 'name' argument in the getSingleDSstat API isn't documented

Enable monitoring and alarms for CMSSW popularity service

Hello,

the CMSSW popularity collector running on vocms044 doesn't come back up automatically after a reboot. Suggesting to add:

automatic restart of the service; could be done using the simplevisor like other popularity services
monitoring and alarms in case the automatic restart fails; could be done with lemon metrics like other cmsweb services

Cheers
Nicolo'

Verify the request headers coming from the cmsweb frontend

From Diego:

One thing that is missing in the setup but is not critical for now is
that both popdbweb and victorweb are not verifying the headers coming
from the cmsweb frontends. Every back-end service must check them to
make sure requests have not been tampered with or crafted locally by
exploiting some failure in some other service running on the same
back-end. Once that protection would be in place, we cannot anymore
generate requests locally.

On a separate note, it is also important all requests pass through the
frontends for the proper accountability and indentification of all the
clients.

CMS PopDB XRootD: import file To dataset association agent

Provide instructions about how to install file to dataset association agent, and commit code in github

commit db schema for xrootd popularity

CMS Popularity web: development instance alias

There is a alias usable to point to a development/integration instance of the CMS popularity system.
This alias is CMS-POP-DEV--LOAD-xxx-
cms-popularity-dev (the redirector url) points to this alias, to expose the dev/integration instance..
Currently CMS-POP-DEV points to a dashboard machine.
When a development (integration) instance is ready in a cms vobox, please ask to remove the alias from the dashboard machine.

Note from Andreas P. to access the machine through web redirector:
open port 443 (https) open, either for any machine, or for a subset (CERN hosts) or for a single machine (likely by IP address). If it’s by host, you’ll need to change 137.138.162.187 (vocms147, the old -dev front-end) to 128.142.243.211 (vocms0148, the new -dev front-end).

Different variables for the configuration of user and notification email

For the individual modules two different variables are used to configure the user to run the cron-job and be the owner of the configurations ($DASHBUSER, $POPDBUSER) and the email used for notifications ($DASHB_NOTIFICATION, $POPDB_NOTIFICATION).

Align all popularity APIs with common notations

Right now popularity APIs use different notation schema, e.g.DSStatInTimeWindow uses upper case keys in returned dictionary, while getDSdata lower case. When programming against the service it desired to have consistent schema.

XRootD inconsistency of t-stream monitoring data

Large discrepancy found among GLED reported metrics for t-stream (a.k.a. detailed stream) and ML summary metrics
Discrepancy deeply studied during summer with collaboration of NotreDame site admin, measuring real site traffic.
- Identified inconsistent close time and read/write bytes reports
- Reported to the AAA project leaders
- Matevz preliminary investigation identifies a possible reason  “xrootd servers not handling the client disconnects properly (or at least not reporting them to monitoring)”

Proposed actions
move reports from t-stream to f-stream. f-stream doesn’t suffer the same problem
ATLAS is using f-stream, traffic metrics are consistent w.r.t. ML summary

CMSSW udp listener production instance

as soon as the udp port gets opened in vocms044, move the production instance of the cmssw workflow from dashb-ai machine to vocms044

Split out GRANT statements in schema creation to a separate sql

The schema creation script assumes that you are using a separate reader account to access the DB. This might not be the case for a private developer instance on devdb. It would be convenient to split out the GRANT statements in a separate sql.

CRAB Popularity Workflow: Validation Crab3

validate the data collection workflow already running for crab3, and compare with crab2.
It would be useful to do it under control, submitting the same kind of job, by the same user, to a same site, accessing the same files, with crab2 and crab3 as well, in order to compare the entries into the DB.

commit db schema for crab popularity

The victor config file is owned by root

The victor config file is owned by root, it should be chowned to the service account.

Monitoring validity of popularity metrics

Hello,

posting report by Domenico about recent incidents causing incomplete popularity record:

For CRAB:
the job monitoring outage, due to the Monalisa issue, has caused an interruption of the data popualrity collection for all CMS sites, for few days.
This can be seen in [e1] where I report the number of records uploaded to DB per day.
From Sep. 27th to Oct 2nd, and again Oct 4th the reduction is evident, as also shown in the first snapshot.
For EOS:
the migration of EOS disk servers to puppet has propagated a wrong configuration for the xrootd monitoring.
As a consequence udp messages from most of the disk servers were sent to a wrong port, where there was nothing running.
As a consequence the EOS popularity is in a large part wrong for a period of times that goes from Aug 28 to Oct 3rd (5 weeks) when the

issue was discovered and fixed (look at the second snapshot attached).

Need to figure out metrics to spot this kind of incidents earlier, and send alarms accordingly.

Cheers
n

remove the dashboard machines from cms popularity alias

Remove on Monday June 23rd the dashb-ai machines from the alias pointing to the production instances of cms-popularity

Default value when no passed value in popdb

For some parameters there are default values if the parameter is not specified. However if the parameter is defined in the query but no value is passed (empty string) an error message is returned.

Would it be possible to apply the default value if an empty string is passed.

Example: https://cms-popularity.cern.ch/popdb/popularity/getDSdata?&tstart=2012-6-27&tstop=2012-7-5&sitename=&n=5&aggr=day&orderby=totcpu

Here sitename is specified in the query but have no value defined. would like the API to apply the default "summary" for this case.

CMSSW popularity aggregations

CMSSW popularity DB is currently collecting the CMSSW UDP monitoring, but nothing is aggregated and exposed through web UI or json APIs.
Steps needed:

create MVs in the DB schema
build a file2dataset association agent (such as the one used for XRootD popularity) in order to solve the association file to dataset
extend the popularity web UI to access the DB information

Update Victor quotas to values used by cache cleaning at MIT

To avoid confusion for the users, I committed a new version of the pledges file with the quotas used by the service running at MIT:

7b66905

Could this be deployed please? Before going to production please check if something breaks in Dev, since most of the group quotas were removed.

Next step is to update the pledges automatically from the source:

http://t3serv001.mit.edu/~cmsprod/IntelROCCS/Detox/SitesInfo.txt

Cheers
n

Xrootd Popularity getSingleDSstat doesn't work when ranking by naccess

Hi,

orderby=naccess doesn't work in xrdpopularity/getSingleDSstat, the result is an ORA error message, see e.g.:

Database error occurred: ORA-00904: "TA"."NACCESS": invalid identifier (SQL: select round(( TDay-to_date('19700101','YYYYMMDD') )_86400)_1000 as millisecondsSinceEpoch, ta.naccess from ( select trunc(TDay,'DDD') as TDay, sum(numAccesses) as nAcc, round(sum(totCPU)/3600,0) as totCPU, sum(numUsers) as nUsers from CMS_EOS_POPULARITY_SYSTEM.MV_XRD_DS_STAT0_AGGR1 where collName = '/DoubleElectron/Run2012B-PromptReco-v1/AOD' and isUserCMS = 0 group by trunc(TDay,'DDD'), collName ) ta order by TDay )

It looks like a wrong column definition in the API query

Config file name is hardcoded in victor executable

In the main executable /opt/dq2/lib/dq2/victor/run.py the config file name is hardcoded to /opt/dq2/etc/dq2.cfg

It would be useful to have it configurable through argument/env variable so we can deploy multiple instances of Victor agents (e.g. pointing to prod vs int db) on the same machine.

dmwm / ddm Goto Github PK

ddm's Introduction

DDM

ddm's People

Contributors

Stargazers

Watchers

Forkers

ddm's Issues

processing cycle.

issue was discovered and fixed (look at the second snapshot attached).

Recommend Projects

Recommend Topics

Recommend Org

Jobs