GithubHelp home page GithubHelp logo

alibaba / easyrec Goto Github PK

View Code? Open in Web Editor NEW
1.5K 50.0 285.0 113.91 MB

A framework for large scale recommendation algorithms.

License: Apache License 2.0

Python 96.42% Shell 1.22% Lua 1.81% Dockerfile 0.08% C++ 0.47%
recommendation-algorithms recommender-system dssm esmm mind deepfm dlrm autoint din eges

easyrec's Introduction

EasyRec Introduction

 

What is EasyRec?

intro.png

EasyRec is an easy-to-use framework for Recommendation

EasyRec implements state of the art deep learning models used in common recommendation tasks: candidate generation(matching), scoring(ranking), and multi-task learning. It improves the efficiency of generating high performance models by simple configuration and hyper parameter tuning(HPO).

 

Get Started

Running Platform:

 

Why EasyRec?

Run everywhere

Diversified input data

Simple to config

It is smart

Large scale and easy deployment

  • Support large scale embedding and online learning
  • Many parallel strategies: ParameterServer, Mirrored, MultiWorker
  • Easy deployment to EAS: automatic scaling, easy monitoring
  • Consistency guarantee: train and serving

A variety of models

Easy to customize

Fast vector retrieve

 

Document

 

Contribute

Any contributions you make are greatly appreciated!

  • Please report bugs by submitting a GitHub issue.
  • Please submit contributions using pull requests.
  • please refer to the Development document for more details.

 

Cite

If EasyRec is useful for your research, please cite:

@article{Cheng2022EasyRecAE,
  title={EasyRec: An easy-to-use, extendable and efficient framework for building industrial recommendation systems},
  author={Mengli Cheng and Yue Gao and Guoqiang Liu and Hongsheng Jin and Xiaowen Zhang},
  journal={ArXiv},
  year={2022},
  volume={abs/2209.12766}
}

 

Contact

Join Us

  • DingDing Group: 32260796. (EasyRec usage general discussion.)
  • DingDing Group2: 37930014162, click this url or scan QrCode to joinnew_group.jpg
  • Email Group: [email protected].

Enterprise Service

  • If you need EasyRec enterprise service support, or purchase cloud product services, you can contact us by DingDing Group.

 

License

EasyRec is released under Apache License 2.0. Please note that third-party libraries may not have the same license as EasyRec.

easyrec's People

Contributors

0xflotus avatar chenglongliu123 avatar chengmengli06 avatar cosmozhang1995 avatar dawn310826 avatar kinghuin avatar lgqfhwy avatar livmortis avatar muxuezi avatar paradisehit avatar poson avatar tiankongdeguiji avatar weidankong avatar wenyangchou avatar wwxxzz avatar xsank avatar yangxudong avatar yjjinjie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

easyrec's Issues

ODPS-1202005:Algo Job Failed-System Error-Wait over 30min, not enough resource. [ RequsetId: null ].

Instance 20211017182224640gq00ata2 Failed.
FAILED: Failed 20211017182228723gepjc292_b9f620e9_506a_48cd_ba2d_d0dad28d8e24:ODPS-1202005:Algo Job Failed-System Error-Wait over 30min, not enough resource. [ RequsetId: null ].

分布式训练阶段,出现这个问题是资源不足?但是我看资源监控,cpu和内存没有占用到100%。
其中hash bucket size 很大,有千万级。
-Dcluster="{"worker":{"count":8,"gpu":0,"cpu":1500,"memory":60000},"ps":{"count":8,"cpu":400,"memory":10000}}"

DSW中的EasyRec WDL 案例部署到EAS中之后,如何请求?

#!/usr/bin/env python
from eas_prediction import PredictClient
from eas_prediction import TFRequest
from eas_prediction import ENDPOINT_TYPE_DIRECT
client = PredictClient('http://xxxx.cn-beijing.pai-eas.aliyuncs.com', 'zhl_deemfm')
client.set_token('M2Fxxxx')
client.init()
req = TFRequest('serving_default')

names = ["c1", "banner_pos", "site_id", "site_domain", "site_category", "app_id", "app_domain", "app_category",
         "device_id", "device_ip", "device_model", "device_type", "device_conn_type", "hour", "c14", "c15", "c16",
         "c17", "c18", "c19", "c20", "c21"]
for name in names:
  req.add_feed(name, [1], TFRequest.DT_STRING, [bytes("1", "utf-8")])
req.add_fetch('probs')

import time

resp = client.predict(req)

print(resp)
print(resp.get_values('probs'))
print(resp.get_tensor_shape('probs'))
print("average response time: %s s" % (timer / 10))

AssertionError: sep[b','] maybe invalid: field_num=7, required_num=131

`2020-07-30 12:10:38.673426: W tensorflow/core/framework/op_kernel.cc:1261] Unknown: AssertionError: sep[b','] maybe invalid: field_num=7, required_num=131
Traceback (most recent call last):

File "/apsarapangu/disk3/mengli.cml/anaconda3/envs/tf_12_py36/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", line 206, in call
ret = func(*args)

File "/apsarapangu/disk3/mengli.cml/easy-rec/easy_rec/python/input/csv_input.py", line 21, in _check_data
(sep, field_num, len(record_defaults))

AssertionError: sep[b','] maybe invalid: field_num=7, required_num=131`

I use DIN algorithm with two sequences ,error is Input 1 has shape [4096 576 16] and doesn't match input 0 with shape [4096 2810 16].

InvalidArgumentError (see above for traceback): All dimensions except 2 must match. Input 1 has shape [4096 576 16] and doesn't match input 0 with shape [4096 2810 16].
[[node gradients/concat_5_grad/ConcatOffset (defined at /worker/tensorflow_jobs/easy_rec/python/compat/optimizers.py:249) = ConcatOffset[N=2, _class=["loc:@gradients/concat_5_grad/Slice"], _device="/job:worker/replica:0/task:6/device:CPU:0"](gradients/concat_6_grad/mod, gradients/concat_5_grad/ShapeN, gradients/concat_5_grad/ShapeN:1)]]

Killing container master.RMCommunicator (RMCommunicator.java:onContainersCompleted(40))

2020-09-04 11:43:51,536 INFO [AMRM Callback Handler Thread] master.RMCommunicator (RMCommunicator.java:onContainersCompleted(40)) - got container status for containerID=container_1598507699008_0094_01_000010, state=COMPLETE, exitStatus=-104, diagnostics=Container [pid=8788,containerID=container_1598507699008_0094_01_000010] is running beyond physical memory limits. Current usage: 9.8 GB of 9.8 GB physical memory used; 43.0 GB of 47.6 TB virtual memory used. Killing container.
Dump of the process-tree for container_1598507699008_0094_01_000010 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 8795 8788 8788 8788 (java) 2011 1219 2151108608 64952 /usr/lib/jvm/java-1.8.0/bin/java -Xmx256M com.aliyu

combo feature在model config中怎么配置

怎么把combo feature这个组合特征配置到model config中, model config的featuregroup中的feature name怎么填写,因为组合特征是两个独立特征的组合

KeyError: 'device_make'

  File "/usr/lib/python3.7/site-packages/easy_rec/python/feature_column/feature_column.py", line 46, in __init__
    self.parse_id_feaure(config)
  File "/usr/lib/python3.7/site-packages/easy_rec/python/feature_column/feature_column.py", line 119, in parse_id_feaure
    if self.is_wide(config):
  File "/usr/lib/python3.7/site-packages/easy_rec/python/feature_column/feature_column.py", line 86, in is_wide
    return self._wide_deep_dict[feature_name] in [ WideOrDeep.WIDE,
KeyError: 'device_make'

Word mistake

PAI-DSW DEMO (Rember to select Python 3 kernel)

Rember->Remember

配置文件提示 大括号错误

[[ ------------------Disable OneDNN--------------------- ]]
Init odps proxy io environment success.
[2021-10-21 10:31:36.850331] [INFO] [78#78] [paiio/cc/platform/odps_io_manager/odps_io_config.cc:85] Odps environment init done.
[2021-10-21 10:31:53,176] [INFO] [78#MainThread] [tensorflow/python/util/auto_strategy_utils.py:108] Disable Auto Strategy.
[2021-10-21 10:31:53,176][INFO] Disable Auto Strategy.
[2021-10-21 10:31:53,177][INFO] set on pai environment variable: IS_ON_PAI
Traceback (most recent call last):
File "run.py", line 508, in
tf.app.run()
File "/worker/venv/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 128, in run
_sys.exit(main(argv))
File "run.py", line 357, in main
pipeline_config = config_util.get_configs_from_pipeline_file(config, False)
File "/worker/tensorflow_jobs/easy_rec/python/utils/config_util.py", line 48, in get_configs_from_pipeline_file
text_format.Merge(config_str, pipeline_config)
File "/worker/venv/lib/python2.7/site-packages/google/protobuf/text_format.py", line 735, in Merge
allow_unknown_field=allow_unknown_field)
File "/worker/venv/lib/python2.7/site-packages/google/protobuf/text_format.py", line 803, in MergeLines
return parser.MergeLines(lines, message)
File "/worker/venv/lib/python2.7/site-packages/google/protobuf/text_format.py", line 828, in MergeLines
self._ParseOrMerge(lines, message)
File "/worker/venv/lib/python2.7/site-packages/google/protobuf/text_format.py", line 850, in _ParseOrMerge
self._MergeField(tokenizer, message)
File "/worker/venv/lib/python2.7/site-packages/google/protobuf/text_format.py", line 923, in _MergeField
name = tokenizer.ConsumeIdentifierOrNumber()
File "/worker/venv/lib/python2.7/site-packages/google/protobuf/text_format.py", line 1392, in ConsumeIdentifierOrNumber
raise self.ParseError('Expected identifier or number, got %s.' % result)
google.protobuf.text_format.ParseError: 844:1 : '}': Expected identifier or number, got }.
Failed to execute system command. (exit code: 251.)

怎么下载EasyRec仓库得这些数据

image
运行 CUDA_VISIBLE_DEVICES=0 python -m easy_rec.python.train_eval --pipeline_config_path custom_config/dssm_hard_neg_sampler_on_taobao.config这样得命令,发现数据下载不下来

http status code: 400, error code: InvalidRequest, message: It is forbidden to copy appendable object in versioning state,

命令
pai -name easy_rec_ext
-project algo_public_dev
-Dres_project=algo_public_dev
-Dconfig=oss://yanzhen1/easy_rec_test/deepfm.config
-Dcmd=export
-Dexport_dir=oss://yanzhen1/easy_rec_test/export/
-Dcluster='{"worker" : {"count":1, "cpu":1000, "memory":40000}}'
-Darn=acs:ram::1730760139076263:role/aliyunodpspaidefaultrole
-Dbuckets=oss://yanzhen1/
-DossHost=oss-cn-beijing-internal.aliyuncs.com;
错误
Traceback (most recent call last):
File "run.py", line 252, in
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 128, in run
_sys.exit(main(argv))
File "run.py", line 246, in main
easy_rec.export(FLAGS.export_dir, config, FLAGS.checkpoint_path)
File "/worker/tensorflow_jobs/easy_rec/python/main.py", line 350, in export
export_dir_base=export_dir, serving_input_receiver_fn=serving_input_fn)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 694, in export_savedmodel
mode=model_fn_lib.ModeKeys.PREDICT)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 820, in _export_saved_model_for_mode
strip_default_attrs=strip_default_attrs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 959, in _export_all_saved_models
gfile.Rename(temp_export_dir.decode("utf-8") + '/', export_dir)
File "/usr/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 456, in rename
compat.as_bytes(oldname), compat.as_bytes(newname), overwrite, status)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: req_id: 5F225E6AD0E798313135AFAF, http status code: 400, error code: InvalidRequest, message: It is forbidden to copy appendable object in versioning state, oss host:oss-cn-beijing-internal.aliyuncs.com, path:/yanzhen1/easy_rec_test/export_tmp/temp-1596087897/assets/pipeline.config.

dssm 经常 loss = Nan

image

run_metadata=run_metadata))
File "/home/xin/anaconda3/envs/tf12/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 753, in after_run
raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.

都是第二步就Nan了。用的是taobao数据.数据没动过.
我这边观察到得现象是,基本用taobao的数据训练都是有可能出现Nan。2000step内会出现。
代码是这里直接拉最新的下来,数据是也是, 应该是一样的版本
image
tensorflow版本,python版本: tf1.12 py3.6
执行步骤

  1. cd EasyRec-master
  2. wget https://easyrec.oss-cn-beijing.aliyuncs.com/data/easyrec_data_20210818.tar.gz
  3. tar -zxvf easyrec_data_20210818.tar.gz
  4. pip install -r requirements.txt
  5. bash scripts/ci_test.sh
  6. CUDA_VISIBLE_DEVICES=0 python -m easy_rec.python.train_eval --pipeline_config_path samples/model_config/dssm_neg_sampler_on_taobao.config
    数据也一致
    image

EasyRec 训练慢怎么办,怎么设置参数?

要么提高计算资源,例如:
(1)增加ps 的数量,从1增加到2,再增加到4,不要一次增加太多。
(2)worker 的cpu设置为1600,处理数据的并行程度增加。
(3)增加worker的数量。

要么缩小网络:把item 和room 的网络合并;把final dnn 缩小。

sync_replicas: false。同步训练修改为异步训练。

easy_rec的最新版的安装包或者SDK里面这个PAI的命令是来源一个吗

pai -name easy_rec_ext -project algo_public
-Dcmd=train
-Dconfig=oss://easyrec/config/MultiTower/dwd_avazu_ctr_deepmodel_ext.config
-Dtables=odps://pai_online_project/tables/dwd_avazu_ctr_deepmodel_train,odps://pai_online_project/tables/dwd_avazu_ctr_deepmodel_test
-Dcluster='{"ps":{"count":1, "cpu":1000}, "worker" : {"count":3, "cpu":1000, "gpu":100, "memory":40000}}'
-Dwith_evaluator=1
-Dmodel_dir=oss://easyrec/ckpt/MultiTower
-Darn=acs:ram::xxx:role/xxx
-Dbuckets=oss://easyrec/
-DossHost=oss-cn-beijing-internal.aliyuncs.com;

tensorflow.python.framework.errors_impl.UnimplementedError: GetChildren not implemented

Traceback (most recent call last):
  File "run.py", line 300, in <module>
    tf.app.run()
  File "/worker/venv/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 128, in run
    _sys.exit(main(argv))
  File "run.py", line 243, in main
    train_and_evaluate_impl(pipeline_config, continue_train=FLAGS.continue_train)
  File "/worker/tensorflow_jobs/easy_rec/python/main.py", line 289, in _train_and_evaluate_impl
    _train_and_evaluate(estimator, train_spec, eval_spec)
  File "/worker/venv/lib/python2.7/site-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/worker/venv/lib/python2.7/site-packages/tensorflow/python/estimator/training.py", line 637, in run
    getattr(self, task_to_run)()
  File "/worker/venv/lib/python2.7/site-packages/tensorflow/python/estimator/training.py", line 642, in run_chief
    return self._start_distributed_training()
  File "/worker/venv/lib/python2.7/site-packages/tensorflow/python/estimator/training.py", line 788, in _start_distributed_training
    saving_listeners=saving_listeners)
  File "/worker/venv/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 385, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/worker/venv/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1242, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/worker/venv/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1337, in _train_model_default
    input_fn, model_fn_lib.ModeKeys.TRAIN))
  File "/worker/venv/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1107, in _get_features_and_labels_from_input_fn
    self._call_input_fn(input_fn, mode))
  File "/worker/venv/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1194, in _call_input_fn
    return input_fn(**kwargs)
  File "/worker/tensorflow_jobs/easy_rec/python/input/input.py", line 315, in _input_fn
    dataset = self._build(mode, params)
  File "/worker/tensorflow_jobs/easy_rec/python/input/csv_input.py", line 52, in _build
    file_paths = tf.gfile.Glob(self._input_path)
  File "/worker/venv/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 385, in get_matching_files
    compat.as_bytes(single_filename), status)
  File "/worker/venv/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnimplementedError: GetChildren not implemented
Failed to execute system command. (exit code: 123.)

raise NanLossDuringTrainingError

File "/worker/venv/lib/python2.7/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 792, in after_run
raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.
Failed to execute system command. (exit code: 251.)

一些命令里面偏配置的有在配置文件里面指定的,也有在命令行参数里面传入的,需要统一

导出命令
Local
python -m easy_rec.python.export --pipeline_config_path dwd_avazu_ctr_deepmodel.config --export_dir ./export
–pipeline_config_path: config文件路径

–model_dir: 如果指定了model_dir将会覆盖config里面的model_dir,一般在周期性调度的时候使用

–export_dir: 导出的目录

比如这里export_dir,只能传参指定,这个不建议,要么和model_dir一样都使用可覆盖模式

内存设置太小,被kill掉任务

xargs: /worker/venv/bin/python: terminated by signal 9
Failed to execute system command. (exit code: 253.)
The job has been killed by "OOM Killer", please check your job's memory usage.
total-vm:30193768kB, anon-rss:19770320kB, file-rss:0kB, shmem-rss:0kB

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.