alibaba / alink Goto Github PK

Alink is the Machine Learning algorithm platform based on Flink, developed by the PAI team of Alibaba computing platform.

License: Apache License 2.0

Java 50.77% Scala 0.08% Jupyter Notebook 0.14% Shell 0.06% Python 7.66% CMake 0.28% C++ 4.57% C 0.01% Dockerfile 0.01% TypeScript 1.55% Less 0.09% Batchfile 0.01% HTML 34.79% Makefile 0.01%

machine-learning flink classification clustering regression graph-algorithms xgboost recommender recommender-system feature-engineering

alink's Issues

lda online method bug and FeatureLabelUtil comment problem

In the lda online corpus update step, the last sample may be less possible to be sampled.
There is Chinese comment in FeatureLabelUtil.

Adapted to flink 1.10.0

change flink version to 1.10.0
change the dependence of common lang to commn lang3 in icq
optimize the unit tests

PCA throw null pointer exception when there is little data

When little data, some workers do not have data, pca train will throw null pointer exception.

pyalink里的ftrl_demo.ipynb跑不通

想请问下，这个https://github.com/alibaba/Alink/blob/master/pyalink/ftrl_demo.ipynb 的例子是被跑通了的吗?

运行到下面这一句的时候，报错如下，是什么原因？

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-3-af625ccf78c2> in <module>
     13 feature_pipeline.fit(trainBatchData).save(FEATURE_PIPELINE_MODEL_FILE);
     14 
---> 15 BatchOperator.execute();
     16 
     17 # load pipeline model

~/anaconda3/lib/python3.7/site-packages/pyalink-1.0.1_flink_1.9.0_scala_2.11-py3.7.egg/pyalink/alink/batch/base.pyc in execute(cls)

~/anaconda3/lib/python3.7/site-packages/py4j-0.10.8.1-py3.7.egg/py4j/java_gateway.py in __call__(self, *args)
   1284         answer = self.gateway_client.send_command(command)
   1285         return_value = get_return_value(
-> 1286             answer, self.gateway_client, self.target_id, self.name)
   1287 
   1288         for temp_arg in temp_args:

~/anaconda3/lib/python3.7/site-packages/py4j-0.10.8.1-py3.7.egg/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:com.alibaba.alink.operator.batch.BatchOperator.execute.
: org.apache.flink.runtime.client.JobExecutionException: Could not retrieve JobResult.
	at org.apache.flink.runtime.minicluster.MiniCluster.executeJobBlocking(MiniCluster.java:623)
	at org.apache.flink.client.LocalExecutor.executePlan(LocalExecutor.java:222)
	at org.apache.flink.api.java.LocalEnvironment.execute(LocalEnvironment.java:91)
	at org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:820)
	at com.alibaba.alink.operator.batch.BatchOperator.execute(BatchOperator.java:248)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit job.
	at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$internalSubmitJob$2(Dispatcher.java:333)
	at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
	at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
	at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)
	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
	... 6 more
Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
	at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:152)
	at org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:83)
	at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:375)
	at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34)
	... 7 more
Caused by: org.apache.flink.runtime.JobException: Creating the input splits caused an error: java.net.SocketTimeoutException: connect timed out
	at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.<init>(ExecutionJobVertex.java:270)
	at org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:907)
	at org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:230)
	at org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:106)
	at org.apache.flink.runtime.scheduler.LegacyScheduler.createExecutionGraph(LegacyScheduler.java:207)
	at org.apache.flink.runtime.scheduler.LegacyScheduler.createAndRestoreExecutionGraph(LegacyScheduler.java:184)
	at org.apache.flink.runtime.scheduler.LegacyScheduler.<init>(LegacyScheduler.java:176)
	at org.apache.flink.runtime.scheduler.LegacySchedulerFactory.createInstance(LegacySchedulerFactory.java:70)
	at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:275)
	at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:265)
	at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98)
	at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40)
	at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:146)
	... 10 more
Caused by: java.lang.RuntimeException: java.net.SocketTimeoutException: connect timed out
	at com.alibaba.alink.operator.common.io.reader.HttpFileSplitReader.getFileLength(HttpFileSplitReader.java:79)
	at com.alibaba.alink.operator.common.io.csv.GenericCsvInputFormat.createInputSplits(GenericCsvInputFormat.java:401)
	at com.alibaba.alink.operator.common.io.csv.GenericCsvInputFormat.createInputSplits(GenericCsvInputFormat.java:39)
	at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.<init>(ExecutionJobVertex.java:256)
	... 22 more
Caused by: java.net.SocketTimeoutException: connect timed out
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
	at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
	at sun.net.www.http.HttpClient.New(HttpClient.java:339)
	at sun.net.www.http.HttpClient.New(HttpClient.java:357)
	at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1220)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050)
	at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:984)
	at com.alibaba.alink.operator.common.io.reader.HttpFileSplitReader.getFileLength(HttpFileSplitReader.java:61)
	... 25 more

Alink/pyalink/kmeans.ipynb error

use "Alink/pyalink/kmeans.ipynb"
env:
but, step5, error:

Test failure

The last test (testReadEasModel) at CsvSourceSinkTest.java fails. It's likely that the path cannot be reached in the US.

String path = "http://test-region.oss-cn-hangzhou-zmf.aliyuncs.com/alink_model_export/Alink-Expr-pre-6339/226607/model_data.csv"

Allow to inject existed flink env

For now Alink would create env by itself, which make it difficult to integrate with other tools/framework. It would be nice to allow alink to use the env created by others.

Can Alink be added into a Flink program as a dependency？

RT，so we can use Alink to process streaming data in Flink program？

val env = ExecutionEnvironment.createLocalEnvironment()
val tEnv=BatchTableEnvironment.create(env)
val mle = new MLEnvironment(env, tEnv)
MLEnvironmentFactory.setDefault(mle)
加了本地环境运行提示
Error:(18, 42) Static methods in interface require -target:jvm-1.8
val tEnv=BatchTableEnvironment.create(env)

pmml model

How this algorithms can transfer to pmml models??

The initial data of KMeansAssignCluster is not cleared!

Document lack of the way how to save trained model with Alink trainer

In integration with Alink, I didn't find how to save alink trainer trained model in documents of Alink. In my application, I use PipelineModel to save trained model with Alink trainer as follows, but this has no document to describe the way. And I don't know how to save online learning model.

Pipeline pipeline = new Pipeline().add(vectorAssembler).add(kMeans);
PipelineModel pipelineModel = pipeline.fit(data);
pipelineModel.save("/tmp/feature_pipe_model.csv");

是否支持在线深度学习

浏览了example和文档之后，发现在线学习是应用在浅层模型。不支持深度学习难免挂万漏一。

Alink 只支持Python吗？

官方文档里只有pyAlink，请问可以试用其他语言吗？比如Java，Scala

Alink 算是实时处理还是批处理？

请教一个问题，Alink 是基于datastream api 还是dataSet api?

TableSchema getFieldTypes deprecated

Method getFieldTypes of Flink class TableSchema is annotated with Deprecated, which JIRA is FLINK-12254 Expose the new type system through the API. It should use method getFieldDataTypes instead of getFieldTypes to use new type.

Fix the fails in test cases

在线学习FTRL

在线学习训练支持FTRL ,以后会支持其它的算法吗，问个小白的问题，为什么在线学习只支持线性的，可以解决吗

运行时提示TypeError: 'JavaPackage' object is not callable

导入完包后输入useLocalEnv(2, flinkHome=None, config=None)，提示
JVM listening on 127.0.0.1:6327

TypeError Traceback (most recent call last)
in
----> 1 useLocalEnv(2, flinkHome=None, config=None)

D:\python.venv\lib\site-packages\pyalink-1.0_flink_1.9.0_scala_2.11-py3.7.egg\pyalink\alink\env.pyc in useLocalEnv(parallelism, flinkHome, config)

D:\python.venv\lib\site-packages\pyalink-1.0_flink_1.9.0_scala_2.11-py3.7.egg\pyalink\alink\env.pyc in make_configuration(config)

D:\python.venv\lib\site-packages\pyalink-1.0_flink_1.9.0_scala_2.11-py3.7.egg\pyalink\alink\py4j_util.pyc in get_java_gateway()

D:\python.venv\lib\site-packages\pyalink-1.0_flink_1.9.0_scala_2.11-py3.7.egg\pyalink\alink\py4j_util.pyc in inst(cls)

D:\python.venv\lib\site-packages\pyalink-1.0_flink_1.9.0_scala_2.11-py3.7.egg\pyalink\alink\py4j_util.pyc in init(self)

TypeError: 'JavaPackage' object is not callable

Bug of HasVectorSize alias

In HasVectorSize class, ParamInfo of VECTOR_SIZE use vectorSize as name, use vectorSize and inputDim as alias, which cause Exception when use flink-ml-api dependency because of the repeated alias of vectorSize. vectorSize should be removed from alias of VECTOR_SIZE.

Is there a docker image or dockerfile to run alink?

Segrate connector from `core` module.

This is the first step to support multi-version data source connector.

ALSExample发生错误

打包官网example上传到flink上跑ALSExample出现以下错误

2019-12-27 14:56:29
java.lang.Exception: The user defined 'open()' method caused an exception: null
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:499)
at org.apache.flink.runtime.iterative.task.AbstractIterativeTask.run(AbstractIterativeTask.java:157)
at org.apache.flink.runtime.iterative.task.IterationIntermediateTask.run(IterationIntermediateTask.java:107)
at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:369)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NegativeArraySizeException
at com.alibaba.alink.operator.common.recommendation.AlsTrain$9.open(AlsTrain.java:308)
at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
at org.apache.flink.api.java.operators.translation.WrappingFunction.open(WrappingFunction.java:45)
at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:495)
... 6 more

为何maven打包说缺依赖

我用的是集群安装方式，为何maven打包时候一直报bug，
[ERROR] Failed to execute goal on project alink_core: Could not resolve dependencies for project com.alibaba.alink:alink_core:jar:0.1-SNAPSHOT: Could not transfer artifact net.sf.json-lib:json-lib:jar:2.2.3 from/to alimaven (http://maven.aliyun.com/nexus/content/groups/public): Transfer failed for http://maven.aliyun.com/nexus/content/groups/public/net/sf/json-lib/json-lib/2.2.3/json-lib-2.2.3.jar: Unknown host maven.aliyun.com: Temporary failure in name resolution -> [Help 1]
找不到这个依赖包，我把依赖包导入pom文件还是报这个错。那位大佬知道怎么做，告诉我一下，挺急的

Can't read http csv file larger than 2GB

This is caused by integer overflow.

Modify QuantileDiscretizer, OneHotEncoder and Bucketizer.

Change outputCol to outputCols in QuantileDiscrete, OneHotEncoder, Bucketizer
Add handleInvalid, encode, dropLast in QuantileDiscrete and OneHotEncoder
Change the model format in QuantileDiscrete, OneHotEncoder

Modify vector size hint doc and naive bayes text doc

Some wrong

试图与flink交互时，表环境不一致错误
Exception in thread "main" org.apache.flink.table.api.ValidationException: Only tables from the same TableEnvironment can be joined.

`
package com.alibaba.alink;

import com.alibaba.alink.operator.batch.BatchOperator;
import com.alibaba.alink.operator.batch.dataproc.SplitBatchOp;
import com.alibaba.alink.operator.batch.recommendation.AlsPredictBatchOp;
import com.alibaba.alink.operator.batch.recommendation.AlsTrainBatchOp;
import com.alibaba.alink.operator.batch.source.CsvSourceBatchOp;
import com.alibaba.alink.operator.batch.source.TableSourceBatchOp;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.java.BatchTableEnvironment;

public class WrongDemo {
public static void main(String[] args) throws Exception {
ExecutionEnvironment env= ExecutionEnvironment.createLocalEnvironment();
BatchTableEnvironment tEnv= BatchTableEnvironment.create(env);
String url = "https://alink-release.oss-cn-beijing.aliyuncs.com/data-files/movielens_ratings.csv";
String schema = "userCol bigint, itemCol bigint,rateCol double, timestamp string";

    Tuple3[] fake= new Tuple3[]{
            new Tuple3<Long,Long,Double>(1L,1L,0.6),
            new Tuple3<Long,Long,Double>(2L,2L,0.8),
            new Tuple3<Long,Long,Double>(2L,3L,0.6),
            new Tuple3<Long,Long,Double>(4L,1L,0.6),
            new Tuple3<Long,Long,Double>(4L,2L,0.3),
    };

    DataSet ds= env.fromElements(fake);
    Table table =tEnv.fromDataSet(ds);


    BatchOperator data=BatchOperator.fromTable(table);
    BatchOperator data3=new TableSourceBatchOp(table);
    SplitBatchOp spliter = new SplitBatchOp().setFraction(0.8);
    BatchOperator data2 = new CsvSourceBatchOp()
            .setFilePath(url).setSchemaStr(schema);

            System.out.println(data.getDataSet().getExecutionEnvironment());
    System.out.println(BatchOperator.getExecutionEnvironmentFromDataSets(ds));
    System.out.println(BatchOperator.getExecutionEnvironmentFromOps(data2));
    System.out.println(BatchOperator.getExecutionEnvironmentFromOps(data3));
    System.out.println(BatchOperator.getExecutionEnvironmentFromOps(spliter.linkFrom(data2)));
    spliter.linkFrom(data);

    BatchOperator trainData = spliter;
    BatchOperator testData = spliter.getSideOutput(0);
    AlsTrainBatchOp als = new AlsTrainBatchOp()
            .setUserCol("userid").setItemCol("movieid").setRateCol("rating")
            .setNumIter(10).setRank(10).setLambda(0.1);
    BatchOperator model = als.linkFrom(trainData);

    AlsPredictBatchOp predictor = new AlsPredictBatchOp()
            .setUserCol("userid").setItemCol("movieid").setPredictionCol("prediction_result");
    BatchOperator preditionResult;

    preditionResult = predictor.linkFrom(model, testData).select("rating, prediction_result");


    preditionResult.print();
}

}
`

Improvement of UDF/UDTF operators

Hi,
I found the current implementation of UDF/UDTF operators has the following limitations:

Not aligned with corresponding Flink feature: only 1 selectedCol is used ~~and no joinType supported~~ in UDTF, and case-sensitivity of identifier breaks in UDTFBatchOp by MappingColNameIgnoreCase.
~~No parameter MLEnvrionmentId provided.~~
Incomplete JavaDoc and unit tests.
Redundant codes.

Moreover, in current codes, if a same column name is used in selectedCol(s), reservedCols, and outputCols, the related codes are not very clear.
Maybe we can re-phrase codes to 3 rules, and the implementation should align to these rules:

If reservedCols is not set or set to null, then all columns in the input table are automatically reserved except for cases in rule 3. In this way, the behaviors of reservedCols in UDF/UDTF are consistent to other operators.
For UDTF, if a same name is used in selectedCol(s) and outputCols, the usage should be allowed. The conflicts exist due to the cross join or left outer join clause. However, because the implementation details are hidden for users, we should handle such conflicts within the operators.
If a same name is used in reservedCols and outputCols, then the one representing a column in the input table will not exist in the output table.

为何easy_install安装失败

我没学过python，在python这方面小白，我安装python3.5的环境，也下载包了，但是按照提示easy_install时候报错，官方提供的是easy_install [存放的路径]/pyalink-0.0.1-py3.*.egg。我的是easy_install /usr/apps/pyalink-1.0_flink_1.9.0_scala_2.11-py3.5.egg 为何跑了很久一直卡在Couldn't find index page for 'pyalink' (maybe misspelled?)
Scanning index of all packages (this may take a while)
Reading https://pypi.org/simple/
卡的有半小时多，最后报错说找不到pyalink-1.0_flink_1.9.0_scala_2.11-py3.5，有安装成功的么？试了好多次都不行，有人懂么？在线等，挺急的

Modify Naive Bayes Text Classifier

modify naive bayes algorithm as a text classifier

Support registerTableName/registerFunction/sqlQuery in Batch/StreamOperator

本地单机模式可以吗

不使用集群版本

Modify Bucketizer and VectorSizeHint Param

Modify the cutsArray param in Bucketizer.
Modify the handleInvalid param in VectorSizeHint.

OSS links of data files in docs may be inaccessible to public users

Need to change the oss links to accessible ones.

What is Alink's roadmap?

TypeError: 'JavaPackage' object is not callable 报错

java version "1.8.0_202"

How can I get the jar of Alink_core?

mvn.aliyun.com and mvn central all don't exist it.

How to serving

Is there a way to save the pipeline model to Mysql?

I try to save the model, use this code:

featurePipeline.save(FEATURE_PIPELINE_MODEL_FILE)

The pipeline model have been save to a file with csv.
Is there a way to save the model to mysql and load it?

stream读取csv后，csv内容变更后，print出来的内容没有更新

大神们好，我有个问题，用CsvSourceStreamOp方法读取csv后，csv内容变更后，print出来的内容没有更新。以下是我的代码，请大神帮忙解决一下，谢谢！
`#!/usr/bin/python

-- coding: UTF-8 --

from pyalink.alink import *
import numpy as np
import pandas as pd

useLocalEnv(1)

schemaStr = "A double, B double, C double, D double, label string"
path = "D:/program/Alink_script/data/iris2_1_stream.csv"
csvSource = CsvSourceStreamOp().setFilePath(path).setSchemaStr(schemaStr).setFieldDelimiter(",")
csvSource.print()
sample = SampleStreamOp().setRatio(0.01).linkFrom(csvSource)
sample.print(key="csvSource", refreshInterval=1)
StreamOperator.execute()`

Please upload the package to PyPI

Most python packages (currently more than 200 thousand) can be easily installed with pip, making it easier to integrate new packages like yours into large existing python development environments.

This however would require uploading your package to PyPI, which I think will be easy with this package, given that it's already bundled into an .egg format.

Modify QuantileDiscretizer, OnehotEncoder, Bucketizer docs.

Modify the Chinese and English documents of QuantileDiscretizer, OnehotEncoder, Bucketizer.

Kafka as the sink, something was wrong.

When I test the kafka as the sink, something was wrong.
The codes are as follows:

public class writeKafka {
    private static void write() throws Exception {
        String URL = "data/iris.csv";
        String SCHEMA_STR
                = "sepal_length double, sepal_width double, petal_length double, petal_width double, category string";
        CsvSourceStreamOp data = new CsvSourceStreamOp().setFilePath(URL).setSchemaStr(SCHEMA_STR);
        Kafka011SinkStreamOp sink = new Kafka011SinkStreamOp()
                .setBootstrapServers("127.0.0.1:9092")
                .setDataFormat("json")
                .setTopic("iris");

        data.link(sink);

        StreamOperator.execute();
    }

    public static void main(String[] args) throws Exception {
        write();

    }
}

It goes wrong like

09:54:15,518 WARN  org.apache.kafka.common.utils.AppInfoParser                   - Error registering AppInfo mbean
javax.management.InstanceAlreadyExistsException: kafka.producer:type=app-info,id=kafka_011_-6885371916521496791
	at com.sun.jmx.mbeanserver.Repository.addMBean(Repository.java:437)
	at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(DefaultMBeanServerInterceptor.java:1898)
	at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:966)
	at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:900)
	at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:324)
	at com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:522)
	at org.apache.kafka.common.utils.AppInfoParser.registerAppInfo(AppInfoParser.java:58)
	at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:335)
	at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:191)
	at com.alibaba.alink.operator.common.io.kafka.KafkaBaseOutputFormat.getKafkaProducer(KafkaBaseOutputFormat.java:160)
	at com.alibaba.alink.operator.common.io.kafka.KafkaBaseOutputFormat.open(KafkaBaseOutputFormat.java:166)
	at org.apache.flink.streaming.api.functions.sink.OutputFormatSinkFunction.open(OutputFormatSinkFunction.java:65)
	at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
	at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102)
	at org.apache.flink.streaming.api.operators.StreamSink.open(StreamSink.java:48)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:529)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:393)
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
	at java.lang.Thread.run(Thread.java:748)

I don't know what caused it. Besides, I found that the message was generated normally.

Double-Checked Locking

Alink/core/src/main/java/com/alibaba/alink/operator/common/nlp/jiebasegment/WordDictionary.java

Lines 48 to 52 in 963f0b8

 if (singleton == null) { 

 synchronized (WordDictionary.class) { 

 if (singleton == null) { 

 singleton = new WordDictionary(); 

 return singleton;

Double-Checked Locking is widely cited and used as an efficient method for implementing lazy initialization in a multithreaded environment.
Unfortunately, it will not work reliably in a platform independent way when implemented in Java, without additional synchronization.
declares the singleton field volatile offers a much more elegant solution

GBDTExample occurs wrong setFeautreCols required String.

First, I can't get the jar of alink_core.
So, I import my project with alink_open-0.1...jar and alink_python-0.1..jar that in pyalink lib folder.
When I test the GBDTExample in my project, then , I just get this error like title described.

mysqlsource error when i use collect method

BaseSourceStreamOp doesn't support 0.8, 0.9, 0.10 version of Kafka source

Class BaseSourceStreamOp doesn't support 0.8, 0.9, 1.0 version of Kafka source, only support 1.1 version, which is supported by Kafka011SourceStreamOp. It is necessary to support above version of Kafka source.

如何将单列Operator修改格式变为多列Operator

大神们好，有个问题想请教一下，我通过Kafka011SourceStreamOp读取Kafka中的数据，但获取到message生成一列的Operator，我想对message进行解析(例如通过逗号分隔)，生成新的多列的Operator，这样该如何操作？以下为我部分代码，希望大神帮忙解答，谢谢！
`#!/usr/bin/python

-- coding: UTF-8 --

from pyalink.alink import *
import numpy as np
import pandas as pd

useLocalEnv(1)

bootstrapServer = "10.124.165.91:9092"
topic = "test_alink1"

data = Kafka011SourceStreamOp().setBootstrapServers(bootstrapServer).setTopic(topic).setStartupMode("EARLIEST").setGroupId("")

#此次可获取Kafka中message的信息，但是想对其进行解析生成新的Operator，同时不想解析成DataFrame再转Operator，因为之前readme中介绍过，不要频繁切换DataFrame与Operator，效率会收到影响。
data.select("message").print()
StreamOperator.execute()`

	if (singleton == null) {
	synchronized (WordDictionary.class) {
	if (singleton == null) {
	singleton = new WordDictionary();
	return singleton;

alibaba / alink Goto Github PK

alink's Issues

导入完包后输入useLocalEnv(2, flinkHome=None, config=None)，提示 JVM listening on 127.0.0.1:6327

-- coding: UTF-8 --

-- coding: UTF-8 --

Recommend Projects

Recommend Topics

Recommend Org

Jobs

导入完包后输入useLocalEnv(2, flinkHome=None, config=None)，提示
JVM listening on 127.0.0.1:6327