Given that the current method signature is described as follow : <div class="snipp

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How can I set a custom analyzer?,about zouzias/spark-lucenerdd

Comments (15)

zouzias commented on June 11, 2024 1

Yes, currently the library only supports the predefined language analyzers. Let me think about it once more if I can fix this issue.

The main problem with the analyzers is they are not serializable and this fact causes a lot of issues.

from spark-lucenerdd.

zouzias commented on June 11, 2024 1

Yes, analyzers in Lucene are not serializable.

from spark-lucenerdd.

zouzias commented on June 11, 2024 1

@yeikel , pushed the feature in version 0.3.7-SNAPSHOT. You can try it out and let me know if there are any issues.

from spark-lucenerdd.

yeikel commented on June 11, 2024

Besides the idea of wrapping them around some other type , I am not sure what we can do.

from spark-lucenerdd.

zouzias commented on June 11, 2024

I think a good option is to allow the user to specify the analyzer by String using the class name, i.e.,

my.cool.analyzers.CustomAnalyzer would load the class using Java reflection. I think this will allow you to use any analyzer and it requires small modifications on the codebase. WDYT?

See, e.g. http://tutorials.jenkov.com/java-reflection/dynamic-class-loading-reloading.html

from spark-lucenerdd.

yeikel commented on June 11, 2024

It is definitely an option.

Could you please clarify why passing an instance of the analyzer is not an option? Serialization issues?

from spark-lucenerdd.

yeikel commented on June 11, 2024

It is very unfortunate we need to do this.

The workaround sounds fair to me. Let's do it

from spark-lucenerdd.

yeikel commented on June 11, 2024

I believe that queryAnalyzer is not working as expected. Please see example below :

class StandardWithStopWords extends Analyzer {

  private val STOP_WORDS = fromInputStream(getClass.getResourceAsStream("/stop-words/list.txt")).getLines().toSet

    override def createComponents(fieldName: String): TokenStreamComponents = {
      val source: Tokenizer = new StandardTokenizer()
      val lowerCase = new LowerCaseFilter(source)
      val finalTokens = removeStopWords(lowerCase)
      new TokenStreamComponents(source, finalTokens)
    }

   private def removeStopWords(l: TokenFilter) = {
    System.out.println("Removing stop words")
    val charSet = new CharArraySet(STOP_WORDS, true)
    val stop = new StopFilter(l, charSet)
    stop
  }
}

val A = Seq(("Googlex", "123 Main Street" , "US" )).toDF("name","address","mkt_cd")

val B = Seq(

  ("Google", "123 Main Street" , "US" ),
  ("Googly", "123 Main Street" , "US" )
).toDF("name","address","mkt_cd",)


val linkedResults = LuceneRDD.blockEntityLinkage(
  A,
  B,
  linkerQuery ,
  blockingFields,
  blockingFields,
 1000 ,
  indexAnalyzer = "lucene.StandardWithStopWords",
  queryAnalyzer =  "lucene.StandardWithStopWords"
)

19/03/19 10:28:29 INFO LuceneRDDPartition: Lucene index will be storage in disk
19/03/19 10:28:29 INFO LuceneRDDPartition: Index disk location C:\Users\user\AppData\Local\Temp\
19/03/19 10:28:29 INFO LuceneRDDPartition: [partId=0] Partition is created...
19/03/19 10:28:29 INFO LuceneRDDPartition: Loading class lucene.StandardWithStopWords using loader sun.misc.Launcher$AppClassLoader@18b4aac2
19/03/19 10:28:29 INFO LuceneRDDPartition: [partId=0]Indexing process initiated at 2019-03-19T10:28:29.288-04:00...
19/03/19 10:28:29 INFO LuceneRDDPartition: Loading class lucene.StandardWithStopWords using loader sun.misc.Launcher$AppClassLoader@18b4aac2
Removing stop words
19/03/19 10:28:29 INFO LuceneRDDPartition: [partId=0]Indexing process completed at 2019-03-19T10:28:29.930-04:00...
19/03/19 10:28:29 INFO LuceneRDDPartition: [partId=0]Indexing process took 0 seconds...
19/03/19 10:28:30 INFO LuceneRDDPartition: [partId=0]Indexed 2 documents
19/03/19 10:28:30 INFO Linker: +((name:Googlex~2)~1) +((address:123 Main Street~2)~2)
19/03/19 10:28:30 INFO Executor: Finished task 0.0 in stage 2.0 (TID 3). 1677 bytes result sent to driver
19/03/19 10:28:30 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 3) in 1449 ms on localhost (executor driver) (1/1)
19/03/19 10:28:30 INFO DAGScheduler: ResultStage 2 (show at BlockLinkageAmex.scala:226) finished in 1.557 s
19/03/19 10:28:30 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
19/03/19 10:28:30 INFO DAGScheduler: Job 0 finished: show at BlockLinkageAmex.scala:226, took 2.664930 s
19/03/19 10:28:30 INFO SparkContext: Starting job: show at BlockLinkageAmex.scala:226
19/03/19 10:28:30 INFO DAGScheduler: Got job 1 (show at BlockLinkageAmex.scala:226) with 1 output partitions
19/03/19 10:28:30 INFO DAGScheduler: Final stage: ResultStage 5 (show at BlockLinkageAmex.scala:226)
19/03/19 10:28:30 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 3, ShuffleMapStage 4)
19/03/19 10:28:30 INFO DAGScheduler: Missing parents: List()
19/03/19 10:28:30 INFO DAGScheduler: Submitting ResultStage 5 (MapPartitionsRDD[20] at show at BlockLinkageAmex.scala:226), which has no missing parents
19/03/19 10:28:30 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 43.2 KB, free 1989.5 MB)
19/03/19 10:28:30 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 14.6 KB, free 1989.5 MB)
19/03/19 10:28:30 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on FLL39P2SN2.ads.aexp.com:51436 (size: 14.6 KB, free: 1989.6 MB)
19/03/19 10:28:30 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1039
19/03/19 10:28:30 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 5 (MapPartitionsRDD[20] at show at BlockLinkageAmex.scala:226) (first 15 tasks are for partitions Vector(1))
19/03/19 10:28:30 INFO TaskSchedulerImpl: Adding task set 5.0 with 1 tasks
19/03/19 10:28:30 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 4, localhost, executor driver, partition 1, PROCESS_LOCAL, 7712 bytes)
19/03/19 10:28:30 INFO Executor: Running task 0.0 in stage 5.0 (TID 4)
19/03/19 10:28:30 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 1 blocks
19/03/19 10:28:30 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/03/19 10:28:30 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 2 blocks
19/03/19 10:28:30 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/03/19 10:28:30 INFO Executor: Finished task 0.0 in stage 5.0 (TID 4). 1527 bytes result sent to driver
19/03/19 10:28:30 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 4) in 36 ms on localhost (executor driver) (1/1)
19/03/19 10:28:30 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool 
19/03/19 10:28:30 INFO DAGScheduler: ResultStage 5 (show at BlockLinkageAmex.scala:226) finished in 0.050 s
19/03/19 10:28:30 INFO DAGScheduler: Job 1 finished: show at BlockLinkageAmex.scala:226, took 0.065966 s
+-------+--

Unless I am missing something , I believe I should see two events in my logs for System.out.println("Removing stop words") . One at index time and another at search time

from spark-lucenerdd.

zouzias commented on June 11, 2024

These are the driver logs. Did you check the executor logs as well? the message should appear in total 1+(number of partitions) times.

…

On Tue, 19 Mar 2019, 15:34 Yeikel, ***@***.***> wrote: I believe that queryAnalyzer is not working as expected. Please see example below : class StandardWithStopWords extends Analyzer { private val STOP_WORDS = fromInputStream(getClass.getResourceAsStream("/stop-words/list.txt")).getLines().toSet override def createComponents(fieldName: String): TokenStreamComponents = { val source: Tokenizer = new StandardTokenizer() val lowerCase = new LowerCaseFilter(source) val finalTokens = removeStopWords(lowerCase) new TokenStreamComponents(source, finalTokens) } private def removeStopWords(l: TokenFilter) = { System.out.println("Removing stop words") val charSet = new CharArraySet(STOP_WORDS, true) val stop = new StopFilter(l, charSet) stop } } val A = Seq(("Googlex", "123 Main Street" , "US" )).toDF("name","address","mkt_cd") val B = Seq( ("Google", "123 Main Street" , "US" ), ("Googly", "123 Main Street" , "US" ) ).toDF("name","address","mkt_cd",) val linkedResults = LuceneRDD.blockEntityLinkage( A, B, linkerQuery , blockingFields, blockingFields, 1000 , indexAnalyzer = "lucene.StandardWithStopWords", queryAnalyzer = "lucene.StandardWithStopWords" ) 19/03/19 10:28:29 INFO LuceneRDDPartition: Lucene index will be storage in disk 19/03/19 10:28:29 INFO LuceneRDDPartition: Index disk location C:\Users\user\AppData\Local\Temp\ 19/03/19 10:28:29 INFO LuceneRDDPartition: [partId=0] Partition is created... 19/03/19 10:28:29 INFO LuceneRDDPartition: Loading class lucene.StandardWithStopWords using loader ***@***.*** 19/03/19 10:28:29 INFO LuceneRDDPartition: [partId=0]Indexing process initiated at 2019-03-19T10:28:29.288-04:00... 19/03/19 10:28:29 INFO LuceneRDDPartition: Loading class lucene.StandardWithStopWords using loader ***@***.*** Removing stop words 19/03/19 10:28:29 INFO LuceneRDDPartition: [partId=0]Indexing process completed at 2019-03-19T10:28:29.930-04:00... 19/03/19 10:28:29 INFO LuceneRDDPartition: [partId=0]Indexing process took 0 seconds... 19/03/19 10:28:30 INFO LuceneRDDPartition: [partId=0]Indexed 2 documents 19/03/19 10:28:30 INFO Linker: +((name:Googlex~2)~1) +((address:123 Main Street~2)~2) 19/03/19 10:28:30 INFO Executor: Finished task 0.0 in stage 2.0 (TID 3). 1677 bytes result sent to driver 19/03/19 10:28:30 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 3) in 1449 ms on localhost (executor driver) (1/1) 19/03/19 10:28:30 INFO DAGScheduler: ResultStage 2 (show at BlockLinkageAmex.scala:226) finished in 1.557 s 19/03/19 10:28:30 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 19/03/19 10:28:30 INFO DAGScheduler: Job 0 finished: show at BlockLinkageAmex.scala:226, took 2.664930 s 19/03/19 10:28:30 INFO SparkContext: Starting job: show at BlockLinkageAmex.scala:226 19/03/19 10:28:30 INFO DAGScheduler: Got job 1 (show at BlockLinkageAmex.scala:226) with 1 output partitions 19/03/19 10:28:30 INFO DAGScheduler: Final stage: ResultStage 5 (show at BlockLinkageAmex.scala:226) 19/03/19 10:28:30 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 3, ShuffleMapStage 4) 19/03/19 10:28:30 INFO DAGScheduler: Missing parents: List() 19/03/19 10:28:30 INFO DAGScheduler: Submitting ResultStage 5 (MapPartitionsRDD[20] at show at BlockLinkageAmex.scala:226), which has no missing parents 19/03/19 10:28:30 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 43.2 KB, free 1989.5 MB) 19/03/19 10:28:30 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 14.6 KB, free 1989.5 MB) 19/03/19 10:28:30 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on FLL39P2SN2.ads.aexp.com:51436 (size: 14.6 KB, free: 1989.6 MB) 19/03/19 10:28:30 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1039 19/03/19 10:28:30 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 5 (MapPartitionsRDD[20] at show at BlockLinkageAmex.scala:226) (first 15 tasks are for partitions Vector(1)) 19/03/19 10:28:30 INFO TaskSchedulerImpl: Adding task set 5.0 with 1 tasks 19/03/19 10:28:30 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 4, localhost, executor driver, partition 1, PROCESS_LOCAL, 7712 bytes) 19/03/19 10:28:30 INFO Executor: Running task 0.0 in stage 5.0 (TID 4) 19/03/19 10:28:30 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 1 blocks 19/03/19 10:28:30 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms 19/03/19 10:28:30 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 2 blocks 19/03/19 10:28:30 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms 19/03/19 10:28:30 INFO Executor: Finished task 0.0 in stage 5.0 (TID 4). 1527 bytes result sent to driver 19/03/19 10:28:30 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 4) in 36 ms on localhost (executor driver) (1/1) 19/03/19 10:28:30 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool 19/03/19 10:28:30 INFO DAGScheduler: ResultStage 5 (show at BlockLinkageAmex.scala:226) finished in 0.050 s 19/03/19 10:28:30 INFO DAGScheduler: Job 1 finished: show at BlockLinkageAmex.scala:226, took 0.065966 s +-------+-- Unless I am missing something , I believe I should see two events in my logs for System.out.println("Removing stop words") . One at index time and another at search time — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#156 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AByNKHEfH7K1Y1aO3qqGvNupaEZarhN5ks5vYPV2gaJpZM4brkp3> .

from spark-lucenerdd.

zouzias commented on June 11, 2024

I see now. You need to invoke an action since the RDD is not computed (lazy evaluation), i.e., linkedResults.count()

from spark-lucenerdd.

yeikel commented on June 11, 2024

For this example I am running in only one machine using standalone mode so there are no executors besides the driver. I was executing linkedResults.show to show the table but I changed it to count and it produced the same results.

Please see the full logs below :

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/03/19 17:33:10 INFO SparkContext: Running Spark version 2.3.2
19/03/19 17:33:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/03/19 17:33:10 INFO SparkContext: Submitted application: MatchingTest
19/03/19 17:33:11 INFO SecurityManager: Changing view acls to: user
19/03/19 17:33:11 INFO SecurityManager: Changing modify acls to: user
19/03/19 17:33:11 INFO SecurityManager: Changing view acls groups to: 
19/03/19 17:33:11 INFO SecurityManager: Changing modify acls groups to: 
19/03/19 17:33:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(user); groups with view permissions: Set(); users  with modify permissions: Set(user); groups with modify permissions: Set()
19/03/19 17:33:12 INFO Utils: Successfully started service 'sparkDriver' on port 53164.
19/03/19 17:33:12 INFO SparkEnv: Registering MapOutputTracker
19/03/19 17:33:12 INFO SparkEnv: Registering BlockManagerMaster
19/03/19 17:33:12 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
19/03/19 17:33:12 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
19/03/19 17:33:12 INFO DiskBlockManager: Created local directory at C:\Users\user\AppData\Local\Temp\blockmgr-f61ccecd-66d9-4623-8411-f97234361c3e
19/03/19 17:33:13 INFO MemoryStore: MemoryStore started with capacity 1989.6 MB
19/03/19 17:33:13 INFO SparkEnv: Registering OutputCommitCoordinator
19/03/19 17:33:13 INFO Utils: Successfully started service 'SparkUI' on port 4040.
19/03/19 17:33:13 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://localhost:4040
19/03/19 17:33:13 INFO Executor: Starting executor ID driver on host localhost
19/03/19 17:33:14 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 53185.
19/03/19 17:33:14 INFO NettyBlockTransferService: Server created on localhost:53185
19/03/19 17:33:14 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/03/19 17:33:14 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, localhost, 53185, None)
19/03/19 17:33:14 INFO BlockManagerMasterEndpoint: Registering block manager localhost:53185 with 1989.6 MB RAM, BlockManagerId(driver, localhost, 53185, None)
19/03/19 17:33:14 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, localhost, 53185, None)
19/03/19 17:33:14 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, localhost, 53185, None)
19/03/19 17:33:18 INFO CodeGenerator: Code generated in 453.128391 ms
19/03/19 17:33:18 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/C:/dev/spark-es-master/spark-warehouse/').
19/03/19 17:33:18 INFO SharedState: Warehouse path is 'file:/C:/dev/spark-es-master/spark-warehouse/'.
19/03/19 17:33:19 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
19/03/19 17:33:21 INFO CodeGenerator: Code generated in 19.063393 ms
19/03/19 17:33:21 INFO CodeGenerator: Code generated in 19.929444 ms
19/03/19 17:33:22 INFO CodeGenerator: Code generated in 26.14319 ms
19/03/19 17:33:22 INFO CodeGenerator: Code generated in 21.811701 ms
19/03/19 17:33:22 INFO SparkContext: Starting job: count at BlockLinkage.scala:198
19/03/19 17:33:22 INFO DAGScheduler: Registering RDD 4 (keyBy at LuceneRDD.scala:507)
19/03/19 17:33:22 INFO DAGScheduler: Registering RDD 9 (keyBy at LuceneRDD.scala:510)
19/03/19 17:33:22 INFO DAGScheduler: Registering RDD 16 (count at BlockLinkage.scala:198)
19/03/19 17:33:22 INFO DAGScheduler: Got job 0 (count at BlockLinkage.scala:198) with 1 output partitions
19/03/19 17:33:22 INFO DAGScheduler: Final stage: ResultStage 3 (count at BlockLinkage.scala:198)
19/03/19 17:33:22 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 2)
19/03/19 17:33:22 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 2)
19/03/19 17:33:22 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[4] at keyBy at LuceneRDD.scala:507), which has no missing parents
19/03/19 17:33:22 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 7.2 KB, free 1989.6 MB)
19/03/19 17:33:22 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 3.9 KB, free 1989.6 MB)
19/03/19 17:33:22 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:53185 (size: 3.9 KB, free: 1989.6 MB)
19/03/19 17:33:22 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1039
19/03/19 17:33:22 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[4] at keyBy at LuceneRDD.scala:507) (first 15 tasks are for partitions Vector(0, 1))
19/03/19 17:33:22 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
19/03/19 17:33:22 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[9] at keyBy at LuceneRDD.scala:510), which has no missing parents
19/03/19 17:33:22 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 7.1 KB, free 1989.6 MB)
19/03/19 17:33:22 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.9 KB, free 1989.6 MB)
19/03/19 17:33:22 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:53185 (size: 3.9 KB, free: 1989.6 MB)
19/03/19 17:33:22 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1039
19/03/19 17:33:22 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[9] at keyBy at LuceneRDD.scala:510) (first 15 tasks are for partitions Vector(0))
19/03/19 17:33:22 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
19/03/19 17:33:22 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 8100 bytes)
19/03/19 17:33:22 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 8100 bytes)
19/03/19 17:33:22 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, executor driver, partition 0, PROCESS_LOCAL, 8084 bytes)
19/03/19 17:33:22 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
19/03/19 17:33:22 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
19/03/19 17:33:22 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
19/03/19 17:33:23 INFO CodeGenerator: Code generated in 58.379922 ms
19/03/19 17:33:23 INFO CodeGenerator: Code generated in 19.728008 ms
19/03/19 17:33:23 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1212 bytes result sent to driver
19/03/19 17:33:23 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1212 bytes result sent to driver
19/03/19 17:33:23 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1212 bytes result sent to driver
19/03/19 17:33:23 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 455 ms on localhost (executor driver) (1/2)
19/03/19 17:33:23 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 427 ms on localhost (executor driver) (1/1)
19/03/19 17:33:23 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
19/03/19 17:33:23 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 521 ms on localhost (executor driver) (2/2)
19/03/19 17:33:23 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
19/03/19 17:33:23 INFO DAGScheduler: ShuffleMapStage 1 (keyBy at LuceneRDD.scala:510) finished in 0.559 s
19/03/19 17:33:23 INFO DAGScheduler: looking for newly runnable stages
19/03/19 17:33:23 INFO DAGScheduler: running: Set(ShuffleMapStage 0)
19/03/19 17:33:23 INFO DAGScheduler: waiting: Set(ShuffleMapStage 2, ResultStage 3)
19/03/19 17:33:23 INFO DAGScheduler: failed: Set()
19/03/19 17:33:23 INFO DAGScheduler: ShuffleMapStage 0 (keyBy at LuceneRDD.scala:507) finished in 0.948 s
19/03/19 17:33:23 INFO DAGScheduler: looking for newly runnable stages
19/03/19 17:33:23 INFO DAGScheduler: running: Set()
19/03/19 17:33:23 INFO DAGScheduler: waiting: Set(ShuffleMapStage 2, ResultStage 3)
19/03/19 17:33:23 INFO DAGScheduler: failed: Set()
19/03/19 17:33:23 INFO DAGScheduler: Submitting ShuffleMapStage 2 (MapPartitionsRDD[16] at count at BlockLinkage.scala:198), which has no missing parents
19/03/19 17:33:23 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 12.6 KB, free 1989.6 MB)
19/03/19 17:33:23 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 5.7 KB, free 1989.6 MB)
19/03/19 17:33:23 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:53185 (size: 5.7 KB, free: 1989.6 MB)
19/03/19 17:33:23 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1039
19/03/19 17:33:23 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 2 (MapPartitionsRDD[16] at count at BlockLinkage.scala:198) (first 15 tasks are for partitions Vector(0, 1))
19/03/19 17:33:23 INFO TaskSchedulerImpl: Adding task set 2.0 with 2 tasks
19/03/19 17:33:23 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 3, localhost, executor driver, partition 0, PROCESS_LOCAL, 7701 bytes)
19/03/19 17:33:23 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 4, localhost, executor driver, partition 1, PROCESS_LOCAL, 7701 bytes)
19/03/19 17:33:23 INFO Executor: Running task 0.0 in stage 2.0 (TID 3)
19/03/19 17:33:23 INFO Executor: Running task 1.0 in stage 2.0 (TID 4)
19/03/19 17:33:23 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
19/03/19 17:33:23 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 1 blocks
19/03/19 17:33:23 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 15 ms
19/03/19 17:33:23 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 15 ms
19/03/19 17:33:23 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 2 blocks
19/03/19 17:33:23 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/03/19 17:33:23 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/03/19 17:33:23 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/03/19 17:33:23 INFO LuceneRDDPartition: Config parameter lucenerdd.index.store.mode is set to 'disk'
19/03/19 17:33:23 INFO LuceneRDDPartition: Lucene index will be storage in disk
19/03/19 17:33:23 INFO LuceneRDDPartition: Index disk location C:\Users\user\AppData\Local\Temp\
19/03/19 17:33:23 INFO Executor: Finished task 1.0 in stage 2.0 (TID 4). 1900 bytes result sent to driver
19/03/19 17:33:23 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 4) in 274 ms on localhost (executor driver) (1/2)
19/03/19 17:33:23 INFO LuceneRDDPartition: Config parameter lucenerdd.index.store.mode is set to 'disk'
19/03/19 17:33:23 INFO LuceneRDDPartition: Lucene index will be storage in disk
19/03/19 17:33:23 INFO LuceneRDDPartition: Index disk location C:\Users\user\AppData\Local\Temp\
19/03/19 17:33:23 INFO LuceneRDDPartition: [partId=0] Partition is created...
19/03/19 17:33:23 INFO LuceneRDDPartition: Loading class lucene.StandardWithStopWords using loader sun.misc.Launcher$AppClassLoader@18b4aac2
19/03/19 17:33:23 INFO LuceneRDDPartition: [partId=0]Indexing process initiated at 2019-03-19T17:33:23.825-04:00...
19/03/19 17:33:23 INFO LuceneRDDPartition: Loading class lucene.StandardWithStopWords using loader sun.misc.Launcher$AppClassLoader@18b4aac2
Removing stop words
19/03/19 17:33:24 INFO LuceneRDDPartition: [partId=0]Indexing process completed at 2019-03-19T17:33:24.532-04:00...
19/03/19 17:33:24 INFO LuceneRDDPartition: [partId=0]Indexing process took 0 seconds...
19/03/19 17:33:24 INFO LuceneRDDPartition: [partId=0]Indexed 2 documents
19/03/19 17:33:24 INFO Linker: +(name:googlex~2) +((address:main~2 address:123~1 address:street~2)~1)
19/03/19 17:33:24 INFO Executor: Finished task 0.0 in stage 2.0 (TID 3). 1900 bytes result sent to driver
19/03/19 17:33:24 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 3) in 1467 ms on localhost (executor driver) (2/2)
19/03/19 17:33:24 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
19/03/19 17:33:24 INFO DAGScheduler: ShuffleMapStage 2 (count at BlockLinkage.scala:198) finished in 1.506 s
19/03/19 17:33:24 INFO DAGScheduler: looking for newly runnable stages
19/03/19 17:33:24 INFO DAGScheduler: running: Set()
19/03/19 17:33:24 INFO DAGScheduler: waiting: Set(ResultStage 3)
19/03/19 17:33:24 INFO DAGScheduler: failed: Set()
19/03/19 17:33:24 INFO DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[19] at count at BlockLinkage.scala:198), which has no missing parents
19/03/19 17:33:24 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 7.5 KB, free 1989.6 MB)
19/03/19 17:33:24 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 3.9 KB, free 1989.5 MB)
19/03/19 17:33:24 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:53185 (size: 3.9 KB, free: 1989.6 MB)
19/03/19 17:33:25 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1039
19/03/19 17:33:25 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[19] at count at BlockLinkage.scala:198) (first 15 tasks are for partitions Vector(0))
19/03/19 17:33:25 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
19/03/19 17:33:25 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 5, localhost, executor driver, partition 0, ANY, 7754 bytes)
19/03/19 17:33:25 INFO Executor: Running task 0.0 in stage 3.0 (TID 5)
19/03/19 17:33:25 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/03/19 17:33:25 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/03/19 17:33:25 INFO Executor: Finished task 0.0 in stage 3.0 (TID 5). 1782 bytes result sent to driver
19/03/19 17:33:25 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 5) in 38 ms on localhost (executor driver) (1/1)
19/03/19 17:33:25 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 
19/03/19 17:33:25 INFO DAGScheduler: ResultStage 3 (count at BlockLinkage.scala:198) finished in 0.080 s
19/03/19 17:33:25 INFO DAGScheduler: Job 0 finished: count at BlockLinkage.scala:198, took 2.688508 s
19/03/19 17:33:25 ERROR BlockLinkage: ========================================
19/03/19 17:33:25 ERROR BlockLinkage: || Elapsed time: 10.576 seconds ||
19/03/19 17:33:25 ERROR BlockLinkage: ========================================
19/03/19 17:33:25 ERROR BlockLinkage: ****************************************
19/03/19 17:33:25 ERROR BlockLinkage: ****************************************
19/03/19 17:33:25 INFO SparkUI: Stopped Spark web UI at http://localhost:4040
19/03/19 17:33:25 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/03/19 17:33:25 INFO MemoryStore: MemoryStore cleared
19/03/19 17:33:25 INFO BlockManager: BlockManager stopped
19/03/19 17:33:25 INFO BlockManagerMaster: BlockManagerMaster stopped
19/03/19 17:33:25 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/03/19 17:33:25 INFO SparkContext: Successfully stopped SparkContext
19/03/19 17:33:25 INFO ShutdownHookManager: Shutdown hook called
19/03/19 17:33:25 INFO ShutdownHookManager: Deleting directory C:\Users\user\AppData\Local\Temp\spark-28f2888e-4dde-4e26-bb54-0b8eaf118f2b

Process finished with exit code 0

from spark-lucenerdd.

yeikel commented on June 11, 2024

I tried with a larger dataset (more partitions) and I am seeing the message multiple time as expected.

Now to clarify , shouldn't I see this at search time? Or is my assumption incorrect?

from spark-lucenerdd.

yeikel commented on June 11, 2024

Is this feature only supported for some queries?

I am trying to use this with the linkDataFrame but it is not available

from spark-lucenerdd.

zouzias commented on June 11, 2024

Is this feature only supported for some queries?

I am trying to use this with the linkDataFrame but it is not available

The custom Analyzer can be used in linkDataFrame, you need to specify it during the creation of the LuceneRDD.

from spark-lucenerdd.

yeikel commented on June 11, 2024

That makes sense. Thank you for confirming.

From what I've seen this issue seems to be done. Can we close it?

from spark-lucenerdd.

How can I set a custom analyzer? about spark-lucenerdd HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs