Comments (15)
Yes, currently the library only supports the predefined language analyzers. Let me think about it once more if I can fix this issue.
The main problem with the analyzers is they are not serializable and this fact causes a lot of issues.
from spark-lucenerdd.
Yes, analyzers in Lucene are not serializable.
from spark-lucenerdd.
@yeikel , pushed the feature in version 0.3.7-SNAPSHOT
. You can try it out and let me know if there are any issues.
from spark-lucenerdd.
Besides the idea of wrapping them around some other type , I am not sure what we can do.
from spark-lucenerdd.
I think a good option is to allow the user to specify the analyzer by String using the class name, i.e.,
my.cool.analyzers.CustomAnalyzer
would load the class using Java reflection. I think this will allow you to use any analyzer and it requires small modifications on the codebase. WDYT?
See, e.g. http://tutorials.jenkov.com/java-reflection/dynamic-class-loading-reloading.html
from spark-lucenerdd.
It is definitely an option.
Could you please clarify why passing an instance of the analyzer is not an option? Serialization issues?
from spark-lucenerdd.
It is very unfortunate we need to do this.
The workaround sounds fair to me. Let's do it
from spark-lucenerdd.
I believe that queryAnalyzer
is not working as expected. Please see example below :
class StandardWithStopWords extends Analyzer {
private val STOP_WORDS = fromInputStream(getClass.getResourceAsStream("/stop-words/list.txt")).getLines().toSet
override def createComponents(fieldName: String): TokenStreamComponents = {
val source: Tokenizer = new StandardTokenizer()
val lowerCase = new LowerCaseFilter(source)
val finalTokens = removeStopWords(lowerCase)
new TokenStreamComponents(source, finalTokens)
}
private def removeStopWords(l: TokenFilter) = {
System.out.println("Removing stop words")
val charSet = new CharArraySet(STOP_WORDS, true)
val stop = new StopFilter(l, charSet)
stop
}
}
val A = Seq(("Googlex", "123 Main Street" , "US" )).toDF("name","address","mkt_cd")
val B = Seq(
("Google", "123 Main Street" , "US" ),
("Googly", "123 Main Street" , "US" )
).toDF("name","address","mkt_cd",)
val linkedResults = LuceneRDD.blockEntityLinkage(
A,
B,
linkerQuery ,
blockingFields,
blockingFields,
1000 ,
indexAnalyzer = "lucene.StandardWithStopWords",
queryAnalyzer = "lucene.StandardWithStopWords"
)
19/03/19 10:28:29 INFO LuceneRDDPartition: Lucene index will be storage in disk
19/03/19 10:28:29 INFO LuceneRDDPartition: Index disk location C:\Users\user\AppData\Local\Temp\
19/03/19 10:28:29 INFO LuceneRDDPartition: [partId=0] Partition is created...
19/03/19 10:28:29 INFO LuceneRDDPartition: Loading class lucene.StandardWithStopWords using loader sun.misc.Launcher$AppClassLoader@18b4aac2
19/03/19 10:28:29 INFO LuceneRDDPartition: [partId=0]Indexing process initiated at 2019-03-19T10:28:29.288-04:00...
19/03/19 10:28:29 INFO LuceneRDDPartition: Loading class lucene.StandardWithStopWords using loader sun.misc.Launcher$AppClassLoader@18b4aac2
Removing stop words
19/03/19 10:28:29 INFO LuceneRDDPartition: [partId=0]Indexing process completed at 2019-03-19T10:28:29.930-04:00...
19/03/19 10:28:29 INFO LuceneRDDPartition: [partId=0]Indexing process took 0 seconds...
19/03/19 10:28:30 INFO LuceneRDDPartition: [partId=0]Indexed 2 documents
19/03/19 10:28:30 INFO Linker: +((name:Googlex~2)~1) +((address:123 Main Street~2)~2)
19/03/19 10:28:30 INFO Executor: Finished task 0.0 in stage 2.0 (TID 3). 1677 bytes result sent to driver
19/03/19 10:28:30 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 3) in 1449 ms on localhost (executor driver) (1/1)
19/03/19 10:28:30 INFO DAGScheduler: ResultStage 2 (show at BlockLinkageAmex.scala:226) finished in 1.557 s
19/03/19 10:28:30 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
19/03/19 10:28:30 INFO DAGScheduler: Job 0 finished: show at BlockLinkageAmex.scala:226, took 2.664930 s
19/03/19 10:28:30 INFO SparkContext: Starting job: show at BlockLinkageAmex.scala:226
19/03/19 10:28:30 INFO DAGScheduler: Got job 1 (show at BlockLinkageAmex.scala:226) with 1 output partitions
19/03/19 10:28:30 INFO DAGScheduler: Final stage: ResultStage 5 (show at BlockLinkageAmex.scala:226)
19/03/19 10:28:30 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 3, ShuffleMapStage 4)
19/03/19 10:28:30 INFO DAGScheduler: Missing parents: List()
19/03/19 10:28:30 INFO DAGScheduler: Submitting ResultStage 5 (MapPartitionsRDD[20] at show at BlockLinkageAmex.scala:226), which has no missing parents
19/03/19 10:28:30 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 43.2 KB, free 1989.5 MB)
19/03/19 10:28:30 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 14.6 KB, free 1989.5 MB)
19/03/19 10:28:30 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on FLL39P2SN2.ads.aexp.com:51436 (size: 14.6 KB, free: 1989.6 MB)
19/03/19 10:28:30 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1039
19/03/19 10:28:30 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 5 (MapPartitionsRDD[20] at show at BlockLinkageAmex.scala:226) (first 15 tasks are for partitions Vector(1))
19/03/19 10:28:30 INFO TaskSchedulerImpl: Adding task set 5.0 with 1 tasks
19/03/19 10:28:30 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 4, localhost, executor driver, partition 1, PROCESS_LOCAL, 7712 bytes)
19/03/19 10:28:30 INFO Executor: Running task 0.0 in stage 5.0 (TID 4)
19/03/19 10:28:30 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 1 blocks
19/03/19 10:28:30 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/03/19 10:28:30 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 2 blocks
19/03/19 10:28:30 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/03/19 10:28:30 INFO Executor: Finished task 0.0 in stage 5.0 (TID 4). 1527 bytes result sent to driver
19/03/19 10:28:30 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 4) in 36 ms on localhost (executor driver) (1/1)
19/03/19 10:28:30 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool
19/03/19 10:28:30 INFO DAGScheduler: ResultStage 5 (show at BlockLinkageAmex.scala:226) finished in 0.050 s
19/03/19 10:28:30 INFO DAGScheduler: Job 1 finished: show at BlockLinkageAmex.scala:226, took 0.065966 s
+-------+--
Unless I am missing something , I believe I should see two events in my logs for System.out.println("Removing stop words")
. One at index time and another at search time
from spark-lucenerdd.
from spark-lucenerdd.
I see now. You need to invoke an action since the RDD is not computed (lazy evaluation), i.e., linkedResults.count()
from spark-lucenerdd.
For this example I am running in only one machine using standalone mode
so there are no executors besides the driver. I was executing linkedResults.show
to show the table but I changed it to count
and it produced the same results.
Please see the full logs below :
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/03/19 17:33:10 INFO SparkContext: Running Spark version 2.3.2
19/03/19 17:33:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/03/19 17:33:10 INFO SparkContext: Submitted application: MatchingTest
19/03/19 17:33:11 INFO SecurityManager: Changing view acls to: user
19/03/19 17:33:11 INFO SecurityManager: Changing modify acls to: user
19/03/19 17:33:11 INFO SecurityManager: Changing view acls groups to:
19/03/19 17:33:11 INFO SecurityManager: Changing modify acls groups to:
19/03/19 17:33:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); groups with view permissions: Set(); users with modify permissions: Set(user); groups with modify permissions: Set()
19/03/19 17:33:12 INFO Utils: Successfully started service 'sparkDriver' on port 53164.
19/03/19 17:33:12 INFO SparkEnv: Registering MapOutputTracker
19/03/19 17:33:12 INFO SparkEnv: Registering BlockManagerMaster
19/03/19 17:33:12 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
19/03/19 17:33:12 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
19/03/19 17:33:12 INFO DiskBlockManager: Created local directory at C:\Users\user\AppData\Local\Temp\blockmgr-f61ccecd-66d9-4623-8411-f97234361c3e
19/03/19 17:33:13 INFO MemoryStore: MemoryStore started with capacity 1989.6 MB
19/03/19 17:33:13 INFO SparkEnv: Registering OutputCommitCoordinator
19/03/19 17:33:13 INFO Utils: Successfully started service 'SparkUI' on port 4040.
19/03/19 17:33:13 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://localhost:4040
19/03/19 17:33:13 INFO Executor: Starting executor ID driver on host localhost
19/03/19 17:33:14 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 53185.
19/03/19 17:33:14 INFO NettyBlockTransferService: Server created on localhost:53185
19/03/19 17:33:14 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/03/19 17:33:14 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, localhost, 53185, None)
19/03/19 17:33:14 INFO BlockManagerMasterEndpoint: Registering block manager localhost:53185 with 1989.6 MB RAM, BlockManagerId(driver, localhost, 53185, None)
19/03/19 17:33:14 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, localhost, 53185, None)
19/03/19 17:33:14 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, localhost, 53185, None)
19/03/19 17:33:18 INFO CodeGenerator: Code generated in 453.128391 ms
19/03/19 17:33:18 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/C:/dev/spark-es-master/spark-warehouse/').
19/03/19 17:33:18 INFO SharedState: Warehouse path is 'file:/C:/dev/spark-es-master/spark-warehouse/'.
19/03/19 17:33:19 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
19/03/19 17:33:21 INFO CodeGenerator: Code generated in 19.063393 ms
19/03/19 17:33:21 INFO CodeGenerator: Code generated in 19.929444 ms
19/03/19 17:33:22 INFO CodeGenerator: Code generated in 26.14319 ms
19/03/19 17:33:22 INFO CodeGenerator: Code generated in 21.811701 ms
19/03/19 17:33:22 INFO SparkContext: Starting job: count at BlockLinkage.scala:198
19/03/19 17:33:22 INFO DAGScheduler: Registering RDD 4 (keyBy at LuceneRDD.scala:507)
19/03/19 17:33:22 INFO DAGScheduler: Registering RDD 9 (keyBy at LuceneRDD.scala:510)
19/03/19 17:33:22 INFO DAGScheduler: Registering RDD 16 (count at BlockLinkage.scala:198)
19/03/19 17:33:22 INFO DAGScheduler: Got job 0 (count at BlockLinkage.scala:198) with 1 output partitions
19/03/19 17:33:22 INFO DAGScheduler: Final stage: ResultStage 3 (count at BlockLinkage.scala:198)
19/03/19 17:33:22 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 2)
19/03/19 17:33:22 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 2)
19/03/19 17:33:22 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[4] at keyBy at LuceneRDD.scala:507), which has no missing parents
19/03/19 17:33:22 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 7.2 KB, free 1989.6 MB)
19/03/19 17:33:22 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 3.9 KB, free 1989.6 MB)
19/03/19 17:33:22 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:53185 (size: 3.9 KB, free: 1989.6 MB)
19/03/19 17:33:22 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1039
19/03/19 17:33:22 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[4] at keyBy at LuceneRDD.scala:507) (first 15 tasks are for partitions Vector(0, 1))
19/03/19 17:33:22 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
19/03/19 17:33:22 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[9] at keyBy at LuceneRDD.scala:510), which has no missing parents
19/03/19 17:33:22 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 7.1 KB, free 1989.6 MB)
19/03/19 17:33:22 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.9 KB, free 1989.6 MB)
19/03/19 17:33:22 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:53185 (size: 3.9 KB, free: 1989.6 MB)
19/03/19 17:33:22 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1039
19/03/19 17:33:22 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[9] at keyBy at LuceneRDD.scala:510) (first 15 tasks are for partitions Vector(0))
19/03/19 17:33:22 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
19/03/19 17:33:22 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 8100 bytes)
19/03/19 17:33:22 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 8100 bytes)
19/03/19 17:33:22 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, executor driver, partition 0, PROCESS_LOCAL, 8084 bytes)
19/03/19 17:33:22 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
19/03/19 17:33:22 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
19/03/19 17:33:22 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
19/03/19 17:33:23 INFO CodeGenerator: Code generated in 58.379922 ms
19/03/19 17:33:23 INFO CodeGenerator: Code generated in 19.728008 ms
19/03/19 17:33:23 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1212 bytes result sent to driver
19/03/19 17:33:23 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1212 bytes result sent to driver
19/03/19 17:33:23 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1212 bytes result sent to driver
19/03/19 17:33:23 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 455 ms on localhost (executor driver) (1/2)
19/03/19 17:33:23 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 427 ms on localhost (executor driver) (1/1)
19/03/19 17:33:23 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
19/03/19 17:33:23 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 521 ms on localhost (executor driver) (2/2)
19/03/19 17:33:23 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
19/03/19 17:33:23 INFO DAGScheduler: ShuffleMapStage 1 (keyBy at LuceneRDD.scala:510) finished in 0.559 s
19/03/19 17:33:23 INFO DAGScheduler: looking for newly runnable stages
19/03/19 17:33:23 INFO DAGScheduler: running: Set(ShuffleMapStage 0)
19/03/19 17:33:23 INFO DAGScheduler: waiting: Set(ShuffleMapStage 2, ResultStage 3)
19/03/19 17:33:23 INFO DAGScheduler: failed: Set()
19/03/19 17:33:23 INFO DAGScheduler: ShuffleMapStage 0 (keyBy at LuceneRDD.scala:507) finished in 0.948 s
19/03/19 17:33:23 INFO DAGScheduler: looking for newly runnable stages
19/03/19 17:33:23 INFO DAGScheduler: running: Set()
19/03/19 17:33:23 INFO DAGScheduler: waiting: Set(ShuffleMapStage 2, ResultStage 3)
19/03/19 17:33:23 INFO DAGScheduler: failed: Set()
19/03/19 17:33:23 INFO DAGScheduler: Submitting ShuffleMapStage 2 (MapPartitionsRDD[16] at count at BlockLinkage.scala:198), which has no missing parents
19/03/19 17:33:23 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 12.6 KB, free 1989.6 MB)
19/03/19 17:33:23 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 5.7 KB, free 1989.6 MB)
19/03/19 17:33:23 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:53185 (size: 5.7 KB, free: 1989.6 MB)
19/03/19 17:33:23 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1039
19/03/19 17:33:23 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 2 (MapPartitionsRDD[16] at count at BlockLinkage.scala:198) (first 15 tasks are for partitions Vector(0, 1))
19/03/19 17:33:23 INFO TaskSchedulerImpl: Adding task set 2.0 with 2 tasks
19/03/19 17:33:23 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 3, localhost, executor driver, partition 0, PROCESS_LOCAL, 7701 bytes)
19/03/19 17:33:23 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 4, localhost, executor driver, partition 1, PROCESS_LOCAL, 7701 bytes)
19/03/19 17:33:23 INFO Executor: Running task 0.0 in stage 2.0 (TID 3)
19/03/19 17:33:23 INFO Executor: Running task 1.0 in stage 2.0 (TID 4)
19/03/19 17:33:23 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
19/03/19 17:33:23 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 1 blocks
19/03/19 17:33:23 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 15 ms
19/03/19 17:33:23 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 15 ms
19/03/19 17:33:23 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 2 blocks
19/03/19 17:33:23 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/03/19 17:33:23 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/03/19 17:33:23 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/03/19 17:33:23 INFO LuceneRDDPartition: Config parameter lucenerdd.index.store.mode is set to 'disk'
19/03/19 17:33:23 INFO LuceneRDDPartition: Lucene index will be storage in disk
19/03/19 17:33:23 INFO LuceneRDDPartition: Index disk location C:\Users\user\AppData\Local\Temp\
19/03/19 17:33:23 INFO Executor: Finished task 1.0 in stage 2.0 (TID 4). 1900 bytes result sent to driver
19/03/19 17:33:23 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 4) in 274 ms on localhost (executor driver) (1/2)
19/03/19 17:33:23 INFO LuceneRDDPartition: Config parameter lucenerdd.index.store.mode is set to 'disk'
19/03/19 17:33:23 INFO LuceneRDDPartition: Lucene index will be storage in disk
19/03/19 17:33:23 INFO LuceneRDDPartition: Index disk location C:\Users\user\AppData\Local\Temp\
19/03/19 17:33:23 INFO LuceneRDDPartition: [partId=0] Partition is created...
19/03/19 17:33:23 INFO LuceneRDDPartition: Loading class lucene.StandardWithStopWords using loader sun.misc.Launcher$AppClassLoader@18b4aac2
19/03/19 17:33:23 INFO LuceneRDDPartition: [partId=0]Indexing process initiated at 2019-03-19T17:33:23.825-04:00...
19/03/19 17:33:23 INFO LuceneRDDPartition: Loading class lucene.StandardWithStopWords using loader sun.misc.Launcher$AppClassLoader@18b4aac2
Removing stop words
19/03/19 17:33:24 INFO LuceneRDDPartition: [partId=0]Indexing process completed at 2019-03-19T17:33:24.532-04:00...
19/03/19 17:33:24 INFO LuceneRDDPartition: [partId=0]Indexing process took 0 seconds...
19/03/19 17:33:24 INFO LuceneRDDPartition: [partId=0]Indexed 2 documents
19/03/19 17:33:24 INFO Linker: +(name:googlex~2) +((address:main~2 address:123~1 address:street~2)~1)
19/03/19 17:33:24 INFO Executor: Finished task 0.0 in stage 2.0 (TID 3). 1900 bytes result sent to driver
19/03/19 17:33:24 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 3) in 1467 ms on localhost (executor driver) (2/2)
19/03/19 17:33:24 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
19/03/19 17:33:24 INFO DAGScheduler: ShuffleMapStage 2 (count at BlockLinkage.scala:198) finished in 1.506 s
19/03/19 17:33:24 INFO DAGScheduler: looking for newly runnable stages
19/03/19 17:33:24 INFO DAGScheduler: running: Set()
19/03/19 17:33:24 INFO DAGScheduler: waiting: Set(ResultStage 3)
19/03/19 17:33:24 INFO DAGScheduler: failed: Set()
19/03/19 17:33:24 INFO DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[19] at count at BlockLinkage.scala:198), which has no missing parents
19/03/19 17:33:24 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 7.5 KB, free 1989.6 MB)
19/03/19 17:33:24 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 3.9 KB, free 1989.5 MB)
19/03/19 17:33:24 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:53185 (size: 3.9 KB, free: 1989.6 MB)
19/03/19 17:33:25 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1039
19/03/19 17:33:25 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[19] at count at BlockLinkage.scala:198) (first 15 tasks are for partitions Vector(0))
19/03/19 17:33:25 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
19/03/19 17:33:25 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 5, localhost, executor driver, partition 0, ANY, 7754 bytes)
19/03/19 17:33:25 INFO Executor: Running task 0.0 in stage 3.0 (TID 5)
19/03/19 17:33:25 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/03/19 17:33:25 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/03/19 17:33:25 INFO Executor: Finished task 0.0 in stage 3.0 (TID 5). 1782 bytes result sent to driver
19/03/19 17:33:25 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 5) in 38 ms on localhost (executor driver) (1/1)
19/03/19 17:33:25 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
19/03/19 17:33:25 INFO DAGScheduler: ResultStage 3 (count at BlockLinkage.scala:198) finished in 0.080 s
19/03/19 17:33:25 INFO DAGScheduler: Job 0 finished: count at BlockLinkage.scala:198, took 2.688508 s
19/03/19 17:33:25 ERROR BlockLinkage: ========================================
19/03/19 17:33:25 ERROR BlockLinkage: || Elapsed time: 10.576 seconds ||
19/03/19 17:33:25 ERROR BlockLinkage: ========================================
19/03/19 17:33:25 ERROR BlockLinkage: ****************************************
19/03/19 17:33:25 ERROR BlockLinkage: ****************************************
19/03/19 17:33:25 INFO SparkUI: Stopped Spark web UI at http://localhost:4040
19/03/19 17:33:25 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/03/19 17:33:25 INFO MemoryStore: MemoryStore cleared
19/03/19 17:33:25 INFO BlockManager: BlockManager stopped
19/03/19 17:33:25 INFO BlockManagerMaster: BlockManagerMaster stopped
19/03/19 17:33:25 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/03/19 17:33:25 INFO SparkContext: Successfully stopped SparkContext
19/03/19 17:33:25 INFO ShutdownHookManager: Shutdown hook called
19/03/19 17:33:25 INFO ShutdownHookManager: Deleting directory C:\Users\user\AppData\Local\Temp\spark-28f2888e-4dde-4e26-bb54-0b8eaf118f2b
Process finished with exit code 0
from spark-lucenerdd.
I tried with a larger dataset (more partitions) and I am seeing the message multiple time as expected.
Now to clarify , shouldn't I see this at search time? Or is my assumption incorrect?
from spark-lucenerdd.
Is this feature only supported for some queries?
I am trying to use this with the linkDataFrame
but it is not available
from spark-lucenerdd.
Is this feature only supported for some queries?
I am trying to use this with the
linkDataFrame
but it is not available
The custom Analyzer can be used in linkDataFrame
, you need to specify it during the creation of the LuceneRDD
.
from spark-lucenerdd.
That makes sense. Thank you for confirming.
From what I've seen this issue seems to be done. Can we close it?
from spark-lucenerdd.
Related Issues (20)
- Improve test coverage on Lucene Analyzers per field
- Remove dependency on sbt-spark-package HOT 2
- Support Scala 2.12 HOT 4
- Update to SBT 1.x HOT 1
- [Implicits] Support MapType for Spark DataFrames
- [question]want to knnSearch on for every record from other data frame - HOT 1
- Label Entity Linkage tasks using `sc.setJobGroup`
- Improve logging HOT 2
- Why is indexing entering a loop? HOT 4
- Weird results when running it distributed vs local HOT 5
- Help debugging a OOM issue when the search population increases HOT 3
- Question about blockdedup and call to count()
- Typesafe config is generating the error UTFDataFormatException: encoded string too long HOT 2
- Serialization Issue with org.apache.lucene.facet.FacetsConfig HOT 4
- RDD is removing null columns on fuzzy linking HOT 3
- How to search with lucenerdd in another queries' rdd? HOT 1
- Greatly improved linking performance
- Elasticsearch Snapshots? HOT 2
- Compiler warnings
- Spark 3.4.1 no longer ships slf4j HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spark-lucenerdd.