sashafrey / topicmod Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 87.46 MB

This project had been moved to https://github.com/bigartm/bigartm

License: Other

CSS 0.04% C++ 73.68% C 1.80% Vim Script 0.04% Emacs Lisp 0.08% Java 14.17% Python 9.12% TeX 0.53% Shell 0.54%

topicmod's People

Contributors

Watchers

topicmod's Issues

Ctrl+C in cpp_client result in error "Assertion failed: Successful WSASTARTUP not yet performed"

This is a regression, introduced in alfrey_rpcz branch. If one press Ctrl+C to abort cpp_client during ongoing processing, he gets an error message:

Assertion failed: Successful WSASTARTUP not yet performed (......\src\signaler
.cpp:328)

As far as I see, there is no simple way of solving this on Windows. On Linux this can be handled by calling install_signal_handler() method from src\rpcz\include\rpcz\connection_manager.hpp.

I hope this is an annoying, but harmless error. However, there might be some consequences - for example, leaking socket connections when clients crash, etc. So I suggest we ignore this ticket for now, but keep it open as a reminder.

Implement Instance::Commit operation

This operation should store in-memory partitions to disk (to form the "index"). In is an important step towards big data: all processing should not require storing all collection in memory.

This is quite a big and complex fix, which require at least the following things:

Add index location (disk folder) to InstanceConfig.
Partition should became a base class with two inheritors - one for MemoryPartition, the other for DiskPartition
DataLoader should be able to handle both types of partitions (both disk and in-memory)
Instance should be able to start on top of existing index (without re-feeding all data through insert_batch).

Use CMake as a build engine

CMake should save us a lot of troubles. We already have many issues with our current build system:

Current Makefiles doesn't work well in Fedora because it uses /etc/ld.so.conf to load .so dependencies (while Ubuntu uses LD_LIBRARY_PATH).
Currently we maintain Makefiles and Visual Studio project files. CMake is able to generate both Makefiles and VisualStudio project files based on CMakeLists.txt.
#47
#45
#6

Fix interface of TokenTopicMatrix class

There are several troubles:

Class should be renamed to TopicModel.
TokenWeights should be iterator, and renamed to TopicWeightIterator
token_id() function should never return -1.

Fix raise condition in RPCZ

There is a critical bug in RPCZ which causes a crash in certain scenarios. Given that it is intermittent it' hard for me provide a reprosteps, but the problem is quite straightforward. To explain, I need three pieces of code.

// Any stub method, generated by RPCZ compiler.
void XXXStub::XXXMethod(...) {
::rpcz::rpc rpc;
channel_->call_method(..., &rpc, ...); // (1)
rpc.wait(); // (3)
}

// File rpcz\rpc_channel_impl.cc
void rpc_channel_impl::handle_client_response(...) {
response_context.rpc_->set_status(status::OK); // (2)
response_context.rpc_->sync_event_->signal(); // (4)
}

Assume that the threads were scheduled so that the sequence was as follows:

channel_->call_method(..., &rpc, ...);
response_context.rpc_->set_status(status::OK);
rpc.wait();
response_context.rpc_->sync_event_->signal();

Now, the last piece of code --- rpc.wait() method:
int rpc::wait() {
status_code status = get_status();
CHECK_NE(status, status::INACTIVE) << "Request must be sent before calling wait()";
if (status != status::ACTIVE) return get_status(); // THIS LINE SHOULD BE REMOVED
sync_event_->wait();
return 0;
}

If threads was scheduled as I described, the status of rpc will be OK, and the rpc::wait() will exit immediately. Therefore, by the moment when handle_client_response() reaches the sync_event_->signal() the sync_event might be disposed, and we get a crash.

I believe the fix is just to remove the following line in rpc::wait() method:
if (status != status::ACTIVE) return get_status();

Perplexity: make it possible to calculate score for a single iteration (not cumulative)

Currently perplexity is accumulated across all iterations. This is a very stupid limitation. ARTM should support perplexity calculation for the last iteration.

Configure 1ms sleeps across the code

Search for this code:
boost::this_thread::sleep(boost::posix_time::milliseconds(1));
It is happens in many places. Everywhere 1ms should be configurable via config object. Keep 1ms as a default value.

Avoid creation of .depends files in src/topicmd/make clean

Yep, its stupid, but for some reason Makefile under src/topicmd/ first creates .depends files, just to clean them afterwards. Should be an easy fix :)

Improve DataLoader.WaitIdle() method

Branch https://github.com/sashafrey/topicmod/tree/alfrey_data_loader_streams introduces two new operations: DataLoader.WaitIdle() and Instance.WaitIdle(). A better solution would be to

Processor notifies DataLoader when it is done with batch (via some sort of callback)
DataLoader.WaitIdle() waits for all callbacks from Processors
Instance.WaitIdle() is removed.

Document how to use Python interface in Linux

Rename topicmd -> artm

folders
namespace
vc proj
.dll, .lib. and .so files
Also, consider placing current code under artm/core, since we will likely many other modules.

Scale regularization coefficients according to the size of the collection

Придумать такие нормировки для каждого регуляризатора, чтобы значение коэффициента tau=1 естественно интерпретировалось бы как "нормальное", чтобы была формулировка (как в физике), что происходит с моделью "за единицу регуляризации", и чтобы tau=0.1 означало "в 10 раз ослабить регуляризацию", а tau=10 -- "в 10 раз усилить регуляризацию".

Test alfrey_rpcz on linux

Implement simple Python interface

topicmd is design as a library that you may use from any other language (at least from Python, Java, mostlikelly C#, and perhaps Matlab).
We should implement an interface from Python, since Python is a very nice environment to conduct research experiments. Everything that c_client does nowadays should be doable from Python (push collection, create model, fetch back the results).

Integration with Python should be done via library called "ctypes".
An example of such integration you may find here:
http://sourceforge.net/p/vft11ccas/code/HEAD/tree/Source/alfrey/Python/
Ask sashafrey for more details.

Implement calculation of perplexity

Currently Merger and Processors will tune models until forever (in a background).
Ideally we should have an option to stop tuning if perplexity (or any other measure) reaches sufficient level. I don't have clear vision on how to achieve this, you are welcome to suggest your design!

Add test for DataLoader.WaitIdle() into Python interface

Create Python_client

Store collection to disk in DataLoader

DataLoader currently keeps all the data in memory. Suggested change is to

In DataLoader.AddBatch() store each batch in a separate file (using google protocol messages serialization). The location is defined in DataLoaderConfig.disk_path().
In DataLoader constructor read the list of files under DataLoaderConfig.disk_path(). Save the list of files to be able to iterate through those files.
In DataLoader.ThreadFunction() read batches from disk.

Setup framework for unit testing (C++)

I suggest we use google test framework
https://code.google.com/p/googletest/

Documentation: update "Large scale topic models" document with implementation details

(and change the title)

create_model must generate model_id (not require it as input)

Currently c_interface and cpp_interface require caller to provide it with model_id. In fact, it should be Instance::UpdateModel which keeps track of existing model_id, and generate a new model_id. Such function should be named CreateModel. UpdateModel should be renamed to ReconfigureModel (to match names in c_interface and cpp_interface).

Build RPCZ in Python

RPCZ (https://code.google.com/p/rpcz/) is introduce in the following branch:
https://github.com/sashafrey/topicmod/compare/alfrey_rpcz
For now we only use C++ version of RPCZ, but it is would be nice to learn how to call RPCZ services from Python.

Design a better representation of Topic

Currently the concept of 'Topic' is quite unrepresented in the code. Basically, there is a number of topics associated with each model, and after that topic is just an index in some arrays/matrices. This makes it difficult to associate extra information with each topic (for example, a flag 'subject topic' vs 'background topic').

Consider using rand_r(...) instead of rand(...) for improved thread safety

Make proper error codes

rdkl:

Maybe replace

define ARTM_SUCCESS ->

namespace artm {
enum ExitCodes {
SUCCESS,
FIRST_ERROR_TYPE,
SECOND_ERROR_TYPE
}
Auto-completion.

alfrey:
I have to think about this :)

error LNK1104: cannot open file 'libprotobuf.lib'

error LNK1104: cannot open file 'libprotobuf.lib' in VS2010, projects topicmd_tests, topicmd_dll and c_client.
Root cause - for some reason $(VisualStudioVersion) macro doesn't exist.
Workaround - replace "vc$(VisualStudioVersion)" in linker path settings with explicit string "vc10.0".

Fix linux instruction about 3rd party

Есть небольшая неточность в документации в разделе "How to build sources", часть 2.5.2 "Linux".
Там не указано, как собрать библиотеки (protobuf, rpcz, zeromq и glog). Понятно, что нужно запустить make с нужными целями в папке topicmod/3rdparty.
Правда, для protobuf инструкция все-таки есть, но тогда непонятно, что именно нужно устанавливать - версию из репозиториев или же собирать версию из topicmod/3rdparty?
При запуске cpp_client без параметров выдается сообщение об использовании, но при этом в качестве названия программы выводится PlsaBatchEM (видимо, старое название)

Implement ArtmRequestBatchTopics

ArtmRequestBatchTopics method of c_interface is not yet implemented. The goal of this method is to classify a collection of documents (e.g. Batch) with a specific topic model.

Fix code style in get(), size vs count, etc

Protocol Buffers: learn backwards compatibility

Perplexity: implement option to split test document into two halves

Implement this feature:
"Testing document d was further randomly split into two equal parts: the ﬁrst one Each
was used to estimate parameters θd, and the second one was used in perplexity evaluation."

Make sure all threads catch(...) in their main thread functions

An uncaught exception in non-main thread will cause terminate(). Such situations are very difficult to debug, because terminate() aborts all threads in uncontrolled manner, causing them NRE, AV and all sorts of weirdness that is impossible to debug.

To avoid this, we should review all threads created in topicmd, and make sure we catch all exceptions (and log a proper error message)

Add option in Merger to forget old counters "exponentially" as in online algorithms

Currently Merger keeps accumulating counters. It never resets them to zero, an do not "exponentially decrease" values, corresponding to "old iterations". This is suboptimal since on the first steps Phi matrix was inacurate we accumulated some noise.

Common for online algorithms is to exponentially decrease counters. You are welcome to suggest an actual implementation details for this feature, and implement it :)

Implement streams in DataLoader

Goad: it is possible to set up several \emph{streams} inside data loader
(for example, one stream for training items, and one for test items).
Models (and quality measures such as perplexity) specify which stream to use for tuning (evaluation).

Don't hardcode Float

Currently token-topic matrix uses 'float' as a datatype. This should be perhaps set as a separate typedef ('artm_float'), to make sure we can go to double at any time better precision is needed.

I'll have to think about some details (for example, how do we do with protobuf messages).

Add processors_count to InstanceConfig

Currently the number of processors is hardcoded to one. We need to support multicore machines by adding more processors to the instance. The fix should be fairly simple - just to instantiate more Processor, connect them to the same queues (processor_queue_ and merger_queue_), and it should work! :)

Processor should be able to re-use Theha's from previous iteration

With current implementation processor tunes Theta (distribution of items into topics) each time from scratch, even though it doesn't differ much between iterations (N) and (N+1).

The idea is to ask DataLoader to save theta's, and then pass it back to Processor.

Prototype sorted n(t,w) layout

From "An Architecture for Parallel Topic Models" by A. Smola and S. Narayanamurthy

To store the n(t,w) we use the same memory layout as Mallet. That is, we maintain a list of (topic, count) pairs for each word "w" sorting in order of decreasing counts. This allws us to implement a [Gibbs] sampler efficiently (with high probability we do not reach the end of the list) since the most likelly topics occur first.

Dynamically add tokens to model when new data arrives

Currently merger resets the whole model whenever it see new generation. There are many better ways to do this.

Scan the whole generation, and add only those tokens that were added.
Scan only new partitions (those that are different in old and new generation).

Add timeout to DataLoader.WaitIdle()

Add timeout to prevent operations from hanging forever

Protocol Buffers: compile .proto file and document the procedure

Documentation: create diagrams with overall architechture

For internal documentation it would be nice to have two diagrams:

One diagram to represent how modules (DataLoader, Processor, etc) are connected to each other
Second represent code dependencies and interfaces (cpp_interface, c_interface).

sashafrey / topicmod Goto Github PK

topicmod's People

Contributors

Watchers

topicmod's Issues

define ARTM_SUCCESS ->

Recommend Projects

Recommend Topics

Recommend Org

Jobs