GithubHelp home page GithubHelp logo

sashafrey / topicmod Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 0.0 87.46 MB

This project had been moved to https://github.com/bigartm/bigartm

License: Other

CSS 0.04% C++ 73.68% C 1.80% Vim Script 0.04% Emacs Lisp 0.08% Java 14.17% Python 9.12% TeX 0.53% Shell 0.54%

topicmod's People

Contributors

antoniox avatar jeanpaulshapo avatar marinadudarenko avatar mellain avatar netbug avatar rdkl avatar sashafrey avatar vuvko avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

topicmod's Issues

Make sure all threads catch(...) in their main thread functions

An uncaught exception in non-main thread will cause terminate(). Such situations are very difficult to debug, because terminate() aborts all threads in uncontrolled manner, causing them NRE, AV and all sorts of weirdness that is impossible to debug.

To avoid this, we should review all threads created in topicmd, and make sure we catch all exceptions (and log a proper error message)

Fix interface of TokenTopicMatrix class

There are several troubles:

  1. Class should be renamed to TopicModel.
  2. TokenWeights should be iterator, and renamed to TopicWeightIterator
  3. token_id() function should never return -1.

Documentation: create diagrams with overall architechture

For internal documentation it would be nice to have two diagrams:

  1. One diagram to represent how modules (DataLoader, Processor, etc) are connected to each other
  2. Second represent code dependencies and interfaces (cpp_interface, c_interface).

Make proper error codes

rdkl:

Maybe replace

define ARTM_SUCCESS ->

namespace artm {
enum ExitCodes {
SUCCESS,
FIRST_ERROR_TYPE,
SECOND_ERROR_TYPE
}
Auto-completion.

alfrey:
I have to think about this :)

Add processors_count to InstanceConfig

Currently the number of processors is hardcoded to one. We need to support multicore machines by adding more processors to the instance. The fix should be fairly simple - just to instantiate more Processor, connect them to the same queues (processor_queue_ and merger_queue_), and it should work! :)

Implement calculation of perplexity

Currently Merger and Processors will tune models until forever (in a background).
Ideally we should have an option to stop tuning if perplexity (or any other measure) reaches sufficient level. I don't have clear vision on how to achieve this, you are welcome to suggest your design!

Don't hardcode Float

Currently token-topic matrix uses 'float' as a datatype. This should be perhaps set as a separate typedef ('artm_float'), to make sure we can go to double at any time better precision is needed.

I'll have to think about some details (for example, how do we do with protobuf messages).

create_model must generate model_id (not require it as input)

Currently c_interface and cpp_interface require caller to provide it with model_id. In fact, it should be Instance::UpdateModel which keeps track of existing model_id, and generate a new model_id. Such function should be named CreateModel. UpdateModel should be renamed to ReconfigureModel (to match names in c_interface and cpp_interface).

Dynamically add tokens to model when new data arrives

Currently merger resets the whole model whenever it see new generation. There are many better ways to do this.

  1. Scan the whole generation, and add only those tokens that were added.
  2. Scan only new partitions (those that are different in old and new generation).

Implement ArtmRequestBatchTopics

ArtmRequestBatchTopics method of c_interface is not yet implemented. The goal of this method is to classify a collection of documents (e.g. Batch) with a specific topic model.

Fix linux instruction about 3rd party

  1. Есть небольшая неточность в документации в разделе "How to build sources", часть 2.5.2 "Linux".
    Там не указано, как собрать библиотеки (protobuf, rpcz, zeromq и glog). Понятно, что нужно запустить make с нужными целями в папке topicmod/3rdparty.
    Правда, для protobuf инструкция все-таки есть, но тогда непонятно, что именно нужно устанавливать - версию из репозиториев или же собирать версию из topicmod/3rdparty?
  2. При запуске cpp_client без параметров выдается сообщение об использовании, но при этом в качестве названия программы выводится PlsaBatchEM (видимо, старое название)

error LNK1104: cannot open file 'libprotobuf.lib'

error LNK1104: cannot open file 'libprotobuf.lib' in VS2010, projects topicmd_tests, topicmd_dll and c_client.
Root cause - for some reason $(VisualStudioVersion) macro doesn't exist.
Workaround - replace "vc$(VisualStudioVersion)" in linker path settings with explicit string "vc10.0".

Store collection to disk in DataLoader

DataLoader currently keeps all the data in memory. Suggested change is to

  1. In DataLoader.AddBatch() store each batch in a separate file (using google protocol messages serialization). The location is defined in DataLoaderConfig.disk_path().
  2. In DataLoader constructor read the list of files under DataLoaderConfig.disk_path(). Save the list of files to be able to iterate through those files.
  3. In DataLoader.ThreadFunction() read batches from disk.

Implement streams in DataLoader

Goad: it is possible to set up several \emph{streams} inside data loader
(for example, one stream for training items, and one for test items).
Models (and quality measures such as perplexity) specify which stream to use for tuning (evaluation).

Design a better representation of Topic

Currently the concept of 'Topic' is quite unrepresented in the code. Basically, there is a number of topics associated with each model, and after that topic is just an index in some arrays/matrices. This makes it difficult to associate extra information with each topic (for example, a flag 'subject topic' vs 'background topic').

Ctrl+C in cpp_client result in error "Assertion failed: Successful WSASTARTUP not yet performed"

This is a regression, introduced in alfrey_rpcz branch. If one press Ctrl+C to abort cpp_client during ongoing processing, he gets an error message:

Assertion failed: Successful WSASTARTUP not yet performed (......\src\signaler
.cpp:328)

As far as I see, there is no simple way of solving this on Windows. On Linux this can be handled by calling install_signal_handler() method from src\rpcz\include\rpcz\connection_manager.hpp.

I hope this is an annoying, but harmless error. However, there might be some consequences - for example, leaking socket connections when clients crash, etc. So I suggest we ignore this ticket for now, but keep it open as a reminder.

Implement simple Python interface

topicmd is design as a library that you may use from any other language (at least from Python, Java, mostlikelly C#, and perhaps Matlab).
We should implement an interface from Python, since Python is a very nice environment to conduct research experiments. Everything that c_client does nowadays should be doable from Python (push collection, create model, fetch back the results).

Integration with Python should be done via library called "ctypes".
An example of such integration you may find here:
http://sourceforge.net/p/vft11ccas/code/HEAD/tree/Source/alfrey/Python/
Ask sashafrey for more details.

Use CMake as a build engine

CMake should save us a lot of troubles. We already have many issues with our current build system:

  • Current Makefiles doesn't work well in Fedora because it uses /etc/ld.so.conf to load .so dependencies (while Ubuntu uses LD_LIBRARY_PATH).
  • Currently we maintain Makefiles and Visual Studio project files. CMake is able to generate both Makefiles and VisualStudio project files based on CMakeLists.txt.
  • #47
  • #45
  • #6

Prototype sorted n(t,w) layout

From "An Architecture for Parallel Topic Models" by A. Smola and S. Narayanamurthy

To store the n(t,w) we use the same memory layout as Mallet. That is, we maintain a list of (topic, count) pairs for each word "w" sorting in order of decreasing counts. This allws us to implement a [Gibbs] sampler efficiently (with high probability we do not reach the end of the list) since the most likelly topics occur first.

Add option in Merger to forget old counters "exponentially" as in online algorithms

Currently Merger keeps accumulating counters. It never resets them to zero, an do not "exponentially decrease" values, corresponding to "old iterations". This is suboptimal since on the first steps Phi matrix was inacurate we accumulated some noise.

Common for online algorithms is to exponentially decrease counters. You are welcome to suggest an actual implementation details for this feature, and implement it :)

Implement Instance::Commit operation

This operation should store in-memory partitions to disk (to form the "index"). In is an important step towards big data: all processing should not require storing all collection in memory.

This is quite a big and complex fix, which require at least the following things:

  • Add index location (disk folder) to InstanceConfig.
  • Partition should became a base class with two inheritors - one for MemoryPartition, the other for DiskPartition
  • DataLoader should be able to handle both types of partitions (both disk and in-memory)
  • Instance should be able to start on top of existing index (without re-feeding all data through insert_batch).

Processor should be able to re-use Theha's from previous iteration

With current implementation processor tunes Theta (distribution of items into topics) each time from scratch, even though it doesn't differ much between iterations (N) and (N+1).

The idea is to ask DataLoader to save theta's, and then pass it back to Processor.

Rename topicmd -> artm

  • folders
  • namespace
  • vc proj
  • .dll, .lib. and .so files
    Also, consider placing current code under artm/core, since we will likely many other modules.

Support more than one request in c_interface

Currently request_model_topics in c_interface.cc is implemented via one global variable (string message). This holds the data only for the last request. Ideally the result of request must be stored until dispose_request is invoked.

The task is to implement proper request manager (similar to instance_manager).

Fix raise condition in RPCZ

There is a critical bug in RPCZ which causes a crash in certain scenarios. Given that it is intermittent it' hard for me provide a reprosteps, but the problem is quite straightforward. To explain, I need three pieces of code.

// Any stub method, generated by RPCZ compiler.
void XXXStub::XXXMethod(...) {
::rpcz::rpc rpc;
channel_->call_method(..., &rpc, ...); // (1)
rpc.wait(); // (3)
}

// File rpcz\rpc_channel_impl.cc
void rpc_channel_impl::handle_client_response(...) {
response_context.rpc_->set_status(status::OK); // (2)
response_context.rpc_->sync_event_->signal(); // (4)
}

Assume that the threads were scheduled so that the sequence was as follows:

  1. channel_->call_method(..., &rpc, ...);
  2. response_context.rpc_->set_status(status::OK);
  3. rpc.wait();
  4. response_context.rpc_->sync_event_->signal();

Now, the last piece of code --- rpc.wait() method:
int rpc::wait() {
status_code status = get_status();
CHECK_NE(status, status::INACTIVE) << "Request must be sent before calling wait()";
if (status != status::ACTIVE) return get_status(); // THIS LINE SHOULD BE REMOVED
sync_event_->wait();
return 0;
}

If threads was scheduled as I described, the status of rpc will be OK, and the rpc::wait() will exit immediately. Therefore, by the moment when handle_client_response() reaches the sync_event_->signal() the sync_event might be disposed, and we get a crash.

I believe the fix is just to remove the following line in rpc::wait() method:
if (status != status::ACTIVE) return get_status();

Configure 1ms sleeps across the code

Search for this code:
boost::this_thread::sleep(boost::posix_time::milliseconds(1));
It is happens in many places. Everywhere 1ms should be configurable via config object. Keep 1ms as a default value.

Scale regularization coefficients according to the size of the collection

Придумать такие нормировки для каждого регуляризатора, чтобы значение коэффициента tau=1 естественно интерпретировалось бы как "нормальное", чтобы была формулировка (как в физике), что происходит с моделью "за единицу регуляризации", и чтобы tau=0.1 означало "в 10 раз ослабить регуляризацию", а tau=10 -- "в 10 раз усилить регуляризацию".

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.