GithubHelp home page GithubHelp logo

Size of Training Data. about crfsuite HOT 5 OPEN

chokkan avatar chokkan commented on June 24, 2024
Size of Training Data.

from crfsuite.

Comments (5)

jndevanshu avatar jndevanshu commented on June 24, 2024

Hi akazer2
Were you able to resolve the issue?
I was also trying to model a text file (10 MB) but crfsuite gives segmentation fault.
Thanks, in advance

from crfsuite.

viveksck avatar viveksck commented on June 24, 2024

Did anyone manage to resolve this ?

from crfsuite.

usptact avatar usptact commented on June 24, 2024

The thing is that during training much more memory is quested than just fitting your dataset in the memory.

For this big datasets, I suggest to use online algorithms. I found the Vowpal Wabbit to be not only very versatile but also scaling very well. Yes, including sequence tagging as CRFSuite does. I can show how to do sequence tagging with VW.

from crfsuite.

bratao avatar bratao commented on June 24, 2024

@usptact , could you please provide an example of sequence tagging in Vowpal ? What command line and input format ?

from crfsuite.

usptact avatar usptact commented on June 24, 2024

The data format is similar to that of CRFSuite, except spaces are used to separate features. VW also introduces feature spaces. The following is a training example for sequence tagging in VW format (notice the empty line between the two examples; I am using only one feature space, called "f"):

label1 |f f1 f2 f3
label2 |f f2 f3 f4
label3 |f f4 f5 f1

label2 |f f2 f4
label3 |f f1 f3

The sequence tagging model can be trained with this command:

vw  --data train.feat \
    --cache \
    --passes 10 \                                   # keep this small
    --search_task sequence \              # the task is sequence tagging
    --search $NUM_LABELS \             # number of possible labels
    --search_rollin=policy \
    --search_rollout=none \
    --named_labels "$(< labels)" \      # provide a comma-separated list of string labels if integer labels are not used
    -b 28 \                                             # number of bits for feature hashing - more is better
    --l2=1e-5 \                                      # per-example regularization
    --l1=1e-7 \
    -f $MODEL \                                   # store the model
    --readable_model $MODEL.txt    # store the model in readable format

from crfsuite.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.