I'm trying to train a model with a text file that is 42G in size. I have more than eno

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Size of Training Data. about crfsuite HOT 5 OPEN

chokkan commented on June 24, 2024

Size of Training Data.

from crfsuite.

Comments (5)

jndevanshu commented on June 24, 2024

Hi akazer2
Were you able to resolve the issue?
I was also trying to model a text file (10 MB) but crfsuite gives segmentation fault.
Thanks, in advance

from crfsuite.

viveksck commented on June 24, 2024

Did anyone manage to resolve this ?

from crfsuite.

usptact commented on June 24, 2024

The thing is that during training much more memory is quested than just fitting your dataset in the memory.

For this big datasets, I suggest to use online algorithms. I found the Vowpal Wabbit to be not only very versatile but also scaling very well. Yes, including sequence tagging as CRFSuite does. I can show how to do sequence tagging with VW.

from crfsuite.

bratao commented on June 24, 2024

@usptact , could you please provide an example of sequence tagging in Vowpal ? What command line and input format ?

from crfsuite.

usptact commented on June 24, 2024

The data format is similar to that of CRFSuite, except spaces are used to separate features. VW also introduces feature spaces. The following is a training example for sequence tagging in VW format (notice the empty line between the two examples; I am using only one feature space, called "f"):

label1 |f f1 f2 f3
label2 |f f2 f3 f4
label3 |f f4 f5 f1

label2 |f f2 f4
label3 |f f1 f3

The sequence tagging model can be trained with this command:

vw  --data train.feat \
    --cache \
    --passes 10 \                                   # keep this small
    --search_task sequence \              # the task is sequence tagging
    --search $NUM_LABELS \             # number of possible labels
    --search_rollin=policy \
    --search_rollout=none \
    --named_labels "$(< labels)" \      # provide a comma-separated list of string labels if integer labels are not used
    -b 28 \                                             # number of bits for feature hashing - more is better
    --l2=1e-5 \                                      # per-example regularization
    --l1=1e-7 \
    -f $MODEL \                                   # store the model
    --readable_model $MODEL.txt    # store the model in readable format

from crfsuite.

Recommend Projects

Size of Training Data. about crfsuite HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs