¿Cómo usar la lista de cadenas como datos de entrenamiento para svm usando scikit.learn?

I am using scikit.learn to train an svm based on data where each observation (X) is a list of words. The tags for each observation (Y) are floating point values. I have tried following the example given in the scikit learn documentation (http://scikit-learn.org/stable/modules/svm.html) for Multi-class classification. Here is my code:

from __future__ import division
from sklearn import svm
import os.path
import numpy

import re

The stanford-postagger was included to see how it tags the words and to see if it would help in getting just the names
of the ingredients. Turns out its pointless.
#from nltk.tag.stanford import POSTagger
mainDirectory = './nyu/PROJECTS/Epicurious/DATA/ingredients'
#st = POSTagger('/usr/share/stanford-postagger/models/english-bidirectional-distsim.tagger','/usr/share/stanford-postagger/stanford-postagger.jar')

This is where we would reach each line of the file and then run a regex match on it to get all the words before
the first tab. (these are the names of the ingredients. Some of them may have adjectives like fresh, peeled,cut etc.
    Not sure what to do about them yet.)

def getFileDetails(_filename,_fileDescriptor):
    rankingRegexMatch = re.match('([0-9](?:\_)[0-9]?)', _filename)

    if len(rankingRegexMatch.group(0)) == 2:
        ranking = float(rankingRegexMatch.group(0)[0])
        ranking = float(rankingRegexMatch.group(0)[0]+'.'+rankingRegexMatch.group(0)[2])

    _keywords = []
    for line in _fileDescriptor:
        m = re.match('(\w+\s*\w*)(?=\t[0-9])', line)
        if m:

    return [_keywords,ranking]

Open each file in the directory and pass the name and file descriptor to getFileDetails
def this_is_it(files):
    _allKeywords = []
    _allRankings = []
    for eachFile in files:
        fullFilePath = mainDirectory + '/' + eachFile
        f = open(fullFilePath)
        XandYForThisFile = getFileDetails(eachFile,f)
    #_allKeywords = numpy.array(_allKeywords,dtype=object)

def svm_learning(x,y):
    clf = svm.SVC()
This just prints the directory path and then calls the callback x on files
def print_files( x, dir_path , files ):
    print dir_path
code starts here
os.path.walk(mainDirectory, print_files, this_is_it)

When the svm_learning(x,y) method is called, it throws me an error:

Traceback (most recent call last):
  File "scan for files.py", line 72, in <module>
    os.path.walk(mainDirectory, print_files, this_is_it)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/posixpath.py", line 238, in walk
    func(arg, top, names)
  File "scan for files.py", line 68, in print_files
  File "scan for files.py", line 56, in this_is_it
  File "scan for files.py", line 62, in svm_learning
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/svm/base.py", line 135, in fit
    X = atleast2d_or_csr(X, dtype=np.float64, order='C')
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 116, in atleast2d_or_csr
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 96, in _atleast2d_or_sparse
    X = array2d(X, dtype=dtype, order=order, copy=copy)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 80, in array2d
    X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
  File "/Library/Python/2.7/site-packages/numpy-1.8.0.dev_bbcfcf6_20130307-py2.7-macosx-10.8-intel.egg/numpy/core/numeric.py", line 331, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.

Can anyone help? I am new to scikit and could not find any help in the documentation.

preguntado el 02 de diciembre de 13 a las 18:12

Echa un vistazo a la extracción de características documentación. -

1 Respuestas

Deberías echarle un vistazo a: Text feature extraction. You are going to want to use either a TfidfVectorizer, a CountVectorizer, or a HashingVectorizer(if your data is very large). These components take your text in and output feature matrices that are acceptable to classifiers. Be advised that these work on lists of strings, with one string per example, so if you have a list of lists of strings (you have already tokenized), you may need to either join() the tokens to get a list of strings or skip tokenization.

Respondido el 03 de diciembre de 13 a las 20:12

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.