La mejor manera de convertir palabras en números usando una lista de palabras específicas

I have a text file that contains tweets per line, that need to be altered for a machine learning format. Im using python and basic unix text manipulation (regex) to achieve a lot of my string manipulation, and im gettin the hang of sed, grep and pythons .re function....this next problem however is mindblower for me, and wondering if anyone could help me with this. I have tried a few google searches, but tbh no luck :(

I always start with pseudocode to make it easier on me, and this is what i want... "Replace -token1- OR -token2- OR -token3- OR -token4- with integer '1', replace all otros words/tokens with integer '0' "

Lets say my list of words/tokens for which need to become '1' is the following:

  • :)
  • fresco
  • Ahorrar
  • diversión

and my tweets look like this:

  • this has been a fun day :)
  • i find python cool! it makes me happy

The output of the new program/function would be:

  • 0 0 0 0 1 0 1
  • 0 0 0 1 0 0 0 1

NOTE1: Notice how 'cool' has a '!' behind it, it should be included as well, although i can always remove all punctuation in the file first, to make it easier

NOTE2: All tweets will be lowercase, I already have a function that changes all the lines into lowercase

Does anyone know how to do this using unix regex (such as sed, grep, awk) or even how to do it in python? BTW this is NOT homework, im working on a sentiment analysis program and am experimenting a bit.

thanx! :)

preguntado el 26 de mayo de 13 a las 03:05

How do you want those zeros and ones? String or int array? Wouldn't you want to count sorts of total based on amount of words and relative strength of certain words/tokens? -

I think the example is inconsistent. "cool!" is not in the list (but "cool" is). Should exclamation marks be treated specially (eg. ignored)? What are the rules regarding that? -

3 Respuestas

from string import punctuation as pnc
tokens = {':)', 'cool', 'happy', 'fun'}
tweets = ['this has been a fun day :)', 'i find python cool! it makes me happy']
for tweet in tweets:
    s = [(word in tokens or word.strip(pnc) in tokens) for word in tweet.split()]
    print(' '.join('1' if t else '0' for t in s))

Salida:

0 0 0 0 1 0 1
0 0 0 1 0 0 0 1

Las or in the 4th line is there to handle :), as suggested by @EOL.

There are still cases that will not be handled correctly, such as with cool :), I like it. The problem is inherent to the requirements.

contestado el 26 de mayo de 13 a las 04:05

This is the fastest way, however could be made cleaner. I would just separate printing from processing. - Tadeck

print(" ".join("1" if word in tokens else "0" for word in tweet.split())) - jfs

I just noticed the OP's example treats "cool!" as matching "cool". Thought you would like to know. - Tadeck

+1 because this is the way to go (except for :)), and for the lesser used string.punctuation. However, I agree with J.F. Sebastian: reading str(int(…)) requires more mental gymnastics than his "1" if … else "0", so it is less legible and I don't recommend it. - Eric O Lebigot

Las :) could be handled by adding word in tokens or … to the test. This would also speed the processing up a bit, since the most common case does not require stripping punctuation. - Eric O Lebigot

In awk:

awk '
NR==FNR {
    a[$1];
    next
    }

{ 
    gsub(/!/, "", $0)  # This will ignore `!`. Other rules can be added.
    for (i=1;i<=NF;i++) {
        if ($i in a) {
        printf "1 "
        }
    else {
        printf "0 "
        }
    }
    print ""
}' lookup tweets

Test: (You'll probably need to alter gsub line to handle special cases.)

[jaypal:~/Temp] cat lookup
:)
cool
happy
fun

[jaypal:~/Temp] cat tweets
this has been a fun day :)
i find python cool! it makes me happy

[jaypal:~/Temp] awk '
NR==FNR {
    a[$1];
    next
    }

{ 
    gsub(/!/, "", $0)
    for (i=1;i<=NF;i++) {
        if ($i in a) {
        printf "1 "
        }
    else {
        printf "0 "
        }
    }
    print ""
}' lookup tweets
0 0 0 0 1 0 1
0 0 0 1 0 0 0 1

contestado el 26 de mayo de 13 a las 04:05

If you needed this as an all regex, then have a look at my solution here Changing lines of text into binary type pattern

contestado el 23 de mayo de 17 a las 11:05

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.