Intentando manipular datos, asigne una lista al primer elemento en una lista superior, el segundo elemento será información sobre esa lista

Ok, I'm trying to transmit a list of values, alongside information regarding that list of values. I am trying to do that while manipulating the data. Let me show you what's going on:

worddictlist2 = []
for innertweet in namelist:
        worddictlist = []
        for tweet in innertweet[0]:
                worddict = {word: tweet.count(word) for word in wordlist}
                worddictlist.append(worddict)
                worddictlist2.append(worddictlist)

namelist is a variable with the following information:

[[['blah blah blah string blah blah blah blah blah blah', 'another string, blah blah blah, string string', 'string string string'], category], ['string string another string, blah', 'more words, more words, etc', 'yet again, here we go'], category2]

I am counting the number of times that a particular word occurs in each phrase. However I still want to keep the category assignment in some way.

I've been trying to append different lists throughout the various loops, I've tried different list comprehensions, and I'm just not seeing the result I want, which will be as follows:

[[{word1: 0, word2: 7, word3: 12, word4: 6}, category], {word1: 3, word2: 9, word3: 1, word4: 2}, category2]]

How can I get this output? Am I doing this inefficiently? The way I am torturing this data makes me feel like I am doing this process inefficiently.

preguntado el 31 de julio de 12 a las 11:07

3 Respuestas

Datos dados:

category = "C"
category2 = "C2"

namelist = [
  [['blah blah blah string blah blah blah blah blah blah', 'another string, blah blah blah, string string', 'string string string'],
   category
  ],
  [['string string another string, blah', 'more words, more words, etc', 'yet again, here we go'],
   category2
  ]
]

wordlist = "blah string words".split()

Then this should work as described:

from collections import defaultdict

worddictlist2 = []
for innertweet in namelist:
    worddict = defaultdict(lambda: 0)
    category = innertweet[1]
    for tweet in innertweet[0]:
        for word in wordlist:
            worddict[word] += tweet.count(word)

    # optional - transform defaultdict into standard dict to make it printable
    worddictClean = {}
    worddictClean.update(worddict)

    worddictlist2.append([worddictClean, category])

print worddictlist2

Y da como resultado:

[[{'blah': 12, 'string': 7, 'words': 0}, 'C'], [{'blah': 1, 'string': 3, 'words': 2}, 'C2']]

Respondido 31 Jul 12, 11:07

This appears to do what I want. - steven matthews

Can you please explain defaultdict(lambda: 0)? - steven matthews

defaultdict is a special dictionary that works much the same as the standard {}. The only difference is that when you try to access a key that it does not contain, it immediately creates an entry for that key using the specified "factory function". In this case the factory function is lambda: 0, so that basically means that accessing any new key will immediately create an entry for it with value 0. More here: docs.python.org/library/… - Deestán

First, in the current code the worddict gets created anew for each tweet, which is probably not what you want. Also, using the method str.count() you run the risk of counting a word that occurs in the tweet as a part of another word, e.g. 'as is the case'.count('as') would be 2, rather than 1, since as appears in the word case as substring. I would suggest splitting the tweet by whitespace and than iterating over the unique words in that split instead, like words = tweet.split() y {word: words.count(word) for word in list(set(words)) or simply iterating over the words and incrementing the counts in the dictionary for every occurrence of a word, I'm not sure which is more efficient.

So, my suggestion would be

worddictlist2 = []
for innertweet in namelist:
    worddict = {}
    for tweet in innertweet[0]:
        words = tweet.split()
        for word in words:
            if not worddict.has_key(word):
                worddict[word] = 1
            else:
                worddict[word] += 1
    worddictlist2.append([worddict, innertweet[1]])

dada la entrada

namelist = [[['blah blah blah string blah blah blah blah blah blah', 'another string, blah blah blah, string string', 'string string string'], 'category'], [['string string another string, blah', 'more words, more words, etc', 'yet again, here we go'], 'category2']]

este código genera

[[{'blah,': 1, 'blah': 11, 'string,': 1, 'string': 6, 'another': 1}, 'category'], [{'string,': 1, 'string': 2, 'again,': 1, 'etc': 1, 'we': 1, 'here': 1, 'blah': 1, 'words,': 2, 'another': 1, 'go': 1, 'yet': 1, 'more': 2}, 'category2']]

In order to get rid of the words with commas attached, you might want to eliminate the punctuation before counting the words, e.g. by adding tweet = re.sub(r'[^a-zA-Z0-9]', ' ', tweet) al código anterior:

import re

worddictlist2 = []
for innertweet in namelist:
    worddict = {}
    for tweet in innertweet[0]:
        tweet = re.sub(r'[^a-zA-Z0-9]', ' ', tweet)
        words = tweet.split()
        for word in words:
            if not worddict.has_key(word):
                worddict[word] = 1
            else:
                worddict[word] += 1
    worddictlist2.append([worddict, innertweet[1]])

print worddictlist2

que rinde

[[{'blah': 12, 'string': 7, 'another': 1}, 'category'], [{'again': 1, 'we': 1, 'string': 3, 'etc': 1, 'here': 1, 'blah': 1, 'another': 1, 'words': 2, 'go': 1, 'yet': 1, 'more': 2}, 'category2']]

Respondido 31 Jul 12, 12:07

I absolutely want the worddict to be created for each tweet. - steven matthews

everything else seems great so far. - steven matthews

worddict is going to have several tweets created, and then they're all going to be assigned to a category. - steven matthews

No, it does not. It assigns values per word. This is not what I want. I have a word list and am assigning numbers based on the frequency in this wordlist. - steven matthews

Tal vez así:

worddictlist2 = []
wdlist = {}
for innertweet,cat in namelist:
   for i in innertweet:
      for j in i.split():
         j = j.strip(',') # strip comma
         wdlist.setdefault(j,0) # if 'j' unknown key
         wdlist[j] += 1
   worddictlist2.append(wdlist, cat)
   wdlist = {}


print(worddictlist2)

da:

[
 [{'another': 1, 'blah': 12, 'string': 7}, 'category'],
 [{'again': 1, 'another': 1, 'blah': 1, 'etc': 1, 'go': 1, 'here': 1, 'more': 2, 'string': 3, 'we': 1, 'words': 2, 'yet': 1}, 'category2']
]

Respondido 31 Jul 12, 14:07

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.