phemail.py - sin atributo 'codificar'

root@bt:~# ./phemail.py -g0@*******.com
Gathering emails from domain: ******.com
Traceback (most recent call last):
  File "./phemail.py", line 206, in <module>
  gatherEmails(domain[0],domain[1],p)
  File "./phemail.py", line 51, in gatherEmails
  namesurname = re.sub(' -.*','',a.text.encode('utf8'))
AttributeError: 'NoneType' object has no attribute 'encode'

Why is a.text NoneType type?

preguntado el 28 de agosto de 12 a las 14:08

(1) please format your code with a code block ({} button). (2) please finish the sentence... -

*.com doesn't look like a valid domain name to me. -

Abierta phemail.py in a text editor, go to line 51 where the error is, and trace backwards to find where a.text is (or could be but is not) set. -

Suggested adding BeautifulSoup tag - the script crashes based on BS behavior. -

3 Respuestas

a.text has no value (None)
There is probably something wrong with the line where you initialize your a variable.

I would not advice doing things as root by the way.

Respondido 28 ago 12, 14:08

By way of explanation, what the script is doing is using Google to search through indexed pages of LinkedIn, specifically for pages where user's names appear (as opposed to company profiles, jobs, discussions, etc). Since the target company name, and presumably standard e-mail format for that company, are known (and specified in args to the script), the search appears to seek all LI profile page results mentioning the company, extract the names, and generate e-mail addresses from the names. It is not scraping e-mail addresses, or even domains - it is scraping names.

It actually shows a lack of understanding of how LI makes public profiles visible to search engines (or a tolerance for a lot of crap results), because your results will be full of 'directory' pages, not profiles.

But aside from that strategic error, you are also using the script wrong - Google does not support per-character wild-cards - the wildcard primarily indicates that one or more words may fall between (or after/before - but it works best between) to other words. Wildcard behavior is a bit tricky and not completely documented for all cases, though. So even if this didn't fail later on, your output would be the first hundred names to appear on a very generic "site:" search of LinkedIn (without any company/domain info). Not sure how this is useful to anyone?

As for why the script fails on that specific line, you're iterating over the ouput of a BeautifulSoup.findAll call for the a-tags of the search result items. In this case, a.text has value and type of 'None', and that leads to the error because None has no encode() method. BeautifulSoup has a lot of great shortcuts, but they can be confusing to track back through for errors. The result of findAll is a set of tags, and default of tags is to act like findAll, so I THINK a.text is like calling findAll('text') on the individual tag for that cycle of the interation. I can't say for sure why that doesn't work - I don't have BeautifulSoup on this machine - but you should be able to play with this a bit and see where it's going wrong.

In relevant part:

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
headers={'User-Agent':user_agent,}
p = 10

def gatherEmails(l,domain,p):
    print "Gathering emails from domain: "+domain
    emails = []
    for i in range(0,p):
        url = "http://www.google.co.uk/search?hl=en&safe=off&q=site:linkedin.com/pub+"+re.sub('\..*','',domain)+"&start="+str(i)+"0"
        request=urllib2.Request(url,None,headers)
        response = urllib2.urlopen(request)
        data = response.read()
        html = BeautifulSoup(data)
        for a in html.findAll('a',attrs={'class':'l'}):
            namesurname = re.sub(' -.*','',a.text.encode('utf8'))
            firstname = re.sub(' ([a-zA-Z])+','',namesurname).lower()
            surname = re.sub('([a-zA-Z])+ ','',namesurname).lower()
            sys.stdout.write("\r%d%%" %((100*(i+1))/p))
            sys.stdout.flush()
            if firstname != surname and not re.search('\W',firstname) and not re.search('\W',surname):                
                if l == '0' : # 1- firstname.surname@example.com
                    emails.append(firstname+" "+surname)

Respondido 28 ago 12, 15:08

thanks all for help. but i still cant find the solution. by the way this is the author for this source code aquí - user1630278

You're using a version of Beautiful Soup before 3.0.8. Upgrade to get .text, .getText(separator), and (in Beautiful Soup 4) .get_text(separator).

Respondido 29 ago 12, 23:08

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.