i would like to implement some kind of service my customers can use to find their company on a. blogs, forums b. facebook, twitter c. review sites
a. blogs, forums This can only be done by a crawler, right? A crawler looking for the robots.txt on a forum/blog and than optionally reading the content (and of course links) of the forum/blog. But where to start? Can i use a set of sites to start with crawling? do i have to predefine them or can i use some other searchengine first? E.g. searching in Google for that company and then crawl the SERPs? Legal?
b. facebook, twitter They have APIs, so hat should not be a problem i think.
c. review sites I looked at some review site's TOS and they wrote that using an automated software crawling their sites is not permitted. On the other hand, the sites that are relevant to me are not disallowed in their robots.txt. What matters here?
Any other hints are welcome.
Gracias por adelantado :-)
preguntado el 08 de enero de 11 a las 15:01
Honestly, the easiest way to do it would be to start with the search engines. They all have APIs for doing automated searches, so that'd probably give yout he highest return for your time on getting back links/mentions of your client's products or brand.
That won't handle things behind authentication, only public stuff (of course). But it'll give you a good baseline to start with. From there, you could (if you want) use API's or custom-written bots that are given auth creds on the sites, but honestly I think at that point you're missnig the core question, I think.
Is the core question, "Where are we mentioned?" or is the core question really... "What sites are getting traffic to come to us?" In most cases, it's the latter, in which case you can ignore all of what I said previously and just use Google Analytics, or similar software on your client's site to determine where traffic's coming from.
Editar Ok, so if it's where are we mentioned, I'd still start w/ the search engines as stated. Google's api is pretty easy and it has a SOAP based one that you can pull in as a web reference if you want; ejemplo
Re: review sites. If the site's TOS says you can't use automated bots, then it's a good idea not to use automated bots. The robots.txt is not legally binding (it's sort of a good-neighbor thing), and so I wouldn't not use the lack of exclusion there to be permission. Some review sites (more modern ones) might disallow automated scraping of their site, but they might still publish RSS feeds or Atom feeds or have some other API that you can hook into, that's worth checking.