Análisis de HTMLAgillityPack

I am trying to parse the following data from an HTML document using HTMLAgillityPack:

<a href="http://abilene.craigslist.org/">abilene</a> <br>
<a href="http://albany.craigslist.org/"><b>albany</b></a> <br>
<a href="http://amarillo.craigslist.org/">amarillo</a> <br>
...

I would like parse out the URL and the name of the city into 2 separate files.

Ejemplo:

urls.txt
"http://abilene.craigslist.org/"
"http://albany.craigslist.org/"
"http://amarillo.craigslist.org/"

cities.txt
abilene
albany
amarillo

Esto es lo que tengo hasta ahora:

        public void ParseHtml()
    {
        //Clear text box 
        textBox1.Clear();

        //managed wrapper around the HTML Document Object Model (DOM). 
        HtmlAgilityPack.HtmlDocument hDoc = new HtmlAgilityPack.HtmlDocument();

        //Load file
        hDoc.Load(@"c:\AllCities.html"); 

        try
        {
            //Execute the input XPath query from text box
            foreach (HtmlNode hNode in hDoc.DocumentNode.SelectNodes(xpathText.Text))
                {
                    textBox1.Text += hNode.InnerHtml + "\r\n";
                }

        }
        catch (NullReferenceException nre)
        {
            textBox1.Text += "Can't process XPath query, modify it and try again.";
        }
    }

Any help would be greatly appreciated! Thanks guys!

preguntado el 10 de marzo de 12 a las 08:03

Creo que el este can be use full for you -

Perfect! Got all 500 URLs in 30 seconds... -

I still need to get the cities from the HTML. -

No Node Have the value of a? -

1 Respuestas

I get it that you want to parse them from craigslist.org?
Así es como lo haría.

List<string> links = new List<string>();
List<string> names = new List<string>();
HtmlDocument doc = new HtmlDocument();
//Load the Html
doc.Load(new WebClient().OpenRead("http://geo.craigslist.org/iso/us"));
//Get all Links in the div with the ID = 'list' that have an href-Attribute
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//div[@id='list']/a[@href]");
//or if you have only the links already saved somewhere
//HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//a[@href]");
if (linkNodes != null)
{
  foreach (HtmlNode link in linkNodes)
  {
    links.Add(link.GetAttributeValue("href", ""));
    names.Add(link.InnerText);//Get the InnerText so you don't get any Html-Tags
  }
}
//Write both lists to a File
File.WriteAllText("urls.txt", string.Join(Environment.NewLine, links.ToArray()));
File.WriteAllText("cities.txt", string.Join(Environment.NewLine, names.ToArray()));

respondido 11 mar '12, 14:03

Wow, perfect! Thank you very much! - John

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.