¿Cómo comprobar si las URL coinciden, dentro de una enorme base de datos de productos en línea?

So, the problem seems simple at the beginning but is not. Using Mongo and Node.js.

Problema: I have a URL. I need to match that URL with all the URLs I have in my database. Recuerda, there is no rule that the URL I'm on always have "category" infront or things like that. And please don't take "cases" into consideration.

I have no clue of the name of parameters, or anything else.

  1. Let's assume the URL is smth like example.com/category/product_name.html?session_id=2423412fd

    In the database I only have example.com/product_name.html

  2. The URL is smth like example.com/index.php?productid=6&category=3&utm_campaign=google&utm_source=click

    In the database I only have example.com/index.php?productid=6

  3. The URL is smth like example.com/product_name.html

    In the database I only have example.com/category/subcategory/product.html

I think I made my point. What I'm looking is a solution that matches URL in any cases (they are more than these). It can be an external services, class or something complex.

But I need it to work, and to work very fast because is doing this on every page refresh.

¡Gracias!

preguntado el 10 de marzo de 12 a las 10:03

In example 3, wouldn't those two URL's be different because of the path? -

@Anagio, the technique in example 3 is very common in SEO, but essentially it's the same page behind different looking urls. -

@alexandru.topliceanu thats bad to user underscores for SEO, see here mattcutts.com/blog/dashes-vs-underscores -

2 Respuestas

I would use this function to separate the strings http://php.net/manual/en/function.parse-url.php

Then take parts of the path name which you want to match from the URL and query your database URL's looking for matches.

respondido 10 mar '12, 10:03

To follow on from Anagio's answer, the URL

example.com/index.php?productid=6&category=3&utm_campaign=google&utm_source=click

could be saved as a Mongo object like:

{
  url: "example.com/index.php?productid=6&category=3&utm_campaign=google&utm_source=click",
  indexes: [
    "example.com",
    "index.php",
    "productid=6",
    "category=3",
    "utm_campaign=google",
    "utm_source=click"
  ]
}

You could then split up any new URL using the same algorithm, then do a map/reduce on the indexes field for scoring and then take the highest score as the best "fuzzy match"

respondido 10 mar '12, 14:03

this solution could be interesting. I wonder if it really works on any case. How fast would be to do this for every page load? - alejandro r

@NicholasTolleyCottrell can you give an example of the map and reduce functions you would use for such a case. Their implementation is essential to obtaining the "fuzzy match" you were talking about - alexandru.topliceanu

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.