Extraer el nombre de dominio principal de una URL determinada

I used the following to extract the domain from a url: (They are test cases)

String regex = "^(ww[a-zA-Z0-9-]{0,}\\.)";
ArrayList<String> cases = new ArrayList<String>();
cases.add("www.google.com");
cases.add("ww.socialrating.it");
cases.add("www-01.hopperspot.com");
cases.add("wwwsupernatural-brasil.blogspot.com");
cases.add("xtop10.net");
cases.add("zoyanailpolish.blogspot.com");

for (String t : cases) {  
    String res = t.replaceAll(regex, "");  
}

I can get the following results:

google.com
hopperspot.com
socialrating.it
blogspot.com
xtop10.net
zoyanailpolish.blogspot.com

The first four cases are good. The last one is not good. What I want is: blogspot.com for the last one, but it gives zoyanailpolish.blogspot.com. ¿Qué estoy haciendo mal?

preguntado el 27 de agosto de 11 a las 20:08

It looks like the regexes in esta publicación might help you =) -

Then don’t put those silly woublewoos in your pattern. If all you want is to s/^[^.]+\.//, then I suggest you do that. -

Not clear what you want, though. Are you trying to remove the first component Siempre hay , or all components but the one just before the TLD, or the first one only when it starts with a "ww" or ....? -

How about domains like example.com.tw y example.co.uk? -

Don't do it the hard regex way then. Using regex for this kind of problem is ridiculous. Split on dot into an array. Count the parts. Check if second last part isn't <=3 chars and/or starts with co (there are probably other ccTLDs you'd like to match). Grab the last two or three items depending on the outcome and join them together on the dot again. -

7 Respuestas

Using Guava library, we can easily get domain name:

InternetDomainName.from(tld).topPrivateDomain()

Refer API link for more details

https://google.github.io/guava/releases/14.0/api/docs/

http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/net/InternetDomainName.html

Respondido 26 Oct 16, 11:10

Obtain the host through REGEX is pretty complicated or impossible because TLD's don't obey to simple rules but are provided by ICANN and change in time.

You should use instead the functionality provided by JAVA library like this:

URL myUrl = new URL(urlString);
myUrl.getHost();

Respondido 28 ago 11, 02:08

Well, yes, but he already has all that. He wants to sometimes shift off some number of leading elements of the little-endian hostname, although he hasn’t told us how to know how many those might be. He seems to think we can eyeball domainnames and know whether the part we have is the “main” part already or not. I don’t think that’s possible. - tchrist

For the record, this does not answer the question. This returns whatever domain name was given including the subdomain. The OP was looking for the "root" domain name without subdomains, so if given "www.google.com" it should return "google.com". This method returns "www.google.com". This does work nicely if you are just trying to get the domain from a URL with a path and/or query string. - nerdherd

This is 2013 and solution I found is straight forward:

System.out.println(InternetDomainName.fromLenient(uriHost).topPrivateDomain().name());

respondido 09 nov., 13:19

It is much simpler:

  try {
        String domainName = new URL("http://www.zoyanailpolish.blogspot.com/some/long/link").getHost();

        String[] levels = domainName.split("\\.");
        if (levels.length > 1)
        {
            domainName = levels[levels.length - 2] + "." + levels[levels.length - 1];
        }

        // now value of domainName variable is blogspot.com
    } catch (Exception e) {}

respondido 01 mar '16, 19:03

Whats happens with: www.zoyanailpolish.blogspot.co.uk - Clive Paterson

As suggested by BalusC and others the most practical solution would be to get a list of TLDs (see this --), save them to a file, load them and then determine what TLD is being used by a given url String. From there on you could constitute the main domain name as follows:

    String url = "zoyanailpolish.blogspot.com";

    String tld = findTLD( url ); // To be implemented. Add to helper class ?

    url = url.replace( "." + tld,"");  

    int pos = url.lastIndexOf('.');

    String mainDomain = "";

    if (pos > 0 && pos < url.length() - 1) {
        mainDomain = url.substring(pos + 1) + "." + tld;
    }
    // else: Main domain name comes out empty

The implementation details are left up to you.

Respondido 28 ago 11, 02:08

to @James Poulson, Thanks. sorry, what is the output of your example? I do not quite understand. It remove tld first, then add it later. So, what is the final output? - chnet

There is no output as this is pseudocode. A text file listing the TLDs needs to be created (TLDs can be found on the Wikipedia link), these need to be read into a data structure and the findTLD method needs to be filled in. If done correctly it should do what you want which in this case would give blogspot.com. - James P.

to @James Poulson, right. Assume I get tld, the pseudo example would remove .com from url. Then, it moves to the dot position before blogspot. In this way, you can remove zoyanailpolish . - chnet

That's the idea :) . If you encounter any issues getting it to work let me know. - James P.

Probably this is not a good idea anymore as there are thousands of new TLD's coming in the next years. - andreas

The reason why your are seeing zoyanailpolish.blogspot.com is that your regex finds only strings that Comienze aquí with a 'ww'. What you are asking is that in addition to removing all strings that start with a 'ww' , it should also work for a string starting with 'zoyanailpolish' (?). In that case , use the regex String regex = "^((ww|z|a)[a-zA-Z0-9-]{0,}\\.)"; This will remove any word that starts with a 'ww' or 'z' or 'a'. Customize it based on what you need exactly.

Respondido 28 ago 11, 01:08

Right. in addition to removing all strings that start with a 'ww'. It should also work for a string staring with others (not only 'zoyanailpolish'). For example, "xyz.blogspot.com". - chnet

but as you showed for xtop10.net it does not remove xtop10 - so that means for certain strings it does not remove - right ? The question is - is it a custom list of string you want not to remove or there is a rule based on which this works ? - Bhaskar

to @Bhaskar, It depends. For example, xtop10.net, it is a website. It is a domain name. I do not need to do any changes. While for zoyanailpolish.blogspot.com, the domain name should be blogspot.com. So, I need to remove zoyanailpolish. - chnet

It is very clear what @chnet wants: "Right. I want the main domain and not the subdomains" - James P.

@James It is? Then he should have said that, now shouldn’t he? I hope he has fun telling that .com, .co.uk y pvt.k12.wy.us all count as the same sort of thing. - tchrist

InternetDomainName.from("test.blogspot.com").topPrivateDomain() -> test.blogspot.com

Esto funciona mejor en mi caso:

InternetDomainName.from("test.blogspot.com").topDomainUnderRegistrySuffix() -> blogspot.com

Detalles: https://github.com/google/guava/wiki/InternetDomainNameExplained

contestado el 17 de mayo de 19 a las 13:05

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.