Check for broken links
I am trying to find all the broken links in the webpage using Java. Here is the code:
   private static boolean isLive(String link){
    HttpURLConnection urlconn = null;
    int res = -1;
    String msg = null;
    try{
        URL url = new URL(link);
        urlconn = (HttpURLConnection)url.openConnection();
        urlconn.setConnectTimeout(10000);
        urlconn.setRequestMethod("GET");
        urlconn.connect();
        String redirlink = urlconn.getHeaderField("Location");
        System.out.println(urlconn.getHeaderFields());
        if(redirlink != null && !url.toExternalForm().equals(redirlink))
            return isLive(redirlink);
        else
            return urlconn.getResponseCode()==HttpURLConnection.HTTP_OK;
    }catch(Exception e){
      System.out.println(e.getMessage());
      return false;
    }finally{
        if(urlconn != null)
            urlconn.disconnect();
    }
}
public static void main(String[] s){
    String link = "http://www.somefakesite.net";
    System.out.println(isLive(link));
}
Code referred from http://nscraps.com/Java/146-program-code-broken-link-checker.htm.
This code gives HTTP 200 status for all webpages including the broken ones. For example http://www.somefakesite.net/ gives the following header fields:
{null=[HTTP/1.1 200 OK], Date=[Sun, 15 May 2011 18:51:29 GMT], Transfer-Encoding=[chunked], Keep-Alive=[timeout=4, max=100], Connection=[Keep-Alive], Content-Type=[text/html], Server=[Apache/2.2.15 (Win32) PHP/5.2.12], X-Powered-By=[PHP/5.2.9-1]}
Even though such sites do not exist, how to classify it as a broken link?
Maybe the issue is that currently lots of webserver and DNS providers detect those "broken" links and redirect you to their "not found" pages.
Test it against an URL that you know sends the 404 code (it shows the browser original message).
EDIT to answer the comment by the author (as it is too long to fit in a comment): I do not see an easy answer for your problem, but there are several different types of failures:
上一篇: 哪里需要从数据库中获取数据的域逻辑
下一篇: 检查损坏的链接
