利用htmlparser提取网页纯文本的例子

还是一样,先从缓存里找,如果没有找到,则调用findExtensions,如果是null就在缓存中保存null,而如果找到,则在缓存里放入空的Extension列表。findExtension的代码如下:

privateList<Extension> findExtensions(String scope) {

    String[] orders = null;

    String orderlist = conf.get("urlnormalizer.order." +scope);

    if (orderlist == null)

       orderlist = conf.get("urlnormalizer.order");

    if (orderlist != null&& !orderlist.trim().equals("")) {

       orders = orderlist.split("\\s+");

    }

    String scopelist = conf.get("urlnormalizer.scope." +scope);

    Set<String> impls = null;

    if (scopelist != null&& !scopelist.trim().equals("")) {

       String[] names = scopelist.split("\\s+");

       impls = newHashSet<String>(Arrays.asList(names));

    }

    Extension[] extensions = this.extensionPoint.getExtensions();

    HashMap<String, Extension>normalizerExtensions =

        newHashMap<String, Extension>();

    for (int i =0; i < extensions.length; i++) {

       Extension extension = extensions[i];

       if(impls != null &&!impls.contains(extension.getClazz()))

           continue;

       normalizerExtensions.put(extension.getClazz(),extension);

    }

    List<Extension> res = newArrayList<Extension>();

    if (orders == null) {

       res.addAll(normalizerExtensions.values());

    } else {

       // first add those explicitlynamed in correct order

       for (int i =0; i < orders.length; i++) {

           Extension e =normalizerExtensions.get(orders[i]);

           if (e!= null) {

              res.add(e);

              normalizerExtensions.remove(orders[i]);

           }

       }

       // then add all others inrandom order

       res.addAll(normalizerExtensions.values());

    }

    return res;

}

    关于scope可以看一下注释You can define a set of contexts (orscopes) in which normalizers may be called. Each scope can have its own list ofnormalizers (defined in "urlnormalizer.scope.<scope_name>"property) and its own order (defined in "urlnormalizer.order.<scope_name>" property). If any ofthese properties are missing, default settings are used for the global scope.,大意是你可以再定义一个更小的域,处理关于这个子域的特征情况,比如在Injector中调用,可以指定urlnormalizer.order.injector。默认情况下,在injector域中,在配置中使用的是默认的urlnormalizer.order,相应的配置在nutch-default中:

<property>

  <name>urlnormalizer.order</name>

 <value>org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizerorg.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>

  <description>Order in which normalizerswill run. If any of these isn‘t

  activated it will be silently skipped. Ifother normalizers not on the

  list are activated, they will run in randomorder after the ones

  specified here are run.

  </description>

</property>

    接下来的代码是从urlnormalizer.order中得到相应的normalizer顺序,再从扩展点中得到相应的normalizerExtension,然后根据normalizer的顺序将extensions放到res中。

    在Injector中调用normalize的代码是:

url = urlNormalizers

       .normalize(url,URLNormalizers.SCOPE_INJECT);

      URLNormalizers的注释写到:This class uses a "chainedfilter" pattern to run defined normalizers. 也就是要按配置中的顺序依次进行normalize:

publicString normalize(String urlString, String scope)

       throwsMalformedURLException {

    // optionally loop severaltimes, and break if no further changes

    String initialString = urlString;

    for (int k =0; k < loopCount; k++) {

       for (int i =0; i < this.normalizers.length;i++) {

           if(urlString == null)

              return null;

           urlString = this.normalizers[i].normalize(urlString,scope);

       }

       if(initialString.equals(urlString))

           break;

       initialString = urlString;

    }

    returnurlString;

}

    用normlizers中的 Normalizer对象进行规范化,到loopCount次或与上次规范化的结果已经没有差异了,那么停止。

 

1.1    搜索引擎优化优质的链接地址

http:c.//tieba.baidu.com/p/3347410682
http:c.//tieba.baidu.com/p/3347428728
http:c.//tieba.baidu.com/p/3347420824
http:c.//tieba.baidu.com/p/3318046622
http://c.tieba.baidu.com/p/3347428728
http://c.tieba.baidu.com/p/3318046622
http://c.tieba.baidu.com/p/3347420824
http://c.tieba.baidu.com/p/3345012911
http://bbs.28tui.com/thread-7470581-1-1.html
http://bbs.28tui.com/thread-7470597-1-1.html
http://bbs.28tui.com/thread-7470617-1-1.html
http://bbs.28tui.com/thread-7470722-1-1.html
http://bbs.28tui.com/thread-7470751-1-1.html
http://bbs.28tui.com/thread-7470783-1-1.html
http://bbs.28tui.com/thread-7470819-1-1.html
http://bbs.28tui.com/thread-7470819-1-1.html
http://bbs.28tui.com/thread-7470865-1-1.html
http://bbs.28tui.com/thread-7470877-1-1.html
http://bbs.28tui.com/thread-7470894-1-1.html
http://bbs.28tui.com/thread-7470908-1-1.html
http://bbs.28tui.com/thread-7470967-1-1.html
http://bbs.28tui.com/thread-7470982-1-1.html
http://bbs.28tui.com/thread-7471004-1-1.html
http://bbs.28tui.com/thread-7471027-1-1.html
http://bbs.28tui.com/thread-7471085-1-1.html
http://bbs.28tui.com/thread-7471095-1-1.html
http://bbs.28tui.com/thread-7471143-1-1.html
http://bbs.28tui.com/thread-7471148-1-1.html
http://bbs.28tui.com/thread-7471148-1-1.html
http://bbs.28tui.com/thread-7471256-1-1.html

郑重声明:本站内容如果来自互联网及其他传播媒体,其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享,并不代表本站赞同其观点和对其真实性负责,也不构成任何其他建议。