httpclient 设置user-agent

gcgmh

浏览: 348671 次
性别:
来自: 北京

最近访客更多访客>>

kevin.shi

12697459

Yan_Sunny

leoeco2000

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

httpclient

webkit Linux Firefox 浏览器 XHTML

前些天在csdn上看到有人说dianping.com上的东西没法抓取，我就用htmlparser试了试，果然不行。看返回结果
Server returned HTTP response code: 500 for URL: http://www.dianping.com/shop/2212912
不能使用:然后想到换httpclient来试试:

HttpClient hc=new HttpClient();
GetMethod gm=new GetMethod("http://www.dianping.com/shop/1968937");
hc.executeMethod(gm);
System.out.print(gm.getResponseBodyAsString());

返回的数据:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><title>提示_大众点评网</title><style type="text/css">html{background:#f7f7f7;}body{background:#fff;color:#333;font-family:"MicrosoftYaHei","微软雅黑",Verdana,Arial;margin:2em auto 0 auto;width:700px;padding:1em 2em;-moz-border-radius:11px;-khtml-border-radius:11px;-webkit-border-radius:11px;border-radius:11px;border:1px solid #dfdfdf;}a{color:#2583ad;text-decoration:none;}a:hover{color:#d54e21;}h1{border-bottom:1px solid #dadada;clear:both;color:#666;margin:5px 0 5px 0;padding:0;padding-bottom:1px;}p{text-align:center;}sub{display:block;margin:0;padding:0;color:#aaa;font-size:11px;text-align:right;}</style></head><body><h1 id="logo" style="text-align: center"><img alt="dianping.com" src="http://i1.dpfile.com/s/img/simplelogo.gif" /></h1><p>对不起，您的访问存在某些问题。<br />如果您是正常访问，请与<a href="mailto:spam@dianping.com">spam@dianping.com</a>联系，并附上以下信息：<br /><textarea rows="10" cols="80">401
221.221.153.137
jakarta commons-httpclient/3.0</textarea></p><sub>401</sub></body></html>

大家可以把他复制到html看看效果，主要是提示我的浏览器错误:使用httpclient默认为jakarta commons-httpclient/3.0
我就换了user-agent试试
在上面的代码中加入

hc.getParams().setParameter(HttpMethodParams.USER_AGENT,"Mozilla/5.0 (X11; U; Linux i686; zh-CN; rv:1.9.1.2) Gecko/20090803 Fedora/3.5.2-2.fc11 Firefox/3.5.2");//设置信息

然后再提交就行了。。这个网站是通过判断user-agent来实现是不是正常的访问.

分享到：

httpclient htmlparser来查询手机号相关信 ... | nekohtml 用法

2009-09-21 15:36
浏览 19897
评论(1)
查看更多

1 楼 mypotatolove 2012-07-23

我现在想做用HttpClient从微博中爬取微博动态，能不能跟我讲一下其中的原理和具体实现要做什么啊？

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论