结合IE9的开发者工具取得动态网页的html代码

最近在做一个项目,要得到网页中的一些数据,静态的页面比较容易做,只要解析网站的URL地址就可以得到HTML代码,但是有些网页是动态生成的,比如翻页过程中,地址栏中的URL地址都不会发生变化,所以得到这种网页的内容就相对麻烦些。下面我以https://honors.libraries.psu.edu/browse/author/all/这个网站的翻页动作为例子,说明一下动态网页HTML代码的获取过程。


1、用IE9打开这个网站:https://honors.libraries.psu.edu/browse/author/all/



2、按下F12调出开发者工具



点开发者工具中的“网络”-->"开始捕获",然后点击网页上的“next page”链接


3、得到整个请求的过程


点击“转到详细视图“


4、将参数与c#的HtmlWebRequest对象绑定


///<summary>

        ///采用https协议访问网络

        ///</summary>

        ///<param name="URL">url地址</param>

        ///<param name="strPostdata">发送的数据</param>

        ///<returns></returns>

        public string OpenReadWithHttps(string URL, string strPostdata, Encoding encoding)
        {
            CookieContainer cc = new CookieContainer();

            cc.Add(new Cookie("csrftoken", "04696113ff3ee3e8220dd9044921e100", "/browse/author/all/", "honors.libraries.psu.edu"));
            cc.Add(new Cookie("__utma", "148028590.1404245236.1416720957.1416734716.1416748914.3", "/browse/author/all/", "honors.libraries.psu.edu"));
            cc.Add(new Cookie("__utmz", "148028590.1416720957.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)", "/browse/author/all/", "honors.libraries.psu.edu"));
            cc.Add(new Cookie("__utmb", "148028590.2.10.1416748914", "/browse/author/all/", "honors.libraries.psu.edu"));
            cc.Add(new Cookie("__utmc", "148028590", "/browse/author/all/", "honors.libraries.psu.edu"));

            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(URL);

            request.CookieContainer = cc;            

            request.Method = "post";

            request.Accept = "text/html, application/xhtml+xml, */*";

            request.ContentType = "application/x-www-form-urlencoded";

            request.Referer="https://honors.libraries.psu.edu/browse/author/all/";

            request.KeepAlive = true;

            request.UserAgent = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0)";

            request.Host = "honors.libraries.psu.edu";

            request.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-US");

            request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip, deflate");

            request.Headers.Add(HttpRequestHeader.CacheControl, "no-cache");

            byte[] buffer = encoding.GetBytes(strPostdata);

            request.ContentLength = buffer.Length;

            Stream writer = request.GetRequestStream(); //获得请求流    

            writer.Write(buffer, 0, buffer.Length); //将请求参数写入流   
            
            writer.Close(); //关闭请求流

            HttpWebResponse response = (HttpWebResponse)request.GetResponse();

            using (StreamReader reader = new StreamReader(response.GetResponseStream(), encoding))
            {

                return reader.ReadToEnd();
            }

        }


参数说明:

URL:请求的地址,strPostdata:POST发送的数据,encoding:页面编码

5、调用

private void button2_Click(object sender, EventArgs e)
        {
            string url = "https://honors.libraries.psu.edu/browse/author/all/";
            string strPostData = "csrfmiddlewaretoken=04696113ff3ee3e8220dd9044921e100&browse_start=all&browse_type=author&page=9&display=50&num_display_items=50";

            textBox1.Text = OpenReadWithHttps(url, strPostData, Encoding.UTF8);
        }

总结流程:用IE9的开发者工具捕获页面请求过程,得到请求的各参数,然后将各参数绑定到HtmlWebRequest对象进行请求!


郑重声明:本站内容如果来自互联网及其他传播媒体,其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享,并不代表本站赞同其观点和对其真实性负责,也不构成任何其他建议。