Bot-Factory 602 Posted April 13, 2014 Report Share Posted April 13, 2014 Hello Ubotters, you probably have seen navigation links like:123...106107108 at the bottom of a page. Now I need to get all those URLs in one go. But they are not all available on that page: <ul class="pd-nav pagination"> <li class="grey previous">Back</li> <li class="next"><a href="/de/directories/people/a/ab/2.html">Next</a></li> <li class="selected">1</li> <li><a href="/de/directories/people/a/ab/2.html">2</a></li> <li><a href="/de/directories/people/a/ab/3.html">3</a></li> <li><span class="grey">...</span></li> <li><a href="/de/directories/people/a/ab/236.html">236</a></li> <li><a href="/de/directories/people/a/ab/237.html">237</a></li> <li><a href="/de/directories/people/a/ab/238.html">238</a></li> </ul> So my idea was to get the lowest number and the highest number. And then run a loop to create all the URLs on my own. I have this working already. But it requires Javascript (eval function) to do it. Question:Is there a smarter and faster way to do that? Without JavaScript and additional browser / http get stuff. Working Code:set(#testhtml, " <ul class=\"pd-nav pagination\"> <li class=\"grey previous\">Back</li> <li class=\"next\"><a href=\"/de/directories/people/a/ab/2.html\">Next</a></li> <li class=\"selected\">1</li> <li><a href=\"/de/directories/people/a/ab/2.html\">2</a></li> <li><a href=\"/de/directories/people/a/ab/3.html\">3</a></li> <li><span class=\"grey\">...</span></li> <li><a href=\"/de/directories/people/a/ab/236.html\">236</a></li> <li><a href=\"/de/directories/people/a/ab/237.html\">237</a></li> <li><a href=\"/de/directories/people/a/ab/238.html\">238</a></li></ul>", "Global")set(#zzzz, $plugin function("HTTP post.dll", "$xpath parser", #testhtml, "//ul[@class=\'pd-nav pagination\']/li/a", "href", "HTML"), "Global")set(#z2, $find regular expression(#zzzz, "(?<=\\/)[0-9]+(?=\\.html)"), "Global")set(#z3, $replace(#z2, $new line, ","), "Global")navigate("http://www.google.com", "Wait")set(#zMIN, $eval("var b=Math.min({#z3})b"), "Global")set(#zMAX, $eval("var b=Math.max({#z3})b"), "Global")loop($subtract(#zMAX, #zMIN)) { set(#zzURLS, "{#zzURLS}/de/directories/people/a/ab/{#zMIN}.html", "Global") increment(#zMIN)} Quote Link to post Share on other sites
Aymen 385 Posted April 13, 2014 Report Share Posted April 13, 2014 you don't have to scrape all of em , just get the last oneusually they all have the same URL structure all what you have to do is change the page number/de/directories/people/a/ab/{#counter}.html Quote Link to post Share on other sites
Bot-Factory 602 Posted April 13, 2014 Author Report Share Posted April 13, 2014 you don't have to scrape all of em , just get the last oneusually they all have the same URL structure all what you have to do is change the page number/de/directories/people/a/ab/{#counter}.html Thanks Aymen, I now found a smarter way to do it. Via xpath parser index.I put the first result into a list. Count the list. Sub1 and use that for the index. That get's me the last entry. set(#testhtml, " <ul class=\"pd-nav pagination\"> <li class=\"grey previous\">Back</li> <li class=\"next\"><a href=\"/de/directories/people/a/ab/2.html\">Next</a></li> <li class=\"selected\">1</li> <li><a href=\"/de/directories/people/a/ab/2.html\">2</a></li> <li><a href=\"/de/directories/people/a/ab/3.html\">3</a></li> <li><span class=\"grey\">...</span></li> <li><a href=\"/de/directories/people/a/ab/236.html\">236</a></li> <li><a href=\"/de/directories/people/a/ab/237.html\">237</a></li> <li><a href=\"/de/directories/people/a/ab/238.html\">238</a></li></ul>", "Global")set(#z1, $plugin function("HTTP post.dll", "$xpath parser", #testhtml, "//ul[@class=\'pd-nav pagination\']/li/a", "href", "HTML"), "Global")add list to list(%tmp, $list from text(#z1, $new line), "Don\'t Delete", "Global")set(#z2, $plugin function("HTTP post.dll", "$xpath parser index", #testhtml, "//ul[@class=\'pd-nav pagination\']/li/a", $subtract($list total(%tmp), 1), "href"), "Global")set(#z3, $find regular expression(#z2, "(?<=\\/)[0-9]+(?=\\.html)"), "Global")set(#loopcounter, 2, "Global")loop(#z3) { set(#zzURLS, "{#zzURLS}/de/directories/people/a/ab/{#loopcounter}.html", "Global") increment(#loopcounter)} No browser and no Java YEAH!!! I think that's it! THANKS A LOTDan Quote Link to post Share on other sites
Bot-Factory 602 Posted April 13, 2014 Author Report Share Posted April 13, 2014 Ok even simpler... set(#testhtml, " <ul class=\"pd-nav pagination\"> <li class=\"grey previous\">Back</li> <li class=\"next\"><a href=\"/de/directories/people/a/ab/2.html\">Next</a></li> <li class=\"selected\">1</li> <li><a href=\"/de/directories/people/a/ab/2.html\">2</a></li> <li><a href=\"/de/directories/people/a/ab/3.html\">3</a></li> <li><span class=\"grey\">...</span></li> <li><a href=\"/de/directories/people/a/ab/236.html\">236</a></li> <li><a href=\"/de/directories/people/a/ab/237.html\">237</a></li> <li><a href=\"/de/directories/people/a/ab/238.html\">238</a></li></ul>", "Global")set(#z1, $plugin function("HTTP post.dll", "$xpath parser", #testhtml, "//ul[@class=\'pd-nav pagination\']/li/a", "href", "HTML"), "Global")add list to list(%tmp, $list from text(#z1, $new line), "Don\'t Delete", "Global")set(#z2, $list item(%tmp, $subtract($list total(%tmp), 1)), "Global")set(#z3, $find regular expression(#z2, "(?<=\\/)[0-9]+(?=\\.html)"), "Global")set(#loopcounter, 2, "Global")loop(#z3) { set(#zzURLS, "{#zzURLS}/de/directories/people/a/ab/{#loopcounter}.html", "Global") increment(#loopcounter)} Quote Link to post Share on other sites
LazyBotter 188 Posted April 14, 2014 Report Share Posted April 14, 2014 Had a little mess... How that? set(#testhtml, " <ul class=\"pd-nav pagination\"> <li class=\"grey previous\">Back</li> <li class=\"next\"><a href=\"/de/directories/people/a/ab/2.html\">Next</a></li> <li class=\"selected\">1</li> <li><a href=\"/de/directories/people/a/ab/2.html\">2</a></li> <li><a href=\"/de/directories/people/a/ab/3.html\">3</a></li> <li><span class=\"grey\">...</span></li> <li><a href=\"/de/directories/people/a/ab/236.html\">236</a></li> <li><a href=\"/de/directories/people/a/ab/237.html\">237</a></li> <li><a href=\"/de/directories/people/a/ab/238.html\">238</a></li> </ul>", "Global") set(#Pos, $find regular expression(#testhtml, "(?<=ab\\/).*?(?=.html\"\\>.*<\\/a><\\/li>\\s<\\/ul>)"), "Global") loop($subtract(#Pos, 2)) { load html("/de/directories/people/a/ab/{#Pos}.html") decrement(#Pos) } Carl :-) Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.