Jump to content
UBot Underground

Scrape Links from Navigation - 1 2 3 ... 106 107 108


Recommended Posts

Hello Ubotters,
 
you probably have seen navigation links like:

 

 

 

 

at the bottom of a page. 

 

Now I need to get all those URLs in one go. But they are not all available on that page:

 

 <ul class="pd-nav pagination">
    <li class="grey previous">Back</li>
    <li class="next"><a href="/de/directories/people/a/ab/2.html">Next</a></li>

       <li class="selected">1</li>
       <li><a href="/de/directories/people/a/ab/2.html">2</a></li>
       <li><a href="/de/directories/people/a/ab/3.html">3</a></li>
       <li><span class="grey">...</span></li>
       <li><a href="/de/directories/people/a/ab/236.html">236</a></li>
       <li><a href="/de/directories/people/a/ab/237.html">237</a></li>
       <li><a href="/de/directories/people/a/ab/238.html">238</a></li>
</ul>
 
 
So my idea was to get the lowest number and the highest number. 
And then run a loop to create all the URLs on my own.
 
I have this working already. But it requires Javascript (eval function) to do it.
 
Question:
Is there a smarter and faster way to do that? Without JavaScript and additional browser / http get stuff.
 
 
 
 
Working Code:
set(#testhtml" <ul class=\"pd-nav pagination\">
    <li class=\"grey previous\">Back</li>
    <li class=\"next\"><a href=\"/de/directories/people/a/ab/2.html\">Next</a></li>

       <li class=\"selected\">1</li>
       <li><a href=\"/de/directories/people/a/ab/2.html\">2</a></li>
       <li><a href=\"/de/directories/people/a/ab/3.html\">3</a></li>
       <li><span class=\"grey\">...</span></li>
       <li><a href=\"/de/directories/people/a/ab/236.html\">236</a></li>
       <li><a href=\"/de/directories/people/a/ab/237.html\">237</a></li>
       <li><a href=\"/de/directories/people/a/ab/238.html\">238</a></li>
</ul>""Global")
set(#zzzz$plugin function("HTTP post.dll""$xpath parser"#testhtml"//ul[@class=\'pd-nav pagination\']/li/a""href""HTML"), "Global")
set(#z2$find regular expression(#zzzz"(?<=\\/)[0-9]+(?=\\.html)"), "Global")
set(#z3$replace(#z2$new line","), "Global")
navigate("http://www.google.com""Wait")
set(#zMIN$eval("var b=Math.min({#z3})
b"), "Global")
set(#zMAX$eval("var b=Math.max({#z3})
b"), "Global")
loop($subtract(#zMAX#zMIN)) {
    set(#zzURLS"{#zzURLS}/de/directories/people/a/ab/{#zMIN}.html
""Global")
    increment(#zMIN)
}
 
 
 
 
Link to post
Share on other sites

you don't have to scrape all of em , just get the last one

usually they all have the same URL structure all what you have to do is change the page number

/de/directories/people/a/ab/{#counter}.html

Link to post
Share on other sites

you don't have to scrape all of em , just get the last one

usually they all have the same URL structure all what you have to do is change the page number

/de/directories/people/a/ab/{#counter}.html

 

Thanks Aymen,

 

I now found a smarter way to do it. Via xpath parser index.

I put the first result into a list. Count the list. Sub1 and use that for the index. That get's me the last entry.

 

 

set(#testhtml" <ul class=\"pd-nav pagination\">

    <li class=\"grey previous\">Back</li>

    <li class=\"next\"><a href=\"/de/directories/people/a/ab/2.html\">Next</a></li>

 

       <li class=\"selected\">1</li>

       <li><a href=\"/de/directories/people/a/ab/2.html\">2</a></li>

       <li><a href=\"/de/directories/people/a/ab/3.html\">3</a></li>

       <li><span class=\"grey\">...</span></li>

       <li><a href=\"/de/directories/people/a/ab/236.html\">236</a></li>

       <li><a href=\"/de/directories/people/a/ab/237.html\">237</a></li>

       <li><a href=\"/de/directories/people/a/ab/238.html\">238</a></li>

</ul>""Global")

set(#z1$plugin function("HTTP post.dll""$xpath parser"#testhtml"//ul[@class=\'pd-nav pagination\']/li/a""href""HTML"), "Global")

add list to list(%tmp$list from text(#z1$new line), "Don\'t Delete""Global")

set(#z2$plugin function("HTTP post.dll""$xpath parser index"#testhtml"//ul[@class=\'pd-nav pagination\']/li/a"$subtract($list total(%tmp), 1), "href"), "Global")

set(#z3$find regular expression(#z2"(?<=\\/)[0-9]+(?=\\.html)"), "Global")

set(#loopcounter, 2, "Global")

loop(#z3) {

    set(#zzURLS"{#zzURLS}/de/directories/people/a/ab/{#loopcounter}.html

""Global")

    increment(#loopcounter)

}

 

 

No browser and no Java YEAH!!!

 

I think that's it!

 

THANKS  A LOT

Dan

Link to post
Share on other sites

Ok even simpler... 

 

set(#testhtml" <ul class=\"pd-nav pagination\">
    <li class=\"grey previous\">Back</li>
    <li class=\"next\"><a href=\"/de/directories/people/a/ab/2.html\">Next</a></li>

       <li class=\"selected\">1</li>
       <li><a href=\"/de/directories/people/a/ab/2.html\">2</a></li>
       <li><a href=\"/de/directories/people/a/ab/3.html\">3</a></li>
       <li><span class=\"grey\">...</span></li>
       <li><a href=\"/de/directories/people/a/ab/236.html\">236</a></li>
       <li><a href=\"/de/directories/people/a/ab/237.html\">237</a></li>
       <li><a href=\"/de/directories/people/a/ab/238.html\">238</a></li>
</ul>""Global")
set(#z1$plugin function("HTTP post.dll""$xpath parser"#testhtml"//ul[@class=\'pd-nav pagination\']/li/a""href""HTML"), "Global")
add list to list(%tmp$list from text(#z1$new line), "Don\'t Delete""Global")
set(#z2$list item(%tmp$subtract($list total(%tmp), 1)), "Global")
set(#z3$find regular expression(#z2"(?<=\\/)[0-9]+(?=\\.html)"), "Global")
set(#loopcounter, 2, "Global")
loop(#z3) {
    set(#zzURLS"{#zzURLS}/de/directories/people/a/ab/{#loopcounter}.html
""Global")
    increment(#loopcounter)
}

Link to post
Share on other sites

Had a little mess... 

 

How that?

set(#testhtml, " <ul class=\"pd-nav pagination\">
    <li class=\"grey previous\">Back</li>
    <li class=\"next\"><a href=\"/de/directories/people/a/ab/2.html\">Next</a></li>

       <li class=\"selected\">1</li>
       <li><a href=\"/de/directories/people/a/ab/2.html\">2</a></li>
       <li><a href=\"/de/directories/people/a/ab/3.html\">3</a></li>
       <li><span class=\"grey\">...</span></li>
       <li><a href=\"/de/directories/people/a/ab/236.html\">236</a></li>
       <li><a href=\"/de/directories/people/a/ab/237.html\">237</a></li>
       <li><a href=\"/de/directories/people/a/ab/238.html\">238</a></li>
</ul>", "Global")
set(#Pos, $find regular expression(#testhtml, "(?<=ab\\/).*?(?=.html\"\\>.*<\\/a><\\/li>\\s<\\/ul>)"), "Global")
loop($subtract(#Pos, 2)) {
    load html("/de/directories/people/a/ab/{#Pos}.html")
    decrement(#Pos)
}

Carl :-)

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...