Jump to content
UBot Underground

Get Href Only For Main Url


Recommended Posts

hi,need some help

i am tryind to set up a script to extract emails from websites

so,my first part is to get all internal link for specific websites

ui text box("URL",#main url)
navigate(#main url,"Wait")
wait(5)
add list to list(%second url,$scrape attribute(<href=r"">,"fullhref"),"Delete","Global")

but with this code i get all links,including link for ads,banner...but i need only internal link

 

can someone give me an idea how to get only internal link?

Link to post
Share on other sites

try something like this

navigate("http://www.cnn.com/", "Wait")
wait for element(<data-analytics="edition-picker-logo">, 15, "Appear")
comment("just to show url func")
set(#host name, $url, "Global")
set(#urls, $find regular expression($document text, "{$url}.*"), "Global")

CD

  • Like 1
Link to post
Share on other sites

try something like this

navigate("http://www.cnn.com/", "Wait")
wait for element(<data-analytics="edition-picker-logo">, 15, "Appear")
comment("just to show url func")
set(#host name, $url, "Global")
set(#urls, $find regular expression($document text, "{$url}.*"), "Global")

CD

if on websites are the fullhref then your code could be used but if there not fullhref, only /index.php - then will not work

Link to post
Share on other sites

well, not giving an example site leaves your answer wide open.

 

I suggest learn some xpath over at w3.school. xpath is awesome

 

get to know regex as well

 

give a specific site and we can give a specific answer

 

using "Large Data" plugin you can use both

 

else can use "HTTP Post"  plugin

 

or Python

 

CD

Link to post
Share on other sites

Think out of the box  :P

 

There are multiple ways of doing that:

 

1. Scrape element with full href and filter the list later via a loop
2. Extract the urls with xpath or regex. And maybe add the correct url in front of it, if the site doesn't show the fullhref. 

Sometimes things can't be done in one go. Just scrape what you can get, and process the data later.

here's a quick example:

 

ui text box("URL",#main url)
navigate(#main url,"Wait")
wait(5)
clear list(%second url)
clear list(%newurllist)
add list to list(%second url,$scrape attribute(<href=r"">,"fullhref"),"Delete","Global")
set(#counter,0,"Global")
loop($list total(%second url)) {
    if($contains($list item(%second url,#counter),#main url)) {
        then {
            add item to list(%newurllist,$list item(%second url,#counter),"Delete","Global")
        }
    }
    increment(#counter)
}

Of course there might be even quicker / smarter ways of doing it. But you probably get the idea.

 

Cheers

Dan

  • Like 1
Link to post
Share on other sites

Think out of the box  :P

 

There are multiple ways of doing that:

 

1. Scrape element with full href and filter the list later via a loop

2. Extract the urls with xpath or regex. And maybe add the correct url in front of it, if the site doesn't show the fullhref. 

 

Sometimes things can't be done in one go. Just scrape what you can get, and process the data later.

 

here's a quick example:

 

ui text box("URL",#main url)

navigate(#main url,"Wait")

wait(5)

clear list(%second url)

clear list(%newurllist)

add list to list(%second url,$scrape attribute(<href=r"">,"fullhref"),"Delete","Global")

set(#counter,0,"Global")

loop($list total(%second url)) {

    if($contains($list item(%second url,#counter),#main url)) {

        then {

            add item to list(%newurllist,$list item(%second url,#counter),"Delete","Global")

        }

    }

    increment(#counter)

}

 

Of course there might be even quicker / smarter ways of doing it. But you probably get the idea.

 

Cheers

Dan

thank you for your time

your example helped me alot 

Link to post
Share on other sites
  • 2 weeks later...

this is how i do it

 

    navigate("http://www.url.com""Wait")
    wait for browser event("Page Loaded""")
    wait($rand(10, 20))
    set(#url$url"Local")
    clear list(%scrapedurlsmain)
    add item to list(%scrapedurlsmain$scrape attribute(<href=w"{$url}*">"fullhref"), "Delete""Global")
    set list position(%scrapedurlsmain, 0)
    save to file("C:\\Work\\scrapeurlsmain.txt"%scrapedurlsmain)
    add list to list(%scrapedurlsmaingo$list from text($read file("C:\\Work\\scrapeurlsmain.txt"), $new line), "Delete""Global")
    navigate($random list item(%scrapedurlsmaingo), "Wait")
    wait for browser event("Page Loaded""")

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...