Get Href Only For Main Url

allcapone1912 · March 28, 2015

hi,need some help

i am tryind to set up a script to extract emails from websites

so,my first part is to get all internal link for specific websites

ui text box("URL",#main url)
navigate(#main url,"Wait")
wait(5)
add list to list(%second url,$scrape attribute(<href=r"">,"fullhref"),"Delete","Global")

but with this code i get all links,including link for ads,banner...but i need only internal link

can someone give me an idea how to get only internal link?

Code Docta (Nick C.) · March 28, 2015

try something like this

navigate("http://www.cnn.com/", "Wait")
wait for element(<data-analytics="edition-picker-logo">, 15, "Appear")
comment("just to show url func")
set(#host name, $url, "Global")
set(#urls, $find regular expression($document text, "{$url}.*"), "Global")

CD

allcapone1912 · March 28, 2015

try something like this

navigate("http://www.cnn.com/", "Wait")
wait for element(<data-analytics="edition-picker-logo">, 15, "Appear")
comment("just to show url func")
set(#host name, $url, "Global")
set(#urls, $find regular expression($document text, "{$url}.*"), "Global")

CD

if on websites are the fullhref then your code could be used but if there not fullhref, only /index.php - then will not work

Code Docta (Nick C.) · March 29, 2015

well, not giving an example site leaves your answer wide open.

I suggest learn some xpath over at w3.school. xpath is awesome

get to know regex as well

give a specific site and we can give a specific answer

using "Large Data" plugin you can use both

else can use "HTTP Post" plugin

or Python

CD

Bot-Factory · March 29, 2015

Think out of the box

There are multiple ways of doing that:

1. Scrape element with full href and filter the list later via a loop
2. Extract the urls with xpath or regex. And maybe add the correct url in front of it, if the site doesn't show the fullhref.

Sometimes things can't be done in one go. Just scrape what you can get, and process the data later.

here's a quick example:

ui text box("URL",#main url)
navigate(#main url,"Wait")
wait(5)
clear list(%second url)
clear list(%newurllist)
add list to list(%second url,$scrape attribute(<href=r"">,"fullhref"),"Delete","Global")
set(#counter,0,"Global")
loop($list total(%second url)) {
    if($contains($list item(%second url,#counter),#main url)) {
        then {
            add item to list(%newurllist,$list item(%second url,#counter),"Delete","Global")
        }
    }
    increment(#counter)
}

Of course there might be even quicker / smarter ways of doing it. But you probably get the idea.

Cheers

Dan

allcapone1912 · March 30, 2015

Think out of the box

There are multiple ways of doing that:

1. Scrape element with full href and filter the list later via a loop
2. Extract the urls with xpath or regex. And maybe add the correct url in front of it, if the site doesn't show the fullhref.

Sometimes things can't be done in one go. Just scrape what you can get, and process the data later.

here's a quick example:

ui text box("URL",#main url)
navigate(#main url,"Wait")
wait(5)
clear list(%second url)
clear list(%newurllist)
add list to list(%second url,$scrape attribute(<href=r"">,"fullhref"),"Delete","Global")
set(#counter,0,"Global")
loop($list total(%second url)) {
    if($contains($list item(%second url,#counter),#main url)) {
        then {
            add item to list(%newurllist,$list item(%second url,#counter),"Delete","Global")
        }
    }
    increment(#counter)
}

Of course there might be even quicker / smarter ways of doing it. But you probably get the idea.

Cheers
Dan

thank you for your time

your example helped me alot

daddycaddy · April 11, 2015

this is how i do it

  navigate("http://www.url.com", "Wait")
  wait for browser event("Page Loaded", "")
    wait($rand(10, 20))
    set(#url, $url, "Local")
    clear list(%scrapedurlsmain)
    add item to list(%scrapedurlsmain, $scrape attribute(<href=w"{$url}*">, "fullhref"), "Delete", "Global")
    set list position(%scrapedurlsmain, 0)
    save to file("C:\\Work\\scrapeurlsmain.txt", %scrapedurlsmain)
    add list to list(%scrapedurlsmaingo, $list from text($read file("C:\\Work\\scrapeurlsmain.txt"), $new line), "Delete", "Global")
    navigate($random list item(%scrapedurlsmaingo), "Wait")
    wait for browser event("Page Loaded", "")

Sign In

Get Href Only For Main Url

Recommended Posts

allcapone1912 7

Link to post

Share on other sites

Code Docta (Nick C.) 638

Link to post

Share on other sites

allcapone1912 7

Link to post

Share on other sites

Code Docta (Nick C.) 638

Link to post

Share on other sites

Bot-Factory 602

Link to post

Share on other sites

allcapone1912 7

Link to post

Share on other sites

daddycaddy 1

Link to post

Share on other sites

Join the conversation

Browse

Activity