allcapone1912 7 Posted March 28, 2015 Report Share Posted March 28, 2015 hi,need some helpi am tryind to set up a script to extract emails from websitesso,my first part is to get all internal link for specific websites ui text box("URL",#main url) navigate(#main url,"Wait") wait(5) add list to list(%second url,$scrape attribute(<href=r"">,"fullhref"),"Delete","Global") but with this code i get all links,including link for ads,banner...but i need only internal link can someone give me an idea how to get only internal link? Quote Link to post Share on other sites
Code Docta (Nick C.) 638 Posted March 28, 2015 Report Share Posted March 28, 2015 try something like this navigate("http://www.cnn.com/", "Wait") wait for element(<data-analytics="edition-picker-logo">, 15, "Appear") comment("just to show url func") set(#host name, $url, "Global") set(#urls, $find regular expression($document text, "{$url}.*"), "Global") CD 1 Quote Link to post Share on other sites
allcapone1912 7 Posted March 28, 2015 Author Report Share Posted March 28, 2015 try something like this navigate("http://www.cnn.com/", "Wait") wait for element(<data-analytics="edition-picker-logo">, 15, "Appear") comment("just to show url func") set(#host name, $url, "Global") set(#urls, $find regular expression($document text, "{$url}.*"), "Global") CDif on websites are the fullhref then your code could be used but if there not fullhref, only /index.php - then will not work Quote Link to post Share on other sites
Code Docta (Nick C.) 638 Posted March 29, 2015 Report Share Posted March 29, 2015 well, not giving an example site leaves your answer wide open. I suggest learn some xpath over at w3.school. xpath is awesome get to know regex as well give a specific site and we can give a specific answer using "Large Data" plugin you can use both else can use "HTTP Post" plugin or Python CD Quote Link to post Share on other sites
Bot-Factory 602 Posted March 29, 2015 Report Share Posted March 29, 2015 Think out of the box There are multiple ways of doing that: 1. Scrape element with full href and filter the list later via a loop2. Extract the urls with xpath or regex. And maybe add the correct url in front of it, if the site doesn't show the fullhref. Sometimes things can't be done in one go. Just scrape what you can get, and process the data later.here's a quick example: ui text box("URL",#main url)navigate(#main url,"Wait")wait(5)clear list(%second url)clear list(%newurllist)add list to list(%second url,$scrape attribute(<href=r"">,"fullhref"),"Delete","Global")set(#counter,0,"Global")loop($list total(%second url)) { if($contains($list item(%second url,#counter),#main url)) { then { add item to list(%newurllist,$list item(%second url,#counter),"Delete","Global") } } increment(#counter)}Of course there might be even quicker / smarter ways of doing it. But you probably get the idea. CheersDan 1 Quote Link to post Share on other sites
allcapone1912 7 Posted March 30, 2015 Author Report Share Posted March 30, 2015 Think out of the box There are multiple ways of doing that: 1. Scrape element with full href and filter the list later via a loop2. Extract the urls with xpath or regex. And maybe add the correct url in front of it, if the site doesn't show the fullhref. Sometimes things can't be done in one go. Just scrape what you can get, and process the data later. here's a quick example: ui text box("URL",#main url)navigate(#main url,"Wait")wait(5)clear list(%second url)clear list(%newurllist)add list to list(%second url,$scrape attribute(<href=r"">,"fullhref"),"Delete","Global")set(#counter,0,"Global")loop($list total(%second url)) { if($contains($list item(%second url,#counter),#main url)) { then { add item to list(%newurllist,$list item(%second url,#counter),"Delete","Global") } } increment(#counter)} Of course there might be even quicker / smarter ways of doing it. But you probably get the idea. CheersDanthank you for your timeyour example helped me alot Quote Link to post Share on other sites
daddycaddy 1 Posted April 11, 2015 Report Share Posted April 11, 2015 this is how i do it navigate("http://www.url.com", "Wait") wait for browser event("Page Loaded", "") wait($rand(10, 20)) set(#url, $url, "Local") clear list(%scrapedurlsmain) add item to list(%scrapedurlsmain, $scrape attribute(<href=w"{$url}*">, "fullhref"), "Delete", "Global") set list position(%scrapedurlsmain, 0) save to file("C:\\Work\\scrapeurlsmain.txt", %scrapedurlsmain) add list to list(%scrapedurlsmaingo, $list from text($read file("C:\\Work\\scrapeurlsmain.txt"), $new line), "Delete", "Global") navigate($random list item(%scrapedurlsmaingo), "Wait") wait for browser event("Page Loaded", "") Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.