Scrape only root urls on webpage?

Mambo · January 25, 2013

I'm trying to build a backlinks scraper by scraping results from a backlink checking site and my issue is I only want to scrape root domains on the result page - if the backlink is a subfolder or xyz.html then I don't want to save it.

Example:

www.domain.com/stuff/blah.html <- i don't want it

www.domain2.com <- i want it

domain3.com/hello/ <- i don't want it

domain4.com <- i want it

So, I need some sort of code to only grab <a hrefs which simply ends with .com .net .org.

So, how would this be done? I'm thinking of grabbing all a hrefs on the page and then clean it up with regex or if it's possible to grab it with regex from the beginning. I also need to clean out all urls which links to the backlinking site itself.

I'm looking for some code for this since I'm not an expert with regex.

AutomationNinja · January 25, 2013

^(http(s)?://)?[^/]+

Rob JH · July 20, 2014

Worked a treat thanks!

Sign In

Scrape only root urls on webpage?

Recommended Posts

Mambo 0

Link to post

Share on other sites

AutomationNinja 194

Link to post

Share on other sites

Rob JH 0

Link to post

Share on other sites

Join the conversation

Browse

Activity