Jump to content
UBot Underground

Scraping two attributes with regex / xpath


Recommended Posts

Hello.

 

I need to scrape two attributes from a HTML text.

Name

href

 

I can get both with regex and xpath. That's not the problem.

 

 

The problem is, that sometimes the href is empty or doesn't exist at all.

 

 

So let's say the HTML text has

10x the name attribute

7xhref with an URL

2xhref with an empty string

1xhref tag is not there at all

 

 

When I now scrape the attributes into lists, the name list has 10 entries and the href one has 7.

 

But I need to know what belongs to what. So which name attribute is followed by which href. And if it's empty or not there, I want to replace it with "" in my list.

 

 

How would you approach this?

 

Scraping the data in two steps maybe?

 

Getting the innerhtml of the parent element?

And then searching for namen and href in the result?

So that I separate them from each other?

 

Or is there another way to do that?

 

 

Thanks in advance for your help

Dan

Link to post
Share on other sites

@Dan

 

How about this?

navigate("http://ubotsandbox.com/ubot-list-example-page-1.php", "Wait")
wait for browser event("Everything Loaded", "")
wait for element(<href="http://rickpowers.com/">, "", "Appear")
set(#var1, $scrape attribute(<id="MyExampleUsers">, "fullhref"), "Global")
if($contains(#var1, "ubotsandbox.com")) {
    then {
        set(#var1, $replace(#var1, "http://ubotsandbox.com/ubot-list-example-page-1.php", "No Link Found"), "Global")
    }
    else {
    }
}
set(#var2, $scrape attribute(<id="MyExampleUsers">, "innertext"), "Global")
clear list(%List1)
clear list(%List2)
add list to list(%List1, $list from text(#var1, "
"), "Don\'t Delete", "Global")
add list to list(%List2, $list from text(#var2, "
"), "Delete", "Global")
set(#var1, $nothing, "Global")
set(#var2, $nothing, "Global")

Link to post
Share on other sites

In my case the same element is 30 times on that site. 

So when I scrape the attribute, it will return 30 results. In my case I'm using xpath for that because I'm not using the browser at all.

 

After I have the 30 elements in a variable I add it to a list. 

So I now have all the elements in a list.

 

I then loop through that list and extract the 

name and href attribute. And if href returns $nothing, I replace it with a placeholder.

 

Watching some of the scarping videos on the training site now. 

Good stuff by the way :-)

 

Dan

Link to post
Share on other sites

Dan from the sounds of your question (correct me if i'm wrong).

Your scraping several different fields where there's a variable amount on the page and you need each line to pair up.

 

Theres a few ways to do it but the easiest and most hassle free is to scrape the parent element and then do a inner scrape of the elements you need this ensures without any doubt that everything matches.

Link to post
Share on other sites

Dan from the sounds of your question (correct me if i'm wrong).

Your scraping several different fields where there's a variable amount on the page and you need each line to pair up.

 

Theres a few ways to do it but the easiest and most hassle free is to scrape the parent element and then do a inner scrape of the elements you need this ensures without any doubt that everything matches.

That's exactly what I want to do. And as you said the most reliable way is probably to do it in two steps. 

Scrape the parent element into a list and then run through that list to look for the elements I need. 

 

Dan

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...