Jump to content
UBot Underground

cant figure out this 666 demon : (


Recommended Posts

i ran into a problem that is kicking me in the nuts...

 

I purchased ayment http script today and decided to play a little bit by scraping trulia listings..

 

First i scrape all the urls from the number of pages i want, then i make the get requests and store the current html of a page in a variable.

 

then i find the desired regex in this variable.

 

it does everything perfectly but when i open the table i tell it to save (.csv) it just fucks up, at the start everything seems fine and then at the end it doesnt scrape nothing... or scrapes something partially, idk..

you need HTTP POST plugin from aymen to run this...

 

TO TEST IT PUT IN THE TEXTBOX THAT SAYS PAGES a value of 10.. 

this code works scraping the first page and all the properties  inside it but if you do in a larger scale it will fuck up.. :(

clear table(&TABLE)
set(#page, 1, "Global")
set(#currentITEM, 0, "Global")
ui text box("Pages to scrape", #pagesTOscrape)
ui stat monitor("Current Page: ", #page)
ui stat monitor("Current Item: ", #currentITEM)
define Clear all lists {
    clear list(%urls)
    clear list(%addresses)
    clear list(%bedrooms)
    clear list(%bathrooms)
}
Clear all lists()
add item to list(%urls, "URLS", "Delete", "Global")
set(#citystate, "Los Angeles,CA", "Global")
set(#html, $plugin function("HTTP post.dll", "$http get", "http://www.trulia.com/for_sale/{#citystate}", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36", "http://trulia.com", "", 5), "Global")
add list to list(%urls, $find regular expression(#html, "(?<=<a itemprop=\"url\" data-row-index=.*href=\").*?(?=\")"), "Delete", "Global")
if($comparison(#pagesTOscrape, "=", 1)) {
    then {
    }
    else {
        loop($subtract(#pagesTOscrape, 1)) {
            increment(#page)
            set(#html, $plugin function("HTTP post.dll", "$http get", "http://www.trulia.com/for_sale/{#citystate}/{#page}_p", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36", "http://www.trulia.com/for_sale/{#citystate}/", "", ""), "Global")
            add list to list(%urls, $find regular expression(#html, "(?<=<a itemprop=\"url\" data-row-index=.*href=\").*?(?=\")"), "Delete", "Global")
        }
    }
}
add list to table as column(&TABLE, 0, 0, %urls)
set(#currentITEM, 0, "Global")
add item to list(%addresses, "Address", "Delete", "Global")
add item to list(%bedrooms, "Bedrooms", "Delete", "Global")
add item to list(%bathrooms, "Bathrooms", "Delete", "Global")
set list position(%urls, 1)
loop($subtract($list total(%urls), 1)) {
    set(#propertyHTML, $plugin function("HTTP post.dll", "$http get", "http://trulia.com{$next list item(%urls)}", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36", "http://www.trulia.com/for_sale/{#citystate}/", "", 5), "Global")
    add list to list(%addresses, $find regular expression(#propertyHTML, "(?<=<div id=\"center_row\">\\n.*\\n.*title=\").*?(?=,)"), "Don\'t Delete", "Global")
    add list to list(%bedrooms, $find regular expression(#propertyHTML, "(?<=<ul class=\"listInline typeEmphasize lhn mtn\">\\n.*<li>\\s+).*?(?=,)"), "Don\'t Delete", "Global")
    add list to list(%bathrooms, $find regular expression(#propertyHTML, "(?<=<ul class=\"listInline typeEmphasize lhn mtn\">\\n.*\\n.*<li>).*?(?=,)"), "Don\'t Delete", "Global")
    increment(#currentITEM)
}
add list to table as column(&TABLE, 0, 1, %addresses)
add list to table as column(&TABLE, 0, 2, %bedrooms)
add list to table as column(&TABLE, 0, 3, %bathrooms)
save to file("C:\\Users\\Joao\\Desktop\\ubot\\table.csv", &TABLE)

here is my code.

 

 

edit: just now i think my regex is wrong in some of them, maybe the bedrooms one... and i should use scrape attribute or something, will try.

Edited by KardoseR
Link to post
Share on other sites

have you tried part running the bot and checking the regex works correctly. if you post the html example of the page and what you want to scrape i'll take a look.

Link to post
Share on other sites

http://www.trulia.com/for_sale/Los Angeles,CA is where i want to scrape.

 

but i dont think is anything wrong with my regex.. 

 

I need to scrape the addresses, number of bedrooms, number of bathrooms.

 

the strange thing is it does all good @ the first pages then it starts getting things at half, like "2 full" and not "2 full bath" then it doesnt catch even the title

 

even worse i tested my regex in regexhero tester and it worked even for a item that wanst filled with anything more than the LINK and the ADDRESS in the table

Edited by KardoseR
Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...