cant figure out this 666 demon : (

runsoftware · April 5, 2014

i ran into a problem that is kicking me in the nuts...

I purchased ayment http script today and decided to play a little bit by scraping trulia listings..

First i scrape all the urls from the number of pages i want, then i make the get requests and store the current html of a page in a variable.

then i find the desired regex in this variable.

it does everything perfectly but when i open the table i tell it to save (.csv) it just fucks up, at the start everything seems fine and then at the end it doesnt scrape nothing... or scrapes something partially, idk..

you need HTTP POST plugin from aymen to run this...

TO TEST IT PUT IN THE TEXTBOX THAT SAYS PAGES a value of 10..

this code works scraping the first page and all the properties inside it but if you do in a larger scale it will fuck up..

clear table(&TABLE)
set(#page, 1, "Global")
set(#currentITEM, 0, "Global")
ui text box("Pages to scrape", #pagesTOscrape)
ui stat monitor("Current Page: ", #page)
ui stat monitor("Current Item: ", #currentITEM)
define Clear all lists {
    clear list(%urls)
    clear list(%addresses)
    clear list(%bedrooms)
    clear list(%bathrooms)
}
Clear all lists()
add item to list(%urls, "URLS", "Delete", "Global")
set(#citystate, "Los Angeles,CA", "Global")
set(#html, $plugin function("HTTP post.dll", "$http get", "http://www.trulia.com/for_sale/{#citystate}", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36", "http://trulia.com", "", 5), "Global")
add list to list(%urls, $find regular expression(#html, "(?<=<a itemprop=\"url\" data-row-index=.*href=\").*?(?=\")"), "Delete", "Global")
if($comparison(#pagesTOscrape, "=", 1)) {
    then {
    }
    else {
        loop($subtract(#pagesTOscrape, 1)) {
            increment(#page)
            set(#html, $plugin function("HTTP post.dll", "$http get", "http://www.trulia.com/for_sale/{#citystate}/{#page}_p", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36", "http://www.trulia.com/for_sale/{#citystate}/", "", ""), "Global")
            add list to list(%urls, $find regular expression(#html, "(?<=<a itemprop=\"url\" data-row-index=.*href=\").*?(?=\")"), "Delete", "Global")
        }
    }
}
add list to table as column(&TABLE, 0, 0, %urls)
set(#currentITEM, 0, "Global")
add item to list(%addresses, "Address", "Delete", "Global")
add item to list(%bedrooms, "Bedrooms", "Delete", "Global")
add item to list(%bathrooms, "Bathrooms", "Delete", "Global")
set list position(%urls, 1)
loop($subtract($list total(%urls), 1)) {
    set(#propertyHTML, $plugin function("HTTP post.dll", "$http get", "http://trulia.com{$next list item(%urls)}", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36", "http://www.trulia.com/for_sale/{#citystate}/", "", 5), "Global")
    add list to list(%addresses, $find regular expression(#propertyHTML, "(?<=<div id=\"center_row\">\\n.*\\n.*title=\").*?(?=,)"), "Don\'t Delete", "Global")
    add list to list(%bedrooms, $find regular expression(#propertyHTML, "(?<=<ul class=\"listInline typeEmphasize lhn mtn\">\\n.*<li>\\s+).*?(?=,)"), "Don\'t Delete", "Global")
    add list to list(%bathrooms, $find regular expression(#propertyHTML, "(?<=<ul class=\"listInline typeEmphasize lhn mtn\">\\n.*\\n.*<li>).*?(?=,)"), "Don\'t Delete", "Global")
    increment(#currentITEM)
}
add list to table as column(&TABLE, 0, 1, %addresses)
add list to table as column(&TABLE, 0, 2, %bedrooms)
add list to table as column(&TABLE, 0, 3, %bathrooms)
save to file("C:\\Users\\Joao\\Desktop\\ubot\\table.csv", &TABLE)

here is my code.

edit: just now i think my regex is wrong in some of them, maybe the bedrooms one... and i should use scrape attribute or something, will try.

Edited April 5, 2014 by KardoseR

kev123 · April 6, 2014

have you tried part running the bot and checking the regex works correctly. if you post the html example of the page and what you want to scrape i'll take a look.

runsoftware · April 6, 2014

http://www.trulia.com/for_sale/Los Angeles,CA is where i want to scrape.

but i dont think is anything wrong with my regex..

I need to scrape the addresses, number of bedrooms, number of bathrooms.

the strange thing is it does all good @ the first pages then it starts getting things at half, like "2 full" and not "2 full bath" then it doesnt catch even the title

even worse i tested my regex in regexhero tester and it worked even for a item that wanst filled with anything more than the LINK and the ADDRESS in the table

Edited April 6, 2014 by KardoseR

Sign In

cant figure out this 666 demon : (

Recommended Posts

runsoftware 14

Link to post

Share on other sites

kev123 132

Link to post

Share on other sites

runsoftware 14

Link to post

Share on other sites

Join the conversation

Browse

Activity