Jump to content
UBot Underground

Best way to tackle an interesting scrape


Recommended Posts

Hello everyone, this is a first time post for me,  I have been reading and watching loads of tutorials but i am struggling to get my head around the way this bot needs to be structured.

 

I hope someone can make some suggestions.

 

The Bots purpose: to scrape members details from members area of website and add then to a database that can be used to email members

 

Sounds simple enough however there are a few catches

catch 1. select country and state via dropdown boxs (image= a1)post-3365-0-28154600-1416856071_thumb.jpg

catch 2. enter searched value Example = a as a referance to pull all a names (image= a2)post-3365-0-81702000-1416856068_thumb.jpg

 

returns results in a page similar to goole (image= a3)post-3365-0-25733200-1416856065_thumb.jpg

to get to the details of a each member show in the search result you need to click (show more) which takes you to the members details page where the rest of the data to scrape is located eg title, email,company, department (image= a4)post-3365-0-73848500-1416856061_thumb.jpg

catch 3 =  not all members show the same data E.G. Some members have websites and others do not(image= a5)post-3365-0-61888500-1416856057_thumb.jpg

catch 4 = some info is coma delimited (image= a6)post-3365-0-89199900-1416856055_thumb.jpg

 

Also not all of the members information is wanted Awards, years in profession etc

 

to get to the next members information you then need to return to the previouse page with the search results and then repeat the prosess

 

catch 5 once you get to the last member on the page you need to select the second search results page (image=a7) and continue.

 

so what is the best structure to do this with and how do you add in a saftey mechanism to allow you to pick up where you left off, in case of bot failure, pc shutdown etc.

 

Now the next thing is this, There are a couple of hundred thousand members - How to then have this as a usable (break the data into several databases E.G USA -a list, USA-b list)

 

this bot is for my personal use only, but i would like to be able to update the information as new members join or cancell ect

 

Also i would like to be able to extract searched data from the files created upload and use to email eg search by state or hobby and export all members associated with that feild.

 

all data scraped needs to be enterd into a csv file with each row per member and each colum per catagorie

 

Suggestions please :)

 

Thx James

Link to post
Share on other sites

I think this is pretty straightforward:

For catch1: You will need to create a UI dropdown and mirror the values in the country list into your dropdown items by comma delimiting them. Then you set this variable to country.

You can then reuse that variable when setting the name of the output file for the list (that's how to keep them all nice and separate). Save that file to /results using the Special Folder command.

For catch 2: A ui text box where you can enter the value of the letter you want which is set to the variable, and that variable is then entered as the value into the search box.

Catch 3: Not a catch, just punch the search button. And scrape the results.

That is basically the setup steps:

As to when it runs the process I would use is:

1. Run the country and the value query.
2. Scrape all the results pages. Depending on how the page is paginated you could use a WHILE and make this conditional on whether an ARROW or other nav item is shown on the results page. Or, if you know the number of results on a page simply take the total number of results for a query, and divide it by the number of results shown, then use javascript to round the number up, and then cycle the results by appending the NEXT item with an incremental value that is repeated the number of times that there are loops in the results list (IE results = 100, number of results in page is 10, then loop length is 10).
3. Once gathered all the URLS of the pages you then run through another loop gathering the profile data.

Scraping profile data:

Use if commands to scrape. That way if it is not there you're bot won't error it wil just skip it.

If you want to scrape different things different sessions, for example one time just email, another email plus location, you could do this by creating UI checkboxes that match the field name.

You can then check the boxes on your UI in the bot and have the bot scrape only those elements.

If this is your forum then you could just add some unique CSS to each of the values you are scrarping from the pages. otherwise I prefer to set each item to a variable.

So you say:

If Address Exists the set #address to scrape attribute, or page scrape item.

If you use the checkboxes you can do:

If Checkbox (for example UI checkbox address) = True IF Address Exists then set (as before)

This is all happening within a loop that has the list of URLS loaded (the scraped profile URLS) and which next list items these profile urls.

At the end of each cycle save off the list. You could set the column  heads of the list for your CSV before going into the loop of profile URLS.

Your last point:

Now the next thing is this, There are a couple of hundred thousand members - How to then have this as a usable (break the data into several databases E.G USA -a list, USA-b list)


In the catch1: Instead of doing this manually by setting each country by hand. Create a loop. So you say Add all the countries to a list, get the list total and then drop that loop around all of the above. Once it completes you will have saved off the list with the country prefix (you should set the next list item of the country list to a variable and use that variable to set the name of the exported list file).

In the catch two: instead of entering the values manually fill a text file with the values you want, and then load these values into a list and then use a combination of the list total and a loop in the script at the right place to rotate through all values (for example a list a-z).

The outcome of the process would be:

Take country from list
Cycle through all letters and scrape profile urls
Scrape profile data
Save list with country name /results.txt
Go to country number 2
and onward.

As you are working on large files of data I would setup some safety precautions, which might be writing the status of the completed countries and letters to a file. In that way you can see, if there is a crash where you might need to pick it up from.

That's one way.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...