Jump to content
UBot Underground

Bot crashing when scraping a large number of pages from CSV file


Recommended Posts

Hi,

I recently hired someone to create a bot for me using ubot.  The bot needs to open work from a CSV spreadsheet containing product listings on Amazon.  One product per row. The first column of each row has the url of the product on Amazon.. It goes to the url and scrapes some data, then writes that data in the same row of the spreadsheet, and then moves on to the next row. I gave the ubot guy a detailed description of what the bot needs to do,Including a sample of the spreadsheet containing 20 rows of products. He said no problem and we agreed on the price,   However when I informed him that the actual spreadsheet that the bot will use may contain hundreds or even thousands of rows, he said the following:
 

I have a couple of issues with this.
Firstly, the bot does not handle large amounts of data very well as it uses a lot of temporary system memory, which can cause the bot to freeze or the computer to crash if it is not regularly cleared out. What happens is that computer memory usage is used up as the bot runs, each time it visits a new web page it uses up a new piece of memory. Unfortunately it is an issue with the software I use to create bots and not something I have control over. If the spreadsheet contains thousands of items, then the bot when running is potentially visiting thousands of webpages, which over time will use up the memory. In the background it will open hundreds of 'browser.exe' processes, which will cause the computer to slow down and the bot to freeze. Unfortunately I cannot take care of this problem in the way it is programmed, as it is an issue with UBot studio itself.

 

Isnt there a way to overcome this problem with scripting? Cant the bot just use the same instance of the browser for each row? Or close the previous instance after it completes each row?
I offered to pay more money, but he insists it is a limitation of ubot studio that can not be overcome, and suggested I buy some third party software to clean up the extra processes and free up memory.

Edited by talon39
Link to post
Share on other sites

 A csv file is little more than .txt so unless it’s a huge file I don’t see the problem, unless you want to run multithreaded bot, which can be a problem, with the write to file conflicting,

 

Link to post
Share on other sites

He's saying the problem is with the browser.exe processes (read where I quoted what he said).  The CSV is just a list of URLs. It reads the URL, goes to the page, collects some data, pastes it back to the CVS file in the same row as the URL, then moves on to the next row.  

Edited by talon39
Link to post
Share on other sites

I was loading huge lists (1 million URL's) with UBot, and it worked, although it took some time to load that data from a file (like 10 minutes just to add data to the list, but the bot run well from that point on).

 

I talked about that with Eddie, the main developer, and he said UBot was not meant to load such large files, so it really isn't well suited for large data (I really do hope they work on this, since data is getting larger and larger).

 

However, I never had problems with memory and browser.exe, and even if I would have them, there is a command which will close the browser.exe process (to free up memory), so that shouldn't be a problem.

Link to post
Share on other sites

Thats what i told him..  I thought the bot could just use one browser instance. But even if it cant, there must be a way to close the previous browser process. It seems the way he is doing it opens a new browser instance with every row, and doesnt close the previous instance..  What is the proper way to do this to avoid the problem he is describing?

Link to post
Share on other sites

It seems the way he is doing it opens a new browser instance with every row

What is the proper way to do this to avoid the problem he is describing?

That is the correct way to open a new browser instance preform the task then close it

If the bot uses the same instance of the browser for each row after x amount of time it will bog down

 

Link to post
Share on other sites

That is the correct way to open a new browser instance preform the task then close it

If the bot uses the same instance of the browser for each row after x amount of time it will bog down

You can use the same instance of the browser, but you need to re-initialize the browser (with "set user agent" command for example), which will prevent browser crashes after few rows. I was just dealing with a site where browser crashed after every few rows, and the approach described solved the problem.

 

If you use "in new browser" command that's not needed (at least if you call the command for every row), since that command initializes a new browser; however, I think that sometimes this leaves the process "browser.exe" running, so you might just use "close page" command at the end, so that process gets stopped.

Link to post
Share on other sites

I made once bot which worked with 5-columns-50 000-rows csv and it worked ok.

The problem you are describing is actually separated to 2 parts: loading and processing big csv file (which can be done, but ubot is not very good with it) and loading multiple pages (which should not be a problem at all, as there are many ways to avoid browser crashes - as stated above by UBotDev.com and zap)

Link to post
Share on other sites

but you need to re-initialize the browser (with "set user agent" command for example),

Re-initialize to return a computer program to the condition it was in at the start of processing, so that nothing remains from previous executions of the program.

 

Where is this information coming from please I'm asking because Eddie has said in the passed  

 

If you use the in new browser command it will create a new browser with separate cache and cookies that will be deleted every time the commands inside it have finished running. You can also clear out all the data by restarting UBot Studio

Link to post
Share on other sites

everything has its limitation relative to memory usage. so you need  to compromise or find the way. to optimize la

The only case when you can't do nothing about browser crashing is when a page is extremely long so that it takes over a GB of RAM, but the same happens even in Chrome and Firefox, but at a higher memory limit. If you can avoid rendering all content at once inside the browser, you can reduce the memory footprint, and prevent possible crashes. Better solution is not to render such pages in browser, instead, one should try loading HTML into variable (if that works for that specific site).

Link to post
Share on other sites

Thanks for all of your suggestions and comments. I was able to pass some of this information on to him, and he implemented the "close page" command as some of you suggested. Im not seeing additional processes or excessive memory usage.  I'll have to test it with a larger spreadsheet to know for sure, but it seems the problem is solved. :)

Link to post
Share on other sites

Thanks for all of your suggestions and comments. I was able to pass some of this information on to him, and he implemented the "close page" command as some of you suggested. Im not seeing additional processes or excessive memory usage.  I'll have to test it with a larger spreadsheet to know for sure, but it seems the problem is solved. :)

Nice to hear. Please let us know if that solved the problem.

Link to post
Share on other sites

Yes it seems to be ok now..   I have one more question. Maybe you guys know. If not, I can make a new thread.  How do you get the bot to automatically close when finished?

Link to post
Share on other sites

Yes it seems to be ok now..   I have one more question. Maybe you guys know. If not, I can make a new thread.  How do you get the bot to automatically close when finished?

Nice to hear!

 

Here you go: http://www.ubotstudio.com/forum/index.php?/topic/12765-free-free-plugin-close-bot-command/

Link to post
Share on other sites

Thanks UbotDev.

 

Just to clarify further - if I use

-'in new browser' for all my browser based tasks and requests - (big or small as the case may be)  and then finish off at the end with a 'close page'

 

I get the following benefits -

 

1. system memory is constantly freed up due to the close page command - hence no more bot crashes due to this reason,at least

2. no need to clean up cookies and cache - as it does the cleaning on exit?

 

Thanks.

Edited by Sanjeev
Link to post
Share on other sites

"Close page" was added to close/kill the process that might stay active/opened after whne "in new browser" command ends.

 

1. Yes, it will free some memory if UBot is leaving extra browser.exe processes running

2. I think you don't actually need to clear cookies, since "in new browser" initializes the browser and clears cookies (you should only clear flash cookies if any left)

Link to post
Share on other sites
  • 2 months later...
  • 2 months later...

Guys thanks for your tips very interesting!

What is the optimal way of doing this if we want to ensure that we keep cookies intact? IE scraping from a membership based website but still freeing up resources?

Link to post
Share on other sites
  • 3 months later...

You can for example login in the "main" browser, and then just call "In a shared Browser" ... the cookies still work ... then close that one.

I think that the main browser will not crash, it just stays there.... right?

 

There is another way?

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...