Jump to content
UBot Underground

Scraping 500k+ items - Best practices to store data?


Recommended Posts

Hello.

 

I'm working on a scraping bot (http post plugin).

But I need to scrape a lot of entries.

 

Scraping and extracting the data works fine. 

 

But I have some challenges with the ubot performance and memory consumption.

 

Basically I run through multiple loops. 

The first run will extract 1500 URLs into a list.

 

The next one will extract 20-40 Urls for every URL in that list.

 

So that will result in 30000-60000 Entries.

And already after 5000 entries it get's very squishy...

 

How do you normally handle such large amount of data? 

Do you store that on the disk in multiple files? Should I save the lists into a SQLite DB and load the stuff from there?

 

Would love to hear your best practice tips.

 

Thanks in advance for your help

Dan

 

 

 

Link to post
Share on other sites

I go SQLite for the saving and write a define that's saves after a certain size. its important to get this right as it can slow down your bot.

 

A simple way would be to save you initial scrape to a file using advanced file read part file and append to file after your second scrape gets to a certain size.

 

I'm also half way through a plugin for large tables which will have all the features of normal tables but can handle a lot of data.

Link to post
Share on other sites

I go SQLite for the saving and write a define that's saves after a certain size. its important to get this right as it can slow down your bot.

 

A simple way would be to save you initial scrape to a file using advanced file read part file and append to file after your second scrape gets to a certain size.

 

I'm also half way through a plugin for large tables which will have all the features of normal tables but can handle a lot of data.

 

I'm currently playing around with SQLite. Need to write a define to write a list of 500 entries into the database. 

But not all into one cell. I need each item as a separate entry.

 

But when I loop through the list and do a INSERT INTO  

for each item, that's way to slow.

 

Do you know a smarter way to get a huge list into the sqlite db? 

 

Can you share some more details about the plugin you are working on? 

The biggest challenge is memory consumption for me at the moment. The bot is using like 500mb already. And it only has 3000 entries in a list :-(

 

Dan

Link to post
Share on other sites

Ok I now can add up to 500 (SQLite limit) entries in one go:

 

 

The SQLite syntax for that is:

INSERT INTO 'tablename'
SELECT 'data1' AS 'column1', 'data2' AS 'column2'
UNION SELECT 'data3', 'data4'
UNION SELECT 'data5', 'data6'
UNION SELECT 'data7', 'data8'

 

 

Here's my define to create that:

define CreateSQLAdd {

    set list position(%urls, 0)
    set(#sqlcommand"INSERT INTO \'data1\'""Global")
    set(#sqlcommand"{#sqlcommand}
SELECT \'{$next list item(%urls)}\' AS \'urls\'""Global")
    loop($subtract($list total(%urls), 1)) {
        set(#sqlcommand"{#sqlcommand}
UNION SELECT \'{$next list item(%urls)}\'""Global")
    }
}

 

 

If you need to add more than 500 items, you have to split it.

 

Dan

Link to post
Share on other sites

why not use the table insert the command if you want to add over 500 in one go and add your list as a row. its lighting fast.

i'm going to ask aymen if the table insert could have options for update etc.

 

regarding the plugin all functions of normal tables (probably not at launch as it seem people want it now),can hold stupid amounts of data without a hitch, will be free just waiting on the key, approver at support to be back of vacation. Any features let me know.

  • Like 1
Link to post
Share on other sites

why not use the table insert the command if you want to add over 500 in one go and add your list as a row. its lighting fast.

i'm going to ask aymen if the table insert could have options for update etc.

 

regarding the plugin all functions of normal tables (probably not at launch as it seem people want it now),can hold stupid amounts of data without a hitch, will be free just waiting on the key, approver at support to be back of vacation. Any features let me know.

 

Hmm.. the insert table command probably won't work. 

I have to add 200 thousand entries to the database. 

 

And the add table command can't update the database. It always overwrites it. 

So I would need to add 200 thousand entries to a table. Which probably will kill ubot :-)

 

So I scrape 500 entries. Add them to a list. Convert that to a SQL statement. Add it to DB and clear the list. And then I start over.

 

That's what I'm currently working on. Will let you know if it works after it's done :-)

Link to post
Share on other sites

sorry I missed a bit of information do it in chunks. inserting 1000-2000 rows at a time. Its what I do and works very well. if your looking to get above 500 records at once.

Link to post
Share on other sites

regarding the plugin all functions of normal tables (probably not at launch as it seem people want it now),can hold stupid amounts of data without a hitch, will be free just waiting on the key, approver at support to be back of vacation. Any features let me know.

 

How's your plugin working when I add 50.000 entries to the table? 

How much memory is required for that?

 

With the native ubot lists / tables that's almost impossible. Well at least in Ubot studio. haven't tested if that changes when I compile the bot.

 

It's my first bot where I need to process more than 1000 urls in one go :-/

 

The final goal is to extract 3 Million URLs... and somehow store them so that they can be used later. Still not 100% sure how to approach that...

 

Dan

Link to post
Share on other sites

I just tested my bot with v5. And I must say.. It's much better in terms of memory management. 

v4 was using 600mb already with 4000 entries in a table.

 

With v5 I just added 20000 entries and it was still at 220mb ram. And the UI was still very responsive.

 

They only think you shouldn't do is open the table in the debugger with the plus sign

That will instantly kill ubot. Bäm  1.2GB ram and app is frozen :-)

 

Dan

Link to post
Share on other sites

50,000 was so minor I couldn't tell if it was ubot using the memory or the table.

I went to 500000 and the memory while adding went to over a gig. I think this was because I was looping half a million times with no other actions apart from setting the table cells which in any program even ubot is unlikely and memory cpu heavy.

 

I carried out a memory clear and currently with half a million records inside ubot is sitting at under 200mb. I have carried out several actions  in the browser and this hasn't increased a lot.

Link to post
Share on other sites

50,000 was so minor I couldn't tell if it was ubot using the memory or the table.

I went to 500000 and the memory while adding went to over a gig. I think this was because I was looping half a million times with no other actions apart from setting the table cells which in any program even ubot is unlikely and memory cpu heavy.

 

I carried out a memory clear and currently with half a million records inside ubot is sitting at under 200mb. I have carried out several actions  in the browser and this hasn't increased a lot.

 

Very interesting. Would love to test that.

Does the plugin support:

$table cell

$table total columns

$table total rows

set table cell

clear table

 

Those would be Prio1 features in my oppinion.

Followed by:

add list to table as row

add list to table as column

 

Nice work Kev!

Link to post
Share on other sites

Yeah of course all the standard stuff and anything people can think of. Two things to note it doesn't show values in debugger, api doesn't allow this. The table will be the size you specify for example how many rows columns, I could make it auto calculate like ubots table but this would make it more bulky and the whole point is storing large data. Obviously reading from file you wouldn't need to specify

Link to post
Share on other sites

Yeah of course all the standard stuff and anything people can think of. Two things to note it doesn't show values in debugger, api doesn't allow this. The table will be the size you specify for example how many rows columns, I could make it auto calculate like ubots table but this would make it more bulky and the whole point is storing large data. Obviously reading from file you wouldn't need to specify

Sounds pretty cool Kev.

 

Would love to give it a try!

Dan

Link to post
Share on other sites

Dan,

 

Why are you putting it all into a list or table first?

 

INSERT/UPDATE straight to data base. Use the DB as your list or table. You can go on forever.

 

Yeah, that's what I'm doing now. But I don't want to save every single item directly to the SQLite database. That would be very slow for 5 Million entries.

I'm grouping them together. I scrape 250 Items into a list. Write them into the DB with one Query, clear the list and then I continue. 

 

That's working fine so far. But I still look at other ways to optimize it.

 

Dan

Link to post
Share on other sites

Dan  have u try BEGIN TRANSACTION   and COMIT on insert your data to SQLite.

im not 100% sure but i thing SQLite is setting the INDEX OFF and rebuild it after COMIT 

so your INSERTS should be much faster.

  • Like 1
Link to post
Share on other sites

Dan  have u try BEGIN TRANSACTION   and COMIT on insert your data to SQLite.

im not 100% sure but i thing SQLite is setting the INDEX OFF and rebuild it after COMIT 

so your INSERTS should be much faster.

Interesting. Thanks a lot. Will definitely check it out.

Dan

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...