Jump to content
UBot Underground

Loop Process On Large File


Recommended Posts

I have a file of about 150,000 lines

 

ubot has no issue pulling the text file data into a list and removing duplicate line entries.

 

however, i have to do a loop to remove some text from every single line remaining in the file.

 

it is different text per line that has to be removed.

 

so, doing the actual process is easy enough - the problem is that it is taking hours for it to go through each line to remove the target text.

 

do you (anyone) know of a faster way to do search/replace/remove on files of this size?

Link to post
Share on other sites

thanks for trying to help.

 

ok, initially each list item looks like this...

1 | http://...
2 | http://... 

etc.

 

Then i split the list so that it turns out like this...

1 | 
http://...
2 | 
http://...

 

So the end result is that i come up with a nice clean list of urls.

 

but because of the volume, it takes waaaayyy too long to process.

  • Like 1
Link to post
Share on other sites

Try this, replace the FILENAME.txt with the file.

set(#biglist,$list from file("FILENAME.txt"),"Global")
clear list(%urls)
add list to list(%urls,$list from text($trim($replace regular expression(#biglist,"\\d+\\s\\|\\s",$nothing)),$new line),"Delete","Global")

  • Like 1
Link to post
Share on other sites

Here is an alternative way of looking at it with find regex

 

add list to list(%list$list from text($find regular expression($read file("{$special folder("Desktop")}\\reg-test.txt"), "http.*"), $new line), "Delete""Global")
plugin command("Bigtable.dll""large List from Regex""a"
"http.*""replace")
alert($plugin function("Bigtable.dll""Large list return""a"))
plugin command("Bigtable.dll""large List from Regex""b"$read file("{$special folder("Desktop")}\\reg-test.txt"), "http.*""replace")
alert($plugin function("Bigtable.dll""Large list return""b"))
plugin command("Bigtable.dll""Large list Remove duplicates""b")

 

 

cant imagine it being fast with 150k lines

so I did it in Large Data plugin too http://www.ubotstudio.com/forum/index.php?/topic/16308-free-plugin-large-data/

 

CD

Link to post
Share on other sites

you should be able to get this in the 10< seconds range. In one of my apps for each link I

1.strip http www. etc

2.get the hostname

3.work out the tld

4. work out if its a subdomain.

 

this involves creating sub lists and seeing if contains in a "tld list"/hashset

 

per 100,000 takes 2 seconds granted its in c# but if you use the large data plugin the overhead of ubot shouldn't be very large and what i'm doing involves a lot more.

 

if you cant get it below ten seconds using the large data plugin post you code here maybe time for me to spend a afternoon optimizing the plugin if possible.

 

thanks

kev123

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...