ReadFile vs Readln
This weekend I try to expand my book page with more novels. To do it, I need to update TF/IDF database and insert the novel table.
This time I try to make my previous mecab.go can run on my Linode 2G node. I try to use ulimit and and try to add process into OOM Killer exclusion list. The second method is workable. But after running 30mins, it frozen.
And I try to ignore it, then, I found my website is down and some of my tmux window is killed. So, there should be some kind of protection mechanism. Need try another way.
Then I found that previous I use io.ReadFile
to read in the txt file and then tokenize into mecab. This is good but memory-consuming. I try to use Read line-by-line instead.
The result is fantastic!! The memory is keeping in 50mb and stable. But, I wait and wait and wait. It cannot finish the run. Then I try to write small benchmark using 2.4mb txt file. I read the contents and pass it into mecab.
In this file, the runtime is huge different. I think the reason may because mecab need to keep re-init and parse the contents. To init mecab need setup several part, and to re-use it need kind of code refactor. So I choose to do the computation in my mbpr. Again!
With my new cleanup script, it can finished in 10min by calculating 15000+ files, totally 1.6G in zip format. I think the result is ok to me, since I will not do this operation very frequently.
After this trying, I think that maybe I need to try Amazon EC2 again. I use it previously before choose Linode. Because to my usage, I only want simple web server so Linode is cheaper. But when I need to do some heavy computation, Linode 2G is not enough and it since not doable to buy temporary upgrade or computation upgrade.
I had seen some course about distributed computation having lab to request student to run the job in Amazon EC2 and use temporary computation power. Maybe it’s doable to me. Hope the setup procedure is enhanced these days.