How to make a private English language database

Nobuo Saito   Saturday, July 06, 2002, 01:54 GMT
Tom wrote about using Google to search English sentences.
That is a good idea, but, as Tom wrote, it is sometimes tricky to use Google because there are writings written by non-native English speakers. If you can make your own language database, this problem can be avoided.

First, you need a good text editor having "grep" capability. With "grep", you can search words in *many* text files.
Next you go to the following site where you can download movie scripts.

http://www.movie-page.com/movie_scripts.htm

Download as many scripts as possible. You need to download only files with file name extentions txt or htm or html. Note that html files are ascii texts so you can read them with an ascii text editor, although they contain tag symbols and codes which don't have much to do with the texts' contents.
You put them in a folder. That's all. You use the text editor's grep to search words in scripts in the folder.
Tom   Saturday, July 06, 2002, 09:08 GMT
Good idea, but the size of such a database would be much smaller than Google's. Also, when the database grows bigger, it takes longer and longer to search it. (And Google takes about 1 second!)

When I didn't have unlimited access to the Internet at home, I sometimes searched the entire Project Gutenberg archive (a collection of English literature). The problem was that it takes much longer to search 2 GB of text on my hard drive than it takes Google to search billions of webpages! In addition, Gutenberg contains mostly old texts, so it's not really contemporary English.

So I would still stick to Google (and be careful).
Nobuo Saito   Saturday, July 06, 2002, 10:08 GMT
Hi, Tom,

I didn't mean to say that my method is better than your method using Google. It's nice to have several methods to search sentences.
A problem of Google is that search results can contain too much garbage. It can be a hassle to choose appropriate results.
I don't think the size of a database is not so big a problem as long as it is not too small and it contains correct data you want.
A nice thing about movie scripts is that they contain mostly contemporary spoken English.
Project Gutenberg's collection consists mostly of classical books.
So English used by those books may not be modern enough.
But it can be useful.
I usually don't mind about search time, since I can stop search anytime and still get half searched results thanks to my good editor
(I can see results on screen as my editor searches).
Another option is dividing a database to several groups.