The program that is being created by me is a Internet Crawler/Spider a lot like Google Web Search and Google Images Search. I hope to release the Spider as GPL when done. I also what to release the database, because I believe that even the information should become Open Source (Under an OSI approved licence) as well. (Alot like Wikipedia is because google is way to propitiatory with infomation they dont even own in the first place :( )
In my latest experiments I have been creating mySQL tables with over 2 million rows. (Yes I have indexed over 2million pages in under 2 days!) The problem I have been running into is that my data length for that table is around 300mb but the index length is around 1.8gb. Right now all URLs are indexed into one table, and the Crawler takes forever to fetch a batch of URL rows to work on, because the entire database needs to be sorted. I need to come up with a faster solution.
The soulution I'm working on right now is to create a archive database, to put revisions of URLs into, thus creating as smaller database for the Crawler to work off.
But the unawnserd questions are:
- How big can the archive get?
- How hard/slow will it be to serve out the database trough a client (A generated web page of search results, or an image blob search for some examples)
- Is MySQL truly the right choice for the database?


