A Little About Gigablast
June 14, 2021
A lot of people think there are tons of search engines, but that is a great misconception. In the United States there is really only Google and Bing. The other search engines pretty much all just use Bing's API to supply search results (and query completions and spell checking) to their website. You can look at this comparison of search engines on Wikipedia to learn more. I'm not sure why duckduckgo and Qwant are on there because they use Bing according to this list of search engines.
I've been working on Gigablast since around the year 2000. Mostly, just myself. It is coded in C. Technically it is in C++, but I really only use the class structure of C++ as I find the rest of C++ to be more confusing than helpful. It is a matter of taste perhaps, as I grew up using EDTASM and Basic on the Color Computer in the 80s in Ohio.
Right now Gigablast runs on essentially 0 computers if you were to compare its hardware to Big Tech's search engines. This is something that needs to change. Certainly Gigablast's search algorithms will be undergoing more tuning, and speed-ups, but, at this point, the hardware issue is the biggest one. That is, the most improvement to be had in quality and speed will come from adding more hardware. You can see the Gigablast datacenter here.
It'd be nice if the legislation that is circulating becomes law to allow other people besides Big Tech a shot at making a successful business. You can see the advantages Google has over everyone else are monumental, in part due to platform control (bundling) and network effects. I think I could beat Google on technical merit. That is, I think I could deliver a significantly better search experience to consumer, but the obstacles they have constructed over the years from their immense power are too big to overcome simply by writing better code. At this point, search technology is stagnating at the hands of Big Tech.
Why Google Will Never Have Competition
June 4, 2021
1. Google has the U.S. Government under its spell. Google will just sidestep any attempt at Government regulation with the same ease they sidestep paying their fair-share of income taxes.
2. Google's query logs measuring in billions of queries every week, are used to construct synonyms for future searches. Thereby creating an insurmountable network effect.
3. Many websites and services like Cloudflare and Cloudfront make it hard for non-Google search engines to crawl the web. This is also somewhat of a network effect.
4. Google has the deep pockets to buy the hardware to create and maintain the largest search index.
5. Google's monopoly on search ads allows it to monetize search traffic far greater than any potential competitor. It does not allow competing engines to monetize using Google search ads.
6. Google does not allow other small search engines to crawl Youtube at the required rate to create an effective video search product.
So Google likes to say that competition is only a click away, but no matter how many times I click my mouse these four obstacles not only still remain, but grow every year.
How Big Tech Controls Web Search on Mobile
April 13, 2021
Google has a 97% monopoly on mobile web search ads (as of 2017) and does not, as a matter of policy, allow Google ads to be shown with the search results of other search engines. If this can change then other search engines can monetize and consumers might have more choices and not just Google, Google, Google.
Furthermore, the whole fiasco around Google's bidding process for the choice screen on Android devices in Europe is fundamentally flawed, because all of the alternative choices (except Yandex) are powered by the Bing search engine. So the choice screen is not much more than just Bing affiliate spam. Hey EU Commision, how is this breaking up Big Tech? This is just Google diverting a little search traffic to its Big Tech buddy, Microsoft. Furthermore, considering Microsoft's and Bill Gates' close relationships with countries that commit genocide, this is quite an incredible thing.
If you want some more thoughts, here's a link to an interview I just did for website planet on the direction and problems of web search today.
Also of note, Gigablast was mentioned in a couple of other articles recently:
Gigablast in the Australian Broadcasting Company (ABC).
Gigablast in the New York Times.
Cloudflare, Google, Bing Destroying the Infrastructure of the Free Web
June 13, 2020
It has come of notice to me recently while spidering the web that Cloudflare, a Cotent Distribution Network (CDN), is now offering as they put it: "DDos Security" by using several tactics that prevent free web search engines from spidering public web sites.
You have to be a non-blind person to pass the Turing test, as Cloudflare does not offer a handicap option. I believe this is illegal. After you click a checkbox you have to do that one Turing test where you select boxes within a grid of boxes where each box contains part of a picture. It asks you something like to select all the boxes with trucks in them. Once you pass this Turing test you are issued another cookie and allowed to access the website. It asked me to do this four times before I was able to access the web page in question! I would never do that to visit a website, personally, I would just go to another site.
Lastly, it's of worthy note that Cloudflare does not seem to impede Baidu's access to such public websites. It's funny that an American company like Cloudflare gives free web access to China's flagship search engine, but when it comes to a small America-based search engine they smear it in the face.
And, yes, Google, Bing, Baidu, we know Cloudflare is really you! You are using Cloudflare as a tool to limit internet access to other search engines trying to spider the web while your search engines have free reign.
The Evil of Bundling Content with Web Search
June 10, 2020
If you are searching for people in a professional context, a single company controls all of the resources. That company is Microsoft. Its Bing search engine combined with its Linkedin.com property gives it a complete stranglehold on this important section of web search. How many times have you searched for someone and received search results of people in a professional context? The answer is: a lot. And now Microsoft is essentially the only source for this information. This large and valuable segment of search is effectively monopolized.
If you are searching for videos, then the culprit would be Google. Google couples its own search engine with its Youtube.com properties to guarantee its victory over all competitors. If it so wishes it can cut off video search to practically any competing video search engine. And, it is indeed doing this slowly so that it seems natural and less culpable (see article below). In my unscientific observations it also seems it promotes Youtube.com results in its search engine more than it should. To the point of being biased. I currently have no empirical data to back up this bias, but it is measurable and seems worthy of investigation.
How many more sections of search will be consumed by the tech titans? How far will we let it go before Congress breaks up these evil search/content bundles of domination?
Rest in peace:
Stop Google and Microsoft from Censoring the Internet NOW
Apr 8, 2020
Content of https://patents.google.com/robots.txt :
User-agent: * Disallow: /* Allow: /$ Allow: /advanced$ Allow: /patent/ Allow: /sitemap/
Microsoft/Bing, which recently bought github.com now throttles competing search engine crawlers from indexing github.com's content. However, no such throttling applies to Bing's own crawlers. Just think, the once epitome of openness, github, is now ironically being slowly and methodically shutdown by evil big tech. Now, that's not to mention linkedin.com, of course, also owned by Bing, which is login-access only. So anytime you see linkedin.com in your search results it's basically an unlabeled ad, in violation of the FTC mandates for clearly labeling ads in the search results. Bing also has some arrangement with Google so Google still has crawling access to linkedin.com. This flies in the face of open internet access as well.
These guys have so much money they can slowly shut down the open internet, and wall off all the data so all competitive threats, including potentially-competing search engines, are completely extinguished while our Congressman ask stupid little questions and do absolutely nothing. When people say 'you don't get it', this is what they are talking about. And it's only getting worse over time.
Note to Congress, etc.: You need to break up big tech. Search engine tech, ad tech and content simply can not co-exist in the same basket, as you get evil shit like this. I throw ad tech in there, because competing search engines need to have access to the same ad inventory as the big guys in order to even have a shot. Do your job and break these cheaters up. STOP INTERNET CENSORSHIP NOW.
U.S. Government Supporting Google Monopoly?
Mar 31, 2020
When it comes to material actions, the United States Government is actually supporting the Google monopoly. The stuff you read about in the news is just foo foo talk to give the media something to report about. I know this because when I crawl U.S. Government sites like senate.gov my spiders get blocked. I sent an email asking why and I got the following response:
You should have access again in 15 minutes or so. If you find yourself being blocked, please tamp down your rate request. Best, Simin Xi On behalf of the Senate WebmasterThen I followed up, twice, actually, asking if Google was throttled the same way, or if I could be on the same whitelist that Google is on, and was met with deafening silence. I need to be able to spider somewhat fast to get any sort of coverage on these big websites. To surmise, if you want to start a search engine to compete against Google, the U.S. government will make sure you do not get far.
I guess this isn't as bad as when I applied to be the official U.S. search engine for firstgov.gov (now usa.gov) and lost out (lost to Microsoft 'in disguise', nonetheless) because I didn't give the guy in charge at the GSA the job at Gigablast that he asked for.
Feb 28, 2020
Gigablast is teaming up with Private Internet Access (now Kape Technologies), which operates some of the world's leading VPNs, to produce a private search engine called, private.sh, which offers cryptographically-secured privacy.
This private search engine is unique, and is another level above the existing private search solutions. How does it work? Private Internet Access has a strong legal record of defending its privacy services, and it basically sits on top of the Gigablast search engine, scrubbing away any IP address information before forwarding a query done on private.sh to Gigablast.
Other "private" search engines have access to both your IP and your query, and who knows who their people really work for.
New Gigablast Search Engine for 2020
Feb 28, 2020
After about a year of work focused on search results quality, it is time to introduce another revision of the engine. Here's some highlights:
- Great advances in search quality (and more to come)
- New relevancy algorithms yield relevant results the others don't
- Added link after each summary sentence to show where it occurs in the cached document
- Dropdown menu for each result with numerous functions
- Improved synonym and spell checking algorithms and data sources
- Highlight query terms in the results with individual colors (Is this too crazy?)
- Backend: Added more hardware
- Backend: Improvements to QA tools to prevent quality regression
- Backend: Tools to make scaling a lot easier
The Google Contradiction
Feb 20, 2020
"A remarkable range of consumers, developers, computer scientists, and businesses agree that open software interfaces promote innovation and that no single company should be able to monopolize creativity by blocking software tools from working together. Openness and interoperability have helped developers create a variety of new products that consumers use to communicate, work, and play across different platforms." -- Google spokesperson Jose Castaneda (Wed Feb 19 2020)
However, Google prevents their search results from being combined with search results from other search engines to create a meta search engine. So, in other words Google is saying something like "we are free to use your stuff how we see fit, but you can't use our stuff how you see fit."
Hmmmm.... they'll have to use some corporate doublespeak to patch this up.
Sep 21, 2019
LinkedIn is a walled garden of data, owned by Microsoft, that is no longer an openly-shared platform. While Microsoft shares their LinkedIn data with Google, and of course Bing, which they own, other smaller search engines can not access the data, whatsoever.
Now that these data baron monopolists have gotten increasingly into owning their own content on the web, and prohibiting other search engines from accessing it, small search engine upstarts are even less likely to be able to compete.
Facebook itself also prohibits smaller search engines from indexing their data as well.
Please quit using LinkedIn and Facebook and seek open alternatives.
The Danger of Content Distribution Networks
Sep 21, 2019
A lot of power in the hands of a few. It's a recurring theme on the Internet. Companies like Cloudflare and Akamai, both CDNs, have blocked or impeded smaller search engines from spidering the content on the millions of websites for which they control the security and network logistics. Under the guise of protection their policies default to prohibiting smaller search engines from legitimately downloading and indexing this content.
It's also quite interesting to note that Cloudflare is heavily funded by both Google and Bing, and also Baidu. So once you get the dominant search engines actively interefering with smaller search engines' spidering of the content from millions of websites, it doesn't really get any more anti-competitive. This is something I would not expect anyone except a handful of people to realize, as there are not that many people spidering website content for their search engine as I am, but, nonetheless, it is a critical piece of the entire anti-competitive racket orchestrated by the data barons.
This could be fixed with a Bot Bill of Rights that required all bots in the same category be treated equally. No favoritism.
The Demise of the Meta Web Search Engine
Sep 21, 2019
Back in the 90s and early naughts, we had search engines called meta search engines that would search multiple web search engines and return the blended results. But now that there are only two large search engines in the United States, Google and, to a much lesser extent, Bing, the meta search engine has disappeared.
The meta search engine was a great way for a smaller or newer search engines to get business because that newer search engine's results, however unique, could be blended into a potpourri of the results from other, more established, search engines. However, today this is not contractually possible, thanks to Google's anti-competitive policies.
Google prohibits any company that displays Google search results from ever, even on another website that doesn't show Google's results, displaying search results from another search engine. Basically, Google demands exclusivity.
These practices need to stop, they are hurting the search ecosystem.
New Changes Coming
Sep 21, 2019
I am actively working on some new algos that should be out soon, in addition to an expanded index.
New Stuff Afoot
Jan 27, 2018
Gigablast now auotmatically generates news every 10 minutes from millions of web sites. It is unrestricted in that it pulls news from any website. It is not sandboxed to a list of news outlets that pay money for traffic nor does it promote news sites that pay for promotion. Users can specify their preferred country and language of news in a simple drop down. The news is currently presented in up to two sections. The first section, if available, is called Breaking News and consists of stories that are under one hour old and it is sorted by the age of the story. The second section is Top Stories and basically contains all the relevant news stories from the last several hours, sorted by a popularity metric. As time passes stories will come and go, rise and fall, like the tide.
There are still some minor issues with topic clustering and detecting the correct publication date in some cases, but it will be ironed out over time. The same goes for country identification of some news websites. The News TODO list also includes categorizing the news into topics like like sports or business.
Secondly, Gigablast is making its foray into image search. Although it has a tiny index compared to the search monopoly, the image search shows great promise and is an area of active development.
Thirdly, you can now specify a country (and language) in a drop down menu when doing a search. Gigablast gives heavy weight to results from the specified country, and language. It also gives some weight, although not as much, to results that have unknown countries, or are international. As we move forward, we will become better at identifying and tuning country-based indicators, of which there are many. It's hard to identify the country or countries a website represents. This, as is the case with most search algorithms, is being continually improved.
Last, but the opposite of least, is a highly-improved web index. The new Gigablast index fixes several issues with search results. It is a much faster and better web search, re-engineered on the bit-level. Summaries are better, titles are better, and relevance in general is better.
Going forward I will be adding local search and improved synonyms for query expansion. We have a head start with "local" because of the work I did on an event search engine a while back called EventGuru. I have a few techniques I'm anxious to try out. From what I've seen during spot-testing, I'm confident the results will improve even more. Additionally, supporting mobile devices better will be imperative. This should wrap it up for the big new features that are coming soon, and the rest of the time will be spent tuning existing search features to iteratively close the relevance gap between ourselves and the "competitor".
Please help this independent search engine continue to improve and grow with a donation. New hardware is desperately needed to increase the index size and return even more relevant results.
A Bright Future
May 23, 2017
New Query-Based Features
Gigablast now has a new and improved query spell checker. It quickly searches a dictionary of over 600,000,000 entries covering most languages to determine correct spellings. Only English is supported PHONETICALLY at the moment. A more detailed description.
Additionally, Gigablast now has support for query completion and search-as-you-type technology. And, at the bottom of the results page, it now displays related queries.
Search Extinct Web Pages
Dec 28, 2016
Now Gigablast can search for pages that no longer or exist or had some kind of error when downloading. For example, property24. A convenient link to the Internet Archive is provided so you can see what the page looked like in the past.
The Gigablast Web Search Appliance
Feb 2, 2016
Today is exciting. After substantial development and testing we are proud to reveal the Gigablast Web Search Appliance. The largest and fastest web search engine available. More info is here. It runs a souped-up version of Gigablast Open Source, called Gigablast PRO™. It can index over 100 billion, yes, billion, pages at full capacity. It can serve over a crisp 20 queries per second.
The New Search Engine is Here
June 27, 2015
The new Gigablast search engine is online now. If you have any feedback that I can use to improve the service, don't hesitate to tell me. Gigablast's scoring algorithm is completely transparent and you can see how individual search results are scored by clicking on the link that says scoring next to each search result. It also has some interesting new features for a web search engine, so take it for a spin.
15 Year Anniversary
September 1, 2014
It's been 15 years since I first started Gigablast. It's taken some interesting directions as of late. Most notably being open source. I've decided to revive the old blog entries that you can find below and continue working on top of those.
Giga Bits Introduced
Jan 31, 2004
Gigablast now generates related concepts for your query. I call them Giga Bits. I believe it is the best concept generator in the industry, but if you don't think so please drop me a note explaining why not, so I can improve it.
You can also ask Gigablast a simple question like "Who is President of Russia?" and it often comes up with the correct answer in the Giga Bits section. How do you think it does that?
In other news, the spider speed ups I rolled a few weeks ago are tremendously successful. I can easily burn all my bandwidth quota with insignificant load on my servers. I could not be happier with this.
Now I'm planning on turning Gigablast into a default AND engine. Why? Because it will decrease query latency by several times, believe or not. That should put Gigablast on par with the fastest engines in the world, even though it only runs on 8 desktop machines. But Don't worry, I will still leave the default OR functionality intact.
January Update Rolled
Jan 8, 2004
Gigablast now has a more professional, but still recognizable, logo, and a new catch phrase, "Information Acceleration". Lots of changes on the back end. You should notice significantly higher quality searches. The spider algorithm was sped up several times. Gigablast should be able to index several million documents per day, but that still remains to be tested. <knock on wood>. Site clustering was sped up. I added the ability to force all query terms to be required by using the &rat=1 cgi parm. Now Gigablast will automatically regenerate some of its databases when they are missing. And I think I wasted two weeks working like a dog on code that I'm not going to end up using! I hate when that happens...
An Easy way to Slash Motor Vehicle Emissions
Dec 11, 2003
Blanket the whole city with wi-fi access. (like Cerritos, California) When you want to travel from point A to point B, tell the central traffic computer. It will then give you a time window in which to begin your voyage and, most importantly, it will ensure that as long as you stay within the window you will always hit green lights.
If you stray from your path, you'll be able to get a new window via the wi-fi network. If everyone's car has gps and is connected to the wi-fi network, the central computer will also be able to monitor the flow of traffic and make adjustments to your itinerary in real-time. Essentially, the traffic computer will be solving a large system of linear, and possibly non-linear, constraints in real-time. Lots of fun... and think of how much more efficient travel will be!! If someone wants to secure funding, count me in.
Spellchecker Finally Finished
Nov 18, 2003
After a large, countable number of interruptions, I've finally completed the spellchecker. I tested the word 'dooty' on several search engines to see how they handled that misspelling. Here's what I got:
|Wisenut||N/A (no spellchecker)|
So there is no one way to code a spellchecker. It's a guessing game. And, hey Wisenut, want to license a good spellchecker for cheap? Let me know.
Gigablast uses its cached web pages to generate its dictionary instead of the query logs. When a word or phrase is not found in the dictionary, Gigablast replaces it with the closest match in the dictionary. If multiple words or phrases are equally close, then Gigablast resorts to a popularity ranking.
One interesting thing I noticed is that in Google's spellchecker you must at least get the first letter of the word correct, otherwise, Google will not be able to recommend the correct spelling. I made Gigablast this way too, because it really cuts down on the number of words it has to search to come up with a recommendation. This also allows you to have an extremely large dictionary distributed amongst several machines, where each machine is responsible for a letter.
Also of note: I am planning on purchasing the hardware required for achieving a 5 billion document index capable of serving hundreds of queries per second within the next 12 months. Wish me luck... and thanks for using Gigablast.
Spiders On Again
Nov 10, 2003
After updating the spider code I've reactivated the spiders. Gigablast should be able to spider at a faster rate with even less impact on query response time than before. So add your urls now while the addings good.
Going For Speed
Nov 3, 2003
I've finally got around to working on Gigablast's distributed caches. It was not doing a lot of caching before. The new cache class I rigged up has no memory fragmentation and minimal record overhead. It is vurhy nice.
I've stopped spidering just for a bit so I can dedicate all Gigablast's RAM to the multi-level cache system I have in place now and see how much I can reduce query latency. Disks are still my main point of contention by far so the caching helps out a lot. But I could still use more memory.
Take Gigablast for a spin. See how fast it is.
Bring Me Your Meta Tags
Oct 11, 2003
As of now Gigablast supports the indexing, searching and displaying of generic meta tags. You name them I fame them. For instance, if you have a tag like <meta name="foo" content="bar baz"> in your document, then you will be able to do a search like foo:bar or foo:"bar baz" and Gigablast will find your document.
You can tell Gigablast to display the contents of arbitrary meta tags in the search results, like this. Note that you must assign the dt cgi parameter to a space-separated list of the names of the meta tags you want to display. You can limit the number of returned characters of each tag to X characters by appending a :X to the name of the meta tag supplied to the dt parameter. In the link above, I limited the displayed keywords to 32 characters.
Why use generic metas? Because it is very powerful. It allows you to embed custom data in your documents, search for it and retrieve it. Originally I wanted to do something like this in XML, but now my gut instincts are that XML is not catching on because it is ugly and bloated. Meta tags are pretty and slick.
Verisign Stops Destroying the Internet
Oct 11, 2003
Ok, they actually stopped about a week ago, but I didn't get around to posting it until now. They really ought to lose their privileged position so this does not happen again. Please do not stop your boycott. They have not learned from their mistakes.
Verisign Continues to Damage Gigablast's Index
September 30, 2003
When the Gigablast spider tries to download a page from a domain it first gets the associated robots.txt file for that domain. When the domain does not exist it ends up downloading a robots.txt file from verisign. There are two major problems with this. The first is that verisign's servers may be slow which will slow down Gigablast's indexing. Secondly, and this has been happening for a while now, Gigablast will still index any incoming link text for that domain, thinking that the domain still exists, but just that spider permission was denied by the robots.txt file.
So, hats off to you verisign, thanks for enhancing my index with your fantastic "service". I hope your company is around for many years so you can continue providing me with your great "services".
If you have been hurt because of verisign's greed you might want to consider joining the class-action lawsuit announced Friday, September 26th, by the Ira Rothken law firm.
Want to learn more about how the internet is run? Check out the ICANN movie page. Movie #1 portrays verisign's CEO, Stratton Sclavos, quite well in my opinion.
(10/01/03) Update #5: verisign comes under further scrutiny.
Verisign Redesigns the Internet for their Own Profit
September 24, 2003
My spiders expect to get "not found" messages when they look up a domain that does not have an IP. When verisign uses their priviledged position to change the underlying fundamentals of the internet just to line their own greedy pockets it really, really perturbs me. Now, rather than get the "not found" message, my spiders get back a valid IP, the IP of verisign's commercial servers. That causes my spiders to then proceed to download the robots.txt from that domain. This can take forever if their servers are slow. What a pain. Now I have to fix my freakin' code. And that's just one of many problems this company has caused.
Please join me in boycott. I'm going to discourage everyone I know from supporting this abusive, monopolistic entity.
(9/22/03) Update #1: verisign responded to ICANN's request that they stop. See what the slashdot community has to say about this response.
(9/22/03) Update #2: ICANN has now posted some complaints in this forum.
(9/24/03) Update #3: Slashdot has more coverage.
(9/24/03) Update #4: Please sign the petition to stop verisign.
September 18, 2003
Gigablast now supports some special new meta tags that allow for constraining a search to a particular zipcode, city, state or country. Support was also added for the standard author, language and classification meta tags. This page explains more. These meta tags should be standard, everyone should use them (but not abuse them!) and things will be easier for everybody.
Secondly, I have declared jihad against stale indexes. I am planning a significantly faster update cycle, not to mention growing the index to about 400 million pages, all hopefully in the next few months.
Foiling the Addurl Scripts
September 6, 2003
The new pseudo-Turing test on the addurl page should prevent most automated scripts from submitting boatloads of URLs. If someone actually takes the time to code a way around it then I'll just have to take it a step further. I would rather work on other things, though, so please quit abusing my free service and discontinue your scripts. Thanks.
Boolean is Here
September 1, 2003
I just rolled out the new boolean logic code. You should be able to do nested boolean queries using the traditional AND, OR and NOT boolean operators. See the updated help page for more detail.
I have declared jihad against swapping and am now running the 2.4.21-rc6-rmap15j Linux kernel with swap tuned to zero using the /proc/sys/vm/pagecache knobs. So far no machines have swapped, which is great, but I'm unsure of this kernel's stability.
All Swapped Out
August 29, 2003
I no longer recommend turning the swap off, at least not on linux 2.4.22. A kernel panicked on me and froze a server. Not good. If anyone has any ideas for how I can prevent my app from being swapped out, please let me know. I've tried mlockall() within my app but that makes its memory usage explode for some reason. I've also tried Rik van Riel's 2.4.21-rc6-rmap15j.txt patch on the 2.4.21 kernel, but it still does unnecessary swapping (although, strangely, only when spidering). If you know how to fix this problem, please help!!! Here is the output from the vmstat command on one of my production machines running 2.4.22. And here is the output from my test machine running 2.4.21-rc6-rmap15j.txt.
August 28, 2003
I updated the Linux kernel to 2.4.22, which was just released a few days ago on kernel.org. Now my gigabit cards are working, yay! I finally had to turn off swap using the swapoff command. When an application runs out of memory the swapper is supposed to write unfrequently used memory to disk so it can give that memory to the application that needs it. Unfortunately, the Linux virtual memory manager enjoys swapping out an application's memory for no good reason. This can often make an application disastrously slow, especially when the application ends up blocking on code that it doesn't expect too! And, furthermore, when the application uses the disk intensely it has to wait even longer for memory to get swapped back in from disk. I recommend that anyone who needs high performance turn off the swap and just make sure their program does not use more physical memory than is available.
The Gang's All Here
August 17, 2003
I decided to add PostScript (.ps) , PowerPoint (.ppt), Excel SpreadSheet (.xls) and Microsoft Word (.doc) support in addition to the PDF support. Woo-hoo.
August 14, 2003
Gigablast now indexes PDF documents. Try the search gbtype:pdf to see some PDF results. gbtype is a new search field. It also support the text type, gbtype:text, and will support other file types in the future.
Minor Code Updates
July 17, 2003
I've cleaned up the keyword highlight routines so they don't highlight isolated stop words. Gigablast now displays a blue bar above returned search results that do not have all of your query terms. When returning a page of search results Gigablast lets you know how long ago that page was cached by displaying a small message at the bottom of that page. NOTE: This small message is at the bottom of the page containing the search results, not at the bottom of any pages from the web page cache, that is a different cache entirely. Numerous updates to less user-visible things on the back end. Many bugs fixed, but still more to go. Thanks a bunch to Bruce Perens for writing the Electric Fence debug utility.
June 20, 2003
I've recently released Gigablast 2.0. Right now Gigablast can do about twice as many queries per second as before. When I take care of a few more things that rate should double again.
The ranking algorithm now treats phrase weights much better. If you search for something like boots in the uk you won't get a bunch of results that have that exact phrase in them, but rather you will get UK sites about boots (theoretically). And when you do a search like all the king's men you will get results that have that exact phrase. If you find any queries for which Gigablast is especially bad, but a competing search engine is good, please let me know, I'm am very interested.
2.0 also introduced a new index format. The new index is half the size of the old one. This allows my current setup to index over 400 million pages with dual redundancy. Before it was only able to index about 300 million pages. The decreased index size also speeds up the query process since only half as much data needs to be read from disk to satisfy a query.
I've also started a full index refresh, starting with top level pages that haven't been spidered in a while. This is especially nice because a lot of pages that were indexed before all my anti-spam algorithms were 100% in place are just now getting filtered appropriately. I've manually removed over 100,000 spam pages so far, too.
My Take on Looksmart's Grub
Apr 19, 2003
There's been some press about Grub, a program from Looksmart which you install on your machine to help Looksmart spider the web. Looksmart is only using Grub to save on their bandwidth. Essentially Grub just compresses web pages before sending them to Looksmart's indexer thus reducing the bandwidth they have to pay for by a factor of 5 or so. The same thing could be accomplished through a proxy which compresses web pages. Eventually, once the HTTP mime standard for requesting compressed web pages is better supported by web servers, Grub will not be necessary.
Mar 25, 2003
I just rolled some significant updates to Gigablast's back-end. Gigablast now has a uniformly-distributed, unreplicated search results cache. This means that if someone has done your search within the last several hours then you will get results back very fast. This also means that Gigablast can handle a lot more queries per second.
I also added lots of debug and timing messages that can be turned on and off via the Gigablast admin page. This allows me to quickly isolate problems and identify bottlenecks.
Gigablast now synchronizes the clocks on all machines on the network so the instant add-url should be more "instant". Before I made this change, one machine would tell another to spider a new url "now", where "now" was actually a few minutes into the future on the spider machine. But since everyone's currently synchronized, this will not be a problem anymore.
There were about 100 other changes and bug fixes, minor and major, that I made, too, that should result in significant performance gains. My next big set of changes should make searches at least 5 times faster, but it will probably take several months until completed. I will keep you posted.
Feb 20, 2003
To combat downtime I wrote a monitoring program. It will send me a text message on my cellphone if gigablast ever stops responding to queries. This should prevent extended periods of downtime by alerting me to the problem so I can promptly fix it.
Connectivity Problems. Bah!
Feb 14, 2003
I had to turn off the main refresh spiders a few weeks ago because of internet connectivity problems. Lots of pages were inaccessible or were timing out to the point that spider performance was suffering too much.
After running tcpdump in combination with wget I noticed that the FIN packets of some web page transfers were being lost or delayed for over a minute. The TCP FIN packet is typically the last TCP packet sent to your browser when it retrieves a web page. It tells your browser to close the connection. Once it is received the little spinning logo in the upper right corner of your browser window should stop spinning.
The most significant problem was, however, that the initial incoming data packet for some URLs was being lost or excessively delayed. You can get by without receiving FIN packets but you absolutely need these TCP "P" packets. I've tested my equipment and my ISP has tested their equipment and we have both concluded that the problem is upstream. Yesterday my ISP submitted a ticket to Worldcom/UUNet. Worldcom's techs have verified the problem and thought it was... "interesting".
I personally think it is a bug in some filtering or monitoring software installed at one of Worldcom's NAPs (Network Access Points). NAPs are where the big internet providers interface with each other. The most popular NAPs are in big cities, the Tier-1 cities, as they're called. There are also companies that host NAP sites where the big carriers like Worldcom can install their equipment. The big carriers then set up Peering Agreements with each other. Peering Agreements state the conditions under which two or more carriers will exchange internet traffic.
Once you have a peering agreement in place with another carrier then you must pay them based on how much data you transfer from your network to their network across a NAP. This means that downloading a file is much cheaper than uploading a file. When you send a request to retrieve some information, that request is small compared to the amount of data it retrieves. Therefore, the carrier that hosted the server from which you got the data will end up paying more. Doh! I got off the topic. I hope they fix the problem soon!
Jan 10, 2003
I'm now looking into serving text advertisements on top of the search results page so I can continue to fund my information retrieval research. I am also exploring the possibility of injecting ads into some of my xml-based search feeds. If you're interested in a search feed I should be able to give you an even better deal provided you can display the ads I feed you, in addition to any other ads you might want to add. If anyone has any good advice concerning what ad company I should use, I'd love to here it.
Dec 27, 2002
After a brief hiatus I've restarted the Gigablast spiders. The problem was they were having a negative impact on the query engine's performance, but now, all spider processing yields computer resources much better to the query traffic. The result is that the spidering process only runs in the space between queries. This actually involved a lot of work. I had to insert code to suspend spider-related, network transactions and cancel disk-read and disk-write threads.
I've also launched my Gigaboost campaign. This rewards pages that link to gigablast.com with a boost in the search results rankings. The boost is only utilized to resolve ties in ranking scores so it does not taint the quality of the index.
Gigablast.nu, in Scandinavia, now has a news index built from news sources in the Scandinavian region. It is not publically available just yet because there's still a few details we are working out. I've also added better duplicate detection and removal. It won't be very noticeable until the index refresh cycle completes. In addition Gigablast now removes session ids from urls, but, this only applies to new links and will be back pedaled to fix urls already in the index at a later date. There is also a new summary generator installed. It's over ten times faster than the old one. If you notice any problems with it please contact me. As always, I appreciate any constructive input you have to give.
Data Corruption Mysteries
Dec 20, 2002
I've been having problems with my hard drives. I have a bunch of Maxtor 160GB drives (Model # = 4G160J8) running on Linux 2.4.17 with the 48-bit LBA patch. Each machine has 4 of these drives on them, 2 on each IDE slot. I've had about 160 gigabytes of data on one before so I know the patch seems to do the job. But every now and then a drive will mess up a write. I do a lot of writing and it usually takes tens of gigabytes of writing before a drive does this. It writes out about 8 bytes that don't match what should have been written. This causes index corruption and I've had to install work-arounds in my code to detect and patch it.
I'm not sure if the problem is with the hard drive itself or with Linux. I've made sure that the problem wasn't in my code by doing a read after each write to verify. I thought it might be my motherboard or CPU. I use AMDs and Giga-byte motherboards. But gigablast.nu in Sweden has the same problem and it uses a Pentium 3. Furthermore, gigablast.nu uses a RAID of 160GB Maxtors, whereas gigablast.com does not. Gigablast.nu uses version 2.4.19 of Linux with the 48-bit LBA patch. So the problem seems to be with Linux, the LBA patch or the hard drive itself.
On top of all this mess, about 1 Maxtor, out of the 32 I have, completely fails on me every 4 months. The drive just gives I/O errors to the kernel and brings the whole system down. Luckily, gigablast.com implements a redundant architecture so the failing server will be replaced by his backup. So far Maxtor has replaced the drives I had fail. If you give them your credit card number they'll even send the replacements out in advance. But I believe the failure problem is an indicator that the data corruption problem is hard drive related, not Linux related. If anyone has any insight into this problem please let me know, you could quite easily be my hero.
If you're still reading this you're pretty hard core so here's what /var/log/messages says when the 4G160J8 completely fails.
Personal Video Recorders (PVRs)
Dec 20, 2002
Boy, these things are great. I bought a Tivo last year for my wife and she loved it. At first though she wasn't that enthusiastic because she wasn't very familiar with it. But now we rarely rent any more video tapes from Blockbuster or Hollywood video because there's always something interesting to watch on the Tivo. You just let it know what shows you like and it will record them anytime they come on. We always have an overflow of Simpsons and Seinfeld epsidoes on there.
In the future though I don't think Tivo is going to make it. The reason? Home networking. Because I'm a professional computer person, we already have a home network installed. If the TV had an ethernet jack it would be in our network. 100Mbps is fast enough to send it a high-quality video stream from the computers already on the network. I have a cable modem which, in the future, should allow the computer using it to rip signals from the cable station, as well. For now though, you could split your cable and plug the new end into a tuner card on your PC. So once someone comes out with a small device for the television that converts an ethernet-based mpeg stream to a video signal we can use our home PC to act as the TIVO. This device should be pretty cheap, I'd imagine around $30 or so. The only thing you'd need then is a way to allow the remote control to talk to your PC.
Now I read about the EFF suing "Hollywood" in order to clarify consumer rights of fair use. Specifically, the EFF was said to be representing Replay TV. Hey! Isn't Replay TV owned in part by Disney (aka Hollywood)... hmmmm... Seems like Disney might have pretty good control over the outcome of this case. I think it's a conflict of interest when such an important trial, which would set precedence for many cases to come, has the same plaintiff as defendant.
This makes me wonder about when Disney's Go.com division got sued by Overture (then known as Goto.com) for logo infringement. Disney had to pay around 20 million to Overture. I wonder what kind of ties Disney had to Overture. Ok, maybe I'm being a conspiracy theorist, so I'll stop now.
ECS K7S5A Motherboard Mayhem
Dec 20, 2002
I pinch pennies. When I bought my 8 servers I got the cheapest motherboards I could get for my AMD 1.4GHz Athlon T-Birds. At the time, in late January 2002, they turned out to be the K7S5A's. While running my search engine on them I experienced lots of segmentation faults. I spent a couple of days pouring over the code wondering if I was tripping out. It wasn't until I ran memtest86 at boot time (ran by lilo) that I found memory was being corrupted. I even tried new memory sticks to no avail. Fortunately I found some pages on the web that addressed the problem. It was the motherboard. It took me many hours to replace them on all 8 servers. I don't recommend ECS. I've been very happy with the Giga-byte motherboards I have now.
privacy syntax api login