Factbites
 Where results make sense
About us   |   Why use us?   |   Reviews   |   PR   |   Contact us  

Topic: Nutch


Related Topics

In the News (Mon 21 Dec 09)

  
  Nutch - Wikipedia, the free encyclopedia
Nutch is an effort to build an open source search engine.
Nutch has a highly modular architecture allowing developers to create plugins for the following activities: media-type parsing, data retrieval, querying and clustering.
As of June 2005, Nutch has graduated from the Apache Incubator, and is now a subproject of Lucene.
en.wikipedia.org /wiki/Nutch   (161 words)

  
 Project searches for open-source niche | Tech News on ZDNet
Nutch itself has been operating secretly for roughly a year, gathering support from developers and funding from one of the biggest commercial players in search: Overture Services.
Nutch is actively seeking funding for hardware that would support traffic from Web surfers, but for now, its systems do not have the capacity to handle an influx of visitors.
Nutch is an alternative test bed for the company's use, she said.
news.zdnet.com /2100-3513_22-5064913.html   (1296 words)

  
 John Battelle's Searchblog: #5: Nutch Presages a New Kind of Search Engine
Because of this, anyone will be able to access Nutch's code and use it to their own ends, without paying licensing fees or hewing to a particular company's set of rules.
But Cutting says they hope that once Nutch is loosed on the world, tinkerers from Romania to China to Palo Alto will help build it into a robust platform, in the spirit of Linux or Apache (which has garnered more than 60 percent of the Web-server software market in just the last couple of years).
Nutch is moving its servers to Kahle's high-bandwidth location this weekend, a crucial step toward readying the engine for its public debut.
battellemedia.com /archives/000138.php   (1009 words)

  
 An Open-Source Search Engine (Nutch) Takes Shape - Article by Tech News World
Whether Nutch will be able to penetrate this market remains to be seen.
Ironically, according to Winfield, one potential problem with Nutch's approach could be the very element that the organization seems most proud of -- the fact that people will be able to see the exact formula for determining results.
"Nutch will be an interesting one to watch," he said, pointing out that an open-source search technology has the capacity to shake things up in the coming months.
www.10e20webdesign.com /news_center_press_coverage_tech_news_world.htm   (819 words)

  
 Lucene and Nutch   (Site not responding. Last check: 2007-09-07)
Nutch is a nascent effort to implement an open-source internet search application.
Nutch also provides a scalable, high-quality search application for intranets, and a platform for research.
He is also the principal architect of Nutch an open source web search application.
www.wgrosso.com /Articles/EmergingTechnology/LuceneAndNutch.html   (160 words)

  
 Welcome to Nutch!
Nutch has now graduated from the Apache incubator, and is now a Subproject of Lucene.
We have now determined that the Apache license is the appropriate license for Nutch and no longer require the overhead of an independent non-profit organization.
Nutch's board of directors and its developers were both polled and supported the move to the Apache foundation.
www.nutch.org   (225 words)

  
 FAQ - Nutch Wiki   (Site not responding. Last check: 2007-09-07)
We don't think these techniques are likely to solve the hard problems Nutch needs to solve, but we'd be happy to be proven wrong.
The tricky thing about Nutch is that out of the box it has most plugins disabled and is tuned for a crawl of a "remote" web server - you have to change config files to get it to crawl your local disk.
By default, the size of the documents downloaded by Nutch is limited (to 65536 bytes).
wiki.apache.org /nutch/FAQ   (1818 words)

  
 SeoPapers - article The Evolution of Search   (Site not responding. Last check: 2007-09-07)
Nutch is a two-year-old open source project, which has been hosted previously at Soundforge and backed by a non-profit organization.
Nutch builds on Lucene technology, which was developed under the watchful eye of Doug Cutting, the primary developer for both of these open source projects.
The primary developer of Nutch, Doug Cutting, feels that the closed-source advantage is not nearly as much of a factor as one might imagine it to be.
seopapers.com /article/277   (971 words)

  
 Nutch: developer information   (Site not responding. Last check: 2007-09-07)
Check the nutch developers mailing list to see if anyone is already working on what you are interested in working on.
Once you've done some, submit the diffs to the nutch developers mailing list, or attach them to a bug report.
Nutch needs contributions in the following areas (among others).
nutch.sourceforge.net /docs/en/developers.html   (344 words)

  
 Google's Greatest Threat - Open Source | Threadwatch.org
Nutch is an open source search engine crawler, indexer, etc. The project appears to have been a bit dormant since its first media splash a few years ago, but has just recently become incubated with the Apache Software Foundation.
I've not played with Nutch but i understand it's easy enough to set up and get running so i may have to see if there's a Gentoo ebuild for it heh...
Google came under heavy fire recently for not giving back to the OS community when they owe so much of their success to it.
www.threadwatch.org /node/1469   (454 words)

  
 TP: Nutch: The Free Search Alternative to Google
Nutch is software that one can download in order to deploy a web search engine.
The goal is for Nutch to be both easy to use for intranets and niches, while at the same time scaling to complex whole-web deployments.
Nutch is like the Apache foundation: we have no employees, and have a legal entity (a non-profit corporation) primarily to own the copyright, so that the project is independent from its individual developers.
www.heise.de /tp/r4/artikel/17/17593/1.html   (1615 words)

  
 IT Manager's Journal | Why the future of search may be open source   (Site not responding. Last check: 2007-09-07)
Nutch is an attempt to write search engine software that is open source and which displays information about how rankings are given.
Nutch's goal is open algorithms that transcend garden-variety manipulators.
Nutch, like Linux, would essentially be a clone of "the other guys" but would be open, and would hopefully offer the same incremental improvement seen in projects like Apache.
software.itmanagersjournal.com /software/04/01/23/212253.shtml   (816 words)

  
 Nutch: about
Nutch is a nascent effort to implement an open-source web search engine.
Nutch, on the other hand, has nothing to hide and no motive to bias its results or its crawler in any way other than to try to give each user the best results possible.
Nutch aims to enable anyone to easily and cost-effectively deploy a world-class web search engine.
nutch.sourceforge.net /docs/en   (227 words)

  
 About :: Search :: Oregon State University
The migration to Nutch was initiated to improve flexibility and extensibility.
Nutch is an emerging open source project aimed at creating a non-biased, optimized search solution.
Nutch is a web application, so as our search database grows, so can the hardware that powers it.
search.oregonstate.edu /about   (593 words)

  
 Nutch   (Site not responding. Last check: 2007-09-07)
Nutch is has a highly modula architecture allowing developers to create plug ins for the following activities: media-type parsing, data retrival, querying and clustering.
As of 2003 it is completely coded in Java, but data is written in language-independent formats.
Nutch at ObjectsSearch.com - A working implemention of Nutch at the ObjectsSearch.com
pedia.newsfilter.co.uk /wikipedia/n/nu/nutch.html   (152 words)

  
 Archive Crawler Wiki: NutchSearchingArcs
Nutch webapp has hardcoding that says only show two pages from a particular site in a page of hits (See src/web/jsp/search.jsp).
Disadvantage is we'd have to move nutch to the ARCs rather than have nutch ask the cluster for an ARC entry (We might then merge the nutch indexes all into one large index).
Nutch population is done by putting a list of urls into db using 'inject'.
crawler.archive.org /cgi-bin/wiki.pl?NutchSearchingArcs   (768 words)

  
 Tom White's Blog: MapReduce
In essence, it allows massive data sets to be processed in a distributed fashion by breaking the processing into many small computations of two types: a map operation that transforms the input into an intermediate representation, and a reduce function that recombines the intermediate representation into the final output.
Currently MapReduce is a part of Nutch, but it has been proposed that it and NDFS be moved into a separate project.
Nutch MapReduce may not be finished, but most of the major pieces seem to be in place, so it is only a matter of time before this exciting and powerful tool sees wider adoption.
weblogs.java.net /blog/tomwhite/archive/2005/09/mapreduce.html   (992 words)

  
 media style
Nutch is an open-source alternative to commercial web search engine software.
The Nutch community was supported with corporate identity, a character and interface designs made by media style.
Furthermore media style is an active code contributer to the nutch project.
www.media-style.com /index.jsp?folderPK=265&action=&   (77 words)

  
 TP: Nutch: die freie Suchalternative zu Google
Nutch kann eine ähnliche Wirtschaftlichkeit für Suchmaschinen schaffen.
Nutch ermöglicht es auch mehr Wissenschaftlern, durch das Anbieten einer Plattform für die Forschung Forschritte in der Suchtechnik zu erzielen.
Nutch ist wie die Apache-Stiftung: Wir haben keine Angestellten, und wir haben primär eine rechtliche Körperschaft, eine nichtkommerzielle Organisation, um das Urheberrecht zu besitzen, sodass das Projekt unabhängig ist von seinen einzelnen Entwicklern.
www.heise.de /tp/deutsch/special/wos/17592/1.html   (1544 words)

  
 supermind.org - Lucene Nutch consulting » programming
The rationale behind this, is to be able to write to a single Nutch segment, instead of requiring a x-way post-fetch segment merge where x is the number of fetcher threads (I haven’t put much thought into this.
Unlike Nutch where a FetchList is a sequence of URLs, our FetchList is a sequence of HostQueues, which are in turn a sequence of URLs with the same host.
Nutch as it is, does not utilize HTTP 1.1’s connection persistence and request pipelining features which would significantly cut down crawling time when crawling a small number of hosts extensively.
www.supermind.org /index.php?cat=5   (3104 words)

  
 CIOL : News : “All search engines are biased”
Targeting Google, Nutch, in an effort to implement an open-source web search engine says, "Today's oligopoly could soon be a monopoly, with a single company controlling nearly all web search for its commercial gain.
Nutch claims that all existing, major search engines, have proprietary ranking formulas, and will not explain why a given page ranks as it does.
Nutch, on the other hand, has nothing to hide and no motive to bias its results or its crawler in any way other than to try to give each user the best results possible," states the organization.
www.ciol.com /content/news/2003/103081408.asp   (625 words)

  
 Doug Cutting Interview
Nutch aims to scale from simple intranet searching to search of the entire web, like Google and Yahoo!.
Nutch requires a Java servlet container, which some ISPs support, but most do not.
I believe Google does roughly what Nutch does: they broadcast queries to a number of nodes, each which returns the top-results over a set of pages.
blog.outer-court.com /archive/2004_05_28_index.html   (1618 words)

  
 find23: jmx & nutch   (Site not responding. Last check: 2007-09-07)
I'm working to put nutch on top of a jmx layer and I will give some more details about my work here.
First step to get nutch on top of jmx would be to write a set of adaptor around existing nutch command line tools.
Actually deployment of nutch mbean is done by scanning a folder that contains a set of xml deployment descriptor files.
www.find23.net /2004/07/jmx-nutch.html   (822 words)

  
 Erik Hatcher's Blog: Nutch - Google in a JAR
Nutch has made a big splash the past couple of days, first with an article in Business 2.0 (sorry, the full article requires subscription) and then with the inevitable /.
Nutch uses Lucene under the covers for indexing and searching.
Your use of this web site or any of its content or software indicates your agreement to be bound by these Terms of Participation.
weblogs.java.net /pub/wlg/367   (127 words)

  
 SpiritCompany - A Blog by Berlin Brown: Nutch and Experience   (Site not responding. Last check: 2007-09-07)
Nutch and Experience · 1 September 05 · [permalink]
Where searcher.dir is the home of my nutch search path, but that didnt help much.
Also, ROOT is actually needed because it looks like nutch is a little greedy on its use of Tomcat.
newspiritcompany.com /blog/article/193/nutch-and-experience   (195 words)

  
 Ranking HTML web documents using IR techniques and PageRank   (Site not responding. Last check: 2007-09-07)
Nutch [Nutch] is an open source search engine which uses lucene as a back-end search API.
The nutch code at plugin/query-basic/src/java/net/nutch/searcher/basic/BasicQueryFilter.java was hacked so that query objects will have H field matches in it as shown in the examples below.
Tuning of search results: Nutch method of assigning a overall document score to a document is given in Section 3.2.3 uses PageRank (link analysis score) and IR score.
www.cs.utexas.edu /users/vgupta/lsdm/report1.html   (3561 words)

  
 DaveNet : Nutch, an open source search engine   (Site not responding. Last check: 2007-09-07)
I first heard about Nutch at a lunch in Cambridge last week with John Battelle, a visiting professor at UC Berkeley, and former editor of the high-flying dotcom journal The Industry Standard.
He has been watching Nutch in his role as a columnist for Business 2.0.
After the lunch I did a Google search for Nutch, found it; confirming some but not all of what Battelle told me. So I linked to it on Scripting News and waited to see what would come next.
davenet.scripting.com /2003/08/13/nutchAnOpenSourceSearchEngine   (423 words)

  
 Nutch - Experiences and Findings :: SearchGuild.com   (Site not responding. Last check: 2007-09-07)
Hi All, I have begun what will be a rather serious investigation of the Nutch web crawler and I would like to share my experiences and findings with other who have done the same sort of thing.
At the moment I have been testing Nutch using urls from the dmoz file.
In fact I am thinking of writing some code which will capture the urls with errors and reintroduce them into the url fetch list to be re-spidered until they have failed a number of times in which case they will be discarded.
google.searchguild.com /tpage19919-0.html   (252 words)

  
 John Battelle's Searchblog: Nutch Update
The recent announcement of Mozdex, which is leveraging the Nutch open source engine, reminded me to ping Doug Cutting and see how things were going with the Nutch project.
But Doug said that his focus these days with Nutch is not to try to get a major, open source alternative to Google or Yahoo out there, though that remains a long term goal.
I'll be contributing as much as time permits to Nutch because I see it as an equalizer for the small guy.
battellemedia.com /archives/000598.php   (520 words)

  
 Bernhard Seefeld's Blog: Nutch   (Site not responding. Last check: 2007-09-07)
Nutch is hostet by the Internet Archive and is backed by Mitchell Kapor (of Lotus and OSAF fame), Tim O'Reilly (of O'Reilly & Associates) and others.
Apart from the political arguments, I think that such an open source web search engine will successfully attract contributions because a web search engine is a tool that is used intensively and diversely by software engineers (hackers hack hack tools).
Excerpt: Nutch: Web search is a basic requirement for internet navigation, yet the number of web search engines is decreasing.
www.bernhardseefeld.ch /archives/000053.html   (799 words)

Try your search on: Qwika (all wikis)

Factbites
  About us   |   Why use us?   |   Reviews   |   Press   |   Contact us  
Copyright © 2005-2007 www.factbites.com Usage implies agreement with terms.