Авторизация
Поиск по указателям
Hemenway K., Calishain T. — Spidering Hacks
Обсудите книгу на научном форуме
Нашли опечатку? Выделите ее мышкой и нажмите Ctrl+Enter
Название: Spidering Hacks
Авторы: Hemenway K., Calishain T.
Аннотация: The Internet, with its profusion of information, has made us hungry for ever more, ever better data. Out of necessity, many of us have become pretty adept with search engine queries, but there are times when even the most powerful search engines aren't enough. If you've ever wanted your data in a different form than it's presented, or wanted to collect data from several sites and see it side-by-side without the constraints of a browser, then Spidering Hacks is for you. Spidering Hacks takes you to the next level in Internet data retrieval — beyond search engines — by showing you how to create spiders and bots to retrieve information from your favorite sites and data sources. You'll no longer feel constrained by the way host sites think you want to see their data presented — you'll learn how to scrape and repurpose raw data so you can view in a way that's meaningful to you. Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when you've gone too far: what's acceptable and unacceptable). Next, you'll collect media files and data from databases. Then you'll learn how to interpret and understand the data, repurpose it for use in other applications, and even build authorized interfaces to integrate the data into your own content. By the time you finish Spidering Hacks, you'll be able to:
* Aggregate and associate data from disparate locations, then store and manipulate the data as you like
* Gain a competitive edge in business by knowing when competitors' products are on sale, and comparing sales ranks and product placement on e-commerce sites
* Integrate third-party data into your own applications or web sites
* Make your own site easier to scrape and more usable to others
* Keep up-to-date with your favorite comics strips, news stories, stock tips, and more without visiting the site every day
Язык:
Рубрика: Технология /
Статус предметного указателя: Готов указатель с номерами страниц
ed2k: ed2k stats
Год издания: 2003
Количество страниц: 424
Добавлена в каталог: 15.06.2007
Операции: Положить на полку |
Скопировать ссылку для форума | Скопировать ID
Предметный указатель
$browser object
.m3u files
.txt files
Aas, Gisle
AbleShoppers
absolute URLs
Accept-Charset headers
Accept-Encoding HTTP header
Accept-Language headers
Acceptable Use Policies
Acceptable use policy (AUP)
accessing particular
accessing particular URLs
account
across multiple domains using Google
across multiple sites for authors
ActiveState's ActivePerl
adding to request
advanced applications and wget utility
advanced techniques
advertisers and
advertisers and geotargeting
aggregating [See aggregating data]
aggregating data 2nd
aggregating entries from multiple
aggregating from multiple engines
aggregators
AIM (AOL Instant Messenger)
alert for new Amazon.com product reviews
Alexa
All Consuming 2nd
AlltheWeb.com 2nd 3rd
AlltheWeb.com sample
AltaVista
Amazon.com
America at Work, America at Leisure project
America at Work, America at Leisure project (Library of Congress)
Ampache
AmphetaDesk 2nd 3rd
anatomy of
Andromeda 2nd
announcing to world
AOL Instant Messenger (AIM)
Apache
Apache::MP3 [See Apache::MP3 module]
Apache::MP3 module 2nd
API
API developer's key
Apis
arbitrary
arbitrary classification systems
architectural style
archiving messages
archiving with yahoo2mbox
archiving Yahoo! Groups messages
Artymiak, Jacek (contributor)
as Perl module
ASIN 2nd
ASIN (Amazon.com Standard Identification Number) 2nd
Associates account
Associates sales statistics, publishing
associative data
attachments, saving only POP3
Audioscrobbler
AUP (Acceptable Use Policy)
Authentication
authors, searching across multiple sites for
automating
automating tasks
Ball, Chris (contributor)
Bandwidth
banking online
Bausch, Paul (contributor)
BBC's Radio Times
beginning process
Ben's Bargains
Benson, Erik (contributor)
Berkman Center for Internet & Society at Harvard Law School
Best practices
best practices for spidering
Better Business Bureau
Bidder's Edge sued by eBay
Biddle, Daniel (contributor)
bio
Blagg
Blawg Search
blog neighborhoods
Blogger 2nd
blogrolls
Blogs [See also weblogs]
Blosxom
book metadata and weblog mentions
BOTs [See spiders]
Boundary data
branding another site's data
Bregenzer, Adam (contributor)
browser attributes
Buffy the Vampire Slayer
Bugtraq reports, reformatting
Burke, Sean (contributor)
by keyword
CAIDA project 2nd
calculating distance
calculating mindshare
Calishain, Tara (author)
cd-discid program
CdS
chaining commands
change notification through email
characters, special
checking for new comments
checks on keywords
clarifying
classification numbers, unique
classification system
classification systems
clustered and related results
clustered search results
Code
Combined Log Format
combining information from FreeDB and
combining related information with other
comics
comics, downloading
Competitive intelligence
Compress::Zlib
Compress::Zlib module
Compressed
compressed data
consequences of violating
considering
contacting sites about your spider
Content [See also data]
cookie jar
cookies, enabling
Copyfight, the Politics of IP web site
copyright, violating
Cosmos [See Link Cosmos]
Cozens, Simon (contributor)
CPAN (Comprehensive Perl Archive Network) 2nd 3rd
CPAN module (Perl)
creating web site for
cron
Crone
cURL utility
cursors, rotating
customer advice
customer advice, scraping
dailystrips
DATA
databases
daterange: syntax
DaylightStation.com
Daypop
Developer Wiki
Developer's Token
Dewey Decimal system
dict protocol
DICT.org server
Dictionaries
diff utility (GNU)
difference between scrapers and
Directi
directories
directory indexes
directory, calculating mindshare
disc ID
discussion groups [See also Yahoo! Groups]2nd
disobey.com
distance calculating, geographic
Dive Into Mark
DMOZ (Open Directory Project)
DNS lookup
Dornfest, Rael (contributor)
downloading
downloading images from Webshots
downloading movies from
Dynamic MP3 Lister
Eastler, William (contributor)
eBay's lawsuit against Bidder's Edge
EchoCloud 2nd
Edna
EIN (Employer Identification Number)
Electronic Freedom Foundation
Electronic Frontier Foundation
Email
email alert for new
error checking
ETag HTTP header
European train connections, finding faster
example
example template
Fake Cron
Faking a Referer
Fallin, Scott (contributor)
Fark
Farscape
Fastcron
Favorites tree
FedEx, tracking packages with
feed ID 2nd
FeedDemon
Feeds
fetching
fetching with
File::Spec
File::Spec module
Files 2nd
Filtering
Finance::Bank::HSBC
Finance::Bank::HSBC module
Finance::QIF
Finance::QIF module
Finance::Quote
Finance::Quote module
finding related sites using
finding related sites using RSS feeds
FireWire HD and
FireWire HD and iPods
FishHoo! fishing search engine
Folder
fopen( ) function
for requesting the hourly and weekly most-mentioned lists
for retrieving categorized books
for spidering
form data, posting with LWP
form, posting with LWP
Forums
Framing
framing data
FreeDB project
freshmeat.net
freshmeat.net sample
friends and recommendations
from Usenet with nget
from webcams
from Webshots
functionalities, combining
gaining access to CD device
Gamegrene.com
GameStop.com prices
GameStop.com, spidering prices
gathering
gathering tools
Geo::Distance
Geo::Distance module
geographic distance calculating
geotargeting
Getopt::Std
Getopt::Std module
getting for each book
gleaning data from
gleaning data from databases and information collections
GNUMP3d
Google
Google API sample
grabbing [See fetching]
graphing
graphing data with
graphing data with RRDTOOL
graphing Sales Rank
grep command
GuideStar
Hack #10, More Involved Requests with LWP::UserAgent
Hack #11, Adding HTTP Headers to Your Request
Hack #12, Posting Form Data with LWP
Hack #13, Authentication, Cookies, and Proxies
Hack #14, Handling Relative and Absolute URLs
Hack #15, Secured Access and Browser Attributes
Hack #17, Respecting robots.txt
Hack #19, Scraping with HTML::TreeBuilder
Hack #20, Parsing with HTML::TokeParser
Hack #21, WWW::Mechanize 101
Hack #22, Scraping with WWW::Mechanize
Hack #23, In Praise of Regular Expressions
Hack #24, Painless RSS with Template::Extract
Hack #25, A Quick Introduction to XPath
Hack #27, More Advanced wget Techniques
Hack #28, Using Pipes to Chain Commands
Hack #29, Running Multiple Utilities at Once
Hack #30, Utilizing the Web Scraping Proxy
Hack #36, Downloading Images from Webshots
Hack #38, Archiving Your Favorite Webcams
Hack #44, Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups
Hack #45, Gleaning Buzz from Yahoo!
Hack #46, Spidering the Yahoo! Catalog
Hack #50, Weblog-Free Google Results
Hack #52, Scraping Amazon.com Product Reviews
Hack #53, Receive an Email Alert for Newly Added Amazon.com Reviews
Hack #54, Scraping Amazon.com Customer Advice
Hack #55, Publishing Amazon.com Associates Statistics
Hack #56, Sorting Amazon.com Recommendations by Rating
Hack #57, Related Amazon.com Products with Alexa
Hack #58, Scraping Alexa's Competitive Data with Java
Hack #59, Finding Album Information with FreeDB and Amazon.com
Hack #60, Expanding Your Musical Tastes
Hack #62, Graphing Data with RRDTOOL
Hack #63, Stocking Up on Financial Quotes
Hack #64, Super Author Searching
Hack #66, Using All Consuming to Get Book Lists
Реклама