Go to file
Ivan Kozik b22627e4cd
Merge pull request #560 from ivan/log-window-scroll-down
2023-07-17 15:47:54 -07:00
bot Add !reason alias for !explain and --reason for --explain 2021-06-27 01:10:44 +00:00
cogs Load igsets and UAs into CouchDB on cogs start 2021-03-04 02:32:39 +00:00
config Introduce Cucumber for integration testing. 2014-04-27 02:37:59 -05:00
dashboard dashboard: fix out-of-viewport log windows not being scrolled down in Chrome 2023-07-17 22:42:07 +00:00
db Merge pull request #546 from Pokechu22/patch-3 2022-11-24 03:08:14 +00:00
doc Change "www.bar.org" links to "example.net" links 2023-01-28 15:52:06 -07:00
lib Notify IRC channel on pipeline changes 2019-08-20 17:03:18 +00:00
ops Add a hacky script for clearing a job's cookie jar 2021-03-23 04:49:54 +00:00
pipeline Fix compatibility with PyYAML 6.0 (mandatory `Loader`) 2022-02-24 22:32:35 +00:00
plumbing Add some error handling 2019-05-01 00:10:30 +00:00
spec Remove PhantomJS support 2019-07-23 21:47:01 +00:00
test Remove RSYNC_URL environment variable for pipeline 2019-08-20 21:57:54 +00:00
uploader Fix syntax warnings in uploader 2020-11-20 23:17:00 +00:00
viewer Add link to archivelab WARC viewer 2016-03-01 22:51:57 -08:00
.gitignore tests+travis: Add db/ JSON validation 2017-06-03 16:17:43 +02:00
.gitmodules redis-lua is no longer required. 2014-03-08 18:30:04 -06:00
.travis.yml Fix get-pip.py URL for Python 3.5 tests 2021-03-08 04:46:52 +00:00
Gemfile Fix webmachine-sprockets dependency 2020-07-19 02:21:37 +00:00
Gemfile.lock Fix webmachine-sprockets dependency 2020-07-19 02:21:37 +00:00
INSTALL.backend New dashboard WebSocket server 2019-05-07 03:33:37 +00:00
INSTALL.pipeline Overhaul the documentation to remove all references of the ArchiveTeam instance 2022-11-28 06:38:30 +00:00
LICENSE Relicense as MIT. 2013-09-18 22:17:34 -05:00
README Updated grab-site repo URL in README 2019-03-02 17:15:43 -08:00
README.pipeline-recovery Add readme file for how to save off as much data as possible if a 2020-12-04 20:27:01 -08:00
Rakefile Remove pointless features. 2014-12-13 22:41:15 -06:00

README

1. ArchiveBot

    <SketchCow> Coders, I have a question.
    <SketchCow> Or, a request, etc.
    <SketchCow> I spent some time with xmc discussing something we could
                do to make things easier around here.
    <SketchCow> What we came up with is a trigger for a bot, which can
                be triggered by people with ops.
    <SketchCow> You tell it a website. It crawls it. WARC. Uploads it to
                archive.org. Boom.
    <SketchCow> I can supply machine as needed.
    <SketchCow> Obviously there's some sanitation issues, and it is root
                all the way down or nothing.
    <SketchCow> I think that would help a lot for smaller sites
    <SketchCow> Sites where it's 100 pages or 1000 pages even, pretty
                simple.
    <SketchCow> And just being able to go "bot, get a sanity dump"

2. More info

ArchiveBot has two major backend components: the control node, which
runs the IRC interface and bookkeeping programs, and the crawlers, which
do all the Web crawling.  ArchiveBot users communicate with ArchiveBot
by issuing commands in an IRC channel.

User's guide: http://archivebot.readthedocs.org/en/latest/
Control node installation guide: INSTALL.backend
Crawler installation guide: INSTALL.pipeline

3. Local use

ArchiveBot was originally written as a set of separate programs for
deployment on a server.  This means it has a poor distribution story.
However, Ivan Kozik (@ivan) has taken the ArchiveBot pipeline,
dashboard, ignores, and control system and created a package intended for
personal use.  You can find it at https://github.com/ArchiveTeam/grab-site.

4. License

Copyright 2013 David Yip; made available under the MIT license.  See
LICENSE for details.

5. Acknowledgments

Thanks to Alard (@alard), who added WARC generation and Lua scripting to
GNU Wget.  Wget+lua was the first web crawler used by ArchiveBot.

Thanks to Christopher Foo (@chfoo) for wpull, ArchiveBot's current web
crawler.

Thanks to Ivan Kozik (@ivan) for maintaining ignore patterns and
tracking down performance problems at scale.

Other thanks go to the following projects:

* Celluloid <http://celluloid.io/>
* Cinch <https://github.com/cinchrb/cinch/>
* CouchDB <http://couchdb.apache.org/>
* Ember.js <http://emberjs.com/>
* Redis <http://redis.io/>
* Seesaw <https://github.com/ArchiveTeam/seesaw-kit>

6. Special thanks

Dragonette, Barnaby Bright, Vienna Teng, NONONO.

The memory hole of the Web has gone too far.
Don't look down, never look away; ArchiveBot's like the wind.

 vim:ts=2:sw=2:tw=72:et