Amazon S3S3 is an asset store; a repository of content with different rights for different users, and a formal SLA. It can be accessed remotely (with a fee per MB), or in the EC2 farms. In the EC2 farms, access to the US S3 store does not incur any bandwidth charges. Access to the EU store is billed at both ends. The S3 store is where AMIs are kept. APIThere is some coverage of S3 in the RESTful Web Services book. There is a SOAP API, but the REST API is the main one, as every upload asset can be accessed via an HTTP URL. BitTorrent support is an extra feature of the REST API. ConceptsS3 BucketAn S3 Bucket is something that holds data. Every bucket gets a hostname.
As bucket names are global across all users (and visible to anyone with a web browser), Amazon warn users about not expecting high performance from this. DNS updates may take a while too. Developers are not to create and delete buckets wildly. S3 ObjectAn S3 Object is an asset stored in the repository
Cost Storage
$0.15 per GB-Month of storage used
Data Transfer
$0.10 per GB - all data transfer in
$0.18 per GB - first 10 TB / month data transfer out
$0.16 per GB - next 40 TB / month data transfer out
$0.13 per GB - data transfer out / month over 50 TB
Requests
$0.01 per 1,000 PUT or LIST requests
$0.01 per 10,000 GET and all other requests*
* No charge for delete requests
The pricing is designed to deliver added value to the big users, the companies that use S3 for storing photos and videos. The per request charge puts a slight penalty on people using S3 as the store for lots of small content, like icons and other artwork, because they build up a bill, even if the amount of data is low. Given that every HTTP operation incurs server-side costs (electricity, CPU load, deprecation of servers), this probably makes sense. S3 Costs and EC2For storing AMIs, you pay about $1-$2 to upload the image, then, assuming it is a small image, about $1/month in storage. You don't pay anything for loading the image in the EC2 farm. For using S3 as the persistence layer for your AMI, you only pay for the storage, though of course you may pay more if the files so uploaded are made directly available to customers. However, as all GET requests served directly to your customers bypass the EC2 servers, you save on server fees in such a situation. As mentioned before, transfer between the EC2 servers and the EU S3 Store is billed at both ends. However, EU customers get to experience lower latency with the data in this store. In some applications, it may be better to store data in the S3 store, especially if you have client applications to do the uploading directly from the EU customer's machines. GUIThe S3 Firefox Organiser provides a GUI to let you synchronise a local directory with the S3 store. You can use this to publish AMIs or to move other data to and fro. For $0.10+VAT you can even use it to back up your photo collection. File system semanticsThis is not your normal filesystem. It has caching built in, and is designed to scale out by taking away some of the things people expect from a local filesystem
What they do appear to guarantee is that once a change has propagated, it will be what people see. Developer API (RESTful)There are two APIs, a SOAP API and a REST API. We will ignore the SOAP API, as it requires a feature (WS-Security) that is not in Alpine. ListList all objects matching a specific key. Returns an XML document you can apply XPath to. The size of the return list does not significantly affect the time to create the list (i.e. linear or better). Note that parsing long lists can hurt the memory consumption of a DOM-style XML parser. Service OperationsGETGET / host: s3.amazonaws.com Get the list of buckets that belong to the authenticated user Bucket OperationsEvery bucket is a host, so you can operate on it by making requests. If a host is not found, nslookup still works, but Amazon returns an error, such as HTTP/1.1 404 Not Found x-amz-request-id: 2D4A2D0DFCB1190C x-amz-id-2: kFK/F3fWy2UcGCtlndSL1wQvIF2AI0p1oGBvF3frHN5HSMc5aKKk/6WjBoxdOhek Content-Type: application/xml Date: Fri, 23 Nov 2007 18:41:14 GMT Connection: close Server: AmazonS3 <?xml version="1.0" encoding="UTF-8"?> <Error> <Code>NoSuchBucket</Code> <Message>The specified bucket does not exist</Message> <RequestId>11FFFDAC076F89E8</RequestId> <BucketName>something.new.smartfrog</BucketName> <HostId> I3TmzpV4EAlReRF9Ap3AAA5lGv0hnEq5ycNTrhVMcqCBEdzOc832/SGfiRSsCFdU </HostId> </Error> This shows some implementation aspects of how s3 works. DNS is set up so that hostname under *.s3.amazonaws.com always resolves to their servers; there is no need to add/remove DNS entries as buckets come and go. This means that you cannot rely on hostname lookup to probe for a bucket, you have to check for a 404 response. PUTcreates a new bucket at the host identified. PUT / Host: smartfrog.s3.amazonaws.com This is REST at its best: an HTTP operation that creates a new resource (the bucket), dynamically updating DNS as it does so! GETGet a list of object matching the pattern passed down. GET ?prefix=N&marker=Ned&max-keys=40 HTTP/1.1 Host: quotes.s3.amazonaws.com Date: Wed, 01 Mar 2006 12:00:00 GMT Authorization: AWS 15B4D3461F177624206A:xQE0diMbLRepdf3YB+FIEXAMPLE= GET ? locationReturns the location of the bucket GET /?location HTTP/1.1 Host: quotes.s3.amazonaws.com Date: Tue, 09 Oct 2007 20:26:04 +0000 Authorization: AWS 1ATXQ3HHA59CYF1CVS02:JUtd9kkJFjbKbkP9f6T/tAxozYY= <LocationConstraint xm lns="http://s3.amazonaws.com/doc/2006-03-01/">EU</LocationConstraint> This is a bit non-restful. They should really have given every server resources for metadata about the service itself, something like \services. Instead they've tacked on overlaying functionality based on the query string, probably because they added this later and didn't want to break anything by reserving content underneath the server. DELETE DELETE / HTTP/1.1
Host: quotes.s3.amazonaws.com
Date: Wed, 01 Mar 2006 12:00:00 GMT
Authorization: AWS 15B4D3461F177624206A:xQE0diMbLRepdf3YB+FIEXAMPLE=
Results in something like HTTP/1.1 204 No Content
x-amz-id-2: JuKZqmXuiwFeDQxhD7M8KtsKobSzWA1QEjLbTMTagkKdBX2z7Il/jGhDeJ3j6s80
x-amz-request-id: 32FE2CEB32F5EE25
Date: Wed, 01 Mar 2006 12:00:00 GMT
Connection: close
204 is a special HTTP response to mean 'successful, nothing interesting to provide'. The Connection:close header is important, as the server will no longer exist once this change propagates. the x-amz-request-id header is a unique request ID -they can be used for support calls with amazon if things are going wrong. Object operationsPUTAdds an object to a bucket. "The response indicates that the object has been successfully stored. Amazon S3 never stores partial Request Headers can include content-type and cache metadata; response headers include the ETag key that is GETGet the object. the Etag can be used for conditional GETs. HEADGet the object's metadata. the Etag can be used for conditional HEAD operations; to poll for changes. DELETEDelete an object. It is idempotent; not an error to delete a nonexistent object. MetadataIn the REST API, when you PUT a resource you can add name:value HTTP headers with the prefix x-amz-meta. On a GET, the prefix is stripped and all duplicate entries are merged into a comma-separate list. Because the prefix is stripped, the data can be used to create HTTP headers for third party programs to handle. What you can't do is search for resources by metadata. Request SecurityYou need to create a signed checksum of every request; the rules for this are quite complex. The solution is simple: delegate the works to libraries that implement it. Objects and buckets have ACL based security; you can grant rights to individual users, or groups of users. BitTorrent SupportEvery resource that is world readable can have a bittorrent description. Just add ?torrent at the end of the resource URL. If there are no peers, the torrent serves up the S3 resource: it is the seed. However, if there are peers serving the content, these may be picked up instead, depending upon networking settings. The result is that popular content may be downloaded faster, with some bandwidth costs saved. The torrent file is demand created on the first ?torrent request; the time to create is is O(file size), and can take several minutes for a big file. If you want to serve torrent content, it is best to do the ?torrent request yourself. When you delete an object, or remove anonymous access, S3 stops serving the torrent. This does not stop others continuing to serve the deleted file, though the .torrent may be harder to get hold of. LoggingYou can turn logging for a bucket on; the logs are delivered to a different bucket -you get pay for the storage. LibrariesThere's a good Ruby Library. For Java, the jetS3t (pronounced Jet-Set) library does the work. For SmartFrog we went with Restlet, which has support for the Amazon Web Services custom authentication protocol. AnalysisS3 is a datastore for large volume data; its costs may be comparable to trying to run the datacentre yourself. Because you can feed up URLs directly to customers, you can embed content from the S3 store straight into the web browser or other HTTP-enabled tool. The SOAP API should be viewed as obsolete; the fun stuff is RESTy. Even there, however, you can see the API doing things that are not 'pure' REST. It's pretty close though, and because it uses PUT and DELETE, is one of the key REST architectures. The SLA is pretty good, and with its security model, you can use it as the back end for the non-database part of any application, that combines database data with artifacts that are stored in a central repository. Traditionally, people use the filesystem for this, but having integrated with asset stores in the past, I can appreciate how hard it is to get all details right. If you used S3, then right from the outset you'd be working against a long haul repository. This does mean you'd encounter connectivity and reliability issues early, and ramp up bills if you are not careful, but it also means that by the time you go live, you know what the costs will be, and you know your front end will be able to cope with unreliable connections and S3's concurrency rules. Where there are limitations are in the metadata and the (in)ability of the repository to look like a real filesystem. The metadata is good, but limited in size and usefulness. you'd have to walk every artifact and do HEAD requests if you were looking for a specific piece of metadata. You couldn't put something like an expiry date on an artifact and then search for all artifacts which had already expired. Nor can you apparently change metadata on an existing artifact. Because of these limitations, it is clearly just a way to set some custom HTTP headers on requests; all the real metadata would have to be stored in a database. This is where it gets complex for EC2: S3 is the only way to store data. You don't have a shared filestore; you have the local image's disk (which has to be considered unreliable) and the S3 repository. It's only after you've got a 2XX response from the S3 store that you know that data is successfully written; that the transaction is complete. |