I've been messing around lately trying to figure out if hosing a plex media server on EC2 and storing all the media in S3 was possible. Obviously this can get VERY expensive very quickly to actually do long term which was never my plan I was just doing it out of curiosity and proof of concept sort of thing. The first thing I wanted to figure out was what instance type would work well. I decided to use Ubuntu 12.04 LTS because its what I'm most familiar with. Here are the instance types I tested and a short description of what I found. This portion I was copying media straight into the instances ephemeral storage, which for those who don't speak AWS it is the local hard drive on the server. (The media was 720 mp4)
m1.large - even making it transcode on the fly to 5 different devices, I wasn't seeing north of 50% cpu or ram except the occasional spike. To much computer.
m1.medium - same 5 devices, I was seeing pretty steady its north of 60% even holding about 75% occasionally on the CPU and the ram was certainly filling up faster being there is less. Seemed just right.
m1.small - It works (which sort of surprised me since I generally find the m1.small to slow for many applications) but it was sluggish especially the web interface, it did manage to transcode okay for 2 streams but 3 it was maxed out and any other interaction one of the streams would suffer.
I ended up deciding the m1.medium was the right answer. (Running one of these 24/7 will cost you about $88 a month, not including any EBS or S3 storage or outgoing bandwidth)
EBS Test - I figured it would be a good idea to see how Plex and EBS would play together because if EBS couldn't give plex enough performance S3 never would. So I copied the movies over to EBS and got similar results as I was seeing with the ephemeral storage, the delay to start a movie was a tiny bit longer by maybe 1-3 seconds after which it was fine. S3 - Now came the hard part, how the heck can I make this work with S3? The key thing to understand is S3 is intended to store data in a put/get method where if you want to read or modify a file you download the entire thing and then upload the entire file when your done if you changed anything. That being said the essentially limitless storage and very strong file durability makes it an excellent place to store files (and its slightly cheaper than EBS).
The first tool I tired was s3fs (http://code.google.com/p/s3fs/) which allows you to mount an s3 bucket so linux see's it like any other mounted folder. Got it all working quite quickly, at first it seemed to work well but that was only because I had just copied the files in and while it had uploaded them to s3 it was using the local cache version of the files. After a umount, clear the cache and mount Plex couldn't play anything, though I noticed that the network traffic into the ec2 instance was spiking and after waiting a few minutes and trying the same movie again it worked. What I think was happening was Plex was trying to read the file, and s3fs had to download the entire file before it would feed any of it to Plex, so Plex would error out thinking that there was something wrong but once s3fs had the entire file in cache it would work fine because it would read the local copy. I tired all the different settings offered by s3fs to make Plex happy, it didn't work. So off to look for another option.
Then I stumbled upon s3ql (http://code.google.com/p/s3ql/) similar concept but a big difference in the way it writes and reads to s3. In s3fs it was uploading the files to s3 and downloading them again exactly as they where so to read a 1GB file you first had to download 1GB. Where s3ql creates a block type file system when writing files to s3, so when uploading a 1GB file to s3 it breaks it into chunks (default is 10mb) and keeps an inventory of the blocks. It also has a few other tricks it does during this process I'll talk about later. The great thing about the block system is when you go to read 1GB file, it only has to download the 10mb blocks you need to read.
The other advantage is that a single download from s3 typically has a maximum speed of 10-15MB/sec which is pretty fast BUT downloading a 1GB file as a single file takes about 1 minute and 30 seconds, but if you download several 10MB files simultaneously (8 seems optimal) you will see a combined throughput of around 50MB/sec meaning that 1GB file now takes less than 30 seconds. So you can not only start reading the file faster as your getting the blocks you need first, but you can get the entire file faster using multiple downloads.
Testing of s3ql, copied some media over, cleared local cache of s3ql and just using default settings Plex was much happier to read data from the s3ql mount. It was a bit sluggish and certainly not perfect. So I started messing with the s3ql options to see what I could do to improve things. There are 3 additional features that s3ql offers all of which affect your speed of read/write and its processor/ram usage these are Encryption, Data De-Dupelication and Compression and of course I was using all of them at defaults. I wont bore you with all the things I tried and their results, in the end I found adjusting the following settings gave me performance that plex was happy with.
-Block Size since the main goal of this is for movies most of which are 1GB or more, 10mb blocks is a bit small so I pushed it up to 30MB. Which means instead of dealing with 103 blocks per 1GB its dealing with 35 blocks per 1GB. This allows me to take advantage of multiple threads (simultaneous downloads/uploads) and yet lower the amount of gets and puts and thus lower network overhead and processing to assemble the blocks leaving more room for throughput and more cpu for the other processes.
-Compression I changed to zlib which is lowest amount of 3 types offered besides none but hey s3 storage isn't exactly cheap so if I can get away with some compression I should! (Granted my media is mp4 which is already pretty compressed to being with, so I'm not seeing a massive space savings but every little bit counts!) The default highest level of compression I was seeing a hit of 20% cpu, with zlib it never goes above 10%. -Threads which is how many blocks s3ql will upload or download to s3 at a time. I believe the default was 3, I found anything above 8 didn't seem to improve through put so why waste the extra cpu/memory and as I said before this gives you about 50MB/sec of throughput which is pretty good.
-Cachesize, I don't recall what the default is but it was pretty low considering I'm planning on regularly reading 1GB and larger files and being that the m1.medium instance comes with about 400GB of ephemeral storage for free why not use it? (Worth noting that ephemeral storage is specific to each instance so if you shutdown your server anything on it is gone forever, because of this its not really a primary place to store files but its great place to use as a local cache as thats constantly being rebuilt and only temporary.)
After all that I now have an m1.medimum EC2 instance running Plex Media Server and storing all its media in s3. I am still messing around with the s3ql settings doing some more testing because I want to maximize my performance and see if I can use one of the stronger compression schemes to save some $ on s3.
Once I'm done with fine tuning things I will move into my next phase of testing, to see how it fairs with regular use. Which I plan to put up a selection of movies and try to use it as my primary server for a bit and see if I find any quarks that occur after being up for several days, and moving several movies back and fourth. (Also I will lower the cache to something like 20GB because I don't want to have to exercise 400GB of cache, cause thats going to cost a a bit in s3 space.)
Damn thats a long post, I hope others find it useful. Ultimately I don't think EC2 and S3 are a viable solution for most because the amount of money the EC2 instance costs, the s3 storage and the per GB fees of outgoing bandwidth. When I did the math if I wanted to move my entire media collection to s3, and use EC2 for my primary plex server it would end up costing anywhere from $2,000 to $4,000 a year depending on actual use and if I got a reserved instance, which is just not worth it when I already have tons of storage on my Synology Diskstation, and have (2) Mac Minis doing a great job running Plex at the house. As I said before this is/was an exercise to see if it could be done and it CAN! Maybe one day Amazon will lower the EC2 and s3 prices enough that it will become affordable. Side note: There is no cost to transfer between s3 and EC2 as long as they are in the same region so your only outing per GB charges will be from your EC2 to you. (If they are in different regions you will pay to get the data out of s3 and again out of EC2 which is obviously not ideal.)