Analyzing CloudTrail Logs Using Hive/Hadoop

Published 2014-10-13 on Farid Zakaria's Blog

Disclaimer

This is simply a blog post record for myself as I had great difficulty in finding information on the subject. It's not meant to be a very informative guide on either CloudTrail or Hive/hadoop

Intro

Recently at work we've had an issue where some security group ingress rules were being modified (either automated or manually) and it has been affecting our test runs that rely on those rules. In order to try and track down the source of the modification we have enabled CloudTrail. CloudTrail is part of the AWS family of web services and it records AWS API records you've made and places those logs in an S3 bucket that you can access.

The recorded information includes the identity of the API caller, the time of the API call, the source IP address of the API caller, the request parameters, and the response elements returned by the AWS service.

Hive

My experience with Hive has been very limited (simple exposure from running tutorials) however I was aware that it was a SQL-ish type execution engine that transformed those queries into MapReduce jobs to execute using Hadoop. As it was built with Hadoop that means it has native support for using S3 as a HDFS.

With the little knowledge of Hive I had, I thought there should exist a very prominent white paper in which describes how to consume CloudTrail logs using Hive (using some custom SerDe). A co-worker was simply consuming the JSON log files via Python however I was on a mission to see if I could solve the solution (querying relevant data from the logs) using an easy-setup with Hive! The benefit of setting up the Hadoop/Hive cluster for this would be that it could be used easily to query additional information and be persistent.

Solution

After contacting some people from the EMR team (I was unable to find anything myself on the internet) I was finally pointed to some relevant information! I've included the reference link and the original example code for incase the link ever breaks.
reference: http://www.emrsandbox.com/beeswax/execute/design/4#query

The key thing to note from the example is that it is using a custom SerDe that is included with the Hadoop clusters created with AWS ElasticMapReduce. The SerDe includes the input format table and deserializer which will properly consume the nested JSON records. With this you can now query easily CloudTrail logs!

-- This example creates an external Hive table from a location containing CloudTrail logs for a day.
-- A custom SerDe - CloudTrailLogDeserializer which comes with EMR AMI is used for this example 
-- and a few sample queries are provided below.
-- Please click 'Execute' to create table and then 'Next' to run subsequent queries.

ADD JAR /usr/share/aws/emr/goodies/lib/EmrHadoopGoodies-1.0.0.jar;
ADD JAR /usr/share/aws/emr/goodies/lib/EmrHiveGoodies-1.0.0.jar;

-- Create Hive table for CloudTrail logs
CREATE EXTERNAL TABLE IF NOT EXISTS CloudTrailTable
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailLogDeserializer'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://us-east-1.elasticmapreduce.samples/cloudtrail-logs/data/AWSLogs/176430881729/CloudTrail/us-east-1/2014/07/01/';

-- Show all API calls made by given user
SELECT DISTINCT(eventName)
FROM CloudTrailTable
WHERE userIdentity.principalId = "63f29b7986e7c43cb8cd4";

-- Show all API calls made by given user between time period T1 and T2
SELECT DISTINCT(eventName)
FROM CloudTrailTable
WHERE userIdentity.principalId = "40c71535f6ba"
AND 
TO_UNIX_TIMESTAMP(eventTime,"yyyy-MM-dd'T'HH:mm:ss'Z'")
BETWEEN TO_UNIX_TIMESTAMP("2014-07-01T10:00:53Z","yyyy-MM-dd'T'HH:mm:ss'Z'")
AND 
TO_UNIX_TIMESTAMP("2014-07-01T20:00:53Z","yyyy-MM-dd'T'HH:mm:ss'Z'");

-- Show calls originating from EMR Service along with the caller
SELECT eventName,  userIdentity.principalId
FROM CloudTrailTable
WHERE eventSource = "elasticmapreduce.amazonaws.com";

-- Show count of different clients used
SELECT userAgent , count(requestId) AS cnt
FROM CloudTrailTable
GROUP BY userAgent
ORDER BY cnt DESC;

-- Optional way to create Hive table, if your CloudTrail logs are small files.
-- When you copy data using Hive, it will combine data from multiple small files into appropriate large files.
-- Below are the steps to copy data from CloudTrail logs location to HDFS/S3 location.

-- SET mapred.input.dir.recursive=true; 
-- SET hive.mapred.supports.subdirectories=true;
-- SET hive.merge.mapredfiles=true;
-- SET hive.merge.mapfiles=true;

-- CREATE EXTERNAL TABLE IF NOT EXISTS CloudTrailTempTable
-- ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailLogDeserializer'
-- STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
-- OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
-- LOCATION 's3://us-east-1.elasticmapreduce.samples/cloudtrail-logs/data/AWSLogs/176430881729/CloudTrail/us-east-1/2014/07/';

-- Change the HDFS location to point to your home directory in HDFS.
-- CREATE EXTERNAL TABLE IF NOT EXISTS CloudTrailHDFS(
--  eventversion STRING,
--   userIdentity STRUCT<
--   type:STRING,
--   principalid:STRING,
--   arn:STRING,
--   accountid:STRING,
--   invokedby:STRING,
--   accesskeyid:STRING,
--   sessioncontext:STRUCT<
--    attributes:STRUCT<
--     mfaauthenticated:STRING,
--     creationdate:STRING
--    >
--   >
--  >,
--  eventTime STRING,
--  eventSource STRING,
--  eventName STRING,
--  awsRegion STRING,
--  sourceIpAddress STRING,
--  userAgent STRING,
--  requestId STRING,
--  eventId STRING
--  )
-- LOCATION 'hdfs:///cloudTraillogs/';

-- INSERT OVERWRITE TABLE CloudTrailHDFS 
-- SELECT * FROM CloudTrailTempTable;