Recently I completed the Data Engineering on Google Cloud Platform Specialization (link here) through Coursera, here is my review.
it’s good, reasonably advanced, has plenty of code examples and I recommend it for anyone working on the GCP. Only problem was a couple of issues in the final labs of the course.
The course is divided into 5 modules of increasing complexity:
- Google Cloud Platform Big Data and Machine Learning Fundamentals
- Leveraging Unstructured Data with Cloud Dataproc on Google Cloud Platform
- Serverless Data Analysis with Google BigQuery and Cloud Dataflow
- Serverless Machine Learning with Tensorflow on Google Cloud Platform
- Building Resilient Streaming Systems on Google Cloud Platform
You can take each module out of order or complete sequentially. Its up to you, I’d recommend to keep it sequential at least roughly. I went from 1 to 3 then went back to 2, 4 and then 5.
The courses are hosted by Valliappa Lakshmanan from Google. He does a pretty great job overall. Modules are shaped initially with slides and discussion, followed by Labs run through Google Codelabs (https://codelabs.developers.google.com/) which is a free to use training platform for hands-on labs in the Google Cloud Platform – highly, highly recommended!
Each Module is slated to take between 6-8 hours to complete. I found this to be roughly correct, although closer to 8 hours, especially on the latter modules were the lab content really ramps up and you run into inevitable code issues (Google pub/sub version 0.27.0 Im looking at you!) which mean the labs take longer as you google for explanations…
Whilst I did complete 2 of the modules in one day (not recommended), they really are chock full of content, so Id make sure to leave adequate time. If you had time of work, Id say 5 in 5 days is doable, although still fairly chock full.
The labs were great with all of the code saved in Github for you to use. No complaints here, they all worked really well and fit into the rest of the course material nicely.
There are Quizzes..
There are also a number of short Quizzes throughout (roughly 5 per module with 2 or 3 questions in each). They can be tricky and you get 3 attempts in any 8 hour period in case you don’t pass the first time. Funnily enough I didn’t pass the very first quiz… on second attempt I did and from then on I made sure not to repeat the first up miss for the rest of the course (which I didn’t :-))
The Machine Learning bit is the best!
My favourite part of the course was Serverless Machine Learning with Tensorflow on Google Cloud Platform. What a cool concept Tensorflow and its associated pals: Dataflow, Cloud ML and GCS are. It really seems like the entire google cloud has been set up to handle massive, at scale machine learning. The task is complicated – shifting out: data cleaning, feature transformations, hyperparameter training and data ingestion across myriads of machines on the GCP network.
Certainly the process has been greatly simplified with tools like Dataflow in particular, but the fact remains: Machine Learning (at scale/real-time etc) is large and complex undertaking. Users should consider taking at least a couple of weeks to formulate a proper data structure and model. Like many others have said, building the actual model is only a small percentage of the work – getting the data, cleaning it and (as the course shows) understanding it – will take up the bulk of your time.
The Tensorflow module focuses on building a ML Neural Network model that collates New York traffic data and attempts to estimate fares for taxi users given starting and ending locations. Variables such as time of day, day of week, euclidean distance etc are included in the model. What was most interesting to me – was how one of the biggest improvements in RMSE was at the end when the full data set was used – more data = more accurate model… interesting.
Minor concerns with the course:
Valliappa Lakshmanan contacted me on twitter re these issues – see below:
Thanks for the review! The pubsub issue has been fixed in GitHub, so if you update, you'll get fix. Quota issue for last lab is difficult — that level of free resources invites Bitcoin miners …
— lakshmanan v (@lak_gcp) November 21, 2017
Thanks so much for responding Valliappa!! I will leave the original comments below here for a little while just in case someone else comes across a related issue and it might be helpful.
A couple of concerns I had with the course occurred in the final module on Building Resilient Streaming Systems on Google Cloud Platform. The final few labs used a simulated streaming model through Google Pub/Sub that didn’t work on my Cloud Shell. The reason being as stated above was the version conflicts in Pub/Sub. The code in the lab relies on version 0.27.0 of Pub/Sub to work. To get around this issue follow the steps here: https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/courses/streaming/publish
The other issue I had (which was unfortunately quite annoying) was that because I am on the free GCP trial I didn’t have enough “quota” to run the final lab. See below:
This is despite having over $300 credit still in my free trial as you can see above. From what I can tell there is an arbitrary quota limit that sits outside the $ figure stated in your trial. This is actually disappointing and I think Google should be more clear upfront to users about these quotas. Free data/resources is awesome! But facing the reality that your actual testing of the platform for some use cases can only occur on paid accounts is not. It is testing after all!
Despite the teething issues in the last module, overall I liked this course quite a bit and I’m really interested in getting Valliappa’s book now which is to be release on November 25th Link to Book Details
Thanks for Reading!