Sourabh Bajaj2022-11-10T08:46:14+00:00http://sourabhbajaj.comSocial media Share icons for Jekyll2017-10-29T00:00:00+00:00/blog/2017/10/29/adding-social-media-share-icons-to-jekyll<p>I recently added share icons on each blogpost to make it easier for readers to share the posts on social media. I did it with just HTML and CSS. Here is what the final output looked like:</p>
<p><img src="/images/blog/2017-10/share_demo.png" align="center" alt="Innovation" style="margin:auto; display:block;" /></p>
<p>This is a short guide on how to add this to your own blog.</p>
<h3 id="download-images">Download Images</h3>
<p>First we need to download svg images for the social media buttons we’re going to create. You can use <a href="https://simpleicons.org/">SimpleIcons</a>. In this tutorial we’ll use images for Reddit, Hacker News, Twitter and LinkedIn.</p>
<p>Once you’ve downloaded the images add then to <code class="language-plaintext highlighter-rouge">_includes/social</code> directory in your Jekyll project.</p>
<h3 id="html-block">HTML block</h3>
<p>Let’s add the HTML code for the social media icons. You can create a <code class="language-plaintext highlighter-rouge">_includes/share.html</code> file for this. Don’t forget to change <code class="language-plaintext highlighter-rouge">USERNAME</code> with your account on Twitter.</p>
<div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><div</span> <span class="na">class=</span><span class="s">"sharebuttons"</span><span class="nt">></span>
<span class="nt"><hr</span> <span class="nt">/></span>
<span class="nt"><ul></span>
<span class="nt"><li></span>
<span class="nt"><p</span> <span class="na">class=</span><span class="s">"sharetitle"</span><span class="nt">></span> Share this: <span class="nt"></p></span>
<span class="nt"></li></span>
<span class="nt"><li</span> <span class="na">class=</span><span class="s">"reddit"</span><span class="nt">></span>
<span class="nt"><a</span> <span class="na">href=</span><span class="s">"http://www.reddit.com/submit?url={{ page.url | replace:'index.html','' | prepend: site.baseurl | prepend: site.url | uri_escape}}&title={{ page.title | default:"</span><span class="err">"</span> <span class="err">|</span> <span class="na">uri_escape</span> <span class="err">}}"</span> <span class="na">target=</span><span class="s">"_blank"</span><span class="nt">></span>
{% include social/share-icon-reddit.svg %}
<span class="nt"></a></span>
<span class="nt"></li></span>
<span class="nt"><li</span> <span class="na">class=</span><span class="s">"hn"</span><span class="nt">></span>
<span class="nt"><a</span> <span class="na">href=</span><span class="s">"http://news.ycombinator.com/submitlink?u={{ page.url | replace:'index.html','' | prepend: site.baseurl | prepend: site.url | uri_escape}}&t={{ page.title | default:"</span><span class="err">"</span> <span class="err">|</span> <span class="na">uri_escape</span><span class="err">}}"</span> <span class="na">target=</span><span class="s">"_blank"</span><span class="nt">></span>
{% include social/share-icon-hn.svg %}
<span class="nt"></a></span>
<span class="nt"></li></span>
<span class="nt"><li</span> <span class="na">class=</span><span class="s">"twitter"</span><span class="nt">></span>
<span class="nt"><a</span> <span class="na">href=</span><span class="s">"https://twitter.com/intent/tweet?via=USERNAME&url={{ page.url | replace:'index.html','' | prepend: site.baseurl | prepend: site.url | uri_escape}}&text={{ page.title | default:"</span><span class="err">"</span> <span class="err">|</span> <span class="na">uri_escape</span><span class="err">}}"</span> <span class="na">target=</span><span class="s">"_blank"</span><span class="nt">></span>
{% include social/share-icon-twitter.svg %}
<span class="nt"></a></span>
<span class="nt"></li></span>
<span class="nt"><li</span> <span class="na">class=</span><span class="s">"linkedin"</span><span class="nt">></span>
<span class="nt"><a</span> <span class="na">href=</span><span class="s">"https://www.linkedin.com/shareArticle?mini=true&url={{ page.url | replace:'index.html','' | prepend: site.baseurl | prepend: site.url | uri_escape}}&title={{ page.title | default:"</span><span class="err">"</span> <span class="err">|</span> <span class="na">uri_escape</span><span class="err">}}"</span> <span class="na">target=</span><span class="s">"_blank"</span><span class="nt">></span>
{% include social/share-icon-linkedin.svg %}
<span class="nt"></a></span>
<span class="nt"></li></span>
<span class="nt"></ul></span>
<span class="nt"></div></span>
</code></pre></div></div>
<h3 id="add-layout-to-the-post-template">Add layout to the post template</h3>
<p>We need to include this block in the layout file which might be <code class="language-plaintext highlighter-rouge">_layouts/post.html</code>. Add the one line to that file.</p>
<div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{% include share.html %}
</code></pre></div></div>
<h3 id="style-the-buttons">Style the buttons</h3>
<p>In the CSS file for the site, add the following snippet. We got the colors for the icons from <a href="https://simpleicons.org/">SimpleIcons</a>.</p>
<div class="language-css highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">/* Share buttons */</span>
<span class="nc">.sharebuttons</span> <span class="p">{</span>
<span class="nl">margin</span><span class="p">:</span> <span class="m">0</span> <span class="nb">auto</span> <span class="m">0</span> <span class="nb">auto</span><span class="p">;</span>
<span class="p">}</span>
<span class="nc">.sharebuttons</span> <span class="nt">ul</span> <span class="p">{</span>
<span class="nl">margin</span><span class="p">:</span> <span class="m">20px</span> <span class="m">0</span> <span class="m">0</span> <span class="m">0</span><span class="p">;</span>
<span class="nl">text-align</span><span class="p">:</span> <span class="nb">center</span><span class="p">;</span>
<span class="p">}</span>
<span class="nc">.sharebuttons</span> <span class="nt">ul</span> <span class="nt">li</span> <span class="p">{</span>
<span class="nl">display</span><span class="p">:</span> <span class="nb">inline</span><span class="p">;</span>
<span class="p">}</span>
<span class="nc">.sharebuttons</span> <span class="nt">ul</span> <span class="nt">li</span> <span class="nt">a</span> <span class="p">{</span>
<span class="nl">text-decoration</span><span class="p">:</span> <span class="nb">none</span><span class="p">;</span>
<span class="p">}</span>
<span class="nc">.sharebuttons</span> <span class="nt">ul</span> <span class="nt">li</span> <span class="nt">svg</span> <span class="p">{</span>
<span class="nl">width</span><span class="p">:</span> <span class="m">40px</span><span class="p">;</span>
<span class="nl">height</span><span class="p">:</span> <span class="m">40px</span><span class="p">;</span>
<span class="p">}</span>
<span class="nc">.sharebuttons</span> <span class="nc">.reddit</span> <span class="nt">svg</span> <span class="p">{</span>
<span class="py">fill</span><span class="p">:</span> <span class="m">#FF4500</span><span class="p">;</span>
<span class="p">}</span>
<span class="nc">.sharebuttons</span> <span class="nc">.hn</span> <span class="nt">svg</span> <span class="p">{</span>
<span class="py">fill</span><span class="p">:</span> <span class="m">#F0652F</span><span class="p">;</span>
<span class="p">}</span>
<span class="nc">.sharebuttons</span> <span class="nc">.twitter</span> <span class="nt">svg</span> <span class="p">{</span>
<span class="py">fill</span><span class="p">:</span> <span class="m">#1DA1F2</span><span class="p">;</span>
<span class="p">}</span>
<span class="nc">.sharebuttons</span> <span class="nc">.linkedin</span> <span class="nt">svg</span> <span class="p">{</span>
<span class="py">fill</span><span class="p">:</span> <span class="m">#0077B5</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="share-this-post">Share this post</h3>
<p>Now you can build your site to check the output. Don’t forget to share this post :)</p>
How to Innovate on Purpose?2017-10-28T00:00:00+00:00/blog/2017/10/28/how-to-innovate-on-purpose<p><img src="/images/blog/2017-10/innovation.png" align="center" alt="Innovation" style="margin:auto; display:block;" /></p>
<p>I was recently watching <a href="http://www.spencergreenberg.com/">Spencer Greenberg’s</a> talk from <a href="https://www.youtube.com/watch?v=LiVyRtS_d9o">EA Global 2015</a> talking about deliberate innovation.</p>
<p>We often think of innovation in two broad themes: Genius and Luck. Genius are people like Einstein who are a cut above the rest and made scientific breakthroughs. Luck is when eureka moments led to discoveries and inventions. The problem with these is that they don’t help us be innovative as neither is actionable.</p>
<p>When looking for innvotive ideas, we start looking for solutions that are possible, discoverable, smart and big. The issue here is that more often than not someone is already working on these.</p>
<p>Can we reverse the situation and look for ideas that are impossible, undiscoverable, stupid and tiny. This doesn’t seem too feasible so let’s tweak them just a bit to:</p>
<ul>
<li>
<p><strong>Impossible until now</strong>: Technology has a lag between it being possible and when it is applied for the masses. One can exploit this gap to create solutions using this new technology and make it more accessible. E.g. applying machine learning to different segments.</p>
</li>
<li>
<p><strong>Undiscoverable to others</strong>: Intersection of two complete different ideas that one is rarely familiar with can lead to interesting applications and ideas. E.g. StitchFix is an intersection of data science and fashion.</p>
</li>
<li>
<p><strong>Stupid sounding</strong>: There are some ideas that sound stupid unless you know some secret about the world that might make it a good idea. E.g. early days of blockchain and cryptocurrencies.</p>
</li>
<li>
<p><strong>Tiny at first</strong>: Try to solve problems for a small group which you are intimately familiar with; this may lead to a path where you can solve problems for a larger group. E.g. Segment.io was started as an internal tool.</p>
</li>
</ul>
<p>At the end, it is important to remember that innovation is dependent on a lot of other factors such as hard work, luck, quality etc.</p>
What's your ML test score?2017-09-07T00:00:00+00:00/blog/2017/09/07/what-s-your-ml-test-score<p>Using machine-learning systems in production is very different from running offline experiments as you run into problems such as train/test skew, latency and resource requirements. The paper <a href="https://research.google.com/pubs/pub45742.html">What’s your ML Test Score?</a> by Eric Breck etc. provides a rubric for measuring the quality of ML system design.</p>
<p>The rubric has four sections so we’ll go over each of them.</p>
<h3 id="tests-for-features-and-data">Tests for Features and Data</h3>
<ul>
<li>Distributions of each feature</li>
<li>Features are same in both the training and serving stack</li>
<li>Relationship between different features and targets</li>
<li>Privacy control in model training</li>
<li>Cost of computing each feature</li>
<li>Does not contain features determined unsuitable for use</li>
<li>Time to add new features to production</li>
</ul>
<p>Points around expensive and redundant features are important as they can affect the ability of the system to meet the desired latency and throughput requirements. One options to solve such issues is to pre-cache expensive features and use them at prediction time but this can yield to a lot of redundant compute.</p>
<p>I liked the point around making sure we’re not using features determined un-suitable in the context of ML Fairness as we could potentially ban features such as gender etc.</p>
<h3 id="tests-for-model-development">Tests for Model Development</h3>
<ul>
<li>Model code goes through code review</li>
<li>Offline proxy metrics are measuring what will be A/B tested</li>
<li>Hyperparameter tuning</li>
<li>Effect of model staleness</li>
<li>Simple models as a baseline</li>
<li>Model performs well across different data slices</li>
<li>Test for implicit bias in the model or data</li>
</ul>
<p>Touches on aspects of good design principles such as optimizing for the right metrics, measuring staleness and updating the model on time. The point around good performance on different data slices is specially valid when majority data to the website might come from English speaking or developed countries etc.</p>
<h3 id="tests-for-ml-infrastructure">Tests for ML Infrastructure</h3>
<ul>
<li>Reproducibility of model training</li>
<li>Integrations tests for the ML systems</li>
<li>Quality tests before deployment of the model</li>
<li>Ability to rollback deployed models</li>
<li>Testing via a canary process</li>
</ul>
<p>Here most points are easy to follow in this list but something that I have found hard in experience is the quality tests. One example being recommendation systems, since the output of the system may change from time to time. It is hard to write automated tests that measure quality; interested in learning how others solve this problem.</p>
<h3 id="monitoring-tests-for-ml-systems">Monitoring Tests for ML Systems</h3>
<ul>
<li>Upstream instability in features, both in training and serving</li>
<li>Data invariants hold in training and serving inputs</li>
<li>Model staleness</li>
<li>Train/Test skew in features and inputs</li>
<li>Slow leak regression in latency, throughput etc.</li>
<li>Regression in prediction quality</li>
</ul>
<p>This was a fantastic list as it covers a lot of hidden problems in model serving. As models get larger it can get expensive to serve them or features get more expensive to compute. Useful tools could be monitoring success metrics as a time-series and seeing if we hit consistent performance. Another could be to always have a small A/B test running against the old / baseline model.</p>
<p>The paper touches on basic problems that you run into quite often but are not talked about much in the ML community. Curious to know, how the problems around feature engineering and model complexity evolve with advent of Deep Learning models.</p>
<h3 id="references">References</h3>
<ul>
<li>[1] Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2016). What’s your ML Test Score? A rubric for ML production systems. Reliable Machine Learning in the Wild - NIPS 2016 Workshop, (NIPS).</li>
</ul>
The Start-up of You2017-03-26T00:00:00+00:00/blog/2017/03/26/the-startup-of-you<p>I recently finished reading <a href="https://www.amazon.com/Start-up-You-Future-Yourself-Transform-ebook/dp/B0050DIWHU">The Start-up of You</a> by Reid Hoffman. The book provides an interesting framework to think about your career. Here are some nuggets from the book.</p>
<ul>
<li>
<p>When you start a company, you make decisions in an information-poor, time-compressed resource-constrained environment. There are no guarantees or safety nets, so you take on a certain amount of risk.</p>
</li>
<li>
<p>You’re selling your brainpower, your skills, and your energy, and you face massive competition as well. Possible employers, partners, investors all choose between you and someone who looks like you. So one should be able to answer A company hires me over others because … This doesn’t mean you need to be cheaper or faster than everyone as in life there are multiple gold medals. You can’t be best at everything so determine your local niche in which you can develop a competitive advantage.</p>
</li>
</ul>
<blockquote>
<blockquote>
<p>Though we are optimistic, we must remain vigilant and maintain a sense of urgency - Jeff Bezos</p>
</blockquote>
</blockquote>
<ul>
<li>
<p>Your competitive advantage is formed by three different forces: your assets, your aspirations and the market realities. The best direction has you pursuing worthy aspirations, using your assets while navigating the market realities.</p>
</li>
<li>
<p>The environment will change, you’ll change, your allies and competitors will change. So it is unwise at any point in your life to pinpoint a single dream around which your existence revolves.</p>
</li>
<li>
<p>Prioritize offers that provide you the most learning potential. Practical knowledge is best developed by doing, launching and learning from the users. Make small bets and think two steps ahead about where will this current action take you.</p>
</li>
</ul>
<blockquote>
<blockquote>
<p>The fastest way to change yourself is to hang out with people who are already the way you want to be.</p>
</blockquote>
</blockquote>
<ul>
<li>
<p>Making a decision may reduce opportunities in the short run, but increases opportunities in the long run.</p>
</li>
<li>
<p>The long-term answer to risk is to build resilience: if you don’t find risk; risk will find you.</p>
</li>
<li>
<p>You achieve big success when you’re both contrarian and right.</p>
</li>
<li>
<p>How you gather, manage and use information will determine whether you win or lose.</p>
</li>
</ul>
<blockquote>
<blockquote>
<p>It is people who help you understand your assets, aspirations, and the market realities; it’s people who help you vet and get introduced to possible allies and trust connections; it’s people who help you track the risk attached to a given opportunity.</p>
</blockquote>
</blockquote>
<ul>
<li>
<p>When asking for career advice go in the following order: domain experts, people who know you well and then just really smart people.</p>
</li>
<li>
<p>When making a decision ask wide questions to figure out the criteria you should be using; ask narrow questions to figure out which weight you should give to each.</p>
</li>
</ul>
So you built a Machine Learning model?2017-03-16T00:00:00+00:00/blog/2017/03/16/so-you-built-a-machine-learning-model<p>You have been working on a Machine Learning project. You collected data from various sources, built your model and got some preliminary results. You notice you are getting about 80% accuracy on your test set which is less than what you desire. Now what? How do you improve the model?</p>
<p>Should you get more data? Build a more complex model? Increase or decrease regularization? Add or remove features? Run more iterations of gradient descent? Maybe try all of them?</p>
<p>Recently I got this question from a friend, who said it seemed that improving models is just hit and trial. This prompted me to write this post on how to make an informed decision about what should you work on first.</p>
<h2 id="bias-and-variance">Bias and Variance</h2>
<p>To build a more accurate model, we first need to learn what are the different sources of error in our model.</p>
<p><strong>Bias</strong>: The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict.</p>
<p><strong>Variance</strong>: The error due to variance is taken as the variability of a model prediction for a given data point.</p>
<p><img src="/images/blog/2017-03/bias-variance.png" alt="Bias Variance" /></p>
<h3 id="mathematical-definition">Mathematical definition</h3>
<p>We are trying to predict \(Y\) and our input is \(X\). Let’s assume there is a relationship relating one to the other such as \(Y = f(X) + \epsilon\) where the error term \(\epsilon\) is normally distributed with a mean of zero like so \(\epsilon \sim \mathcal{N}(0,\sigma_\epsilon)\).</p>
<p>We may estimate a model \(\hat{f}(X)\) of \(f(X)\) using linear regressions or another modeling technique. Then, the expected squared prediction error at a point \(x\) is:</p>
\[Err(x) = E\left[(Y-\hat{f}(x))^2\right]\]
<p>The error can be split into bias and variance components:</p>
\[Err(x) = \left(E[\hat{f}(x)]-f(x)\right)^2 + E\left[\left(\hat{f}(x)-E[\hat{f}(x)]\right)^2\right] +\sigma_e^2\]
\[Err(x) = \mathrm{Bias}^2 + \mathrm{Variance} + \mathrm{Irreducible\ Error}\]
<p><img src="/images/blog/2017-03/error.png" align="left" alt="Irreducible error" style="width: 40%; margin-left:5%; margin-right:5%; margin-top:20px; margin-bottom:20px;" /></p>
<p>The irreducible error is the noise term in the true relationship that cannot be fundamentally reduced by any model. Given the true model and infinite data to calibrate it, we should be able to reduce both the bias and variance terms to 0. However, in a world with imperfect models and finite data, there is a tradeoff between minimizing the bias and minimizing the variance.</p>
<h2 id="what-are-learning-curves">What are Learning Curves?</h2>
<p>Now we know about Bias and Variance and the tradeoff between the two, but the problem of how to improve our model still remains. What is our model suffering from - high bias or high variance? To answer this we need to plot Learning curves for the model.</p>
<h3 id="high-bias">High Bias</h3>
<ul>
<li>Low training set size: \(J_{train}(\Theta)\) will be low and \(J_{CV}(\Theta)\) will be high</li>
<li>Large training set size: \(J_{train}(\Theta)\) and \(J_{CV}(\Theta)\) will be high with \(J_{train}(\Theta) \approx J_{CV}(\Theta)\)</li>
</ul>
<h3 id="high-variance">High Variance</h3>
<ul>
<li>Low training set size: causes \(J_{train}(\Theta)\) will be low and \(J_{CV}(\Theta)\) will be high</li>
<li>Large training set size: \(J_{train}(\Theta)\) increases with training set size and \(J_{CV}(\Theta)\) continues to decrease without leveling off. \(J_{train}(\Theta) < J_{CV}(\Theta)\) but the difference between them remains significant</li>
</ul>
<p><img src="/images/blog/2017-03/high_bias.png" align="left" alt="High bias learning curve" style="width: 48%; margin-top:20px; margin-bottom:50px;" />
<img src="/images/blog/2017-03/high_variance.png" align="right" alt="High variance learning curve" style="width: 48%; margin-top:20px; margin-bottom:50px;" /></p>
<h2 id="what-to-do-next">What to do next?</h2>
<p>We have figured out if we have a bias problem or a variance one. We can make an informed choice about what to work on next.</p>
<h3 id="high-bias-1">High Bias</h3>
<ul>
<li>Try more complex features, polynomial terms or adding more nodes</li>
<li>Decreasing the regularization parameter \(\lambda\)</li>
</ul>
<h3 id="high-variance-1">High Variance</h3>
<ul>
<li>Collecting more training data as that will help the model generalize better</li>
<li>Reducing the feature set size</li>
<li>Increasing the regularization parameter \(\lambda\)</li>
</ul>
<p><img src="/images/blog/2017-03/flowchart.png" alt="Next steps flow chart" style="width: 70%; margin-left:10%; margin-right:10%; margin-top:20px; margin-bottom:10px;" /></p>
<h2 id="what-if-i-have-an-ml-pipeline">What if I have an ML pipeline?</h2>
<p>As most machine learning systems are built using a chain of models. It is fairly common to have a scenario where you have an ML pipeline and want to figure out which part to work on next? Ceiling Analysis can be useful here.</p>
<p>For Ceiling Analysis plug in a perfect version for each component of the pipeline one at a time and then measure how much improvement do we see in the complete pipeline. This can give us a sense of working on which component gives us the highest bang for the buck.</p>
<p><img src="/images/blog/2017-03/pipeline.png" alt="Pipeline" /></p>
<p>Let’s say in the above character detection pipeline you observe that a perfect character segmentation system gives a 1% boost to the overall system while a perfect character recognition system will provide a 7% boost. So we should focus on improving the recognition system much more so than the segmentation model.</p>
<h2 id="references">References</h2>
<ul>
<li><a href="https://www.coursera.org/learn/machine-learning">Machine Learning</a> class on Coursera</li>
<li><a href="http://www-stat.stanford.edu/~tibs/ElemStatLearn/">The Elements of Statistical Learning</a></li>
<li><a href="https://www.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/0387310738">Pattern Recognition and Machine Learning </a></li>
<li><a href="http://scott.fortmann-roe.com/docs/BiasVariance.html">Understanding the Bias-Variance Tradeoff</a></li>
</ul>
Running GUI applications using Docker for Mac2017-02-07T00:00:00+00:00/blog/2017/02/07/gui-applications-docker-mac<p>This is a short guide explaining how to run GUI applications from within Docker on Mac. This uses XQuartz to enable to set the <code class="language-plaintext highlighter-rouge">DISPLAY</code> variable within the container.</p>
<h3 id="install-xquartz">Install XQuartz</h3>
<p>You can install XQuartz using homebrew with <code class="language-plaintext highlighter-rouge">brew cask install xquartz</code> or directly from the website <a href="https://www.xquartz.org/">here</a>. At the time of writing, I had <code class="language-plaintext highlighter-rouge">2.7.11</code> installed on my machine with OSX El Capitan. After installing XQuartz restart your machine.</p>
<h3 id="install-docker-for-mac">Install Docker for Mac</h3>
<p>Install docker using <code class="language-plaintext highlighter-rouge">brew cask install docker</code> or directly from the website <a href="https://docs.docker.com/docker-for-mac/">here</a>.</p>
<h3 id="run-xquartz">Run XQuartz</h3>
<p>Start XQuartz from command line using <code class="language-plaintext highlighter-rouge">open -a XQuartz</code>. In the XQuartz preferences, go to the “Security” tab and make sure you’ve got “Allow connections from network clients” ticked:</p>
<p><img src="/images/blog/2017-02/xquartz_preferences.png" alt="XQuartz Preferences" style="width: 50%; margin-left:10%; margin-right:10%; margin-top:10px; margin-bottom:10px;" /></p>
<h3 id="host-machine-ip">Host Machine IP</h3>
<p><code class="language-plaintext highlighter-rouge">IP=$(ifconfig en0 | grep inet | awk '$1=="inet" {print $2}')</code> should set the <code class="language-plaintext highlighter-rouge">IP</code> variable as the ip of your local machine. If you’re on wifi you may want to use <code class="language-plaintext highlighter-rouge">en1</code> instead of <code class="language-plaintext highlighter-rouge">en0</code>, check the value of the variable using <code class="language-plaintext highlighter-rouge">echo $IP</code>.</p>
<p>Now add the IP using Xhost with <code class="language-plaintext highlighter-rouge">xhost + $IP</code>. If the xhost command is not found check <code class="language-plaintext highlighter-rouge">/usr/X11/bin/xhost</code> as that might not be in your path.</p>
<h3 id="running-a-container">Running a container</h3>
<p>You can now try running firefox in your container with:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker run -d --name firefox -e DISPLAY=$IP:0 -v /tmp/.X11-unix:/tmp/.X11-unix jess/firefox
</code></pre></div></div>
<p>or run octave using:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker run -d --name octave -e DISPLAY=$IP:0 -v /tmp/.X11-unix:/tmp/.X11-unix openmicroscopy/octave
</code></pre></div></div>
Upsert Backups with rsync2016-05-22T00:00:00+00:00/blog/2016/05/22/upsert-backups-rsync<p>We all know the importance of taking backups. I have run into a few machine failures in the past and they always seem to occur when you least expect them. Currently, Mac comes with a backup program called Time Machine which automatically takes backup of your disk to an external drive. But some of its features have issues that I would like to highlight:</p>
<ul>
<li>You need to get a separate external hard drive that can only be used for Time Machine. Although you can get around this by <a href="http://osxdaily.com/2013/05/01/use-single-hard-drive-time-machine-and-file-storage/">partitioning</a> the external drive into multiple partitions.</li>
<li>This external drive needs to be formatted in MacOS Journaled file format, which makes it harder to be used on Windows/Linux.</li>
<li>You cannot make a list of things you want to have backed up, you can only exclude folders from your complete hard disk</li>
<li>Time Machine makes an exact copy of your hard drive.</li>
</ul>
<p>The points listed above are not huge blockers by any means and are perfectly fine choices on Time Machine’s part. Still, the last one had been a growing pain for me as SSDs are much smaller in size compared to HDDs unless you have a lot of $$ to shell out.</p>
<p>Let’s say you have a 1 TB external drive and your Mac has 128GB of disk space. You transfer 50 GB of music/videos/data to your external drive and delete it from your local disk to create space. The problem with this approach is that the next time you run a backup, those files would be deleted from the external disk also, as they are no longer present on your local disk. This can be really frustrating if you want the new data to be merged into your old copy on the backup drive instead of wiping out the old data.</p>
<p>The solution I ended up settling on was to use the <code class="language-plaintext highlighter-rouge">rsync</code> <a href="http://linux.die.net/man/1/rsync">command line utility</a> on Unix to backup. The basic syntax for using rsync is very simple: <code class="language-plaintext highlighter-rouge">rsync OPTIONS SOURCE DESTINATION</code>.</p>
<p>So if you want to backup your Documents directory onto the external drive.</p>
<p><code class="language-plaintext highlighter-rouge">rsync -a --progress --exclude '*.DS_Store' ~/Documents /Volumes/Seagate/</code></p>
<p>Here the first flag <code class="language-plaintext highlighter-rouge">-a</code> means Archive which does exactly what we want, the second <code class="language-plaintext highlighter-rouge">--progress</code> is used to provide additional feedback to us when the job is running. The exclude statements just filter files that we don’t want being transferred to our backup drive such as the <code class="language-plaintext highlighter-rouge">.DS_Store</code> files that mac creates in directories. Finally, we have the source and destination paths.</p>
<p>I ended up creating a new function in my <code class="language-plaintext highlighter-rouge">~/.zshrc</code> (<code class="language-plaintext highlighter-rouge">~/.bashrc</code> if you use bash) to help with backing up the different directories I wanted. This can directly be invoked up typing backupDisk in the terminal.</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">function </span>backupDisk<span class="o">()</span> <span class="o">{</span>
rsync <span class="nt">-a</span> <span class="nt">--progress</span> <span class="nt">--exclude</span> <span class="s1">'*.DS_Store'</span> ~/Acads /Volumes/Seagate/
rsync <span class="nt">-a</span> <span class="nt">--progress</span> <span class="nt">--exclude</span> <span class="s1">'*.DS_Store'</span> ~/Music /Volumes/Seagate/
rsync <span class="nt">-a</span> <span class="nt">--progress</span> <span class="nt">--exclude</span> <span class="s1">'*.DS_Store'</span> ~/Photos /Volumes/Seagate/
rsync <span class="nt">-a</span> <span class="nt">--progress</span> <span class="nt">--exclude</span> <span class="s1">'*.DS_Store'</span> ~/Videos /Volumes/Seagate/
<span class="o">}</span>
</code></pre></div></div>
<p>PS: I started using this trick in school when online storage was way more expensive compared to external drives. You can also use one of the cloud storage providers as an alternative. I use Google Drive for most of my documents.</p>
Redshift SSD Benchmarks2014-12-20T00:00:00+00:00/blog/2014/12/20/redshift-ssd-benchmarks<p>Our warehouse runs completely on Redshift, and query performance is extremely important to us. Earlier this year, the AWS team announced the release of SSD instances for Amazon Redshift. Is the extra CPU truly worth it? We do a lot of processing with Redshift, so this question is big for us. To answer this, we decided to benchmark SSD performance and compare it to our original HDD performance.</p>
<p>Redshift is easy to use because its PostgreSQL JDBC drivers allow us to use a range of familiar SQL clients. Its speedy performance is achieved through columnar storage and data compression.</p>
<h2 id="experiment-setup">Experiment Setup</h2>
<p>The Redshift instance specs are based off on-demand pricing, but the reserved instances can be 75% more affordable. The results from the benchmark are the mean run times after running each query 3 times.</p>
<table>
<thead>
<tr>
<th> </th>
<th style="text-align: left">HDD Setup 1</th>
<th style="text-align: left">HDD Setup 2</th>
<th style="text-align: left">HDD Setup 3</th>
<th style="text-align: left">HDD Setup 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>Nodes</td>
<td style="text-align: left">4 dw1.xlarge</td>
<td style="text-align: left">8 dw1.xlarge</td>
<td style="text-align: left">32 dw2.large</td>
<td style="text-align: left">4 dw2.8xlarge</td>
</tr>
<tr>
<td>Storage</td>
<td style="text-align: left">8 TB</td>
<td style="text-align: left">16 TB</td>
<td style="text-align: left">5.12 TB</td>
<td style="text-align: left">10.24 TB</td>
</tr>
<tr>
<td>Memory</td>
<td style="text-align: left">60 GB</td>
<td style="text-align: left">120 GB</td>
<td style="text-align: left">480 GB</td>
<td style="text-align: left">976 GB</td>
</tr>
<tr>
<td>vCPU</td>
<td style="text-align: left">8</td>
<td style="text-align: left">16</td>
<td style="text-align: left">64</td>
<td style="text-align: left">128</td>
</tr>
<tr>
<td>Price ($/hr)</td>
<td style="text-align: left">3.4</td>
<td style="text-align: left">6.8</td>
<td style="text-align: left">8.0</td>
<td style="text-align: left">19.2</td>
</tr>
</tbody>
</table>
<h3 id="query-1">Query 1.</h3>
<p>First, we ran a simple join query between a table with 1 billion rows and a table with 50 million rows. The total amount of data processed was around 46GB. The results fell in favour of SSD’s.</p>
<p><img src="https://dnsta5v53r71w.cloudfront.net/images/redshift-ssd-benchmark/1a.png" alt="Screenshot" style="width: 80%; margin-left:10%; margin-right:10%; margin-top:20px; margin-bottom:20px;" /></p>
<h3 id="query-2">Query 2.</h3>
<p>This complex query features REGEX matching and aggregate functions across 1 million rows from 4 joins. The total amount of data processed was around 100GB. The results fell even more in favour of SSD’s from 5x - 15x the performance improvement.</p>
<p><img src="https://dnsta5v53r71w.cloudfront.net/images/redshift-ssd-benchmark/2.png" alt="Screenshot" style="width: 80%; margin-left:10%; margin-right:10%; margin-top:20px; margin-bottom:20px;" /></p>
<h3 id="query-3">Query 3.</h3>
<p>A query that runs window functions on a table of 1 billion rows showed surprising results. The total amount of data in this table is about 400GB. Although the SSD’s performed better, the smaller SSD’s out-performed the bigger SSD’s despite having double the memory and CPU power.</p>
<p><img src="https://dnsta5v53r71w.cloudfront.net/images/redshift-ssd-benchmark/3.png" alt="Screenshot" style="width: 80%; margin-left:10%; margin-right:10%; margin-top:20px; margin-bottom:20px;" /></p>
<h3 id="query-4">Query 4.</h3>
<p>This last query has 4 join statements with a subquery that also includes 2 joins. The amount of data processed is around 107GB. Since this query is very compute-heavy, it is not surprising that SSD’s perform 10x better. What is shocking is that the smaller SSD’s are once again more performant than the bigger SSD’s.</p>
<p><img src="https://dnsta5v53r71w.cloudfront.net/images/redshift-ssd-benchmark/4a.png" alt="Screenshot" style="width: 80%; margin-left:10%; margin-right:10%; margin-top:20px; margin-bottom:20px;" /></p>
<h2 id="conclusion">Conclusion</h2>
<p>We also ran some other queries and the performance improvement from HDD to SSD was consistent at about 5 - 10 times. From these experiments, the DW2 machines are clearly promising in terms of computation time. For the same price, SSD’s provide 3.4 times more CPU power and memory. However, the disk storage is about 25% of that of the HDD’s.</p>
<p>A limitation to the dw2.large SSD instances is that a Redshift cluster can support at most 32 of them. That means dw2.large’s can provide at most 5.12 TB of disk storage. The only other option is to upgrade to dw2.8xlarge’s but this experiment shows little performance benefits from dw2.large’s to dw2.8xlarge’s despite doubling the memory and CPU.</p>
<p><i><small>PS: This was originally written by Jason Shao on the <a href="https://tech.coursera.org/blog/2014/12/19/redshift-benchmark/">Coursera blog</a>.</small></i></p>
Pycon 2014 - Montreal2014-04-20T00:00:00+00:00/blog/2014/04/20/pycon-2014---montreal<p>I’ve always loved developing in Python, specially after working on QSTK and at Lucena and Coursera. This year I was finally able to make it to PyCon in Montreal, Canada with the help of Python Software Foundation and Coursera.</p>
<p>It was great talking to some many other people who share the same love for the language that I do. It was a great learning experience and a gentle reminder that I still have a long way to go. Talking to Guido was the highlight of the conference for me.</p>
<p>I was with a few co-workers from Coursera as company sponsors. We use a fair amount of python (Django, Fabric, IPython, and the scientific python stack) at work. We were excited to show off our Python courses and let others know about the challenging engineering work we get to do out here in Mountain View, California.</p>
<p>As an engineer working here in the Bay Area I don’t often get to see the true global impact of my work, so hearing stories from students from all different backgrounds was rewarding. We were extremely humbled by how many people were seeking out education on the Coursera platform and pushing themselves to succeed.</p>
<p>In between my time at the Coursera booth, I would try and attend as many of the great talks at the conference. A few of the talks that I was able to attend were Graham Dumpleton’s talk on <a href="http://pyvideo.org/video/2617/advanced-methods-for-creat">Advanced methods of creating decorators</a>, Tres Seaver’s talk - <a href="http://pyvideo.org/video/2626/by-your-bootstraps-porting-your-application-to-p">By Your Bootstraps: Porting Your Application to Python3</a>, and <a href="http://pyvideo.org/video/2659/its-dangerous-to-go-alone-battling-the-invisibl">It’s Dangerous to Go Alone: Battling the Invisible Monsters of Tech</a> by Julie Pagano.</p>
<p>The shoutout from Jessica McKellar during the <a href="http://pyvideo.org/video/2684/keynote-jessica-mckellar">keynote</a> was really the cherry-on-top, solidifying the work that I do here at Coursera and the things we can give back as a company to the Python community at large.</p>
<p>See you all at PyCon 2015!</p>
Mac OSX Setup Guide2014-04-20T00:00:00+00:00/blog/2014/04/20/mac-os-x-setup-guide<p>I recently updated the Mac OS X setup guide I maintain to an interactive <a href="https://www.gitbook.com/">Gitbook</a>. You can reference it <a href="/mac-setup">here (Mac OS X Setup Guide)</a>.</p>
<p><img src="/images/blog/2014-04/mac-gitbook.png" alt="Screenshot" style="width: 80%; margin-left:10%; margin-right:10%; margin-top:20px; margin-bottom:20px;" /></p>
<p>This book covers the basics of setting up development environment on a new MacBook for most major languages. All instructions covered have been tried on Mountain Lion and Mavericks but they might be more inclined towards Mavericks. Whether you are an experienced programmer or not, this book is intended for everyone to use as a reference when installing some language/library.</p>
<p>We will set up Node (JavaScript), Python, CPlusPlus, and Ruby environments. Even if you don’t program in all three, it is good to have them as many command-line tools use one of them. We also install a few daily use application and Latex. As you read and follow these steps, feel free to send me any feedback or comments you may have.</p>
<p>All contributions to the book are welcome. Please help add support for other libraries and languages.</p>
Fix Value Error: unknown locale: UTF 82014-03-31T00:00:00+00:00/blog/2014/03/31/fix-valueerror-unknown-locale-utf-8<p>Today I was trying to install AWS CLI and got this error when running the help command.</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>raise ValueError, <span class="s1">'unknown locale: %s'</span> % localename
ValueError: unknown locale: UTF-8
</code></pre></div></div>
<p>The solution was to add some environment variables to my zsh environment. So I added this to my <code class="language-plaintext highlighter-rouge">env-config</code> files.</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">LANG</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span>
<span class="nb">export </span><span class="nv">LC_COLLATE</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span>
<span class="nb">export </span><span class="nv">LC_CTYPE</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span>
<span class="nb">export </span><span class="nv">LC_MESSAGES</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span>
<span class="nb">export </span><span class="nv">LC_MONETARY</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span>
<span class="nb">export </span><span class="nv">LC_NUMERIC</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span>
<span class="nb">export </span><span class="nv">LC_TIME</span><span class="o">=</span><span class="s2">"en_US.UTF-8"</span>
<span class="nb">export </span><span class="nv">LC_ALL</span><span class="o">=</span>
</code></pre></div></div>
<p>To test the code run:</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python <span class="nt">-c</span> <span class="s1">'import locale; print(locale.getdefaultlocale());'</span>
</code></pre></div></div>
<p>And you should see <code class="language-plaintext highlighter-rouge">('en_US', 'UTF-8')</code> as the output.</p>
<p>References for the post:</p>
<ul>
<li><a href="http://patrick.arminio.info/blog/2012/02/fix-valueerror-unknown-locale-utf8/">Patrick Armino’s Blogpost</a></li>
<li><a href="http://stackoverflow.com/questions/19961239/pelican-3-3-pelican-quickstart-error-valueerror-unknown-locale-utf-8">Stackoverflow</a></li>
</ul>
Guide to make a website like this2014-02-09T00:00:00+00:00/blog/2014/02/09/Guide-to-make-a-website-like-this<p>This is my first blog post since I redesigned my website to be hosted using <a href="http://jekyllrb.com/">Jekyll</a>. I have been using <a href="http://pages.github.com/">Github pages</a> for quite sometime now to host my portfolio but today I decide to revamp the whole thing and create a blog as well.</p>
<p>So now that I just created this website and don’t know what to write about. I might just as well document how I created this thing. You can get the source code for the website at <a href="https://github.com/sb2nov/sb2nov.github.io">https://github.com/sb2nov/sb2nov.github.io</a>.</p>
<p><strong>Step 1</strong> : Create a new github repository <code class="language-plaintext highlighter-rouge">username.github.io</code> in your account.</p>
<p><strong>Step 2</strong> : Install Jekyll using the command <code class="language-plaintext highlighter-rouge">gem install jekyll</code>.</p>
<p><strong>Step 3</strong> : Create the clone of Jekyll bootstrap and set the remote to track the repository you just created.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> git clone https://github.com/plusjade/jekyll-bootstrap.git portfolio
cd portfolio
git remote set-url origin git@github.com:USERNAME/USERNAME.github.io.git
git push origin master
</code></pre></div></div>
<p><strong>Step 4</strong> : Change and modify your theme using <a href="http://getbootstrap.com/">Twitter Bootstrap</a>.</p>
<p><strong>Step 5</strong> : Create the pages you want using <code class="language-plaintext highlighter-rouge">rake page name="pages/about.html"</code>, similarly create the other pages as well.</p>
<p><strong>Step 6</strong> : Create your first post using <code class="language-plaintext highlighter-rouge">rake post title="Hello World"</code>.</p>
<p><strong>Step 7</strong> : Remove the default example posts <code class="language-plaintext highlighter-rouge">rm -rf _posts/core-examples</code></p>
<p><strong>Step 8</strong> : Edit the <code class="language-plaintext highlighter-rouge">index.md</code> as per requirements.</p>
<p><strong>Step 9</strong> : Change the default template as need be, it is located in <code class="language-plaintext highlighter-rouge">_include/themes/bootstrap/defaults.html</code>. I made some style change and removed the buttons and navbar at the side to set my current layout. And I added the links to the pages I had created in the <code class="language-plaintext highlighter-rouge">step 5</code>.</p>
<p><strong>Step 10</strong> : Just push the source code to the github repository and Github will automatically render it, <code class="language-plaintext highlighter-rouge">git push</code>.</p>
<p><strong>Step 11</strong> : If you find anything wrong in this guide. Please let me know about it.</p>