Indie ML

No ML Degree: How to Build an AI Portfolio

Emil Wallner — Fri, 16 Aug 2024 21:07:35 GMT

Think like an employer and stand out in the job market by: - Selecting the best resources to learn programming and machine learning skills - Identifying the types of projects that employers value most - Bringing unique ideas to the table - Developing habits that help you make progress on your projects - Creating a framework to evaluate and improve your projects - Finding and contacting companies that value real-world skills - Preparing for practical interviews

This guide is for self-learners, but it's also crucial for degree holders looking to strengthen their resumes with portfolio projects.

No ML Degree

Emil Wallner — Tue, 24 May 2022 04:30:34 GMT

🎧 Audio Version: Listen on Spotify or Download the Audio version here (1 Hour)

Autodidacts focus on practical skills that ML professionals use every day. Once they have a core skillset and a micro portfolio, they apply to niche job opportunities that suit their no-degree background.

It’s all about momentum.

Self-learners often approach their careers in stages. First, they build a portfolio and get a foot in the door. Later, many make additional sprints to achieve career-specific goals. They either do them full-time, part-time, or through internal transitions and upskilling.

Here’s what those transitions can look like:

Learning software engineering → A tech internship at a small company
Building an ML portfolio → An entry-level ML role at a small company
Part-time ML portfolio building → A junior ML role at a mid-sized company
FAANG/Unicorn company interview prep → ML engineer at a known tech company

This guide is for self-learners looking for their first ML job. But it’s also valuable for recent graduates and ML practitioners who want to stay up-to-date as ML evolves.

Buy this e-book on Amazon

Buy this e-book on Gumroad

Old way: learn learn learn learn learn do

New way: learn do learn do learn do

- Wes Kao

The Recipe to Fail

I reckon effective self-learning mimics what ML professionals do on a daily basis. Yet, many self-learners do the opposite: they get stuck in long lists of online courses.

After gaining a handful of certificates, they run out of motivation.

They end up applying to the popular enter-level positions, but after not getting any interviews, confidence drops. They feel overwhelmed by **everything to learn** and have little hope of fixing it.

Here’s the thing.

There is little credential value in online certificates.

Online courses can provide structured learning resources but have a marginal impact on employment attractiveness. Most assessments have all answers available online, so there is little risk or consequence for cheating.

But the same is also true for most portfolios, many copy-paste or tweak existing projects. It’s hard to tell the difference between real and low-effort projects.

For companies, it’s too risky to advance candidates without evidence of being hireable.

After building a weak ML resume, many self-learners apply to known companies. High salaries and status seduce them. Viral Medium posts by autodidacts who land high-status jobs are often valuable. However, they lead to the wrong impression.

Early career high-status jobs without degrees are exceptions that are highly context-dependent.

Unfortunately, many self-learners don't know how to find niche job opportunities that suit their no-degree background.

Instead, popular positions are flooded with applications. Many are complacent. With one-click applications and remote jobs, it’s easy to apply to hundreds of companies. Known tech companies have a hundred to several thousand applicants per position. In these resume lotteries, university graduates come out on top.

Many self-learners don’t land any interviews.

At this stage, it’s tough to get back on track. Self-learners spent their savings on living costs, they have little motivation left, and the pressure to create an income increases.

It’s hard to know what to do next.

Many ML experts recommend several years' worth of online courses to reflect the depth of their education and career. Unfortunately, although it comes with good intent, it often sets unrealistic expectations on what it takes to enter the ML field.

At this point, self-learners don’t know if they should trust the academic camp that emphasizes calculus, algebra, statistics, and probability; the industry camp that argues for MLOps, pipelines, SQL, Git, and Kaggle; or the interview hacking camp that believes you should master LeetCode, cracking the coding interview, and memorizing the first part in Goodfellow's Deep Learning book.

Self-learners often drop out at this stage.

Many never realize there are **far better ways** to land an ML job.

Follow Emil Wallner on Twitter

Subscribe now

Becoming Hireable

Many self-learners have a crucial misunderstanding: knowledge is not the same as evidence of being hireable.

This relationship was first studied by Solon and Hungerford in the late 80s.

For example, if a university student drops out a few weeks before graduating, employers don’t see it as 98% of a degree. Last-minute dropouts will have almost the same knowledge as graduates, but only be a fraction as hireable as someone who stayed a few more weeks.

While online courses are excellent learning resources, they seldom make a candidate more attractive to an employer. For self-learners, taking online courses is like someone who drops out just before graduation. They have the knowledge but not the hiring credibility.

If you don’t earn a traditional degree for your learning, you need to use your knowledge to compete for employment attractiveness.

Self-learners have to compete for either brand recognition, time, attention, or money.

When ML professionals validate your work, they are risking their time and reputation. If someone pays you for work, they are risking their money. If a conference publishes your paper, they are risking their brand.

When you compete for something scarce, you need trust, hard work, and talent. That’s what employers look for.

The hard part of the ML self-learning path is not how to gain knowledge but how to create industry credibility.

A portfolio is a collection of evidence that ML professionals and institutions have traded something scarce for your talent and hard work. That portfolio could include work experience, open-source contributions, or beating established benchmarks. They show that you have obtained something rare and valuable.

» Get the full e-book on Gumroad or Amazon «

Buy this e-book on Gumroad

Part One → Programming

Start with Programming

Software is the majority of modern machine learning.

Although many are eager to head straight into ML, it’s often far wiser and more practical to enter tech with software engineering.

There are far more entry-level positions in software development, clear learning paths, and it’s faster and easier to learn. 20% of professional developers are self-taught, and in comparison, 4.1% of the employed data scientists on Kaggle are self-taught.

There are boot camps and other options to learn data science. However, it can be overwhelming to try to learn both programming and machine learning at the same time. It's usually better to get some experience with software engineering first, and then learn ML.

Having that said, data science boot camps can be helpful for roles that are related to data science but don't require as much technical expertise, such as analytics, product managers, support, and business roles.

No-degree Tech Schools and Online Courses

If you are still in high school, just curious, or already have a STEM degree, online coding courses are good options, such as Codecademy, Scrimba, and freeCodeCamp. Online courses require the most motivation and offer the least support in landing jobs. Yet, they are cheap, fun, and easy to access.

I learned programming via the 42 network. Imo, it’s still the best option if you are looking for a collage-like experience for self-learners.

42 schools don’t require high school diplomas. It’s free, portfolio-based, and peer-to-peer without any teachers. You’ll land a job within six months, but you can also keep studying for four years. A good paid alternative to 42 is Holberton School.

Being part of a network is not just great for building life-long friendships and making learning more enjoyable, but it’s also important for referrals.

Self-learning schools are a great alternative for many who don’t jell well with traditional schools. There are also sound education theories that support self-driven learning.

Boot Camps

The third alternative is traditional boot camps, either online or campus-based. They are often practical but have a more structured learning experience like schools.

If you are looking for more technical roles or don’t have any work experience, it’s worth going for boot camps that are at least six months. Lambda School/BloomTech is worth considering.

Traditional three-month boot camps often offer the shortest transition into tech. I reckon they are well suited for people who already have a career in another field and are looking for lighter technical roles.

On average, tech boot camps are 14 weeks and cost $14k. This is not enough time to become an outstanding coder, but it’s enough to land a web development job and continue the path of becoming one. 79% of boot camp graduates land tech job and the average salary is $69k. Many boot camps hire their graduates as instructors, so take the numbers with a grain of salt.

The curricula tend to be similar, so I’d opt for a camp with a high ranking.

Computer Science

“But a lot of the work of a day-to-day MLE or data scientist is around putting things in production, or managing things, or scaling things, or figuring out how to grab stuff off of ten different databases and clean it and put it somewhere. That’s the kind of stuff I’m really looking for.” - Chris Albon, Wikipedia

90% of today's models are trained and deployed on servers.

Most of the work is focused on making the data, training, and production process faster by improving efficiency and organization.

Where I studied, we learned the C Programming Language in depth. We started by developing our own standard C library and then reinvented many classic algorithms and core programs in computers. I feel like C is a good abstraction layer. I can both dabble in lower-level languages and high-level languages.

Imo, a practical computer science curriculum with a focus on projects and programming is a solid base. Specializations include security, DevOps, back-end, and graphics.

Once you move to ML, you’ll be mostly working in Python, so it’s worth dabbling in that too.

Front-end and Mobile

Mobile and front-end roles are less common entries into ML. However, there are significant cost, latency, and privacy benefits to running ML models on personal computers and phones.

Client inference and optimization are valid entry-point into ML. Although only 10% of ML inference happens on the client today, according to Gartner, this is expected to increase to 50% by 2025.

The major shifts on the client-side are human-in-the-loop, prompt engineering, and active learning. Creating smaller intermediate models, workflows, and programs to interact with server-side models is important. It’s also worth looking into Neural Radiance Fields and browser rendering.

On the tech side, I’d especially look into TensorflowJS, ONNX.JS, Eigen (C++) complied with Web Assembly and experiment with the newly developed PyScript. Full-stack roles fall between both roles. It’s a valid option, although I’d specialize on either the back-end or front-end to not be spread out too thinly.

Regardless of which programming path you choose, I’d aim for at least 6 months to 2 years of study and work experience to get a good foundation.

Part Two → Machine Learning

Learning Machine Learning

From informal chats with other self-learners and looking at how long they keep studying, an average self-taught student has around three months of motivation before giving up.

You can’t learn everything and need to make strategic tradeoffs. At heart there are two challenges:

Having a good enough resume to get interviews
Passing the interviews and getting an offer

Many autodidacts focus on knowledge that’s useful for interviews and wing their resumes. I’d argue for the opposite.

There is little point in being good at ML interviews if people don’t invite you to interviews. Also, you’ll have a higher chance of keeping the job and advancing if you have useful skills on day one.

It’s too overwhelming to both create a strong resume and competitive interviewing skills. Many university students have 5-9 years’ worth of theory training and do months’ worth of interview preparation. It’s unreasonable to both have a strong no-degree resume and also be competitive in theory-heavy interviews.

There are enough ML opportunities that have practical interviews and light theory requirements. It's best to focus on those gigs and maintain momentum. Work experience will give you a significant boost in later job hunts.

So, the goals for those three months are to:

Learn data-centric problem-solving tools
Identify, scope, communicate and solve problems
Build a portfolio with externally validated results
Gain a light overview of ML and statistics

Machine learning boot camps can work if you have significant programming experience. Many companies are also looking for strong programmers and offer on-the-job ML training.

However, ML is more competitive than software, and a strong portfolio weighs heavier than a boot camp graduation. Boot camps can also be inconvenient and expensive. There are also cheaper online boot camps such as Nano Degrees.

Other options are cloud-specific ML certificates such as GCP, Azure, and AWS. The tests have around 60 multiple-choice questions. It feels like it’s too easy to game to be valuable in general.

But some find them useful, especially for customer-centric roles for specific cloud providers. Cloud providers also track and incentivize partner companies to hire candidates with their certificates.

Practical ML Courses

Pick a practical ML course and study it for one month.

You might think that one month of ML is not enough to build your projects. But it is. You’ll be rusty and need to check things frequently, but you have enough to start solving data problems.

Look for instructors that are doing competitive industry work today. The current ML cycles are only a few years and best practices are improving each year.

Solid practical ML courses for coders include FastAI and Kaggle’s 30 days of code.

Many find recent practical courses messy. That’s true.

The pace is fast and there are a lot of tools, mixing off-the-shelf library calls with dabbling in the source code, context switching, and debugging. But that also reflects reality, especially industry-level knowledge that people want to pay you for.

It’s less important if you use a paid solution, an open-source library, write models from scratch or know all the theory. What matters is spotting potential risks and weaknesses with your solutions and learning how to mitigate them. That’s what modern learning is.

The things you want to learn include:

Problem-solving

The types of problems that machine learning can and cannot solve
Knowing when to use paid APIs, open-source, or custom solutions
Basic awareness of how your model impacts a business including privacy, UI/UX, legal, ethics, and their business model
Communicating expectations and timelines to technical and non-technical stakeholders
How and when to mitigate risk from your inexperience

Data

Understanding what data is available to you, and how to get more
Extracting, visualizing, cleaning, and loading data
Understand the data and make informed decisions based on it

Models

Understanding the type of problem and how to find a solution
Setting and measuring appropriate objectives and success criteria
Quickly reaching a baseline model
Training models with state-of-the-art results
Fast and efficient debugging
Visualizing model performance
Deploying models and understanding memory, cost, queries-per-second, and latency

By all means, skim online courses, and look up things on YouTube and Google, but after your month-long practical course, your focus should be 90% on your portfolio.

That honeymoon phase goes fast and you need a resume.

Breadth, Credibility, and Edge

Imo, deep learning is the most exciting area and has the most future potential. I rarely use classic machine learning approaches, although they are common in the industry and are often used in interviews.

In the evenings, it’s worth exploring StatQuests' the Basics (of statistics) and Machine Learning. You can use a flashcard app like Anki to memorize the key concepts in these videos.

For your portfolio, you have two types of portfolio projects:

Degree equivalent projects, 1-3 months long result-driven projects that give you credibility
Talent projects, 1-4 week open projects that make you stand out

A degree equivalent project is what makes employers invite you to interviews. They give you evidence that you can do the job. Talent projects are both great for marketing yourself and making you stand out in the interview process.

However, if you only have shiny talent projects, many employers will doubt that you can do the daily grunt work to deliver on projects.

Part Three → A Base Portfolio

Weak Portfolio Projects

A typical weak portfolio project is listing toy problems such as MNIST, Titanic, and Iris on your resume. For many employers, that is an instant rejection. They are considered school projects and don’t require talent or endurance to solve.

Some ML projects have neither a positive nor negative impact on your resume. They are comparable to blank portfolio items. This might sound harsh, but these are often the most common ML projects.

These portfolio items are often too hard to evaluate or lack results. For example, a stock prediction app, a GAN to generate artwork, reinforcement learning applied to a game, or a cancer prediction model. Many recruiters will see 5-10 people with the same portfolio projects on any given day.

These could be great projects, but they lack enough information to inform a recruiter.

Self-learners are often naive. They don’t have any experience competing against candidates that lack integrity. Many fake their portfolios. They clone a project, change a few things, fake a git commit history, and create a documentation. Some even study the source code to answer questions about it.

Self-learners need to differentiate themselves from fake and low-effort portfolios.

Degree Equivalent Portfolio Projects

A non-expert recruiter needs **hard evidence** that you didn’t copy-paste your projects. And even if you have impressive results, they need to be validated by someone else. Otherwise, you could have made trivial mistakes, made them up, or plagiarized them.

It’s your responsibility to create the evidence equal to a degree.

A result-based portfolio project means that you achieved something objectively good and it’s easy to understand. You have three options when it comes to clear result-based portfolio items:

High-ranking score in an ML competition
A contribution to a popular ML open-source project
A published paper/workshop paper (mostly relevant for transitioning STEM researchers)

These are hard to achieve for your first portfolio, but it’s worth knowing what to aim for. There are three more result-based portfolio items, but they require effort for recruiters to understand. Thus, they are less valuable than the previous three categories.

An ML project with real users (ideally a deployed model with a UI)
Industry-specific solution with a mentor that provides a testimonial
ML content marketing with high engagement such as blogging, podcasts, and videos (developer advocacy roles)

High-effort Projects

Some of the highest value portfolio items are first-author published papers in machine learning conferences such as NeurIPS, ICLR, and ICLM, top entries in large Kaggle competitions sponsored by known companies, and open source contributions in popular libraries such as Scikit Learn, TensorFlow, Numpy, or PyTorch.

These are easy to understand and as a high-effort signal. But these are hard to achieve given the timeframe self-learners have.

More achievable results would be workshop papers in NeurIPS, ICLM, and ICLR or a published paper in any other ML conference. I made a contribution to the creative workshop at NeurIPS. Although I made a light contribution, having someone select the work and participate in the community is valuable. However, it’s more reasonable for someone transitioning from another research field, say Physics, which is common.

Smaller ML competitions are great portfolio projects. That could be more niche competitions on Kaggle, Numerai, ML conference competitions, or company competitions. Imo, it’s better to have a top ranking in a small competition than to be average in a large competition.

The third core opportunity is open-source contributions to up-and-coming projects. Also, it’s often the best way to collaborate and get to know people who work in ML.

I’d highly consider contributing to one of the following open-source projects: FFCV, EleutherAI, Hugging Face, Pytorch Lightning, LAION, Replicate, timm, Segmentation Models, OpenAI Gym, Albumentations, einops, ONNX JS, FLAX, and the FastAI Library.

These projects are done by some of the most talented people in the industry, and they are often looking for people to help. They might even have solid first issues listed in their GitHub repos. You can find more OS projects here.

These portfolio items will require a lot of hard and focused work. But these systems are meritocratic. Good work will be recognized.

Industry Portfolio Projects

Working with someone in the industry to solve a real problem and have them write a testimonial is a safe and solid portfolio project. However, unless you have someone that vouches for your solution, it’s not a result-based project.

The hardest part is finding someone to work with.

One path is to email ten to twenty ML engineers at startups you respect. Ask them for industry problems with accessible data that you can tackle. A good place to find prospects are people on Twitter with less than 10k followers and a blog.

Email example:

Title: Industry ML problems

Hi Jane,

I’m self-studying deep learning [Link to github] and I’m looking for problems I can tackle for my portfolio.

Given your interesting work on Twitters’s recommendation system [link to their blog], I thought you could have exposure to other unique industry problems.

I’m thinking of using Twitter’s API to do an NLP analysisis to detect the percetage of bots on Twitter. Is that a good entry-level problem to tackle or can you think of something else?

Cheers, Bob

Keep it short, indicate that you have done your homework, and give them an easy way out. If you translate their feedback into results, they’ll be happy to keep helping you.

When you have worked hard on the problem and have a great result, you can ping them again and ask for feedback on the project. ML engineers can both put you in a good starting point, scope the project, help you when you get stuck, and potentially hire or recommend you later.

You can also approach people that post data-related freelance projects on freelancer marketplaces and look at sites that post pro-bono data projects.

Part Four → Talent projects

Short expressive projects

To get an interview, you need a result-driven project with external validation. Companies want evidence that you are reliable and can do the job. This is very much like the value of a degree.

To deliver on result-driven projects, you often need to use well-established solutions to defined problems. In comparison, talent projects don’t have the same burden on rigor. Instead, the focus is on novelty and storytelling. It's all about expressing yourself.

Here are a few characteristics of talent projects:

It takes 1-4 weeks
It explores something novel
The result is a demo, blog post, or a visual

Talent projects can be used in different scenarios:

Stand out in interviews
Indicate you have passion for a particular topic
Personal branding and marketing yourself
Art and creative projects
Create a developer advocacy skillset

Also, worth noting, that if you have say a Master's in another STEM subject or have done something else impressive. This can give you enough headwind to do one or two shorter talent projects and still be competitive when looking for an ML job.

The Risk of Open-ended Projects

Talent projects are risky. You have to define a problem, find a solution, communicate the project, and market the project. They are more idea-driven and require more taste and skillsets outside of machine learning.

Many things go wrong. According to Gartner, 85% of ML projects fail.

The model you base your work on might not work well in practice, this is very common. It often takes a few attempts to find a model to work with. Even if it runs, there can be lots of issues with the model and it can be hard to put it into production.

But even if you get a result, it could be too complex, you can’t communicate it, few care about your problem, or it’s already been done.

Talent projects are usually the most popular type of projects among learners. They are often more flexible and fun, and have less accountability. However, since talent projects are hard to execute and popular among students, they also introduce the most noise. As a result, it is difficult to compare students who have completed open-ended projects.

Interview Edge Projects

Getting an interview is one problem, but getting the final offer is the second problem. The final selection often correlates to how passionate candidates are about **their specific problems** and their company.

All candidates will claim they are interested in a company, but there is nothing better than proof, making a short portfolio project targeting a specific company or industry. This is far more convincing than words.

That’s where talent projects come in. Think, the icing on the cake. I mean, what’s a carrot cake whiteout that thick layer of cream cheese frosting.

Here are a few examples of X-factor projects:

At heart, an interview edge project shows that you are up to date with the latest developments, you are curious, and interested in your potential employer’s type of problems.

Developer Advocacy Projects

There are a lot of ML roles that focus on raising awareness of their product or service. That could be anything from blogging and podcasts to videos. Developer advocacy roles focus on engagement such as views, signups, likes, etc. These roles are often fun and creative and can also work as a transition into more technical roles.

As a rough ballpark for combined direct views per content piece (not including views on say Twitter or Instagram):

Okay: 5k
Good: 25k
Excellent: 100k+

The internet is meritocratic, so if you make an excellent contribution, communicate it, and share it. It has a high chance of generating decent user metrics. The external excitement is an indicator that you made a unique contribution. Without it, it's hard to tell if you made anything at all.

Part Five → Self Evaluation

Metric-based Portfolio Items

Let’s look at an example of a weak portfolio item:

A skin cancer classification model

In reality, students will add jargon and different technologies.

Say, ‘A skin cancer classification model in PyTorch, cleaning the data and using a CNN with transfer learning, using weight and biases for metric tracking, and so on.’

This describes what they did. Many believe that adding more technical terms or steps will make it sound more impressive. It doesn’t.

It doesn’t have a result, we don’t know the context, and you wrote it, so you could have made everything up. The project could have been a significant effort or copy-pasted. Self-learners feel like it was the first, but an employer will assume it’s the last.

Let’s improve it a little.

A skin cancer classification model with 90% accuracy

Now, we have a metric, but without context and some objectivity, it doesn’t mean anything. 90% accurate on what? how many examples? was it an established benchmark? did you use a test set? did you copy-paste an existing solution?

Another iteration:

A skin cancer classification model with 90% accuracy on Benchmark X with a previous SOTA of 85%

This almost makes the cut.

A result has three components: a metric or testimonial, a context, and third-party validation.

It has a metric and a context, but it lacks objectivity. SOTA means state of the art, and means you have the best current solution to a defined problem.

This indicates that you solved a problem. Since your result is better than the existing solution, it reduces the chance that you copy-pasted a solution. However, it still raises doubts, maybe someone else in your team did all the work, the benchmark is weak, or your results are wrong? The objectivity is lacking.

Let’s improve it.

A skin cancer classification model with 90% accuracy on Benchmark X with a previous SOTA of 85%. Published in Machine Learning Conference X as the first author.

A similar result but in a different context:

A skin cancer classification model with 90% accuracy on Benchmark X on Machine Learning Competition X by Company Y. Solo participant with place 120/2450, placing it in the top 5%. The top solution had 91% accuracy.

Now we have a solid portfolio item. When a third party validates a result, it reduces the doubt that there was a mistake in your work or that you made it up. Many employers neither have the time nor expertise to validate results, so it’s dependent on external validation. Also, first author or sole participant means that you are responsible for the main result.

Testimonial-based Portfolio Items

Here’s another example starting with a weak portfolio item:

An implementation of the LAMB optimizer

Same story as last time. There are many LAMB optimizers online, and you could have copy-pasted it. If you don’t know what LAMB is, don’t worry about it.

An implementation of the LAMB optimizer and a blog post about it [link]

Again, it could be a ‘copy’ project, but taking the effort to write about it creates more context and increases the likelihood that it’s an actual project. But there is enough content online about LAMB optimizers to put something together without understanding it.

A released open-source contribution to PyTorch, the LAMB optimizer [link], and a blog post [link].

This makes the cut for a result. When contributing to a popular framework, it makes it clear that you made it. You both had to understand the framework, understand what they need, learn the framework enough to improve it and pass a technical review for your submission.

To spice it up, you can also request a quote from someone on the PyTorch team (made by Facebook Meta), so:

“X made a fast and well-documented implementation of the LAMB optimizer in PyTorch.”, Employee X at Facebook Meta. [endorsement link], [commit link] and a blog post [link]

Having the endorsement public on Twitter, LinkedIn, or GitHub makes the endorsement verifiable.

A non-technical person might not know what LAMB is nor PyTorch is. Having a quote validates that it’s a real contribution and not say a pull request for a typo. The human element is also a nice touch.

Product Portfolio Items

Again, let’s look at an example, starting with a weak portfolio item:

A super-resolution model

No result, no context, no third-party validation.

A super-resolution model with a Google Colab UI

Adding a UI makes it much easier for a non-technical person to understand the value of the project, so that’s a good start. Yet, most recruiters are too lazy to run your notebook.

A super-resolution model in production and a live javascript UI

This is a significant improvement. Running models in production is a big part of ML, and a recruiter can now test it in a few seconds without any technical expertise.

A super-resolution model in production and a live UI. [link] Optimized deployment taking the original RAM footprint from 1 GB to 250MB, and the CPU inference from 4 seconds to 30 ms. [Google Colab benchmark link]

This is an above-average project. RAM footprint and inference speed are crucial for both the user experience and the cost. These are both high priorities for an employer. It indicates that you are both street-smart and result-oriented. However, a cynical recruiter will assume you cloned the project from GitHub. It’s missing external validation.

For this project to have edge and marketing potential, the super-resolution model has to be recently released in an ML conference or custom made, and still not be accessible with a UI.

That’s what makes it newsworthy and gain traction online.

A super-resolution model in production and a live UI. [link] Optimized deployment taking the original RAM footprint from 1 GB to 150 MB, and the CPU inference from 4 seconds to 30 ms. [Google Colab benchmark link]. 100 weekly users [Stats screenshot], 250 stars on GitHub [link], and seen on Hacker News [link] and recommended by X, at Famous company. [link to tweet]

These snippets can be added to both your resume, LinkedIn, and in your blog posts. Open-ended projects need a cluster of external validation since the core result is not as clear as the other categories of portfolio projects.

Oh, as a bonus, it’s extra valuable if you deploy it on a scalable back-end on any of the large cloud providers such as GCP, AWS, or Azure. Show evidence that it supports at least 100 QPS.

Part Six → Ideas

Base Portfolio Ideas

There are two types of ideas: beginner ideas and experience-based ideas. Novice ideas feel like eureka moments, and experience-based ideas are more like a sigh of relief after grinding a problem for months or years.

We take novice ideas too personally. By the time the idea-intoxication runs off, we’ve wasted months on something doomed to fail on day one.

Until we have a few years of experience, it’s often better to rely on gut ideas as little as possible.

Good base portfolio ideas are validated problems.

For example:

Github issues in open-source projects
Machine learning competitions
Asking ML practitioners what problems they are facing
Browsing ML project proposals on freelancer websites

For your ML portfolio base, you want to translate hard work into outcomes with as few risks as possible.

Talent Project Ideas

To develop newsworthy projects, you need to take risks.

In general, it’s better to overdeliver on a tiny real problem than to make a vague attempt on an ambitious problem. You want the story to be: I solved X. It worked. Versus, I did X,Y,Z,P, and S. It lead to something. I think.

Successful talent projects lead to a short and clear stories.

As a novice, we want to rely on our intuition as little as possible. The secret of good ideas is to improve an already interesting idea — polish diamonds.

Finding exciting ideas to build on is all about quantity. In my experience, novices are better at comparing projects than case-by-case evaluations. Thus, we are better off ranking 30 potential project areas than deciding if a few specific ideas have merit.

Most beginners get exposed to ideas in the same way: popular Twitter accounts, HackerNews, Reddit, press, Arxiv Sanity, popular newsletters, YouTube channels, podcasts, and top ML conferences.

If the idea has appeared in any mainstream channel, it’s often overexploited, especially if it appears in several mainstream outlets.

Sourcing Ideas

Here are a few solid starting points:

Top ML conferences from 1987 to 2007
Stanford’s CS224n, CS329S & CS231n projects
FastAI student projects
Twitter likes & GitHub stars
Creative projects like AK, ML x Art, NeurIPS gallery
Edge devices and hardware projects
Kaggle Kernels
Top 15-40% of papers on Arxiv Sanity

Aim for first-author papers or projects made by people early in their careers. If a person made a project early in their career they often had less experience and access to compute. It increases the odds that you can contribute to the project.

New projects made by senior ML practitioners are often too complex and compute-heavy to build on. There are exceptions, but it can be hard to tell for a beginner.

Ranking Ideas

I try to have at least 20-30 project ideas before ranking them.

There are a handful of things you want to consider:

Can you impress a non-technical person in less than 30 seconds?
Can you find a quick way to run the model, e.g. colab or open-source?
Do you have enough compute and knowledge for the project?
Is there a clear angle to improve it?
Does it **really** excite you?

There is no clear way to rank them. It often depends on what you will use the project for, say your resume, marketing yourself, or getting a developer advocacy role.

My first two projects:

Once you have the final five projects, you often want to pick what makes you most excited. This is when your gut can be helpful. Also, the sooner you can run the code and evaluate the model, the more you reduce the risk. If you can’t create a baseline within the first week, you want to move on to something else.

Promoting Projects

Imagine your reader being an average distracted social media user. Make it simple. Having something highly visual or a few-click online demo is vital if you want people to share it.

Spreading good vibes on Twitter and having a good reply game will help you build a small audience. That will be crucial to getting your content some action on Twitter. The same is true for other communities.

If you use a specific tech, say Keras, FFCV, LAION, FastAI, etc., you can post it in their Slack channels and Discord groups. Larger products like Tensorflow, also have dedicated marketing teams that can help you. They often promote articles on their social channels. For example, here’s a promo Google did on my first ML project.

You can also cross-post it to say Dev.to, Medium publications such as Towards Data Science, Substack, FreeCodingCamp, HackerNoon, and pitching tech media outlets such as TheNextWeb. Publishing platforms often promote the content on their platforms.

Otherwise, Reddit r/learnmachinelearning and r/MachineLearning/ are good places to share your projects, as well as other programming-related subreddits, Lesswrong, Hacker News, the FastAI forum, and Product Hunt. TikTok and YouTube are also worth testing. If you get some action, GitHub’s trending section will also start pulling an audience.

If you have a model, add it to all the model platforms, such as Huggingface spaces, Replicate, Modelplace, and Runway models.

A lot of the traffic will come from Google, so it’s worth looking at Google’s Keyword tool, other mainstream SEO tools, Google’s trending topics, and Youtube’s Keyword tool.

Part Seven → Workflow

High-effort Focus

To watch a youtube video, run a few cells in a notebook, or copy a snippet of code for an assignment requires low effort. It doesn’t require much willpower or clarity of mind.

However, as soon as you start with a blank page and try to build your first project, you need high-effort focus. It’s more mentally straining and requires more effort to create and maintain the focus.

To build a high-effort focus, have a look at Atomic Habits and Deep Work.

80% is having a good sleep, exercise, and food routines. Sleep at least 8 hours, ideally 9 hours, in a quiet, dark, and cool environment. Have at least 20 minutes of exercise per day to elevate your heart rate, and eat healthy food that does not spike your sugar levels.

It’s the foundation for high-effort focus.

Tiny Habits

My favorite concept from Atomic Habits is tiny habits. For example, for your first solo project on day one, start your working day with a blank notebook. Stare at it for 1 minute. That’s it. Celebrate and take the day off!

It might seem odd to celebrate doing nothing. But far too many stay in the comfort zone of low-effort learning.

On day two, aim to write one line. The following day, a paragraph of code. And so on. Most only start with 5-15 min of high-effort focus per day. Work with that, celebrate success, and aim to increase it by 5-10 min every day.

If you can work that up to 1-3 hours per day, you have a solid chance of succeeding. That’s what most practitioners do.

Activating your high-effort focus every morning is the hardest part of your ML journey.

Learning Schedule

The time of the day is the second key aspect to consider. Our highest potential for high-effort focus is from the morning until lunch, and then rapidly declines after lunch and improves around dinner.

Three main time slots:

8-14: High-effort focus (scoping, coding, major refactoring)
14-18: Low-effort focus (light debugging and simple refactoring)
18-22: Mid-effort focus (learning gaps + skimming)

Active problem solving takes the most effort, such as coding or hard debugging. That should be your focus for your morning slot. It’s difficult to get invested in a challenging programming problem after lunch.

It’s also worth testing Lex Fridman’s routine. He takes a large break during lunch for exercising and leisure to reenergize for another high-effort session.

Some will deviate from this schedule, but the essence is to build awareness of when you are most effective and structure your workflow accordingly.

Social Media

The villain in your self-learning journey is social media addiction. As soon as things get a little hard, it will be prepared to send you off the rails.

For me, it’s always an ongoing battle. As social media gets stickier, I need to find new ways to deal with it.

Here are a few things to consider:

Deactivate accounts that are not crucial for your life. I deactivated Instagram, Facebook and removed my LinkedIn. This reduces entry points to getting sucked into a feed.
Add parent control and screen time limits to disable feed-based social media apps on your phone. If you can’t follow them, have a friend create a password to lock you out.
Add two-factor authentication on all social media accounts. At the start of your day and after lunch, log out of all accounts and put your phone in a phone lock.

The last tip is to avoid meta procrastination, to procrastinate by over-engineering your procrastination remedy.

Part Eight → Applying to Companies

Competitive Disadvantage

For popular entry-level jobs, many companies optimize for scale. They often use automated filters, non-technical people to screen resumes, and use one-size-fits-all questions.

These hiring pipelines are often optimized for university graduates who lack experience. They select candidates based on brand-name universities, and the interviews resemble general aptitude tests, coding tests, and CS and ML theory questions.

These processes are ‘hackable’. But you often need work experience to be considered and a few months of interview preparation. These are often better as your second or third ML gig.

Competitive Advantage

There is a second type of hiring process that make for ideal entry-level targets: specialized, practical, or small-scale hiring.

When organizations:

Need specialized knowledge
Have few applications
Companies that lack on-the-job training resources
Value diverse candidates
Value and assess job-related skills

These tend to be smaller organizations and startups, companies with specific cultures, or specialized teams within larger companies. They have a more custom hiring process. Hiring managers are technical and questions are catered to each candidate’s work or reflect skills you need for day-to-day work.

High-growth Startups and Small Organisations

Smaller companies tend to be good because they often have technical hiring managers, often the technical founder of a company, and consider every candidate due to a lack of applications. They also need people who can add value on day one.

This gives them the context and motivation to analyze a no-degree resume. However, the hiring processes vary a lot, they can be picky, and you often have to do more adjacent work related to ML.

To find startups, have a look at Product Hunt, browse Y Combinators list of startups, Angel list, check ‘Ask HN: Who is hiring?’ monthly threads on Hacker News, remote jobs, local incubators and offices, and portfolios of angel investors.

To find other smaller companies, contact your network. Everyone. Contact people on your social networks, your family, your cousins, and friends of your family. Get the word out. It can be awkward but often generates the warmest leads.

It can also be worth contacting research institutions. They often need people who can do the more engineering-heavy side of their ML research.

Midsized and Large Companies

Larger companies tend to have more conservative hiring which favors university students, especially in old industries such as pharma and telecom. Yet, there are exceptions.

There are a few main strategies to find no-degree opportunities in large companies:

Search for no-degree graduates on LinkedIn and see where they work (e.g. famous boot camps, Kaggle, or no-degree schools)
Ask no-degree graduates directly
Look for companies that actively look for no-degree candidates
Attract companies with online and social media presence
Interview-as-a-Service companies such as TripleByte
Browse Glassdoor and look for companies with practical interviews

It requires some work to find these companies, but larger companies can often pay higher salaries, you have more room to specialize in a specific ML area, and it can add more weight to your resume.

Job fairs and conferences can also be useful, but I believe it’s more efficient to research and contact companies online.

Resume

Keep it simple.

Add your essential contact information, tech jobs, and one or two bullet points with your most impressive ML projects. Aim for half a page. A concise resume makes you stand out and look confident.

Whatever you do, don’t refer to yourself as an ML enthusiast, add jargon, nor make the resume more than one page.

Email Templates for Contacting Startups

Email the company's founder and CEO and send two follow-up emails one week apart if they don't reply. Don’t worry if the companies have open positions or not. Email them anyway.

Here’s a rough template:

Title: Entry-level ML positions

Hi John,

I hope you’ve had an excellent week so far!

I first saw your product on Product Hunt. I loved the user interface, and I was impressed by the quality of the generative model. I’m currently looking for an entry-level ML position.

I’ve made open source contributions to PyTorch and ranked in the top 5% in a popular image segmentation competition on Kaggle. You can find more details in my portfolio [github] and [linkedin] here.

If you have any opportunities at [company] or know anyone else hiring, please let me know.

Cheers, Jane

Interview Prep

Don’t worry about this too much.

Your portfolio should do the heavy lifting for the companies you are targeting. They will focus on your portfolio and ask questions about your work. Some might give you a take-home exam or do a simple technical interview to ensure you haven’t faked your resume.

Get the basics right.

Practice telling about yourself concisely in 30 seconds and a 30 seconds overview for each of your critical projects. And do some of the easy LeetCode questions in Python. Then, ask your friends for light, practical interviews to get into the flow.

Once you start your outreach and interviews, you can use the gaps and breaks to brush up on StatQuests' the Basics (of statistics) and Machine Learning.

You’ll have interviews that are theory-heavy where you will fail. But that’s okay.

Once you’ve worked for a few companies, and you are looking to apply to popular positions at large companies, then it’s worth doing a longer interview preparation.

For companies you like, research their problems for a few hours and get data on how they are most likely solving their problems. Having specific questions and the ability to discuss their problems in detail will make you come across as a peer.

‘Mozilla has a very interesting interview process. For individual contributors, you simulate a day of work and try to understand how they would perform.’ - Julie Hollek, Mozilla

Plan B

If you’ve identified and contacted 10-30 companies each week for three months, you should have a few offers or be in the late interviewing stages. If you don't have any warm leads, you want to explore a plan B.

It’s also worth looking for plan B options from day one. This can give you extra confidence and reduces the risk of lost income.

If you had interviews with 5-10 companies, that’s often a sign that you are close to landing a job. That could mean that you can make one or two shorter projects targeting specific companies. And continue your outreach for another one or two months.

If you have less than five interviews and you fell out after the first phone screening or first interview, that’s a sign that you need to rethink. Getting into ML is hard, especially if you are self-taught.

Here are a few options worth considering:

Apply for software roles closely related to machine learning
Apply for software roles in companies that do a lot of ML
Apply for developer advocacy and content marketing roles in ML companies
Apply for product manager and analytic roles related to ML
Bid on ML contracting opportunities or software projects related to ML

If you still have motivation and income left, aim to create the base portfolio projects mentioned earlier.

Follow Emil Wallner on Twitter

Subscribe now

Part Nine → The Key Points

Self-learning ML
- Efficient self-learning is similar to apprenticeships.
- Starting with software engineering reduces your chances of failure.
- Free peer-to-peer CS schools are my preferred way to learn to program.
- FastAI and Kaggle’s 30 days have practical and efficient curricula.
- Creating a solid portfolio requires more focus and effort than passively taking online courses. You need rigorous work habits to progress.
Hireability
- Most employers don’t value certificates, online courses, and common projects.
- Online courses are excellent learning resources, but completing online courses seldom makes candidates more attractive to employers.
- Employers hire self-learners based on validated real-world results.
- Avoiding popular tech positions and search for small companies, companies with specific needs, and organizations with practical interviews.
Portfolio
- Result-based portfolio projects have three components: a metric or testimonial, a context, and third-party validation.
- Publishing papers, machine learning competitions, and contributing to open source projects are the safest portfolio projects.
- The second best portfolio options are creating live ML products, collaborating with people in the industry, and developing ML content with high engagement.
- Shorter portfolio projects targeting a specific industry are great for standing out in the interview process.
- Improving promising existing projects often created better results than coming up with gut project ideas.

Thanks to everyone who read drafts of this article! A special thanks to Miha Jenko, Michaël Trazzi, Curt Tigges, Richmond Alake, Nathan Waters, Ravi Chandra Veeramachaneni, Fabrizio Damicelli, Hasan Yaman, Emanuel Ramirez, Ed Campbell, Roy Keyes, Hussein Lezzaik, Utkarsh Malaiya, Priya Joseph, Job Henandez Lara, Agboja David, and Brian Ko.

If you enjoyed this post and think it will help others, don't hesitate to like it here on Substack and share it. Also, if you have a question or feedback, please leave it in the comments. I reply to all comments.

How I built a €25K Machine Learning Rig

Emil Wallner — Tue, 06 Apr 2021 04:30:26 GMT

Below is my first beauty.

It has 4 NVIDIA RTX A6000 and an AMD EPYC 2 with 32 cores, including 192 GB in GPU memory and 256GB in RAM (part list).

Let’s begin.

GPUs

Until AMD’s GPU machine libraries are more stable, NVIDIA is the only real option. Since NVIDIA’s latest Ampere microarchitecture is significantly better than the previous generation, I’ll only focus on Ampere GPUs.

NVIDIA has three broad GPU types:

Consumer: (RTX 3080 / RTX 3090)
Prosumer: (A6000)
Enterprise: (A100)

There are a few convenient GPU amounts per rig and consumer class:

Consumer: two RTX 3080s/RTX 3090s
Prosumer: four A6000s
Enterprise:
- 8 A100 or A6000 (PCIe),
- 16 A100s (SXM4), and
- 20 A100 (PCIe-based modular blade nodes).

You can work around these limits, but it increases risk, reliability, and convenience.

Buy me a coffee! ☕️

Constraints For Consumer GPUs

Let’s outline a few of the limitations for the consumer and prosumer cards.

Main limits:

Motherboard limit with PCIe risers: 14 GPUs (x8 Gen 4.0 per GPU)
Consumer electricity limit per socket: 8 GPUs (4 in the US)
Consumer power supply limit: 5 GPUs (2000W)
Standard PC case size: 4 dual-slot GPUs

Space and environment limits:

Stacking cards next to each other: 4 A6000/3070 or 2 3080/3090
Shared office sound and heat limit: 2 GPUs (preferably water-cooled)
Consumer supply per customer: 1 GPU (most stores will only allow you to buy one consumer GPU, and they are only generally available 3-12 months after launch)

I tried to buy 5 RTX 3090, but after waiting four months due to supply issues, I opted for four RTX A6000.

According to Lamda Labs and Puget Systems, the 3080 and 3090 dual-slot blower editions are too hot to reliable fit four next to each other on a standard-sized motherboard. Thus you need PCIe risers, a water-cooled rig, or cap the power usage.

Using PCIe risers in an open-air rig exposes the hardware to dust. A water-cooled rig requires maintenance and has a risk of leaking during transportation. Capping power usage is non-standard and could lead to unreliability and performance loss.

For 3+ GPU rigs, many opt for cards that consume 300W or less, so the RTX 3070 and down or A6000 and up.

Most of today’s models are designed for 16GB cards since the most mainstream cloud GPUs have 16 GB GPU memory, and we are shifting towards 40 GB. Thus, the cards with the lowest memory will see an increased overhead in rewriting software to accommodate a lower memory limit.

Why do I see 8-GPU consumer rigs online?

The 5+ GPU consumer rigs people see online are often crypto-rigs with multiple power sources.

Since crypto-rigs don’t need high bandwidth, they use specific USB adapters to connect the GPU. It’s an adapter that transfers the data without electricity. Thus, the GPU and motherboard’s power is separated, which reduces the problem of mixing circuits.

However, the adapters are often of poor quality, and a small soldering error can both destroy your hardware and catch fire. And they are especially not recommended for ML rigs that require PCIe risers that enable 75W of power.

Crypto rigs also use mining power supplies from Alibaba with poor standards or retrofit enterprise power supplies. Since people tend to place them in garages or containers, they accept the added safety risk.

Prosumer and Enterprise Features

For the Ampere series, NVIDIA makes it hard to use high-end consumer cards for workstations with more than 2 GPUs. The 3-slot width, high wattage, and seeing several manufactures discontinue the 2-width blower edition of the 3090 — all indicate this.

Thus, the prosumer and enterprise Ampere cards' key selling point is support for 3+ GPU rigs with 24/7/365 workloads.

The pro-consumer and enterprise cards have a few additional features.

Main features (compared to RTX 3090):

1.1 - 2 times faster (depending on GPU, binary floating-point formats, and model)
1.7 - 3.3 times more memory
Less energy consumption (better for stacking cards)
Datacenter deployment (non-profits can gain permission for consumer cards)

Nice to have features:

ECC memory (error-proof memory)
Multiple users per GPU, MIG (only enterprise cards)
Faster GPU-to-GPU communication, NVSwitch (A100 SXM4)

The 80GB GPUs will give you an edge for specific models, but it’s hard to say if they have enough compute to effectively benefit from the massive models. The safest option is the 40GB version. However, it’s hard to ignore the bragging rights that come with 80GB GPUs.

In general, I don’t think in terms of NLP, CV, or RL-specific workloads. They will vary in performance, but since the machine learning landscape is shifting so fast, it’s not worth over-optimizing for a specific workload.

For a more in-depth comparison, read Tim Dettmers’ go-to GPU guide. Pay extra attention to the Tensor Core overview, sparse training, capping GPU wattage, and low-precision computation.

Server Constraints

While the power supply caps the consumer rigs, server rigs are constrained in weight, case size, and networking overhead.

Main limits:

Server with consumer parts: 4 PCIe GPUs
PCIe server case limit: 10 dual-slot GPUs (the width of a standard server)
Transporting a server manually: 10 PCIe GPUs or 4 SMX4 GPUs (30kg)

Additional limits:

PCIe server case limit with networking: 8 dual-slot GPUs (2 dual-slots for networking)
SXM4 server case limit: 16 GPUs (168 kg)
PCIe blade server limit: 20 dual-slot GPUs

The key constraint here is networking overhead. As soon as you connect one or more servers, you need software and hardware to manage the system. I highly recommend Stephen Balaban's overview of building GPU clusters for machine learning.

The second key concern is weight and repairing.

A server with eight SXM4 sits at around 75kg. Thus you ideally need a server lift. The SXM4 can be hard to repair than the more standard parts that come with a PCIe server.

The A100 and A6000 also have versions without a built-in fan. These need a server case with a dozen 10K+ RPM fans. These will make them more fault-tolerant since you can hot-swap the fans.

Speed Benchmarks

Lambda Labs has the best per GPU benchmarks and overall benchmark.

The benchmark is the average of several models using PyTorch with half-precision.

FP16 PyTorch Lambda Labs Benchmark

In terms of speed, the A100 is 1.4 times faster than the A6000. But the A6000 is 1.2 faster than the 3090, and twice as fast as the 3080.

The other noteworthy benchmark is the comparison between PCIe and SXM4. NVIDIA’s A100 PCIe can only connect to another GPU, while NVIDIA’s A100 SXM4 can simultaneously connect to 8 - 16 GPUs.

F16 PyTorch w/ 8 GPUs Lambda Labs Benchmark

NVIDIA’s NVswitch and SXM4 have 10x faster bandwidth in theory, but in an 8-GPU setting, it’s only 10% faster when compared to PCIe solutions. Since the SXM4 is 8% faster on a per GPU basis, the NVswitch has a marginal impact.

It should be a marginal difference up to an 8-GPU system. According to Lamda Labs’ CEO, they can see a 2x improvement for certain use cases in larger clusters. Hence, it’s directed chiefly for multiple 8-GPU systems. It’s also worth looking into the DGX A100 SuperPOD system at the scale of several hundred GPUs.

Also, in networking benchmarks, pay attention to GB/s (Gigabytes) and Gb/s (Gigabits). GB/s is eight times faster than Gb/s.

Testing my ML build for the first time

Buy me a cold beer! 🍺

GPU Pricing

The pricing is approximated for the actual retail price, rounded for simplification, and without VAT and discounts.

Enterprise:

A100 SMX4 (80 GB): €18k
A100 SMX4 (40 GB): €13k
A100 PCIe (40 GB): €9k

Prosumer and consumer:

RTX A6000 / A40 (48GB): €4500k
RTX 3090 (24 GB): €1500-2000
RTX 3080 (10 GB): €800-1300
RTX 3070 (12 GB): €700-1000

NVIDIA also provides startup and education discounts so that you can save 15-30% per GPU. For startups, apply to the inception program. In total, it takes about one week to get the discounts.

I saved around €4k on my 4 x RTX A6000 by building it and NVIDIA GPU discounts.

The SMX4 cards are sold as part of an 8 GPU server, so the per GPU pricing is a rough approximation due to the custom GPU-to-GPU communication that makes it more expensive.

Machine Learning Rig Tiers

These are estimated pre-built prices without discounts and VAT.

High-growth startups, large research labs, and enterprise:

€240-340k: 8 x A100 SXM4 (80 GB)
€120-170k: 8 x A100 SXM4 (40 GB)

Startups, research labs, and SMEs:

€90k: 8 x A100 PCIe (40 GB)
€50k: 4 x A100 PCIe or 8 x RTX A40 (fanless RTX A6000)
€25k: 4 x RTX A6000 (€21K if you build it and with GPU discounts)
€25k: 4 x RTX 3090 (Liquid cooling)
€15k: 4 x RTX 3090 (Crypto-style or capped perfomance)

Students, hobbyist, consultants:

€10k: 4 x RTX 3070
€7k: 2 x RTX 3090
€5k: 1 x RTX 3090 or 2 x RTX 3080
€4k: 1 x RTX 3080
€3k: 1 x RTX 3070

Budget is one aspect, but the key concern is where you place it.

When you start, you often have the machine in the same room and cope with the inconvenience.

As you scale, you’ll need more infrastructure. You might move it to a separate office room and later put it in a data center, starting with collocation and then climbing from tier 1 to 4 data centers for added fault tolerance.

I find 4 GPUs too loud and generate too much heat to have in an office or at home without proper cooling. Here’s a quick benchmark by Puget Systems. Think, a small leaf blower with hot air, equal to a 1600W radiator.

The starting price of a data center collocation is around €80-250 per GPU and month, including €25 per GPU in electricity charges. You can ask for a quote from all your local data center collocations here. If you plan on running workloads 24/7/365 on 4+ GPUs, I highly recommend it.

You can easily buy parts for a 4 GPU server, similarly to a PC. A barebone 5+ GPU ML server will cost around €7k.

CPU

Go with AMD.

AMD has 5x more internal bandwidth compared to Intel. And it’s both cheaper and better. A majority of the Ampere ML servers use AMD.

AMD has three main CPU types:

Consumer: (Ryzen 5000 with AM4 socket) and,
Prosumer: (Ryzen Threadripper 3rd Gen with sTRX4, and the sWRX8 socket for the 1st Gen Pro version)
Enterprise: (EPYC 2 with the SP3 socket)

For a 1-GPU system, Ryzen is excellent, and for systems between 2-4 GPU PCs, go with the Threadripper. For 5+ GPU systems and server builds, go with EPYC.

Threadripper is faster than EPYC, but EPYC has twice the memory channels, RDIMM, and requires less energy. If you plan to use your computer as a server, I’d go with EPYC.

I ended up with an AMD EPYC 2 Rome 7502P with 32 Cores. For the processors, I used eight cores per GPU as a rough guideline. Also, pay attention to if they support single, dual, or both processor setups.

CPU Cooling

For cooling, Noctua fans are the quietest, most performant, and reliable. However, I find the brown color scheme rather ugly. They are also big, so make sure they fit with your RAM and chassis.

For RGB fans, I enjoy Corsair’s All-in-one (AIO) liquid CPU coolers. They bring life. The colors are programmable, and the system frees up space around the CPU. They use antifreeze liquid, and the leak risk is tiny.

All Threadripper and EPYC CPUs have the same size, making the coolers compatible, but you might need a mounting bracket. Also, check that the cooler supports the wattage of the CPU you choose.

Anyways, here are my top pics:

Ryzen 5000: Noctua NH-D15 or Corsair H100i RGB PLATINUM
Threadripper: Noctua NH-U14S TR4-SP3 or Corsair Hydro Series H100x
EPYC: Dynatron A26 2U (for servers)

I avoid custom liquid cooling due to cost, maintenance, freezing risk, transport risk, and lack of flexibility.

Motherboard

Here are a few motherboards worth considering for AMD :

Ryzen 5000: MSI PRO B550-A PRO AM4 (ATX)
Threadripper 3rd Gen: ASRock TRX40 CREATOR (ATX)
Threadripper Pro: ASUS Pro WS WRX80E-SAGE SE (ETAX)
EPYC 2: AsRock ROMED8-2T (ATX) (My motherboard)

My principal deciding factors were the PCIe slots and IPMI.

If you plan on using your ML rig as a regular PC and want built-in support for, say, WIFI, headphone jack, microphone jack, and sleep functionality — you are best off with a consumer or prosumer motherboard.

In my case, I went with a dual-usage prosumer/server motherboard with support for remote handling or Intelligent Platform Management Interface (IPMI). Via an Ethernet connection and web GUI, I can install the OS, turn it on/off, and connect to a virtual monitor. An IPMI is ideal if you plan to use it 24/7/365.

CPU sockets have a built-in chipset, and the prosumer and consumer cards have additional chipsets to enable specific CPUs or features, for example, B550 for the Ryzen and TRX40 for the Threadripper.

For Ryzen 5000 builds it’s ideal to have a BIOS flash button. Otherwise, you need an earlier Gen Ryzen CPU to update the BIOS to be compatible with Ryzen 5000.

5+ GPU server-only motherboards are hard to buy separately. While consumer setups are modular, larger server builds are integrated.

Motherboard sizes

The standard size of a motherboard is ATX, it’s 305 × 244 mm, and works great for both server chassis and PCs. I mostly look at ATX boards, the standard size, to avoid any chassis spacing issues.

Some of the other form factors vary in size depending on the manufacturer so that you will be more limited in terms of chassis. It’s not a big deal for consumer chassis, but for server chassis, you don’t what the height to be more than the ATX’s 305 mm.

PCI Express (PCIe)

Below is the motherboard I went with, the AsRock ROMED8-2T (ATX)

The important thing to look for is the PCIe slots, where you plug the GPUs, the vertical gray slots above. Above, you have seven single-width slots.

The connection will be to the far right of the GPU. As you see, it’s a tight gap between the RAM slots and the first GPU.

When you have four dual-width GPUs on a 7-slot board, the 4th GPU will exceed the board's bottom. Thus, you need a PC or server chassis that supports 8 PCIe extension slots.

For two RTX 3090 triple-slot cards, you’d have the first card cover the first three PCIe slots and empty slot and have the second GPU cover the last three slots.

If you plan to buy an NVlink to connect two GPUs, they often come in 2-slot, 3-slot, and 4-slot versions. In the picture, you’d need two 2-slot bridges. For the triple-slot cards with a gap in between, you’d need a 4-slot bridge: the width of the card, 3-slot, plus the 1-slot gap.

There are a few things worth knowing about PCIe slots:

PCIe physical length: it’s x16 per slot in the picture, the standard for GPUs, which is 89 mm.
PCIe bandwidth: sometimes, you have the length of an x16 slot, but only half has pins that connect it to the motherboard, making it an x16 slot with x8 bandwidth. For reference, crypto-rigs will use x16 adapters but with x1 bandwidth.
Generation speed: The above board is Generation 4.0. Each generation tends to be twice as fast as the previous generation. NVIDIA’s latest GPUs are Gen 4.0 but have comparable performance on Gen 3.0 boards in practice.
Multiple GPU requirements: For 4-10 GPU systems, most recommend at least x8 Gen 3.0 per GPU.

PCIe lanes

Another thing most people look for is the total amount of PCIe lanes, the total internal bandwidth. It gives you a rough indication of the networking, storage, and multiple GPU capacity.

Motherboard manufacturers can use PCIe lanes to prioritize certain features, such as storage, PCIe slots, CPU-to-CPU communication, etc.

For reference, one GPU will use x16 lanes, a 10 GB/s ethernet port uses x8 lanes, and an NVMe SSD will use x4 lanes.

Here’s what AMD’s chipsets enable:

Ryzen 5000: 20 PCIe lanes Gen 4.0
Threadripper 3rd Gen: 88 PCIe lanes Gen 4.0
Threadripper Pro 1st Gen: 128 PCIe lanes Gen 4.0
EPYC 2: 128 PCIe lanes Gen 4.0

Chassis

The most used ML workstation chassis is Corsair Carbide Air 540, and for consumer servers, the Chenbro Micom RM41300-FS81. From a sound, dust, and transportation point of view, these two cases are ideal. Both will house the RTX 3090, but you need a rear-end power connector for the Chenbro.

I started with the Thermaltake Core P5 Tempered Glass Edition. From an ascetic angle, it’s the best. But it’s rather clunky and not ideal for dust. Given the GPUs' heat and noise, I decided to convert it into the server with the Chenbro chassis and put it in a data center.

Space between the GPUs will have more impact than the main chassis airflow. If you are going for 3+ 3080/3090, you want to look into open-air crypto-rig setups. However, these are both very noisy and vulnerable to dust. Ideally, you want to put it in a sound-isolated room with cooling and dust filters.

The Chenbro chassis has two 120 mm 2700 RPM fans on the lid, which creates an excellent airflow for the GPUs.

PSU, RAM, and Storage

When you have the GPU, CPU, motherboard, and chassis — the rest of the components are easy to pick.

Power Supply: For power supply, I looked at the two suppliers considered the best, EVGA and Corsair. I added the total GPU wattage, an extra 250W, and a margin. Here’s a more accurate calculator. I ended up with the EVGA SuperNOVA 1600W T2.
RAM: I looked at what the motherboard provider recommended and bought something that I could easily buy online. It’s recommended to fill the available slots with RAM, and I wanted the RAM to match or exceed the GPU memory. According to Tim Dettmers, the RAM speed has little impact on the overall performance. I went with 8 x Kingston 32GB 3200MHz DDR4 KSM32RD4/32ME, so 256 GB.
NVMe SSDs: I checked the highest-rated SSDs on PCpartpicker and Newegg. I used 0.5 TB per GPU as a guideline with PCIe Gen 4.0. I grabbed two 2 TB Samsung 980 Pro 2 to M.2 NVMe.
Hard drives: I used the same strategy as my SSD, but 6 TB per GPU for slow storage. This ended up being 2 x 12 TB Seagate IronWolf Pro, 3.5'', SATA 6Gb/s, 7200 RPM, 256MB cache. For a more rigorous benchmark, you can study the disk failure rates.
NVlink: It’s a nice-to-have that can improve performance by a few percent on specific workloads. It does not combine the memory of two GPUs to a single memory, it’s just confusing marketing.

Purchases

PCpartpicker and Newegg are the most user-friendly price comparison tools.

Nowadays, if I can’t find it on Amazon, I think twice before buying something. Roughly 30% of the lesser-known stores gave me a headache.

A few examples:

Many stores don’t have the products they list online (5 stores)
The stores forgot my order, and I had to follow up for 3-9 weeks (3 stores)
One store sent me an order they had canceled and then charged me several hundred euros for the return (1 store)
One store didn’t send me the product for one month, and instead of making a refund, they gave me a voucher (1 store)
The customer service is either poorly automated or bad to the extent that it makes it unbearable to solve any issues (6 stores)

PNY lists the retailers of prosumer and enterprise cards. I reached out to all of the 20 suppliers in France. 50% didn’t reply. Of the replies, 60% didn’t have the latest cards, and from the quotes I got, the price varied between 5-10%. In France, CARRI systems had the best price and good customer service.

Build lists

PCpartpicker has more than 40 000 builds with the RTX 30 series, although most are with 1 GPU, some 2-GPU rigs, but nothing with 3+.

Four RTX A6000 Workstation and Server, EPYC 2 (my build)
4 x RTX 2080 (Curtis will soon release a 4 RTX 3090 build)
Titan RTX, Intel (Article by Daniel Bourke)
Titan RTX, Threadripper (Article by Jeff Heaton)
RTX 2080 Ti, Intel (Video by Michael Phi)
Two RTX 3090, Ryzen (source)
One RTX 3090, Threadripper
Two RTX 2070, Threadripper, Corsair Air 540
One RTX 3090, Threadripper, Thermaltake Core P3 (Gorgeous!)
Two RTX 3090, Threadripper, Server
One RTX 3080, Ryzen

Pre-built rigs

Here are the providers per region that offer pre-built rigs.

Here's a list of retailers with transparent pricing:

Aime (GER)
Delta (GER)
sysGEN (GER)
Novatech (UK)

If you know of other providers that list prices, please submit them here, and I’ll add them.

Building and Installing

The hard part in building a rig is finding the parts, especially if you are trying to do something unconventional.

Putting the pieces together and installing them takes less than an hour, but you probably want to spend a few extra hours to be on the safe side.

I had a mobile repair kit at home that was useful, but you’ll be okay with a standard screwdriver and a good selection of bits.

I used the remote management system to install the software. When I plugged the ethernet cable into my router, it assigned it an IP address, then I put the IP address in the browser, and I had access to a web interface to update the BIOS and installed Ubuntu 20.04 LTS.

I then installed the Lambda Stack for all the GPU drivers and machine learning libraries, etc. I highly recommend it.

If you are using an IMPI, change the VGA output to internal in the BIOS. Otherwise, you can’t use the virtual monitor in the IMPI without removing the GPUs.

Conclusion

The main reason to own hardware is workflow. To not waste time on cloud savings and encourage robust experimentation.

You’ll save money when building a consumer ML rig. If you price your time, the cost savings of building prosumer and enterprise rigs is dubious. However, you’ll learn a bunch and become a much more educated consumer. Plus, it’s a valuable skill when all pre-built suppliers are having GPU supply issues.

Nvidia is making it hard to use high-end consumer cards for 3+ rigs. For a prosumer rig with a server room at home, I’d go for 4 x 3090 in an open-air rig. And with more limited space, a 2 x 3090 workstation.

With a larger budget, 4 x RTX A6000 is a good option, but given the noise and heat, I’d go for a server solution and place it in a data center.

The A100 has the most mindshare, but the A6000 / A40 is more value for money. The SMX4 is too clunky and offers a marginal performance gain over the PCIe version. I’d like to see a transparent benchmark with a large cluster to see the benefits in practice.

If you have any questions, ping me on Twitter, or drop a comment below.

A Podcast Episode About the ML Rig

Buy me a coffee! ☕️