AlbertAI Kaggle Camp

Introduction

Kaggle is an online community of data scientists and machine learners, owned by Google, Inc. Kaggle allows users to find and publish data sets, explore and build models in a web-based data science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges. Kaggle got its start by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and short form AI education.

How Kaggle competitions work?

The competition host prepares the data and a description of the problem. Participants experiment with different techniques and compete against each other to produce the best models. Work is shared publicly through Kaggle Kernels to achieve a better benchmark and to inspire new ideas. Submissions can be made through Kaggle Kernels, through manual upload or using the Kaggle API. For most competitions, submissions are scored immediately (based on their predictive accuracy relative to a hidden solution file) and summarized on a live leaderboard. After the deadline passes, the competition host pays the prize money in exchange for “a worldwide, perpetual, irrevocable and royalty-free license to use the winning Entry”, i.e. the algorithm, software and related intellectual property developed, which is “non-exclusive unless otherwise specified”. Alongside its public competitions, Kaggle also offers private competitions limited to Kaggle’s top participants. Kaggle offers a free tool for data science teachers to run academic machine learning competitions, Kaggle In Class. Kaggle also hosts recruiting competitions in which data scientists compete for a chance to interview at leading data science companies like Facebook, Winton Capital, and Walmart.

About AlbertaAI Kaggle Camp

Thanks to the recent advances in AI research and applications, there is a significantly increasing trend of the needs of AI-related developer/engineer/researcher positions. However, it is not only problematic for people from the academia to engage into the industry but also extremely hard for those non-professionals who are interested in investing time and money to learn and utilize the advancement of AI. We believe one of the best ways of gaining experience and knowledge in a new field is doing a lot of practice.

We utilize one of the most popular data science platform, Kaggle, to provide an opportunity that practitioners can gather, discuss and compete. We’d like to provide this great opportunity for everyone who is interested in landing a position in data science, machine learning and artificial intelligence to practice related skills.

Competition 1

Planet: Understanding the Amazon from Space

https://www.kaggle.com/c/planet-understanding-the-amazon-from-space
Every minute, the world loses an area of forest the size of 48 football fields. And deforestation in the Amazon Basin accounts for the largest share, contributing to reduced biodiversity, habitat loss, climate change, and other devastating effects. But better data about the location of deforestation and human encroachment on forests can help governments and local stakeholders respond more quickly and effectively.

Planet, designer and builder of the world’s largest constellation of Earth-imaging satellites, will soon be collecting daily imagery of the entire land surface of the earth at 3-5 meter resolution. While considerable research has been devoted to tracking changes in forests, it typically depends on coarse-resolution imagery from Landsat (30 meter pixels) or MODIS (250 meter pixels). This limits its effectiveness in areas where small-scale deforestation or forest degradation dominate.

Furthermore, these existing methods generally cannot differentiate between human causes of forest loss and natural causes. Higher resolution imagery has already been shown to be exceptionally good at this, but robust methods have not yet been developed for Planet imagery.

In this competition, Planet and its Brazilian partner SCCON are challenging Kagglers to label satellite image chips with atmospheric conditions and various classes of land cover/land use. Resulting algorithms will help the global community better understand where, how, and why deforestation happens all over the world – and ultimately how to respond.

Competition 2

Quora Question Pairs

https://www.kaggle.com/c/quora-question-pairs

Where else but Quora can a physicist help a chef with a math problem and get cooking tips in return? Quora is a place to gain and share knowledge—about anything. It’s a platform to ask questions and connect with people who contribute unique insights and quality answers. This empowers people to learn from each other and to better understand the world.

Over 100 million people visit Quora every month, so it’s no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.

Currently, Quora uses a Random Forest model to identify duplicate questions. In this competition, Kagglers are challenged to tackle this natural language processing problem by applying advanced techniques to classify whether question pairs are duplicates or not. Doing so will make it easier to
find high quality answers to questions resulting in an improved experience for Quora writers, seekers, and readers.

Competition 3

https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection

Fraud risk is everywhere, but for companies that advertise online, click fraud can happen at an overwhelming volume, resulting in misleading click data and wasted money. Ad channels can drive up costs by simply clicking on the ad at a large scale. With over 1 billion smart mobile devices in active use every month, China is the largest mobile market in the world and therefore suffers from huge volumes of fraudulent traffic.

TalkingData, China’s largest independent big data service platform, covers over 70% of active mobile devices nationwide. They handle 3 billion clicks per day, of which 90% are potentially fraudulent. Their current approach to prevent click fraud for app developers is to measure the journey of a user’s click across their portfolio, and flag IP addresses who produce lots of clicks, but never end up installing apps. With this information, they’ve built an IP blacklist and device blacklist.

While successful, they want to always be one step ahead of fraudsters and have turned to the Kaggle community for help in further developing their solution. In their 2nd competition with Kaggle, you’re challenged to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad. To support your modeling, they have provided a generous dataset covering approximately 200 million clicks over 4 days!

Stage 1

We will host a two-stage camp. In the first stage, we’d like to invite people with various background to participate in one of the two selected (finished) competition. Kaggle provides late submission options that can allow us to submit predictions and receive the evaluation results immediately. Some guidance will be provided through the one month period. We’d like to select a group of participants to join the next stage. During the first stage, mentorship will be provided by several active Kaggle competitors (Kaggle Master and Kaggle Expert level competitors).

Rules in Stage 1

There is no specific rules for the stage 1, you are very welcome to use any public resources from the Internet. There are sufficient amount of public kernels and competition techinique summaries in the competition discussion forum. Feel free to grab any idea, code and report from others. We don’t pay all the attention to your final performance but your team collaboration and contribution.

Stage 2

In stage 2, selected participants will be invited to form several groups (or not, if someone wants to do a solo competition). We will select a competition from one of the following forms.

An ongoing competition
A private competition

In stage 2, mentorship will be fully provided to help you reach a higher performance.

Rules in Stage 2

Please strictly follow the rules for the stage 2 listed by Kaggle.

Timeline

Stage 1

Begin: Dec 15, 2018
End: Jan 15, 2019 (tentative)

Stage 2

Begin: Jan 20, 2018 (tentative)
End: TBD

How to participate?

We welcome everyone who is interested in spending time on learning and practicing with data science, machine learning and artificial intelligence. However, to ensure the quality of the participants in terms of the engagement and the commitment time, we will charge a $10 entrance fee for every participants for entering the first stage of the camp. This amount will be fully refunded after you finish the first stage.

In addition, a minimal programming skill required for participating is that you should be familiar with basic programming fundamentals (first-year CS course in the university).

Please fill the following Google Form to register to the camp.
https://goo.gl/forms/hrFL466CNl7NfL4k1

In the stage 1, we will gather all the participants and help to form groups based on personal background and interests. Only the first gathering is mandatory for participants to attend. After that, each group can schedule their own preferred meeting time.

Deliverable

Each group is expected to submit the code base and a short report (usually 1 – 2 pages are enough) that demonstrates the efforts of the team. Here we provide a list of previous summary written by top Kagglers.

https://www.kaggle.com/c/quora-question-pairs/discussion/34325

Prize

We will select the top team in the stage 2.

The choice of gift card ($200) will be given as the final prize.

Administrator