Machine Learning Engineer - Data Scrapping
Data
Machine Learning
São Paulo, SP
Remote
TRACTIAN is transforming the industrial world by empowering frontline maintenance workers to achieve more. We’ve fused cutting-edge hardware with innovative software into one powerful platform, disrupting legacy systems and delivering smarter, faster solutions for our clients.
Design and maintain robust data collection pipelines from a wide range of sources, including websites, documents, APIs, and raw sensor data
Extract and structure information from unstructured or semi-structured formats into clean, standardized schemas
Handle real-world data challenges like pagination, rate limits, CAPTCHAs, noise, missing values, and inconsistent formatting
Clean, filter, and validate raw data to ensure high quality, consistency, and usability across our systems
Develop small tools and utilities to support and automate data collection workflows
Support the creation and maintenance of labeling pipelines for ML applications
Collaborate with engineering and product teams to optimize data storage and access patterns
Document data sources, collection methodologies, and processing procedures for reproducibility
0–2 years of experience in software development, data engineering, or related fields
Degree in Computer Science, Computer Engineering, Information Systems, or equivalent technical background
Understanding of HTML, CSS selectors, and how web pages are structured
Strong problem-solving skills and an eye for detail
Ability to work in a fast-paced environment and manage shifting priorities
Proficiency in Python, especially for data manipulation and automation
Experience (academic or professional) with data extraction using tools like `requests`, `BeautifulSoup`, or similar
Familiarity with REST APIs and the HTTP protocol
Experience with data cleaning techniques such as:
Handling missing or inconsistent values
Removing duplicates and outliers
Standardizing formats (e.g., dates, units, text normalization)
Validating data against schemas or expected ranges
(Optional) Exposure to browser automation tools like Selenium or Playwright
Experience with web scraping libraries/frameworks like Scrapy, Playwright, or Selenium
Familiarity with proxy usage, headless browsers, or CAPTCHA bypass techniques
Understanding of database systems (SQL or NoSQL)
Exposure to rapid prototyping tools like Streamlit
Previous experience working with or around industrial equipment or maintenance systems
• Competitive salary and stock options
• Optional fully funded English / Spanish courses
• 30 days of paid annual leave
• Education and courses stipend
• Employee Giving
• Earn a trip anywhere in the world every 4 years
• Day off during the week of your birthday
• Up to R$1.000/mo for meals and remote work allowance
• Health plan with national coverage and without coparticipation
• Dental Insurance: we help you with dental treatment for a better quality of life.
• Gympass and Sports Incentive: R$300/mo extra if you practice activities
If you want to build a ship, don't organize people to collect wood, assign them tasks, and give orders. Instead, teach them to long for the vast and endless sea.
Antoine Saint-Exupery